jsoup is a Java HTML parser that can directly parse a URL address and HTML text content. It provides a very low-effort API to retrieve and manipulate data through DOM, CSS, and jQuery-like manipulation methods.
I'm currently working on something that requires regional data across the country, from provinces and cities to counties, towns, and streets. Various Du Niang and various Google searches have failed to find complete data. In the end, the hard work paid off, and I finally found a relatively complete data. However, the data here are only accurate to the town level, and there is no data at the village level (I later learned why by analyzing the data source, haha). In addition, some of the data provided by bloggers are redundant. For me who has obsessive-compulsive disorder and pursues perfection, I thought that I must crawl out this part of the data myself.
The content in the above blog post is quite rich. The blogger used PHP to implement it. As the first place in the programming language rankings in 2015, we cannot show weakness. Now I will take you to see how to use Java. How to crawl the data we want from the web page...
The first step, preparation (data source + tools):
Data source (the most comprehensive and authoritative official data so far): http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2013/
Tools for crawling data (crawler tools): http://jsoup.org/
The second step, data source analysis:
First of all, I will not explain the use of the jsoup tool here. If you are interested, you can check it out yourself.
When doing development, you should learn more about the use of some software tools. Only when you encounter them in the normal development process will you know where to start. I encourage everyone to pay more attention to the software tools around you in case you need them. Before making this thing, I didn't know how to use jsoup, but I know what jsoup can be used for. When I need to use it, I will look up the information and learn by myself.
The above-mentioned data source was released by the National Bureau of Statistics of the People's Republic of China in 2013, and its accuracy and authoritativeness are self-evident.
Next, let’s analyze the structure of the data source, starting with the home page:
By analyzing the homepage source code we can get the following three points:
1. The entire layout of the page is controlled by the table tag. That is to say, if we want to select hyperlinks through jsoup, we must pay attention to the fact that in the above picture, it is not only the places marked with provinces, cities and regions that use tables. , there are multiple tables in the entire page, so it is not possible to directly pass the table
Document connect = connect("http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm/2013/");
Elements rowProvince = connect.select("table");
to parse the data.
2. How many places on the page are there hyperlinks? Maybe the official has considered why programmers like you need to obtain such data. The page is very clean. Except for the registration number below which is a redundant hyperlink, other links can be crawled directly.
3. Data patterns of provinces and cities. Each row of the table containing valid information has a class attribute provincetr. This attribute is very important. As for why it is important, please read on; there are multiple td tags in each row of data, and each td tag contains an a hyperlink. , and this hyperlink is exactly the hyperlink we want. The text of the hyperlink is the name of the province (municipality, etc.).
Let's take a look at the general data page again (general data pages include three-level data display pages at the city, county, and town levels):
The reason why we put the above three pages together is because through analysis we can find that the data pages of these three levels of data are completely consistent. The only difference is that the class attributes of the data row tr in the html source code data table are inconsistent, respectively. Corresponding to: citytr, countrytr and towntr. Everything else is consistent. In this way, we can use a common method to solve the data crawling of these three pages.
Next, let’s analyze the structure of the data source, starting with the home page:
Finally, take a look at the village-level data page:
The data format at the village level is inconsistent with the data format of the cities, counties and towns mentioned above. The data represented at this level is the lowest level, so there is no a link, so the crawling method of the city, county and town data above cannot be used. Crawling; the class of the table row displaying data here is villagegetr. In addition to these two points, each row of data contains three columns of data. The first column is citycode, and the second column is urban and rural classification (the data format of cities, counties and towns is different. This item exists), and the third column is the city name.
After grasping the above points, we can start coding.
The third step, coding implementation:
import java.io.BufferedWriter; import java.io.File; import java.io.FileWriter; import java.io.IOException; import java.util.HashMap; import java.util.Map; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; /** * Data crawling of provinces, cities, counties, towns and villages across the country * @author liushaofeng * @date -- am:: * @version .. */ public class JsoupTest { private static Map<Integer, String> cssMap = new HashMap<Integer, String>(); private static BufferedWriter bufferedWriter = null; static { cssMap.put(, "provincetr");// Province cssMap.put(, "citytr");// City cssMap.put(, "countytr");// County cssMap.put(, "towntr");// Town cssMap.put (, "villagetr");//village} public static void main(String[] args) throws IOException { int level = ; initFile(); // Get provincial information across the country Document connect = connect("http://www.stats.gov.cn/tjsj/tjbz/tjyqhdmhcxhfdm//"); Elements rowProvince = connect.select("tr. " + cssMap.get(level)); for (Element provinceElement : rowProvince)//Traverse the provinces and cities in each row{ Elements select = provinceElement.select("a"); for (Element province : select)//Each province (Sichuan Province) { parseNextLevel(province, level + ); } } closeStream(); } private static void initFile() { try { bufferedWriter = new BufferedWriter(new FileWriter(new File("d://CityInfo.txt"), true)); } catch (IOException e) { e.printStackTrace(); } } private static void closeStream() { if (bufferedWriter != null) { try { bufferedWriter.close(); } catch (IOException e) { e.printStackTrace(); } bufferedWriter = null; } } private static void parseNextLevel(Element parentElement, int level) throws IOException { try { Thread.sleep();//Sleep, otherwise various error status codes may occur} catch (InterruptedException e) { e.printStackTrace(); } Document doc = connect(parentElement.attr("abs:href")); if (doc != null) { Elements newsHeadlines = doc.select("tr." + cssMap.get(level));// // Get a row of data from the table for (Element element : newsHeadlines) { printInfo(element, level + ); Elements select = element.select("a");// When calling recursively, this is to determine whether it is village-level data. Village-level data The data does not have a tag if (select.size() != ) { parseNextLevel(select.last(), level + ); } } } } /** * Write a line of data to the data file* @param element crawled data element* @param level city level*/ private static void printInfo(Element element, int level) { try { bufferedWriter.write(element.select("td").last().text() + "{" + level + "}[" + element.select("td").first().text() + "]"); bufferedWriter.newLine(); bufferedWriter.flush(); } catch (IOException e) { e.printStackTrace(); } } private static Document connect(String url) { if (url == null || url.isEmpty()) { throw new IllegalArgumentException(" The input url('" + url + "') is invalid!"); } try { return Jsoup.connect(url).timeout( * ).get(); } catch (IOException e) { e.printStackTrace(); return null; } } } The data crawling process is a long process. Just wait slowly. Haha, because the program takes a long time to run, please do not print the output on the console, otherwise it may affect the program operation....
The format of the final data obtained is as follows ("{}" represents the city level, and the content in "[]" represents the city code):
Municipal district {3}[110100000000]
Dongcheng District{4}[110101000000]
Donghuamen Subdistrict Office{5}[110101001000]
Duofu Lane Community Neighborhood Committee {6}[110101001001]
Yinzha Community Neighborhood Committee{6}[110101001002]
Dongchang Community Neighborhood Committee{6}[110101001005]
Zhide Community Neighborhood Committee{6}[110101001006]
Nanchizi Community Neighborhood Committee{6}[110101001007]
Huangtugang Community Neighborhood Committee{6}[110101001008]
Dengshikou Community Neighborhood Committee{6}[110101001009]
Zhengyi Road Community Neighborhood Committee {6}[110101001010]
Ganyu Community Neighborhood Committee {6}[110101001011]
Taijichang Community Neighborhood Committee{6}[110101001013]
Shaojiu Community Neighborhood Committee{6}[110101001014]
Wangfujing Community Neighborhood Committee{6}[110101001015]
Jingshan Subdistrict Office{5}[110101002000]
Longfu Temple Community Neighborhood Committee {6}[110101002001]
Jixiang Community Neighborhood Committee {6}[110101002002]
Huanghuamen Community Neighborhood Committee{6}[110101002003]
Zhonggu Community Neighborhood Committee{6}[110101002004]
Weijia Community Neighborhood Committee{6}[110101002005]
Wangzhima Community Neighborhood Committee {6}[110101002006]
Jingshan East Street Community Neighborhood Committee {6}[110101002008]
Huangcheng Genbei Street Community Neighborhood Committee {6}[110101002009]
Jiaodaokou Subdistrict Office{5}[110101003000]
Jiaodong Community Neighborhood Committee{6}[110101003001]
Fuxiang Community Neighborhood Committee{6}[110101003002]
Daxing Community Neighborhood Committee {6}[110101003003]
Fuxue Community Neighborhood Committee{6}[110101003005]
Gulouyuan Community Neighborhood Committee{6}[110101003007]
Juer Community Neighborhood Committee{6}[110101003008]
Nanluoguxiang Community Neighborhood Committee{6}[110101003009]
Andingmen Subdistrict Office{5}[110101004000]
Jiaobei Toutiao Community Neighborhood Committee {6}[110101004001]
Beiluoguxiang Community Neighborhood Committee {6}[110101004002]
Guozijian Community Neighborhood Committee{6}[110101004003]
...
After getting the above data, you can realize whatever you want to do by yourself. The above code can be run directly. After crawling from the data source, it can be directly converted into the format you want.