Implementation of Java crawler information crawling

Author：Eve Cole Update Time：2025-05-19 15:16:01

Today, the company has a need to do some data scraping after querying on the specified website, so it took some time to write a demo for demonstration use.

The idea is very simple: it is to access the link through Java, then get the html string, and then parse the required data such as the link. Technically, Jsoup is convenient for page parsing. Of course, Jsoup is very convenient and simple. You can know how to use it with just one line of code:

 Document doc = Jsoup.connect("http://www.oschina.net/") .data("query", "Java") // Request parameters.userAgent("I ' m jsoup") // Set User-Agent .cookie("auth", "token") // Set cookie .timeout(3000) // Set connection timeout.post(); // Use POST method to access the URL

The entire implementation process is described below:

1. Analyze the pages that need to be parsed:

Website: http://www1.sxcredit.gov.cn/public/infocomquery.do?method=publicIndexQuery

page:

First do a query on this page: observe the requested url, parameters, method, etc.

Here we use the built-in developer tool (shortcut key F12), and the following are the results of the query:

We can see the url, method, and parameters. After knowing how or query URL, I will start the code below. In order to reuse and extend, I have defined several classes:

1. Rule.java is used to specify query url, method, params, etc.

 package com.zhy.spider.rule; /** * Rule class* * @author zhy * */ public class Rule { /** * Link */ private String url; /** * Parameter collection */ private String[] params; /** * Values corresponding to the parameter*/ private String[] values; /** * For the returned HTML, please set type first */ private String resultTagName; /** * CLASS / ID / SELECTION * Set the type of resultTagName, default to ID */ private int type = ID ; /** *GET / POST * The type of request, default GET */ private int requestMoethod = GET ; public final static int GET = 0 ; public final static int POST = 1 ; public final static int CLASS = 0; public final static int ID = 1; public final static int SELECTION = 2; public Rule() { } public Rule(String url, String[] params, String[] values, String resultTagName, int type, int requestMoethod) { super(); this.url = url; this.params = params; this.values = values; this.resultTagName = resultTagName; this.type = type; this.requestMoethod = requestMoethod; } public String getUrl() { return url; } public void setUrl(String url) { this.url = url; } public String[] getParams() { return params; } public void setParams(String[] params) { this.params = params; } public String[] getValues() { return values; } public void setValues(String[] values) { this.values = values; } public String getResultTagName() { return resultTagName; } public void setResultTagName(String resultTagName) { this.resultTagName = resultTagName; } public int getType() { return type; } public void setType(int type) { this.type = type; } public int getRequestMoethod() { return requestMoethod; } public void setRequestMoethod(int requestMoethod) { this.requestMoethod = requestMoethod; } }

Let me briefly put: this rule class defines all the information we need during the query process, which facilitates our extension and code reuse. It is impossible for us to write a set of code for each website that needs to be crawled.

2. The required data object, currently only needs links, LinkTypeData.java

 package com.zhy.spider.bean; public class LinkTypeData { private int id; /** * link address*/ private String linkHref; /** * link title*/ private String linkText; /** * summary*/ private String summary; /** * content*/ private String content; public int getId() { return id; } public void setId(int id) { this.id = id; } public String getLinkHref() { return linkHref; } public void setLinkHref(String linkHref) { this.linkHref = linkHref; } public String getLinkText() { return linkText; } public void setLinkText(String linkText) { this.linkText = linkText; } public String getSummary() { return summary; } public void setSummary(String summary) { this.summary = summary; } public String getContent() { return content; } public void setContent(String content) { this.content = content; } }

3. Core query class: ExtractService.java

 package com.zhy.spider.core; import java.io.IOException; import java.util.ArrayList; import java.util.List; import java.util.Map; import javax.swing.plaf.TextUI; import org.jsoup.Connection; import org.jsoup.Jsoup; import org.jsoup.nodes.Document; import org.jsoup.nodes.Element; import org.jsoup.select.Elements; import com.zhy.spider.bean.LinkTypeData; import com.zhy.spider.rule.Rule; import com.zhy.spider.rule.RuleException; import com.zhy.spider.util.TextUtil; /** * * @author zhy * */ public class ExtractService { /** * @param rule * @return */ public static List<LinkTypeData> extract(Rule rule) { // Perform necessary verification of rule validateRule(rule); List<LinkTypeData> datas = new ArrayList<LinkTypeData>(); LinkTypeData data = null; try { /** * parse rule */ String url = rule.getUrl(); String[] params = rule.getParams(); String[] values = rule.getValues(); String resultTagName = rule.getResultTagName(); int type = rule.getType(); int requestType = rule.getRequestMoethod(); Connection conn = Jsoup.connect(url); // Set query parameters if (params != null) { for (int i = 0; i < params.length; i++) { conn.data(params[i], values[i]); } } // Set request type Document doc = null; switch (requestType) { case Rule.GET: doc = conn.timeout(100000).get(); break; case Rule.POST: doc = conn.timeout(1000000).post(); break; } // Process return data Elements results = new Elements(); switch (type) { case Rule.CLASS: results = doc.getElementsByClass(resultTagName); break; case Rule.ID: Element result = doc.getElementById(resultTagName); results.add(result); break; case Rule.SELECTION: results = doc.select(resultTagName); break; default: //When resultTagName is empty, body tag is deprecated if (TextUtil.isEmpty(resultTagName)) { results = doc.getElementsByTag("body"); } } for (Element result : results) { Elements links = result.getElementsByTag("a"); for (Element link : links) { //Required filter String linkHref = link.attr("href"); String linkText = link.text(); data = new LinkTypeData(); data.setLinkHref(linkHref); data.setLinkText(linkText); data.add(data); } } } catch (IOException e) { e.printStackTrace(); } return datas; } /** * Perform the necessary verification on the passed parameters*/ private static void validateRule(Rule rule) { String url = rule.getUrl(); if (TextUtil.isEmpty(url)) { throw new RuleException("url cannot be empty! "); } if (!url.startsWith("http://")) { throw new RuleException("url is incorrect!"); } if (rule.getParams() != null && rule.getValues() != null) { if (rule.getParams().length != rule.getValues().length) { throw new RuleException("The key value of the parameter does not match the number of keys and values!"); } } } } }

4. An exception class is used: RuleException.java

 package com.zhy.spider.rule; public class RuleException extends RuntimeException { public RuleException() { super(); // TODO Auto-generated constructor stub } public RuleException(String message, Throwable cause) { super(message, cause); // TODO Auto-generated constructor stub } public RuleException(String message) { super(message); // TODO Auto-generated constructor stub } public RuleException(Throwable cause) { super(cause); // TODO Auto-generated constructor stub } }

5. Finally, it’s a test: Two websites are used for testing here, and different rules are used. Please see the code for details.

 package com.zhy.spider.test; import java.util.List; import com.zhy.spider.bean.LinkTypeData; import com.zhy.spider.core.ExtractService; import com.zhy.spider.rule.Rule; public class Test { @org.junit.Test public void getDatasByClass() { Rule rule = new Rule( "http://www1.sxcredit.gov.cn/public/infocomquery.do?method=publicIndexQuery", new String[] { "query.enterprisename","query.registationnumber" }, new String[] { "Xingwang"," }, "cont_right", Rule.CLASS, Rule.POST); List<LinkTypeData> extracts = ExtractService.extract(rule); printf(extracts); } @org.junit.Test public void getDatasByCssQuery() { Rule rule = new Rule("http://www.11315.com/search", new String[] { "name" }, new String[] { "Xingwang" }, "div.g-mn div.con-model", Rule.SELECTION, Rule.GET); List<LinkTypeData> extracts = ExtractService.extract(rule); printf(extracts); } public void printf(List<LinkTypeData> datas) { for (LinkTypeData data : datas) { System.out.println(data.getLinkText()); System.out.println(data.getLinkHref()); System.out.println("***************************************************"); } } }

Output result:

 Shenzhen Netxing Technology Co., Ltd. http://14603257.11315.com *************************************** Jingzhou Xingwang Highway Materials Co., Ltd. http://05155980.11315.com *************************** Xi'an Quanxing Internet Cafe# *************************** Zichang County Xinxing Net City# ********************************* The Third Branch of Shaanxi Tongxing Network Information Co., Ltd.# ************************************ Xi'an Gaoxing Network Technology Co., Ltd.# *************************************** Shaanxi Tongxing Network Information Co., Ltd. Xi'an Branch# ************************************

Finally, use a Baidu news to test our code: It means that our code is universal.

 /** * Use Baidu News, only set the url and keywords and return types*/ @org.junit.Test public void getDatasByCssQueryUserBaidu() { Rule rule = new Rule("http://news.baidu.com/ns", new String[] { "word" }, new String[] { "Alipay" }, null, -1, Rule.GET); List<LinkTypeData> extracts = ExtractService.extract(rule); printf(extracts); }

We only set links, keywords, and request types, and do not set specific filter conditions.

Result: It is certain that there is certain junk data, but the required data must be crawled. We can set Rule.SECTION, as well as further limits on filtering conditions.

 Sort by time/ns?word=Alipay&ie=utf-8&bs=Alipay&sr=0&cl=2&rn=20&tn=news&ct=0&clk=sortbytime ********************************* x javascript:void(0) ********************************************* Alipay will jointly build a security fund with multiple parties to invest 40 million in the first batch http://finance.ifeng.com/a/20140409/12081871_0.shtml *************************************** 7 same news/ns?word=%E6%94%AF%E4%BB%98%E5%AE%9D+cont:2465146414%7C697779368%7C3832159921&same=7&cl=1&tn=news&rn=30&fm=sd *************************************** Baidu snapshot http://cache.baidu.com/c?m=9d78d513d9d437ab4f9e91697d1cc0161d4381132ba7d3020cd0870fd33a541b0120a1ac26510d19879e20345dfe1e4bea876d26605f75a09bbfd91782a6c1352f8a2432721a844a0fd019adc 1452fc423875d9dad0ee7cdb168d5f18c&p=c96ec64ad48b2def49bd9b780b64&newp=c4769a4790934ea95ea28e281c4092695912c10e3dd796&user=baidu&fm=sc&query=%D6%A7%B8%B6%B1%A6&qid=a400f3660007a6c5&p1=1 **************************************************** OpenSSL vulnerability involves many websites Alipay says there is no data leakage yet http://tech.ifeng.com/internet/detail_2014_04/09/35590390_0.shtml ******************************************* 26 same news/ns?word=%E6%94%AF%E4%BB%98%E5%AE%9D+cont:3869124100&same=26&cl=1&tn=news&rn=30&fm=sd ****************************************** Baidu snapshot http://cache.baidu.com/c?m=9f65cb4a8c8507ed4fece7631050803743438014678387492ac3933fc239045c1c3aa5ec677e4742ce932b2152f4174bed843670340537b0efca8e57dfb08f29288f2c367117845615a71bb8cb31649b66cf04fdea44 a7ecff25e5aac5a0da4323c044757e97f1fb4d7017dd1cf4&p=8b2a970d95df11a05aa4c32013&newp=9e39c64ad4dd50fa40bd9b7c5253d8304503c52251d5ce042acc&user=baidu&fm=sc&query=%D6%A7%B8%B6%B1%A6&qid=a400f3660007a6c5&p1=2 *************************************** Yahoo Japan starts supporting Alipay payments from June http://www.techweb.com.cn/ucweb/news/id/2025843 ************************************************

If there are any shortcomings, you can point them out; if you think it is useful to you, please give it a try~~haha

Download the source code, click here.

The above is an example of Java crawler information crawling. We will continue to add relevant information in the future. Thank you for your support for this site!