1. Requirements and configuration
Requirements: Crawl the information on the JD mobile phone search page, record the name, price, number of comments, etc. of each mobile phone, and form a data table that can be used for actual analysis.
Using Maven project, log4j logs, logs are exported to the console only.
Maven depends as follows (pom.xml)
<dependencies> <dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.5.3</version> </dependency> <dependency> <!-- jsoup HTML parser library @ https://jsoup.org/ --> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>1.11.2</version> </dependency> <!-- https://mvnrepository.com/artifact/log4j/log4j --> <dependency> <groupId>log4j</groupId> <artifactId>log4j</artifactId> <version>1.2.17</version> </dependency> </dependencies>
log4j configuration (log4j.properties), output INFO and above level information to the console, and do not set the output document separately.
log4j.rootLogger=INFO, Console #Console log4j.appender.Console=org.apache.log4j.ConsoleAppenderlog4j.appender.Console.layout=org.apache.log4j.PatternLayoutlog4j.appender.Console.layout.ConversionPattern=%d [%t] %-5p [%c] - %m%n
2. Requirements Analysis and Code
2.1 Requirements Analysis
The first step is to establish a connection between the client and the server, and obtain the HTML content on the web page through the URL.
The second step is to parse the HTML content and obtain the required elements.
The third step is to output the HTML content to the local text document and can be directly analyzed through other data analysis software.
According to the above analysis, four classes are established, GetHTML (used to obtain website HTML), ParseHTML (used to parse HTML), WriteTo (used to output documents), and Maincontrol (used to control). The following are four classes. To make the code as concise as possible, all exceptions are thrown directly from the method without catch.
2.2 Code
2.2.1GetHTML Class
This class contains two methods: getH(String url), urlControl(String baseurl, int page), which are respectively used to obtain web page HTML and control URL. Since the content of the web page crawled this time is only the search results of a certain type of product on JD.com, there is no need to traverse all URLs on the page. You only need to observe the changes in the URL when turning the page and introduce the rules. Only the urlControl method is exposed to the outside, and a private log property is set in the class: private static Logger log = Logger.getLogger(getHTML.class); is used to record logs.
getH(String url), get the HTML content of a single URL.
urlControl(String baseurl, int page), sets the loop and accesses the data of multiple pages. By reviewing the elements, you can see that the change in the page of the search page on JD.com is actually a change in odd order.
If you look at the changes in the URL after clicking, you will find that the actual change is the value of the page attribute. By splicing, you can easily get the address of the next web page.
https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&page=3&s=47&click=0
https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&page=5&s=111&click=0
https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=utf-8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&page=7&s=162&click=0
Overall code:
import java.io.IOException;import org.apache.http.HttpEntity;import org.apache.http.client.ClientProtocolException;import org.apache.http.client.methods.CloseableHttpResponse;import org.apache.http.client.methods.HttpGet;import org.apache.http.impl.client.CloseableHttpClient;import org.apache.http.impl.client.HttpClients;import org.apache.http.util.EntityUtils;import org.apache.log4j.Logger;public class getHTML { //Create log private static Logger log = Logger.getLogger(getHTML.class); private static String getH(String url) throws ClientProtocolException, IOException { //Console outputs logs, so that each accessed URL can see the access situation on the console log.info("Resolving" + url); /* * The following content is the general usage of HttpClient to establish connections* Use HttpClient to establish a client* Use the get method to access the specified URL * Get a response * */ CloseableHttpClient client = HttpClients.createDefault(); HttpGet get = new HttpGet(url); CloseableHttpResponse response = client.execute(get); /* * The following content is to convert HTML content into String * Get the response body * Convert the response body to String format, here the toString method in EntityUtils is used, and the encoding format is set to "utf-8" * Close the client and reply after completion * */ HttpEntity entity = response.getEntity(); String content; if (entity != null) { content = EntityUtils.toString(entity, "utf-8"); client.close(); response.close(); return content; } else return null; } public static void urlControl(String baseurl, int page) throws ClientProtocolException, IOException { //Set the current page count int count = 1; //If the current page is less than the number of pages you want to crawl, execute while (count < page) { //The URL actually accessed is the unchanged URL value spliced with the URL change value String u = baseurl + (2 * count - 1) + "&click=0"; //Here we call the method in the ParseHTML class to process the HTML page in the URL, and we will introduce the class String content = ParseHTML.parse(getH(u)).toString(); //Here we call the method in the WriteTo class to write the parsed content locally. Later, we will introduce the class WriteTo.writeto(content); count++; } }}2.2.2ParseHTML class
This step requires the review element to determine the tags that need to be crawled, and then obtain them through the CSS selector in Jsoup.
import org.jsoup.Jsoup;import org.jsoup.nodes.Document;import org.jsoup.nodes.Element;import org.jsoup.select.Elements;public class ParseHTML { public static StringBuilder parse(String content) { //Use the parse method in Jsoup to analyze the HTML content that has been converted to String, and the return value is Document class Document doc = Jsoup.parse(content); //Use select select to grab the elements you need to find. For example, the first select is the content whose class attribute is equal to gl-warp clearfix in the ul tag Elements ele = doc.select("ul[class = gl-warp clearfix]").select("li[class=gl-item]"); //Set a container to install each attribute StringBuilder sb = new StringBuilder(); //Use the previous selector to obtain all the elements that meet the requirements in the entire page, that is, each phone. Below, you need to traverse each phone to obtain its attributes for (Element e : ele) { //The acquisition of each attribute here refers to an article crawling content on JD.com on the Internet. There should be other different writing methods. String id = e.attr("data-pid"); String mingzi = e.select("div[class = p-name p-name-type-2]").select("a").select("em").text(); String jiage = e.select("div[class=p-price]").select("strong").select("i").text(); String pinglun = e.select("div[class=p-commit]").select("strong").select("a").text(); //Add attributes sb.append(id+"/t"); sb.append(mingzi+"/t"); sb.append(jiaage+"/t"); sb.append(pinglun+"/t"); sb.append(pinglun+"/t"); sb.append("/r/n"); } return sb; }}2.2.3WriteTo class
Methods in this class write parsed content into a local file. Just a simple IO.
import java.io.BufferedWriter;import java.io.File;import java.io.FileWriter;import java.io.IOException;public class WriteTo { // Set the location of file storage private static File f = new File("C://jingdong.txt"); public static void writeto(String content) throws IOException { //Use the continuous writing method to avoid overwriting the previously written content BufferedWriter bw = new BufferedWriter(new FileWriter(f, true)); bw.append(content); bw.flush(); bw.close(); }}2.2.4MainControl class
The main control program, write the base address and the number of pages to be obtained. Call the urlControl method in the getHTML class to crawl the page.
import java.io.IOException;import org.apache.http.client.ClientProtocolException;public class MainControl { public static void main(String[] args) throws ClientProtocolException, IOException { // TODO Auto-generated method stub String baseurl = "https://search.jd.com/Search?keyword=%E6%89%8B%E6%9C%BA&enc=" + "utf-8&qrst=1&rt=1&stop=1&vt=2&cid2=653&cid3=655&page="; int page = 5;//Set the crawl page number getHTML.urlControl(baseurl, page); }}3. Crawling results
Crawl 20 pages.
3.1 console output
3.2 Document output
You can directly open it with Excel, and the delimiter is a tab character. The columns are the product number, name, price and number of comments respectively.
4. Summary
This crawl uses HttpClient and Jsoup, which shows that for simple needs, these tools are still very efficient. In fact, you can also write all classes into one class, and the idea of writing multiple classes is clearer.
The above article Java crawler crawls the mobile search page HttpClient+Jsoup on JD.com is all the content I share with you. I hope you can give you a reference and I hope you can support Wulin.com more.