Java implementation crawler provides data to the App (Jsoup web crawler)

Author：Eve Cole Update Time：2025-08-20 02:16:02

1. Requirements

Recently, I refactored my own news app based on Material Design, and the source of the data is a problem.

A predecessor has analyzed APIs such as Zhihu Daily and Phoenix News, and can obtain the news JSON data based on the corresponding URL. In order to exercise code writing skills, the author plans to crawl the news page and obtain data by himself to build an API.

2. Effect picture

The picture below is the page of the original website

The crawler obtained the data and displayed it to the APP mobile terminal

3. Crawler ideas

For the implementation process of App, you can refer to these articles. This article mainly explains how to crawl data.

The entire process of recording App operation on Android to generate Gif dynamic pictures: //www.VeVB.COM/article/78236.htm
Learn Android Material Design (RecyclerView instead of ListView): //www.VeVB.COM/article/78232.htm
Android project practical imitation of NetEase News page (RecyclerView): //www.VeVB.COM/article/78230.htm

Introduction to Jsoup

Jsoup is an open source HTML parser for Java, which can directly parse a certain URL address and HTML text content.

Jsoup mainly has the following functions:

- Parses HTML from a URL, file, or string;
- Use DOM or CSS selector to find and retrieve data;
- Operation on HTML elements, attributes, and text;
- Clear untrusted HTML (to prevent XSS attacks)

4. Crawling process

Get request to get web page HTML

The DOM tree of the news web page Html is as follows:

The following code uses the code to obtain the html source code returned by the get request based on the specified url.

 public static String doGet(String urlStr) throws CommonException { URL url; String html = ""; try { url = new URL(urlStr); HttpURLConnection connection = (HttpURLConnection) url.openConnection(); connection.setRequestMethod("GET"); connection.setConnectTimeout(5000); connection.setDoInput(true); connection.setDoOutput(true); if (connection.getResponseCode() == 200) { InputStream in = connection.getInputStream(); html = StreamTool.inToStringByByte(in); } else { throw new CommonException("News server return value is not 200"); } } catch (Exception e) { e.printStackTrace(); throw new CommonException("get request failed"); } return html;}

InputStream in = connection.getInputStream(); converting the resulting input stream into a string is a common requirement. We abstract it out and write a tool method.

 public class StreamTool { public static String inToStringByByte(InputStream in) throws Exception { ByteArrayOutputStream outStr = new ByteArrayOutputStream(); byte[] buffer = new byte[1024]; int len = 0; StringBuilder content = new StringBuilder(); while ((len = in.read(buffer)) != -1) { content.append(new String(buffer, 0, len, "UTF-8")); } outStr.close(); return content.toString(); }}

5. Parsing HTML to get the title

Use the censorship element of the Google browser to find out the html code for the news title:

 <div id="article_title"> <h1> <a href="http://see.xidian.edu.cn/html/news/7428.html"> Notice on holding a lecture on "Appreciation of Classic Music Works and Humanistic Aesthetics"</a> </h1></div>

We need to find the part of id="article_title" from the above HTML, using the getElementById(String id) method

 String htmlStr = HttpTool.doGet(urlStr);// Convert the obtained HTML source code of the web page into DocumentDocument doc = Jsoup.parse(htmlStr);Element articleEle = doc.getElementById("article");// Title Element titleEle = articleEle.getElementById("article_title");String titleStr = titleEle.text();

6. Obtain release date and information source

Also find out the HTML code for

 <html> <head></head> <body> <div id="article_detail"> <span> 2015-05-28 </span> <span> Source: </span> <span> Number of views: <script language="JavaScript" src="http://see.xidian.edu.cn/index.php/news/click/id/7428"> </script> 477 </span> </div> </body></html>

The idea is similar to the above. Use getElementById(String id) method to find out that id="article_detail" is Element, and then use getElementsByTag to get the span part. Because there are 3 <span> ... </span> in total, Elements are returned instead of Element.

 // article_detail includes 2016-01-15 Source: Views: 177Element detailEle = articleEle.getElementById("article_detail");Elements details = detailEle.getElementsByTag("span");// Release time String dateStr = details.get(0).text();// News source String sourceStr = details.get(1).text();

7. Number of times of analysis

If you print out the above details.get(2).text(), you will only get

Number of views:
No views? Why?

Because the number of views is rendered by JavaScript, the Jsoup crawler may only extract HTML content and cannot obtain dynamically rendered data.
There are two solutions

When crawling, a browser kernel is built in, and the page is rendered by js, and then crawls. The corresponding tools in this aspect are Selenium, HtmlUnit or PhantomJs.
So analyze the JS request and find the corresponding data request url

If you visit the above urlhttp://see.xidian.edu.cn/index.php/news/click/id/7428, you will get the following results

 document.write(478)

This 478 is the number of views we need. We make a get request for the above url, get the returned string, and use the regular to find the number in it.

 // When visiting this news page, the number of views will be +1, and the number of times is the String rendered by JS jsStr = HttpTool.doGet(COUNT_BASE_URL + currentPage); int readTimes = Integer.parseInt(jsStr.replaceAll("//D+", ""));// Or use the following regular method // String readTimesStr = jsStr.replaceAll("[^0-9]", "");

8. Analyze the news content

Originally, it was a form of obtaining news content in plain text, but later it was found that the Android side could also display CSS format, so the content was retained in HTML format later.

 Element contentEle = articleEle.getElementById("article_content");// News body content String contentStr = contentEle.toString();// If the text() method is used, the html tag of the news body content will be lost// In order to display html with WebView on Android, use toString()// String contentStr = contentEle.text();

9. Analyze the picture Url

Note that there are many large and small pictures on a web page. In order to only obtain the content in the news text, it is best to first locate the Elements of the news content, and then use getElementsByTag ("img") to filter out the pictures.

 Element contentEle = articleEle.getElementById("article_content");// News body content String contentStr = contentEle.toString();// If the text() method is used, the html tag of the news body content will be lost// In order to display the html with WebView on Android, use toString()// String contentStr = contentEle.text();Elements images = contentEle.getElementsByTag("img");String[] imageUrls = new String[images.size()];for (int i = 0; i < imageUrls.length; i++) { imageUrls[i] = images.get(i).attr("src");}

10. News entity JavaBean

The above is to obtain the title, release date, number of reads, news content, etc. of the news. We naturally need to construct a javabean and encapsulate the obtained content into the entity class.

 public class ArticleItem { private int index; private String[] imageUrls; private String title; private String publishDate; private String source; private int readTimes; private String body; public ArticleItem(int index, String[] imageUrls, String title, String publishDate, String source, int readTimes, String body) { this.index = index; this.imageUrls = imageUrls; this.title = title; this.publishDate = publishDate; this.source = source; this.readTimes = readTimes; this.body = body; } @Override public String toString() { return "ArticleItem [index=" + index + ",/n imageUrls=" + Arrays.toString(imageUrls) + ",/n,/n publishDate=" + publishDate + ",/n source=" + source + ",/n readTimes=" + readTimes + ",/n body=" + body + "]"; }}

test

 public static ArticleItem getNewsItem(int currentPage) throws CommonException { // According to the suffix number, splice the news url String urlStr = ARTICLE_BASE_URL + currentPage + ".html"; String htmlStr = HttpTool.doGet(urlStr); Document doc = Jsoup.parse(htmlStr); Element articleEle = doc.getElementById("article"); // Title Element titleEle = articleEle.getElementById("article_title"); String titleStr = titleEle.text(); // article_detail includes 2016-01-15 Source: Views: 177 Element detailEle = articleEle.getElementById("article_detail"); Elements details = detailEle.getElementsByTag("span"); // Release time String dateStr = details.get(0).text(); // News source String sourceStr = details.get(1).text(); // Visit this news page and the number of views will be +1, which is the number of times rendered by JS. jsStr = HttpTool.doGet(COUNT_BASE_URL + currentPage); int readTimes = Integer.parseInt(jsStr.replaceAll("//D+", "")); // Or use the following regular method// String readTimesStr = jsStr.replaceAll("[^0-9]", ""); Element contentEle = articleEle.getElementById("article_content"); // News body content String contentStr = contentEle.toString(); // If the text() method is used, the html tag of the news body content will be lost // In order to display the html with WebView on Android, use toString() // String contentStr = contentEle.text(); Elements images = contentEle.getElementsByTag("img"); String[] imageUrls = new String[images.size()]; for (int i = 0; i < imageUrls.length; i++) { imageUrls[i] = images.get(i).attr("src"); } return new ArticleItem(currentPage, imageUrls, titleStr, dateStr, sourceStr, readTimes, contentStr);}public static void main(String[] args) throws CommonException { System.out.println(getNewsItem(7928));}

Output information

 ArticleItem [index=7928, imageUrls=[/uploads/image/20160114/20160114225911_34428.png], title=The School of Electrical Engineering launched the "Let the Flower of Integrity Bloom all over the Winter Campus" education activity, publishDate=2016-01-14, source=Source: Movie News Network, readTimes=200, body=<div id="article_content"> <p style="text-indent:2em;" align="justify"> <strong><span style="font-size:16px;line-height:1.5;">XiDian News Network</span></strong><span style="font-size:16px;line-height:1.5;"> (Comradesperson</span><strong><span style="font-size:16px;line-height:1.5;"> Ding Tong Wang Zhu Dan</span></strong><span style="font-size:16px;line-height:1.5;">...)

This article explains how to implement Jsoup web crawler. If the article is helpful to you, then give me a thumbs up.