Java implements simple crawlers: Today's headlines

Author：Eve Cole Update Time：2025-05-31 02:32:01

Preface

What needs to be said in advance is that due to the special nature of today's Toutiao articles, it is impossible to directly obtain the address of the article. You need to obtain the id of the article and then splice it into a url before accessing. I won’t say much below, just upload the code.

The sample code is as follows

 public class Demo2 { public static void main(String[] args) { // List of articles for web pages that need to be crawled String url = "http://www.toutiao.com/news_finance/"; //Prefix of the article details page (Since today's headlines are all in the group directory, the prefix is defined, and the html page obtained through the request) String url2="http://www.toutiao.com/group/"; //Link to this website Connection connection = Jsoup.connect(url); Document content = null; try { //Get content content = connection.get(); } catch (IOException e) { e.printStackTrace(); } //Convert to string String htmlStr = content.html(); //Because the articles in today's headlines are quite weird, they are all defined as variables through js, so you cannot get the value by getting the dom element String jsonStr = StringUtils.substringBetween(htmlStr,"var _data = ", ";"); System.out.println(jsonStr); Map parse = (Map) JSONObject.parse(jsonStr); JSONArray parseArray = (JSONArray) parse.get("real_time_news"); Map map=null; List<Map> maps=new ArrayList<>(); //Transf the jsonArray, get each json object, and then convert it into a Map object (in this case, only a group_id is needed, so there is no need to use map) for(int i=0;i<parseArray.size();i++){ map = (Map)parseArray.get(i); maps.add((Map)parseArray.get(i)); System.out.println(map.get("group_id")); } //Transf the map collection obtained before, and then visit these article details pages separately for (Map map2 : maps) { connection = Jsoup.connect(url2+map2.get("group_id")); try { Document document = connection.get(); //Get the title of the article Elements title = document.select("[class=article-title]"); System.out.println(title.html()); //Get the source of the article and the release time of the article Elements articleInfo = document.select("[class=articleInfo]"); Elements src = articleInfo.select("[class=src]"); System.out.println(src.html()); Elements time = articleInfo.select("[class=time]"); System.out.println(time.html()); //Get article contentElements contentEle = document.select("[class=article-content]"); System.out.println(contentEle.html()); } catch (IOException e) { e.printStackTrace(); } } }}

Summarize

The above is the entire content of this article. I hope the content of this article will be of some help to your study or work. If you have any questions, you can leave a message to communicate.