Preface: Before writing this article, I mainly read a few similar crawler writing styles. Some of them use queues to write them, which feels not very intuitive. Some have only one request and then perform page analysis. They don’t get up automatically at all. This is also called crawler? Therefore, I wrote about a simple crawler based on my own ideas.
An algorithm introduction
The program uses a breadth-first algorithm in its idea, initiates GET requests for untraversed links one after another, and then parses the returned page with regular expressions, takes out the new link that has not been discovered, adds it to the collection, and traverses it in the next loop.
The specific implementation uses Map<String, Boolean>, and the key-value pairs are the link and whether to be traversed. Two Map collections are used in the program, namely: oldMap and newMap. The initial link is in oldMap, and then a request is made for a link with the flag false in oldMap, parse the page, and use regular to remove the link under the <a> tag. If this link is not in oldMap and newMap, it means that this is a new link. At the same time, if this link is the link of the target website we need to obtain, we will put this link into newMap and continue to parse it. When the page is parsed, the value of the link on the current page in oldMap is set to true, which means that it has been traversed.
Finally, when the entire link that has not been traversed by the oldMap has been traversed, if you find that the newMap is not empty, it means that new links have been generated in this loop. Therefore, these new links are added to the oldMap and continue to traverse recursively. Otherwise, it means that no new links have been generated in this loop. If you continue to loop, you can no longer generate new links. Because the task is over, the link collection oldMap will be returned.
Two program implementation
The above related ideas have been explained very clearly, and there are comments in the key areas in the code, so I won’t talk about it here, the code is as follows:
package action;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.net.HttpURLConnection;import java.net.MalformedURLException;import java.net.URL;import java.util.LinkedHashMap;import java.util.Map;import java.util.regex.Matcher;import java.util.regex.Pattern;public class WebCrawlerDemo { public static void main(String[] args) { WebCrawlerDemo webCrawlerDemo = new WebCrawlerDemo(); webCrawlerDemo.myPrint("http://www.zifangsky.cn"); } public void myPrint(String baseUrl) { Map<String, Boolean> oldMap = new LinkedHashMap<String, Boolean>(); //Storage link-whether it is traversed// Key value pair String oldLinkHost = ""; //host Pattern p = Pattern.compile("(https?://)?[^///s]*"); //For example: http://www.zifangsky.cn Matcher m = p.matcher(baseUrl); if (m.find()) { oldLinkHost = m.group(); } oldMap.put(baseUrl, false); oldMap = crawlLinks(oldLinkHost, oldMap); for (Map.Entry<String, Boolean> mapping : oldMap.entrySet()) { System.out.println("Link:" + mapping.getKey()); } } /** * Crawl all the web page links that can be crawled on a website, and use the breadth priority algorithm in the idea. GET requests are constantly initiated for new links that have not been traversed until the complete set is traversed. This means that new links cannot be found, and the task ends * * @param oldLinkHost Domain name, such as: http://www.zifangsky.cn * @param oldMap Collection of links to be traversed* * @return Return all crawled link collections* */ private Map<String, Boolean> crawlLinks(String oldLinkHost, Map<String, Boolean> oldMap) { Map<String, Boolean> newMap = new LinkedHashMap<String, Boolean>(); String oldLink = ""; for (Map.Entry<String, Boolean> mapping : oldMap.entrySet()) { System.out.println("link:" + mapping.getKey() + "---------check:" + mapping.getValue()); // If it has not been traversed by (!mapping.getValue()) { oldLink = mapping.getKey(); // Initiate a GET request try { URL url = new URL(oldLink); HttpURLConnection connection = (HttpURLConnection) url .openConnection(); connection.setRequestMethod("GET"); connection.setConnectTimeout(2000); connection.setReadTimeout(2000); if (connection.getResponseCode() == 200) { InputStream inputStream = connection.getInputStream(); BufferedReader reader = new BufferedReader( new InputStreamReader(inputStream, "UTF-8")); String line = ""; Pattern pattern = Pattern .compile("<a.*?href=[/"']?((https?://)?/?[^/"']+)[/"']?.*?>(.+)</a>"); Matcher matcher = null; while ((line = reader.readLine()) != null) { matcher = pattern.matcher(line); if (matcher.find()) { String newLink = matcher.group(1).trim(); // Link// String title = matcher.group(3).trim(); //Title// Determine whether the obtained link starts with http if (!newLink.startsWith("http")) { if (newLink.startsWith("/")) newLink = oldLinkHost + newLink; else newLink = oldLinkHost + "/" + newLink; } //Remove/ at the end of the link if(newLink.endsWith("/")) newLink = newLink.substring(0, newLink.length() - 1); //Deduplicate and discard links from other websites if (!oldMap.containsKey(newLink) && !newMap.containsKey(newLink) && newLink.startsWith(oldLinkHost)) { // System.out.println("temp2: " + newLink); newMap.put(newLink, false); } } } } } } catch (MalformedURLException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } try { Thread.sleep(1000); } catch (InterruptedException e) { e.printStackTrace(); } oldMap.replace(oldLink, false, true); } } //There is a new link, continue to traverse if (!newMap.isEmpty()) { oldMap.putAll(newMap); oldMap.putAll(crawlLinks(oldLinkHost, oldMap)); //Due to the Map's characteristics, there will be no duplicate key-value pairs} return oldMap; }}Three final test results
PS: In fact, using recursion is not very good, because if the website has more pages, the memory consumption will be very large if the program runs for a long time.
Thank you for reading, I hope it can help you. Thank you for your support for this site!