I have written many single-page python crawlers before, and I feel that python is still very useful. Here I use Java to summarize a multi-page crawler, iteratively crawl all the links of the seed page, and save them all under the tmp path.
1. Preface
Implementing this crawler requires support from two data structures, unvisited queue (priorityqueue: can calculate the importance of the url) and visited table (hashset: can quickly find out whether the url exists); queue is used to implement width-first crawl, and visited table is used to record the crawled url, no longer crawls repeatedly, avoiding rings. The toolkits required by Java crawlers include httpclient and htmlparser1.5, and you can view the download of the specific version in the maven repo.
1. Target website: Sina http://www.sina.com.cn/
2. Screenshot of the results:
Let’s talk about the implementation of crawlers. The source code will be uploaded to github later. Friends who need it can leave a message:
2. Crawler programming
1. Create the url of the seed page
MyCrawler crawler = new MyCrawler();
crawler.crawling(new String[]{"http://www.sina.com.cn/"});
2. Initialize the unvisited table as the above seed url
LinkQueue.addUnvisitedUrl(seeds[i]);
3. The most important logical implementation part: take out the URL that has not been visited in the queue, download it, then add the visited table, parse other URLs on the changed URL page, and add the unreaded queue to the unvisited queue; iterate until the queue is empty, so this URL network is still very large. Note that the page download and page resolution here require the Java toolkit implementation, and the use of the toolkit is explained in detail below.
while(!LinkQueue.unVisitedUrlsEmpty()&&LinkQueue.getVisitedUrlNum()<=1000) { //The header URL comes out of the queue String visitUrl=(String)LinkQueue.unVisitedUrlDeQueue(); if(visitUrl==null) continue; DownloadFile downLoadFile downLoader=new DownloadFile(); //Download the web page downLoader.downloadFile(visitUrl); //The url is placed in the visited URL LinkQueue.addVisitedUrl(visitUrl); //Extract the URL from the download webpage Set<String> links=HtmlParserTool.extracLinks(visitUrl,filter); //New unvisited URL join for(String link:links) { LinkQueue.addUnvisitedUrl(link); } }4. Download toolkit for the following html page
public String downloadFile(String url) { String filePath = null; /* 1. Generate the HttpClinet object and set the parameters */ HttpClient httpClient = new HttpClient(); // Set the Http connection timeout for 5s httpClient.getHttpConnectionManager().getParams().setConnectionTimeout(5000); /* 2. Generate the GetMethod object and set the parameters */ GetMethod getMethod = new GetMethod(url); // Set the get request timeout for 5s getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT, 5000); // Set request retry processing getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER, new DefaultHttpMethodRetryHandler()); /* 3. Execute HTTP GET request*/ try { int statusCode = httpClient.executeMethod(getMethod); // determine the status code of access if (statusCode != HttpStatus.SC_OK) { System.err.println("Method failed: " + getMethod.getStatusLine()); filePath = null; } /* 4. Process HTTP response content*/ byte[] responseBody = getMethod.getResponseBody();// Read as a byte array// Generate the file name when saving according to the web page url filePath = "temp//" + getFileNameByUrl(url, getMethod.getResponseHeader( "Content-Type").getValue()); saveToLocal(responseBody, filePath); } catch (HttpException e) { // A fatal exception occurred, which may be that the protocol is incorrect or there is a problem with the returned content System.out.println("Please check your provided http address!"); e.printStackTrace(); } catch (IOException e) { // A network exception occurred e.printStackTrace(); } finally { // Release the connection getMethod.releaseConnection(); } return filePath; }5. HTML page parsing toolkit:
public static Set<String> extracLinks(String url, LinkFilter filter) { Set<String> links = new HashSet<String>(); try { Parser parser = new Parser(url); parser.setEncoding("gb2312"); // Filter the filter of the <frame > tag to extract the link represented by the src attribute in the frame tag NodeFilter frameFilter = new NodeFilter() { public boolean accept(Node node) { if (node.getText().startsWith("frame src=")) { return true; } else { return false; } } } }; // OrFilter to set filtering <a> tags, and <frame> tags OrFilter linkFilter = new OrFilter(new NodeClassFilter(LinkTag.class), frameFilter); // Get all filtered tags NodeList list = parser.extractAllNodesThatMatch(linkFilter); for (int i = 0; i < list.size(); i++) { Node tag = list.elementAt(i); if (tag instanceof LinkTag)// <a> tag { LinkTag link = (LinkTag) tag; String linkUrl = link.getLink();// url if (filter.accept(linkUrl)) links.add(linkUrl); } else// <frame> tag { // Extract the link to the src attribute in the frame, such as <frame src="test.html"/> String frame = tag.getText(); int start = frame.indexOf("src="); frame = frame.substring(start); int end = frame.indexOf(" "); if (end == -1) end = frame.indexOf(">"); String frameUrl = frame.substring(5, end - 1); if (filter.accept(frameUrl)) links.add(frameUrl); } } } catch (ParserException e) { e.printStackTrace(); } return links; }6. The unvisited page is saved using PriorityQueue with preferred queues, mainly to apply to algorithms such as pagerank. Some URLs are more loyal; the visited table is implemented using hashset, pay attention to quickly finding whether it exists;
public class LinkQueue { //Accessed url collection private static Set visitedUrl = new HashSet(); //Accessed url collection private static Queue unVisitedUrl = new PriorityQueue(); //Get URL queue public static Queue getUnVisitedUrl() { return unVisitedUrl; } //Add to the visited URL queue public static void addVisitedUrl(String url) { visitedUrl.add(url); } //Remove accessed URL public static void removeVisitedUrl(String url) { visitedUrl.remove(url); } //Unvised URLs are out of the queue public static Object unVisitedUrlDeQueue() { return unVisitedUrl.poll(); } // Ensure that each url is accessed only once public static void addUnvisitedUrl(String url) { if (url != null && !url.trim().equals("") && !visitedUrl.contains(url) && !unVisitedUrl.contains(url)) unVisitedUrl.add(url); } //Get the number of URLs that have been accessed public static int getVisitedUrlNum() { return visitedUrl.size(); } //Judge whether the unvisited URL queue is empty public static boolean unVisitedUrlsEmpty() { return unVisitedUrl.isEmpty(); }}The above is all the content of this article. I hope it will be helpful to everyone's learning and I hope everyone will support Wulin.com more.