Basic ideas
Origin:master: Starting from a certain category of Wikipedia (such as aircraft carrier (key) page, find out all targets containing key (aircraft carrier) in the title attribute of the link, and add them to the queue to be crawled. In this way, while grabbing the code and pictures of a page, it also obtains the addresses of all other web pages related to the key on the web page, and adopts an algorithm that prioritizes traversal of class breadth to complete this task.
Idea 2 (origin:cat): Crawl by classification. Note that on Wikipedia, categories start with Category:. Since Wikipedia has a good document structure, it is easy to start with any category and always crawl all categories below it. This algorithm extracts subcategorizations for classification pages and grabs all pages below it in parallel. It is fast and can save the classification structure, but in fact there are many duplicate pages, but this can be easily processed by writing a script in the later stage.
Selection of library
I started to want to use jsdom. Although I felt that it was powerful, it was also quite "heavy". The most serious thing was that the explanation document was not good enough. I only mentioned its advantages, but did not have a comprehensive explanation. Therefore, if you change to Cheerio, it is lightweight and has relatively complete functions. At least you can have a comprehensive concept at a glance. In fact, after doing it, I realized that there is no need for libraries at all, and you can do everything with regular expressions! I just wrote a little bit of regularity in the library.
Key points
Global variable settings:
var regKey = ['Aircraft Carrier','Aircraft Carrier','Aircraft Carrier']; //If the keywords are included in the link, it is the target var allKeys = []; //The title of the link is also the page identifier, avoiding repeated crawling of var keys = ['Category:%E8%88%AA%E7%A9%BA%E6%AF%8D%E8%88%B0']; //Waiting for queue, start page
Image download
Use the streaming operation of the request library to make each download operation form a closure. Pay attention to the possible side effects of asynchronous operations. In addition, the image name needs to be reset. At the beginning, I took the original name. For some reasons, some images clearly exist, but they cannot be displayed; and the srcset attribute must be cleared, otherwise the original surface cannot be displayed.
$ = cheer.load(downHtml); var rsHtml = $.html(); var imgs = $('#bodyContent .image'); //The pictures are modified by this style for(img in imgs){ if(typeof imgs[img].attribs === 'undefined' || typeof imgs[img].attribs.href === 'undefined') {continue;} //The structure is an image under the link, and the link does not exist, skip else { var picUrl = imgs[img].children[0].attribs.src; //The image address var dirs = picUrl.split('.'); var filename = baseDir+uuid.v1()+'.'+dirs[dir.length -1]; //Rename request("https:"+picUrl).pipe(fs.createWriteStream('pages/'+filename)); //Download rsHtml = rsHtml.replace(picUrl,filename); //Replace local path// console.log(picUrl); } }Breadth priority traversal
At first, I didn’t fully understand the concept of asynchronousness and did it in a loop. I thought that using Promise had already been converted into synchronization, but in fact, it only ensures that the operations handed over to promise will be carried out in an orderly manner, and these operations cannot be ordered with other operations! For example, the following code is incorrect.
var keys = ['Aircraft Carrier'];var key = keys.shift();while(key){ data.get({ url:encodeURI(key), qs:null }).then(function(downHtml){ ... keys.push(key); //(1) } });key = keys.shift(); //(2)}The above operation is normal, but in fact (2) will be run between (1)! What to do?
I used recursion to solve this problem. The following example code:
var key = keys.shift();(function doNext(key){ data.get({ url:key, qs:null }).then(function(downHtml){ ... keys.push(href); ... key = keys.shift(); if(key){ doNext(key); }else{ console.log('The crawl task was completed smoothly.') } })})(key);Regular cleaning
Use regular expressions to clean useless page code, because there are many patterns to be processed, so I wrote a loop to uniformly process it.
var regs = [/<link rel=/"stylesheet/" href=/"?[^/"]*/">/g, /<script>?[^<]*<//script>/g, /<style>?[^<]*<//style>/g, /<a ?[^>]*>/g, /</a>/g, /srcset=(/"?[^/"]*/")/g ] regs.forEach(function(rs){ var mactches = rsHtml.match(rs); for (var i=0;i < mactches.length ; i++) { rsHtml = rsHtml.replace(mactches[i],mactches[i].indexOf('stylesheet')>-1?'<link rel="stylesheet" href="wiki'+(i+1)+'.css"':''); } })Running effect
I need FQ on Wiki Chinese. I tried it and grabbed the aircraft carrier classification. During the operation, I found about 300 related links (including classification pages. I only took valid links and did not download them). Finally, I downloaded 209 correctly. I manually tested some error links and found that they were invalid links. It showed that the entry had not been established yet. The whole process took about less than fifteen minutes. After compression, it was nearly thirty M and it felt that the effect was pretty good.
source code
https://github.com/zhoutk/wikiSpider
summary
By the time I had basically completed the task last night, Idea 1 can crawl pages with relatively accurate content, and the pages are not repeated, but the crawling efficiency is not high, and the classified information cannot be accurately obtained; Idea 2 can automatically crawl and store files locally in categories according to Wikipedia, which is highly efficient (actual measurement, crawling [warship], and crawling nearly 6,000 pages in total, which takes about 50 minutes, and more than 100 pages can be crawled per minute), and can accurately save classified information.
The biggest gain is a deep understanding of the overall process control of asynchronous programming.