Today, let’s learn the crawler tutorial of alsotang and simply crawl CNode.
Create a project craelr-demo
We first create an Express project and then delete all the contents of app.js files, because we do not need to display content on the web side for the time being. Of course, we can also use the Express function we need by npm install express directly in an empty folder.
Target website analysis
As shown in the figure, this is a part of the div tag on the CNode homepage. We use this series of ids and classes to locate the information we need.
Use superagent to get source data
Superagent is the Http library used by the ajax API. It is used similarly to jQuery , through which we initiate a get request, outputting the result in the callback function.
The code copy is as follows:
var express = require('express');
var url = require('url'); //Parse the operation url
var superagent = require('superagent'); //Don't forget npm install
var cheeseio = require('cheerio');
var eventproxy = require('eventproxy');
var targetUrl = 'https://cnodejs.org/';
superagent.get(targetUrl)
.end(function (err, res) {
console.log(res);
});
Its res result is an object containing the target url information, and the website content is mainly in its text(string).
Use cheatio to parse
Cheerio acts as a server-side jQuery function. We first use its .load() to load HTML, and then filter elements through CSS selector.
The code copy is as follows:
var $ = cheeseio.load(res.text);
//Filter data through CSS selector
$('#topic_list .topic_title').each(function (idx, element) {
console.log(element);
});
The result is an object, and the .each(function(index, element)) function is called to iterate through each object, and the HTML DOM Elements is returned.
The result of output console.log($element.attr('title')); is广州2014年12月06日NodeParty 之UC 场
For titles like this, output console.log($element.attr('href')); the result is a URL like /topic/545c395becbcb78265856eb2 . Then use the url.resolve() function of NodeJS1 to complete the complete url.
The code copy is as follows:
superagent.get(tUrl)
.end(function (err, res) {
if (err) {
return console.error(err);
}
var topicUrls = [];
var $ = cheeseio.load(res.text);
// Get all links on the homepage
$('#topic_list .topic_title').each(function (idx, element) {
var $element = $(element);
var href = url.resolve(tUrl, $element.attr('href'));
console.log(href);
//topicUrls.push(href);
});
});
Use eventproxy to concurrently crawl content from each topic
The tutorial shows examples of deep nested (serial) methods and counter methods. Eventproxy uses event (parallel) methods to solve this problem. After all the crawling is completed, eventproxy will automatically call the processing function when it receives the event message.
The code copy is as follows:
//Step 1: Get an instance of eventproxy
var ep = new eventproxy();
//Step 2: Define the callback function that listens to events.
// After method is repeated listening
//params: eventname(String) event name, times(Number) number of listens, callback callback function
ep.after('topic_html', topicUrls.length, function(topics){
// topics is an array containing the 40 pairs in ep.emit('topic_html', pair) 40 times
//.map
topics = topics.map(function(topicPair){
//use cheerio
var topicUrl = topicPair[0];
var topicHtml = topicPair[1];
var $ = cheeseio.load(topicHtml);
return ({
title: $('.topic_full_title').text().trim(),
href: topicUrl,
comment1: $('.reply_content').eq(0).text().trim()
});
});
//outcome
console.log('outcome:');
console.log(topics);
});
//Step 3: Determine the release of event messages
topicUrls.forEach(function (topicUrl) {
superagent.get(topicUrl)
.end(function (err, res) {
console.log('fetch ' + topicUrl + ' successful');
ep.emit('topic_html', [topicUrl, res.text]);
});
});
The results are as follows
Extended exercises (challenges)
Get the message username and points
Find the user class name of the comment in the source code of the article page, and the classname is reply_author. console.log first element $('.reply_author').get(0) can be seen that we need to get things all here.
First, we can crawl an article and get everything we need at once.
The code copy is as follows:
var userHref = url.resolve(tUrl, $('.reply_author').get(0).attribs.href);
console.log(userHref);
console.log($('.reply_author').get(0).children[0].data);
We can grab points information through https://cnodejs.org/user/username
The code copy is as follows:
$('.reply_author').each(function (idx, element) {
var $element = $(element);
console.log($element.attr('href'));
});
On the user information page $('.big').text().trim() is the points information.
Use cheatio's function.get(0) to get the first element.
The code copy is as follows:
var userHref = url.resolve(tUrl, $('.reply_author').get(0).attribs.href);
console.log(userHref);
This is just a crawling of a single article, and there are still some things that need to be modified for 40 articles.