The entire process of NodeJS crawler production

Author：Eve Cole Update Time：2025-05-19 16:32:01

Today, let’s learn the crawler tutorial of alsotang and simply crawl CNode.

Create a project craelr-demo

We first create an Express project and then delete all the contents of app.js files, because we do not need to display content on the web side for the time being. Of course, we can also use the Express function we need by npm install express directly in an empty folder.

Target website analysis

As shown in the figure, this is a part of the div tag on the CNode homepage. We use this series of ids and classes to locate the information we need.

Use superagent to get source data

Superagent is the Http library used by the ajax API. ~~It is used similarly to jQuery~~ , through which we initiate a get request, outputting the result in the callback function.

The code copy is as follows:

var express = require('express');

var url = require('url'); //Parse the operation url

var superagent = require('superagent'); //Don't forget npm install

var cheeseio = require('cheerio');

var eventproxy = require('eventproxy');

var targetUrl = 'https://cnodejs.org/';

superagent.get(targetUrl)

.end(function (err, res) {

console.log(res);

});

Its res result is an object containing the target url information, and the website content is mainly in its text(string).

Use cheatio to parse

Cheerio acts as a server-side jQuery function. We first use its .load() to load HTML, and then filter elements through CSS selector.

The code copy is as follows:

var $ = cheeseio.load(res.text);

//Filter data through CSS selector

$('#topic_list .topic_title').each(function (idx, element) {

console.log(element);

});

The result is an object, and the .each(function(index, element)) function is called to iterate through each object, and the HTML DOM Elements is returned.

The result of output console.log($element.attr('title')); is广州2014年12月06日NodeParty 之UC 场

For titles like this, output console.log($element.attr('href')); the result is a URL like /topic/545c395becbcb78265856eb2 . Then use the url.resolve() function of NodeJS1 to complete the complete url.

The code copy is as follows:

superagent.get(tUrl)

.end(function (err, res) {

if (err) {

return console.error(err);

}

var topicUrls = [];

var $ = cheeseio.load(res.text);

// Get all links on the homepage

$('#topic_list .topic_title').each(function (idx, element) {

var $element = $(element);

var href = url.resolve(tUrl, $element.attr('href'));

console.log(href);

//topicUrls.push(href);

});

Use eventproxy to concurrently crawl content from each topic

The tutorial shows examples of deep nested (serial) methods and counter methods. Eventproxy uses event (parallel) methods to solve this problem. After all the crawling is completed, eventproxy will automatically call the processing function when it receives the event message.

The code copy is as follows:

//Step 1: Get an instance of eventproxy

var ep = new eventproxy();

//Step 2: Define the callback function that listens to events.

// After method is repeated listening

//params: eventname(String) event name, times(Number) number of listens, callback callback function

ep.after('topic_html', topicUrls.length, function(topics){

// topics is an array containing the 40 pairs in ep.emit('topic_html', pair) 40 times

//.map

topics = topics.map(function(topicPair){

//use cheerio

var topicUrl = topicPair[0];

var topicHtml = topicPair[1];

var $ = cheeseio.load(topicHtml);

return ({

title: $('.topic_full_title').text().trim(),

href: topicUrl,

comment1: $('.reply_content').eq(0).text().trim()

});

//outcome

console.log('outcome:');

console.log(topics);

});

//Step 3: Determine the release of event messages

topicUrls.forEach(function (topicUrl) {

superagent.get(topicUrl)

.end(function (err, res) {

console.log('fetch ' + topicUrl + ' successful');

ep.emit('topic_html', [topicUrl, res.text]);

});

The results are as follows

Extended exercises (challenges)

Get the message username and points

Find the user class name of the comment in the source code of the article page, and the classname is reply_author. console.log first element $('.reply_author').get(0) can be seen that we need to get things all here.

First, we can crawl an article and get everything we need at once.

The code copy is as follows:

var userHref = url.resolve(tUrl, $('.reply_author').get(0).attribs.href);

console.log(userHref);

console.log($('.reply_author').get(0).children[0].data);

We can grab points information through https://cnodejs.org/user/username

The code copy is as follows:

$('.reply_author').each(function (idx, element) {

var $element = $(element);

console.log($element.attr('href'));

});

On the user information page $('.big').text().trim() is the points information.

Use cheatio's function.get(0) to get the first element.

The code copy is as follows:

var userHref = url.resolve(tUrl, $('.reply_author').get(0).attribs.href);

console.log(userHref);

This is just a crawling of a single article, and there are still some things that need to be modified for 40 articles.