The book continues last time, we need to modify the program to crawl the content of 40 pages in a row. That is to say, we need to output the title, link, first comment, comment user and forum points of each article.
As shown in the figure, the value obtained $('.reply_author').eq(0).text().trim(); is the correct first comment user.
{<1>}
After getting comments and username content in eventproxy, we need to jump to the user interface through the username to continue to crawl the user points
The code copy is as follows:
var $ = cheeseio.load(topicHtml);
//This URL is the next step to crawl the target URL
var userHref = 'https://cnodejs.org' + $('.reply_author').eq(0).attr('href');
userHref = url.resolve(tUrl, userHref);
var title = $('.topic_full_title').text().trim().replace(//n/g,"");;
var href = topicUrl;
var comment1 = $('.reply_content').eq(0).text().trim();
var author1 = $('.reply_author').eq(0).text().trim();
//Pass the parameters to the next concurrent crawl
ep.emit('user_html', [userHref, title, href, comment1, author1]);
In eventproxy this time, we want to find where score is placed (class="big").
{<2>}
Just find the classname, let's try to output the result first
The code copy is as follows:
var outcome = superagent.get(userUrl)
.end(function (err, res) {
if (err) {
return console.error(err);
}
var $ = cheeseio.load(res.text);
var score = $('.big').text().trim();
console.log(user[1]);
console.log(user[2]);
console.log(user[3]);
console.log(user[4]);
console.log($('.big').text().trim());
return ({
title: user[1],
href: user[2],
comment1: user[3],
author1: user[4],
score1: score
});
});
});
Run the program and the result is obtained by this code.
{<3>}
But the problem is that we can output the result correctly in the callback function of .end(), but we cannot output the result correctly. If you look closely, the output that needs to be output is a Request object. This is because of careless mistakes. The .end() function does not pass the return value to the Request object, and needs to return the result to the previous layer (users).
The code copy is as follows:
//find userDetails
ep.after('user_html', topicUrls.length, function(users){
users = users.map(function(user){
var userUrl = user[0];
var score;
superagent.get(userUrl)
.end(function (err, res) {
if (err) {
return console.error(err);
}
//console.log(res.text);
var $ = cheeseio.load(res.text);
score = $('.big').text().trim();
});
return ({
title: user[1],
href: user[2],
comment1: user[3],
author1: user[4],
score1: score
});
});
Export users well and find that other than score1 are the correct values. After careful debugging, I found that the program first performed console.log() and then performed .map(). More precisely, within the .map() function, the .get() callback function does not complete the assignment score, and the return return value is carried out. This is the asynchronous callback function, and the outer synchronous operation will not wait for the callback function to complete the operation.
{<4>}
My approach is to eventproxy to emit another layer of message, and pass the required data to the receiving message operation along with the message. After(), only when all messages are received will print out the passed parameters (result).
The code copy is as follows:
score = $('.big')text().trim();
//Newly added
ep.emit('got_score', [user[1], user[2], user[3], user[4], score]);
.....
ep.after('got_score', 10, function(users){
console.log(users);
});
{<6>}
This problem has been solved, but the value of score1 seems to be too large. After looking again, it turns out that there are two class='big', and the user's topic collection also belongs to this class. We have to cut the first element through Cheerio's .slice( start, [end] ) and modify score to score = $('.big').slice(0).eq(0).text().trim();. The correct result is shown in the figure.
{<7>}