Today, web crawling is a well-known technology, but there are still many complexities. Simple web crawlers are still difficult to compete with modern websites developed by various complex technologies such as Ajax training, XMLHttpRequest, WebSockets, Flash Sockets, etc.
Let’s take our basic needs on the Hubdoc project as an example, in which we crawl the bill amount, expiration date, account number, and most importantly: pdfs of recent bills from the websites of banks, utilities and credit card companies. For this project, I started with a very simple solution (not using the expensive commercial product we are evaluating for the time being) - a simple crawler project I used to do with Perl in MessageLab/Symantec. But the results were not going well, and the spammers made a much simpler website than those of banks and utilities.
So how to solve this problem? We mainly start with the excellent request library developed using Mikea. Make a request in the browser, and check what request headers were sent out in the Network window, and then copy these request headers into the code. This process is very simple. It is just to track all requests from login to downloading the Pdf file and then simulate all requests from this process. In order to make it easier to handle similar things and to make web developers more rational in writing crawler programs, I exported the results from HTML to jQuery (using the lightweight cheatio library), which made similar work simple and made it easier to use the CSS selector to select elements in a page. The whole process is wrapped into a framework, which can also do additional work, such as picking up certificates from the database, loading individual robots, and communicating with the UI through socket.io.
This works for some web sites, but it's just a JS script, not my node.js code that is placed on their site by these companies. They can layer the leftover issues to address complexity, making it very difficult for you to figure out what to do to get the login information point. For some sites I tried to get it by combining it with the request() library for a few days, but it was still in vain.
After almost crashing, I discovered node-phantomjs, a library that allows me to control the phantomjs headless webkit browser from node (Translator's note: I didn't expect a corresponding noun. Headless means that the rendering page is completed in the background without displaying the device). This seems like a simple solution, but there are some problems that phantomjs cannot avoid:
1.PhantomJS can only tell you whether the page has been loaded, but you cannot determine whether there is redirect (redirect) implemented through JavaScript or meta tags in this process. Especially when JavaScript uses setTimeout() to delay calls.
2.PhantomJS provides you with a pageLoadStarted hook that allows you to deal with the above mentioned problems, but this function can only reduce this number when you determine the number of pages to load, reduce this number when each page is loaded, and provide processing for possible timeouts (because this doesn't always happen), so that when your number is reduced to 0, your callback function can be called. This method can work, but it always makes people feel a bit like a hacker.
3.PhantomJS requires a complete and independent process for each page to crawl, because if this is not the case, it is impossible to separate cookies between each page. If you are using the same phantomjs process, the session in the page that has been logged in will be sent to another page.
4. Unable to use PhantomJS to download resources - you can only save the page as png or pdf. This is useful, but it means we need to resort to request() to download the pdf.
5. Due to the above reasons, I have to find a way to distribute cookies from PhantomJS session to the session library of request(). Just distribute the document.cookie string, parse it, and inject it into the cookie jar of request().
6. Injecting variables into the browser session is not easy. To do this I need to create a string to create a Javascript function.
The code copy is as follows:
Robot.prototype.add_page_data = function (page, name, data) {
page.evaluate(
"function () { var " + name + " = window." + name + " = " + JSON.stringify(data) + "}"
);
}
7. Some websites are always filled with code like console.log(), and they need to be redefined and output to the location we want. To accomplish this, I did this:
The code copy is as follows:
if (!console.log) {
var iframe = document.createElement("iframe");
document.body.appendChild(iframe);
console = window.frames[0].console;
}
8. Some websites are always filled with code like console.log(), and they need to be redefined and output to the location we want. To accomplish this, I did this:
The code copy is as follows:
if (!console.log) {
var iframe = document.createElement("iframe");
document.body.appendChild(iframe);
console = window.frames[0].console;
}
9. It is not easy to tell the browser that I clicked on the a tag. In order to accomplish these things, I added the following code:
The code copy is as follows:
var clickElement = window.clickElement = function (id){
var a = document.getElementById(id);
var e = document.createEvent("MouseEvents");
e.initMouseEvent("click", true, true, window, 0, 0, 0, 0, false, false, false, false, 0, null);
a.dispatchEvent(e);
};
10. I also need to limit the maximum concurrency of the browser session to ensure that we will not explode the server. Even so, this limitation is much higher than what expensive commercial solutions can offer. (Translator's note: that is, the concurrency of a commercial solution is greater than that of this solution)
After all the work, I have a relatively decent crawler solution for PhantomJS + request. You must log in with PhantomJS before you can return to the request() request. It will use cookies set in PhantomJS to verify the logged-in session. This is a huge win because we can use the stream of request() to download the pdf file.
The whole plan is to make it relatively easy for web developers to understand how to use jQuery and CSS selectors to create crawlers for different web sites. I have not successfully proved that this idea is feasible, but I believe it will happen soon.