Implementation code for web crawling using phantomjs

Author：Eve Cole Update Time：2025-05-16 11:48:02

Because phantomjs is a headless browser that can run JS, it can also run dom nodes, which is the best way to use it for web crawling.

For example, we want to batch crawl the content of the web page "Today in history". website

After observing the dom structure, we only need to get the title value of .list li a. So we use advanced selector to build dom fragments

 var d= ''var c = document.querySelectorAll('.list li a')var l = c.length;for(var i =0;i<l;i++){d=d+c[i].title+'/n'}

After that, just have to let the js code run in phantomjs ~

 var page = require('webpage').create();page.open('http://www.todayonhistory.com/', function (status) { //Open the page if (status !== 'success') {console.log('FAIL to load the address');} else {console.log(page.evaluate(function () {var d= ''var c = document.querySelectorAll('.list li a')var l = c.length;for(var i =0;i<l;i++){d=d+c[i].title+'/n'}return d}))}phantom.exit();});

Finally, we save as catch.js, execute it in dos, and output the content to the txt file (you can also use the phantomjs file api to write)