I recently saw the Gecoo crawler tool, which feels relatively simple and easy to use. Write a DEMO test and crawl the website.
http://zj.zjol.com.cn/home.html, mainly crawls the title and release time of the news as the crawl test object. It is very convenient to crawl HTML nodes by selecting nodes like Jquery selectors. The Gecco code mainly uses annotation implementation to achieve URL matching, which looks relatively concise and beautiful.
Add Maven dependencies
<dependency> <groupId>com.geccacrawler</groupId> <artifactId>gecco</artifactId> <version>1.0.8</version></dependency>
Write a crawl list page
@Gecco(matchUrl = "http://zj.zjol.com.cn/home.html?pageIndex={pageIndex}&pageSize={pageSize}",pipelines = "zJNewsListPipelines")public class ZJNewsGeccoList implements HtmlBean { @Request private HttpRequest request; @RequestParameter private int pageIndex; @RequestParameter private int pageSize; @HtmlField(cssPath = "#content > div > div > div.con_index > div.r.main_mod > div > ul > li > dl > dt > a") private List<HrefBean> newList;} @PipelineName("zJNewsListPipelines")public class ZJNewsListPipelines implements Pipeline<ZJNewsGeccoList> { public void process(ZJNewsGeccoList zjNewsGeccoList) { HttpRequest request=zjNewsGeccoList.getRequest(); for (HrefBean bean:zjNewsGeccoList.getNewList()){ //Enter the auspicious page to crawl SchedulerContext.into(request.subRequest("http://zj.zjol.com.cn"+bean.getUrl())); } int page=zjNewsGeccoList.getPageIndex()+1; String nextUrl = "http://zj.zjol.com.cn/home.html?pageIndex="+page+"&pageSize=100"; //Crawl the next page SchedulerContext.into(request.subRequest(nextUrl)); }} Write a crawling page
@Gecco(matchUrl = "http://zj.zjol.com.cn/news/[code].html" ,pipelines = "zjNewsDetailPipeline")public class ZJNewsDetail implements HtmlBean { @Text @HtmlField(cssPath = "#headline") private String title ; @Text @HtmlField(cssPath = "#content > div > div.news_con > div.news-content > div:nth-child(1) > div > p.go-left.post-time.c-gray") private String createTime;} @PipelineName("zjNewsDetailPipeline")public class ZJNewsDetailPipeline implements Pipeline<ZJNewsDetail> { public void process(ZJNewsDetail zjNewsDetail) { System.out.println(zjNewsDetail.getTitle()+" "+zjNewsDetail.getCreateTime()); }} Start the main function
public class Main { public static void main(String [] rags){ GeccoEngine.create() //The package path of the project.classpath("com.zhaochao.gecco.zj") //The page address that starts crawling.start("http://zj.zjol.com.cn/home.html?pageIndex=1&pageSize=100") //Open several crawler threads.thread(10) //The interval between a single crawler after each crawling a request.interval(10) //Use pc userAgent .mobile(false) //Start run.run(); }}Crawl results
The above is all the content of this article. I hope it will be helpful to everyone's learning and I hope everyone will support Wulin.com more.