Java Crawler Gecco Tool to Crawl Newsの例

著者：Eve Cole 更新時間：2025-05-03 16:32:02

最近、Gecoo Crawlerツールを見ました。これは、シンプルで使いやすいと感じています。デモテストを書き、ウェブサイトをクロールします。
http://zj.zjol.com.cn/home.htmlは、主にクロールテストオブジェクトとしてニュースのタイトルとリリース時間をクロールします。 jQueryセレクターなどのノードを選択して、HTMLノードをクロールするのは非常に便利です。 GECCOコードは主に注釈の実装を使用して、比較的簡潔で美しいURLマッチングを実現します。

Maven依存関係を追加します

<Dependency> GroupId> com.geccacrawler </groupId> <Artifactid> gecco </artifactid> <バージョン> 1.0.8 </version> </dependency>

クロールリストページを書きます

@gecco（matchurl = "http://zj.zjol.com.cn/home.html?pageindex= {pageindex}＆pagesize = {pagesize}」、pipelines =" zjnewslistpipelines "）Public Class Zjnewsgecolist htmlbeant htmlbeant htmlbean @requestParameter private int pageindex; @RequestParameterプライベートINTページサイズ。 @htmlfield（csspath = "#content> div> div> div.con_index> div.r.main_mod> div> ul> li> dl> dt> a"）プライベートリスト<hrefbean> newlist;}

 @pipelineName（ "zjnewslistpipelines"）パブリッククラスzjnewslistpipelinesはpipeline <zjnewsgecolist> {public void process（zjnewsgeccolist zjnewsgecolist）{httprequest = zjnewsgecolist.getRequest（） for（hrefbean bean：zjnewsgeccolist.getNewList（））{//縁起の良いページを入力して、schedulercontext.into（request.subrequest（ "http://zj.zjol.com.cn"+bean.geturl（）））; } int page = zjnewsgeccolist.getPageIndex（）+1; string nexturl = "http://zj.zjol.com.cn/home.html?pageindex="+Page+"&pagesize=100"; //次のページをcrawl schedulercontext.into（request.subrequest（nexturl））; }}

クロールページを書いてください

@gecco（matchurl = "http://zj.zjol.com.cn/news/ [code] .html"、pipelines = "zjnewsdetailpipeline"）public class zjnewsdetailはhtmlbean {@text @htmlfield（csspate = "＃headline"） "＃headline"） "＃headline"） @text @htmlfield（csspath = "#content> div> div> div.news_con> div.news-content> div：nth-child（1）> div> p.go-left.post-time.c-gray"）プライベート文字列CreateTime;}

 @pipelinename（ "zjnewsdetailpipeline"）パブリッククラスzjnewsdetailpipelineはpipeline <zjnewsdetail> {public void process（zjnewsdetail zjnewsdetail）{system.out.out.println（zjnewsteail.gettitle. "+zjnewsdetail.getCreatetime（））; }}

メイン関数を開始します

public class main {public static void main（string [] rags）{geccoengine.create（）// project.classpathのパッケージパス（ "com.zhaochao.gecco.zj"）//開始するページアドレスcrawling.start（ "http://zj.zjol.com.cn/home.html?pageindex=1&pagesize=100"）//いくつかのクローラースレッドを開きます。 }}

結果の結果

上記はこの記事のすべての内容です。みんなの学習に役立つことを願っています。誰もがwulin.comをもっとサポートすることを願っています。