fundusのダウンロード - fundusソースコードのダウンロード

fundus

その他のソースコード

v0.4.6

ダウンロード

Pythonの非常にシンプルなニュースクローラー。ベルリンのフンボルト大学で開発されました。

クイックスタート|チュートリアル|ニュースソース|紙

眼底は次のとおりです。

静的なニュースクローラー。 Fundusでは、Pythonコードの数行しかないオンラインニュース記事をクロールできます！ライブWebサイトまたはCC-Newsデータセットからのものです。
オープンソースPythonパッケージ。 Fundusは、何かを一緒に構築するというアイデアに基づいて構築されています。 Fundusの成長を支援するためにあなたの貢献を歓迎します！

クイックスタート

PIPからインストールするには、単に実行します。

 pip install fundus

FundusにはPython 3.8+が必要です。

例1：英語のニュース記事の束をクロールします

Fundusを使用して、米国に拠点を置く出版社から2つの記事をcraう。

 from fundus import PublisherCollection , Crawler

# initialize the crawler for news publishers based in the US
crawler = Crawler ( PublisherCollection . us )

# crawl 2 articles and print
for article in crawler . crawl ( max_articles = 2 ):
    print ( article )

それはすでにそれです！

このコードを実行する場合、次のようなものを印刷する必要があります。

 Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text:  "Democrats jammed three of President Joe Biden's controversial court nominees
          through committee votes on Thursday thanks to a last-minute [...]"
- URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From:   FreeBeacon (2023-05-11 18:41)

Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text:  "Student government at Northwestern University in Illinois "indefinitely" froze
          the funds of the university's chapter of College Republicans [...]"
- URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From:   FoxNews (2023-05-09 14:37)

このプリントアウトは、2つの記事をrawったことを告げています！

各記事について、印刷の詳細：

記事の「タイトル」、つまりその見出し
「テキスト」、つまり主要な記事ボディテキスト
それがrawいされた「URL」
それは「From」ですニュースソース

例2：特定のニュースソースをクロールします

代わりに特定のニュースソースをcraうんでいるかもしれません。ワシントンタイムズのみのニュース記事のみをクロールしましょう：

 from fundus import PublisherCollection , Crawler

# initialize the crawler for The New Yorker
crawler = Crawler ( PublisherCollection . us . TheNewYorker )

# crawl 2 articles and print
for article in crawler . crawl ( max_articles = 2 ):
    print ( article )

例3：100万件の記事をクロールします

このような量のデータをクロールするために、FundusはCommonCrawl Webアーカイブ、特にNews Crawl CC-NEWSに依存しています。 CommonCrawlやCC-NEWSに慣れていない場合は、Webサイトをご覧ください。 CCNewsCrawlerをインポートして、事前にチュートリアルを確認してください。

 from fundus import PublisherCollection , CCNewsCrawler

# initialize the crawler using all publishers supported by fundus
crawler = CCNewsCrawler ( * PublisherCollection )

# crawl 1 million articles and print
for article in crawler . crawl ( max_articles = 1000000 ):
  print ( article )

注：デフォルトでは、クローラーはシステム上の利用可能なすべてのCPUコアを利用しています。最適なパフォーマンスには、 processesパラメーターを使用してプロセスの数を手動で設定することをお勧めします。適切な経験則はone process per 200 Mbps of bandwidthを割り当てることです。これは、コア速度によって異なります。

注：上記のクロールは、1000 Mbps接続、Core I9-13905H、64GB RAM、Windows 11、および記事を印刷することなく、マシン上のマシン全体のPublisherCollection全体を使用して約7時間かかりました。推定時間は、使用される出版社と利用可能な帯域幅によって大きく異なる場合があります。さらに、すべての出版社がCC-NEWSクロール（特に米国を拠点とする出版社）に含まれているわけではありません。大規模なコーパスの作成の場合、サイトマップのみを使用して通常のクローラーを使用することもできます。これには、帯域幅が大幅に少ない必要があります。

 from fundus import PublisherCollection , Crawler , Sitemap

# initialize a crawler for us/uk based publishers and restrict to Sitemaps only
crawler = Crawler ( PublisherCollection . us , PublisherCollection . uk , restrict_sources_to = [ Sitemap ])

# crawl 1 million articles and print
for article in crawler . crawl ( max_articles = 1000000 ):
  print ( article )

チュートリアル

ライブラリを始めるためのクイックチュートリアルを提供します。

チュートリアル1：Fundusとのニュースをクロールする方法
チュートリアル2：CC-Newsの記事をクロールする方法
チュートリアル3：記事クラス
チュートリアル4：記事をフィルタリングする方法
チュートリアル5：高度なトピック
チュートリアル6：ロギング

貢献したい場合は、これらのチュートリアルをご覧ください。

貢献する方法
出版社を追加する方法

現在サポートされているニュースソース

現在サポートされている出版社はここでサポートされています。

また、新しい出版社を追加するのは簡単です - プロジェクトに貢献することを検討してください！

評価ベンチマーク

評価ベンチマークをご覧ください。

次の表は、Fundusの全体的な性能をまとめたものと、平均Rouge-Lsum Precision、Recall、F1-Score、および標準偏差の観点から評価されたスクレーパーを評価します。テーブルは、F1スコアで下降順序でソートされています。

スクレーパー	精度	想起	F1スコア	バージョン
眼底	99.89 _±0.57	96.75 _±12.75	97.69 _±9.75	0.4.1
トラフィラトゥラ	93.91 _±12.89	96.85 _±15.69	93.62 _±16.73	1.12.0
ニュースをお楽しみください	97.95 _±10.08	91.89 _±16.15	93.39 _±14.52	1.6.13
BTE	81.09 _±19.41	98.23 _±8.61	87.14 _±15.48	/
Justext	86.51 _±18.92	90.23 _±20.61	86.96 _±19.76	3.0.1
ボイレンテット	85.96 _±18.55	91.21 _±19.15	86.52 _±18.03	/
ボイラーパイプ	82.89 _±20.65	82.11 _±29.99	79.90 _±25.86	1.3.0

引用

私たちの仕事に基づいて、眼底または構築を使用するときは、次の論文を引用してください。

 @inproceedings { dallabetta-etal-2024-fundus ,
    title = " Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions " ,
    author = " Dallabetta, Max  and
      Dobberstein, Conrad  and
      Breiding, Adrian  and
      Akbik, Alan " ,
    editor = " Cao, Yixin  and
      Feng, Yang  and
      Xiong, Deyi " ,
    booktitle = " Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) " ,
    month = aug,
    year = " 2024 " ,
    address = " Bangkok, Thailand " ,
    publisher = " Association for Computational Linguistics " ,
    url = " https://aclanthology.org/2024.acl-demos.29 " ,
    pages = " 305--314 " ,
}

接触

Max Dallabettaに質問やコメントをメールで送信してください

貢献

貢献してくれてありがとう！参加するには多くの方法があります。貢献者のガイドラインから始めて、特定のタスクについてこれらのオープンな問題を確認してください。

ライセンス

mit

拡大する

追加情報

バージョン v0.4.6
タイプその他のソースコード
更新時間 2025-04-18
サイズ 9.13MB
から Github

fundus

クイックスタート

例1：英語のニュース記事の束をクロールします

例2：特定のニュースソースをクロールします

例3：100万件の記事をクロールします

チュートリアル

現在サポートされているニュースソース

評価ベンチマーク

引用

接触

貢献

ライセンス

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express