Python中的一個非常簡單的新聞爬行者。在柏林洪堡大學開發。
快速啟動|教程|新聞來源|紙
眼底是:
一個靜態的新聞爬行者。眼底,您只需幾行Python代碼即可爬網上新聞文章!無論是從實時網站還是CC-NEWS數據集中。
一個開源Python包。眼底建立在共同建立東西的想法上。我們歡迎您的貢獻,以幫助眼底成長!
要從PIP安裝,只需做:
pip install fundus
眼底需要python 3.8+。
讓我們使用眼底爬行,從美國的出版商那裡抓取2篇文章。
from fundus import PublisherCollection , Crawler
# initialize the crawler for news publishers based in the US
crawler = Crawler ( PublisherCollection . us )
# crawl 2 articles and print
for article in crawler . crawl ( max_articles = 2 ):
print ( article )就是這樣!
如果您運行此代碼,它應該打印出類似的內容:
Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text: "Democrats jammed three of President Joe Biden's controversial court nominees
through committee votes on Thursday thanks to a last-minute [...]"
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From: FreeBeacon (2023-05-11 18:41)
Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
the funds of the university's chapter of College Republicans [...]"
- URL: https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From: FoxNews (2023-05-09 14:37)此打印輸出告訴您,您成功地爬了兩篇文章!
對於每篇文章,打印輸出詳細信息:
也許您想改用特定的新聞來源。讓我們僅從華盛頓時報爬網:
from fundus import PublisherCollection , Crawler
# initialize the crawler for The New Yorker
crawler = Crawler ( PublisherCollection . us . TheNewYorker )
# crawl 2 articles and print
for article in crawler . crawl ( max_articles = 2 ):
print ( article )為了爬行如此大量的數據,Fellus依賴於CommonCrawl Web檔案館,特別是新聞爬網CC-NEWS 。如果您不熟悉CommonCrawl或CC-NEWS請查看他們的網站。只需導入我們的CCNewsCrawler ,並確保事先查看我們的教程。
from fundus import PublisherCollection , CCNewsCrawler
# initialize the crawler using all publishers supported by fundus
crawler = CCNewsCrawler ( * PublisherCollection )
# crawl 1 million articles and print
for article in crawler . crawl ( max_articles = 1000000 ):
print ( article )注意:默認情況下,爬網利用系統上所有可用的CPU內核。為了獲得最佳性能,我們建議使用processes參數手動設置過程數。一個好的經驗法則是one process per 200 Mbps of bandwidth 。這可能會因核心速度而異。
注意:上面的爬網使用了具有1000 Mbps連接的機器上的整個PublisherCollection ,Core i9-13905h,64GB RAM,Windows 11,而無需打印文章。估計的時間可能會取決於所用的發布者和可用帶寬。此外,並非所有發布者都包含在CC-NEWS爬網(尤其是美國的出版商)中。對於大型語料庫的創建,還可以通過僅利用Sitemaps來使用常規爬網,這需要較小的帶寬。
from fundus import PublisherCollection , Crawler , Sitemap
# initialize a crawler for us/uk based publishers and restrict to Sitemaps only
crawler = Crawler ( PublisherCollection . us , PublisherCollection . uk , restrict_sources_to = [ Sitemap ])
# crawl 1 million articles and print
for article in crawler . crawl ( max_articles = 1000000 ):
print ( article )我們提供快速的教程,讓您從圖書館開始:
如果您想捐款,請查看以下教程:
您可以在此處找到當前支持的發布者。
另外:添加新發布者很容易 - 考慮為項目做出貢獻!
查看我們的評估基準。
下表總結了眼底的總體表現,並根據平均胭脂精度,召回和F1得分及其標準偏差評估了刮板。該表以F1分數下降順序排序:
| 刮刀 | 精確 | 記起 | F1得分 | 版本 |
|---|---|---|---|---|
| 眼底 | 99.89 ±0.57 | 96.75 ±12.75 | 97.69 ±9.75 | 0.4.1 |
| Trafilatura | 93.91 ±12.89 | 96.85 ±15.69 | 93.62 ±16.73 | 1.12.0 |
| 新聞 - 請 | 97.95 ±10.08 | 91.89 ±16.15 | 93.39 ±14.52 | 1.6.13 |
| BTE | 81.09 ±19.41 | 98.23 ±8.61 | 87.14 ±15.48 | / |
| Justext | 86.51 ±18.92 | 90.23 ±20.61 | 86.96 ±19.76 | 3.0.1 |
| 鍋爐 | 85.96 ±18.55 | 91.21 ±19.15 | 86.52 ±18.03 | / |
| 鍋爐 | 82.89 ±20.65 | 82.11 ±29.99 | 79.90 ±25.86 | 1.3.0 |
請在使用眼底或在我們的工作上構圖時引用以下論文:
@inproceedings { dallabetta-etal-2024-fundus ,
title = " Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions " ,
author = " Dallabetta, Max and
Dobberstein, Conrad and
Breiding, Adrian and
Akbik, Alan " ,
editor = " Cao, Yixin and
Feng, Yang and
Xiong, Deyi " ,
booktitle = " Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) " ,
month = aug,
year = " 2024 " ,
address = " Bangkok, Thailand " ,
publisher = " Association for Computational Linguistics " ,
url = " https://aclanthology.org/2024.acl-demos.29 " ,
pages = " 305--314 " ,
}請通過電子郵件將您的問題或評論發送給Max Dallabetta
感謝您對貢獻的興趣!有很多參與的方法;從我們的撰稿人指南開始,然後檢查這些開放問題是否是否有特定任務。
麻省理工學院