Python中的一个非常简单的新闻爬行者。在柏林洪堡大学开发。
快速启动|教程|新闻来源|纸
眼底是:
一个静态的新闻爬行者。眼底,您只需几行Python代码即可爬网上新闻文章!无论是从实时网站还是CC-NEWS数据集中。
一个开源Python包。眼底建立在共同建立东西的想法上。我们欢迎您的贡献,以帮助眼底成长!
要从PIP安装,只需做:
pip install fundus
眼底需要python 3.8+。
让我们使用眼底爬行,从美国的出版商那里抓取2篇文章。
from fundus import PublisherCollection , Crawler
# initialize the crawler for news publishers based in the US
crawler = Crawler ( PublisherCollection . us )
# crawl 2 articles and print
for article in crawler . crawl ( max_articles = 2 ):
print ( article )就是这样!
如果您运行此代码,它应该打印出类似的内容:
Fundus-Article:
- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"
- Text: "Democrats jammed three of President Joe Biden's controversial court nominees
through committee votes on Thursday thanks to a last-minute [...]"
- URL: https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/
- From: FreeBeacon (2023-05-11 18:41)
Fundus-Article:
- Title: "Northwestern student government freezes College Republicans funding over [...]"
- Text: "Student government at Northwestern University in Illinois "indefinitely" froze
the funds of the university's chapter of College Republicans [...]"
- URL: https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community
- From: FoxNews (2023-05-09 14:37)此打印输出告诉您,您成功地爬了两篇文章!
对于每篇文章,打印输出详细信息:
也许您想改用特定的新闻来源。让我们仅从华盛顿时报爬网:
from fundus import PublisherCollection , Crawler
# initialize the crawler for The New Yorker
crawler = Crawler ( PublisherCollection . us . TheNewYorker )
# crawl 2 articles and print
for article in crawler . crawl ( max_articles = 2 ):
print ( article )为了爬行如此大量的数据,Fellus依赖于CommonCrawl Web档案馆,特别是新闻爬网CC-NEWS 。如果您不熟悉CommonCrawl或CC-NEWS请查看他们的网站。只需导入我们的CCNewsCrawler ,并确保事先查看我们的教程。
from fundus import PublisherCollection , CCNewsCrawler
# initialize the crawler using all publishers supported by fundus
crawler = CCNewsCrawler ( * PublisherCollection )
# crawl 1 million articles and print
for article in crawler . crawl ( max_articles = 1000000 ):
print ( article )注意:默认情况下,爬网利用系统上所有可用的CPU内核。为了获得最佳性能,我们建议使用processes参数手动设置过程数。一个好的经验法则是one process per 200 Mbps of bandwidth 。这可能会因核心速度而异。
注意:上面的爬网使用了具有1000 Mbps连接的机器上的整个PublisherCollection ,Core i9-13905h,64GB RAM,Windows 11,而无需打印文章。估计的时间可能会取决于所用的发布者和可用带宽。此外,并非所有发布者都包含在CC-NEWS爬网(尤其是美国的出版商)中。对于大型语料库的创建,还可以通过仅利用Sitemaps来使用常规爬网,这需要较小的带宽。
from fundus import PublisherCollection , Crawler , Sitemap
# initialize a crawler for us/uk based publishers and restrict to Sitemaps only
crawler = Crawler ( PublisherCollection . us , PublisherCollection . uk , restrict_sources_to = [ Sitemap ])
# crawl 1 million articles and print
for article in crawler . crawl ( max_articles = 1000000 ):
print ( article )我们提供快速的教程,让您从图书馆开始:
如果您想捐款,请查看以下教程:
您可以在此处找到当前支持的发布者。
另外:添加新发布者很容易 - 考虑为项目做出贡献!
查看我们的评估基准。
下表总结了眼底的总体表现,并根据平均胭脂精度,召回和F1得分及其标准偏差评估了刮板。该表以F1分数下降顺序排序:
| 刮刀 | 精确 | 记起 | F1得分 | 版本 |
|---|---|---|---|---|
| 眼底 | 99.89 ±0.57 | 96.75 ±12.75 | 97.69 ±9.75 | 0.4.1 |
| Trafilatura | 93.91 ±12.89 | 96.85 ±15.69 | 93.62 ±16.73 | 1.12.0 |
| 新闻 - 请 | 97.95 ±10.08 | 91.89 ±16.15 | 93.39 ±14.52 | 1.6.13 |
| BTE | 81.09 ±19.41 | 98.23 ±8.61 | 87.14 ±15.48 | / |
| Justext | 86.51 ±18.92 | 90.23 ±20.61 | 86.96 ±19.76 | 3.0.1 |
| 锅炉 | 85.96 ±18.55 | 91.21 ±19.15 | 86.52 ±18.03 | / |
| 锅炉 | 82.89 ±20.65 | 82.11 ±29.99 | 79.90 ±25.86 | 1.3.0 |
请在使用眼底或在我们的工作上构图时引用以下论文:
@inproceedings { dallabetta-etal-2024-fundus ,
title = " Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions " ,
author = " Dallabetta, Max and
Dobberstein, Conrad and
Breiding, Adrian and
Akbik, Alan " ,
editor = " Cao, Yixin and
Feng, Yang and
Xiong, Deyi " ,
booktitle = " Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations) " ,
month = aug,
year = " 2024 " ,
address = " Bangkok, Thailand " ,
publisher = " Association for Computational Linguistics " ,
url = " https://aclanthology.org/2024.acl-demos.29 " ,
pages = " 305--314 " ,
}请通过电子邮件将您的问题或评论发送给Max Dallabetta
感谢您对贡献的兴趣!有很多参与的方法;从我们的撰稿人指南开始,然后检查这些开放问题是否是否有特定任务。
麻省理工学院