autoscraper下载 - autoscraper源代码下载

autoscraper

其他源码

1.1.14

下载

自动cr：Python的智能，自动，快速和轻巧的网络刮板

该项目是为了自动刮擦而制定的，以使刮擦变得容易。它获取网页的URL或HTML内容，以及我们要从该页面刮擦的示例数据列表。该数据可以是该页面的文本，URL或任何HTML标签值。它了解刮擦规则并返回类似的元素。然后，您可以使用新的URL使用此学习的对象来获取相似的内容或这些新页面的完全相同的元素。

安装

它与Python 3兼容。

使用PIP从GIT存储库中安装最新版本：

$ pip install git+https://github.com/alirezamika/autoscraper.git

从PYPI安装：

$ pip install autoscraper

从来源安装：

$ python setup.py install

如何使用

获得类似的结果

说我们想在stackoverflow页面中获取所有相关的帖子标题：

 from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = [ "What are metaclasses in Python?" ]

scraper = AutoScraper ()
result = scraper . build ( url , wanted_list )
print ( result )

这是输出：

[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?' , 
    'How to call an external command?' , 
    'What are metaclasses in Python?' , 
    'Does Python have a ternary conditional operator?' , 
    'How do you remove duplicates from a list whilst preserving order?' , 
    'Convert bytes to a string' , 
    'How to get line count of a large file cheaply in Python?' , 
    "Does Python have a string 'contains' substring method?" , 
    'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]

现在，您可以使用scraper对象获取任何stackoverflow页面的相关主题：

 scraper . get_result_similar ( 'https://stackoverflow.com/questions/606191/convert-bytes-to-a-string' )

得到确切的结果

假设我们想从雅虎财务上刮下现场股价：

 from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'

wanted_list = [ "124.81" ]

scraper = AutoScraper ()

# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper . build ( url , wanted_list )
print ( result )

请注意，如果要复制此代码，则应更新wanted_list ，因为页面的内容动态更改。

您还可以传递任何自定义requests模块参数。例如，您可能需要使用代理或自定义标头：

 proxies = {
    "http" : 'http://127.0.0.1:8001' ,
    "https" : 'https://127.0.0.1:8001' ,
}

result = scraper . build ( url , wanted_list , request_args = dict ( proxies = proxies ))

现在我们可以获得任何符号的价格：

 scraper . get_result_exact ( 'https://finance.yahoo.com/quote/MSFT/' )

您可能还需要获取其他信息。例如，如果您也想获得市值，则可以将其附加到想要的列表中。通过使用get_result_exact方法，它将将数据作为所需列表中相同的确切顺序检索。

另一个例子：说我们想刮擦有关文本，星数以及GitHub回购页面问题的链接：

 from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'

wanted_list = [ 'A Smart, Automatic, Fast and Lightweight Web Scraper for Python' , '6.2k' , 'https://github.com/alirezamika/autoscraper/issues' ]

scraper = AutoScraper ()
scraper . build ( url , wanted_list )

简单，对吗？

保存模型

现在，我们可以保存构建的模型以稍后使用。保存：

 # Give it a file path
scraper . save ( 'yahoo-finance' )

并加载：

 scraper . load ( 'yahoo-finance' )

教程

有关更多高级用法，请参阅此要旨。
自动cr和烧瓶：在不到5分钟的时间内从任何网站创建API

问题

如果您使用该模块有任何问题，请随时打开问题。

支持该项目

愉快的编码♥谢

展开

附加信息

版本 1.1.14
类型其他源码
更新时间 2025-02-25
大小 12.55KB
来自于 Github

autoscraper

自动cr：Python的智能，自动，快速和轻巧的网络刮板

安装

如何使用

获得类似的结果

得到确切的结果

保存模型

教程

问题

支持该项目

愉快的编码♥谢

Google Dorks

shepherd

hidusbf

mongo express

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

hidusbf

Google Dorks

shepherd

hidusbf