autoscraper
1.1.14

该项目是为了自动刮擦而制定的,以使刮擦变得容易。它获取网页的URL或HTML内容,以及我们要从该页面刮擦的示例数据列表。该数据可以是该页面的文本,URL或任何HTML标签值。它了解刮擦规则并返回类似的元素。然后,您可以使用新的URL使用此学习的对象来获取相似的内容或这些新页面的完全相同的元素。
它与Python 3兼容。
$ pip install git+https://github.com/alirezamika/autoscraper.git$ pip install autoscraper$ python setup.py install说我们想在stackoverflow页面中获取所有相关的帖子标题:
from autoscraper import AutoScraper
url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'
# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = [ "What are metaclasses in Python?" ]
scraper = AutoScraper ()
result = scraper . build ( url , wanted_list )
print ( result )这是输出:
[
'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?' ,
'How to call an external command?' ,
'What are metaclasses in Python?' ,
'Does Python have a ternary conditional operator?' ,
'How do you remove duplicates from a list whilst preserving order?' ,
'Convert bytes to a string' ,
'How to get line count of a large file cheaply in Python?' ,
"Does Python have a string 'contains' substring method?" ,
'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]现在,您可以使用scraper对象获取任何stackoverflow页面的相关主题:
scraper . get_result_similar ( 'https://stackoverflow.com/questions/606191/convert-bytes-to-a-string' )假设我们想从雅虎财务上刮下现场股价:
from autoscraper import AutoScraper
url = 'https://finance.yahoo.com/quote/AAPL/'
wanted_list = [ "124.81" ]
scraper = AutoScraper ()
# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper . build ( url , wanted_list )
print ( result )请注意,如果要复制此代码,则应更新wanted_list ,因为页面的内容动态更改。
您还可以传递任何自定义requests模块参数。例如,您可能需要使用代理或自定义标头:
proxies = {
"http" : 'http://127.0.0.1:8001' ,
"https" : 'https://127.0.0.1:8001' ,
}
result = scraper . build ( url , wanted_list , request_args = dict ( proxies = proxies ))现在我们可以获得任何符号的价格:
scraper . get_result_exact ( 'https://finance.yahoo.com/quote/MSFT/' )您可能还需要获取其他信息。例如,如果您也想获得市值,则可以将其附加到想要的列表中。通过使用get_result_exact方法,它将将数据作为所需列表中相同的确切顺序检索。
另一个例子:说我们想刮擦有关文本,星数以及GitHub回购页面问题的链接:
from autoscraper import AutoScraper
url = 'https://github.com/alirezamika/autoscraper'
wanted_list = [ 'A Smart, Automatic, Fast and Lightweight Web Scraper for Python' , '6.2k' , 'https://github.com/alirezamika/autoscraper/issues' ]
scraper = AutoScraper ()
scraper . build ( url , wanted_list )简单,对吗?
现在,我们可以保存构建的模型以稍后使用。保存:
# Give it a file path
scraper . save ( 'yahoo-finance' )并加载:
scraper . load ( 'yahoo-finance' )如果您使用该模块有任何问题,请随时打开问题。