autoscraper
1.1.14

該項目是為了自動刮擦而製定的,以使刮擦變得容易。它獲取網頁的URL或HTML內容,以及我們要從該頁面刮擦的示例數據列表。該數據可以是該頁面的文本,URL或任何HTML標籤值。它了解刮擦規則並返回類似的元素。然後,您可以使用新的URL使用此學習的對象來獲取相似的內容或這些新頁面的完全相同的元素。
它與Python 3兼容。
$ pip install git+https://github.com/alirezamika/autoscraper.git$ pip install autoscraper$ python setup.py install說我們想在stackoverflow頁面中獲取所有相關的帖子標題:
from autoscraper import AutoScraper
url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'
# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = [ "What are metaclasses in Python?" ]
scraper = AutoScraper ()
result = scraper . build ( url , wanted_list )
print ( result )這是輸出:
[
'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?' ,
'How to call an external command?' ,
'What are metaclasses in Python?' ,
'Does Python have a ternary conditional operator?' ,
'How do you remove duplicates from a list whilst preserving order?' ,
'Convert bytes to a string' ,
'How to get line count of a large file cheaply in Python?' ,
"Does Python have a string 'contains' substring method?" ,
'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]現在,您可以使用scraper對象獲取任何stackoverflow頁面的相關主題:
scraper . get_result_similar ( 'https://stackoverflow.com/questions/606191/convert-bytes-to-a-string' )假設我們想從雅虎財務上刮下現場股價:
from autoscraper import AutoScraper
url = 'https://finance.yahoo.com/quote/AAPL/'
wanted_list = [ "124.81" ]
scraper = AutoScraper ()
# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper . build ( url , wanted_list )
print ( result )請注意,如果要復制此代碼,則應更新wanted_list ,因為頁面的內容動態更改。
您還可以傳遞任何自定義requests模塊參數。例如,您可能需要使用代理或自定義標頭:
proxies = {
"http" : 'http://127.0.0.1:8001' ,
"https" : 'https://127.0.0.1:8001' ,
}
result = scraper . build ( url , wanted_list , request_args = dict ( proxies = proxies ))現在我們可以獲得任何符號的價格:
scraper . get_result_exact ( 'https://finance.yahoo.com/quote/MSFT/' )您可能還需要獲取其他信息。例如,如果您也想獲得市值,則可以將其附加到想要的列表中。通過使用get_result_exact方法,它將將數據作為所需列表中相同的確切順序檢索。
另一個例子:說我們想刮擦有關文本,星數以及GitHub回購頁面問題的鏈接:
from autoscraper import AutoScraper
url = 'https://github.com/alirezamika/autoscraper'
wanted_list = [ 'A Smart, Automatic, Fast and Lightweight Web Scraper for Python' , '6.2k' , 'https://github.com/alirezamika/autoscraper/issues' ]
scraper = AutoScraper ()
scraper . build ( url , wanted_list )簡單,對嗎?
現在,我們可以保存構建的模型以稍後使用。保存:
# Give it a file path
scraper . save ( 'yahoo-finance' )並加載:
scraper . load ( 'yahoo-finance' )如果您使用該模塊有任何問題,請隨時打開問題。