autoscraper下載 - autoscraper源代碼下載

autoscraper

其他源碼

1.1.14

下載

自動cr：Python的智能，自動，快速和輕巧的網絡刮板

該項目是為了自動刮擦而製定的，以使刮擦變得容易。它獲取網頁的URL或HTML內容，以及我們要從該頁面刮擦的示例數據列表。該數據可以是該頁面的文本，URL或任何HTML標籤值。它了解刮擦規則並返回類似的元素。然後，您可以使用新的URL使用此學習的對象來獲取相似的內容或這些新頁面的完全相同的元素。

安裝

它與Python 3兼容。

使用PIP從GIT存儲庫中安裝最新版本：

$ pip install git+https://github.com/alirezamika/autoscraper.git

從PYPI安裝：

$ pip install autoscraper

從來源安裝：

$ python setup.py install

如何使用

獲得類似的結果

說我們想在stackoverflow頁面中獲取所有相關的帖子標題：

 from autoscraper import AutoScraper

url = 'https://stackoverflow.com/questions/2081586/web-scraping-with-python'

# We can add one or multiple candidates here.
# You can also put urls here to retrieve urls.
wanted_list = [ "What are metaclasses in Python?" ]

scraper = AutoScraper ()
result = scraper . build ( url , wanted_list )
print ( result )

這是輸出：

[
    'How do I merge two dictionaries in a single expression in Python (taking union of dictionaries)?' , 
    'How to call an external command?' , 
    'What are metaclasses in Python?' , 
    'Does Python have a ternary conditional operator?' , 
    'How do you remove duplicates from a list whilst preserving order?' , 
    'Convert bytes to a string' , 
    'How to get line count of a large file cheaply in Python?' , 
    "Does Python have a string 'contains' substring method?" , 
    'Why is “1000000000000000 in range(1000000000000001)” so fast in Python 3?'
]

現在，您可以使用scraper對象獲取任何stackoverflow頁面的相關主題：

 scraper . get_result_similar ( 'https://stackoverflow.com/questions/606191/convert-bytes-to-a-string' )

得到確切的結果

假設我們想從雅虎財務上刮下現場股價：

 from autoscraper import AutoScraper

url = 'https://finance.yahoo.com/quote/AAPL/'

wanted_list = [ "124.81" ]

scraper = AutoScraper ()

# Here we can also pass html content via the html parameter instead of the url (html=html_content)
result = scraper . build ( url , wanted_list )
print ( result )

請注意，如果要復制此代碼，則應更新wanted_list ，因為頁面的內容動態更改。

您還可以傳遞任何自定義requests模塊參數。例如，您可能需要使用代理或自定義標頭：

 proxies = {
    "http" : 'http://127.0.0.1:8001' ,
    "https" : 'https://127.0.0.1:8001' ,
}

result = scraper . build ( url , wanted_list , request_args = dict ( proxies = proxies ))

現在我們可以獲得任何符號的價格：

 scraper . get_result_exact ( 'https://finance.yahoo.com/quote/MSFT/' )

您可能還需要獲取其他信息。例如，如果您也想獲得市值，則可以將其附加到想要的列表中。通過使用get_result_exact方法，它將將數據作為所需列表中相同的確切順序檢索。

另一個例子：說我們想刮擦有關文本，星數以及GitHub回購頁面問題的鏈接：

 from autoscraper import AutoScraper

url = 'https://github.com/alirezamika/autoscraper'

wanted_list = [ 'A Smart, Automatic, Fast and Lightweight Web Scraper for Python' , '6.2k' , 'https://github.com/alirezamika/autoscraper/issues' ]

scraper = AutoScraper ()
scraper . build ( url , wanted_list )

簡單，對嗎？

保存模型

現在，我們可以保存構建的模型以稍後使用。保存：

 # Give it a file path
scraper . save ( 'yahoo-finance' )

並加載：

 scraper . load ( 'yahoo-finance' )

教程

有關更多高級用法，請參閱此要旨。
自動cr和燒瓶：在不到5分鐘的時間內從任何網站創建API

問題

如果您使用該模塊有任何問題，請隨時打開問題。

支持該項目

愉快的編碼♥謝

展開

附加信息

版本 1.1.14
類型其他源碼
更新時間 2025-02-25
大小 12.55KB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
hidusbf

2025-02-14
mongo express

2025-06-04
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
hidusbf

其他源碼

1.0.0
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
hidusbf

其他源碼

1.0.0

相關資訊全部