網絡刮板是為了AI和簡單性而製作的。它以CLI的形式運行,可以並行化並輸出高質量的Markdown內容。
共享:
刮刀:
索引器:
# Install the package
python3 -m pip install scrape-it-now
# Run the CLI
scrape-it-now --help要配置CLI(包括對後端服務的身份驗證),請使用環境變量, .env文件或命令行選項。
應用必須使用Python 3.13或更高版本運行。如果未安裝此版本,則可以簡單的安裝方法是Pyenv。
# Download the source code
git clone https://github.com/clemlesne/scrape-it-now.git
# Move to the directory
cd scrape-it-now
# Run install scripts
make install dev
# Run the CLI
scrape-it-now --help使用Azure Blob存儲和Azure隊列存儲的使用:
# Azure Storage configuration
export AZURE_STORAGE_ACCESS_KEY=xxx
export AZURE_STORAGE_ACCOUNT_NAME=xxx
# Run the job
scrape-it-now scrape run https://nytimes.com使用本地磁盤斑點和本地磁盤隊列使用:
# Local disk configuration
export BLOB_PROVIDER=local_disk
export QUEUE_PROVIDER=local_disk
# Run the job
scrape-it-now scrape run https://nytimes.com例子:
❯ scrape-it-now scrape run https://nytimes.com
2024-11-08T13:18:49.169320Z [info ] Start scraping job lydmtyz
2024-11-08T13:18:49.169392Z [info ] Installing dependencies if needed, this may take a few minutes
2024-11-08T13:18:52.542422Z [info ] Queued 1/1 URLs
2024-11-08T13:18:58.509221Z [info ] Start processing https://nytimes.com depth=1 process=scrape-lydmtyz-4 task=63dce50
2024-11-08T13:19:04.173198Z [info ] Loaded 154554 ads and trackers process=scrape-lydmtyz-4
2024-11-08T13:19:16.393045Z [info ] Queued 310/311 URLs depth=1 process=scrape-lydmtyz-4 task=63dce50
2024-11-08T13:19:16.393323Z [info ] Scraped depth=1 process=scrape-lydmtyz-4 task=63dce50
...最常見的選項是:
Options | 描述 | Environment variable |
|---|---|---|
--azure-storage-access-key-asak | Azure存儲訪問密鑰 | AZURE_STORAGE_ACCESS_KEY |
--azure-storage-account-name-asan | Azure存儲帳戶名稱 | AZURE_STORAGE_ACCOUNT_NAME |
--blob-provider-bp | Blob提供商 | BLOB_PROVIDER |
--job-name-jn | 工作名稱 | JOB_NAME |
--max-depth-md | 最大深度 | MAX_DEPTH |
--queue-provider-qp | 隊列提供商 | QUEUE_PROVIDER |
--save-images-si | 保存圖像 | SAVE_IMAGES |
--save-screenshot-ss | 保存屏幕截圖 | SAVE_SCREENSHOT |
--whitelist-w | 白名單 | WHITELIST |
對於所有可用選項的文檔,請運行:
scrape-it-now scrape run --help使用Azure Blob存儲使用:
# Azure Storage configuration
export AZURE_STORAGE_CONNECTION_STRING=xxx
# Show the job status
scrape-it-now scrape status [job_name]使用本地磁盤斑點的使用:
# Local disk configuration
export BLOB_PROVIDER=local_disk
# Show the job status
scrape-it-now scrape status [job_name]例子:
❯ scrape-it-now scrape status lydmtyz
{ " created_at " : " 2024-11-08T13:18:52.839060Z " , " last_updated " : " 2024-11-08T13:19:16.528370Z " , " network_used_mb " :2.6666793823242188, " processed " :1, " queued " :311}最常見的選項是:
Options | 描述 | Environment variable |
|---|---|---|
--azure-storage-access-key-asak | Azure存儲訪問密鑰 | AZURE_STORAGE_ACCESS_KEY |
--azure-storage-account-name-asan | Azure存儲帳戶名稱 | AZURE_STORAGE_ACCOUNT_NAME |
--blob-provider-bp | Blob提供商 | BLOB_PROVIDER |
對於所有可用選項的文檔,請運行:
scrape-it-now scrape status --help使用Azure Blob存儲,Azure隊列存儲和Azure AI搜索使用:
# Azure OpenAI configuration
export AZURE_OPENAI_API_KEY=xxx
export AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=xxx
export AZURE_OPENAI_EMBEDDING_DIMENSIONS=xxx
export AZURE_OPENAI_EMBEDDING_MODEL_NAME=xxx
export AZURE_OPENAI_ENDPOINT=xxx
# Azure Search configuration
export AZURE_SEARCH_API_KEY=xxx
export AZURE_SEARCH_ENDPOINT=xxx
# Azure Storage configuration
export AZURE_STORAGE_ACCESS_KEY=xxx
export AZURE_STORAGE_ACCOUNT_NAME=xxx
# Run the job
scrape-it-now index run [job_name]使用本地磁盤斑點,本地磁盤隊列和Azure AI搜索使用:
# Azure OpenAI configuration
export AZURE_OPENAI_API_KEY=xxx
export AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=xxx
export AZURE_OPENAI_EMBEDDING_DIMENSIONS=xxx
export AZURE_OPENAI_EMBEDDING_MODEL_NAME=xxx
export AZURE_OPENAI_ENDPOINT=xxx
# Azure Search configuration
export AZURE_SEARCH_API_KEY=xxx
export AZURE_SEARCH_ENDPOINT=xxx
# Local disk configuration
export BLOB_PROVIDER=local_disk
export QUEUE_PROVIDER=local_disk
# Run the job
scrape-it-now index run [job_name]例子:
❯ scrape-it-now index run lydmtyz
2024-11-08T13:20:37.129411Z [info ] Start indexing job lydmtyz
2024-11-08T13:20:38.945954Z [info ] Start processing https://nytimes.com process=index-lydmtyz-4 task=63dce50
2024-11-08T13:20:39.162692Z [info ] Chunked into 7 parts process=index-lydmtyz-4 task=63dce50
2024-11-08T13:20:42.407391Z [info ] Indexed 7 chunks process=index-lydmtyz-4 task=63dce50
...最常見的選項是:
Options | 描述 | Environment variable |
|---|---|---|
--azure-openai-api-key-aoak | Azure OpenAI API密鑰 | AZURE_OPENAI_API_KEY |
--azure-openai-embedding-deployment-name-aoedn | Azure OpenAI嵌入部署名稱 | AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME |
--azure-openai-embedding-dimensions-aoed | Azure Openai嵌入尺寸 | AZURE_OPENAI_EMBEDDING_DIMENSIONS |
--azure-openai-embedding-model-name-aoemn | Azure Openai嵌入型號名稱 | AZURE_OPENAI_EMBEDDING_MODEL_NAME |
--azure-openai-endpoint-aoe | Azure OpenAi端點 | AZURE_OPENAI_ENDPOINT |
--azure-search-api-key-asak | Azure搜索API密鑰 | AZURE_SEARCH_API_KEY |
--azure-search-endpoint-ase | Azure搜索端點 | AZURE_SEARCH_ENDPOINT |
--azure-storage-access-key-asak | Azure存儲訪問密鑰 | AZURE_STORAGE_ACCESS_KEY |
--azure-storage-account-name-asan | Azure存儲帳戶名稱 | AZURE_STORAGE_ACCOUNT_NAME |
--blob-provider-bp | Blob提供商 | BLOB_PROVIDER |
--queue-provider-qp | 隊列提供商 | QUEUE_PROVIDER |
對於所有可用選項的文檔,請運行:
scrape-it-now index run --help ---
標題:Azure存儲的刮擦過程
---
圖LR
CLI [“ CLI”]
Web [“網站”]
子圖“ Azure隊列存儲”
to_chunk [“到塊”]
to_scrape [“刮擦”]
結尾
子圖“ Azure Blob存儲”
子圖“容器”
工作[“工作”]
刮擦[“刮擦”]
狀態[“狀態”]
結尾
結尾
CLI-(1)拉消息 - > to_scrape
CLI-(2)獲得緩存 - >刮擦
CLI-(3)瀏覽 - > Web
CLI-(4)更新緩存 - >刮擦
CLI-(5)推定狀態 - >狀態
CLI-(6)添加消息 - > to_scrape
CLI-(7)添加消息 - > to_chunk
CLI-(8)更新狀態 - >作業
---
標題:帶有Azure存儲和Azure AI搜索的刮擦過程
---
圖LR
搜索[“ Azure AI搜索”]
CLI [“ CLI”]
嵌入[“ Azure Openai嵌入”]
子圖“ Azure隊列存儲”
to_chunk [“到塊”]
結尾
子圖“ Azure Blob存儲”
子圖“容器”
刮擦[“刮擦”]
結尾
結尾
CLI-(1)拉消息 - > to_chunk
CLI-(2)獲得緩存 - >刮擦
CLI-(3)塊 - > CLI
CLI-(4)嵌入 - >嵌入
CLI-(5)推入搜索 - >搜索
Blob存儲在文件夾中組織:
[job_name]-scraping/ # Job name (either defined by the user or generated)
scraped/ # All the data from the pages
[page_id]/ # Assets from a page
screenshot.jpeg # Screenshot (if enabled)
[image_id].[ext] # Image binary (if enabled)
[image_id].json # Image metadata (if enabled)
[page_id].json # Data from a page
state/ # Job states (cache & parallelization)
[page_id] # Page state
job.json # Job state (aggregated stats)頁面數據被視為API(直到下一個主要版本才會破裂),並以JSON格式存儲:
{
"created_at" : " 2024-09-11T14:06:43.566187Z " ,
"redirect" : " https://www.nytimes.com/interactive/2024/podcasts/serial-season-four-guantanamo.html " ,
"status" : 200 ,
"url" : " https://www.nytimes.com/interactive/2024/podcasts/serial-season-four-guantanamo.html " ,
"content" : " ## Listen to the trailer for Serial Season 4... " ,
"etag" : null ,
"links" : [
" https://podcasts.apple.com/us/podcast/serial/id917918570 " ,
" https://music.amazon.com/podcasts/d1022069-8863-42f3-823e-857fd8a7b616/serial?ref=dm_sh_OVBHkKYvW1poSzCOsBqHFXuLc " ,
...
],
"metas" : {
"description" : " “Serial” returns with a history of Guantánamo told by people who lived through key moments in Guantánamo’s evolution, who know things the rest of us don’t about what it’s like to be caught inside an improvised justice system. " ,
"articleid" : " 100000009373583 " ,
"twitter:site" : " @nytimes " ,
...
},
"network_used_mb" : 1.041460037231445 ,
"raw" : " <head>...</head><body>...</body> " ,
"valid_until" : " 2024-09-11T14:11:37.790570Z "
}然後,索引數據存儲在Azure AI搜索中:
| 場地 | 類型 | 描述 |
|---|---|---|
chunck_number | Edm.Int32 | 塊號,從0到x |
content | Edm.String | Chunck內容 |
created_at | Edm.DateTimeOffset | 源刮擦日期 |
id | Edm.String | Chunck ID |
title | Edm.String | 源頁標題 |
url | Edm.String | 源頁網址 |
白名單選項允許限製到域而忽略子路徑。它是正則表達式的列表:
domain1,regexp1,regexp2 domain2,regexp3例如:
to whitelist learn.microsoft.com :
learn.microsoft.com to flyelist learn.microsoft.com和go.microsoft.com ,但忽略除/en-us以外的所有子路徑:
learn.microsoft.com, ^ /(?!en-us). * go.microsoft.com為了輕鬆配置CLI,請從.env文件中源環境變量。例如,對於--azure-storage-access-key選項:
AZURE_STORAGE_ACCESS_KEY=xxx對於接受多個值的參數,請使用一個分離空間的列表。例如,對於--whitelist選項:
WHITELIST=learn . microsoft . com go . microsoft . com緩存Directoty取決於操作系統:
~/.config/scrape-it-now (unix)~/Library/Application Support/scrape-it-now (macOS)C:Users<user>AppDataRoamingscrape-it-now (Windows)每次運行時都會自動下載或更新瀏覽器二進製文件。瀏覽器是鉻,它不可構型(如果需要其他瀏覽器,請隨時打開問題),它的權重約為450MB。緩存存儲在緩存目錄中。
本地磁盤存儲用於BLOB和隊列。它不建議用於生產使用,因為它不容易擴展,也不容易耐受。它對於測試和開發或無法使用Azure服務很有用。
執行:
代理在應用程序中未實施。網絡安全性無法從應用程序級別實現。使用VPN(例如您的第三方)或代理服務(例如住宅procies,tor)來確保匿名並配置系統防火牆以限制應用程序網絡的訪問。
由於將應用程序包裝到PYPI,因此可以很容易地將其與容器捆綁在一起。在每個開始時,應用程序都會下載依賴項(瀏覽器等)並緩存它們。您可以通過運行命令scrape-it-now scrape install預先下載它們。
良好的性能技術還將通過運行每個容器的多個容器來平行刮擦和索引作業。通過配置隊列縮放器,可以使用KEDA來實現這一點。