网络刮板是为了AI和简单性而制作的。它以CLI的形式运行,可以并行化并输出高质量的Markdown内容。
共享:
刮刀:
索引器:
# Install the package
python3 -m pip install scrape-it-now
# Run the CLI
scrape-it-now --help要配置CLI(包括对后端服务的身份验证),请使用环境变量, .env文件或命令行选项。
应用必须使用Python 3.13或更高版本运行。如果未安装此版本,则可以简单的安装方法是Pyenv。
# Download the source code
git clone https://github.com/clemlesne/scrape-it-now.git
# Move to the directory
cd scrape-it-now
# Run install scripts
make install dev
# Run the CLI
scrape-it-now --help使用Azure Blob存储和Azure队列存储的使用:
# Azure Storage configuration
export AZURE_STORAGE_ACCESS_KEY=xxx
export AZURE_STORAGE_ACCOUNT_NAME=xxx
# Run the job
scrape-it-now scrape run https://nytimes.com使用本地磁盘斑点和本地磁盘队列使用:
# Local disk configuration
export BLOB_PROVIDER=local_disk
export QUEUE_PROVIDER=local_disk
# Run the job
scrape-it-now scrape run https://nytimes.com例子:
❯ scrape-it-now scrape run https://nytimes.com
2024-11-08T13:18:49.169320Z [info ] Start scraping job lydmtyz
2024-11-08T13:18:49.169392Z [info ] Installing dependencies if needed, this may take a few minutes
2024-11-08T13:18:52.542422Z [info ] Queued 1/1 URLs
2024-11-08T13:18:58.509221Z [info ] Start processing https://nytimes.com depth=1 process=scrape-lydmtyz-4 task=63dce50
2024-11-08T13:19:04.173198Z [info ] Loaded 154554 ads and trackers process=scrape-lydmtyz-4
2024-11-08T13:19:16.393045Z [info ] Queued 310/311 URLs depth=1 process=scrape-lydmtyz-4 task=63dce50
2024-11-08T13:19:16.393323Z [info ] Scraped depth=1 process=scrape-lydmtyz-4 task=63dce50
...最常见的选项是:
Options | 描述 | Environment variable |
|---|---|---|
--azure-storage-access-key-asak | Azure存储访问密钥 | AZURE_STORAGE_ACCESS_KEY |
--azure-storage-account-name-asan | Azure存储帐户名称 | AZURE_STORAGE_ACCOUNT_NAME |
--blob-provider-bp | Blob提供商 | BLOB_PROVIDER |
--job-name-jn | 工作名称 | JOB_NAME |
--max-depth-md | 最大深度 | MAX_DEPTH |
--queue-provider-qp | 队列提供商 | QUEUE_PROVIDER |
--save-images-si | 保存图像 | SAVE_IMAGES |
--save-screenshot-ss | 保存屏幕截图 | SAVE_SCREENSHOT |
--whitelist-w | 白名单 | WHITELIST |
对于所有可用选项的文档,请运行:
scrape-it-now scrape run --help使用Azure Blob存储使用:
# Azure Storage configuration
export AZURE_STORAGE_CONNECTION_STRING=xxx
# Show the job status
scrape-it-now scrape status [job_name]使用本地磁盘斑点的使用:
# Local disk configuration
export BLOB_PROVIDER=local_disk
# Show the job status
scrape-it-now scrape status [job_name]例子:
❯ scrape-it-now scrape status lydmtyz
{ " created_at " : " 2024-11-08T13:18:52.839060Z " , " last_updated " : " 2024-11-08T13:19:16.528370Z " , " network_used_mb " :2.6666793823242188, " processed " :1, " queued " :311}最常见的选项是:
Options | 描述 | Environment variable |
|---|---|---|
--azure-storage-access-key-asak | Azure存储访问密钥 | AZURE_STORAGE_ACCESS_KEY |
--azure-storage-account-name-asan | Azure存储帐户名称 | AZURE_STORAGE_ACCOUNT_NAME |
--blob-provider-bp | Blob提供商 | BLOB_PROVIDER |
对于所有可用选项的文档,请运行:
scrape-it-now scrape status --help使用Azure Blob存储,Azure队列存储和Azure AI搜索使用:
# Azure OpenAI configuration
export AZURE_OPENAI_API_KEY=xxx
export AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=xxx
export AZURE_OPENAI_EMBEDDING_DIMENSIONS=xxx
export AZURE_OPENAI_EMBEDDING_MODEL_NAME=xxx
export AZURE_OPENAI_ENDPOINT=xxx
# Azure Search configuration
export AZURE_SEARCH_API_KEY=xxx
export AZURE_SEARCH_ENDPOINT=xxx
# Azure Storage configuration
export AZURE_STORAGE_ACCESS_KEY=xxx
export AZURE_STORAGE_ACCOUNT_NAME=xxx
# Run the job
scrape-it-now index run [job_name]使用本地磁盘斑点,本地磁盘队列和Azure AI搜索使用:
# Azure OpenAI configuration
export AZURE_OPENAI_API_KEY=xxx
export AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=xxx
export AZURE_OPENAI_EMBEDDING_DIMENSIONS=xxx
export AZURE_OPENAI_EMBEDDING_MODEL_NAME=xxx
export AZURE_OPENAI_ENDPOINT=xxx
# Azure Search configuration
export AZURE_SEARCH_API_KEY=xxx
export AZURE_SEARCH_ENDPOINT=xxx
# Local disk configuration
export BLOB_PROVIDER=local_disk
export QUEUE_PROVIDER=local_disk
# Run the job
scrape-it-now index run [job_name]例子:
❯ scrape-it-now index run lydmtyz
2024-11-08T13:20:37.129411Z [info ] Start indexing job lydmtyz
2024-11-08T13:20:38.945954Z [info ] Start processing https://nytimes.com process=index-lydmtyz-4 task=63dce50
2024-11-08T13:20:39.162692Z [info ] Chunked into 7 parts process=index-lydmtyz-4 task=63dce50
2024-11-08T13:20:42.407391Z [info ] Indexed 7 chunks process=index-lydmtyz-4 task=63dce50
...最常见的选项是:
Options | 描述 | Environment variable |
|---|---|---|
--azure-openai-api-key-aoak | Azure OpenAI API密钥 | AZURE_OPENAI_API_KEY |
--azure-openai-embedding-deployment-name-aoedn | Azure OpenAI嵌入部署名称 | AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME |
--azure-openai-embedding-dimensions-aoed | Azure Openai嵌入尺寸 | AZURE_OPENAI_EMBEDDING_DIMENSIONS |
--azure-openai-embedding-model-name-aoemn | Azure Openai嵌入型号名称 | AZURE_OPENAI_EMBEDDING_MODEL_NAME |
--azure-openai-endpoint-aoe | Azure OpenAi端点 | AZURE_OPENAI_ENDPOINT |
--azure-search-api-key-asak | Azure搜索API密钥 | AZURE_SEARCH_API_KEY |
--azure-search-endpoint-ase | Azure搜索端点 | AZURE_SEARCH_ENDPOINT |
--azure-storage-access-key-asak | Azure存储访问密钥 | AZURE_STORAGE_ACCESS_KEY |
--azure-storage-account-name-asan | Azure存储帐户名称 | AZURE_STORAGE_ACCOUNT_NAME |
--blob-provider-bp | Blob提供商 | BLOB_PROVIDER |
--queue-provider-qp | 队列提供商 | QUEUE_PROVIDER |
对于所有可用选项的文档,请运行:
scrape-it-now index run --help ---
标题:Azure存储的刮擦过程
---
图LR
CLI [“ CLI”]
Web [“网站”]
子图“ Azure队列存储”
to_chunk [“到块”]
to_scrape [“刮擦”]
结尾
子图“ Azure Blob存储”
子图“容器”
工作[“工作”]
刮擦[“刮擦”]
状态[“状态”]
结尾
结尾
CLI-(1)拉消息 - > to_scrape
CLI-(2)获得缓存 - >刮擦
CLI-(3)浏览 - > Web
CLI-(4)更新缓存 - >刮擦
CLI-(5)推定状态 - >状态
CLI-(6)添加消息 - > to_scrape
CLI-(7)添加消息 - > to_chunk
CLI-(8)更新状态 - >作业
---
标题:带有Azure存储和Azure AI搜索的刮擦过程
---
图LR
搜索[“ Azure AI搜索”]
CLI [“ CLI”]
嵌入[“ Azure Openai嵌入”]
子图“ Azure队列存储”
to_chunk [“到块”]
结尾
子图“ Azure Blob存储”
子图“容器”
刮擦[“刮擦”]
结尾
结尾
CLI-(1)拉消息 - > to_chunk
CLI-(2)获得缓存 - >刮擦
CLI-(3)块 - > CLI
CLI-(4)嵌入 - >嵌入
CLI-(5)推入搜索 - >搜索
Blob存储在文件夹中组织:
[job_name]-scraping/ # Job name (either defined by the user or generated)
scraped/ # All the data from the pages
[page_id]/ # Assets from a page
screenshot.jpeg # Screenshot (if enabled)
[image_id].[ext] # Image binary (if enabled)
[image_id].json # Image metadata (if enabled)
[page_id].json # Data from a page
state/ # Job states (cache & parallelization)
[page_id] # Page state
job.json # Job state (aggregated stats)页面数据被视为API(直到下一个主要版本才会破裂),并以JSON格式存储:
{
"created_at" : " 2024-09-11T14:06:43.566187Z " ,
"redirect" : " https://www.nytimes.com/interactive/2024/podcasts/serial-season-four-guantanamo.html " ,
"status" : 200 ,
"url" : " https://www.nytimes.com/interactive/2024/podcasts/serial-season-four-guantanamo.html " ,
"content" : " ## Listen to the trailer for Serial Season 4... " ,
"etag" : null ,
"links" : [
" https://podcasts.apple.com/us/podcast/serial/id917918570 " ,
" https://music.amazon.com/podcasts/d1022069-8863-42f3-823e-857fd8a7b616/serial?ref=dm_sh_OVBHkKYvW1poSzCOsBqHFXuLc " ,
...
],
"metas" : {
"description" : " “Serial” returns with a history of Guantánamo told by people who lived through key moments in Guantánamo’s evolution, who know things the rest of us don’t about what it’s like to be caught inside an improvised justice system. " ,
"articleid" : " 100000009373583 " ,
"twitter:site" : " @nytimes " ,
...
},
"network_used_mb" : 1.041460037231445 ,
"raw" : " <head>...</head><body>...</body> " ,
"valid_until" : " 2024-09-11T14:11:37.790570Z "
}然后,索引数据存储在Azure AI搜索中:
| 场地 | 类型 | 描述 |
|---|---|---|
chunck_number | Edm.Int32 | 块号,从0到x |
content | Edm.String | Chunck内容 |
created_at | Edm.DateTimeOffset | 源刮擦日期 |
id | Edm.String | Chunck ID |
title | Edm.String | 源页标题 |
url | Edm.String | 源页网址 |
白名单选项允许限制到域而忽略子路径。它是正则表达式的列表:
domain1,regexp1,regexp2 domain2,regexp3例如:
to whitelist learn.microsoft.com :
learn.microsoft.com to flyelist learn.microsoft.com和go.microsoft.com ,但忽略除/en-us以外的所有子路径:
learn.microsoft.com, ^ /(?!en-us). * go.microsoft.com为了轻松配置CLI,请从.env文件中源环境变量。例如,对于--azure-storage-access-key选项:
AZURE_STORAGE_ACCESS_KEY=xxx对于接受多个值的参数,请使用一个分离空间的列表。例如,对于--whitelist选项:
WHITELIST=learn . microsoft . com go . microsoft . com缓存Directoty取决于操作系统:
~/.config/scrape-it-now (unix)~/Library/Application Support/scrape-it-now (macOS)C:Users<user>AppDataRoamingscrape-it-now (Windows)每次运行时都会自动下载或更新浏览器二进制文件。浏览器是铬,它不可构型(如果需要其他浏览器,请随时打开问题),它的权重约为450MB。缓存存储在缓存目录中。
本地磁盘存储用于BLOB和队列。它不建议用于生产使用,因为它不容易扩展,也不容易耐受。它对于测试和开发或无法使用Azure服务很有用。
执行:
代理在应用程序中未实施。网络安全性无法从应用程序级别实现。使用VPN(例如您的第三方)或代理服务(例如住宅procies,tor)来确保匿名并配置系统防火墙以限制应用程序网络的访问。
由于将应用程序包装到PYPI,因此可以很容易地将其与容器捆绑在一起。在每个开始时,应用程序都会下载依赖项(浏览器等)并缓存它们。您可以通过运行命令scrape-it-now scrape install预先下载它们。
良好的性能技术还将通过运行每个容器的多个容器来平行刮擦和索引作业。通过配置队列缩放器,可以使用KEDA来实现这一点。