scrape it now

？现在刮擦它！

网络刮板是为了AI和简单性而制作的。它以CLI的形式运行，可以并行化并输出高质量的Markdown内容。

特征

共享：

使用Azure队列存储或本地SQLite的解耦架构
可以并行运行的基金操作
？刮擦内容存储在Azure Blob存储或本地磁盘中

刮刀：

？如果没有更改，请避免重新剪裁页面
通过块列表项目降低网络成本的块广告
？通过检测链接并删除它们，深入探索页面
✍️从Pandoc的页面上提取降压内容
？从页面上提取元数据元素
带有剧作家和铬的负载动态JavaScript内容
‍♂️保留随机用户代理，随机视口大小和没有客户端提示标题的匿名性
用状态命令显示进度
？在页面上收集的图像
？存储页面的屏幕截图
？跟踪全网络使用的进度

索引器：

？自动创建AI搜索索引
✂️块降低，同时保持内容连贯
？带有Openai嵌入的嵌入块
？索引内容可以通过Azure AI搜索在语义上搜索

安装

来自PYPI

 # Install the package
python3 -m pip install scrape-it-now
# Run the CLI
scrape-it-now --help

要配置CLI（包括对后端服务的身份验证），请使用环境变量， .env文件或命令行选项。

来自来源

应用必须使用Python 3.13或更高版本运行。如果未安装此版本，则可以简单的安装方法是Pyenv。

 # Download the source code
git clone https://github.com/clemlesne/scrape-it-now.git
# Move to the directory
cd scrape-it-now
# Run install scripts
make install dev
# Run the CLI
scrape-it-now --help

如何使用

刮擦网站

运行工作

使用Azure Blob存储和Azure队列存储的使用：

 # Azure Storage configuration
export AZURE_STORAGE_ACCESS_KEY=xxx
export AZURE_STORAGE_ACCOUNT_NAME=xxx
# Run the job
scrape-it-now scrape run https://nytimes.com

使用本地磁盘斑点和本地磁盘队列使用：

 # Local disk configuration
export BLOB_PROVIDER=local_disk
export QUEUE_PROVIDER=local_disk
# Run the job
scrape-it-now scrape run https://nytimes.com

例子：

❯ scrape-it-now scrape run https://nytimes.com
2024-11-08T13:18:49.169320Z [info     ] Start scraping job lydmtyz
2024-11-08T13:18:49.169392Z [info     ] Installing dependencies if needed, this may take a few minutes
2024-11-08T13:18:52.542422Z [info     ] Queued 1/1 URLs
2024-11-08T13:18:58.509221Z [info     ] Start processing https://nytimes.com depth=1 process=scrape-lydmtyz-4 task=63dce50
2024-11-08T13:19:04.173198Z [info     ] Loaded 154554 ads and trackers process=scrape-lydmtyz-4
2024-11-08T13:19:16.393045Z [info     ] Queued 310/311 URLs            depth=1 process=scrape-lydmtyz-4 task=63dce50
2024-11-08T13:19:16.393323Z [info     ] Scraped                        depth=1 process=scrape-lydmtyz-4 task=63dce50
...

最常见的选项是：

`Options`	描述	`Environment variable`
`--azure-storage-access-key` `-asak`	Azure存储访问密钥	`AZURE_STORAGE_ACCESS_KEY`
`--azure-storage-account-name` `-asan`	Azure存储帐户名称	`AZURE_STORAGE_ACCOUNT_NAME`
`--blob-provider` `-bp`	Blob提供商	`BLOB_PROVIDER`
`--job-name` `-jn`	工作名称	`JOB_NAME`
`--max-depth` `-md`	最大深度	`MAX_DEPTH`
`--queue-provider` `-qp`	队列提供商	`QUEUE_PROVIDER`
`--save-images` `-si`	保存图像	`SAVE_IMAGES`
`--save-screenshot` `-ss`	保存屏幕截图	`SAVE_SCREENSHOT`
`--whitelist` `-w`	白名单	`WHITELIST`

对于所有可用选项的文档，请运行：

scrape-it-now scrape run --help

显示工作状态

使用Azure Blob存储使用：

 # Azure Storage configuration
export AZURE_STORAGE_CONNECTION_STRING=xxx
# Show the job status
scrape-it-now scrape status [job_name]

使用本地磁盘斑点的使用：

 # Local disk configuration
export BLOB_PROVIDER=local_disk
# Show the job status
scrape-it-now scrape status [job_name]

例子：

❯ scrape-it-now scrape status lydmtyz
{ " created_at " : " 2024-11-08T13:18:52.839060Z " , " last_updated " : " 2024-11-08T13:19:16.528370Z " , " network_used_mb " :2.6666793823242188, " processed " :1, " queued " :311}

最常见的选项是：

`Options`	描述	`Environment variable`
`--azure-storage-access-key` `-asak`	Azure存储访问密钥	`AZURE_STORAGE_ACCESS_KEY`
`--azure-storage-account-name` `-asan`	Azure存储帐户名称	`AZURE_STORAGE_ACCOUNT_NAME`
`--blob-provider` `-bp`	Blob提供商	`BLOB_PROVIDER`

对于所有可用选项的文档，请运行：

scrape-it-now scrape status --help

索引一个刮擦的网站

运行工作

使用Azure Blob存储，Azure队列存储和Azure AI搜索使用：

 # Azure OpenAI configuration
export AZURE_OPENAI_API_KEY=xxx
export AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=xxx
export AZURE_OPENAI_EMBEDDING_DIMENSIONS=xxx
export AZURE_OPENAI_EMBEDDING_MODEL_NAME=xxx
export AZURE_OPENAI_ENDPOINT=xxx

# Azure Search configuration
export AZURE_SEARCH_API_KEY=xxx
export AZURE_SEARCH_ENDPOINT=xxx

# Azure Storage configuration
export AZURE_STORAGE_ACCESS_KEY=xxx
export AZURE_STORAGE_ACCOUNT_NAME=xxx

# Run the job
scrape-it-now index run [job_name]

使用本地磁盘斑点，本地磁盘队列和Azure AI搜索使用：

 # Azure OpenAI configuration
export AZURE_OPENAI_API_KEY=xxx
export AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME=xxx
export AZURE_OPENAI_EMBEDDING_DIMENSIONS=xxx
export AZURE_OPENAI_EMBEDDING_MODEL_NAME=xxx
export AZURE_OPENAI_ENDPOINT=xxx
# Azure Search configuration
export AZURE_SEARCH_API_KEY=xxx
export AZURE_SEARCH_ENDPOINT=xxx
# Local disk configuration
export BLOB_PROVIDER=local_disk
export QUEUE_PROVIDER=local_disk
# Run the job
scrape-it-now index run [job_name]

例子：

❯ scrape-it-now index run lydmtyz
2024-11-08T13:20:37.129411Z [info     ] Start indexing job lydmtyz
2024-11-08T13:20:38.945954Z [info     ] Start processing https://nytimes.com process=index-lydmtyz-4 task=63dce50
2024-11-08T13:20:39.162692Z [info     ] Chunked into 7 parts           process=index-lydmtyz-4 task=63dce50
2024-11-08T13:20:42.407391Z [info     ] Indexed 7 chunks               process=index-lydmtyz-4 task=63dce50
...

最常见的选项是：

`Options`	描述	`Environment variable`
`--azure-openai-api-key` `-aoak`	Azure OpenAI API密钥	`AZURE_OPENAI_API_KEY`
`--azure-openai-embedding-deployment-name` `-aoedn`	Azure OpenAI嵌入部署名称	`AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME`
`--azure-openai-embedding-dimensions` `-aoed`	Azure Openai嵌入尺寸	`AZURE_OPENAI_EMBEDDING_DIMENSIONS`
`--azure-openai-embedding-model-name` `-aoemn`	Azure Openai嵌入型号名称	`AZURE_OPENAI_EMBEDDING_MODEL_NAME`
`--azure-openai-endpoint` `-aoe`	Azure OpenAi端点	`AZURE_OPENAI_ENDPOINT`
`--azure-search-api-key` `-asak`	Azure搜索API密钥	`AZURE_SEARCH_API_KEY`
`--azure-search-endpoint` `-ase`	Azure搜索端点	`AZURE_SEARCH_ENDPOINT`
`--azure-storage-access-key` `-asak`	Azure存储访问密钥	`AZURE_STORAGE_ACCESS_KEY`
`--azure-storage-account-name` `-asan`	Azure存储帐户名称	`AZURE_STORAGE_ACCOUNT_NAME`
`--blob-provider` `-bp`	Blob提供商	`BLOB_PROVIDER`
`--queue-provider` `-qp`	队列提供商	`QUEUE_PROVIDER`

对于所有可用选项的文档，请运行：

scrape-it-now index run --help

建筑学

刮

 ---
标题：Azure存储的刮擦过程
---
图LR
  CLI [“ CLI”]
  Web [“网站”]

  子图“ Azure队列存储”
    to_chunk [“到块”]
    to_scrape [“刮擦”]
  结尾

  子图“ Azure Blob存储”
    子图“容器”
      工作[“工作”]
      刮擦[“刮擦”]
      状态[“状态”]
    结尾
  结尾

  CLI-（1）拉消息 - > to_scrape
  CLI-（2）获得缓存 - >刮擦
  CLI-（3）浏览 - > Web
  CLI-（4）更新缓存 - >刮擦
  CLI-（5）推定状态 - >状态
  CLI-（6）添加消息 - > to_scrape
  CLI-（7）添加消息 - > to_chunk
  CLI-（8）更新状态 - >作业

指数

 ---
标题：带有Azure存储和Azure AI搜索的刮擦过程
---
图LR
  搜索[“ Azure AI搜索”]
  CLI [“ CLI”]
  嵌入[“ Azure Openai嵌入”]

  子图“ Azure队列存储”
    to_chunk [“到块”]
  结尾

  子图“ Azure Blob存储”
    子图“容器”
      刮擦[“刮擦”]
    结尾
  结尾

  CLI-（1）拉消息 - > to_chunk
  CLI-（2）获得缓存 - >刮擦
  CLI-（3）块 - > CLI
  CLI-（4）嵌入 - >嵌入
  CLI-（5）推入搜索 - >搜索

设计

Blob存储在文件夹中组织：

[job_name]-scraping/            # Job name (either defined by the user or generated)
    scraped/                    # All the data from the pages
        [page_id]/              # Assets from a page
            screenshot.jpeg     # Screenshot (if enabled)
            [image_id].[ext]    # Image binary (if enabled)
            [image_id].json     # Image metadata (if enabled)
        [page_id].json          # Data from a page
    state/                      # Job states (cache & parallelization)
        [page_id]               # Page state
    job.json                    # Job state (aggregated stats)

页面数据被视为API（直到下一个主要版本才会破裂），并以JSON格式存储：

{
  "created_at" : " 2024-09-11T14:06:43.566187Z " ,
  "redirect" : " https://www.nytimes.com/interactive/2024/podcasts/serial-season-four-guantanamo.html " ,
  "status" : 200 ,
  "url" : " https://www.nytimes.com/interactive/2024/podcasts/serial-season-four-guantanamo.html " ,
  "content" : " ## Listen to the trailer for Serial Season 4... " ,
  "etag" : null ,
  "links" : [
    " https://podcasts.apple.com/us/podcast/serial/id917918570 " ,
    " https://music.amazon.com/podcasts/d1022069-8863-42f3-823e-857fd8a7b616/serial?ref=dm_sh_OVBHkKYvW1poSzCOsBqHFXuLc " ,
    ...
  ],
  "metas" : {
    "description" : " “Serial” returns with a history of Guantánamo told by people who lived through key moments in Guantánamo’s evolution, who know things the rest of us don’t about what it’s like to be caught inside an improvised justice system. " ,
    "articleid" : " 100000009373583 " ,
    "twitter:site" : " @nytimes " ,
    ...
  },
  "network_used_mb" : 1.041460037231445 ,
  "raw" : " <head>...</head><body>...</body> " ,
  "valid_until" : " 2024-09-11T14:11:37.790570Z "
}

然后，索引数据存储在Azure AI搜索中：

场地	类型	描述
`chunck_number`	`Edm.Int32`	块号，从`0`到`x`
`content`	`Edm.String`	Chunck内容
`created_at`	`Edm.DateTimeOffset`	源刮擦日期
`id`	`Edm.String`	Chunck ID
`title`	`Edm.String`	源页标题
`url`	`Edm.String`	源页网址

高级用法

白名单

白名单选项允许限制到域而忽略子路径。它是正则表达式的列表：

domain1,regexp1,regexp2 domain2,regexp3

例如：

to whitelist learn.microsoft.com ：

learn.microsoft.com

to flyelist learn.microsoft.com和go.microsoft.com ，但忽略除/en-us以外的所有子路径：

learn.microsoft.com, ^ /(?!en-us). * go.microsoft.com

源环境变量

为了轻松配置CLI，请从.env文件中源环境变量。例如，对于--azure-storage-access-key选项：

AZURE_STORAGE_ACCESS_KEY=xxx

对于接受多个值的参数，请使用一个分离空间的列表。例如，对于--whitelist选项：

WHITELIST=learn . microsoft . com go . microsoft . com

应用程序缓存目录

缓存Directoty取决于操作系统：

~/.config/scrape-it-now （unix）
~/Library/Application Support/scrape-it-now （macOS）
C:Users<user>AppDataRoamingscrape-it-now （Windows）

Broswer二进制安装

每次运行时都会自动下载或更新浏览器二进制文件。浏览器是铬，它不可构型（如果需要其他浏览器，请随时打开问题），它的权重约为450MB。缓存存储在缓存目录中。

本地磁盘存储的工作方式

本地磁盘存储用于BLOB和队列。它不建议用于生产使用，因为它不容易扩展，也不容易耐受。它对于测试和开发或无法使用Azure服务很有用。

执行：

本地磁盘斑点使用目录结构来存储斑点。每个斑点都存储在文件中，blob名称为文件名。租赁是通过锁定文件实现的。默认情况下，相对于命令执行目录，文件存储在目录中。
本地磁盘队列使用SQLite数据库存储消息。数据库存储在缓存目录中。 SQL数据库实现可见性超时和删除令牌，以确保与无状态队列服务（如Azure队列存储）保持一致性。

使用代理人匿名

代理在应用程序中未实施。网络安全性无法从应用程序级别实现。使用VPN（例如您的第三方）或代理服务（例如住宅procies，tor）来确保匿名并配置系统防火墙以限制应用程序网络的访问。

带有容器的捆绑包

由于将应用程序包装到PYPI，因此可以很容易地将其与容器捆绑在一起。在每个开始时，应用程序都会下载依赖项（浏览器等）并缓存它们。您可以通过运行命令scrape-it-now scrape install预先下载它们。

良好的性能技术还将通过运行每个容器的多个容器来平行刮擦和索引作业。通过配置队列缩放器，可以使用KEDA来实现这一点。

展开

？现在刮擦它！

特征

安装

来自PYPI

来自来源

如何使用

刮擦网站

运行工作

显示工作状态

索引一个刮擦的网站

运行工作

建筑学

刮

指数

设计

高级用法

白名单

源环境变量

应用程序缓存目录

Broswer二进制安装

本地磁盘存储的工作方式

使用代理人匿名

带有容器的捆绑包

现在摩登天空

怪物现在去

怪物现在GO中文版

怪物猎人现在

nba now 23安卓

正在测试：407

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

hidusbf

Google Dorks

shepherd

hidusbf