Scrapling下载-Scrapling免费版下载-源码网

安装·概述·选择方法·选择一个fetcher·从美丽的小组迁移

由于反机器保护或网站更改而处理失败的网络刮板？遇到Scrapling 。

Scrapling是一个高性能，智能的网络刮擦库，用于自动适应网站的变化，同时大大优于流行的替代方案。对于初学者和专家来说， Scrapling提供了强大的功能，同时保持简单性。

Scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, PlayWrightFetcher >> StealthyFetcher.auto_match = True # Fetch websites' source under the radar! >> page = StealthyFetcher.fetch('https://exam*ple*.c*om', headless=True, network_idle=True) >> print(page.status) 200 >> products = page.css('.product', auto_save=True) # Scrape data that survives website design changes! >> # Later, if the website structure changes, pass `auto_match=True` >> products = page.css('.product', auto_match=True) # and Scrapling still finds them!">

 >> from Scrapling . fetchers import Fetcher , AsyncFetcher , StealthyFetcher , PlayWrightFetcher
>> StealthyFetcher . auto_match = True
# Fetch websites' source under the radar!
>> page = StealthyFetcher . fetch ( 'https://*example.c**om' , headless = True , network_idle = True )
>> print ( page . status )
200
>> products = page . css ( '.product' , auto_save = True )  # Scrape data that survives website design changes!
>> # Later, if the website structure changes, pass `auto_match=True`
> > products = page . css ( '.product' , auto_match = True )  # and Scrapling still finds them!

赞助商

_{您想在这里展示您的广告吗？单击此处，选择适合您的层！}

关键功能

随着您喜欢的异步支持，获取网站

HTTP请求：Fetcher类的快速而隐形的HTTP请求。
动态加载和自动化：通过您的真实浏览器， Scrapling的隐身模式，Playwright的Chrome浏览器或NSTBrowser的无浏览器无需浏览器，从playwrightfetcher类中获取动态网站！
防机保护措施绕过：与隐形弗格和Playwrightfetcher类轻松绕过保护。

自适应刮擦

智能元素跟踪：使用智能相似性系统和集成存储在网站更改后重新定位元素。
灵活选择：CSS选择器，XPATH选择器，基于过滤器的搜索，文本搜索，正则搜索等等。
?找到类似的元素：自动找到与您发现的元素相似的元素！
?智能内容刮擦：使用Scrapling的强大功能在没有特定选择器的情况下从多个网站中提取数据。

高性能

快速闪电：从头开始构建，要牢记性能，优于最受欢迎的Python刮擦图书馆。
?内存效率：最小内存足迹的优化数据结构。
⚡快速JSON序列化：比标准库快10倍。

开发人员友好

功能强大的导航API ：在各个方向上易于DOM遍历。
?丰富的文本处理：所有字符串都具有内置的正则延期，清洁方法等。所有元素的属性都是优化的词典，其添加的方法比标准字典消耗的内存少。
自动选择器生成：为任何元素生成健壮的短和完整的CSS/XPATH选择器。
?熟悉的API ：类似于废品/美丽的套件以及与废品中使用的相同的伪元素。
类型提示：完整的类型/DOC串覆盖范围，以实现未来的预处理和最佳的自动完成支持。

入门

Scrapling.fetchers import Fetcher # Do HTTP GET request to a web page and create an Adaptor instance page = Fetcher.get('https://quotes.**toscr*ape.com/', stealthy_headers=True) # Get all text content from all HTML tags in the page except the `script` and `style` tags page.get_all_text(ignore_tags=('script', 'style')) # Get all quotes elements; any of these methods will return a list of strings directly (TextHandlers) quotes = page.css('.quote .text::text') # CSS selector quotes = page.xpath('//span[@class="text"]/text()') # XPath quotes = page.css('.quote').css('.text::text') # Chained selectors quotes = [element.text for element in page.css('.quote .text')] # Slower than bulk query above # Get the first quote element quote = page.css_first('.quote') # same as page.css('.quote').first or page.css('.quote')[0] # Tired of selectors? Use find_all/find # Get all 'div' HTML tags that one of its 'class' values is 'quote' quotes = page.find_all('div', {'class': 'quote'}) # Same as quotes = page.find_all('div', class_='quote') quotes = page.find_all(['div'], class_='quote') quotes = page.find_all(class_='quote') # and so on... # Working with elements quote.html_content # Get the Inner HTML of this element quote.prettify() # Prettified version of Inner HTML above quote.attrib # Get that element's attributes quote.path # DOM path to element (List of all ancestors from <html> tag till the element itself)">

 from Scrapling . fetchers import Fetcher

# Do HTTP GET request to a web page and create an Adaptor instance
page = Fetcher . get ( 'https://quotes.*tos*c*rape.com/' , stealthy_headers = True )
# Get all text content from all HTML tags in the page except the `script` and `style` tags
page . get_all_text ( ignore_tags = ( 'script' , 'style' ))

# Get all quotes elements; any of these methods will return a list of strings directly (TextHandlers)
quotes = page . css ( '.quote .text::text' )  # CSS selector
quotes = page . xpath ( '//span[@class="text"]/text()' )  # XPath
quotes = page . css ( '.quote' ). css ( '.text::text' )  # Chained selectors
quotes = [ element . text for element in page . css ( '.quote .text' )]  # Slower than bulk query above

# Get the first quote element
quote = page . css_first ( '.quote' )  # same as page.css('.quote').first or page.css('.quote')[0]

# Tired of selectors? Use find_all/find
# Get all 'div' HTML tags that one of its 'class' values is 'quote'
quotes = page . find_all ( 'div' , { 'class' : 'quote' })
# Same as
quotes = page . find_all ( 'div' , class_ = 'quote' )
quotes = page . find_all ([ 'div' ], class_ = 'quote' )
quotes = page . find_all ( class_ = 'quote' )  # and so on...

# Working with elements
quote . html_content  # Get the Inner HTML of this element
quote . prettify ()  # Prettified version of Inner HTML above
quote . attrib  # Get that element's attributes
quote . path  # DOM path to element (List of all ancestors from <html> tag till the element itself)

为了使其简单，所有方法都可以彼此链接！

笔记

从这里查看完整的文档

解析性能

Scrapling不仅强大 - 它也很快就会燃烧。 Scrapling实现了许多最佳实践，设计模式和许多优化，以节省几秒钟的分数。所有这些都专注于解析HTML文档。这是在两个测试中进行Scrapling与流行的Python库进行比较的基准。

文本提取速度测试（5000个嵌套元素）。

该测试包括提取5000个嵌套div元素的文本内容。

＃	图书馆	时间（MS）	vs Scrapling
1	Scrapling	5.44	1.0x
2	parsel/scrapy	5.53	1.017x
3	原始LXML	6.76	1.243x
4	平柏	21.96	4.037x
5	SelectOlax	67.12	12.338x
6	BS4带LXML	1307.03	240.263X
7	机械小组	1322.64	243.132x
8	BS4与html5lib	3373.75	620.175X

如您所见， Scrapling与零食相当，并且比LXML稍快，这两个库都在其顶部建造。这些是最接近Scrapling的结果。 Pyquery也建在LXML之上，但Scrapling快四倍。

通过文本速度测试提取

Scrapling可以根据其文本内容找到元素，并找到类似于这些元素的元素。这两个功能也是唯一已知的库是自动cr。

因此，我们将其比较，以查看与加压刀相比，这两个任务在这两个任务中可以Scrapling速度。

这是结果：

图书馆	时间（MS）	vs Scrapling
Scrapling	2.51	1.0x
自动cr	11.41	4.546x

Scrapling可以找到具有更多方法的元素，并返回整个元素的适配器对象，而不仅仅是诸如AutoScraper之类的文本。因此，为了使该测试公平，两个库将提取文本元素，查找类似的元素，然后为所有元素提取文本内容。

如您所见，在同一任务中， Scrapling速度仍然快4.5倍。

如果我们仅在不停止提取每个元素的文本的情况下进行Scrapling提取元素，那么我们的速度将是这么快的速度，但是正如我所说，使其公平地比较一点。

所有基准测试结果平均为100次。有关方法论并进行比较，请参见我们的基准。

安装

Scrapling很容易开始。从0.2.9版开始，我们至少需要Python 3.9才能工作。

pip3 install Scrapling

然后运行此命令以安装使用fetcher类所需的浏览器依赖项

 Scrapling install

如果您有任何安装问题，请打开问题。

贡献

每个人都受到邀请，欢迎为Scrapling做出贡献。有很多事情要做！

在执行任何操作之前，请阅读贡献文件。

Scrapling项目的免责声明

警告

该图书馆仅用于教育和研究目的。通过使用此库，您同意遵守本地和国际数据刮擦和隐私法。作者和贡献者对滥用此软件概不负责。该图书馆不应用于侵犯他人的权利，出于不道德的目的，也不应以未经授权或非法的方式使用数据。除非您已获得网站所有者的许可或在其允许规则之内（例如robots.txt文件），否则请勿在任何网站上使用它。

执照

这项工作已根据BSD-3许可

致谢

该项目包括改编自：

PARSEL（BSD许可证） - 用于翻译子模块

谢谢和参考

Daijro在Browserforge和Camoufox上的出色工作
Vinyzu在剧作家对Botright的模拟上的工作
Brototor
假兄弟
重新布罗斯特

已知问题

在自动匹配保存过程中，选择结果中第一个元素的唯一属性是唯一被保存的元素。如果您使用的选择器在不同位置的页面上选择了不同的元素，则仅在您稍后重新安置时，自动匹配才会将第一个元素返回您。这不包括组合的CSS选择器（例如，使用逗号组合多个选择器），因为这些选择器会分开，并且每个选择器单独执行。

由Karim Shaiair设计和制作❤️。

Scrapling

赞助商

关键功能

随着您喜欢的异步支持，获取网站

自适应刮擦

高性能

开发人员友好

入门

解析性能

文本提取速度测试（5000个嵌套元素）。

通过文本速度测试提取

安装

贡献

Scrapling项目的免责声明

执照

致谢

谢谢和参考

已知问题

相关文章

ECommerceCrawlers

jd assistant

websockets

fuck login

推荐阅读

向上：银河游戏免安装正式版

风暴驭使正式中文版

冥河：贪婪之刃中文试玩版

超级键盘侠免安装绿色中文版