pywebcopy下载 - pywebcopy源代码下载

pywebcopy

网站数据

v7.0.1

下载

    ____       _       __     __    ______                     _____   
   / __ __  _| |     / /__  / /_  / ____/___  ____  __  __   /__  /   
  / /_/ / / / / | /| / / _ / __ / /   / __ / __ / / / /     / /    
 / ____/ /_/ /| |/ |/ /  __/ /_/ / /___/ /_/ / /_/ / /_/ /     / /     
/_/    __, / |__/|__/___/_.___/____/____/ .___/__, /     /_/      
      /____/                               /_/    /____/

Created By : Raja Tomar License : Apache License 2.0 Email: [email protected]

Pywebcopy是一种免费的工具，用于将全部或部分网站复制到您的硬盘上以进行离线查看。

Pywebcopy将扫描指定的网站并将其内容下载到您的硬盘上。链接将自动重新设计到网站中的样式表，图像和其他页面，以匹配本地路径。使用其广泛的配置，您可以定义网站的哪些部分以及如何复制。

Pywebcopy可以做什么？

Pywebcopy将检查网站的HTML标记，并尝试发现所有链接的资源，例如其他页面，图像，视频，文件下载 - 任何内容。它将下载所有这些资源，并继续寻找更多资源。通过这种方式，WebCopy可以“爬网”整个网站并下载所看到的所有内容，以创建源网站的合理传真。

pywebcopy不做什么？

pywebcopy不包括虚拟DOM或任何形式的JavaScript解析。如果网站大量使用JavaScript运行，则如果Pywebcopy无法发现全部网站，则由于JavaScript被用于动态生成链接，因此不太可能会制作真正的副本。

Pywebcopy不会下载网站的原始源代码，它只能下载HTTP服务器返回的内容。虽然它将尽最大努力创建网站的离线副本，但一旦复制了高级数据驱动的网站可能无法正常工作。

安装

pywebcopy在PYPI上可用，可以使用pip轻松安装

$ pip install pywebcopy

你准备出发了。阅读下面的教程以开始。

第一步

您应该始终检查是否成功安装了最新的pywebcopy。

 >>> import pywebcopy
>>> pywebcopy.__version___
7.x.x

您的版本可能不同，现在您可以继续本教程。

基本用法

要保存任何一页，只需输入Python控制台

 from pywebcopy import save_webpage
save_webpage (
      url = "https://httpbin.org/" ,
      project_folder = "E://savedpages//" ,
      project_name = "my_site" ,
      bypass_robots = True ,
      debug = True ,
      open_in_browser = True ,
      delay = None ,
      threaded = False ,
)

要保存完整的网站（这可能会超载目标服务器，因此请小心）

 from pywebcopy import save_website
save_website (
url = "https://httpbin.org/" ,
project_folder = "E://savedpages//" ,
project_name = "my_site" ,
bypass_robots = True ,
debug = True ,
open_in_browser = True ,
delay = None ,
threaded = False ,
)

运行测试

运行测试很简单，不需要任何外部库。只需从pywebcopy软件包的根目录运行此命令。

$ python -m pywebcopy --tests

命令行接口

pywebcopy具有非常易于使用的命令行界面，可以帮助您完成任务，而不必担心内心的长路。

获取命令列表
```
$ python -m pywebcopy --help
```

使用CLI

 Usage: pywebcopy [-p|--page|-s|--site|-t|--tests] [--url=URL [,--location=LOCATION [,--name=NAME [,--pop [,--bypass_robots [,--quite [,--delay=DELAY]]]]]]]

Python library to clone/archive pages or sites from the Internet.

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  --url=URL             url of the entry point to be retrieved.
  --location=LOCATION   Location where files are to be stored.
  -n NAME, --name=NAME  Project name of this run.
  -d DELAY, --delay=DELAY
                        Delay between consecutive requests to the server.
  --bypass_robots       Bypass the robots.txt restrictions.
  --threaded            Use threads for faster downloading.
  -q, --quite           Suppress the logging from this library.
  --pop                 open the html page in default browser window after
                        finishing the task.

  CLI Actions List:
    Primary actions available through cli.

    -p, --page          Quickly saves a single page.
    -s, --site          Saves the complete site.
    -t, --tests         Runs tests for this library.

运行测试
```
  $ python -m pywebcopy run_tests
```

身份验证和饼干

在大多数时候，需要身份验证才能访问特定页面。它使用pywebcopy非常易于身份验证，因为它使用了一个requests.Session基本HTTP活动的Sessession对象，可以通过WebPage.session属性访问。如您所知，有大量有关通过requests.Session设置身份验证的教程。

这是一个填充表格的示例

 from pywebcopy . configs import get_config

config = get_config ( 'http://httpbin.org/' )
wp = config . create_page ()
wp . get ( config [ 'project_url' ])
form = wp . get_forms ()[ 0 ]
form . inputs [ 'email' ]. value = 'bar' # etc
form . inputs [ 'password' ]. value = 'baz' # etc
wp . submit_form ( form )
wp . get_links ()