markout下载 - markout源代码下载

markout

Ai源码

1.0.0

下载

我制作的一个小型Python包来从网页中提取HTML内容。它是非常可定制的，我使它适合我的需求（将多个页面的代码提取到降级，但只有一些我需要的HTML标签）。由于其目的是能够将特定的HTML标签转换为所需的降价格式，因此该脚本不会生成任何标准输出，而是使用配置文件中指定的自定义令牌，因此可以将输出格式化为任何内容。

用法

导入您的代码

要使用此软件包，您需要使用pip安装它：

pip install markout-html

然后只需将其导入您的代码：

 from markout_html import *

之后，您可以使用extract_url和extract_html函数：

 result = extract_url (
  # HTML page link
  'http://example.page.com/blog/some_post.html' ,

  # Tokens to format each HTML tags contents (you can extract only the ones you want)
  {
    'p' : " n ** {} **"
  },

  # Only extract contents inside this tag
  'article'
)

result = extract_html (
  # HTML code string
  '<html>some html code</html>' ,

  # Tokens to format each HTML tags contents (you can extract only the ones you want)
  {
    'p' : " n ** {} **"
  },

  # Only extract contents inside this tag
  'article'
)

使用CLI命令

以下是一些示例，如果您不想创建Python脚本，则有更好的描述有关如何使用此软件包命令！

如果您只想在终端中使用字符串提取，则可以使用markout_html --extract [string] 。

您可以将命令markout_html与标志--help螺旋一起使用，以获取更多信息。

配置

所有配置都可以在一个文件中找到： .markoutrc.json （您可以在终端中使用标志--config指定另一个名称），如果您不加载配置文件，则脚本将使用其默认值。存储库根中有一个配置的示例！

指定其他配置文件使用：

markout_html --config [filename]

配置文件值

links - 要提取的链接的对象，每个链接都有一个目标值（输出文件）。例子：

{
  "links" : {
    "http://example.page.com/blog/some_post.html" : " out/post.md " ,
    "http://example.page.com/blog/some_other_post.html" : " out/other_post.md "
  }
}

上面的示例将从http://example.page.com/blog/some_post.html获取HTML，然后将结果提取到out/post.md中。

only_on字符串，指定何处（html标签）以从（例如：html，body，main）提取内容。例子：

{
  "only_on" : " article "
}

tokens - 将每个指定的HTML标签提取到格式化的字符串中，然后放置在输出文件上的对象。例子：

{
  "tokens" : {
    "header" : " # {} " ,
    "h1" : " n # {} " ,
    "h2" : " n # {} " ,
    "b" : " n ## {} " ,
    "li" : " + {} " ,
    "i" : " ** {} ** " ,
    "p" : " n {} " ,
    "span" : " {} "
  }
}