markout下載 - markout源代碼下載

markout

Ai源碼

1.0.0

下載

我製作的一個小型Python包來從網頁中提取HTML內容。它是非常可定制的，我使它適合我的需求（將多個頁面的代碼提取到降級，但只有一些我需要的HTML標籤）。由於其目的是能夠將特定的HTML標籤轉換為所需的降價格式，因此該腳本不會生成任何標準輸出，而是使用配置文件中指定的自定義令牌，因此可以將輸出格式化為任何內容。

用法

導入您的代碼

要使用此軟件包，您需要使用pip安裝它：

pip install markout-html

然後只需將其導入您的代碼：

 from markout_html import *

之後，您可以使用extract_url和extract_html函數：

 result = extract_url (
  # HTML page link
  'http://example.page.com/blog/some_post.html' ,

  # Tokens to format each HTML tags contents (you can extract only the ones you want)
  {
    'p' : " n ** {} **"
  },

  # Only extract contents inside this tag
  'article'
)

result = extract_html (
  # HTML code string
  '<html>some html code</html>' ,

  # Tokens to format each HTML tags contents (you can extract only the ones you want)
  {
    'p' : " n ** {} **"
  },

  # Only extract contents inside this tag
  'article'
)

使用CLI命令

以下是一些示例，如果您不想創建Python腳本，則有更好的描述有關如何使用此軟件包命令！

如果您只想在終端中使用字符串提取，則可以使用markout_html --extract [string] 。

您可以將命令markout_html與標誌--help螺旋一起使用，以獲取更多信息。

配置

所有配置都可以在一個文件中找到： .markoutrc.json （您可以在終端中使用標誌--config指定另一個名稱），如果您不加載配置文件，則腳本將使用其默認值。存儲庫根中有一個配置的示例！

指定其他配置文件使用：

markout_html --config [filename]

配置文件值

links - 要提取的鏈接的對象，每個鏈接都有一個目標值（輸出文件）。例子：

{
  "links" : {
    "http://example.page.com/blog/some_post.html" : " out/post.md " ,
    "http://example.page.com/blog/some_other_post.html" : " out/other_post.md "
  }
}

上面的示例將從http://example.page.com/blog/some_post.html獲取HTML，然後將結果提取到out/post.md中。

only_on字符串，指定何處（html標籤）以從（例如：html，body，main）提取內容。例子：

{
  "only_on" : " article "
}

tokens - 將每個指定的HTML標籤提取到格式化的字符串中，然後放置在輸出文件上的對象。例子：

{
  "tokens" : {
    "header" : " # {} " ,
    "h1" : " n # {} " ,
    "h2" : " n # {} " ,
    "b" : " n ## {} " ,
    "li" : " + {} " ,
    "i" : " ** {} ** " ,
    "p" : " n {} " ,
    "span" : " {} "
  }
}