summarizer下载 - summarizer源代码下载

文章摘要

该项目实施了一种自定义算法，以从西班牙语和英语新闻文章中提取最重要的句子和关键词。

它是在Python中充分开发的，它的灵感来自使用Reddit News Subreddits上的类似项目，这些项目使用术语频率 - 内部文档频率（ tf–idf ）。

最重要的3个文件是：

scraper.py ：在给定的HTML源上执行Web刮擦的Python脚本，它提取文章标题，日期和主体。
summary.py ：将自定义算法应用于文本的字符串并提取顶级句子和单词的Python脚本。
bot.py ：一个reddit bot，该机器人检查其最新提交的子reddit。它管理已经处理过的提交的清单，以避免重复。

要求

该项目使用以下Python库

spaCy ：习惯将文章纳入句子和单词。
PRAW ：使Reddit API的使用非常容易。
Requests ：要执行HTTP get文章URL请求。
BeautifulSoup ：用于提取文章文本。
html5lib ：与BeautifulSoup一起使用时，这款解析器获得了更好的兼容性。
tldextract ：用于从URL提取域。
wordcloud ：用文章文本创建单词云。

安装spaCy库后，您必须安装语言模型才能使文章化。

对于Spanish您可以运行此：

python -m spacy download es_core_news_sm

有关其他语言，请检查以下链接：https：//spacy.io/usage/models

reddit bot

机器人本质上很简单，它使用了PRAW库，该库非常简单。该机器人每10分钟进行一次调查，以获取其最新提交的内容。

它首先检测是否尚未处理提交，然后检查提交URL是否在白名单中。这位白名单目前由我自己策划。

如果帖子及其URL通过两者进行了检查，则将网络刮擦过程应用于URL，这是事物开始变得有趣的地方。

在回复原始提交之前，它会检查所达到的减少百分比，如果它太低或太高，它会跳过它并移至下一个提交。

网络刮板

目前在白名单中，已经有300多个新闻文章和博客网站。为每个制作专门的网络刮板根本不可行。

第二个最好的事情是使刮板尽可能准确。

我们以通常的方式和BeautifulSoup Requests库以通常的方式启动Web刮板。

 with requests . get ( article_url ) as response :
    
    if response . encoding == "ISO-8859-1" :
        response . encoding = "utf-8"

    html_source = response . text

for item in [ "</p>" , "</blockquote>" , "</div>" , "</h2>" , "</h3>" ]:
    html_source = html_source . replace ( item , item + " n " )

soup = BeautifulSoup ( html_source , "html5lib" )

几次，我遇到了编码不正确的猜测引起的编码问题。为了避免此问题，我强制Requests用utf-8解码。

现在，我们已经将文章分解为soup对象，我们将首先提取标题并发布时间。

我使用了类似的方法来提取这两个值，我首先检查了最常见的标签，并退回了下一个常见替代方案。

并非所有网站都揭露其已发布的日期，有时我们以一个空字符串结尾。

 article_title = soup . find ( "title" ). text . replace ( " n " , " " ). strip ()

# If our title is too short we fallback to the first h1 tag.
if len ( article_title ) <= 5 :
    article_title = soup . find ( "h1" ). text . replace ( " n " , " " ). strip ()

article_date = ""

# We look for the first meta tag that has the word 'time' in it.
for item in soup . find_all ( "meta" ):

    if "time" in item . get ( "property" , "" ):

        clean_date = item [ "content" ]. split ( "+" )[ 0 ]. replace ( "Z" , "" )
        
        # Use your preferred time formatting.
        article_date = "{:%d-%m-%Y a las %H:%M:%S}" . format (
            datetime . fromisoformat ( clean_date ))
        break

# If we didn't find any meta tag with a datetime we look for a 'time' tag.
if len ( article_date ) <= 5 :
    try :
        article_date = soup . find ( "time" ). text . strip ()
    except :
        pass

从不同标签中提取文本时，我经常在不分开的情况下得到字符串。我实现了一些黑客攻击，以在通常包含文本的每个标签中添加新行。这大大提高了令牌仪的整体准确性。

我最初的想法是仅接受使用<article>标签的网站。对于我测试的第一个网站，它可以正常运行，但是我很快意识到很少有网站使用它，而使用它的网站也不正确。

 article = soup . find ( "article" ). text

访问<article>标签的.text属性时，我注意到我还将获取JavaScript代码。我回溯了一点，并删除了所有可以在文章文本中添加噪音的标签。

[ tag . extract () for tag in soup . find_all (
        [ "script" , "img" , "ul" , "time" , "h1" , "h2" , "h3" , "iframe" , "style" , "form" , "footer" , "figcaption" ])]


# These class names/ids are known to add noise or duplicate text to the article.
noisy_names = [ "image" , "img" , "video" , "subheadline" ,
                "hidden" , "tract" , "caption" , "tweet" , "expert" ]

for tag in soup . find_all ( "div" ):

    tag_id = tag [ "id" ]. lower ()

    for item in noisy_names :
        if item in tag_id :
            tag . extract ()

以上代码删除了大多数字幕，通常重复文章中的内容。

之后，我应用了一个三步过程来获取文章文本。

首先，我检查了所有<article>标签，并用最长的文字抓住了标签。

 article = ""

for article_tag in soup . find_all ( "article" ):

    if len ( article_tag . text ) >= len ( article ):
        article = article_tag . text

对于正确使用<article>标签的网站，这很好。最长的标签几乎总是包含主要文章。

但这并没有像预期的那样完全奏效，我注意到结果的质量差，有时我会为其他文章摘录。

那时我决定添加后备，只需寻找<article>标签的Lnstead，我将寻找具有常用id's的<div>和<section>标签。

 # These names commonly hold the article text.
common_names = [ "artic" , "summary" , "cont" , "note" , "cuerpo" , "body" ]

# If the article is too short we look somewhere else.
if len ( article ) <= 650 :

    for tag in soup . find_all ([ "div" , "section" ]):

        tag_id = tag [ "id" ]. lower ()

        for item in common_names :
            if item in tag_id :
                # We guarantee to get the longest div.
                if len ( tag . text ) >= len ( article ):
                    article = tag . text

这提高了准确性，我重复了代码，但是我还在寻找class属性的id属性。

 # The article is still too short, let's try one more time.
if len ( article ) <= 650 :

    for tag in soup . find_all ([ "div" , "section" ]):

        tag_class = "" . join ( tag [ "class" ]). lower ()

        for item in common_names :
            if item in tag_class :
                # We guarantee to get the longest div.
                if len ( tag . text ) >= len ( article ):
                    article = tag . text

使用所有以前的方法，大大提高了刮板的整体准确性。在某些情况下，我使用的是用英语和西班牙语共享相同字母的部分单词（ARTIC-> artic/articulo）。现在，刮板与我测试的所有URL兼容。

我们进行最终检查，如果文章仍然太短，我们将中止该过程并转到下一个URL，否则我们转到摘要算法。

摘要算法

该算法旨在主要从事西班牙书面文章。它由几个步骤组成：

通过删除所有空间来重新格式化和清洁原始文章。
制作原始文章的副本，然后从中删除所有常用单词。
将复制的文章分为单词并评分每个单词。
将原始文章分为句子，并使用单词中的分数对每个句子进行评分。
命中前5个句子和前5个单词，然后按时间顺序返回。

在开始之前，我们需要初始化spaCy库。

 NLP = spacy . load ( "es_core_news_sm" )

那条代码线将加载我使用的Spanish型号。如果您使用其他语言，请参考Requirements部分，以便您知道如何安装适当的模型。

清洁文章

从文章中提取文本时，我们通常会得到很多白色空间，主要是从线路断开（ n ）。

我们将文本分开，然后剥离所有空格，然后再次加入。这不是严格要求这样做的，但是在调试整个过程时会有所帮助。

删除常见和停止单词

在脚本的顶部，我们声明了停止单词文本文件的路径。这些停止单词将被添加到一个set中，保证不重复。

我还添加了一个列表，其中包含一些西班牙语和英语单词，这些单词不是停止单词，但它们并没有为文章添加任何实质性。我个人的偏爱是用小写形式进行硬编码。

然后，我在大写和标题表单中添加了每个单词的副本。这意味着该set将是原始尺寸的3倍。

 with open ( ES_STOPWORDS_FILE , "r" , encoding = "utf-8" ) as temp_file :
    for word in temp_file . read (). splitlines ():
        COMMON_WORDS . add ( word )

with open ( EN_STOPWORDS_FILE , "r" , encoding = "utf-8" ) as temp_file :
    for word in temp_file . read (). splitlines ():
        COMMON_WORDS . add ( word )

extra_words = list ()

for word in COMMON_WORDS :
    extra_words . append ( word . title ())
    extra_words . append ( word . upper ())

for word in extra_words :
    COMMON_WORDS . add ( word )

得分单词

在开始对我们的单词进行引导之前，我们必须首先将清洁的文章传递到NLP管道中，这是用一行代码完成的。

 doc = NLP ( cleaned_article )

该doc对象包含几个迭代器，我们将使用的2个是tokens和sents （句子）。

在这一点上，我为算法添加了个人风格。首先，我制作了文章的副本，然后从中删除了所有常见的单词。

之后，我使用了一个collections.Counter对象进行初始评分。

然后，我将乘数奖励应用于大写字母启动的单词，并且等于或更长的时间超过4个字符。大多数时候，这些词是地方，人或组织的名称。

最后，我将所有实际数字的单词的分数设置为零。

 words_of_interest = [
        token . text for token in doc if token . text not in COMMON_WORDS ]

scored_words = Counter ( words_of_interest )

for word in scored_words :

    if word [ 0 ]. isupper () and len ( word ) >= 4 :
        scored_words [ word ] *= 3

    if word . isdigit ():
        scored_words [ word ] = 0

得分句子

现在，我们已经有了每个单词的最终分数，是时候从文章中获得每个句子了。

为此，我们首先需要将文章分为句子。我尝试了各种方法，包括RegEx ，但最有效的方法是spaCy Library。

我们将再次迭代上一步中定义的doc对象，但是这次我们将迭代其sents属性。

需要注意的是，我们创建了一个句子tokens列表，在这些令牌内部，我们可以通过访问其text属性来检索句子文本。

 article_sentences = [ sent for sent in doc . sents ]

scored_sentences = list ()

or index , sent in enumerate ( article_sentences ):

    # In some edge cases we have duplicated sentences, we make sure that doesn't happen.
    if sent . text not in [ sent for score , index , sent in scored_sentences ]:
        scored_sentences . append (
            [ score_line ( sent , scored_words ), index , sent . text ])

scored_sentences是列表。每个内部列表包含3个值。句子得分，其索引和句子本身。这些值将在下一步中使用。

下面的代码显示了如何评分行。

 def score_line ( line , scored_words ):

    # We remove the common words.
    cleaned_line = [
        token . text for token in line if token . text not in COMMON_WORDS ]

    # We now sum the total number of ocurrences for all words.
    temp_score = 0

    for word in cleaned_line :
        temp_score += scored_words [ word ]

    # We apply a bonus score to sentences that contain financial information.
    line_lowercase = line . text . lower ()

    for word in FINANCIAL_WORDS :
        if word in line_lowercase :
            temp_score *= 1.5
            break

    return temp_score

我们将乘数应用于包含任何指金或财务的句子。

时间顺序

这是算法的最后一部分，我们利用sorted()函数来获取顶部句子，然后将其重新排序以其原始位置。

我们以相反的顺序对scored_sentences进行排序，这将为我们提供首先评分的句子。我们启动一个小的计数器变量，因此一旦达到5。我们还丢弃了所有3个字符或以下的句子（有时有偷偷摸摸的零宽度字符）。

 top_sentences = list ()
counter = 0

for score , index , sentence in sorted ( scored_sentences , reverse = True ):

    if counter >= 5 :
        break

    # When the article is too small the sentences may come empty.
    if len ( sentence ) >= 3 :

        # We append the sentence and its index so we can sort in chronological order.
        top_sentences . append ([ index , sentence ])
        counter += 1

return [ sentence for index , sentence in sorted ( top_sentences )]

最后，我们使用列表理解仅返回已经按时间顺序排序的句子。

字云

只是为了好玩，我在每篇文章中添加了一个单词云。为此，我使用了wordcloud库。该库非常易于使用，您只需要声明WordCloud对象，然后将具有文本字符串作为其参数的generate方法使用。

 wc = wordcloud . WordCloud () # See cloud.py for full parameters.
wc . generate ( prepared_article )
wc . to_file ( "./temp.png" )

生成图像后，我将其上传到Imgur ，然后返回URL链接并将其添加到Markdown消息中。

字云示例

结论

这是一个非常有趣且有趣的项目。我可能已经重新发明了轮子，但至少我学到了一些很酷的东西。

我对结果的整体质量感到满意，我将继续调整算法并应用兼容性提高。

附带说明，在测试脚本时，我不小心要求推文，Facebook帖子和英文书面文章。所有这些都获得了可接受的输出，但是由于这些站点不是目标，所以我将它们从白名单中删除。

经过几周的反馈，我决定增加对英语的支持。这需要一些重构。

为了使其与其他语言一起使用，您只需要一个文本文件，其中包含该语言中的所有停止单词并复制几行代码（请参阅“删除通用单词”部分）。

展开

summarizer

文章摘要

要求

reddit bot

网络刮板

摘要算法

清洁文章

删除常见和停止单词

得分单词

得分句子

时间顺序

字云

结论

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express