summarizer下載 - summarizer源代碼下載

文章摘要

該項目實施了一種自定義算法，以從西班牙語和英語新聞文章中提取最重要的句子和關鍵詞。

它是在Python中充分開發的，它的靈感來自使用Reddit News Subreddits上的類似項目，這些項目使用術語頻率 - 內部文檔頻率（ tf–idf ）。

最重要的3個文件是：

scraper.py ：在給定的HTML源上執行Web刮擦的Python腳本，它提取文章標題，日期和主體。
summary.py ：將自定義算法應用於文本的字符串並提取頂級句子和單詞的Python腳本。
bot.py ：一個reddit bot，該機器人檢查其最新提交的子reddit。它管理已經處理過的提交的清單，以避免重複。

要求

該項目使用以下Python庫

spaCy ：習慣將文章納入句子和單詞。
PRAW ：使Reddit API的使用非常容易。
Requests ：要執行HTTP get文章URL請求。
BeautifulSoup ：用於提取文章文本。
html5lib ：與BeautifulSoup一起使用時，這款解析器獲得了更好的兼容性。
tldextract ：用於從URL提取域。
wordcloud ：用文章文本創建單詞云。

安裝spaCy庫後，您必須安裝語言模型才能使文章化。

對於Spanish您可以運行此：

python -m spacy download es_core_news_sm

有關其他語言，請檢查以下鏈接：https：//spacy.io/usage/models

reddit bot

機器人本質上很簡單，它使用了PRAW庫，該庫非常簡單。該機器人每10分鐘進行一次調查，以獲取其最新提交的內容。

它首先檢測是否尚未處理提交，然後檢查提交URL是否在白名單中。這位白名單目前由我自己策劃。

如果帖子及其URL通過兩者進行了檢查，則將網絡刮擦過程應用於URL，這是事物開始變得有趣的地方。

在回復原始提交之前，它會檢查所達到的減少百分比，如果它太低或太高，它會跳過它並移至下一個提交。

網絡刮板

目前在白名單中，已經有300多個新聞文章和博客網站。為每個製作專門的網絡刮板根本不可行。

第二個最好的事情是使刮板盡可能準確。

我們以通常的方式和BeautifulSoup Requests庫以通常的方式啟動Web刮板。

 with requests . get ( article_url ) as response :
    
    if response . encoding == "ISO-8859-1" :
        response . encoding = "utf-8"

    html_source = response . text

for item in [ "</p>" , "</blockquote>" , "</div>" , "</h2>" , "</h3>" ]:
    html_source = html_source . replace ( item , item + " n " )

soup = BeautifulSoup ( html_source , "html5lib" )

幾次，我遇到了編碼不正確的猜測引起的編碼問題。為了避免此問題，我強制Requests用utf-8解碼。

現在，我們已經將文章分解為soup對象，我們將首先提取標題並發佈時間。

我使用了類似的方法來提取這兩個值，我首先檢查了最常見的標籤，並退回了下一個常見替代方案。

並非所有網站都揭露其已發布的日期，有時我們以一個空字符串結尾。

 article_title = soup . find ( "title" ). text . replace ( " n " , " " ). strip ()

# If our title is too short we fallback to the first h1 tag.
if len ( article_title ) <= 5 :
    article_title = soup . find ( "h1" ). text . replace ( " n " , " " ). strip ()

article_date = ""

# We look for the first meta tag that has the word 'time' in it.
for item in soup . find_all ( "meta" ):

    if "time" in item . get ( "property" , "" ):

        clean_date = item [ "content" ]. split ( "+" )[ 0 ]. replace ( "Z" , "" )
        
        # Use your preferred time formatting.
        article_date = "{:%d-%m-%Y a las %H:%M:%S}" . format (
            datetime . fromisoformat ( clean_date ))
        break

# If we didn't find any meta tag with a datetime we look for a 'time' tag.
if len ( article_date ) <= 5 :
    try :
        article_date = soup . find ( "time" ). text . strip ()
    except :
        pass

從不同標籤中提取文本時，我經常在不分開的情況下得到字符串。我實現了一些黑客攻擊，以在通常包含文本的每個標籤中添加新行。這大大提高了令牌儀的整體準確性。

我最初的想法是僅接受使用<article>標籤的網站。對於我測試的第一個網站，它可以正常運行，但是我很快意識到很少有網站使用它，而使用它的網站也不正確。

 article = soup . find ( "article" ). text

訪問<article>標籤的.text屬性時，我注意到我還將獲取JavaScript代碼。我回溯了一點，並刪除了所有可以在文章文本中添加噪音的標籤。

[ tag . extract () for tag in soup . find_all (
        [ "script" , "img" , "ul" , "time" , "h1" , "h2" , "h3" , "iframe" , "style" , "form" , "footer" , "figcaption" ])]


# These class names/ids are known to add noise or duplicate text to the article.
noisy_names = [ "image" , "img" , "video" , "subheadline" ,
                "hidden" , "tract" , "caption" , "tweet" , "expert" ]

for tag in soup . find_all ( "div" ):

    tag_id = tag [ "id" ]. lower ()

    for item in noisy_names :
        if item in tag_id :
            tag . extract ()

以上代碼刪除了大多數字幕，通常重複文章中的內容。

之後，我應用了一個三步過程來獲取文章文本。

首先，我檢查了所有<article>標籤，並用最長的文字抓住了標籤。

 article = ""

for article_tag in soup . find_all ( "article" ):

    if len ( article_tag . text ) >= len ( article ):
        article = article_tag . text

對於正確使用<article>標籤的網站，這很好。最長的標籤幾乎總是包含主要文章。

但這並沒有像預期的那樣完全奏效，我注意到結果的質量差，有時我會為其他文章摘錄。

那時我決定添加後備，只需尋找<article>標籤的Lnstead，我將尋找具有常用id's的<div>和<section>標籤。

 # These names commonly hold the article text.
common_names = [ "artic" , "summary" , "cont" , "note" , "cuerpo" , "body" ]

# If the article is too short we look somewhere else.
if len ( article ) <= 650 :

    for tag in soup . find_all ([ "div" , "section" ]):

        tag_id = tag [ "id" ]. lower ()

        for item in common_names :
            if item in tag_id :
                # We guarantee to get the longest div.
                if len ( tag . text ) >= len ( article ):
                    article = tag . text

這提高了準確性，我重複了代碼，但是我還在尋找class屬性的id屬性。

 # The article is still too short, let's try one more time.
if len ( article ) <= 650 :

    for tag in soup . find_all ([ "div" , "section" ]):

        tag_class = "" . join ( tag [ "class" ]). lower ()

        for item in common_names :
            if item in tag_class :
                # We guarantee to get the longest div.
                if len ( tag . text ) >= len ( article ):
                    article = tag . text

使用所有以前的方法，大大提高了刮板的整體準確性。在某些情況下，我使用的是用英語和西班牙語共享相同字母的部分單詞（ARTIC-> artic/articulo）。現在，刮板與我測試的所有URL兼容。

我們進行最終檢查，如果文章仍然太短，我們將中止該過程並轉到下一個URL，否則我們轉到摘要算法。

摘要算法

該算法旨在主要從事西班牙書面文章。它由幾個步驟組成：

通過刪除所有空間來重新格式化和清潔原始文章。
製作原始文章的副本，然後從中刪除所有常用單詞。
將復制的文章分為單詞並評分每個單詞。
將原始文章分為句子，並使用單詞中的分數對每個句子進行評分。
命中前5個句子和前5個單詞，然後按時間順序返回。

在開始之前，我們需要初始化spaCy庫。

 NLP = spacy . load ( "es_core_news_sm" )

那條代碼線將加載我使用的Spanish型號。如果您使用其他語言，請參考Requirements部分，以便您知道如何安裝適當的模型。

清潔文章

從文章中提取文本時，我們通常會得到很多白色空間，主要是從線路斷開（ n ）。

我們將文本分開，然後剝離所有空格，然後再次加入。這不是嚴格要求這樣做的，但是在調試整個過程時會有所幫助。

刪除常見和停止單詞

在腳本的頂部，我們聲明了停止單詞文本文件的路徑。這些停止單詞將被添加到一個set中，保證不重複。

我還添加了一個列表，其中包含一些西班牙語和英語單詞，這些單詞不是停止單詞，但它們並沒有為文章添加任何實質性。我個人的偏愛是用小寫形式進行硬編碼。

然後，我在大寫和標題表單中添加了每個單詞的副本。這意味著該set將是原始尺寸的3倍。

 with open ( ES_STOPWORDS_FILE , "r" , encoding = "utf-8" ) as temp_file :
    for word in temp_file . read (). splitlines ():
        COMMON_WORDS . add ( word )

with open ( EN_STOPWORDS_FILE , "r" , encoding = "utf-8" ) as temp_file :
    for word in temp_file . read (). splitlines ():
        COMMON_WORDS . add ( word )

extra_words = list ()

for word in COMMON_WORDS :
    extra_words . append ( word . title ())
    extra_words . append ( word . upper ())

for word in extra_words :
    COMMON_WORDS . add ( word )

得分單詞

在開始對我們的單詞進行引導之前，我們必須首先將清潔的文章傳遞到NLP管道中，這是用一行代碼完成的。

 doc = NLP ( cleaned_article )

該doc對象包含幾個迭代器，我們將使用的2個是tokens和sents （句子）。

在這一點上，我為算法添加了個人風格。首先，我製作了文章的副本，然後從中刪除了所有常見的單詞。

之後，我使用了一個collections.Counter對象進行初始評分。

然後，我將乘數獎勵應用於大寫字母啟動的單詞，並且等於或更長的時間超過4個字符。大多數時候，這些詞是地方，人或組織的名稱。

最後，我將所有實際數字的單詞的分數設置為零。

 words_of_interest = [
        token . text for token in doc if token . text not in COMMON_WORDS ]

scored_words = Counter ( words_of_interest )

for word in scored_words :

    if word [ 0 ]. isupper () and len ( word ) >= 4 :
        scored_words [ word ] *= 3

    if word . isdigit ():
        scored_words [ word ] = 0

得分句子

現在，我們已經有了每個單詞的最終分數，是時候從文章中獲得每個句子了。

為此，我們首先需要將文章分為句子。我嘗試了各種方法，包括RegEx ，但最有效的方法是spaCy Library。

我們將再次迭代上一步中定義的doc對象，但是這次我們將迭代其sents屬性。

需要注意的是，我們創建了一個句子tokens列表，在這些令牌內部，我們可以通過訪問其text屬性來檢索句子文本。

 article_sentences = [ sent for sent in doc . sents ]

scored_sentences = list ()

or index , sent in enumerate ( article_sentences ):

    # In some edge cases we have duplicated sentences, we make sure that doesn't happen.
    if sent . text not in [ sent for score , index , sent in scored_sentences ]:
        scored_sentences . append (
            [ score_line ( sent , scored_words ), index , sent . text ])

scored_sentences是列表。每個內部列表包含3個值。句子得分，其索引和句子本身。這些值將在下一步中使用。

下面的代碼顯示瞭如何評分行。

 def score_line ( line , scored_words ):

    # We remove the common words.
    cleaned_line = [
        token . text for token in line if token . text not in COMMON_WORDS ]

    # We now sum the total number of ocurrences for all words.
    temp_score = 0

    for word in cleaned_line :
        temp_score += scored_words [ word ]

    # We apply a bonus score to sentences that contain financial information.
    line_lowercase = line . text . lower ()

    for word in FINANCIAL_WORDS :
        if word in line_lowercase :
            temp_score *= 1.5
            break

    return temp_score

我們將乘數應用於包含任何指金或財務的句子。

時間順序

這是算法的最後一部分，我們利用sorted()函數來獲取頂部句子，然後將其重新排序以其原始位置。

我們以相反的順序對scored_sentences進行排序，這將為我們提供首先評分的句子。我們啟動一個小的計數器變量，因此一旦達到5。我們還丟棄了所有3個字符或以下的句子（有時有偷偷摸摸的零寬度字符）。

 top_sentences = list ()
counter = 0

for score , index , sentence in sorted ( scored_sentences , reverse = True ):

    if counter >= 5 :
        break

    # When the article is too small the sentences may come empty.
    if len ( sentence ) >= 3 :

        # We append the sentence and its index so we can sort in chronological order.
        top_sentences . append ([ index , sentence ])
        counter += 1

return [ sentence for index , sentence in sorted ( top_sentences )]

最後，我們使用列表理解僅返回已經按時間順序排序的句子。

字云

只是為了好玩，我在每篇文章中添加了一個單詞云。為此，我使用了wordcloud庫。該庫非常易於使用，您只需要聲明WordCloud對象，然後將具有文本字符串作為其參數的generate方法使用。

 wc = wordcloud . WordCloud () # See cloud.py for full parameters.
wc . generate ( prepared_article )
wc . to_file ( "./temp.png" )

生成圖像後，我將其上傳到Imgur ，然後返回URL鏈接並將其添加到Markdown消息中。

字云示例

結論

這是一個非常有趣且有趣的項目。我可能已經重新發明了輪子，但至少我學到了一些很酷的東西。

我對結果的整體質量感到滿意，我將繼續調整算法並應用兼容性提高。

附帶說明，在測試腳本時，我不小心要求推文，Facebook帖子和英文書面文章。所有這些都獲得了可接受的輸出，但是由於這些站點不是目標，所以我將它們從白名單中刪除。

經過幾週的反饋，我決定增加對英語的支持。這需要一些重構。

為了使其與其他語言一起使用，您只需要一個文本文件，其中包含該語言中的所有停止單詞並複制幾行代碼（請參閱“刪除通用單詞”部分）。

展開

summarizer

文章摘要

要求

reddit bot

網絡刮板

摘要算法

清潔文章

刪除常見和停止單詞

得分單詞

得分句子

時間順序

字云

結論

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express