ดาวน์โหลด KeyphraseVectorizers - KeyphraseVectorizers ซอร์สโค้ดดาวน์โหลดแหล่งดาวน์โหลด

Keyphrasevectorizers

แพ็คเกจนี้ได้รับการพัฒนาในระหว่างการเขียนกระดาษรูปแบบของเรา คุณสามารถตรวจสอบกระดาษได้ที่นี่ เมื่อใช้ keyphrasevectorizers หรือ patternrank ในเอกสารทางวิชาการและวิทยานิพนธ์โปรดใช้รายการ BIBTEX ด้านล่าง

ชุด vectorizers ที่แยกกุญแจด้วยรูปแบบส่วนหนึ่งของการพูดจากคอลเลกชันของเอกสารข้อความและแปลงเป็นเมทริกซ์เอกสารสำคัญ เมทริกซ์เอกสารสำคัญคือเมทริกซ์ทางคณิตศาสตร์ที่อธิบายความถี่ของกุญแจที่เกิดขึ้นในคอลเลกชันของเอกสาร แถวเมทริกซ์ระบุเอกสารข้อความและคอลัมน์ระบุ keyphrases ที่ไม่ซ้ำกัน

แพ็คเกจมี wrappers ของ sklearn.feature_extraction.text.countVectorizer และ sklearn.feature_extraction.text.tfidfVectorizer คลาส แทนที่จะใช้โทเค็น N-GRAM ของช่วงที่กำหนดไว้ล่วงหน้าคลาสเหล่านี้จะแยกกุญแจจากเอกสารข้อความโดยใช้แท็กส่วนหนึ่งของการพูดเพื่อคำนวณเมทริกซ์วลีเอกสาร

โพสต์กลางที่เกี่ยวข้องสามารถพบได้ที่นี่และที่นี่

ประโยชน์

สกัดกุญแจที่แม่นยำตามหลักไวยากรณ์ตามแท็กส่วนหนึ่งของการพูด
ไม่จำเป็นต้องระบุช่วง N-Gram
รับเมทริกซ์เอกสารสำคัญ
การสนับสนุนภาษาหลายภาษา
รูปแบบส่วนหนึ่งของการพูดที่ผู้ใช้กำหนดสำหรับการสกัดคีย์ฟริสที่เป็นไปได้

สารบัญ

มันทำงานอย่างไร?
การติดตั้ง
การใช้งาน
1. keyphrasecountVectorizer
  1. ภาษาอังกฤษ
  2. ภาษาอื่น ๆ
2. KeyPhrasetFidFVectorizer
3. นำวัตถุภาษา Spacy กลับมาใช้ใหม่
4. pos-tagger ที่กำหนดเอง
5. PatternRank: การสกัดกุญแจด้วย keyphrasevectorizers และ Keybert
6. การสร้างแบบจำลองหัวข้อด้วย bertopic และ keyphrasevectorizers
7. keyphrasevectorizers ออนไลน์
ข้อมูลการอ้างอิง

มันทำงานอย่างไร?

ขั้นแรกข้อความเอกสารจะมีคำอธิบายประกอบด้วยแท็ก Specy Part-of-Speech รายการแท็กส่วนหนึ่งของ Spehed Speech ที่เป็นไปได้ทั้งหมดสำหรับภาษาต่าง ๆ เชื่อมโยงกันที่นี่ คำอธิบายประกอบต้องผ่านไปป์ไลน์ spacy ของภาษาที่สอดคล้องกันไปยัง Vectorizer ด้วยพารามิเตอร์ spacy_pipeline

ประการที่สองคำจะถูกสกัดจากข้อความเอกสารที่มีแท็กส่วนหนึ่งของคำพูดตรงกับรูปแบบ regex ที่กำหนดไว้ในพารามิเตอร์ pos_pattern Keyphrases เป็นรายการของคำที่ไม่ซ้ำกันที่แยกออกมาจากเอกสารข้อความด้วยวิธีนี้

ในที่สุด vectorizers คำนวณเมทริกซ์เอกสารสำคัญ

การติดตั้ง

 pip install keyphrase-vectorizers

การใช้งาน

สำหรับข้อมูลรายละเอียดเยี่ยมชมคู่มือ API

keyphrasecountVectorizer

กลับไปที่สารบัญ

ภาษาอังกฤษ

 from keyphrase_vectorizers import KeyphraseCountVectorizer

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]
        
# Init default vectorizer.
vectorizer = KeyphraseCountVectorizer ()

# Print parameters
print ( vectorizer . get_params ())
> >> { 'binary' : False , 'dtype' : < class 'numpy.int64' > , 'lowercase' : True , 'max_df' : None , 'min_df' : None , 'pos_pattern' : '<J.*>*<N.*>+' , 'spacy_exclude' : [ 'parser' , 'attribute_ruler' , 'lemmatizer' , 'ner' ], 'spacy_pipeline' : 'en_core_web_sm' , 'stop_words' : 'english' , 'workers' : 1 }

โดยค่าเริ่มต้น Vectorizer จะเริ่มต้นสำหรับภาษาอังกฤษ นั่นหมายความว่ามีการระบุภาษาอังกฤษ spacy_pipeline , stop_words ภาษาอังกฤษจะถูกลบออกและคำหลักที่แยก pos_pattern ที่มีคำคุณศัพท์ 0 หรือมากกว่าตามด้วย 1 คำนามหรือมากกว่าโดยใช้แท็ก Spacy Speech ภาษาอังกฤษ นอกจากนี้ส่วนประกอบของ Spacy Pipeline ['parser', 'attribute_ruler', 'lemmatizer', 'ner'] ถูกแยกออกโดยค่าเริ่มต้นเพื่อเพิ่มประสิทธิภาพ หากคุณเลือก spacy_pipeline ที่แตกต่างกันคุณอาจต้องแยก/รวมส่วนประกอบไปป์ไลน์ที่แตกต่างกันโดยใช้พารามิเตอร์ spacy_exclude สำหรับ Spacy POS Tagger เพื่อทำงานอย่างถูกต้อง

 # After initializing the vectorizer, it can be fitted
# to learn the keyphrases from the text documents.
vectorizer . fit ( docs )

 # After learning the keyphrases, they can be returned.
keyphrases = vectorizer . get_feature_names_out ()

print ( keyphrases )
> >> [ 'users' 'main topics' 'learning algorithm' 'overlap' 'documents' 'output'
 'keywords' 'precise summary' 'new examples' 'training data' 'input'
 'document content' 'training examples' 'unseen instances'
 'optimal scenario' 'document' 'task' 'supervised learning algorithm'
 'example' 'interest' 'function' 'example input' 'various applications'
 'unseen situations' 'phrases' 'indication' 'inductive bias'
 'supervisory signal' 'document relevance' 'information retrieval' 'set'
 'input object' 'groups' 'output value' 'list' 'learning' 'output pairs'
 'pair' 'class labels' 'supervised learning' 'machine'
 'information retrieval environment' 'algorithm' 'vector' 'way' ]

 # After fitting, the vectorizer can transform the documents 
# to a document-keyphrase matrix.
# Matrix rows indicate the documents and columns indicate the unique keyphrases.
# Each cell represents the count.
document_keyphrase_matrix = vectorizer . transform ( docs ). toarray ()

print ( document_keyphrase_matrix )
> >> [[ 0 0 2 0 0 3 0 0 1 3 3 0 1 1 1 0 1 1 2 0 3 1 0 1 0 0 1 1 0 0 1 1 0 1 0 6
  1 1 1 3 1 0 3 1 1 ]
 [ 1 2 0 1 1 0 5 1 0 0 0 1 0 0 0 5 0 0 0 1 0 0 1 0 1 1 0 0 1 2 0 0 1 0 1 0
  0 0 0 0 0 1 0 0 0 ]]

 # Fit and transform can also be executed in one step, 
# which is more efficient. 
document_keyphrase_matrix = vectorizer . fit_transform ( docs ). toarray ()

print ( document_keyphrase_matrix )
> >> [[ 0 0 2 0 0 3 0 0 1 3 3 0 1 1 1 0 1 1 2 0 3 1 0 1 0 0 1 1 0 0 1 1 0 1 0 6
  1 1 1 3 1 0 3 1 1 ]
 [ 1 2 0 1 1 0 5 1 0 0 0 1 0 0 0 5 0 0 0 1 0 0 1 0 1 1 0 0 1 2 0 0 1 0 1 0
  0 0 0 0 0 1 0 0 0 ]]

ภาษาอื่น ๆ

กลับไปที่สารบัญ

 german_docs = [ """Goethe stammte aus einer angesehenen bürgerlichen Familie. 
                Sein Großvater mütterlicherseits war als Stadtschultheiß höchster Justizbeamter der Stadt Frankfurt, 
                sein Vater Doktor der Rechte und Kaiserlicher Rat. Er und seine Schwester Cornelia erfuhren eine aufwendige 
                Ausbildung durch Hauslehrer. Dem Wunsch seines Vaters folgend, studierte Goethe in Leipzig und Straßburg 
                Rechtswissenschaft und war danach als Advokat in Wetzlar und Frankfurt tätig. 
                Gleichzeitig folgte er seiner Neigung zur Dichtkunst.""" ,
              
               """Friedrich Schiller wurde als zweites Kind des Offiziers, Wundarztes und Leiters der Hofgärtnerei in 
               Marbach am Neckar Johann Kaspar Schiller und dessen Ehefrau Elisabetha Dorothea Schiller, geb. Kodweiß, 
               die Tochter eines Wirtes und Bäckers war, 1759 in Marbach am Neckar geboren
               """ ]
# Init vectorizer for the german language
vectorizer = KeyphraseCountVectorizer ( spacy_pipeline = 'de_core_news_sm' , pos_pattern = '<ADJ.*>*<N.*>+' , stop_words = 'german' )

มีการระบุ German spacy_pipeline และการลบ stop_words เยอรมันจะถูกลบออก เนื่องจากแท็กส่วนหนึ่งของการพูดภาษาเยอรมันแตกต่างจากแท็กภาษาอังกฤษพารามิเตอร์ pos_pattern จึงถูกปรับแต่งเช่นกัน รูปแบบ regex <ADJ.*>*<N.*>+ สกัดคำหลักที่มีคำคุณศัพท์ 0 หรือมากกว่าตามด้วยคำนาม 1 หรือมากกว่าโดยใช้แท็กส่วนหนึ่งของ Specy Speech ของเยอรมัน

ความสนใจ! ส่วนประกอบของ Spacy Pipeline ['parser', 'attribute_ruler', 'lemmatizer', 'ner'] ถูกแยกออกโดยค่าเริ่มต้นเพื่อเพิ่มประสิทธิภาพ หากคุณเลือก spacy_pipeline ที่แตกต่างกันคุณอาจต้องแยก/รวมส่วนประกอบไปป์ไลน์ที่แตกต่างกันโดยใช้พารามิเตอร์ spacy_exclude สำหรับ Spacy POS Tagger เพื่อทำงานอย่างถูกต้อง

KeyPhrasetFidFVectorizer

กลับไปที่สารบัญ

KeyphraseTfidfVectorizer มีการเรียกใช้ฟังก์ชันและคุณสมบัติเช่นเดียวกับ KeyphraseCountVectorizer ความแตกต่างเพียงอย่างเดียวคือเซลล์เมทริกซ์วุฒิการศึกษาของเอกสารเป็นตัวแทนของค่า TF หรือ TF-IDF ขึ้นอยู่กับการตั้งค่าพารามิเตอร์แทนการนับ

 from keyphrase_vectorizers import KeyphraseTfidfVectorizer

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]
        
# Init default vectorizer for the English language that computes tf-idf values
vectorizer = KeyphraseTfidfVectorizer ()

# Print parameters
print ( vectorizer . get_params ())
> >> { 'binary' : False , 'custom_pos_tagger' : None , 'decay' : None , 'delete_min_df' : None , 'dtype' : <


class 'numpy.int64' > , 'lowercase' : True , 'max_df' : None

, 'min_df' : None , 'pos_pattern' : '<J.*>*<N.*>+' , 'spacy_exclude' : [ 'parser' , 'attribute_ruler' , 'lemmatizer' , 'ner' ,
                                                                   'textcat' ], 'spacy_pipeline' : 'en_core_web_sm' , 'stop_words' : 'english' , 'workers' : 1 }

ในการคำนวณค่า TF แทนให้ตั้ง use_idf=False

 # Fit and transform to document-keyphrase matrix.
document_keyphrase_matrix = vectorizer . fit_transform ( docs ). toarray ()

print ( document_keyphrase_matrix )
> >> [[ 0.         0.         0.09245003 0.09245003 0.09245003 0.09245003
  0.2773501  0.09245003 0.2773501  0.2773501  0.09245003 0.
  0.         0.09245003 0.         0.2773501  0.09245003 0.09245003
  0.         0.09245003 0.09245003 0.09245003 0.09245003 0.09245003
  0.5547002  0.         0.         0.09245003 0.09245003 0.
  0.2773501  0.18490007 0.09245003 0.         0.2773501  0.
  0.         0.09245003 0.         0.09245003 0.         0.
  0.         0.18490007 0.        ]
 [ 0.11867817 0.11867817 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.11867817
  0.11867817 0.         0.11867817 0.         0.         0.
  0.11867817 0.         0.         0.         0.         0.
  0.         0.11867817 0.23735633 0.         0.         0.11867817
  0.         0.         0.         0.23735633 0.         0.11867817
  0.11867817 0.         0.59339083 0.         0.11867817 0.11867817
  0.11867817 0.         0.59339083 ]]

 # Return keyphrases
keyphrases = vectorizer . get_feature_names_out ()

print ( keyphrases )
> >> [ 'various applications' 'list' 'task' 'supervisory signal'
 'inductive bias' 'supervised learning algorithm' 'supervised learning'
 'example input' 'input' 'algorithm' 'set' 'precise summary' 'documents'
 'input object' 'interest' 'function' 'class labels' 'machine'
 'document content' 'output pairs' 'new examples' 'unseen situations'
 'vector' 'output value' 'learning' 'document relevance' 'main topics'
 'pair' 'training examples' 'information retrieval environment'
 'training data' 'example' 'optimal scenario' 'information retrieval'
 'output' 'groups' 'indication' 'unseen instances' 'keywords' 'way'
 'phrases' 'overlap' 'users' 'learning algorithm' 'document' ]

นำวัตถุภาษา Spacy กลับมาใช้ใหม่

กลับไปที่สารบัญ

Keyphrasevectorizers โหลดวัตถุ spacy.Language สำหรับวัตถุ KeyphraseVectorizer ทุกชิ้น เมื่อใช้วัตถุ KeyphraseVectorizer หลายรายการมันจะมีประสิทธิภาพมากขึ้นในการโหลดวัตถุ spacy.Language ล่วงหน้าและส่งผ่านเป็นอาร์กิวเมนต์ spacy_pipeline

 import spacy
from keyphrase_vectorizers import KeyphraseCountVectorizer , KeyphraseTfidfVectorizer

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]

nlp = spacy . load ( "en_core_web_sm" )

vectorizer1 = KeyphraseCountVectorizer ( spacy_pipeline = nlp )
vectorizer2 = KeyphraseTfidfVectorizer ( spacy_pipeline = nlp )

# the following calls use the nlp object
vectorizer1 . fit ( docs )
vectorizer2 . fit ( docs )

pos-tagger ที่กำหนดเอง

กลับไปที่สารบัญ

ในการใช้แท็กชิ้นส่วนที่แตกต่างจากที่ได้รับจาก Spacy ฟังก์ชัน POS-Tagger แบบกำหนดเองสามารถกำหนดและส่งผ่านไปยัง KeyPhrasevectorizers ผ่านพารามิเตอร์ custom_pos_tagger พารามิเตอร์นี้คาดว่าจะมีฟังก์ชั่น callable ซึ่งจำเป็นต้องคาดหวังว่ารายการสตริงในพารามิเตอร์ 'RAW_Documents' และจะต้องส่งคืนรายการ (โทเค็นคำ, pos-tag) tuples หากพารามิเตอร์นี้ไม่ใช่ไม่มีฟังก์ชัน Tagger ที่กำหนดเองจะใช้แท็กคำที่มีส่วนหนึ่งของคำพูดในขณะที่ท่อ Spacy ถูกละเว้น

ตัวอย่างการใช้ Flair:

สามารถติดตั้ง Flair ผ่าน pip install flair

 from typing import List
import flair
from flair . models import SequenceTagger
from flair . tokenization import SegtokSentenceSplitter


docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]

# define flair POS-tagger and splitter
tagger = SequenceTagger . load ( 'pos' )
splitter = SegtokSentenceSplitter ()

# define custom POS-tagger function using flair
def custom_pos_tagger ( raw_documents : List [ str ], tagger : flair . models . SequenceTagger = tagger , splitter : flair . tokenization . SegtokSentenceSplitter = splitter ) -> List [ tuple ]:
    """
    Important: 

    The mandatory 'raw_documents' parameter can NOT be named differently and has to expect a list of strings. 
    Any other parameter of the custom POS-tagger function can be arbitrarily defined, depending on the respective use case. 
    Furthermore the function has to return a list of (word token, POS-tag) tuples.
    """ 
    # split texts into sentences
    sentences = []
    for doc in raw_documents :
        sentences . extend ( splitter . split ( doc ))

    # predict POS tags
    tagger . predict ( sentences )

    # iterate through sentences to get word tokens and predicted POS-tags
    pos_tags = []
    words = []
    for sentence in sentences :
        pos_tags . extend ([ label . value for label in sentence . get_labels ( 'pos' )])
        words . extend ([ word . text for word in sentence ])
    
    return list ( zip ( words , pos_tags ))


# check that the custom POS-tagger function returns a list of (word token, POS-tag) tuples
print ( custom_pos_tagger ( raw_documents = docs ))

> >> [( 'Supervised' , 'VBN' ), ( 'learning' , 'NN' ), ( 'is' , 'VBZ' ), ( 'the' , 'DT' ), ( 'machine' , 'NN' ), ( 'learning' , 'VBG' ), ( 'task' , 'NN' ), ( 'of' , 'IN' ), ( 'learning' , 'VBG' ), ( 'a' , 'DT' ), ( 'function' , 'NN' ), ( 'that' , 'WDT' ), ( 'maps' , 'VBZ' ), ( 'an' , 'DT' ), ( 'input' , 'NN' ), ( 'to' , 'IN' ), ( 'an' , 'DT' ), ( 'output' , 'NN' ), ( 'based' , 'VBN' ), ( 'on' , 'IN' ), ( 'example' , 'NN' ), ( 'input-output' , 'NN' ), ( 'pairs' , 'NNS' ), ( '.' , '.' ), ( 'It' , 'PRP' ), ( 'infers' , 'VBZ' ), ( 'a' , 'DT' ), ( 'function' , 'NN' ), ( 'from' , 'IN' ), ( 'labeled' , 'VBN' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'consisting' , 'VBG' ), ( 'of' , 'IN' ), ( 'a' , 'DT' ), ( 'set' , 'NN' ), ( 'of' , 'IN' ), ( 'training' , 'NN' ), ( 'examples' , 'NNS' ), ( '.' , '.' ), ( 'In' , 'IN' ), ( 'supervised' , 'JJ' ), ( 'learning' , 'NN' ), ( ',' , ',' ), ( 'each' , 'DT' ), ( 'example' , 'NN' ), ( 'is' , 'VBZ' ), ( 'a' , 'DT' ), ( 'pair' , 'NN' ), ( 'consisting' , 'VBG' ), ( 'of' , 'IN' ), ( 'an' , 'DT' ), ( 'input' , 'NN' ), ( 'object' , 'NN' ), ( '(' , ':' ), ( 'typically' , 'RB' ), ( 'a' , 'DT' ), ( 'vector' , 'NN' ), ( ')' , ',' ), ( 'and' , 'CC' ), ( 'a' , 'DT' ), ( 'desired' , 'VBN' ), ( 'output' , 'NN' ), ( 'value' , 'NN' ), ( '(' , ',' ), ( 'also' , 'RB' ), ( 'called' , 'VBN' ), ( 'the' , 'DT' ), ( 'supervisory' , 'JJ' ), ( 'signal' , 'NN' ), ( ')' , '-RRB-' ), ( '.' , '.' ), ( 'A' , 'DT' ), ( 'supervised' , 'JJ' ), ( 'learning' , 'NN' ), ( 'algorithm' , 'NN' ), ( 'analyzes' , 'VBZ' ), ( 'the' , 'DT' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'and' , 'CC' ), ( 'produces' , 'VBZ' ), ( 'an' , 'DT' ), ( 'inferred' , 'JJ' ), ( 'function' , 'NN' ), ( ',' , ',' ), ( 'which' , 'WDT' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'used' , 'VBN' ), ( 'for' , 'IN' ), ( 'mapping' , 'VBG' ), ( 'new' , 'JJ' ), ( 'examples' , 'NNS' ), ( '.' , '.' ), ( 'An' , 'DT' ), ( 'optimal' , 'JJ' ), ( 'scenario' , 'NN' ), ( 'will' , 'MD' ), ( 'allow' , 'VB' ), ( 'for' , 'IN' ), ( 'the' , 'DT' ), ( 'algorithm' , 'NN' ), ( 'to' , 'TO' ), ( 'correctly' , 'RB' ), ( 'determine' , 'VB' ), ( 'the' , 'DT' ), ( 'class' , 'NN' ), ( 'labels' , 'NNS' ), ( 'for' , 'IN' ), ( 'unseen' , 'JJ' ), ( 'instances' , 'NNS' ), ( '.' , '.' ), ( 'This' , 'DT' ), ( 'requires' , 'VBZ' ), ( 'the' , 'DT' ), ( 'learning' , 'NN' ), ( 'algorithm' , 'NN' ), ( 'to' , 'TO' ), ( 'generalize' , 'VB' ), ( 'from' , 'IN' ), ( 'the' , 'DT' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'to' , 'IN' ), ( 'unseen' , 'JJ' ), ( 'situations' , 'NNS' ), ( 'in' , 'IN' ), ( 'a' , 'DT' ), ( "'" , '``' ), ( 'reasonable' , 'JJ' ), ( "'" , "''" ), ( 'way' , 'NN' ), ( '(' , ',' ), ( 'see' , 'VB' ), ( 'inductive' , 'JJ' ), ( 'bias' , 'NN' ), ( ')' , '-RRB-' ), ( '.' , '.' ), ( 'Keywords' , 'NNS' ), ( 'are' , 'VBP' ), ( 'defined' , 'VBN' ), ( 'as' , 'IN' ), ( 'phrases' , 'NNS' ), ( 'that' , 'WDT' ), ( 'capture' , 'VBP' ), ( 'the' , 'DT' ), ( 'main' , 'JJ' ), ( 'topics' , 'NNS' ), ( 'discussed' , 'VBN' ), ( 'in' , 'IN' ), ( 'a' , 'DT' ), ( 'document' , 'NN' ), ( '.' , '.' ), ( 'As' , 'IN' ), ( 'they' , 'PRP' ), ( 'offer' , 'VBP' ), ( 'a' , 'DT' ), ( 'brief' , 'JJ' ), ( 'yet' , 'CC' ), ( 'precise' , 'JJ' ), ( 'summary' , 'NN' ), ( 'of' , 'IN' ), ( 'document' , 'NN' ), ( 'content' , 'NN' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'utilized' , 'VBN' ), ( 'for' , 'IN' ), ( 'various' , 'JJ' ), ( 'applications' , 'NNS' ), ( '.' , '.' ), ( 'In' , 'IN' ), ( 'an' , 'DT' ), ( 'information' , 'NN' ), ( 'retrieval' , 'NN' ), ( 'environment' , 'NN' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'serve' , 'VBP' ), ( 'as' , 'IN' ), ( 'an' , 'DT' ), ( 'indication' , 'NN' ), ( 'of' , 'IN' ), ( 'document' , 'NN' ), ( 'relevance' , 'NN' ), ( 'for' , 'IN' ), ( 'users' , 'NNS' ), ( ',' , ',' ), ( 'as' , 'IN' ), ( 'the' , 'DT' ), ( 'list' , 'NN' ), ( 'of' , 'IN' ), ( 'keywords' , 'NNS' ), ( 'can' , 'MD' ), ( 'quickly' , 'RB' ), ( 'help' , 'VB' ), ( 'to' , 'TO' ), ( 'determine' , 'VB' ), ( 'whether' , 'IN' ), ( 'a' , 'DT' ), ( 'given' , 'VBN' ), ( 'document' , 'NN' ), ( 'is' , 'VBZ' ), ( 'relevant' , 'JJ' ), ( 'to' , 'IN' ), ( 'their' , 'PRP$' ), ( 'interest' , 'NN' ), ( '.' , '.' ), ( 'As' , 'IN' ), ( 'keywords' , 'NNS' ), ( 'reflect' , 'VBP' ), ( 'a' , 'DT' ), ( 'document' , 'NN' ), ( "'s" , 'POS' ), ( 'main' , 'JJ' ), ( 'topics' , 'NNS' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'utilized' , 'VBN' ), ( 'to' , 'TO' ), ( 'classify' , 'VB' ), ( 'documents' , 'NNS' ), ( 'into' , 'IN' ), ( 'groups' , 'NNS' ), ( 'by' , 'IN' ), ( 'measuring' , 'VBG' ), ( 'the' , 'DT' ), ( 'overlap' , 'NN' ), ( 'between' , 'IN' ), ( 'the' , 'DT' ), ( 'keywords' , 'NNS' ), ( 'assigned' , 'VBN' ), ( 'to' , 'IN' ), ( 'them' , 'PRP' ), ( '.' , '.' ), ( 'Keywords' , 'NNS' ), ( 'are' , 'VBP' ), ( 'also' , 'RB' ), ( 'used' , 'VBN' ), ( 'proactively' , 'RB' ), ( 'in' , 'IN' ), ( 'information' , 'NN' ), ( 'retrieval' , 'NN' ), ( '.' , '.' )]

หลังจากกำหนดฟังก์ชั่น POS-Tagger แบบกำหนดเองแล้วมันสามารถส่งผ่านไปยัง KeyPhrasevectorizers ผ่านพารามิเตอร์ custom_pos_tagger

 from keyphrase_vectorizers import KeyphraseCountVectorizer

# use custom POS-tagger with KeyphraseVectorizers
vectorizer = KeyphraseCountVectorizer ( custom_pos_tagger = custom_pos_tagger )
vectorizer . fit ( docs )
keyphrases = vectorizer . get_feature_names_out ()
print ( keyphrases )

> >> [ 'output value' 'information retrieval' 'algorithm' 'vector' 'groups'
 'main topics' 'task' 'precise summary' 'supervised learning'
 'inductive bias' 'information retrieval environment'
 'supervised learning algorithm' 'function' 'input' 'pair'
 'document relevance' 'learning' 'class labels' 'new examples' 'keywords'
 'list' 'machine' 'training data' 'unseen situations' 'phrases' 'output'
 'optimal scenario' 'document' 'training examples' 'documents' 'interest'
 'indication' 'learning algorithm' 'inferred function'
 'various applications' 'example' 'set' 'unseen instances'
 'example input-output pairs' 'way' 'users' 'input object'
 'supervisory signal' 'overlap' 'document content' ]

PatternRank: การสกัดกุญแจด้วย keyphrasevectorizers และ Keybert

กลับไปที่สารบัญ

การใช้ beyphrase vectorizers ร่วมกับ Keybert สำหรับการสกัดคีย์ฟริสในวิธีการแบบ Patternrank PatternRank สามารถแยกกุญแจที่ถูกต้องตามหลักไวยากรณ์ซึ่งคล้ายกับเอกสารมากที่สุด ดังนั้น Vectorizer จึงแยก Keyphrases ผู้สมัครออกจากเอกสารข้อความซึ่งต่อมาได้รับการจัดอันดับโดย Keybert ตามความคล้ายคลึงกันของเอกสาร กุญแจสำคัญที่สุด N ที่คล้ายกันมากที่สุดนั้นสามารถพิจารณาได้ว่าเป็นคำหลักเอกสาร

ข้อได้เปรียบของการใช้ Keyphrasevectorizers นอกเหนือจาก Keybert คือช่วยให้ผู้ใช้ได้รับกุญแจที่ถูกต้องทางไวยากรณ์แทน N-GRAMS แบบง่าย ๆ ของความยาวที่กำหนดไว้ล่วงหน้า ใน Keybert ผู้ใช้สามารถระบุ keyphrase_ngram_range เพื่อกำหนดความยาวของกุญแจที่ดึงมาได้ อย่างไรก็ตามสิ่งนี้ทำให้เกิดสองประเด็น ก่อนอื่นผู้ใช้มักจะไม่ทราบช่วง N-GRAM ที่ดีที่สุดดังนั้นจึงต้องใช้เวลาทดลองจนกว่าพวกเขาจะพบช่วง N-GRAM ที่เหมาะสม ประการที่สองแม้หลังจากพบช่วง N-Gram ที่ดีบางครั้งกุญแจที่ส่งคืนก็ยังคงไม่ถูกต้องทางไวยากรณ์ค่อนข้างไม่ถูกต้องหรือเป็นคีย์เล็กน้อย น่าเสียดายที่สิ่งนี้ จำกัด คุณภาพของ keyphrases ที่ส่งคืน

ในการแก้ไขปัญหานี้เราสามารถใช้ vectorizers ของแพ็คเกจนี้เพื่อแยกกุญแจผู้สมัครก่อนที่จะประกอบด้วยคำคุณศัพท์เป็นศูนย์หรือมากกว่าตามด้วยคำนามหนึ่งหรือหลายคำในขั้นตอนการประมวลผลก่อนการประมวลผลแทน N-grams ง่าย ๆ Textrank, Singlerank และ Embedrank ได้ใช้วิธีการวลีคำนามนี้สำเร็จสำหรับการสกัดกุญแจ กุญแจผู้สมัครที่สกัดจะถูกส่งผ่านไปยัง Keybert สำหรับการสร้างการฝังและการคำนวณความคล้ายคลึงกัน ในการใช้แพ็คเกจทั้งสองสำหรับการสกัดคอร์นวลีเราจำเป็นต้องส่ง KeyBert A Vectorizer KeyBert ด้วยพารามิเตอร์ vectorizer เนื่องจากความยาวของกุญแจตอนนี้ขึ้นอยู่กับแท็กบางส่วนของคำพูดจึงไม่จำเป็นต้องกำหนดความยาว N-GRAM อีกต่อไป

ตัวอย่าง:

Keybert สามารถติดตั้งผ่าน pip install keybert

 from keyphrase_vectorizers import KeyphraseCountVectorizer
from keybert import KeyBERT

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]

kw_model = KeyBERT ()

แทนที่จะตัดสินใจเลือกช่วง N-Gram ที่เหมาะสมซึ่งอาจเป็นเช่น (1,2) ...

 > >> kw_model . extract_keywords ( docs = docs , keyphrase_ngram_range = ( 1 , 2 ))
[[( 'labeled training' , 0.6013 ),
  ( 'examples supervised' , 0.6112 ),
  ( 'signal supervised' , 0.6152 ),
  ( 'supervised' , 0.6676 ),
  ( 'supervised learning' , 0.6779 )],
 [( 'keywords assigned' , 0.6354 ),
  ( 'keywords used' , 0.6373 ),
  ( 'list keywords' , 0.6375 ),
  ( 'keywords quickly' , 0.6376 ),
  ( 'keywords defined' , 0.6997 )]]

ตอนนี้เราสามารถปล่อยให้ Vectorizer Keyphrase ตัดสินใจเกี่ยวกับ Keyphrases ที่เหมาะสมโดยไม่มีข้อ จำกัด ในช่วง N-GRAM สูงสุดหรือต่ำสุด เราต้องผ่าน Keyphrase Vectorizer เป็นพารามิเตอร์ไปยัง Keybert:

 > >> kw_model . extract_keywords ( docs = docs , vectorizer = KeyphraseCountVectorizer ())
[[( 'learning' , 0.4813 ), 
  ( 'training data' , 0.5271 ), 
  ( 'learning algorithm' , 0.5632 ), 
  ( 'supervised learning' , 0.6779 ), 
  ( 'supervised learning algorithm' , 0.6992 )], 
 [( 'document content' , 0.3988 ), 
  ( 'information retrieval environment' , 0.5166 ), 
  ( 'information retrieval' , 0.5792 ), 
  ( 'keywords' , 0.6046 ), 
  ( 'document relevance' , 0.633 )]]

สิ่งนี้ช่วยให้เราตรวจสอบให้แน่ใจว่าเราไม่ได้ตัดคำสำคัญที่เกิดจากการกำหนดช่วง N-Gram ของเราสั้นเกินไป ตัวอย่างเช่นเราจะไม่พบคำพูด "อัลกอริทึมการเรียนรู้ภายใต้การดูแล" ที่มี keyphrase_ngram_range=(1,2) นอกจากนี้เราหลีกเลี่ยงที่จะได้รับกุญแจที่มีอยู่นอกคีย์เล็กน้อยเช่น "การฝึกอบรมที่มีป้ายกำกับ", "สัญญาณควบคุม" หรือ "คำหลักอย่างรวดเร็ว"

สำหรับเคล็ดลับเพิ่มเติมเกี่ยวกับวิธีการใช้ keyphrasevectorizers ร่วมกับ Keybert โปรดไปที่คู่มือนี้

การสร้างแบบจำลองหัวข้อด้วย bertopic และ keyphrasevectorizers

กลับไปที่สารบัญ

เช่นเดียวกับแอปพลิเคชันกับ Keybert, keyphrase vectorizers สามารถใช้เพื่อรับ keyphrases ที่ถูกต้องตามหลักไวยากรณ์เป็นคำอธิบายสำหรับหัวข้อแทนที่จะเป็น N-grams ง่าย ๆ สิ่งนี้ช่วยให้เราตรวจสอบให้แน่ใจว่าเราไม่ได้ตัดหัวข้อสำคัญคำอธิบายคำอธิบายโดยการกำหนดช่วง N-Gram ของเราสั้นเกินไป ยิ่งไปกว่านั้นเราไม่จำเป็นต้องทำความสะอาดคำหยุดล่วงหน้าสามารถรับโมเดลหัวข้อที่แม่นยำยิ่งขึ้นและหลีกเลี่ยงการรับหัวข้อคำอธิบายกุญแจที่มีคีย์นอกคีย์เล็กน้อย

ตัวอย่าง:

Bertopic สามารถติดตั้งผ่าน pip install bertopic

 from keyphrase_vectorizers import KeyphraseCountVectorizer
from bertopic import BERTopic
from sklearn . datasets import fetch_20newsgroups

# load text documents
docs = fetch_20newsgroups ( subset = 'all' ,  remove = ( 'headers' , 'footers' , 'quotes' ))[ 'data' ]
# only use subset of the data 
docs = docs [: 5000 ]

# train topic model with KeyphraseCountVectorizer
keyphrase_topic_model = BERTopic ( vectorizer_model = KeyphraseCountVectorizer ())
keyphrase_topics , keyphrase_probs = keyphrase_topic_model . fit_transform ( docs )

# get topics
> >> keyphrase_topic_model . topics
{ - 1 : [( 'file' , 0.007265527630674131 ),
  ( 'one' , 0.007055454904474792 ),
  ( 'use' , 0.00633563957153475 ),
  ( 'program' , 0.006053271092949018 ),
  ( 'get' , 0.006011060091056076 ),
  ( 'people' , 0.005729309058970368 ),
  ( 'know' , 0.005635951168273583 ),
  ( 'like' , 0.0055692449802916015 ),
  ( 'time' , 0.00527028825803415 ),
  ( 'us' , 0.00525564504880084 )],
 0 : [( 'game' , 0.024134589719090525 ),
  ( 'team' , 0.021852806383170772 ),
  ( 'players' , 0.01749406934044139 ),
  ( 'games' , 0.014397938026886745 ),
  ( 'hockey' , 0.013932342023677305 ),
  ( 'win' , 0.013706115572901401 ),
  ( 'year' , 0.013297593024390321 ),
  ( 'play' , 0.012533185558169046 ),
  ( 'baseball' , 0.012412743802062559 ),
  ( 'season' , 0.011602725885164318 )],
 1 : [( 'patients' , 0.022600352291162015 ),
  ( 'msg' , 0.02023877371575874 ),
  ( 'doctor' , 0.018816282737587457 ),
  ( 'medical' , 0.018614407917995103 ),
  ( 'treatment' , 0.0165028251400717 ),
  ( 'food' , 0.01604980195180696 ),
  ( 'candida' , 0.015255961242066143 ),
  ( 'disease' , 0.015115496310099693 ),
  ( 'pain' , 0.014129703072484495 ),
  ( 'hiv' , 0.012884503220341102 )],
 2 : [( 'key' , 0.028851633177510126 ),
  ( 'encryption' , 0.024375137861044675 ),
  ( 'clipper' , 0.023565947302544528 ),
  ( 'privacy' , 0.019258719348097385 ),
  ( 'security' , 0.018983682856076434 ),
  ( 'chip' , 0.018822199098878365 ),
  ( 'keys' , 0.016060139239615384 ),
  ( 'internet' , 0.01450486904722165 ),
  ( 'encrypted' , 0.013194373119964168 ),
  ( 'government' , 0.01303978311708837 )],
  ...

หัวข้อเดียวกันดูแตกต่างกันเล็กน้อยเมื่อไม่มีการใช้ beyphrase vectorizer:

 from bertopic import BERTopic
from sklearn . datasets import fetch_20newsgroups

# load text documents
docs = fetch_20newsgroups ( subset = 'all' ,  remove = ( 'headers' , 'footers' , 'quotes' ))[ 'data' ]
# only use subset of the data 
docs = docs [: 5000 ]

# train topic model without KeyphraseCountVectorizer
topic_model = BERTopic ()
topics , probs = topic_model . fit_transform ( docs )

# get topics
> >> topic_model . topics
{ - 1 : [( 'the' , 0.012864641020408933 ),
  ( 'to' , 0.01187920529994724 ),
  ( 'and' , 0.011431498631699856 ),
  ( 'of' , 0.01099851927541331 ),
  ( 'is' , 0.010995478673036962 ),
  ( 'in' , 0.009908233622158523 ),
  ( 'for' , 0.009903667215879675 ),
  ( 'that' , 0.009619596716087699 ),
  ( 'it' , 0.009578499681829809 ),
  ( 'you' , 0.0095328846440753 )],
 0 : [( 'game' , 0.013949166096523719 ),
  ( 'team' , 0.012458483177116456 ),
  ( 'he' , 0.012354733462693834 ),
  ( 'the' , 0.01119583508278812 ),
  ( '10' , 0.010190243555226108 ),
  ( 'in' , 0.0101436249231417 ),
  ( 'players' , 0.009682212470082758 ),
  ( 'to' , 0.00933700544705287 ),
  ( 'was' , 0.009172402203816335 ),
  ( 'and' , 0.008653375901739337 )],
 1 : [( 'of' , 0.012771267188340924 ),
  ( 'to' , 0.012581337590513296 ),
  ( 'is' , 0.012554884458779008 ),
  ( 'patients' , 0.011983273578628046 ),
  ( 'and' , 0.011863499662237566 ),
  ( 'that' , 0.011616113472989725 ),
  ( 'it' , 0.011581944987387165 ),
  ( 'the' , 0.011475148304229873 ),
  ( 'in' , 0.011395485985801054 ),
  ( 'msg' , 0.010715000656335596 )],
 2 : [( 'key' , 0.01725282988290282 ),
  ( 'the' , 0.014634841495851404 ),
  ( 'be' , 0.014429762197907552 ),
  ( 'encryption' , 0.013530733999898166 ),
  ( 'to' , 0.013443159534369817 ),
  ( 'clipper' , 0.01296614319927958 ),
  ( 'of' , 0.012164734232650158 ),
  ( 'is' , 0.012128295958613464 ),
  ( 'and' , 0.011972763728732667 ),
  ( 'chip' , 0.010785744492767285 )],
 ...

keyphrasevectorizers ออนไลน์

กลับไปที่สารบัญ

keyphrasevectorizers ยังสนับสนุนการอัปเดตออนไลน์/เพิ่มขึ้นของการเป็นตัวแทนของพวกเขา (คล้ายกับ onlinecountVectorizer) Vectorizer ไม่เพียง แต่สามารถอัปเดตกุญแจนอกกฎหมายออกไปได้ แต่ยังใช้ฟังก์ชั่นการสลายตัวและการทำความสะอาดเพื่อป้องกันไม่ให้เมทริกซ์เอกสารที่กระจัดกระจายมีขนาดใหญ่เกินไป

พารามิเตอร์สำหรับการอัปเดตออนไลน์:

decay : ในการทำซ้ำแต่ละครั้งเราจะสรุปการแสดงเอกสารของเอกสารของเอกสารใหม่ด้วยการแสดงเอกสารสำคัญของเอกสารทั้งหมดที่ประมวลผลจนถึงตอนนี้ กล่าวอีกนัยหนึ่งเมทริกซ์เอกสารสำคัญจะเพิ่มขึ้นด้วยการทำซ้ำแต่ละครั้ง อย่างไรก็ตามโดยเฉพาะอย่างยิ่งในการตั้งค่าการสตรีมเอกสารเก่าอาจมีความเกี่ยวข้องน้อยลงเรื่อย ๆ เมื่อเวลาผ่านไป ดังนั้นพารามิเตอร์การสลายตัวจึงถูกนำไปใช้เพื่อสลายความถี่ของเอกสารสำคัญในแต่ละการทำซ้ำก่อนที่จะเพิ่มความถี่เอกสารของเอกสารใหม่ พารามิเตอร์การสลายตัวคือค่าระหว่าง 0 ถึง 1 และระบุเปอร์เซ็นต์ของความถี่ที่ควรลดลงของวุฒิการศึกษาเอกสารก่อนหน้านี้ ตัวอย่างเช่นค่าของ .1 จะลดความถี่ในเมทริกซ์เอกสาร-คีย์สโคป 10% ในการทำซ้ำแต่ละครั้งก่อนที่จะเพิ่มเมทริกซ์เอกสารสำคัญ สิ่งนี้จะทำให้แน่ใจว่าข้อมูลล่าสุดมีน้ำหนักมากกว่าการทำซ้ำก่อนหน้านี้
delete_min_df : เราอาจต้องการลบ keyphrases ออกจากการแทนวลีเอกสารที่ปรากฏไม่บ่อยนัก พารามิเตอร์ min_df ทำงานได้ค่อนข้างดีสำหรับสิ่งนั้น อย่างไรก็ตามเมื่อเรามีการตั้งค่าสตรีมมิ่ง min_df ไม่ทำงานเช่นกันเนื่องจากความถี่ของกุญแจอาจเริ่มต่ำกว่า min_df แต่จะจบลงสูงกว่าเมื่อเวลาผ่านไป การตั้งค่าค่าที่สูงอาจไม่ได้รับคำแนะนำเสมอไป เป็นผลให้รายการของ keyphrases ที่เรียนรู้โดย vectorizer และเมทริกซ์วุตวลีเอกสารที่ได้อาจมีขนาดค่อนข้างใหญ่ ในทำนองเดียวกันถ้าเราใช้พารามิเตอร์ decay ตัวค่าบางอย่างจะลดลงเมื่อเวลาผ่านไปจนกว่าจะต่ำกว่า min_df ด้วยเหตุผลเหล่านี้พารามิเตอร์ delete_min_df จึงถูกนำมาใช้ พารามิเตอร์จะใช้จำนวนเต็มบวกและระบุว่าในแต่ละการวนซ้ำซึ่งกุญแจจะถูกลบออกจากสิ่งที่เรียนรู้แล้ว หากค่าถูกตั้งค่าเป็น 5 มันจะตรวจสอบหลังจากการวนซ้ำแต่ละครั้งหากความถี่ทั้งหมดของคีย์ใช้เกินกว่าค่านั้น ถ้าเป็นเช่นนั้นกุญแจจะถูกลบออกอย่างครบถ้วนจากรายการของ keyphrases ที่เรียนรู้โดย Vectorizer สิ่งนี้ช่วยในการรักษาเมทริกซ์เอกสารที่มีขนาดที่จัดการได้

ตัวอย่าง:

 from keyphrase_vectorizers import KeyphraseCountVectorizer

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" ,

        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]

# Init default vectorizer.
vectorizer = KeyphraseCountVectorizer ( decay = 0.5 , delete_min_df = 3 )

# intitial vectorizer fit
vectorizer . fit_transform ([ docs [ 0 ]]). toarray ()
> >> array ([[ 1 , 1 , 3 , 1 , 1 , 3 , 1 , 3 , 1 , 1 , 1 , 1 , 2 , 1 , 3 , 1 , 1 , 1 , 1 , 3 , 1 , 3 ,
             1 , 1 , 1 ]])

# check learned keyphrases
print ( vectorizer . get_feature_names_out ())
> >> [ 'output pairs' , 'output value' , 'function' , 'optimal scenario' ,
      'pair' , 'supervised learning' , 'supervisory signal' , 'algorithm' ,
      'supervised learning algorithm' , 'way' , 'training examples' ,
      'input object' , 'example' , 'machine' , 'output' ,
      'unseen situations' , 'unseen instances' , 'inductive bias' ,
      'new examples' , 'input' , 'task' , 'training data' , 'class labels' ,
      'set' , 'vector' ]

# learn additional keyphrases from new documents with partial fit
vectorizer . partial_fit ([ docs [ 1 ]])
vectorizer . transform ([ docs [ 1 ]]). toarray ()
> >> array ([[ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,
             0 , 0 , 0 , 1 , 1 , 2 , 1 , 1 , 2 , 1 , 1 , 1 , 1 , 1 , 1 , 5 , 1 , 1 , 5 , 1 ]])

# check learned keyphrases, including newly learned ones
print ( vectorizer . get_feature_names_out ())
> >> [ 'output pairs' , 'output value' , 'function' , 'optimal scenario' ,
      'pair' , 'supervised learning' , 'supervisory signal' , 'algorithm' ,
      'supervised learning algorithm' , 'way' , 'training examples' ,
      'input object' , 'example' , 'machine' , 'output' ,
      'unseen situations' , 'unseen instances' , 'inductive bias' ,
      'new examples' , 'input' , 'task' , 'training data' , 'class labels' ,
      'set' , 'vector' , 'list' , 'various applications' ,
      'information retrieval' , 'groups' , 'overlap' , 'main topics' ,
      'precise summary' , 'document relevance' , 'interest' , 'indication' ,
      'information retrieval environment' , 'phrases' , 'keywords' ,
      'document content' , 'documents' , 'document' , 'users' ]

# update list of learned keyphrases according to 'delete_min_df'
vectorizer . update_bow ([ docs [ 1 ]])
vectorizer . transform ([ docs [ 1 ]]). toarray ()
> >> array ([[ 5 , 5 ]])

# check updated list of learned keyphrases (only the ones that appear more than 'delete_min_df' remain)
print ( vectorizer . get_feature_names_out ())
> >> [ 'keywords' , 'document' ]

# update again and check the impact of 'decay' on the learned document-keyphrase matrix
vectorizer . update_bow ([ docs [ 1 ]])
vectorizer . X_ . toarray ()
> >> array ([[ 7.5 , 7.5 ]])

ข้อมูลการอ้างอิง

กลับไปที่สารบัญ

เมื่ออ้างถึง keyphrasevectorizers หรือ patternrank ในเอกสารทางวิชาการและวิทยานิพนธ์โปรดใช้รายการ bibtex นี้:

 @conference{schopf_etal_kdir22,
author={Tim Schopf and Simon Klimek and Florian Matthes},
title={PatternRank: Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction},
booktitle={Proceedings of the 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2022) - KDIR},
year={2022},
pages={243-248},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011546600003335},
isbn={978-989-758-614-9},
issn={2184-3228},
}

ขยาย

KeyphraseVectorizers

Keyphrasevectorizers

ประโยชน์

สารบัญ

มันทำงานอย่างไร?

การติดตั้ง

การใช้งาน

keyphrasecountVectorizer

ภาษาอังกฤษ

ภาษาอื่น ๆ

KeyPhrasetFidFVectorizer

นำวัตถุภาษา Spacy กลับมาใช้ใหม่

pos-tagger ที่กำหนดเอง

ตัวอย่างการใช้ Flair:

PatternRank: การสกัดกุญแจด้วย keyphrasevectorizers และ Keybert

ตัวอย่าง:

การสร้างแบบจำลองหัวข้อด้วย bertopic และ keyphrasevectorizers

ตัวอย่าง:

keyphrasevectorizers ออนไลน์

ตัวอย่าง:

ข้อมูลการอ้างอิง

กลับ

34นิรันดร์

ReverseBlue

แคริเวิร์ส

ความรักติดตามคุณไปทุกที่

เชเวเรโต

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express