该软件包是在我们的PatternRank纸的撰写期间开发的。您可以在这里查看纸张。在学术论文和论文中使用KeyPhraseActectorizer或PatternRank时,请使用下面的Bibtex条目。
从文本集合中提取键形的键形词形,并将其转换为文档键盘矩阵。文档键盘矩阵是一个数学矩阵,描述了文档集合中发生的键形频率。矩阵行指示文本文档和列指示唯一的钥匙声。
该软件包包含sklearn.feature_extraction.text.countvectorizer和sklearn.feature_extraction.text.tfidfvectorizer类的包装器。这些类不使用预定范围的n-gram代币,而是使用词性文档提取键字形来计算文档文档来计算文档键盘矩阵。
相应的中帖子可以在此处和此处找到。
首先,文档文本用Spacy Onal-Onal语音标签注释。此处链接了不同语言的所有可能的语音标签列表。注释需要使用spacy_pipeline参数将相应语言的Spacy管道传递给vectorizer。
其次,从文档文本中提取单词,其词性词性标签匹配pos_pattern参数中定义的REGEX模式。密钥词是通过此方法从文本文档中提取的唯一单词的列表。
最后,矢量化计算文档 - 键盘矩阵。
pip install keyphrase-vectorizers
有关详细信息,请访问API指南。
返回目录
from keyphrase_vectorizers import KeyphraseCountVectorizer
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
# Init default vectorizer.
vectorizer = KeyphraseCountVectorizer ()
# Print parameters
print ( vectorizer . get_params ())
> >> { 'binary' : False , 'dtype' : < class 'numpy.int64' > , 'lowercase' : True , 'max_df' : None , 'min_df' : None , 'pos_pattern' : '<J.*>*<N.*>+' , 'spacy_exclude' : [ 'parser' , 'attribute_ruler' , 'lemmatizer' , 'ner' ], 'spacy_pipeline' : 'en_core_web_sm' , 'stop_words' : 'english' , 'workers' : 1 }默认情况下,对矢量器的英语初始化。这意味着,指定了英语spacy_pipeline ,删除了英语stop_words , pos_pattern提取了具有0个或更多形容词的关键字,然后使用英语spacy op-oppech tags进行1个或更多名词。此外,Spacy管道组件['parser', 'attribute_ruler', 'lemmatizer', 'ner']被排除在默认情况下以提高效率。如果选择其他spacy_pipeline ,则可能必须使用spacy_exclude参数来排除/包括不同的管道组件,以使Spacy POS Tagger正常工作。
# After initializing the vectorizer, it can be fitted
# to learn the keyphrases from the text documents.
vectorizer . fit ( docs ) # After learning the keyphrases, they can be returned.
keyphrases = vectorizer . get_feature_names_out ()
print ( keyphrases )
> >> [ 'users' 'main topics' 'learning algorithm' 'overlap' 'documents' 'output'
'keywords' 'precise summary' 'new examples' 'training data' 'input'
'document content' 'training examples' 'unseen instances'
'optimal scenario' 'document' 'task' 'supervised learning algorithm'
'example' 'interest' 'function' 'example input' 'various applications'
'unseen situations' 'phrases' 'indication' 'inductive bias'
'supervisory signal' 'document relevance' 'information retrieval' 'set'
'input object' 'groups' 'output value' 'list' 'learning' 'output pairs'
'pair' 'class labels' 'supervised learning' 'machine'
'information retrieval environment' 'algorithm' 'vector' 'way' ] # After fitting, the vectorizer can transform the documents
# to a document-keyphrase matrix.
# Matrix rows indicate the documents and columns indicate the unique keyphrases.
# Each cell represents the count.
document_keyphrase_matrix = vectorizer . transform ( docs ). toarray ()
print ( document_keyphrase_matrix )
> >> [[ 0 0 2 0 0 3 0 0 1 3 3 0 1 1 1 0 1 1 2 0 3 1 0 1 0 0 1 1 0 0 1 1 0 1 0 6
1 1 1 3 1 0 3 1 1 ]
[ 1 2 0 1 1 0 5 1 0 0 0 1 0 0 0 5 0 0 0 1 0 0 1 0 1 1 0 0 1 2 0 0 1 0 1 0
0 0 0 0 0 1 0 0 0 ]] # Fit and transform can also be executed in one step,
# which is more efficient.
document_keyphrase_matrix = vectorizer . fit_transform ( docs ). toarray ()
print ( document_keyphrase_matrix )
> >> [[ 0 0 2 0 0 3 0 0 1 3 3 0 1 1 1 0 1 1 2 0 3 1 0 1 0 0 1 1 0 0 1 1 0 1 0 6
1 1 1 3 1 0 3 1 1 ]
[ 1 2 0 1 1 0 5 1 0 0 0 1 0 0 0 5 0 0 0 1 0 0 1 0 1 1 0 0 1 2 0 0 1 0 1 0
0 0 0 0 0 1 0 0 0 ]]返回目录
german_docs = [ """Goethe stammte aus einer angesehenen bürgerlichen Familie.
Sein Großvater mütterlicherseits war als Stadtschultheiß höchster Justizbeamter der Stadt Frankfurt,
sein Vater Doktor der Rechte und Kaiserlicher Rat. Er und seine Schwester Cornelia erfuhren eine aufwendige
Ausbildung durch Hauslehrer. Dem Wunsch seines Vaters folgend, studierte Goethe in Leipzig und Straßburg
Rechtswissenschaft und war danach als Advokat in Wetzlar und Frankfurt tätig.
Gleichzeitig folgte er seiner Neigung zur Dichtkunst.""" ,
"""Friedrich Schiller wurde als zweites Kind des Offiziers, Wundarztes und Leiters der Hofgärtnerei in
Marbach am Neckar Johann Kaspar Schiller und dessen Ehefrau Elisabetha Dorothea Schiller, geb. Kodweiß,
die Tochter eines Wirtes und Bäckers war, 1759 in Marbach am Neckar geboren
""" ]
# Init vectorizer for the german language
vectorizer = KeyphraseCountVectorizer ( spacy_pipeline = 'de_core_news_sm' , pos_pattern = '<ADJ.*>*<N.*>+' , stop_words = 'german' )指定了德语spacy_pipeline ,并删除了德语stop_words 。由于德国词汇一部分标签与英文标签不同,因此也定制了pos_pattern参数。 REGEX模式<ADJ.*>*<N.*>+
注意力!默认情况下,将Spacy管道组件['parser', 'attribute_ruler', 'lemmatizer', 'ner']排除在外,以提高效率。如果选择其他spacy_pipeline ,则可能必须使用spacy_exclude参数来排除/包括不同的管道组件,以使Spacy POS Tagger正常工作。
返回目录
KeyphraseTfidfVectorizer具有与KeyphraseCountVectorizer相同的函数调用和功能。唯一的区别是,根据参数设置而不是计数,文档键盘矩阵单元格代表TF或TF-IDF值。
from keyphrase_vectorizers import KeyphraseTfidfVectorizer
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
# Init default vectorizer for the English language that computes tf-idf values
vectorizer = KeyphraseTfidfVectorizer ()
# Print parameters
print ( vectorizer . get_params ())
> >> { 'binary' : False , 'custom_pos_tagger' : None , 'decay' : None , 'delete_min_df' : None , 'dtype' : <
class 'numpy.int64' > , 'lowercase' : True , 'max_df' : None
, 'min_df' : None , 'pos_pattern' : '<J.*>*<N.*>+' , 'spacy_exclude' : [ 'parser' , 'attribute_ruler' , 'lemmatizer' , 'ner' ,
'textcat' ], 'spacy_pipeline' : 'en_core_web_sm' , 'stop_words' : 'english' , 'workers' : 1 }要计算TF值,请设置use_idf=False 。
# Fit and transform to document-keyphrase matrix.
document_keyphrase_matrix = vectorizer . fit_transform ( docs ). toarray ()
print ( document_keyphrase_matrix )
> >> [[ 0. 0. 0.09245003 0.09245003 0.09245003 0.09245003
0.2773501 0.09245003 0.2773501 0.2773501 0.09245003 0.
0. 0.09245003 0. 0.2773501 0.09245003 0.09245003
0. 0.09245003 0.09245003 0.09245003 0.09245003 0.09245003
0.5547002 0. 0. 0.09245003 0.09245003 0.
0.2773501 0.18490007 0.09245003 0. 0.2773501 0.
0. 0.09245003 0. 0.09245003 0. 0.
0. 0.18490007 0. ]
[ 0.11867817 0.11867817 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.11867817
0.11867817 0. 0.11867817 0. 0. 0.
0.11867817 0. 0. 0. 0. 0.
0. 0.11867817 0.23735633 0. 0. 0.11867817
0. 0. 0. 0.23735633 0. 0.11867817
0.11867817 0. 0.59339083 0. 0.11867817 0.11867817
0.11867817 0. 0.59339083 ]] # Return keyphrases
keyphrases = vectorizer . get_feature_names_out ()
print ( keyphrases )
> >> [ 'various applications' 'list' 'task' 'supervisory signal'
'inductive bias' 'supervised learning algorithm' 'supervised learning'
'example input' 'input' 'algorithm' 'set' 'precise summary' 'documents'
'input object' 'interest' 'function' 'class labels' 'machine'
'document content' 'output pairs' 'new examples' 'unseen situations'
'vector' 'output value' 'learning' 'document relevance' 'main topics'
'pair' 'training examples' 'information retrieval environment'
'training data' 'example' 'optimal scenario' 'information retrieval'
'output' 'groups' 'indication' 'unseen instances' 'keywords' 'way'
'phrases' 'overlap' 'users' 'learning algorithm' 'document' ]返回目录
KeyPhraseVectorizer加载一个spacy.Language对象的每个KeyphraseVectorizer对象。当使用多个KeyphraseVectorizer对象时,加载spacy.Language效率更高。提前Language对象并将其作为spacy_pipeline参数传递。
import spacy
from keyphrase_vectorizers import KeyphraseCountVectorizer , KeyphraseTfidfVectorizer
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
nlp = spacy . load ( "en_core_web_sm" )
vectorizer1 = KeyphraseCountVectorizer ( spacy_pipeline = nlp )
vectorizer2 = KeyphraseTfidfVectorizer ( spacy_pipeline = nlp )
# the following calls use the nlp object
vectorizer1 . fit ( docs )
vectorizer2 . fit ( docs )返回目录
要使用与Spacy提供的言论部分不同的标签器,可以通过custom_pos_tagger参数定义自定义的POS-Tagger函数并将其传递给键盘向量。该参数期望一个可可的功能,反过来又需要期望“ RAW_DOCUMENTS”参数中的字符串列表,并且必须返回(单词令牌,pos-tag)元组的列表。如果此参数不是没有,则使用自定义标记函数用语音部分标记单词,而Spacy管道则被忽略。
天赋可以通过pip install flair安装。
from typing import List
import flair
from flair . models import SequenceTagger
from flair . tokenization import SegtokSentenceSplitter
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
# define flair POS-tagger and splitter
tagger = SequenceTagger . load ( 'pos' )
splitter = SegtokSentenceSplitter ()
# define custom POS-tagger function using flair
def custom_pos_tagger ( raw_documents : List [ str ], tagger : flair . models . SequenceTagger = tagger , splitter : flair . tokenization . SegtokSentenceSplitter = splitter ) -> List [ tuple ]:
"""
Important:
The mandatory 'raw_documents' parameter can NOT be named differently and has to expect a list of strings.
Any other parameter of the custom POS-tagger function can be arbitrarily defined, depending on the respective use case.
Furthermore the function has to return a list of (word token, POS-tag) tuples.
"""
# split texts into sentences
sentences = []
for doc in raw_documents :
sentences . extend ( splitter . split ( doc ))
# predict POS tags
tagger . predict ( sentences )
# iterate through sentences to get word tokens and predicted POS-tags
pos_tags = []
words = []
for sentence in sentences :
pos_tags . extend ([ label . value for label in sentence . get_labels ( 'pos' )])
words . extend ([ word . text for word in sentence ])
return list ( zip ( words , pos_tags ))
# check that the custom POS-tagger function returns a list of (word token, POS-tag) tuples
print ( custom_pos_tagger ( raw_documents = docs ))
> >> [( 'Supervised' , 'VBN' ), ( 'learning' , 'NN' ), ( 'is' , 'VBZ' ), ( 'the' , 'DT' ), ( 'machine' , 'NN' ), ( 'learning' , 'VBG' ), ( 'task' , 'NN' ), ( 'of' , 'IN' ), ( 'learning' , 'VBG' ), ( 'a' , 'DT' ), ( 'function' , 'NN' ), ( 'that' , 'WDT' ), ( 'maps' , 'VBZ' ), ( 'an' , 'DT' ), ( 'input' , 'NN' ), ( 'to' , 'IN' ), ( 'an' , 'DT' ), ( 'output' , 'NN' ), ( 'based' , 'VBN' ), ( 'on' , 'IN' ), ( 'example' , 'NN' ), ( 'input-output' , 'NN' ), ( 'pairs' , 'NNS' ), ( '.' , '.' ), ( 'It' , 'PRP' ), ( 'infers' , 'VBZ' ), ( 'a' , 'DT' ), ( 'function' , 'NN' ), ( 'from' , 'IN' ), ( 'labeled' , 'VBN' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'consisting' , 'VBG' ), ( 'of' , 'IN' ), ( 'a' , 'DT' ), ( 'set' , 'NN' ), ( 'of' , 'IN' ), ( 'training' , 'NN' ), ( 'examples' , 'NNS' ), ( '.' , '.' ), ( 'In' , 'IN' ), ( 'supervised' , 'JJ' ), ( 'learning' , 'NN' ), ( ',' , ',' ), ( 'each' , 'DT' ), ( 'example' , 'NN' ), ( 'is' , 'VBZ' ), ( 'a' , 'DT' ), ( 'pair' , 'NN' ), ( 'consisting' , 'VBG' ), ( 'of' , 'IN' ), ( 'an' , 'DT' ), ( 'input' , 'NN' ), ( 'object' , 'NN' ), ( '(' , ':' ), ( 'typically' , 'RB' ), ( 'a' , 'DT' ), ( 'vector' , 'NN' ), ( ')' , ',' ), ( 'and' , 'CC' ), ( 'a' , 'DT' ), ( 'desired' , 'VBN' ), ( 'output' , 'NN' ), ( 'value' , 'NN' ), ( '(' , ',' ), ( 'also' , 'RB' ), ( 'called' , 'VBN' ), ( 'the' , 'DT' ), ( 'supervisory' , 'JJ' ), ( 'signal' , 'NN' ), ( ')' , '-RRB-' ), ( '.' , '.' ), ( 'A' , 'DT' ), ( 'supervised' , 'JJ' ), ( 'learning' , 'NN' ), ( 'algorithm' , 'NN' ), ( 'analyzes' , 'VBZ' ), ( 'the' , 'DT' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'and' , 'CC' ), ( 'produces' , 'VBZ' ), ( 'an' , 'DT' ), ( 'inferred' , 'JJ' ), ( 'function' , 'NN' ), ( ',' , ',' ), ( 'which' , 'WDT' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'used' , 'VBN' ), ( 'for' , 'IN' ), ( 'mapping' , 'VBG' ), ( 'new' , 'JJ' ), ( 'examples' , 'NNS' ), ( '.' , '.' ), ( 'An' , 'DT' ), ( 'optimal' , 'JJ' ), ( 'scenario' , 'NN' ), ( 'will' , 'MD' ), ( 'allow' , 'VB' ), ( 'for' , 'IN' ), ( 'the' , 'DT' ), ( 'algorithm' , 'NN' ), ( 'to' , 'TO' ), ( 'correctly' , 'RB' ), ( 'determine' , 'VB' ), ( 'the' , 'DT' ), ( 'class' , 'NN' ), ( 'labels' , 'NNS' ), ( 'for' , 'IN' ), ( 'unseen' , 'JJ' ), ( 'instances' , 'NNS' ), ( '.' , '.' ), ( 'This' , 'DT' ), ( 'requires' , 'VBZ' ), ( 'the' , 'DT' ), ( 'learning' , 'NN' ), ( 'algorithm' , 'NN' ), ( 'to' , 'TO' ), ( 'generalize' , 'VB' ), ( 'from' , 'IN' ), ( 'the' , 'DT' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'to' , 'IN' ), ( 'unseen' , 'JJ' ), ( 'situations' , 'NNS' ), ( 'in' , 'IN' ), ( 'a' , 'DT' ), ( "'" , '``' ), ( 'reasonable' , 'JJ' ), ( "'" , "''" ), ( 'way' , 'NN' ), ( '(' , ',' ), ( 'see' , 'VB' ), ( 'inductive' , 'JJ' ), ( 'bias' , 'NN' ), ( ')' , '-RRB-' ), ( '.' , '.' ), ( 'Keywords' , 'NNS' ), ( 'are' , 'VBP' ), ( 'defined' , 'VBN' ), ( 'as' , 'IN' ), ( 'phrases' , 'NNS' ), ( 'that' , 'WDT' ), ( 'capture' , 'VBP' ), ( 'the' , 'DT' ), ( 'main' , 'JJ' ), ( 'topics' , 'NNS' ), ( 'discussed' , 'VBN' ), ( 'in' , 'IN' ), ( 'a' , 'DT' ), ( 'document' , 'NN' ), ( '.' , '.' ), ( 'As' , 'IN' ), ( 'they' , 'PRP' ), ( 'offer' , 'VBP' ), ( 'a' , 'DT' ), ( 'brief' , 'JJ' ), ( 'yet' , 'CC' ), ( 'precise' , 'JJ' ), ( 'summary' , 'NN' ), ( 'of' , 'IN' ), ( 'document' , 'NN' ), ( 'content' , 'NN' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'utilized' , 'VBN' ), ( 'for' , 'IN' ), ( 'various' , 'JJ' ), ( 'applications' , 'NNS' ), ( '.' , '.' ), ( 'In' , 'IN' ), ( 'an' , 'DT' ), ( 'information' , 'NN' ), ( 'retrieval' , 'NN' ), ( 'environment' , 'NN' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'serve' , 'VBP' ), ( 'as' , 'IN' ), ( 'an' , 'DT' ), ( 'indication' , 'NN' ), ( 'of' , 'IN' ), ( 'document' , 'NN' ), ( 'relevance' , 'NN' ), ( 'for' , 'IN' ), ( 'users' , 'NNS' ), ( ',' , ',' ), ( 'as' , 'IN' ), ( 'the' , 'DT' ), ( 'list' , 'NN' ), ( 'of' , 'IN' ), ( 'keywords' , 'NNS' ), ( 'can' , 'MD' ), ( 'quickly' , 'RB' ), ( 'help' , 'VB' ), ( 'to' , 'TO' ), ( 'determine' , 'VB' ), ( 'whether' , 'IN' ), ( 'a' , 'DT' ), ( 'given' , 'VBN' ), ( 'document' , 'NN' ), ( 'is' , 'VBZ' ), ( 'relevant' , 'JJ' ), ( 'to' , 'IN' ), ( 'their' , 'PRP$' ), ( 'interest' , 'NN' ), ( '.' , '.' ), ( 'As' , 'IN' ), ( 'keywords' , 'NNS' ), ( 'reflect' , 'VBP' ), ( 'a' , 'DT' ), ( 'document' , 'NN' ), ( "'s" , 'POS' ), ( 'main' , 'JJ' ), ( 'topics' , 'NNS' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'utilized' , 'VBN' ), ( 'to' , 'TO' ), ( 'classify' , 'VB' ), ( 'documents' , 'NNS' ), ( 'into' , 'IN' ), ( 'groups' , 'NNS' ), ( 'by' , 'IN' ), ( 'measuring' , 'VBG' ), ( 'the' , 'DT' ), ( 'overlap' , 'NN' ), ( 'between' , 'IN' ), ( 'the' , 'DT' ), ( 'keywords' , 'NNS' ), ( 'assigned' , 'VBN' ), ( 'to' , 'IN' ), ( 'them' , 'PRP' ), ( '.' , '.' ), ( 'Keywords' , 'NNS' ), ( 'are' , 'VBP' ), ( 'also' , 'RB' ), ( 'used' , 'VBN' ), ( 'proactively' , 'RB' ), ( 'in' , 'IN' ), ( 'information' , 'NN' ), ( 'retrieval' , 'NN' ), ( '.' , '.' )]定义自定义POS-Tagger函数后,可以通过custom_pos_tagger参数将其传递给keyphraseVectorizer。
from keyphrase_vectorizers import KeyphraseCountVectorizer
# use custom POS-tagger with KeyphraseVectorizers
vectorizer = KeyphraseCountVectorizer ( custom_pos_tagger = custom_pos_tagger )
vectorizer . fit ( docs )
keyphrases = vectorizer . get_feature_names_out ()
print ( keyphrases )
> >> [ 'output value' 'information retrieval' 'algorithm' 'vector' 'groups'
'main topics' 'task' 'precise summary' 'supervised learning'
'inductive bias' 'information retrieval environment'
'supervised learning algorithm' 'function' 'input' 'pair'
'document relevance' 'learning' 'class labels' 'new examples' 'keywords'
'list' 'machine' 'training data' 'unseen situations' 'phrases' 'output'
'optimal scenario' 'document' 'training examples' 'documents' 'interest'
'indication' 'learning algorithm' 'inferred function'
'various applications' 'example' 'set' 'unseen instances'
'example input-output pairs' 'way' 'users' 'input object'
'supervisory signal' 'overlap' 'document content' ]返回目录
将键形矢量与钥匙室一起进行键形式提取导致模式线方法。 PatternRank可以提取与文档最相似的语法上正确的键形。因此,矢量器首先从文本文档中提取候选键形,随后由Keybert基于其文档相似性对其进行排名。然后可以将最相似的键字形视为文档关键字。
除了KeyBert之外,使用键形矢量量的优点是,它允许用户获得语法上正确的键形键盘,而不是简单的预定义长度的n-gram。在Keybert中,用户可以指定keyphrase_ngram_range来定义检索到的键形词的长度。但是,这引发了两个问题。首先,用户通常不知道最佳的N-Gram范围,因此必须花一些时间进行尝试,直到找到合适的N-gram范围。其次,即使找到了良好的n-gram范围,返回的钥匙源也有时在语法上仍然不太正确或略有关键。不幸的是,这限制了返回的钥匙源的质量。
为了解决这个问题,我们可以使用此软件包的矢量化来首先提取由零或更多形容词组成的候选键形,然后在预处理步骤中而不是简单的n-gram中提取一个或多个名词。 Textrank,Singlerank和Embedrank已经成功地使用了这种名词短语方法进行键形提取。随后,提取的候选键形被传递给Keybert,以嵌入产生和相似性计算。要将两个软件包用于键形提取,我们需要将keybert通过vectorizer参数传递键形矢量器。由于现在的键字形的长度取决于言论的一部分标签,因此无需再定义n-gram长度。
可以通过pip install keybert 。
from keyphrase_vectorizers import KeyphraseCountVectorizer
from keybert import KeyBERT
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
kw_model = KeyBERT ()而不是决定合适的n-gram范围,例如(1,2)...
> >> kw_model . extract_keywords ( docs = docs , keyphrase_ngram_range = ( 1 , 2 ))
[[( 'labeled training' , 0.6013 ),
( 'examples supervised' , 0.6112 ),
( 'signal supervised' , 0.6152 ),
( 'supervised' , 0.6676 ),
( 'supervised learning' , 0.6779 )],
[( 'keywords assigned' , 0.6354 ),
( 'keywords used' , 0.6373 ),
( 'list keywords' , 0.6375 ),
( 'keywords quickly' , 0.6376 ),
( 'keywords defined' , 0.6997 )]]现在,我们只能让键形矢量器决定合适的键形键盘,而无需限制到最大或最小n-gram范围。我们只需要将键形矢量器作为参数传递给keybert:
> >> kw_model . extract_keywords ( docs = docs , vectorizer = KeyphraseCountVectorizer ())
[[( 'learning' , 0.4813 ),
( 'training data' , 0.5271 ),
( 'learning algorithm' , 0.5632 ),
( 'supervised learning' , 0.6779 ),
( 'supervised learning algorithm' , 0.6992 )],
[( 'document content' , 0.3988 ),
( 'information retrieval environment' , 0.5166 ),
( 'information retrieval' , 0.5792 ),
( 'keywords' , 0.6046 ),
( 'document relevance' , 0.633 )]]这使我们能够确保我们不会因定义n-gram范围太短而引起的重要单词。例如,我们不会使用keyphrase_ngram_range=(1,2)找到键形的“监督学习算法”。此外,我们避免获得像“标记训练”,“信号监督”或“关键字”之类的键形短语。
有关如何与Keybert一起使用KeyPhraseActectorizers的更多提示,请访问本指南。
返回目录
与使用Keybert的应用相似,可以使用键形矢量化媒介器来获取语法上正确的键形短语作为主题的描述,而不是简单的n-grams。这使我们能够确保通过定义n-gram范围太短而不会切断重要主题描述键形。此外,我们不需要预先清洁停止词,可以获得更精确的主题模型,并避免获得略有关键的主题描述键形。
可以通过pip install bertopic 。
from keyphrase_vectorizers import KeyphraseCountVectorizer
from bertopic import BERTopic
from sklearn . datasets import fetch_20newsgroups
# load text documents
docs = fetch_20newsgroups ( subset = 'all' , remove = ( 'headers' , 'footers' , 'quotes' ))[ 'data' ]
# only use subset of the data
docs = docs [: 5000 ]
# train topic model with KeyphraseCountVectorizer
keyphrase_topic_model = BERTopic ( vectorizer_model = KeyphraseCountVectorizer ())
keyphrase_topics , keyphrase_probs = keyphrase_topic_model . fit_transform ( docs )
# get topics
> >> keyphrase_topic_model . topics
{ - 1 : [( 'file' , 0.007265527630674131 ),
( 'one' , 0.007055454904474792 ),
( 'use' , 0.00633563957153475 ),
( 'program' , 0.006053271092949018 ),
( 'get' , 0.006011060091056076 ),
( 'people' , 0.005729309058970368 ),
( 'know' , 0.005635951168273583 ),
( 'like' , 0.0055692449802916015 ),
( 'time' , 0.00527028825803415 ),
( 'us' , 0.00525564504880084 )],
0 : [( 'game' , 0.024134589719090525 ),
( 'team' , 0.021852806383170772 ),
( 'players' , 0.01749406934044139 ),
( 'games' , 0.014397938026886745 ),
( 'hockey' , 0.013932342023677305 ),
( 'win' , 0.013706115572901401 ),
( 'year' , 0.013297593024390321 ),
( 'play' , 0.012533185558169046 ),
( 'baseball' , 0.012412743802062559 ),
( 'season' , 0.011602725885164318 )],
1 : [( 'patients' , 0.022600352291162015 ),
( 'msg' , 0.02023877371575874 ),
( 'doctor' , 0.018816282737587457 ),
( 'medical' , 0.018614407917995103 ),
( 'treatment' , 0.0165028251400717 ),
( 'food' , 0.01604980195180696 ),
( 'candida' , 0.015255961242066143 ),
( 'disease' , 0.015115496310099693 ),
( 'pain' , 0.014129703072484495 ),
( 'hiv' , 0.012884503220341102 )],
2 : [( 'key' , 0.028851633177510126 ),
( 'encryption' , 0.024375137861044675 ),
( 'clipper' , 0.023565947302544528 ),
( 'privacy' , 0.019258719348097385 ),
( 'security' , 0.018983682856076434 ),
( 'chip' , 0.018822199098878365 ),
( 'keys' , 0.016060139239615384 ),
( 'internet' , 0.01450486904722165 ),
( 'encrypted' , 0.013194373119964168 ),
( 'government' , 0.01303978311708837 )],
...当未使用键形矢量器时,相同的主题看起来有所不同:
from bertopic import BERTopic
from sklearn . datasets import fetch_20newsgroups
# load text documents
docs = fetch_20newsgroups ( subset = 'all' , remove = ( 'headers' , 'footers' , 'quotes' ))[ 'data' ]
# only use subset of the data
docs = docs [: 5000 ]
# train topic model without KeyphraseCountVectorizer
topic_model = BERTopic ()
topics , probs = topic_model . fit_transform ( docs )
# get topics
> >> topic_model . topics
{ - 1 : [( 'the' , 0.012864641020408933 ),
( 'to' , 0.01187920529994724 ),
( 'and' , 0.011431498631699856 ),
( 'of' , 0.01099851927541331 ),
( 'is' , 0.010995478673036962 ),
( 'in' , 0.009908233622158523 ),
( 'for' , 0.009903667215879675 ),
( 'that' , 0.009619596716087699 ),
( 'it' , 0.009578499681829809 ),
( 'you' , 0.0095328846440753 )],
0 : [( 'game' , 0.013949166096523719 ),
( 'team' , 0.012458483177116456 ),
( 'he' , 0.012354733462693834 ),
( 'the' , 0.01119583508278812 ),
( '10' , 0.010190243555226108 ),
( 'in' , 0.0101436249231417 ),
( 'players' , 0.009682212470082758 ),
( 'to' , 0.00933700544705287 ),
( 'was' , 0.009172402203816335 ),
( 'and' , 0.008653375901739337 )],
1 : [( 'of' , 0.012771267188340924 ),
( 'to' , 0.012581337590513296 ),
( 'is' , 0.012554884458779008 ),
( 'patients' , 0.011983273578628046 ),
( 'and' , 0.011863499662237566 ),
( 'that' , 0.011616113472989725 ),
( 'it' , 0.011581944987387165 ),
( 'the' , 0.011475148304229873 ),
( 'in' , 0.011395485985801054 ),
( 'msg' , 0.010715000656335596 )],
2 : [( 'key' , 0.01725282988290282 ),
( 'the' , 0.014634841495851404 ),
( 'be' , 0.014429762197907552 ),
( 'encryption' , 0.013530733999898166 ),
( 'to' , 0.013443159534369817 ),
( 'clipper' , 0.01296614319927958 ),
( 'of' , 0.012164734232650158 ),
( 'is' , 0.012128295958613464 ),
( 'and' , 0.011972763728732667 ),
( 'chip' , 0.010785744492767285 )],
...返回目录
KeyPhraseVectorizer还支持其表示形式的在线/增量更新(类似于OnlineCountVectorizer)。矢量机不仅可以更新量量的键盘,还可以实现衰减和清洁功能,以防止稀疏的文档键盘矩阵变得太大。
在线更新的参数:
decay :在每次迭代中,我们将新文档的文档键盘表示用文档键盘表示到目前为止所处理的所有文档的表示。换句话说,文档键盘矩阵随着每次迭代而不断增加。但是,尤其是在流媒体环境中,随着时间的流逝,较旧的文档可能变得越来越少。因此,在添加新文档的文档频率之前,实现了一个衰减参数,该参数在每次迭代中衰减文档键盘频率。衰减参数是0到1之间的值,并指示频率的百分比应减小为先前的文档键盘矩阵。例如,在添加新的文档键盘矩阵之前,每次迭代时,.1的值将使文档 - 密码矩阵中的频率降低10%。这将确保最近的数据比以前的迭代更大。delete_min_df :我们可能需要从很少出现的文档键字形表示中删除键形。 min_df参数可以很好地工作。但是,当我们具有流式设置时, min_df无法正常工作,因为键形频率可能始于min_df以下,但随着时间的流逝,最终会高于该频率。设置高价值可能并不总是建议。结果,矢量器学到的键形和生成的文档 - 键盘矩阵的列表可能会变得很大。同样,如果我们实现decay参数,那么一些值会随着时间的推移而减小,直到它们低于min_df 。由于这些原因,实现了delete_min_df参数。该参数采用正整数,并在每次迭代中指示哪些键形声将从已经学到的键中删除。如果将值设置为5,则在每个迭代后将检查键形的总频率是否超过该值。如果是这样,将从矢量器学到的键形列表中全部删除钥匙拼。这有助于保持可管理尺寸的文档键盘矩阵。 from keyphrase_vectorizers import KeyphraseCountVectorizer
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
# Init default vectorizer.
vectorizer = KeyphraseCountVectorizer ( decay = 0.5 , delete_min_df = 3 )
# intitial vectorizer fit
vectorizer . fit_transform ([ docs [ 0 ]]). toarray ()
> >> array ([[ 1 , 1 , 3 , 1 , 1 , 3 , 1 , 3 , 1 , 1 , 1 , 1 , 2 , 1 , 3 , 1 , 1 , 1 , 1 , 3 , 1 , 3 ,
1 , 1 , 1 ]])
# check learned keyphrases
print ( vectorizer . get_feature_names_out ())
> >> [ 'output pairs' , 'output value' , 'function' , 'optimal scenario' ,
'pair' , 'supervised learning' , 'supervisory signal' , 'algorithm' ,
'supervised learning algorithm' , 'way' , 'training examples' ,
'input object' , 'example' , 'machine' , 'output' ,
'unseen situations' , 'unseen instances' , 'inductive bias' ,
'new examples' , 'input' , 'task' , 'training data' , 'class labels' ,
'set' , 'vector' ]
# learn additional keyphrases from new documents with partial fit
vectorizer . partial_fit ([ docs [ 1 ]])
vectorizer . transform ([ docs [ 1 ]]). toarray ()
> >> array ([[ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,
0 , 0 , 0 , 1 , 1 , 2 , 1 , 1 , 2 , 1 , 1 , 1 , 1 , 1 , 1 , 5 , 1 , 1 , 5 , 1 ]])
# check learned keyphrases, including newly learned ones
print ( vectorizer . get_feature_names_out ())
> >> [ 'output pairs' , 'output value' , 'function' , 'optimal scenario' ,
'pair' , 'supervised learning' , 'supervisory signal' , 'algorithm' ,
'supervised learning algorithm' , 'way' , 'training examples' ,
'input object' , 'example' , 'machine' , 'output' ,
'unseen situations' , 'unseen instances' , 'inductive bias' ,
'new examples' , 'input' , 'task' , 'training data' , 'class labels' ,
'set' , 'vector' , 'list' , 'various applications' ,
'information retrieval' , 'groups' , 'overlap' , 'main topics' ,
'precise summary' , 'document relevance' , 'interest' , 'indication' ,
'information retrieval environment' , 'phrases' , 'keywords' ,
'document content' , 'documents' , 'document' , 'users' ]
# update list of learned keyphrases according to 'delete_min_df'
vectorizer . update_bow ([ docs [ 1 ]])
vectorizer . transform ([ docs [ 1 ]]). toarray ()
> >> array ([[ 5 , 5 ]])
# check updated list of learned keyphrases (only the ones that appear more than 'delete_min_df' remain)
print ( vectorizer . get_feature_names_out ())
> >> [ 'keywords' , 'document' ]
# update again and check the impact of 'decay' on the learned document-keyphrase matrix
vectorizer . update_bow ([ docs [ 1 ]])
vectorizer . X_ . toarray ()
> >> array ([[ 7.5 , 7.5 ]])返回目录
在引用键形矢量或图案中的键盘向量时,请使用此Bibtex条目:
@conference{schopf_etal_kdir22,
author={Tim Schopf and Simon Klimek and Florian Matthes},
title={PatternRank: Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction},
booktitle={Proceedings of the 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2022) - KDIR},
year={2022},
pages={243-248},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011546600003335},
isbn={978-989-758-614-9},
issn={2184-3228},
}