Paket ini dikembangkan selama penulisan kertas PatternRank kami. Anda dapat memeriksa kertas di sini. Saat menggunakan keyphrasevectorizers atau polaRank di makalah akademik dan tesis, silakan gunakan entri Bibtex di bawah ini.
Set vektoris yang mengekstrak keyphrase dengan pola bagian-of-speech dari kumpulan dokumen teks dan mengubahnya menjadi matriks dokumen-keyphrase. Matriks dokumen-keyphrase adalah matriks matematika yang menggambarkan frekuensi keyphrase yang terjadi dalam kumpulan dokumen. Baris matriks menunjukkan dokumen dan kolom teks menunjukkan keyphrase yang unik.
Paket ini berisi pembungkus sklearn.feature_extraction.text.countVectorizer dan sklearn.feature_extraction.text.tfidfvectorizer kelas. Alih-alih menggunakan token N-gram dari rentang yang telah ditentukan sebelumnya, kelas-kelas ini mengekstrak keyphrase dari dokumen teks menggunakan tag bagian-of-speech untuk menghitung matriks dokumen-keyphrase.
Posting medium yang sesuai dapat ditemukan di sini dan di sini.
Pertama, teks dokumen dianotasi dengan tag bagian spacy-speech. Daftar semua tag bagian spacy yang mungkin untuk bahasa yang berbeda ditautkan di sini. Anotasi membutuhkan melewati pipa spacy dari bahasa yang sesuai ke vektorizer dengan parameter spacy_pipeline .
Kedua, kata-kata diekstraksi dari teks dokumen yang tag bagian-dari-ucapan cocok dengan pola regex yang ditentukan dalam parameter pos_pattern . Keyphrase adalah daftar kata -kata unik yang diekstraksi dari dokumen teks dengan metode ini.
Akhirnya, vektorzer menghitung matriks dokumen-keyphrase.
pip install keyphrase-vectorizers
Untuk informasi terperinci, kunjungi Panduan API.
Kembali ke Daftar Isi
from keyphrase_vectorizers import KeyphraseCountVectorizer
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
# Init default vectorizer.
vectorizer = KeyphraseCountVectorizer ()
# Print parameters
print ( vectorizer . get_params ())
> >> { 'binary' : False , 'dtype' : < class 'numpy.int64' > , 'lowercase' : True , 'max_df' : None , 'min_df' : None , 'pos_pattern' : '<J.*>*<N.*>+' , 'spacy_exclude' : [ 'parser' , 'attribute_ruler' , 'lemmatizer' , 'ner' ], 'spacy_pipeline' : 'en_core_web_sm' , 'stop_words' : 'english' , 'workers' : 1 } Secara default, vektorisasi diinisialisasi untuk bahasa Inggris. Itu berarti, spacy_pipeline bahasa Inggris ditentukan, stop_words bahasa Inggris dihapus, dan pos_pattern mengekstrak kata kunci yang memiliki 0 atau lebih kata sifat, diikuti oleh 1 atau lebih kata benda menggunakan tag part-of-speech spacy bahasa Inggris. Selain itu, komponen pipa spacy ['parser', 'attribute_ruler', 'lemmatizer', 'ner'] dikecualikan secara default untuk meningkatkan efisiensi. Jika Anda memilih spacy_pipeline yang berbeda, Anda mungkin harus mengecualikan/menyertakan komponen pipa yang berbeda menggunakan parameter spacy_exclude agar tagger POS spacy berfungsi dengan baik.
# After initializing the vectorizer, it can be fitted
# to learn the keyphrases from the text documents.
vectorizer . fit ( docs ) # After learning the keyphrases, they can be returned.
keyphrases = vectorizer . get_feature_names_out ()
print ( keyphrases )
> >> [ 'users' 'main topics' 'learning algorithm' 'overlap' 'documents' 'output'
'keywords' 'precise summary' 'new examples' 'training data' 'input'
'document content' 'training examples' 'unseen instances'
'optimal scenario' 'document' 'task' 'supervised learning algorithm'
'example' 'interest' 'function' 'example input' 'various applications'
'unseen situations' 'phrases' 'indication' 'inductive bias'
'supervisory signal' 'document relevance' 'information retrieval' 'set'
'input object' 'groups' 'output value' 'list' 'learning' 'output pairs'
'pair' 'class labels' 'supervised learning' 'machine'
'information retrieval environment' 'algorithm' 'vector' 'way' ] # After fitting, the vectorizer can transform the documents
# to a document-keyphrase matrix.
# Matrix rows indicate the documents and columns indicate the unique keyphrases.
# Each cell represents the count.
document_keyphrase_matrix = vectorizer . transform ( docs ). toarray ()
print ( document_keyphrase_matrix )
> >> [[ 0 0 2 0 0 3 0 0 1 3 3 0 1 1 1 0 1 1 2 0 3 1 0 1 0 0 1 1 0 0 1 1 0 1 0 6
1 1 1 3 1 0 3 1 1 ]
[ 1 2 0 1 1 0 5 1 0 0 0 1 0 0 0 5 0 0 0 1 0 0 1 0 1 1 0 0 1 2 0 0 1 0 1 0
0 0 0 0 0 1 0 0 0 ]] # Fit and transform can also be executed in one step,
# which is more efficient.
document_keyphrase_matrix = vectorizer . fit_transform ( docs ). toarray ()
print ( document_keyphrase_matrix )
> >> [[ 0 0 2 0 0 3 0 0 1 3 3 0 1 1 1 0 1 1 2 0 3 1 0 1 0 0 1 1 0 0 1 1 0 1 0 6
1 1 1 3 1 0 3 1 1 ]
[ 1 2 0 1 1 0 5 1 0 0 0 1 0 0 0 5 0 0 0 1 0 0 1 0 1 1 0 0 1 2 0 0 1 0 1 0
0 0 0 0 0 1 0 0 0 ]]Kembali ke Daftar Isi
german_docs = [ """Goethe stammte aus einer angesehenen bürgerlichen Familie.
Sein Großvater mütterlicherseits war als Stadtschultheiß höchster Justizbeamter der Stadt Frankfurt,
sein Vater Doktor der Rechte und Kaiserlicher Rat. Er und seine Schwester Cornelia erfuhren eine aufwendige
Ausbildung durch Hauslehrer. Dem Wunsch seines Vaters folgend, studierte Goethe in Leipzig und Straßburg
Rechtswissenschaft und war danach als Advokat in Wetzlar und Frankfurt tätig.
Gleichzeitig folgte er seiner Neigung zur Dichtkunst.""" ,
"""Friedrich Schiller wurde als zweites Kind des Offiziers, Wundarztes und Leiters der Hofgärtnerei in
Marbach am Neckar Johann Kaspar Schiller und dessen Ehefrau Elisabetha Dorothea Schiller, geb. Kodweiß,
die Tochter eines Wirtes und Bäckers war, 1759 in Marbach am Neckar geboren
""" ]
# Init vectorizer for the german language
vectorizer = KeyphraseCountVectorizer ( spacy_pipeline = 'de_core_news_sm' , pos_pattern = '<ADJ.*>*<N.*>+' , stop_words = 'german' ) spacy_pipeline Jerman ditentukan dan stop_words Jerman dihapus. Karena tag bagian spacy Jerman berbeda dari yang bahasa Inggris, parameter pos_pattern juga disesuaikan. Pola regex <ADJ.*>*<N.*>+ Mengekstrak kata kunci yang memiliki 0 atau lebih kata sifat, diikuti oleh 1 atau lebih kata benda menggunakan tag bagian spacy spacy Jerman.
Perhatian! Komponen pipa spacy ['parser', 'attribute_ruler', 'lemmatizer', 'ner'] dikecualikan secara default untuk meningkatkan efisiensi. Jika Anda memilih spacy_pipeline yang berbeda, Anda mungkin harus mengecualikan/menyertakan komponen pipa yang berbeda menggunakan parameter spacy_exclude agar tagger POS spacy berfungsi dengan baik.
Kembali ke Daftar Isi
KeyphraseTfidfVectorizer memiliki panggilan dan fitur fungsi yang sama dengan KeyphraseCountVectorizer . Satu-satunya perbedaan adalah, bahwa sel-sel matriks dokumen-keyphrase mewakili nilai TF atau TF-IDF, tergantung pada pengaturan parameter, alih-alih jumlah.
from keyphrase_vectorizers import KeyphraseTfidfVectorizer
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
# Init default vectorizer for the English language that computes tf-idf values
vectorizer = KeyphraseTfidfVectorizer ()
# Print parameters
print ( vectorizer . get_params ())
> >> { 'binary' : False , 'custom_pos_tagger' : None , 'decay' : None , 'delete_min_df' : None , 'dtype' : <
class 'numpy.int64' > , 'lowercase' : True , 'max_df' : None
, 'min_df' : None , 'pos_pattern' : '<J.*>*<N.*>+' , 'spacy_exclude' : [ 'parser' , 'attribute_ruler' , 'lemmatizer' , 'ner' ,
'textcat' ], 'spacy_pipeline' : 'en_core_web_sm' , 'stop_words' : 'english' , 'workers' : 1 } Untuk menghitung nilai TF sebagai gantinya, atur use_idf=False .
# Fit and transform to document-keyphrase matrix.
document_keyphrase_matrix = vectorizer . fit_transform ( docs ). toarray ()
print ( document_keyphrase_matrix )
> >> [[ 0. 0. 0.09245003 0.09245003 0.09245003 0.09245003
0.2773501 0.09245003 0.2773501 0.2773501 0.09245003 0.
0. 0.09245003 0. 0.2773501 0.09245003 0.09245003
0. 0.09245003 0.09245003 0.09245003 0.09245003 0.09245003
0.5547002 0. 0. 0.09245003 0.09245003 0.
0.2773501 0.18490007 0.09245003 0. 0.2773501 0.
0. 0.09245003 0. 0.09245003 0. 0.
0. 0.18490007 0. ]
[ 0.11867817 0.11867817 0. 0. 0. 0.
0. 0. 0. 0. 0. 0.11867817
0.11867817 0. 0.11867817 0. 0. 0.
0.11867817 0. 0. 0. 0. 0.
0. 0.11867817 0.23735633 0. 0. 0.11867817
0. 0. 0. 0.23735633 0. 0.11867817
0.11867817 0. 0.59339083 0. 0.11867817 0.11867817
0.11867817 0. 0.59339083 ]] # Return keyphrases
keyphrases = vectorizer . get_feature_names_out ()
print ( keyphrases )
> >> [ 'various applications' 'list' 'task' 'supervisory signal'
'inductive bias' 'supervised learning algorithm' 'supervised learning'
'example input' 'input' 'algorithm' 'set' 'precise summary' 'documents'
'input object' 'interest' 'function' 'class labels' 'machine'
'document content' 'output pairs' 'new examples' 'unseen situations'
'vector' 'output value' 'learning' 'document relevance' 'main topics'
'pair' 'training examples' 'information retrieval environment'
'training data' 'example' 'optimal scenario' 'information retrieval'
'output' 'groups' 'indication' 'unseen instances' 'keywords' 'way'
'phrases' 'overlap' 'users' 'learning algorithm' 'document' ]Kembali ke Daftar Isi
Keyphrasevectorizers memuat spacy.Language Objek bahasa untuk setiap objek KeyphraseVectorizer . Saat menggunakan beberapa objek KeyphraseVectorizer , lebih efisien untuk memuat spacy.Language Objek bahasa sebelumnya dan meneruskannya sebagai argumen spacy_pipeline .
import spacy
from keyphrase_vectorizers import KeyphraseCountVectorizer , KeyphraseTfidfVectorizer
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
nlp = spacy . load ( "en_core_web_sm" )
vectorizer1 = KeyphraseCountVectorizer ( spacy_pipeline = nlp )
vectorizer2 = KeyphraseTfidfVectorizer ( spacy_pipeline = nlp )
# the following calls use the nlp object
vectorizer1 . fit ( docs )
vectorizer2 . fit ( docs )Kembali ke Daftar Isi
Untuk menggunakan tagger bagian yang berbeda dari yang disediakan oleh Spacy, fungsi pos-tagger khusus dapat didefinisikan dan diteruskan ke KeyphrasEvectorizers melalui parameter custom_pos_tagger . Parameter ini mengharapkan fungsi yang dapat dipanggil yang pada gilirannya perlu mengharapkan daftar string dalam parameter 'raw_documents' dan harus mengembalikan daftar (token kata, pos-tag) tupel. Jika parameter ini tidak ada, fungsi tagger khusus digunakan untuk menandai kata-kata dengan bagian-of-speech, sedangkan pipa spacy diabaikan.
Flair dapat diinstal melalui pip install flair .
from typing import List
import flair
from flair . models import SequenceTagger
from flair . tokenization import SegtokSentenceSplitter
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
# define flair POS-tagger and splitter
tagger = SequenceTagger . load ( 'pos' )
splitter = SegtokSentenceSplitter ()
# define custom POS-tagger function using flair
def custom_pos_tagger ( raw_documents : List [ str ], tagger : flair . models . SequenceTagger = tagger , splitter : flair . tokenization . SegtokSentenceSplitter = splitter ) -> List [ tuple ]:
"""
Important:
The mandatory 'raw_documents' parameter can NOT be named differently and has to expect a list of strings.
Any other parameter of the custom POS-tagger function can be arbitrarily defined, depending on the respective use case.
Furthermore the function has to return a list of (word token, POS-tag) tuples.
"""
# split texts into sentences
sentences = []
for doc in raw_documents :
sentences . extend ( splitter . split ( doc ))
# predict POS tags
tagger . predict ( sentences )
# iterate through sentences to get word tokens and predicted POS-tags
pos_tags = []
words = []
for sentence in sentences :
pos_tags . extend ([ label . value for label in sentence . get_labels ( 'pos' )])
words . extend ([ word . text for word in sentence ])
return list ( zip ( words , pos_tags ))
# check that the custom POS-tagger function returns a list of (word token, POS-tag) tuples
print ( custom_pos_tagger ( raw_documents = docs ))
> >> [( 'Supervised' , 'VBN' ), ( 'learning' , 'NN' ), ( 'is' , 'VBZ' ), ( 'the' , 'DT' ), ( 'machine' , 'NN' ), ( 'learning' , 'VBG' ), ( 'task' , 'NN' ), ( 'of' , 'IN' ), ( 'learning' , 'VBG' ), ( 'a' , 'DT' ), ( 'function' , 'NN' ), ( 'that' , 'WDT' ), ( 'maps' , 'VBZ' ), ( 'an' , 'DT' ), ( 'input' , 'NN' ), ( 'to' , 'IN' ), ( 'an' , 'DT' ), ( 'output' , 'NN' ), ( 'based' , 'VBN' ), ( 'on' , 'IN' ), ( 'example' , 'NN' ), ( 'input-output' , 'NN' ), ( 'pairs' , 'NNS' ), ( '.' , '.' ), ( 'It' , 'PRP' ), ( 'infers' , 'VBZ' ), ( 'a' , 'DT' ), ( 'function' , 'NN' ), ( 'from' , 'IN' ), ( 'labeled' , 'VBN' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'consisting' , 'VBG' ), ( 'of' , 'IN' ), ( 'a' , 'DT' ), ( 'set' , 'NN' ), ( 'of' , 'IN' ), ( 'training' , 'NN' ), ( 'examples' , 'NNS' ), ( '.' , '.' ), ( 'In' , 'IN' ), ( 'supervised' , 'JJ' ), ( 'learning' , 'NN' ), ( ',' , ',' ), ( 'each' , 'DT' ), ( 'example' , 'NN' ), ( 'is' , 'VBZ' ), ( 'a' , 'DT' ), ( 'pair' , 'NN' ), ( 'consisting' , 'VBG' ), ( 'of' , 'IN' ), ( 'an' , 'DT' ), ( 'input' , 'NN' ), ( 'object' , 'NN' ), ( '(' , ':' ), ( 'typically' , 'RB' ), ( 'a' , 'DT' ), ( 'vector' , 'NN' ), ( ')' , ',' ), ( 'and' , 'CC' ), ( 'a' , 'DT' ), ( 'desired' , 'VBN' ), ( 'output' , 'NN' ), ( 'value' , 'NN' ), ( '(' , ',' ), ( 'also' , 'RB' ), ( 'called' , 'VBN' ), ( 'the' , 'DT' ), ( 'supervisory' , 'JJ' ), ( 'signal' , 'NN' ), ( ')' , '-RRB-' ), ( '.' , '.' ), ( 'A' , 'DT' ), ( 'supervised' , 'JJ' ), ( 'learning' , 'NN' ), ( 'algorithm' , 'NN' ), ( 'analyzes' , 'VBZ' ), ( 'the' , 'DT' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'and' , 'CC' ), ( 'produces' , 'VBZ' ), ( 'an' , 'DT' ), ( 'inferred' , 'JJ' ), ( 'function' , 'NN' ), ( ',' , ',' ), ( 'which' , 'WDT' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'used' , 'VBN' ), ( 'for' , 'IN' ), ( 'mapping' , 'VBG' ), ( 'new' , 'JJ' ), ( 'examples' , 'NNS' ), ( '.' , '.' ), ( 'An' , 'DT' ), ( 'optimal' , 'JJ' ), ( 'scenario' , 'NN' ), ( 'will' , 'MD' ), ( 'allow' , 'VB' ), ( 'for' , 'IN' ), ( 'the' , 'DT' ), ( 'algorithm' , 'NN' ), ( 'to' , 'TO' ), ( 'correctly' , 'RB' ), ( 'determine' , 'VB' ), ( 'the' , 'DT' ), ( 'class' , 'NN' ), ( 'labels' , 'NNS' ), ( 'for' , 'IN' ), ( 'unseen' , 'JJ' ), ( 'instances' , 'NNS' ), ( '.' , '.' ), ( 'This' , 'DT' ), ( 'requires' , 'VBZ' ), ( 'the' , 'DT' ), ( 'learning' , 'NN' ), ( 'algorithm' , 'NN' ), ( 'to' , 'TO' ), ( 'generalize' , 'VB' ), ( 'from' , 'IN' ), ( 'the' , 'DT' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'to' , 'IN' ), ( 'unseen' , 'JJ' ), ( 'situations' , 'NNS' ), ( 'in' , 'IN' ), ( 'a' , 'DT' ), ( "'" , '``' ), ( 'reasonable' , 'JJ' ), ( "'" , "''" ), ( 'way' , 'NN' ), ( '(' , ',' ), ( 'see' , 'VB' ), ( 'inductive' , 'JJ' ), ( 'bias' , 'NN' ), ( ')' , '-RRB-' ), ( '.' , '.' ), ( 'Keywords' , 'NNS' ), ( 'are' , 'VBP' ), ( 'defined' , 'VBN' ), ( 'as' , 'IN' ), ( 'phrases' , 'NNS' ), ( 'that' , 'WDT' ), ( 'capture' , 'VBP' ), ( 'the' , 'DT' ), ( 'main' , 'JJ' ), ( 'topics' , 'NNS' ), ( 'discussed' , 'VBN' ), ( 'in' , 'IN' ), ( 'a' , 'DT' ), ( 'document' , 'NN' ), ( '.' , '.' ), ( 'As' , 'IN' ), ( 'they' , 'PRP' ), ( 'offer' , 'VBP' ), ( 'a' , 'DT' ), ( 'brief' , 'JJ' ), ( 'yet' , 'CC' ), ( 'precise' , 'JJ' ), ( 'summary' , 'NN' ), ( 'of' , 'IN' ), ( 'document' , 'NN' ), ( 'content' , 'NN' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'utilized' , 'VBN' ), ( 'for' , 'IN' ), ( 'various' , 'JJ' ), ( 'applications' , 'NNS' ), ( '.' , '.' ), ( 'In' , 'IN' ), ( 'an' , 'DT' ), ( 'information' , 'NN' ), ( 'retrieval' , 'NN' ), ( 'environment' , 'NN' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'serve' , 'VBP' ), ( 'as' , 'IN' ), ( 'an' , 'DT' ), ( 'indication' , 'NN' ), ( 'of' , 'IN' ), ( 'document' , 'NN' ), ( 'relevance' , 'NN' ), ( 'for' , 'IN' ), ( 'users' , 'NNS' ), ( ',' , ',' ), ( 'as' , 'IN' ), ( 'the' , 'DT' ), ( 'list' , 'NN' ), ( 'of' , 'IN' ), ( 'keywords' , 'NNS' ), ( 'can' , 'MD' ), ( 'quickly' , 'RB' ), ( 'help' , 'VB' ), ( 'to' , 'TO' ), ( 'determine' , 'VB' ), ( 'whether' , 'IN' ), ( 'a' , 'DT' ), ( 'given' , 'VBN' ), ( 'document' , 'NN' ), ( 'is' , 'VBZ' ), ( 'relevant' , 'JJ' ), ( 'to' , 'IN' ), ( 'their' , 'PRP$' ), ( 'interest' , 'NN' ), ( '.' , '.' ), ( 'As' , 'IN' ), ( 'keywords' , 'NNS' ), ( 'reflect' , 'VBP' ), ( 'a' , 'DT' ), ( 'document' , 'NN' ), ( "'s" , 'POS' ), ( 'main' , 'JJ' ), ( 'topics' , 'NNS' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'utilized' , 'VBN' ), ( 'to' , 'TO' ), ( 'classify' , 'VB' ), ( 'documents' , 'NNS' ), ( 'into' , 'IN' ), ( 'groups' , 'NNS' ), ( 'by' , 'IN' ), ( 'measuring' , 'VBG' ), ( 'the' , 'DT' ), ( 'overlap' , 'NN' ), ( 'between' , 'IN' ), ( 'the' , 'DT' ), ( 'keywords' , 'NNS' ), ( 'assigned' , 'VBN' ), ( 'to' , 'IN' ), ( 'them' , 'PRP' ), ( '.' , '.' ), ( 'Keywords' , 'NNS' ), ( 'are' , 'VBP' ), ( 'also' , 'RB' ), ( 'used' , 'VBN' ), ( 'proactively' , 'RB' ), ( 'in' , 'IN' ), ( 'information' , 'NN' ), ( 'retrieval' , 'NN' ), ( '.' , '.' )] Setelah fungsi pos-tagger khusus didefinisikan, dapat diteruskan ke keyphrasevectorizers melalui parameter custom_pos_tagger .
from keyphrase_vectorizers import KeyphraseCountVectorizer
# use custom POS-tagger with KeyphraseVectorizers
vectorizer = KeyphraseCountVectorizer ( custom_pos_tagger = custom_pos_tagger )
vectorizer . fit ( docs )
keyphrases = vectorizer . get_feature_names_out ()
print ( keyphrases )
> >> [ 'output value' 'information retrieval' 'algorithm' 'vector' 'groups'
'main topics' 'task' 'precise summary' 'supervised learning'
'inductive bias' 'information retrieval environment'
'supervised learning algorithm' 'function' 'input' 'pair'
'document relevance' 'learning' 'class labels' 'new examples' 'keywords'
'list' 'machine' 'training data' 'unseen situations' 'phrases' 'output'
'optimal scenario' 'document' 'training examples' 'documents' 'interest'
'indication' 'learning algorithm' 'inferred function'
'various applications' 'example' 'set' 'unseen instances'
'example input-output pairs' 'way' 'users' 'input object'
'supervisory signal' 'overlap' 'document content' ]Kembali ke Daftar Isi
Menggunakan vektoris keyphrase bersama -sama dengan keybert untuk hasil ekstraksi keyphrase dalam pendekatan PatternRank. PatternRank dapat mengekstraksi keyphrase yang benar secara tata bahasa yang paling mirip dengan dokumen. Dengan demikian, Vectorizer pertama mengekstrak kandidat keyphrase dari dokumen teks, yang kemudian diperingkat oleh Keybert berdasarkan kesamaan dokumen mereka. Top-n yang paling mirip dengan keyphrase yang paling mirip kemudian dapat dianggap sebagai kata kunci dokumen.
Keuntungan menggunakan keyphrasevectorizers selain keybert adalah memungkinkan pengguna untuk mendapatkan keyphrase yang benar secara tata bahasa alih-alih n-gram sederhana dengan panjang yang telah ditentukan sebelumnya. Di Keybert, pengguna dapat menentukan keyphrase_ngram_range untuk menentukan panjang keyphrase yang diambil. Namun, ini menimbulkan dua masalah. Pertama, pengguna biasanya tidak tahu rentang N-gram yang optimal dan karenanya harus menghabiskan waktu bereksperimen sampai mereka menemukan rentang N-gram yang sesuai. Kedua, bahkan setelah menemukan kisaran N-gram yang baik, keyphrase yang dikembalikan kadang-kadang masih secara tata bahasa tidak benar atau sedikit off-key. Sayangnya, ini membatasi kualitas keyphrase yang dikembalikan.
Untuk menangani masalah ini, kita dapat menggunakan vektorisasi paket ini untuk terlebih dahulu mengekstrak kandidat keyphrase yang terdiri dari nol atau lebih kata sifat, diikuti oleh satu atau beberapa kata benda dalam langkah pra-pemrosesan alih-alih n-gram sederhana. Textrank, Singlerank, dan Embedrank sudah berhasil menggunakan pendekatan frasa kata benda ini untuk ekstraksi keyphrase. Kandidat yang diekstraksi keyphrase kemudian diteruskan ke Keybert untuk penghasilan yang menanamkan dan perhitungan kesamaan. Untuk menggunakan kedua paket untuk ekstraksi keyphrase, kita perlu melewati keybert vektorisasi keyphrase dengan parameter vectorizer . Karena panjang keyphrases sekarang tergantung pada tag bagian-dari-pidato, tidak perlu mendefinisikan panjang n-gram lagi.
Keybert dapat diinstal melalui pip install keybert .
from keyphrase_vectorizers import KeyphraseCountVectorizer
from keybert import KeyBERT
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
kw_model = KeyBERT ()Alih-alih memutuskan kisaran N-gram yang cocok yang bisa misalnya (1,2) ...
> >> kw_model . extract_keywords ( docs = docs , keyphrase_ngram_range = ( 1 , 2 ))
[[( 'labeled training' , 0.6013 ),
( 'examples supervised' , 0.6112 ),
( 'signal supervised' , 0.6152 ),
( 'supervised' , 0.6676 ),
( 'supervised learning' , 0.6779 )],
[( 'keywords assigned' , 0.6354 ),
( 'keywords used' , 0.6373 ),
( 'list keywords' , 0.6375 ),
( 'keywords quickly' , 0.6376 ),
( 'keywords defined' , 0.6997 )]]Kita sekarang dapat membiarkan vektorizer keyphrase memutuskan keyphrase yang sesuai, tanpa batasan pada kisaran n-gram maksimum atau minimum. Kami hanya perlu melewati vektorisasi keyphrase sebagai parameter ke Keybert:
> >> kw_model . extract_keywords ( docs = docs , vectorizer = KeyphraseCountVectorizer ())
[[( 'learning' , 0.4813 ),
( 'training data' , 0.5271 ),
( 'learning algorithm' , 0.5632 ),
( 'supervised learning' , 0.6779 ),
( 'supervised learning algorithm' , 0.6992 )],
[( 'document content' , 0.3988 ),
( 'information retrieval environment' , 0.5166 ),
( 'information retrieval' , 0.5792 ),
( 'keywords' , 0.6046 ),
( 'document relevance' , 0.633 )]] Ini memungkinkan kami untuk memastikan bahwa kami tidak memotong kata-kata penting yang disebabkan oleh mendefinisikan kisaran N-gram kami terlalu pendek. Misalnya, kami tidak akan menemukan "algoritma pembelajaran yang diawasi" dengan keyphrase_ngram_range=(1,2) . Selain itu, kami menghindari untuk mendapatkan keyphrase yang sedikit tidak aktif seperti "pelatihan berlabel", "sinyal diawasi" atau "kata kunci dengan cepat".
Untuk tips lebih lanjut tentang cara menggunakan keyphrasevectorizers bersama dengan Keybert, kunjungi panduan ini.
Kembali ke Daftar Isi
Mirip dengan aplikasi dengan keybert, vektoris keyphrase dapat digunakan untuk mendapatkan keyphrase yang benar secara tata bahasa sebagai deskripsi untuk topik alih-alih n-gram sederhana. Ini memungkinkan kita untuk memastikan bahwa kita tidak memotong topik penting deskripsi keyphrase dengan mendefinisikan rentang n-gram kita terlalu pendek. Selain itu, kita tidak perlu membersihkan stopwords di muka, bisa mendapatkan model topik yang lebih tepat dan hindari untuk mendapatkan deskripsi topik keyphrases yang sedikit tidak aktif.
Bertopic dapat diinstal melalui pip install bertopic .
from keyphrase_vectorizers import KeyphraseCountVectorizer
from bertopic import BERTopic
from sklearn . datasets import fetch_20newsgroups
# load text documents
docs = fetch_20newsgroups ( subset = 'all' , remove = ( 'headers' , 'footers' , 'quotes' ))[ 'data' ]
# only use subset of the data
docs = docs [: 5000 ]
# train topic model with KeyphraseCountVectorizer
keyphrase_topic_model = BERTopic ( vectorizer_model = KeyphraseCountVectorizer ())
keyphrase_topics , keyphrase_probs = keyphrase_topic_model . fit_transform ( docs )
# get topics
> >> keyphrase_topic_model . topics
{ - 1 : [( 'file' , 0.007265527630674131 ),
( 'one' , 0.007055454904474792 ),
( 'use' , 0.00633563957153475 ),
( 'program' , 0.006053271092949018 ),
( 'get' , 0.006011060091056076 ),
( 'people' , 0.005729309058970368 ),
( 'know' , 0.005635951168273583 ),
( 'like' , 0.0055692449802916015 ),
( 'time' , 0.00527028825803415 ),
( 'us' , 0.00525564504880084 )],
0 : [( 'game' , 0.024134589719090525 ),
( 'team' , 0.021852806383170772 ),
( 'players' , 0.01749406934044139 ),
( 'games' , 0.014397938026886745 ),
( 'hockey' , 0.013932342023677305 ),
( 'win' , 0.013706115572901401 ),
( 'year' , 0.013297593024390321 ),
( 'play' , 0.012533185558169046 ),
( 'baseball' , 0.012412743802062559 ),
( 'season' , 0.011602725885164318 )],
1 : [( 'patients' , 0.022600352291162015 ),
( 'msg' , 0.02023877371575874 ),
( 'doctor' , 0.018816282737587457 ),
( 'medical' , 0.018614407917995103 ),
( 'treatment' , 0.0165028251400717 ),
( 'food' , 0.01604980195180696 ),
( 'candida' , 0.015255961242066143 ),
( 'disease' , 0.015115496310099693 ),
( 'pain' , 0.014129703072484495 ),
( 'hiv' , 0.012884503220341102 )],
2 : [( 'key' , 0.028851633177510126 ),
( 'encryption' , 0.024375137861044675 ),
( 'clipper' , 0.023565947302544528 ),
( 'privacy' , 0.019258719348097385 ),
( 'security' , 0.018983682856076434 ),
( 'chip' , 0.018822199098878365 ),
( 'keys' , 0.016060139239615384 ),
( 'internet' , 0.01450486904722165 ),
( 'encrypted' , 0.013194373119964168 ),
( 'government' , 0.01303978311708837 )],
...Topik yang sama terlihat sedikit berbeda ketika tidak ada vektorisasi keyphrase digunakan:
from bertopic import BERTopic
from sklearn . datasets import fetch_20newsgroups
# load text documents
docs = fetch_20newsgroups ( subset = 'all' , remove = ( 'headers' , 'footers' , 'quotes' ))[ 'data' ]
# only use subset of the data
docs = docs [: 5000 ]
# train topic model without KeyphraseCountVectorizer
topic_model = BERTopic ()
topics , probs = topic_model . fit_transform ( docs )
# get topics
> >> topic_model . topics
{ - 1 : [( 'the' , 0.012864641020408933 ),
( 'to' , 0.01187920529994724 ),
( 'and' , 0.011431498631699856 ),
( 'of' , 0.01099851927541331 ),
( 'is' , 0.010995478673036962 ),
( 'in' , 0.009908233622158523 ),
( 'for' , 0.009903667215879675 ),
( 'that' , 0.009619596716087699 ),
( 'it' , 0.009578499681829809 ),
( 'you' , 0.0095328846440753 )],
0 : [( 'game' , 0.013949166096523719 ),
( 'team' , 0.012458483177116456 ),
( 'he' , 0.012354733462693834 ),
( 'the' , 0.01119583508278812 ),
( '10' , 0.010190243555226108 ),
( 'in' , 0.0101436249231417 ),
( 'players' , 0.009682212470082758 ),
( 'to' , 0.00933700544705287 ),
( 'was' , 0.009172402203816335 ),
( 'and' , 0.008653375901739337 )],
1 : [( 'of' , 0.012771267188340924 ),
( 'to' , 0.012581337590513296 ),
( 'is' , 0.012554884458779008 ),
( 'patients' , 0.011983273578628046 ),
( 'and' , 0.011863499662237566 ),
( 'that' , 0.011616113472989725 ),
( 'it' , 0.011581944987387165 ),
( 'the' , 0.011475148304229873 ),
( 'in' , 0.011395485985801054 ),
( 'msg' , 0.010715000656335596 )],
2 : [( 'key' , 0.01725282988290282 ),
( 'the' , 0.014634841495851404 ),
( 'be' , 0.014429762197907552 ),
( 'encryption' , 0.013530733999898166 ),
( 'to' , 0.013443159534369817 ),
( 'clipper' , 0.01296614319927958 ),
( 'of' , 0.012164734232650158 ),
( 'is' , 0.012128295958613464 ),
( 'and' , 0.011972763728732667 ),
( 'chip' , 0.010785744492767285 )],
...Kembali ke Daftar Isi
KeyphrasEvectorizers juga mendukung pembaruan online/tambahan dari representasi mereka (mirip dengan OnlinecountVectorizer). Vektorisasi tidak hanya dapat memperbarui keyphrase out-of-vocabulary tetapi juga mengimplementasikan fungsi pembusukan dan pembersihan untuk mencegah matriks dokumen-keyphrases yang jarang menjadi terlalu besar.
Parameter untuk pembaruan online:
decay : Pada setiap iterasi, kami menyimpulkan representasi dokumen-keyphrase dari dokumen baru dengan representasi dokumen-keyphrase dari semua dokumen yang diproses sejauh ini. Dengan kata lain, matriks dokumen-keyphrase terus meningkat dengan setiap iterasi. Namun, terutama dalam pengaturan streaming, dokumen yang lebih tua mungkin menjadi kurang dan kurang relevan seiring berjalannya waktu. Oleh karena itu, parameter peluruhan diimplementasikan yang meluruh frekuensi dokumen-keyphrase pada setiap iterasi sebelum menambahkan frekuensi dokumen dokumen baru. Parameter peluruhan adalah nilai antara 0 dan 1 dan menunjukkan persentase frekuensi matriks dokumen-keyphrase sebelumnya harus dikurangi menjadi. Misalnya, nilai .1 akan mengurangi frekuensi dalam matriks dokumen-keyphrase sebesar 10% pada setiap iterasi sebelum menambahkan matriks dokumen-keyphrase baru. Ini akan memastikan bahwa data terbaru memiliki bobot lebih dari iterasi sebelumnya.delete_min_df : Kami mungkin ingin menghapus keyphrase dari representasi dokumen-keyphrase yang jarang muncul. Parameter min_df bekerja cukup baik untuk itu. Namun, ketika kami memiliki pengaturan streaming, min_df tidak berfungsi dengan baik karena frekuensi keyphrases mungkin dimulai di bawah min_df tetapi akan berakhir lebih tinggi dari itu dari waktu ke waktu. Menetapkan nilai yang tinggi mungkin tidak selalu disarankan. Akibatnya, daftar keyphrase yang dipelajari oleh vektorisasi dan matriks dokumen-keyphrase yang dihasilkan dapat menjadi cukup besar. Demikian pula, jika kami mengimplementasikan parameter decay , maka beberapa nilai akan berkurang dari waktu ke waktu hingga di bawah min_df . Untuk alasan ini, parameter delete_min_df diimplementasikan. Parameter mengambil bilangan bulat positif dan menunjukkan, pada setiap iterasi, keyphrase mana yang akan dihapus dari yang sudah dipelajari. Jika nilainya diatur ke 5, itu akan memeriksa setelah setiap iterasi jika frekuensi total keyphrase terlampaui oleh nilai itu. Jika demikian, keyphrase akan dihapus secara keseluruhan dari daftar keyphrase yang dipelajari oleh vektorisasi. Ini membantu menjaga matriks dokumen-keyphrase dengan ukuran yang dapat dikelola. from keyphrase_vectorizers import KeyphraseCountVectorizer
docs = [ """Supervised learning is the machine learning task of learning a function that
maps an input to an output based on example input-output pairs. It infers a
function from labeled training data consisting of a set of training examples.
In supervised learning, each example is a pair consisting of an input object
(typically a vector) and a desired output value (also called the supervisory signal).
A supervised learning algorithm analyzes the training data and produces an inferred function,
which can be used for mapping new examples. An optimal scenario will allow for the
algorithm to correctly determine the class labels for unseen instances. This requires
the learning algorithm to generalize from the training data to unseen situations in a
'reasonable' way (see inductive bias).""" ,
"""Keywords are defined as phrases that capture the main topics discussed in a document.
As they offer a brief yet precise summary of document content, they can be utilized for various applications.
In an information retrieval environment, they serve as an indication of document relevance for users, as the list
of keywords can quickly help to determine whether a given document is relevant to their interest.
As keywords reflect a document's main topics, they can be utilized to classify documents into groups
by measuring the overlap between the keywords assigned to them. Keywords are also used proactively
in information retrieval.""" ]
# Init default vectorizer.
vectorizer = KeyphraseCountVectorizer ( decay = 0.5 , delete_min_df = 3 )
# intitial vectorizer fit
vectorizer . fit_transform ([ docs [ 0 ]]). toarray ()
> >> array ([[ 1 , 1 , 3 , 1 , 1 , 3 , 1 , 3 , 1 , 1 , 1 , 1 , 2 , 1 , 3 , 1 , 1 , 1 , 1 , 3 , 1 , 3 ,
1 , 1 , 1 ]])
# check learned keyphrases
print ( vectorizer . get_feature_names_out ())
> >> [ 'output pairs' , 'output value' , 'function' , 'optimal scenario' ,
'pair' , 'supervised learning' , 'supervisory signal' , 'algorithm' ,
'supervised learning algorithm' , 'way' , 'training examples' ,
'input object' , 'example' , 'machine' , 'output' ,
'unseen situations' , 'unseen instances' , 'inductive bias' ,
'new examples' , 'input' , 'task' , 'training data' , 'class labels' ,
'set' , 'vector' ]
# learn additional keyphrases from new documents with partial fit
vectorizer . partial_fit ([ docs [ 1 ]])
vectorizer . transform ([ docs [ 1 ]]). toarray ()
> >> array ([[ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,
0 , 0 , 0 , 1 , 1 , 2 , 1 , 1 , 2 , 1 , 1 , 1 , 1 , 1 , 1 , 5 , 1 , 1 , 5 , 1 ]])
# check learned keyphrases, including newly learned ones
print ( vectorizer . get_feature_names_out ())
> >> [ 'output pairs' , 'output value' , 'function' , 'optimal scenario' ,
'pair' , 'supervised learning' , 'supervisory signal' , 'algorithm' ,
'supervised learning algorithm' , 'way' , 'training examples' ,
'input object' , 'example' , 'machine' , 'output' ,
'unseen situations' , 'unseen instances' , 'inductive bias' ,
'new examples' , 'input' , 'task' , 'training data' , 'class labels' ,
'set' , 'vector' , 'list' , 'various applications' ,
'information retrieval' , 'groups' , 'overlap' , 'main topics' ,
'precise summary' , 'document relevance' , 'interest' , 'indication' ,
'information retrieval environment' , 'phrases' , 'keywords' ,
'document content' , 'documents' , 'document' , 'users' ]
# update list of learned keyphrases according to 'delete_min_df'
vectorizer . update_bow ([ docs [ 1 ]])
vectorizer . transform ([ docs [ 1 ]]). toarray ()
> >> array ([[ 5 , 5 ]])
# check updated list of learned keyphrases (only the ones that appear more than 'delete_min_df' remain)
print ( vectorizer . get_feature_names_out ())
> >> [ 'keywords' , 'document' ]
# update again and check the impact of 'decay' on the learned document-keyphrase matrix
vectorizer . update_bow ([ docs [ 1 ]])
vectorizer . X_ . toarray ()
> >> array ([[ 7.5 , 7.5 ]])Kembali ke Daftar Isi
Saat mengutip keyphrasevectorizers atau polaRank di makalah dan tesis akademik, silakan gunakan entri Bibtex ini:
@conference{schopf_etal_kdir22,
author={Tim Schopf and Simon Klimek and Florian Matthes},
title={PatternRank: Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction},
booktitle={Proceedings of the 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2022) - KDIR},
year={2022},
pages={243-248},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011546600003335},
isbn={978-989-758-614-9},
issn={2184-3228},
}