KeyphraseVectorizers Download - KeyphraseVectorizers do download do código -fonte

KeyphraseVectorizers

Este pacote foi desenvolvido durante a redação do nosso papel padronizado. Você pode conferir o jornal aqui. Ao usar o KeyphraseVectorizers ou o PatternRank em artigos e teses acadêmicos, use a entrada do Bibtex abaixo.

Conjunto de vetorizadores que extraem frases-chave com padrões de parte da fala de uma coleção de documentos de texto e os convertem em uma matriz de documentos-keyphrase. Uma matriz de documentos-keyfrase é uma matriz matemática que descreve a frequência de frases-chave que ocorrem em uma coleção de documentos. As linhas da matriz indicam que os documentos e colunas de texto indicam as frases -chave exclusivas.

O pacote contém invólucros do sklearn.feature_extraction.text.countvectorizer e sklearn.feature_extraction.text.tfidfvectorizer classes. Em vez de usar os tokens n-gramas de um intervalo predefinido, essas classes extraem frases-chave dos documentos de texto usando tags de parte do discurso para calcular matrizes de documentos-keyfrase.

Postagens médias correspondentes podem ser encontradas aqui e aqui.

Benefícios

Extraia fases de chave gramaticalmente precisas com base em suas tags de parte do fala.
Não há necessidade de especificar faixas de n-gramas.
Obtenha matrizes de documentos-keyphrase.
Suporte de múltiplas idiomas.
Padrões de parte de expressão definida pelo usuário para extração da frase-chave possível.

Índice

Como funciona?
Instalação
Uso
1. KeyPhraseCountVectorizer
  1. língua Inglesa
  2. Outros idiomas
2. Keyphrasetfidfvectorizer
3. Reutilizar um objeto de linguagem espacial
4. Pos-tagger personalizado
5. PatternRank: Extração de keyfrase com KeyphraseVectorizers e Keybert
6. Modelagem de tópicos com bertopic e KeyphraseVectorizadores
7. Online KeyphraseVectorizers
Informações sobre citação

Como funciona?

Primeiro, os textos do documento são anotados com tags de parte do discurso do Spacy. Uma lista de todas as tags de parte de fala do Spacy para diferentes idiomas está vinculada aqui. A anotação requer a passagem do pipeline spacy da linguagem correspondente ao vetorizador com o parâmetro spacy_pipeline .

Segundo, as palavras são extraídas dos textos do documento cujas tags de parte do fala correspondem ao padrão REGEX definido no parâmetro pos_pattern . As frases -chave são uma lista de palavras exclusivas extraídas dos documentos de texto por este método.

Finalmente, os vetorizadores calculam matrizes de documentos-keyfrase.

Instalação

 pip install keyphrase-vectorizers

Uso

Para informações detalhadas, visite o guia da API.

KeyPhraseCountVectorizer

Voltar para o índice

língua Inglesa

 from keyphrase_vectorizers import KeyphraseCountVectorizer

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]
        
# Init default vectorizer.
vectorizer = KeyphraseCountVectorizer ()

# Print parameters
print ( vectorizer . get_params ())
> >> { 'binary' : False , 'dtype' : < class 'numpy.int64' > , 'lowercase' : True , 'max_df' : None , 'min_df' : None , 'pos_pattern' : '<J.*>*<N.*>+' , 'spacy_exclude' : [ 'parser' , 'attribute_ruler' , 'lemmatizer' , 'ner' ], 'spacy_pipeline' : 'en_core_web_sm' , 'stop_words' : 'english' , 'workers' : 1 }

Por padrão, o vetorizador é inicializado para o idioma inglês. Isso significa que uma spacy_pipeline inglesa é especificada, o inglês stop_words é removido e o pos_pattern extrai palavras-chave que possuem 0 ou mais adjetivos, seguidos por 1 ou mais substantivos usando as tags de parte do spacy em inglês. Além disso, os componentes do oleoduto Spacy ['parser', 'attribute_ruler', 'lemmatizer', 'ner'] são excluídos por padrão para aumentar a eficiência. Se você escolher um spacy_pipeline DIFERENTE, pode ser necessário excluir/incluir diferentes componentes do pipeline usando o parâmetro spacy_exclude para que o spacy POS Tagger funcione corretamente.

 # After initializing the vectorizer, it can be fitted
# to learn the keyphrases from the text documents.
vectorizer . fit ( docs )

 # After learning the keyphrases, they can be returned.
keyphrases = vectorizer . get_feature_names_out ()

print ( keyphrases )
> >> [ 'users' 'main topics' 'learning algorithm' 'overlap' 'documents' 'output'
 'keywords' 'precise summary' 'new examples' 'training data' 'input'
 'document content' 'training examples' 'unseen instances'
 'optimal scenario' 'document' 'task' 'supervised learning algorithm'
 'example' 'interest' 'function' 'example input' 'various applications'
 'unseen situations' 'phrases' 'indication' 'inductive bias'
 'supervisory signal' 'document relevance' 'information retrieval' 'set'
 'input object' 'groups' 'output value' 'list' 'learning' 'output pairs'
 'pair' 'class labels' 'supervised learning' 'machine'
 'information retrieval environment' 'algorithm' 'vector' 'way' ]

 # After fitting, the vectorizer can transform the documents 
# to a document-keyphrase matrix.
# Matrix rows indicate the documents and columns indicate the unique keyphrases.
# Each cell represents the count.
document_keyphrase_matrix = vectorizer . transform ( docs ). toarray ()

print ( document_keyphrase_matrix )
> >> [[ 0 0 2 0 0 3 0 0 1 3 3 0 1 1 1 0 1 1 2 0 3 1 0 1 0 0 1 1 0 0 1 1 0 1 0 6
  1 1 1 3 1 0 3 1 1 ]
 [ 1 2 0 1 1 0 5 1 0 0 0 1 0 0 0 5 0 0 0 1 0 0 1 0 1 1 0 0 1 2 0 0 1 0 1 0
  0 0 0 0 0 1 0 0 0 ]]

 # Fit and transform can also be executed in one step, 
# which is more efficient. 
document_keyphrase_matrix = vectorizer . fit_transform ( docs ). toarray ()

print ( document_keyphrase_matrix )
> >> [[ 0 0 2 0 0 3 0 0 1 3 3 0 1 1 1 0 1 1 2 0 3 1 0 1 0 0 1 1 0 0 1 1 0 1 0 6
  1 1 1 3 1 0 3 1 1 ]
 [ 1 2 0 1 1 0 5 1 0 0 0 1 0 0 0 5 0 0 0 1 0 0 1 0 1 1 0 0 1 2 0 0 1 0 1 0
  0 0 0 0 0 1 0 0 0 ]]

Outros idiomas

Voltar para o índice

 german_docs = [ """Goethe stammte aus einer angesehenen bürgerlichen Familie. 
                Sein Großvater mütterlicherseits war als Stadtschultheiß höchster Justizbeamter der Stadt Frankfurt, 
                sein Vater Doktor der Rechte und Kaiserlicher Rat. Er und seine Schwester Cornelia erfuhren eine aufwendige 
                Ausbildung durch Hauslehrer. Dem Wunsch seines Vaters folgend, studierte Goethe in Leipzig und Straßburg 
                Rechtswissenschaft und war danach als Advokat in Wetzlar und Frankfurt tätig. 
                Gleichzeitig folgte er seiner Neigung zur Dichtkunst.""" ,
              
               """Friedrich Schiller wurde als zweites Kind des Offiziers, Wundarztes und Leiters der Hofgärtnerei in 
               Marbach am Neckar Johann Kaspar Schiller und dessen Ehefrau Elisabetha Dorothea Schiller, geb. Kodweiß, 
               die Tochter eines Wirtes und Bäckers war, 1759 in Marbach am Neckar geboren
               """ ]
# Init vectorizer for the german language
vectorizer = KeyphraseCountVectorizer ( spacy_pipeline = 'de_core_news_sm' , pos_pattern = '<ADJ.*>*<N.*>+' , stop_words = 'german' )

O spacy_pipeline ALEMANHA é especificado e stop_words alemão é removido. Como as tags de parte do espaço de espacial alemãs diferem das inglesas, o parâmetro pos_pattern também é personalizado. O padrão regex <ADJ.*>*<N.*>+ Extrai palavras-chave que possuem 0 ou mais adjetivos, seguidos por 1 ou mais substantivos usando as tags de parte do espaço da espacial alemã.

Atenção! Os componentes do oleoduto Spacy ['parser', 'attribute_ruler', 'lemmatizer', 'ner'] são excluídos por padrão para aumentar a eficiência. Se você escolher um spacy_pipeline DIFERENTE, pode ser necessário excluir/incluir diferentes componentes do pipeline usando o parâmetro spacy_exclude para que o spacy POS Tagger funcione corretamente.

Keyphrasetfidfvectorizer

Voltar para o índice

O KeyphraseTfidfVectorizer possui as mesmas chamadas e recursos de função que o KeyphraseCountVectorizer . A única diferença é que as células da matriz de keyfrase de documentos representam valores de TF ou TF-IDF, dependendo das configurações de parâmetros, em vez de contagens.

 from keyphrase_vectorizers import KeyphraseTfidfVectorizer

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]
        
# Init default vectorizer for the English language that computes tf-idf values
vectorizer = KeyphraseTfidfVectorizer ()

# Print parameters
print ( vectorizer . get_params ())
> >> { 'binary' : False , 'custom_pos_tagger' : None , 'decay' : None , 'delete_min_df' : None , 'dtype' : <


class 'numpy.int64' > , 'lowercase' : True , 'max_df' : None

, 'min_df' : None , 'pos_pattern' : '<J.*>*<N.*>+' , 'spacy_exclude' : [ 'parser' , 'attribute_ruler' , 'lemmatizer' , 'ner' ,
                                                                   'textcat' ], 'spacy_pipeline' : 'en_core_web_sm' , 'stop_words' : 'english' , 'workers' : 1 }

Para calcular os valores de TF, defina use_idf=False .

 # Fit and transform to document-keyphrase matrix.
document_keyphrase_matrix = vectorizer . fit_transform ( docs ). toarray ()

print ( document_keyphrase_matrix )
> >> [[ 0.         0.         0.09245003 0.09245003 0.09245003 0.09245003
  0.2773501  0.09245003 0.2773501  0.2773501  0.09245003 0.
  0.         0.09245003 0.         0.2773501  0.09245003 0.09245003
  0.         0.09245003 0.09245003 0.09245003 0.09245003 0.09245003
  0.5547002  0.         0.         0.09245003 0.09245003 0.
  0.2773501  0.18490007 0.09245003 0.         0.2773501  0.
  0.         0.09245003 0.         0.09245003 0.         0.
  0.         0.18490007 0.        ]
 [ 0.11867817 0.11867817 0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.11867817
  0.11867817 0.         0.11867817 0.         0.         0.
  0.11867817 0.         0.         0.         0.         0.
  0.         0.11867817 0.23735633 0.         0.         0.11867817
  0.         0.         0.         0.23735633 0.         0.11867817
  0.11867817 0.         0.59339083 0.         0.11867817 0.11867817
  0.11867817 0.         0.59339083 ]]

 # Return keyphrases
keyphrases = vectorizer . get_feature_names_out ()

print ( keyphrases )
> >> [ 'various applications' 'list' 'task' 'supervisory signal'
 'inductive bias' 'supervised learning algorithm' 'supervised learning'
 'example input' 'input' 'algorithm' 'set' 'precise summary' 'documents'
 'input object' 'interest' 'function' 'class labels' 'machine'
 'document content' 'output pairs' 'new examples' 'unseen situations'
 'vector' 'output value' 'learning' 'document relevance' 'main topics'
 'pair' 'training examples' 'information retrieval environment'
 'training data' 'example' 'optimal scenario' 'information retrieval'
 'output' 'groups' 'indication' 'unseen instances' 'keywords' 'way'
 'phrases' 'overlap' 'users' 'learning algorithm' 'document' ]

Reutilizar um objeto de linguagem espacial

Voltar para o índice

KeyphraseVectorizers carrega um objeto spacy.Language para cada objeto KeyphraseVectorizer . Ao usar vários objetos KeyphraseVectorizer , é mais eficiente carregar o objeto spacy.Language spacy_pipeline

 import spacy
from keyphrase_vectorizers import KeyphraseCountVectorizer , KeyphraseTfidfVectorizer

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]

nlp = spacy . load ( "en_core_web_sm" )

vectorizer1 = KeyphraseCountVectorizer ( spacy_pipeline = nlp )
vectorizer2 = KeyphraseTfidfVectorizer ( spacy_pipeline = nlp )

# the following calls use the nlp object
vectorizer1 . fit ( docs )
vectorizer2 . fit ( docs )

Pos-tagger personalizado

Voltar para o índice

Para usar um tagger de parte de fala diferente dos fornecidos pelo Spacy, uma função POS-TAGGER personalizada pode ser definida e passada para os KeyPhraseVectorizers através do parâmetro custom_pos_tagger . Este parâmetro espera uma função chamada que, por sua vez, precisa esperar uma lista de strings em um parâmetro 'Raw_documents' e precisa retornar uma lista de tuplas (token, POS-TAG). Se esse parâmetro não for nenhum, a função personalizada do tagger será usada para marcar palavras com peças de fala, enquanto o pipeline spacy é ignorado.

Exemplo usando o Flair:

O talento pode ser instalado via pip install flair .

 from typing import List
import flair
from flair . models import SequenceTagger
from flair . tokenization import SegtokSentenceSplitter


docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]

# define flair POS-tagger and splitter
tagger = SequenceTagger . load ( 'pos' )
splitter = SegtokSentenceSplitter ()

# define custom POS-tagger function using flair
def custom_pos_tagger ( raw_documents : List [ str ], tagger : flair . models . SequenceTagger = tagger , splitter : flair . tokenization . SegtokSentenceSplitter = splitter ) -> List [ tuple ]:
    """
    Important: 

    The mandatory 'raw_documents' parameter can NOT be named differently and has to expect a list of strings. 
    Any other parameter of the custom POS-tagger function can be arbitrarily defined, depending on the respective use case. 
    Furthermore the function has to return a list of (word token, POS-tag) tuples.
    """ 
    # split texts into sentences
    sentences = []
    for doc in raw_documents :
        sentences . extend ( splitter . split ( doc ))

    # predict POS tags
    tagger . predict ( sentences )

    # iterate through sentences to get word tokens and predicted POS-tags
    pos_tags = []
    words = []
    for sentence in sentences :
        pos_tags . extend ([ label . value for label in sentence . get_labels ( 'pos' )])
        words . extend ([ word . text for word in sentence ])
    
    return list ( zip ( words , pos_tags ))


# check that the custom POS-tagger function returns a list of (word token, POS-tag) tuples
print ( custom_pos_tagger ( raw_documents = docs ))

> >> [( 'Supervised' , 'VBN' ), ( 'learning' , 'NN' ), ( 'is' , 'VBZ' ), ( 'the' , 'DT' ), ( 'machine' , 'NN' ), ( 'learning' , 'VBG' ), ( 'task' , 'NN' ), ( 'of' , 'IN' ), ( 'learning' , 'VBG' ), ( 'a' , 'DT' ), ( 'function' , 'NN' ), ( 'that' , 'WDT' ), ( 'maps' , 'VBZ' ), ( 'an' , 'DT' ), ( 'input' , 'NN' ), ( 'to' , 'IN' ), ( 'an' , 'DT' ), ( 'output' , 'NN' ), ( 'based' , 'VBN' ), ( 'on' , 'IN' ), ( 'example' , 'NN' ), ( 'input-output' , 'NN' ), ( 'pairs' , 'NNS' ), ( '.' , '.' ), ( 'It' , 'PRP' ), ( 'infers' , 'VBZ' ), ( 'a' , 'DT' ), ( 'function' , 'NN' ), ( 'from' , 'IN' ), ( 'labeled' , 'VBN' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'consisting' , 'VBG' ), ( 'of' , 'IN' ), ( 'a' , 'DT' ), ( 'set' , 'NN' ), ( 'of' , 'IN' ), ( 'training' , 'NN' ), ( 'examples' , 'NNS' ), ( '.' , '.' ), ( 'In' , 'IN' ), ( 'supervised' , 'JJ' ), ( 'learning' , 'NN' ), ( ',' , ',' ), ( 'each' , 'DT' ), ( 'example' , 'NN' ), ( 'is' , 'VBZ' ), ( 'a' , 'DT' ), ( 'pair' , 'NN' ), ( 'consisting' , 'VBG' ), ( 'of' , 'IN' ), ( 'an' , 'DT' ), ( 'input' , 'NN' ), ( 'object' , 'NN' ), ( '(' , ':' ), ( 'typically' , 'RB' ), ( 'a' , 'DT' ), ( 'vector' , 'NN' ), ( ')' , ',' ), ( 'and' , 'CC' ), ( 'a' , 'DT' ), ( 'desired' , 'VBN' ), ( 'output' , 'NN' ), ( 'value' , 'NN' ), ( '(' , ',' ), ( 'also' , 'RB' ), ( 'called' , 'VBN' ), ( 'the' , 'DT' ), ( 'supervisory' , 'JJ' ), ( 'signal' , 'NN' ), ( ')' , '-RRB-' ), ( '.' , '.' ), ( 'A' , 'DT' ), ( 'supervised' , 'JJ' ), ( 'learning' , 'NN' ), ( 'algorithm' , 'NN' ), ( 'analyzes' , 'VBZ' ), ( 'the' , 'DT' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'and' , 'CC' ), ( 'produces' , 'VBZ' ), ( 'an' , 'DT' ), ( 'inferred' , 'JJ' ), ( 'function' , 'NN' ), ( ',' , ',' ), ( 'which' , 'WDT' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'used' , 'VBN' ), ( 'for' , 'IN' ), ( 'mapping' , 'VBG' ), ( 'new' , 'JJ' ), ( 'examples' , 'NNS' ), ( '.' , '.' ), ( 'An' , 'DT' ), ( 'optimal' , 'JJ' ), ( 'scenario' , 'NN' ), ( 'will' , 'MD' ), ( 'allow' , 'VB' ), ( 'for' , 'IN' ), ( 'the' , 'DT' ), ( 'algorithm' , 'NN' ), ( 'to' , 'TO' ), ( 'correctly' , 'RB' ), ( 'determine' , 'VB' ), ( 'the' , 'DT' ), ( 'class' , 'NN' ), ( 'labels' , 'NNS' ), ( 'for' , 'IN' ), ( 'unseen' , 'JJ' ), ( 'instances' , 'NNS' ), ( '.' , '.' ), ( 'This' , 'DT' ), ( 'requires' , 'VBZ' ), ( 'the' , 'DT' ), ( 'learning' , 'NN' ), ( 'algorithm' , 'NN' ), ( 'to' , 'TO' ), ( 'generalize' , 'VB' ), ( 'from' , 'IN' ), ( 'the' , 'DT' ), ( 'training' , 'NN' ), ( 'data' , 'NNS' ), ( 'to' , 'IN' ), ( 'unseen' , 'JJ' ), ( 'situations' , 'NNS' ), ( 'in' , 'IN' ), ( 'a' , 'DT' ), ( "'" , '``' ), ( 'reasonable' , 'JJ' ), ( "'" , "''" ), ( 'way' , 'NN' ), ( '(' , ',' ), ( 'see' , 'VB' ), ( 'inductive' , 'JJ' ), ( 'bias' , 'NN' ), ( ')' , '-RRB-' ), ( '.' , '.' ), ( 'Keywords' , 'NNS' ), ( 'are' , 'VBP' ), ( 'defined' , 'VBN' ), ( 'as' , 'IN' ), ( 'phrases' , 'NNS' ), ( 'that' , 'WDT' ), ( 'capture' , 'VBP' ), ( 'the' , 'DT' ), ( 'main' , 'JJ' ), ( 'topics' , 'NNS' ), ( 'discussed' , 'VBN' ), ( 'in' , 'IN' ), ( 'a' , 'DT' ), ( 'document' , 'NN' ), ( '.' , '.' ), ( 'As' , 'IN' ), ( 'they' , 'PRP' ), ( 'offer' , 'VBP' ), ( 'a' , 'DT' ), ( 'brief' , 'JJ' ), ( 'yet' , 'CC' ), ( 'precise' , 'JJ' ), ( 'summary' , 'NN' ), ( 'of' , 'IN' ), ( 'document' , 'NN' ), ( 'content' , 'NN' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'utilized' , 'VBN' ), ( 'for' , 'IN' ), ( 'various' , 'JJ' ), ( 'applications' , 'NNS' ), ( '.' , '.' ), ( 'In' , 'IN' ), ( 'an' , 'DT' ), ( 'information' , 'NN' ), ( 'retrieval' , 'NN' ), ( 'environment' , 'NN' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'serve' , 'VBP' ), ( 'as' , 'IN' ), ( 'an' , 'DT' ), ( 'indication' , 'NN' ), ( 'of' , 'IN' ), ( 'document' , 'NN' ), ( 'relevance' , 'NN' ), ( 'for' , 'IN' ), ( 'users' , 'NNS' ), ( ',' , ',' ), ( 'as' , 'IN' ), ( 'the' , 'DT' ), ( 'list' , 'NN' ), ( 'of' , 'IN' ), ( 'keywords' , 'NNS' ), ( 'can' , 'MD' ), ( 'quickly' , 'RB' ), ( 'help' , 'VB' ), ( 'to' , 'TO' ), ( 'determine' , 'VB' ), ( 'whether' , 'IN' ), ( 'a' , 'DT' ), ( 'given' , 'VBN' ), ( 'document' , 'NN' ), ( 'is' , 'VBZ' ), ( 'relevant' , 'JJ' ), ( 'to' , 'IN' ), ( 'their' , 'PRP$' ), ( 'interest' , 'NN' ), ( '.' , '.' ), ( 'As' , 'IN' ), ( 'keywords' , 'NNS' ), ( 'reflect' , 'VBP' ), ( 'a' , 'DT' ), ( 'document' , 'NN' ), ( "'s" , 'POS' ), ( 'main' , 'JJ' ), ( 'topics' , 'NNS' ), ( ',' , ',' ), ( 'they' , 'PRP' ), ( 'can' , 'MD' ), ( 'be' , 'VB' ), ( 'utilized' , 'VBN' ), ( 'to' , 'TO' ), ( 'classify' , 'VB' ), ( 'documents' , 'NNS' ), ( 'into' , 'IN' ), ( 'groups' , 'NNS' ), ( 'by' , 'IN' ), ( 'measuring' , 'VBG' ), ( 'the' , 'DT' ), ( 'overlap' , 'NN' ), ( 'between' , 'IN' ), ( 'the' , 'DT' ), ( 'keywords' , 'NNS' ), ( 'assigned' , 'VBN' ), ( 'to' , 'IN' ), ( 'them' , 'PRP' ), ( '.' , '.' ), ( 'Keywords' , 'NNS' ), ( 'are' , 'VBP' ), ( 'also' , 'RB' ), ( 'used' , 'VBN' ), ( 'proactively' , 'RB' ), ( 'in' , 'IN' ), ( 'information' , 'NN' ), ( 'retrieval' , 'NN' ), ( '.' , '.' )]

Depois que a função POS-TAGGER personalizada é definida, ela pode ser passada para o KeyPhraseVectorizers por meio do parâmetro custom_pos_tagger .

 from keyphrase_vectorizers import KeyphraseCountVectorizer

# use custom POS-tagger with KeyphraseVectorizers
vectorizer = KeyphraseCountVectorizer ( custom_pos_tagger = custom_pos_tagger )
vectorizer . fit ( docs )
keyphrases = vectorizer . get_feature_names_out ()
print ( keyphrases )

> >> [ 'output value' 'information retrieval' 'algorithm' 'vector' 'groups'
 'main topics' 'task' 'precise summary' 'supervised learning'
 'inductive bias' 'information retrieval environment'
 'supervised learning algorithm' 'function' 'input' 'pair'
 'document relevance' 'learning' 'class labels' 'new examples' 'keywords'
 'list' 'machine' 'training data' 'unseen situations' 'phrases' 'output'
 'optimal scenario' 'document' 'training examples' 'documents' 'interest'
 'indication' 'learning algorithm' 'inferred function'
 'various applications' 'example' 'set' 'unseen instances'
 'example input-output pairs' 'way' 'users' 'input object'
 'supervisory signal' 'overlap' 'document content' ]

PatternRank: Extração de keyfrase com KeyphraseVectorizers e Keybert

Voltar para o índice

Usando os vetores da frase -chave, juntamente com o Keybert para a extração da shrase, resulta na abordagem do padrão. O PatternRank pode extrair frases de chave gramaticalmente corretas que são mais semelhantes a um documento. Assim, o Vectorizer extrai primeiro as frases de key do candidato dos documentos de texto, que são posteriormente classificados por Keybert com base em sua similaridade de documentos. As frases-chave mais semelhantes Nas semelhantes podem ser consideradas como palavras-chave do documento.

A vantagem de usar o KeyPhraseVectorizers, além de Keybert, é que ele permite que os usuários obtenham frases de chave gramaticalmente corretas em vez de N gramas simples de comprimentos predefinidos. Em Keybert, os usuários podem especificar o keyphrase_ngram_range para definir o comprimento das frases de chave recuperadas. No entanto, isso levanta duas questões. Primeiro, os usuários geralmente não conhecem a faixa de n-gramas ideais e, portanto, precisam gastar algum tempo experimentando até encontrar uma faixa de n-gramas adequada. Segundo, mesmo depois de encontrar uma boa faixa de n-gramas, as frases de chave retornadas às vezes ainda não são gramaticalmente corretas ou são ligeiramente paradas. Infelizmente, isso limita a qualidade das frases de chave retornadas.

Para abordar esse problema, podemos usar os vetores deste pacote para extrair as frases de chave candidatas que consistem em zero ou mais adjetivos, seguidos por um ou vários substantivos em uma etapa de pré-processamento em vez de n-gramas simples. Textrank, SingleRank e Embedrank já usaram com sucesso essa abordagem de frase substantiva para extração da shreth da shret. As frases de chave candidatas extraídas são subsequentemente passadas para Keybert para incorporar geração e cálculo de similaridade. Para usar os dois pacotes para a extração da shrase Key, precisamos passar por Keybert um vetorizador de frases -chave com o parâmetro vectorizer . Como o comprimento das frases-chave agora depende de tags de parte do fala, não há mais necessidade de definir um comprimento de n-gramas.

Exemplo:

Keybert pode ser instalado via pip install keybert .

 from keyphrase_vectorizers import KeyphraseCountVectorizer
from keybert import KeyBERT

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" , 
             
        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]

kw_model = KeyBERT ()

Em vez de decidir sobre uma faixa de n-gramas adequada que pode ser por exemplo (1,2) ...

 > >> kw_model . extract_keywords ( docs = docs , keyphrase_ngram_range = ( 1 , 2 ))
[[( 'labeled training' , 0.6013 ),
  ( 'examples supervised' , 0.6112 ),
  ( 'signal supervised' , 0.6152 ),
  ( 'supervised' , 0.6676 ),
  ( 'supervised learning' , 0.6779 )],
 [( 'keywords assigned' , 0.6354 ),
  ( 'keywords used' , 0.6373 ),
  ( 'list keywords' , 0.6375 ),
  ( 'keywords quickly' , 0.6376 ),
  ( 'keywords defined' , 0.6997 )]]

Agora, podemos deixar o vetorizador da frase-chave decidir sobrefrases de chave adequadas, sem limitações a uma faixa de n-gramas máxima ou mínima. Temos que passar apenas um vetorizador de frase -chave como parâmetro para Keybert:

 > >> kw_model . extract_keywords ( docs = docs , vectorizer = KeyphraseCountVectorizer ())
[[( 'learning' , 0.4813 ), 
  ( 'training data' , 0.5271 ), 
  ( 'learning algorithm' , 0.5632 ), 
  ( 'supervised learning' , 0.6779 ), 
  ( 'supervised learning algorithm' , 0.6992 )], 
 [( 'document content' , 0.3988 ), 
  ( 'information retrieval environment' , 0.5166 ), 
  ( 'information retrieval' , 0.5792 ), 
  ( 'keywords' , 0.6046 ), 
  ( 'document relevance' , 0.633 )]]

Isso nos permite garantir que não cortemos palavras importantes causadas pela definição de nossa faixa de n-gramas muito curta. Por exemplo, não teríamos encontrado o "algoritmo de aprendizagem supervisionado" com keyphrase_ngram_range=(1,2) . Além disso, evitamos obter frases-chave que são um pouco fora de chave como "Treinamento rotulado", "Signal Supervision" ou "Palavras-chave rapidamente".

Para obter mais dicas sobre como usar os keyphraseVectorizers, juntamente com Keybert, visite este guia.

Modelagem de tópicos com bertopic e KeyphraseVectorizadores

Voltar para o índice

Semelhante ao aplicativo com Keybert, os vetorizadores da frase-chave podem ser usados para obter frases de chave gramaticalmente corretas como descrições para tópicos em vez de n-gramas simples. Isso nos permite garantir que não cortassemos o tópico importante descrição das opções de chaves, definindo nossa faixa de n-gramas muito curta. Além disso, não precisamos limpar as palavras de parada antecipadamente, podem obter modelos de tópicos mais precisos e evitar obter as frases de chaves de descrição do tópico que são um pouco fora de chave.

Exemplo:

O Bertopic pode ser instalado via pip install bertopic .

 from keyphrase_vectorizers import KeyphraseCountVectorizer
from bertopic import BERTopic
from sklearn . datasets import fetch_20newsgroups

# load text documents
docs = fetch_20newsgroups ( subset = 'all' ,  remove = ( 'headers' , 'footers' , 'quotes' ))[ 'data' ]
# only use subset of the data 
docs = docs [: 5000 ]

# train topic model with KeyphraseCountVectorizer
keyphrase_topic_model = BERTopic ( vectorizer_model = KeyphraseCountVectorizer ())
keyphrase_topics , keyphrase_probs = keyphrase_topic_model . fit_transform ( docs )

# get topics
> >> keyphrase_topic_model . topics
{ - 1 : [( 'file' , 0.007265527630674131 ),
  ( 'one' , 0.007055454904474792 ),
  ( 'use' , 0.00633563957153475 ),
  ( 'program' , 0.006053271092949018 ),
  ( 'get' , 0.006011060091056076 ),
  ( 'people' , 0.005729309058970368 ),
  ( 'know' , 0.005635951168273583 ),
  ( 'like' , 0.0055692449802916015 ),
  ( 'time' , 0.00527028825803415 ),
  ( 'us' , 0.00525564504880084 )],
 0 : [( 'game' , 0.024134589719090525 ),
  ( 'team' , 0.021852806383170772 ),
  ( 'players' , 0.01749406934044139 ),
  ( 'games' , 0.014397938026886745 ),
  ( 'hockey' , 0.013932342023677305 ),
  ( 'win' , 0.013706115572901401 ),
  ( 'year' , 0.013297593024390321 ),
  ( 'play' , 0.012533185558169046 ),
  ( 'baseball' , 0.012412743802062559 ),
  ( 'season' , 0.011602725885164318 )],
 1 : [( 'patients' , 0.022600352291162015 ),
  ( 'msg' , 0.02023877371575874 ),
  ( 'doctor' , 0.018816282737587457 ),
  ( 'medical' , 0.018614407917995103 ),
  ( 'treatment' , 0.0165028251400717 ),
  ( 'food' , 0.01604980195180696 ),
  ( 'candida' , 0.015255961242066143 ),
  ( 'disease' , 0.015115496310099693 ),
  ( 'pain' , 0.014129703072484495 ),
  ( 'hiv' , 0.012884503220341102 )],
 2 : [( 'key' , 0.028851633177510126 ),
  ( 'encryption' , 0.024375137861044675 ),
  ( 'clipper' , 0.023565947302544528 ),
  ( 'privacy' , 0.019258719348097385 ),
  ( 'security' , 0.018983682856076434 ),
  ( 'chip' , 0.018822199098878365 ),
  ( 'keys' , 0.016060139239615384 ),
  ( 'internet' , 0.01450486904722165 ),
  ( 'encrypted' , 0.013194373119964168 ),
  ( 'government' , 0.01303978311708837 )],
  ...

Os mesmos tópicos parecem um pouco diferentes quando nenhum vetorizador da frase -chave é usado:

 from bertopic import BERTopic
from sklearn . datasets import fetch_20newsgroups

# load text documents
docs = fetch_20newsgroups ( subset = 'all' ,  remove = ( 'headers' , 'footers' , 'quotes' ))[ 'data' ]
# only use subset of the data 
docs = docs [: 5000 ]

# train topic model without KeyphraseCountVectorizer
topic_model = BERTopic ()
topics , probs = topic_model . fit_transform ( docs )

# get topics
> >> topic_model . topics
{ - 1 : [( 'the' , 0.012864641020408933 ),
  ( 'to' , 0.01187920529994724 ),
  ( 'and' , 0.011431498631699856 ),
  ( 'of' , 0.01099851927541331 ),
  ( 'is' , 0.010995478673036962 ),
  ( 'in' , 0.009908233622158523 ),
  ( 'for' , 0.009903667215879675 ),
  ( 'that' , 0.009619596716087699 ),
  ( 'it' , 0.009578499681829809 ),
  ( 'you' , 0.0095328846440753 )],
 0 : [( 'game' , 0.013949166096523719 ),
  ( 'team' , 0.012458483177116456 ),
  ( 'he' , 0.012354733462693834 ),
  ( 'the' , 0.01119583508278812 ),
  ( '10' , 0.010190243555226108 ),
  ( 'in' , 0.0101436249231417 ),
  ( 'players' , 0.009682212470082758 ),
  ( 'to' , 0.00933700544705287 ),
  ( 'was' , 0.009172402203816335 ),
  ( 'and' , 0.008653375901739337 )],
 1 : [( 'of' , 0.012771267188340924 ),
  ( 'to' , 0.012581337590513296 ),
  ( 'is' , 0.012554884458779008 ),
  ( 'patients' , 0.011983273578628046 ),
  ( 'and' , 0.011863499662237566 ),
  ( 'that' , 0.011616113472989725 ),
  ( 'it' , 0.011581944987387165 ),
  ( 'the' , 0.011475148304229873 ),
  ( 'in' , 0.011395485985801054 ),
  ( 'msg' , 0.010715000656335596 )],
 2 : [( 'key' , 0.01725282988290282 ),
  ( 'the' , 0.014634841495851404 ),
  ( 'be' , 0.014429762197907552 ),
  ( 'encryption' , 0.013530733999898166 ),
  ( 'to' , 0.013443159534369817 ),
  ( 'clipper' , 0.01296614319927958 ),
  ( 'of' , 0.012164734232650158 ),
  ( 'is' , 0.012128295958613464 ),
  ( 'and' , 0.011972763728732667 ),
  ( 'chip' , 0.010785744492767285 )],
 ...

Online KeyphraseVectorizers

Voltar para o índice

Os KeyphraseVectorizers também oferecem suporte a atualizações online/incrementais de sua representação (semelhante ao OnlineCountVectorizer). O vetorizador pode não apenas atualizar as frases-chave fora do vocabulário, mas também implementa funções de decaimento e limpeza para impedir que a matriz escassa de documentos-keyfrases se torne muito grande.

Parâmetros para atualizações on -line:

decay : a cada iteração, somamos a representação de documentos-chave dos novos documentos com a representação de documentos-chave de todos os documentos processados até agora. Em outras palavras, a matriz de documentos-keyfrase continua aumentando a cada iteração. No entanto, especialmente em um ambiente de streaming, os documentos mais antigos podem se tornar cada vez menos relevantes à medida que o tempo passa. Portanto, foi implementado um parâmetro de decaimento que deteriora as frequências de documentos-chave em cada iteração antes de adicionar as frequências do documento de novos documentos. O parâmetro de decaimento é um valor entre 0 e 1 e indica a porcentagem de frequências que a matriz anterior de documentos-keyfrase deve ser reduzida a. Por exemplo, um valor de 0,1 diminuirá as frequências na matriz de documentos-keyfrase em 10% em cada iteração antes de adicionar a nova matriz de documentos-keyfrase. Isso garantirá que os dados recentes tenham mais peso do que as iterações anteriores.
delete_min_df : Podemos remover as frases-chave da representação de documentos-keyphrase que aparecem com pouca frequência. O parâmetro min_df funciona muito bem para isso. No entanto, quando temos uma configuração de streaming, o min_df não funciona tão bem, pois a frequência de uma das frases -chave pode começar abaixo min_df , mas acabará mais alto do que isso ao longo do tempo. Definir esse valor alto nem sempre pode ser recomendado. Como resultado, a lista de frases-chave aprendidas pelo vetorizador e a matriz de documentos-keyfrase resultante pode se tornar bastante grande. Da mesma forma, se implementarmos o parâmetro decay , alguns valores diminuirão ao longo do tempo até que estejam abaixo min_df . Por esses motivos, o parâmetro delete_min_df foi implementado. O parâmetro pega números inteiros positivos e indica, em cada iteração, quais frases -chave serão removidas das já aprendidas. Se o valor for definido como 5, ele verificará após cada iteração se a frequência total de uma frase -chave for excedida por esse valor. Nesse caso, a frase -chave será removida na íntegra da lista de frases -chave aprendidas pelo vetorizador. Isso ajuda a manter a matriz de documentos-chave de um tamanho gerenciável.

Exemplo:

 from keyphrase_vectorizers import KeyphraseCountVectorizer

docs = [ """Supervised learning is the machine learning task of learning a function that
         maps an input to an output based on example input-output pairs. It infers a
         function from labeled training data consisting of a set of training examples.
         In supervised learning, each example is a pair consisting of an input object
         (typically a vector) and a desired output value (also called the supervisory signal). 
         A supervised learning algorithm analyzes the training data and produces an inferred function, 
         which can be used for mapping new examples. An optimal scenario will allow for the 
         algorithm to correctly determine the class labels for unseen instances. This requires 
         the learning algorithm to generalize from the training data to unseen situations in a 
         'reasonable' way (see inductive bias).""" ,

        """Keywords are defined as phrases that capture the main topics discussed in a document. 
        As they offer a brief yet precise summary of document content, they can be utilized for various applications. 
        In an information retrieval environment, they serve as an indication of document relevance for users, as the list 
        of keywords can quickly help to determine whether a given document is relevant to their interest. 
        As keywords reflect a document's main topics, they can be utilized to classify documents into groups 
        by measuring the overlap between the keywords assigned to them. Keywords are also used proactively 
        in information retrieval.""" ]

# Init default vectorizer.
vectorizer = KeyphraseCountVectorizer ( decay = 0.5 , delete_min_df = 3 )

# intitial vectorizer fit
vectorizer . fit_transform ([ docs [ 0 ]]). toarray ()
> >> array ([[ 1 , 1 , 3 , 1 , 1 , 3 , 1 , 3 , 1 , 1 , 1 , 1 , 2 , 1 , 3 , 1 , 1 , 1 , 1 , 3 , 1 , 3 ,
             1 , 1 , 1 ]])

# check learned keyphrases
print ( vectorizer . get_feature_names_out ())
> >> [ 'output pairs' , 'output value' , 'function' , 'optimal scenario' ,
      'pair' , 'supervised learning' , 'supervisory signal' , 'algorithm' ,
      'supervised learning algorithm' , 'way' , 'training examples' ,
      'input object' , 'example' , 'machine' , 'output' ,
      'unseen situations' , 'unseen instances' , 'inductive bias' ,
      'new examples' , 'input' , 'task' , 'training data' , 'class labels' ,
      'set' , 'vector' ]

# learn additional keyphrases from new documents with partial fit
vectorizer . partial_fit ([ docs [ 1 ]])
vectorizer . transform ([ docs [ 1 ]]). toarray ()
> >> array ([[ 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ,
             0 , 0 , 0 , 1 , 1 , 2 , 1 , 1 , 2 , 1 , 1 , 1 , 1 , 1 , 1 , 5 , 1 , 1 , 5 , 1 ]])

# check learned keyphrases, including newly learned ones
print ( vectorizer . get_feature_names_out ())
> >> [ 'output pairs' , 'output value' , 'function' , 'optimal scenario' ,
      'pair' , 'supervised learning' , 'supervisory signal' , 'algorithm' ,
      'supervised learning algorithm' , 'way' , 'training examples' ,
      'input object' , 'example' , 'machine' , 'output' ,
      'unseen situations' , 'unseen instances' , 'inductive bias' ,
      'new examples' , 'input' , 'task' , 'training data' , 'class labels' ,
      'set' , 'vector' , 'list' , 'various applications' ,
      'information retrieval' , 'groups' , 'overlap' , 'main topics' ,
      'precise summary' , 'document relevance' , 'interest' , 'indication' ,
      'information retrieval environment' , 'phrases' , 'keywords' ,
      'document content' , 'documents' , 'document' , 'users' ]

# update list of learned keyphrases according to 'delete_min_df'
vectorizer . update_bow ([ docs [ 1 ]])
vectorizer . transform ([ docs [ 1 ]]). toarray ()
> >> array ([[ 5 , 5 ]])

# check updated list of learned keyphrases (only the ones that appear more than 'delete_min_df' remain)
print ( vectorizer . get_feature_names_out ())
> >> [ 'keywords' , 'document' ]

# update again and check the impact of 'decay' on the learned document-keyphrase matrix
vectorizer . update_bow ([ docs [ 1 ]])
vectorizer . X_ . toarray ()
> >> array ([[ 7.5 , 7.5 ]])

Informações sobre citação

Voltar para o índice

Ao citar KeyphraseVectorizers ou PatternRank em trabalhos e teses acadêmicos, use esta entrada do Bibtex:

 @conference{schopf_etal_kdir22,
author={Tim Schopf and Simon Klimek and Florian Matthes},
title={PatternRank: Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction},
booktitle={Proceedings of the 14th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K 2022) - KDIR},
year={2022},
pages={243-248},
publisher={SciTePress},
organization={INSTICC},
doi={10.5220/0011546600003335},
isbn={978-989-758-614-9},
issn={2184-3228},
}

Expandir

KeyphraseVectorizers