contextualized topic models下載 - contextualized topic models源代碼下載

contextualized topic models

其他源碼

1.0.0

下載

上下文化主題模型

上下文化主題模型（CTM）是一個主題模型家族，使用語言的預培訓表示（例如，BERT）來支持主題建模。有關詳細信息，請參見論文：

Bianchi，F.，Terragni，S。，＆Hovy，D。（2021）。預訓練是一個熱門話題：上下文化的文檔嵌入提高了主題連貫性。 ACL。 https://aclanthology.org/2021.acl-short.96/
Bianchi，F.，Terragni，S.，Hovy，D.，Nozza，D。和Fersini，E。（2021）。帶有零拍學習的跨語性上下文化主題模型。 EACL。 https://www.aclweb.org/anthology/2021.eacl-main.143/

主題建模使用上下文化的嵌入

我們的新主題建模家庭支持許多不同的語言（即一種由擁抱面模型支持的語言），並有兩個版本：組合tm將上下文嵌入與好的舊單詞結合在一起，以使更連貫的主題； Zeroshottm是任務的理想主題模型，在該任務中，您可能會在測試數據中缺少單詞，並且，如果接受多語言嵌入培訓，則繼承了作為多語言主題模型的屬性！

最大的優勢是您可以將不同的嵌入式用於CTM。因此，當出現新的嵌入方法時，您可以在代碼中使用它並改善結果。我們不再受弓的限制。

我們也有小貓！一個新的子模塊，可用於創建人類的分類器，以快速對您的文檔進行分類並創建命名群集。

教程

您可以查看我們的中等博客文章，也可以從我們的COLAB教程之一開始：

姓名	關聯
Wikipedia數據（Preproc+Save+VIZ）上的TM組合（穩定v2.3.0 ）
零射擊跨語言主題建模（Preproc+viz）（穩定v2.3.0 ）
Kitty：循環分類器中的人（高級用法）（穩定v2.2.0 ）
SuperCTM和β-CTM（高級使用）（穩定的V2.2.0 ）

概述

tl; dr

在CTM中，我們有兩個模型。組合的TM和Zeroshottm，具有不同的用例。
當單詞袋的大小僅限於未超過2000個元素的許多術語時，CTM的工作效果更好。這是因為我們有一個神經模型，它可以重建單詞輸入袋，此外，在組合中，我們將上下文化的嵌入到詞彙空間中，詞彙越大，您獲得的參數就越多，訓練越困難且容易效果。但是，這不是嚴格的限制，請考慮預處理數據集。我們有一條預處理管道，可以幫助您解決此問題。
檢查您使用的上下文模型，在英語數據上使用的多語言模型可能不會給出像純英語訓練的結果一樣好。
預處理是關鍵。如果您給出諸如BERT預處理文本之類的上下文模型，則可能很難獲得良好的表示。我們通常要做的是將預處理的文本用於單詞袋創建和使用未經預處理的文本作為bert嵌入。我們的預處理課可以為您解決這個問題。
CTM使用Sbert，您應該檢查一下，以更好地了解我們如何創建嵌入。 Sbert允許我們使用任何嵌入模型。您可能需要檢查最大長度之類的東西。

安裝

重要的是：如果要使用CUDA，則需要安裝與您的發行版匹配的CUDA系統的正確版本，請參見Pytorch。

使用PIP安裝軟件包

pip install -U contextualized_topic_models

型號

要考慮的一個重要方面是您要使用的網絡：結合上下文化嵌入式和弓（組合tm）的網絡，或僅使用上下文化嵌入式（zeroshottm）的網絡。

但是請記住，您只能使用Zeroshottm模型進行零擊的跨語性主題建模。

上下文化主題模型還支持監督（SUPERCTM）。您可以在文檔中閱讀有關此信息的更多信息。

我們還擁有小貓：您可以用來在文檔的循環分類中進行更簡單的人。進行文檔過濾可能非常有用。它還可以在跨語言環境中工作，因此您可以用不知道的語言過濾文檔！

參考

如果您發現這有用，則可以引用以下論文:)

Zeroshottm

 @inproceedings {bianchi-etal-2021-Cross，
    title =“帶有零拍學習的跨語性上下文化主題模型”，
    作者=“ Bianchi，Federico和Terragni，Silvia和Hovy，Dirk和
      Nozza，Debora和Fersini，Elisabetta”，
    BookTitle =“計算語言學協會歐洲分會第16屆會議論文集：主要卷”，
    月= APR，
    年=“ 2021”，
    地址=“在線”，
    發布者=“計算語言學協會”，
    url =“ https://www.aclweb.org/anthology/2021.eacl-main.143”，
    頁=“ 1676--1683”，
}

組合

@inproceedings {bianchi-etal-2021-pre，
    title =“預訓練是一個熱門話題：上下文化的文檔嵌入會改善主題連貫性”，
    作者=“ Bianchi，Federico和
      Terragni，Silvia和
      Hovy，Dirk”，
    BookTitle =“計算語言學協會第59屆年會和第11屆國際自然語言處理聯合會議（第2卷：簡短論文）的會議記錄”，
    月=八月，
    年=“ 2021”，
    地址=“在線”，
    發布者=“計算語言學協會”，
    url =“ https://aclanthology.org/2021.acl-short.96”，
    doi =“ 10.18653/v1/2021.acl-short.96”，
    頁=“ 759--766”，
}

特定語言和多語言

以下一些示例使用多語言嵌入模型paraphrase-multilingual-mpnet-base-v2 。這意味著您將要使用的表示形式是虛假的。但是，您可能需要更廣泛的語言覆蓋範圍或一種特定的語言。請參閱文檔中的頁面，以查看如何為另一種語言選擇模型。在這種情況下，您可以檢查Sbert以找到使用的理想模型。

在這裡，您可以閱讀有關語言特定和穆利特語的更多信息。

快速概述

您絕對應該查看文檔，以更好地了解這些主題模型的工作方式。

組合主題模型

這是您可以使用組合的TM的方法。這是一個標準主題模型，也使用上下文化的嵌入。 CombinedTM的好處是，它使您的話題更加連貫（請參閱紙張https://arxiv.org/abs/2004.03974）。 N_COMPONENTS = 50指定主題的數量。

 from contextualized_topic_models . models . ctm import CombinedTM
from contextualized_topic_models . utils . data_preparation import TopicModelDataPreparation
from contextualized_topic_models . utils . data_preparation import bert_embeddings_from_file

qt = TopicModelDataPreparation ( "all-mpnet-base-v2" )

training_dataset = qt . fit ( text_for_contextual = list_of_unpreprocessed_documents , text_for_bow = list_of_preprocessed_documents )

ctm = CombinedTM ( bow_size = len ( qt . vocab ), contextual_size = 768 , n_components = 50 ) # 50 topics

ctm . fit ( training_dataset ) # run the model

ctm . get_topics ( 2 )

高級註釋： TM組合將弓與Sbert結合在一起，這一過程似乎增加了預測主題的連貫性（https://arxiv.org/pdf/2004.03974.pdf）。

零射門主題模型

我們的Zeroshottm可用於零攝主題建模。它可以處理訓練階段未使用的單詞。更有趣的是，該模型可用於跨語性主題建模（請參見下一節）！請參閱論文（https://www.aclweb.org/anthology/2021.eacl-main.143）

 from contextualized_topic_models . models . ctm import ZeroShotTM
from contextualized_topic_models . utils . data_preparation import TopicModelDataPreparation
from contextualized_topic_models . utils . data_preparation import bert_embeddings_from_file

text_for_contextual = [
    "hello, this is unpreprocessed text you can give to the model" ,
    "have fun with our topic model" ,
]

text_for_bow = [
    "hello unpreprocessed give model" ,
    "fun topic model" ,
]

qt = TopicModelDataPreparation ( "paraphrase-multilingual-mpnet-base-v2" )

training_dataset = qt . fit ( text_for_contextual = text_for_contextual , text_for_bow = text_for_bow )

ctm = ZeroShotTM ( bow_size = len ( qt . vocab ), contextual_size = 768 , n_components = 50 )

ctm . fit ( training_dataset ) # run the model

ctm . get_topics ( 2 )

如您所見，處理文本的高級API非常易於使用。 Text_for_bert應該用於將未預處理的文檔列表傳遞給模型。相反，對於text_for_bow ，您應該傳遞用於構建弓的預處理文本。

高級註釋：這樣，Sbert可以使用文本中的所有信息來生成表示形式。

使用主題模型

獲取主題

一旦訓練了模型，就很容易獲得主題！

 ctm . get_topics ()

預測看不見的文檔的主題

轉換方法將為您照顧大多數事情，例如，僅考慮模型在訓練中看到的單詞來產生相應的弓。但是，這在與Zeroshottm打交道時會帶來一些顛簸，就像我們在下一節中所示。

但是，如果您願意，可以手動加載嵌入式（請參閱本文檔的高級部分）。

單語主題建模

如果使用組合TM，則需要包括弓的測試文本：

 testing_dataset = qt . transform ( text_for_contextual = testing_text_for_contextual , text_for_bow = testing_text_for_bow )

# n_sample how many times to sample the distribution (see the doc)
ctm . get_doc_topic_distribution ( testing_dataset , n_samples = 20 ) # returns a (n_documents, n_topics) matrix with the topic distribution of each document

如果使用ZeroshottM，則無需使用testing_text_for_bow，因為如果使用不同的測試文檔，這將創建一個不同尺寸的弓。因此，做到這一點的最佳方法是僅通過輸入中要給出的文本傳遞到contexual模型：

 testing_dataset = qt . transform ( text_for_contextual = testing_text_for_contextual )

# n_sample how many times to sample the distribution (see the doc)
ctm . get_doc_topic_distribution ( testing_dataset , n_samples = 20 )

跨語性主題建模

一旦您使用多語言嵌入培訓了Zeroshottm模型，就可以使用此簡單管道來預測用不同語言的文檔的主題（只要該語言被釋義 - 詞彙量化詞語介紹了MPNET-BASE-V2 ）。

 # here we have a Spanish document
testing_text_for_contextual = [
    "hola, bienvenido" ,
]

# since we are doing multilingual topic modeling, we do not need the BoW in
# ZeroShotTM when doing cross-lingual experiments (it does not make sense, since we trained with an english Bow
# to use the spanish BoW)
testing_dataset = qt . transform ( testing_text_for_contextual )

# n_sample how many times to sample the distribution (see the doc)
ctm . get_doc_topic_distribution ( testing_dataset , n_samples = 20 ) # returns a (n_documents, n_topics) matrix with the topic distribution of each document

高級註釋：我們不需要傳遞西班牙語袋：兩種語言的單詞袋將是不可比的！出於兼容原因，我們將其傳遞給了模型，但是您無法獲得模型的輸出（即，訓練有素的語言的預測弓），並將其與測試語言之一進行比較。

更高級的東西

預處理

您是否需要一個快速腳本來運行預處理管道？我們讓你覆蓋了！加載您的文檔，然後使用我們的簡單處理類。它將自動過濾單詞，並刪除訓練後空的文檔。預處理方法將返回預處理和未加工文檔。我們通常將未經加工的伯特（Bert）和單詞袋的預處理進行了預處理。

 from contextualized_topic_models . utils . preprocessing import WhiteSpacePreprocessing

documents = [ line . strip () for line in open ( "unpreprocessed_documents.txt" ). readlines ()]
sp = WhiteSpacePreprocessing ( documents , "english" )
preprocessed_documents , unpreprocessed_corpus , vocab , retained_indices = sp . preprocess ()

使用帶有小貓的自定義嵌入

您是否有自定義嵌入式，並想將它們用於更快的結果？只要給他們貓咪！

 from contextualized_topic_models . models . kitty_classifier import Kitty
import numpy as np

# read the training data
training_data = list ( map ( lambda x : x . strip (), open ( "train_data" ). readlines ()))
custom_embeddings = np . load ( 'custom_embeddings.npy' )

kt = Kitty ()
kt . train ( training_data , custom_embeddings = custom_embeddings , stopwords_list = [ "stopwords" ])

print ( kt . pretty_print_word_classes ())

注意：自定義嵌入必須是numpy.Arrays。

開發團隊

federico bianchi <[email protected]> Bocconi University
silvia terragni <[email protected]>米蘭 - 比科卡大學
dirk hovy <[email protected]> bocconi大學

軟件詳細信息

免費軟件：麻省理工學院許可證
文檔：https：//contextualized-topic-models.readthedocs.io。
Super Big向Stephen Carrow大喊大叫，以創建令人敬畏的https://github.com/estebandito22/pytorchavitm軟件包，我們從中構建了此軟件包的基礎。我們很樂意根據MIT許可再次重新分配該軟件。