textaugment下載 - textaugment源代碼下載

textaugment

其他源碼

2.0.0 16-11-2023

下載

文字授課：通過全球增強方法改善短文分類

您剛剛找到了短信。

Textaugment是一個Python 3庫，用於增強自然語言處理應用程序的文本。 Textaugment位於NLTK，Gensim V3.X和TextBlob的巨大肩膀上，並且與它們效果很好。

致謝

使用此庫時引用本文。 Arxiv版本

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

特徵

生成合成數據，以改善模型性能，而無需手動努力
簡單，輕巧，易於使用的庫。
插入任何機器學習框架（例如Pytorch，Tensorflow，Scikit-Learn）
支持文本數據

引用紙

通過全球增強方法改善短文分類。

要求

Python 3

以下軟件包是依賴項，將自動安裝。

$ pip install numpy nltk gensim==3.8.3 textblob googletrans

以下代碼下載WordNet的NLTK語料庫。

 nltk . download ( 'wordnet' )

以下代碼下載NLTK Tokenizer。該令牌器將文本通過使用無監督的算法來構建縮寫單詞，搭配和啟動句子的單詞的模型，將文本劃分為句子列表。

 nltk . download ( 'punkt' )

以下代碼下載默認的NLTK一部分標記器模型。言論的一部分標記器會處理一系列單詞，並將語音標籤的一部分附加到每個單詞上。

 nltk . download ( 'averaged_perceptron_tagger' )

使用Gensim加載預訓練的Word2Vec模型。就像Google Drive的Google新聞一樣。

 import gensim
model = gensim . models . KeyedVectors . load_word2vec_format ( './GoogleNews-vectors-negative300.bin' , binary = True )

您還可以使用Gensim加載Facebook的FastText英語和多語言模型

 import gensim
model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')

或使用您的數據或以下公共數據集培訓一個：

文字8 Wiki
數據集來自“十億個單詞語言建模基準”

安裝

從PIP安裝[推薦]

$ pip install textaugment
or install latest release
$ pip install [email protected]:dsfsi/textaugment.git

從源安裝

$ git clone [email protected]:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install

如何使用

可以使用三種類型的擴展類型：

Word2Vec

 from textaugment import Word2vec

fastText

 from textaugment import Fasttext

WordNet

 from textaugment import Wordnet

翻譯（這將需要互聯網訪問）

 from textaugment import Translate

FastText/word2Vec的增強

請參閱此筆記本以獲取示例

基本示例

 > >> from textaugment import Word2vec , Fasttext
> >> t = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good
> >> t = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good

高級示例

 > >> runs = 1 # By default.
> >> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> word = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> word . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent
> >> fast = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> fast . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent

基於WordNet的增強

基本示例

 > >> import nltk
> >> nltk . download ( 'punkt' )
> >> nltk . download ( 'wordnet' )
> >> from textaugment import Wordnet
> >> t = Wordnet ()
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon , John is walking to town

高級示例

 > >> v = True # enable verbs augmentation. By default is True.
> >> n = False # enable nouns augmentation. By default is False.
> >> runs = 1 # number of times to augment a sentence. By default is 1.
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> t = Wordnet ( v = False , n = True , p = 0.5 )
> >> t . augment ( 'In the afternoon, John is going to town' , top_n = 10 )
In the afternoon , Joseph is going to town .

基於RTT的增強

例子

 > >> src = "en" # source language of the sentence
> >> to = "fr" # target language
> >> from textaugment import Translate
> >> t = Translate ( src = "en" , to = "fr" )
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon John goes to town

EDA：易於提高文本分類任務性能的簡易數據增強技術

這是Jason Wei和Kai Zou的EDA實施。

https://www.aclweb.org/anthology/d19-1670.pdf

請參閱此筆記本以獲取示例

同義詞替代

從句子中隨機選擇不停止單詞的n單詞。用隨機選擇的同義詞之一替換每個單詞。

基本示例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . synonym_replacement ( "John is going to town" , top_n = 10 )
John is give out to town

隨機刪除

隨機刪除句子中的每個單詞，概率p 。

基本示例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_deletion ( "John is going to town" , p = 0.2 )
is going to town

隨機交換

隨機選擇句子中的兩個單詞並交換其位置。做這個n次。

基本示例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_swap ( "John is going to town" )
John town going to is

隨機插入

在句子中找到一個隨機單詞的隨機同義詞，而不是停止單詞。將同義詞插入句子中的隨機位置。做這個n次

基本示例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_insertion ( "John is going to town" )
John is going to make up town

AEDA：用於文本分類的更輕鬆的數據增強技術

這是EDA變體Karimi等人對AEDA的實現。它基於標點符號的隨機插入。

https://aclanthology.org/2021.findings-emnlp.234.pdf

執行

請參閱此筆記本以獲取示例

隨機插入標點符號

基本示例

 > >> from textaugment import AEDA
> >> t = AEDA ()
> >> t . punct_insertion ( "John is going to town" )
! John is going to town

混合增強

這是Hongyi Zhang，Moustapha Cisse，Yann Dauphin，David Lopez-Paz的混合增強實施。

用於加強混音的數據進行句子分類：一項實證研究。

混音是一種通用且直接的數據增強原理。從本質上講，混音在成對的示例及其標籤的凸組合中訓練神經網絡。通過這樣做，Mixup定期進行神經網絡，以支持訓練示例之間的簡單線性行為。

執行

請參閱此筆記本以獲取示例

用❤

Python

作者

約瑟夫·塞法拉（http://www.speechtech.co.za）
Vukosi Marivate（http://www.vima.co.za）

致謝

使用此庫時引用本文。 Arxiv版本

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}