Textaugment是一個Python 3庫,用於增強自然語言處理應用程序的文本。 Textaugment位於NLTK,Gensim V3.X和TextBlob的巨大肩膀上,並且與它們效果很好。
使用此庫時引用本文。 Arxiv版本
@inproceedings{marivate2020improving,
title={Improving short text classification through global augmentation methods},
author={Marivate, Vukosi and Sefara, Tshephisho},
booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
pages={385--399},
year={2020},
organization={Springer}
}
通過全球增強方法改善短文分類。
以下軟件包是依賴項,將自動安裝。
$ pip install numpy nltk gensim==3.8.3 textblob googletrans
以下代碼下載WordNet的NLTK語料庫。
nltk . download ( 'wordnet' )以下代碼下載NLTK Tokenizer。該令牌器將文本通過使用無監督的算法來構建縮寫單詞,搭配和啟動句子的單詞的模型,將文本劃分為句子列表。
nltk . download ( 'punkt' )以下代碼下載默認的NLTK一部分標記器模型。言論的一部分標記器會處理一系列單詞,並將語音標籤的一部分附加到每個單詞上。
nltk . download ( 'averaged_perceptron_tagger' )使用Gensim加載預訓練的Word2Vec模型。就像Google Drive的Google新聞一樣。
import gensim
model = gensim . models . KeyedVectors . load_word2vec_format ( './GoogleNews-vectors-negative300.bin' , binary = True )您還可以使用Gensim加載Facebook的FastText英語和多語言模型
import gensim
model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')
或使用您的數據或以下公共數據集培訓一個:
文字8 Wiki
數據集來自“十億個單詞語言建模基準”
從PIP安裝[推薦]
$ pip install textaugment
or install latest release
$ pip install [email protected]:dsfsi/textaugment.git從源安裝
$ git clone [email protected]:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install可以使用三種類型的擴展類型:
from textaugment import Word2vec from textaugment import Fasttext from textaugment import Wordnet from textaugment import Translate 請參閱此筆記本以獲取示例
基本示例
> >> from textaugment import Word2vec , Fasttext
> >> t = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good
> >> t = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good高級示例
> >> runs = 1 # By default.
> >> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.
> >> word = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> word . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent
> >> fast = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> fast . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent 基本示例
> >> import nltk
> >> nltk . download ( 'punkt' )
> >> nltk . download ( 'wordnet' )
> >> from textaugment import Wordnet
> >> t = Wordnet ()
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon , John is walking to town高級示例
> >> v = True # enable verbs augmentation. By default is True.
> >> n = False # enable nouns augmentation. By default is False.
> >> runs = 1 # number of times to augment a sentence. By default is 1.
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.
> >> t = Wordnet ( v = False , n = True , p = 0.5 )
> >> t . augment ( 'In the afternoon, John is going to town' , top_n = 10 )
In the afternoon , Joseph is going to town .例子
> >> src = "en" # source language of the sentence
> >> to = "fr" # target language
> >> from textaugment import Translate
> >> t = Translate ( src = "en" , to = "fr" )
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon John goes to townhttps://www.aclweb.org/anthology/d19-1670.pdf
請參閱此筆記本以獲取示例
從句子中隨機選擇不停止單詞的n單詞。用隨機選擇的同義詞之一替換每個單詞。
基本示例
> >> from textaugment import EDA
> >> t = EDA ()
> >> t . synonym_replacement ( "John is going to town" , top_n = 10 )
John is give out to town 隨機刪除句子中的每個單詞,概率p 。
基本示例
> >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_deletion ( "John is going to town" , p = 0.2 )
is going to town 隨機選擇句子中的兩個單詞並交換其位置。做這個n次。
基本示例
> >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_swap ( "John is going to town" )
John town going to is 在句子中找到一個隨機單詞的隨機同義詞,而不是停止單詞。將同義詞插入句子中的隨機位置。做這個n次
基本示例
> >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_insertion ( "John is going to town" )
John is going to make up town這是EDA變體Karimi等人對AEDA的實現。它基於標點符號的隨機插入。
https://aclanthology.org/2021.findings-emnlp.234.pdf
請參閱此筆記本以獲取示例
基本示例
> >> from textaugment import AEDA
> >> t = AEDA ()
> >> t . punct_insertion ( "John is going to town" )
! John is going to town這是Hongyi Zhang,Moustapha Cisse,Yann Dauphin,David Lopez-Paz的混合增強實施。
用於加強混音的數據進行句子分類:一項實證研究。
混音是一種通用且直接的數據增強原理。從本質上講,混音在成對的示例及其標籤的凸組合中訓練神經網絡。通過這樣做,Mixup定期進行神經網絡,以支持訓練示例之間的簡單線性行為。
請參閱此筆記本以獲取示例
使用此庫時引用本文。 Arxiv版本
@inproceedings{marivate2020improving,
title={Improving short text classification through global augmentation methods},
author={Marivate, Vukosi and Sefara, Tshephisho},
booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
pages={385--399},
year={2020},
organization={Springer}
}
麻省理工學院許可。有關更多詳細信息,請參見捆綁的許可證文件。