textaugment下载 - textaugment源代码下载

textaugment

其他源码

2.0.0 16-11-2023

下载

文字授课：通过全球增强方法改善短文分类

您刚刚找到了短信。

Textaugment是一个Python 3库，用于增强自然语言处理应用程序的文本。 Textaugment位于NLTK，Gensim V3.X和TextBlob的巨大肩膀上，并且与它们效果很好。

致谢

使用此库时引用本文。 Arxiv版本

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

特征

生成合成数据，以改善模型性能，而无需手动努力
简单，轻巧，易于使用的库。
插入任何机器学习框架（例如Pytorch，Tensorflow，Scikit-Learn）
支持文本数据

引用纸

通过全球增强方法改善短文分类。

要求

Python 3

以下软件包是依赖项，将自动安装。

$ pip install numpy nltk gensim==3.8.3 textblob googletrans

以下代码下载WordNet的NLTK语料库。

 nltk . download ( 'wordnet' )

以下代码下载NLTK Tokenizer。该令牌器将文本通过使用无监督的算法来构建缩写单词，搭配和启动句子的单词的模型来将文本划分为句子列表。

 nltk . download ( 'punkt' )

以下代码下载默认的NLTK一部分标记器模型。言论的一部分标记器会处理一系列单词，并将语音标签的一部分附加到每个单词上。

 nltk . download ( 'averaged_perceptron_tagger' )

使用Gensim加载预训练的Word2Vec模型。就像Google Drive的Google新闻一样。

 import gensim
model = gensim . models . KeyedVectors . load_word2vec_format ( './GoogleNews-vectors-negative300.bin' , binary = True )

您还可以使用Gensim加载Facebook的FastText英语和多语言模型

 import gensim
model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')

或使用您的数据或以下公共数据集培训一个：

文字8 Wiki
数据集来自“十亿个单词语言建模基准”

安装

从PIP安装[推荐]

$ pip install textaugment
or install latest release
$ pip install [email protected]:dsfsi/textaugment.git

从源安装

$ git clone [email protected]:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install

如何使用

可以使用三种类型的扩展类型：

Word2Vec

 from textaugment import Word2vec

fastText

 from textaugment import Fasttext

WordNet

 from textaugment import Wordnet

翻译（这将需要互联网访问）

 from textaugment import Translate

FastText/word2Vec的增强

请参阅此笔记本以获取示例

基本示例

 > >> from textaugment import Word2vec , Fasttext
> >> t = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good
> >> t = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good

高级示例

 > >> runs = 1 # By default.
> >> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> word = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> word . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent
> >> fast = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> fast . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent

基于WordNet的增强

基本示例

 > >> import nltk
> >> nltk . download ( 'punkt' )
> >> nltk . download ( 'wordnet' )
> >> from textaugment import Wordnet
> >> t = Wordnet ()
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon , John is walking to town

高级示例

 > >> v = True # enable verbs augmentation. By default is True.
> >> n = False # enable nouns augmentation. By default is False.
> >> runs = 1 # number of times to augment a sentence. By default is 1.
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> t = Wordnet ( v = False , n = True , p = 0.5 )
> >> t . augment ( 'In the afternoon, John is going to town' , top_n = 10 )
In the afternoon , Joseph is going to town .

基于RTT的增强

例子

 > >> src = "en" # source language of the sentence
> >> to = "fr" # target language
> >> from textaugment import Translate
> >> t = Translate ( src = "en" , to = "fr" )
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon John goes to town

EDA：易于提高文本分类任务性能的简易数据增强技术

这是Jason Wei和Kai Zou的EDA实施。

https://www.aclweb.org/anthology/d19-1670.pdf

请参阅此笔记本以获取示例

同义词替代

从句子中随机选择不停止单词的n单词。用随机选择的同义词之一替换每个单词。

基本示例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . synonym_replacement ( "John is going to town" , top_n = 10 )
John is give out to town

随机删除

随机删除句子中的每个单词，概率p 。

基本示例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_deletion ( "John is going to town" , p = 0.2 )
is going to town

随机交换

随机选择句子中的两个单词并交换其位置。做这个n次。

基本示例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_swap ( "John is going to town" )
John town going to is

随机插入

在句子中找到一个随机单词的随机同义词，而不是停止单词。将同义词插入句子中的随机位置。做这个n次

基本示例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_insertion ( "John is going to town" )
John is going to make up town

AEDA：用于文本分类的更轻松的数据增强技术

这是EDA变体Karimi等人对AEDA的实现。它基于标点符号的随机插入。

https://aclanthology.org/2021.findings-emnlp.234.pdf

执行

请参阅此笔记本以获取示例

随机插入标点符号

基本示例

 > >> from textaugment import AEDA
> >> t = AEDA ()
> >> t . punct_insertion ( "John is going to town" )
! John is going to town

混合增强

这是Hongyi Zhang，Moustapha Cisse，Yann Dauphin，David Lopez-Paz的混合增强实施。

用于加强混音的数据进行句子分类：一项实证研究。

混音是一种通用且直接的数据增强原理。从本质上讲，混音在成对的示例及其标签的凸组合中训练神经网络。通过这样做，Mixup定期进行神经网络，以支持训练示例之间的简单线性行为。

执行

请参阅此笔记本以获取示例

用❤

Python

作者

约瑟夫·塞法拉（http://www.speechtech.co.za）
Vukosi Marivate（http://www.vima.co.za）

致谢

使用此库时引用本文。 Arxiv版本

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

执照

麻省理工学院许可。有关更多详细信息，请参见捆绑的许可证文件。

展开

附加信息

版本 2.0.0 16-11-2023
类型其他源码
更新时间 2025-04-15
大小 119.78KB
来自于 Github

textaugment

文字授课：通过全球增强方法改善短文分类

您刚刚找到了短信。

致谢

目录

特征

引用纸

要求

安装

如何使用

FastText/word2Vec的增强

基于WordNet的增强

基于RTT的增强

EDA：易于提高文本分类任务性能的简易数据增强技术

这是Jason Wei和Kai Zou的EDA实施。

同义词替代

随机删除

随机交换

随机插入

AEDA：用于文本分类的更轻松的数据增强技术

执行

随机插入标点符号

混合增强

执行

用❤

作者

致谢

执照