textaugmentのダウンロードtextaugmentソースコードのダウンロード

textaugment

その他のソースコード

2.0.0 16-11-2023

ダウンロード

テキストメント：グローバル増強方法による短いテキスト分類の改善

テキストメントを見つけたばかりです。

TextAugmentは、自然言語処理アプリケーションのテキストを拡張するためのPython 3ライブラリです。テキストメントは、NLTK、Gensim V3.X、およびTextBlobの巨大な肩に立っており、それらとうまく再生します。

謝辞

このライブラリを使用するときは、この論文を引用してください。 arxivバージョン

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

特徴

手動の努力なしでモデルのパフォーマンスを改善するための合成データを生成する
シンプルで軽量で、使いやすいライブラリ。
機械学習フレームワークにプラグアンドプレイする（例：Pytorch、Tensorflow、Scikit-Learn）
テキストデータをサポートします

引用紙

グローバルな増強方法による短いテキスト分類の改善。

要件

Python 3

次のソフトウェアパッケージは依存関係であり、自動的にインストールされます。

$ pip install numpy nltk gensim==3.8.3 textblob googletrans

次のコードは、WordNet用のNLTKコーパスをダウンロードします。

 nltk . download ( 'wordnet' )

次のコードは、NLTKトークンザーをダウンロードします。このトークンザーは、監視されていないアルゴリズムを使用して、文を開始する略語、コロケーション、単語のモデルを構築することにより、テキストを文のリストに分割します。

 nltk . download ( 'punkt' )

次のコードは、デフォルトのNLTKの一部のスピーチタガーモデルをダウンロードします。スピーチの一部は、一連の単語を処理し、各単語に音声タグの一部を添付します。

 nltk . download ( 'averaged_perceptron_tagger' )

Gensimを使用して、事前に訓練されたWord2VECモデルをロードします。 Google DriveのGoogleニュースのように。

 import gensim
model = gensim . models . KeyedVectors . load_word2vec_format ( './GoogleNews-vectors-negative300.bin' , binary = True )

Gensimを使用してFacebookのFastText英語と多言語モデルをロードすることもできます

 import gensim
model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')

または、データまたは次のパブリックデータセットを使用して、ゼロからトレーニングします。

text8 wiki
「10億語の言語モデリングベンチマーク」からのデータセット

インストール

PIPからインストール[推奨]

$ pip install textaugment
or install latest release
$ pip install [email protected]:dsfsi/textaugment.git

ソースからインストールします

$ git clone [email protected]:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install

使い方

使用できる3つのタイプの増強があります。

word2vec

 from textaugment import Word2vec

fastText

 from textaugment import Fasttext

wordnet

 from textaugment import Wordnet

翻訳（これにはインターネットアクセスが必要です）

 from textaugment import Translate

fastText/word2vecベースの増強

例については、このノートブックを参照してください

基本的な例

 > >> from textaugment import Word2vec , Fasttext
> >> t = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good
> >> t = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good

高度な例

 > >> runs = 1 # By default.
> >> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> word = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> word . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent
> >> fast = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> fast . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent

WordNetベースの増強

基本的な例

 > >> import nltk
> >> nltk . download ( 'punkt' )
> >> nltk . download ( 'wordnet' )
> >> from textaugment import Wordnet
> >> t = Wordnet ()
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon , John is walking to town

高度な例

 > >> v = True # enable verbs augmentation. By default is True.
> >> n = False # enable nouns augmentation. By default is False.
> >> runs = 1 # number of times to augment a sentence. By default is 1.
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> t = Wordnet ( v = False , n = True , p = 0.5 )
> >> t . augment ( 'In the afternoon, John is going to town' , top_n = 10 )
In the afternoon , Joseph is going to town .

RTTベースの増強

例

 > >> src = "en" # source language of the sentence
> >> to = "fr" # target language
> >> from textaugment import Translate
> >> t = Translate ( src = "en" , to = "fr" )
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon John goes to town

EDA：テキスト分類タスクのパフォーマンスを向上させるための簡単なデータ増強技術

これは、Jason WeiとKai ZouによるEDAの実装です。

https://www.aclweb.org/anthology/d19-1670.pdf

例については、このノートブックを参照してください

同義語の交換

単語を止めていない文からn単語をランダムに選択します。これらの各単語を、ランダムに選択した同義語の1つに置き換えます。

基本的な例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . synonym_replacement ( "John is going to town" , top_n = 10 )
John is give out to town

ランダム削除

確率pで文の各単語をランダムに削除します。

基本的な例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_deletion ( "John is going to town" , p = 0.2 )
is going to town

ランダムスワップ

文に2つの単語をランダムに選択し、位置を交換します。これをn回してください。

基本的な例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_swap ( "John is going to town" )
John town going to is

ランダム挿入

文のランダムな単語のランダムな同義語を見つけてください。その同義語を文のランダムな位置に挿入します。これをn回してください

基本的な例

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_insertion ( "John is going to town" )
John is going to make up town

AEDA：テキスト分類のためのより簡単なデータ増強手法

これは、EDAのバリアントであるKarimi et alによるAedaの実装です。句読点のランダム挿入に基づいています。

https://aclanthology.org/2021.findings-emnlp.234.pdf

実装

例については、このノートブックを参照してください

句読点のランダム挿入

基本的な例

 > >> from textaugment import AEDA
> >> t = AEDA ()
> >> t . punct_insertion ( "John is going to town" )
! John is going to town

混合の増強

これは、hongyi Zhang、Moustapha Cisse、Yann Dauphin、David Lopez-PazがNLPに適応した混合の増強の実装です。

文の分類のための混合を使用してデータを増強する際に使用：実証的研究。

ミックスアップは、一般的で簡単なデータ増強原則です。本質的に、Mixupは、例とそのラベルのペアの凸の組み合わせについてニューラルネットワークを訓練します。そうすることで、Mixupはニューラルネットワークを正規化して、トレーニングの例の間に単純な線形動作を支持します。

実装

例については、このノートブックを参照してください

❤で構築されています

Python

著者

ジョセフ・セファラ（http://www.speechtech.co.za）
Vukosi Marivate（http://www.vima.co.za）

謝辞

このライブラリを使用するときは、この論文を引用してください。 arxivバージョン

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

ライセンス

MITライセンス。詳細については、バンドルされたライセンスファイルを参照してください。

拡大する

追加情報

バージョン 2.0.0 16-11-2023
タイプその他のソースコード
更新時間 2025-04-15
サイズ 119.78KB
から Github

textaugment

テキストメント：グローバル増強方法による短いテキスト分類の改善

テキストメントを見つけたばかりです。

謝辞

目次

特徴

引用紙

要件

インストール

使い方

fastText/word2vecベースの増強

WordNetベースの増強

RTTベースの増強

EDA：テキスト分類タスクのパフォーマンスを向上させるための簡単なデータ増強技術

これは、Jason WeiとKai ZouによるEDAの実装です。

同義語の交換

ランダム削除

ランダムスワップ

ランダム挿入

AEDA：テキスト分類のためのより簡単なデータ増強手法

実装

句読点のランダム挿入

混合の増強

実装

❤で構築されています

著者

謝辞

ライセンス