textaugment 다운로드 - textaugment 소스 코드 다운로드

textaugment

기타 소스코드

2.0.0 16-11-2023

다운로드

Textaugment : 글로벌 증강 방법을 통한 짧은 텍스트 분류 향상

방금 텍스트를 찾았습니다.

Textaugment는 자연 언어 처리 응용 프로그램을위한 텍스트를 증강시키기위한 Python 3 라이브러리입니다. Textaugment는 NLTK, Gensim v3.x 및 TextBlob의 거대한 어깨에 서 있으며 그들과 잘 어울립니다.

감사의 말

이 라이브러리를 사용할 때이 논문을 인용하십시오. arxiv 버전

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

특징

수동 노력없이 모델 성능 향상을위한 합성 데이터 생성
간단하고 가볍고 사용하기 쉬운 라이브러리.
기계 학습 프레임 워크에 플러그 앤 플레이 (예 : Pytorch, Tensorflow, Scikit-Learn)
텍스트 데이터를 지원합니다

인용 논문

글로벌 증강 방법을 통한 짧은 텍스트 분류 향상 .

요구 사항

파이썬 3

다음 소프트웨어 패키지는 종속성이며 자동으로 설치됩니다.

$ pip install numpy nltk gensim==3.8.3 textblob googletrans

다음 코드는 WordNet 용 NLTK Corpus를 다운로드합니다.

 nltk . download ( 'wordnet' )

다음 코드는 NLTK 토큰 화기를 다운로드합니다. 이 토큰 화기는 감독되지 않은 알고리즘을 사용하여 문장을 시작하는 약어 단어, 배치 및 단어를위한 모델을 구축하여 텍스트를 문장 목록으로 나눕니다.

 nltk . download ( 'punkt' )

다음 코드는 기본 NLTK 부품 타기 태그 모델을 다운로드합니다. 부품 태그거는 일련의 단어를 처리하고 각 단어에 음성 태그의 일부를 첨부합니다.

 nltk . download ( 'averaged_perceptron_tagger' )

세대를 사용하여 미리 훈련 된 Word2Vec 모델을로드하십시오. Google 드라이브의 Google 뉴스와 마찬가지로

 import gensim
model = gensim . models . KeyedVectors . load_word2vec_format ( './GoogleNews-vectors-negative300.bin' , binary = True )

Gensim을 사용하여 Facebook의 Fasttext English 및 다국어 모델을로드 할 수도 있습니다.

 import gensim
model = gensim.models.fasttext.load_facebook_model('./cc.en.300.bin.gz')

또는 데이터 또는 다음 공개 데이터 세트를 사용하여 처음부터 교육 :

Text8 Wiki
"10 억 단어 언어 모델링 벤치 마크"의 데이터 세트

설치

PIP에서 설치 [권장]

$ pip install textaugment
or install latest release
$ pip install [email protected]:dsfsi/textaugment.git

소스에서 설치하십시오

$ git clone [email protected]:dsfsi/textaugment.git
$ cd textaugment
$ python setup.py install

사용 방법

사용할 수있는 세 가지 유형의 증강이 있습니다.

Word2vec

 from textaugment import Word2vec

FastText

 from textaugment import Fasttext

Wordnet

 from textaugment import Wordnet

번역 (인터넷 액세스가 필요합니다)

 from textaugment import Translate

FastText/Word2Vec 기반 증강

예를 들어이 노트북을 참조하십시오

기본 예

 > >> from textaugment import Word2vec , Fasttext
> >> t = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good
> >> t = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' )
> >> t . augment ( 'The stories are good' )
The films are good

고급 예

 > >> runs = 1 # By default.
> >> v = False # verbose mode to replace all the words. If enabled runs is not effective. Used in this paper (https://www.cs.cmu.edu/~diyiy/docs/emnlp_wang_2015.pdf)
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> word = Word2vec ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> word . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent
> >> fast = Fasttext ( model = 'path/to/gensim/model' or 'gensim model itself' , runs = 5 , v = False , p = 0.5 )
> >> fast . augment ( 'The stories are good' , top_n = 10 )
The movies are excellent

WordNet 기반 증강

기본 예

 > >> import nltk
> >> nltk . download ( 'punkt' )
> >> nltk . download ( 'wordnet' )
> >> from textaugment import Wordnet
> >> t = Wordnet ()
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon , John is walking to town

고급 예

 > >> v = True # enable verbs augmentation. By default is True.
> >> n = False # enable nouns augmentation. By default is False.
> >> runs = 1 # number of times to augment a sentence. By default is 1.
> >> p = 0.5 # The probability of success of an individual trial. (0.1<p<1.0), default is 0.5. Used by Geometric distribution to selects words from a sentence.

> >> t = Wordnet ( v = False , n = True , p = 0.5 )
> >> t . augment ( 'In the afternoon, John is going to town' , top_n = 10 )
In the afternoon , Joseph is going to town .

RTT 기반 증강

예

 > >> src = "en" # source language of the sentence
> >> to = "fr" # target language
> >> from textaugment import Translate
> >> t = Translate ( src = "en" , to = "fr" )
> >> t . augment ( 'In the afternoon, John is going to town' )
In the afternoon John goes to town

EDA : 텍스트 분류 작업의 성능 향상을위한 쉬운 데이터 증강 기술

이것은 Jason Wei와 Kai Zou의 EDA의 구현입니다.

https://www.aclweb.org/anthology/d19-1670.pdf

예를 들어이 노트북을 참조하십시오

동의어 교체

단어를 멈추지 않는 문장에서 무작위로 n 단어를 선택하십시오. 이 단어들 각각을 무작위로 선택한 동의어 중 하나로 바꾸십시오.

기본 예

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . synonym_replacement ( "John is going to town" , top_n = 10 )
John is give out to town

임의의 삭제

확률 p .

기본 예

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_deletion ( "John is going to town" , p = 0.2 )
is going to town

임의의 스왑

문장에서 무작위로 두 단어를 선택하고 그들의 위치를 교환하십시오. 이 nimes를하십시오.

기본 예

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_swap ( "John is going to town" )
John town going to is

무작위 삽입

문장에서 중지 단어가 아닌 임의의 단어의 무작위 동의어를 찾으십시오. 해당 동의어를 문장에서 임의의 위치에 삽입하십시오. 이 nimes를하십시오

기본 예

 > >> from textaugment import EDA
> >> t = EDA ()
> >> t . random_insertion ( "John is going to town" )
John is going to make up town

AEDA : 텍스트 분류를위한 더 쉬운 데이터 확대 기술

이것은 EDA의 변형 인 Karimi et al의 AEDA의 구현입니다. 구두점 마크의 무작위 삽입에 기초합니다.

https://aclanthology.org/2021.findings-emnlp.234.pdf

구현

예를 들어이 노트북을 참조하십시오

구두점의 무작위 삽입

기본 예

 > >> from textaugment import AEDA
> >> t = AEDA ()
> >> t . punct_insertion ( "John is going to town" )
! John is going to town

믹스 업 증강

이것은 Hongyi Zhang, Moustapha Cisse, Yann Dauphin, David Lopez-Paz의 NLP에 적응 한 Mixup 확대의 구현입니다.

문장 분류를위한 Mixup과 함께 데이터 증강에 사용 : 경험적 연구.

Mixup은 일반적이고 간단한 데이터 확대 원리입니다. 본질적으로 Mixup은 예제 쌍과 레이블의 볼록한 조합에서 신경망을 훈련시킵니다. 그렇게함으로써, Mixup은 신경망을 규칙화하여 훈련 예에서 간단한 선형 행동을 선호합니다.

구현

예를 들어이 노트북을 참조하십시오

❤ 켜짐으로 제작되었습니다

파이썬

저자

Joseph Sefara (http://www.speechtech.co.za)
Vukosi Marivate (http://www.vima.co.za)

감사의 말

이 라이브러리를 사용할 때이 논문을 인용하십시오. arxiv 버전

 @inproceedings{marivate2020improving,
  title={Improving short text classification through global augmentation methods},
  author={Marivate, Vukosi and Sefara, Tshephisho},
  booktitle={International Cross-Domain Conference for Machine Learning and Knowledge Extraction},
  pages={385--399},
  year={2020},
  organization={Springer}
}

특허

MIT 라이센스. 자세한 내용은 번들 라이센스 파일을 참조하십시오.

확장하다

추가 정보

버전 2.0.0 16-11-2023
유형 기타 소스코드
업데이트 시간 2025-04-15
크기 119.78KB
출처 Github

textaugment

Textaugment : 글로벌 증강 방법을 통한 짧은 텍스트 분류 향상

방금 텍스트를 찾았습니다.

감사의 말

목차

특징

인용 논문

요구 사항

설치

사용 방법

FastText/Word2Vec 기반 증강

WordNet 기반 증강

RTT 기반 증강

EDA : 텍스트 분류 작업의 성능 향상을위한 쉬운 데이터 증강 기술

이것은 Jason Wei와 Kai Zou의 EDA의 구현입니다.

동의어 교체

임의의 삭제

임의의 스왑

무작위 삽입

AEDA : 텍스트 분류를위한 더 쉬운 데이터 확대 기술

구현

구두점의 무작위 삽입

믹스 업 증강

구현

❤ 켜짐으로 제작되었습니다

저자

감사의 말

특허