Easy NLP Augmentation 다운로드 - Easy NLP Augmentation 소스 코드 다운로드

Easy NLP Augmentation

AI 소스 코드

1.0.0

다운로드

쉬운 텍스트 증강기

Easy Text Authmenter는 다양한 NLP 기술을 사용하여 Pandas 데이터 프레임에서 텍스트 데이터를 직접 늘리기위한 Python 패키지입니다. 현재 3 가지 기술 만 있습니다.

agsment_random_word
agsment_random_character
agsment_word_bert

설치

 ! pip install easy-nlp-augmentation
import easy_text_augmenter
easy_text_augmenter.info ()

사용 방법

agsment_random_word

 import pandas as pd
from easy_text_augmenter import augment_random_word

df = pd . DataFrame ({
    'text' : [ 'This is a test' , 'Another test data ' , 'Of course we need more data' , 'Newton does not like apple' , 'Hello world I am a human' ],
    'label' : [ 'A' , 'A' , 'B' , 'B' , 'A' ]
})
classes_to_augment = [ 'A' , 'B' ]
augmented_df = augment_random_word ( df , classes_to_augment , augmentation_percentage = 0.8 , text_column = 'text' )
print ( augmented_df )

결과 :

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5             Th is is a te st     A
6                 Another data     A
7   Does not newton like apple     B

agsment_random_character

 from easy_text_augmenter import augment_random_word

classes_to_augment = [ 'A' , 'B' ]
augmented_df = augment_random_character ( df , classes_to_augment , augmentation_percentage = 0.8 , text_column = 'text' )
print ( augmented_df )

결과 :

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5               This is a estt     A
6            Another te8t data     A
7   Newtun d0e8 not like apple     B

agsment_word_bert

 from easy_text_augmenter import augment_word_bert

classes_to_augment = [ 'A' , 'B' ]
augmented_df = augment_word_bert ( df , classes_to_augment , augmentation_percentage = 0.8 , text_column = 'text' , model_path = 'bert-base-uncased' , random_state = 70 )
print ( augmented_df )

결과 :

                                          text label
0                               This is a test     A
1                           Another test data      A
2                  Of course we need more data     B
3                   Newton does not like apple     B
4                     Hello world I am a human     A
5                         another test of data     A
6                      this term is not a test     A
7  newton does absolutely not like every apple     B

저자

저에게 연락하십시오 :

[email protected]
Shizuka.my.id

선적 서류 비치

agsment_random_word

설명:

augment_random_word 함수는 텍스트 열에 세 가지 증강 기술 중 하나 (스왑, 삭제, 분할) 중 하나를 무작위로 적용함으로써 주어진 데이터 프레임 클래스에서 지정된 샘플의 샘플 백분율을 증가시킵니다.

augment_random_word(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.5, 0.3, 0.2])

매개 변수 :

df (pandas.dataframe) : 텍스트 데이터 및 레이블이 포함 된 입력 데이터 프레임.
classes_to_augment (list) : 증강 해야하는 클래스 레이블 목록.
augmentation_percentage (float) : 지정된 각 클래스에서 증강 할 샘플의 백분율.
text_column (str) : 텍스트 데이터가 포함 된 데이터 프레임의 열 이름입니다.
random_state (int, 옵션) : 보강 할 행을 지정하는 데 사용되는 임의의 시드. 기본값은 42입니다.
weights (목록, 선택 사항) : 각 증강 유형을 선택할 확률을 결정하기위한 가중치 목록. 스왑, 삭제 및 분할의 경우 기본값은 각각 [0.5, 0.3, 0.2]입니다.

weights 기술 :

스왑 : 텍스트로 단어를 무작위로 바꾸십시오.
삭제 : 텍스트에서 단어를 무작위로 삭제하십시오.
분할 : 텍스트로 단어를 무작위로 분할하십시오.

보고:

pandas.dataframe : 증강 데이터가 원래 데이터에 추가 된 새 데이터 프레임.

agsment_random_character

설명:

augment_random_character 함수는 데이터 프레임 내의 특정 텍스트 데이터 클래스에서 임의의 문자 기반 증강을 수행합니다. 텍스트에서 문자를 무작위로 변경하여 데이터 세트의 다양성을 높이기 위해 여러 증강 기술을 사용합니다.

augment_random_character(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.2, 0.2, 0.2, 0.2, 0.2])

매개 변수 :

df (Pd.DataFrame) : 텍스트 데이터와 해당 레이블이 포함 된 입력 데이터 프레임.
classes_to_augment (list) : 증강해야 할 클래스를 나타내는 클래스 레이블 목록.
augmentation_percentage (float) : 각 클래스의 샘플의 백분율은 증강해야합니다.
text_column (str) : 증강 할 텍스트 데이터가 포함 된 데이터 프레임의 열 이름입니다.
random_state (int, 옵션) : 보강 할 행을 지정하는 데 사용되는 임의의 시드. 기본값은 42입니다.
weights (목록, 선택 사항) : 각 증강 기술에 대한 가중치 목록은 각 기술을 선택할 확률을 결정하는 데 사용됩니다. 기본값은 [0.2, 0.2, 0.2, 0.2, 0.2]입니다.

weights 기술 :

Aug_ocr : OCR 기반 기반.
Aug_keyboard : 키보드 오류 시뮬레이션.
Aug_insert : 랜덤 문자 삽입.
aug_swap : 임의의 문자 스와핑.
aug_delete : 랜덤 문자 삭제.

보고:

pandas.dataframe : 증강 데이터가 원래 데이터에 추가 된 새 데이터 프레임.

agsment_word_bert

설명:

augment_word_bert 함수는 BERT 기반 단어 증강 기술을 사용하여 데이터 프레임의 텍스트 데이터를 보강합니다. 지정된 클래스에서 주어진 백분율의 샘플에 대해 지정된 텍스트 열에 단어를 삽입하거나 대체합니다.

def augment_word_bert(df, classes_to_augment, augmentation_percentage, text_column, model_path, random_state=42, weights=[0.7, 0.3])

매개 변수 :

df (pandas.dataframe) : 증강 할 데이터가 포함 된 데이터 프레임.
classes_to_augment (list) : 증강해야 할 클래스를 나타내는 클래스 레이블 목록.
augmentation_percentage (float) : 각 클래스 내에서 augment에서 샘플의 백분율 (예 : 0.2의 경우 20%).
text_column (str) : 증강 할 텍스트가 포함 된 데이터 프레임의 열 이름입니다.
model_path (str) : 증강에 사용되는 미리 훈련 된 버트 모델의 경로.
random_state (int, 옵션) : 보강 할 행을 지정하는 데 사용되는 임의의 시드. 기본값은 42입니다.
weights (목록, 선택 사항) : 삽입 및 대체 증강 기술 중에서 선택하기위한 가중치 (기본값은 [0.7, 0.3]).