Easy NLP Augmentation下载 - Easy NLP Augmentation源代码下载

Easy NLP Augmentation

Ai源码

1.0.0

下载

易于文本增强器

Easy Text Augmenter是使用各种NLP技术直接在PANDAS DATAFAME上增强文本数据的Python软件包。目前只有3种技术：

augment_random_word
augment_random_character
augment_word_bert

安装

 ! pip install easy-nlp-augmentation
import easy_text_augmenter
easy_text_augmenter.info ()

如何使用

augment_random_word

 import pandas as pd
from easy_text_augmenter import augment_random_word

df = pd . DataFrame ({
    'text' : [ 'This is a test' , 'Another test data ' , 'Of course we need more data' , 'Newton does not like apple' , 'Hello world I am a human' ],
    'label' : [ 'A' , 'A' , 'B' , 'B' , 'A' ]
})
classes_to_augment = [ 'A' , 'B' ]
augmented_df = augment_random_word ( df , classes_to_augment , augmentation_percentage = 0.8 , text_column = 'text' )
print ( augmented_df )

结果：

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5             Th is is a te st     A
6                 Another data     A
7   Does not newton like apple     B

augment_random_character

 from easy_text_augmenter import augment_random_word

classes_to_augment = [ 'A' , 'B' ]
augmented_df = augment_random_character ( df , classes_to_augment , augmentation_percentage = 0.8 , text_column = 'text' )
print ( augmented_df )

结果：

                          text label
0               This is a test     A
1           Another test data      A
2  Of course we need more data     B
3   Newton does not like apple     B
4     Hello world I am a human     A
5               This is a estt     A
6            Another te8t data     A
7   Newtun d0e8 not like apple     B

augment_word_bert

 from easy_text_augmenter import augment_word_bert

classes_to_augment = [ 'A' , 'B' ]
augmented_df = augment_word_bert ( df , classes_to_augment , augmentation_percentage = 0.8 , text_column = 'text' , model_path = 'bert-base-uncased' , random_state = 70 )
print ( augmented_df )

结果：

                                          text label
0                               This is a test     A
1                           Another test data      A
2                  Of course we need more data     B
3                   Newton does not like apple     B
4                     Hello world I am a human     A
5                         another test of data     A
6                      this term is not a test     A
7  newton does absolutely not like every apple     B

作者

请与我联系：

[email protected]
shizuka.my.id

文档

augment_random_word

描述：

通过将三种增强技术（交换，删除，拆分）之一随机应用于文本列中， augment_random_word函数通过随机将三种增强技术之一（交换，删除，拆分）随机应用于给定的数据框中的指定百分比。

augment_random_word(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.5, 0.3, 0.2])

参数：

df （pandas.dataframe）：包含文本数据和标签的输入数据框架。
classes_to_augment （列表）：需要增强的类标签列表。
augmentation_percentage （float）：从每个指定类增强的样本百分比。
text_column （str）：包含文本数据的数据框中的列的名称。
random_state （INT，可选）：用于指定要增加的随机种子。默认值为42。
weights （列表，可选）：确定选择每种增强类型的概率的权重列表。掉期，删除和拆分的默认值为[0.5，0.3，0.2]。

weights技术：

交换：文本中随机交换单词。
删除：文本中随机删除单词。
拆分：文本中随机拆分单词。

返回：

pandas.dataframe：一个新的数据框架，上面贴在原始数据上的增强数据。

augment_random_character

描述：

augment_random_character函数在数据框架内对特定类别的文本数据类别执行基于随机字符的增强。它使用几种增强技术来随机更改文本中的字符，从而增加了数据集的多样性。

augment_random_character(df, classes_to_augment, augmentation_percentage, text_column, random_state=42, weights=[0.2, 0.2, 0.2, 0.2, 0.2])

参数：

df （pd.dataframe）：包含文本数据及其相应标签的输入数据框架。
classes_to_augment （列表）：类标签的列表，指示应增强哪些类。
augmentation_percentage （float）：每个班级中应增强样本的百分比。
text_column （str）：数据框中包含要增强的文本数据的列名。
random_state （INT，可选）：用于指定要增加的随机种子。默认值为42。
weights （列表，可选）：每种增强技术的权重列表，用于确定选择每种技术的概率。默认值为[0.2、0.2、0.2、0.2、0.2]。

weights技术：

Aug_ocr：基于OCR的增强。
Aug_keyboard：键盘错误模拟。
Aug_insert：随机字符插入。
Aug_swap：随机字符交换。
Aug_delete：随机字符删除。

返回：

pandas.dataframe：一个新的数据框架，上面贴在原始数据上的增强数据。

augment_word_bert

描述：

使用基于BERT的Word增强技术， augment_word_bert函数在数据框中增强文本数据。它在指定的文本列中插入或替换单词作为指定类中给定百分比的样本。

def augment_word_bert(df, classes_to_augment, augmentation_percentage, text_column, model_path, random_state=42, weights=[0.7, 0.3])

参数：

df （pandas.dataframe）：包含要增强数据的数据框架。
classes_to_augment （列表）：类标签的列表，指示应增强哪些类。
augmentation_percentage （float）：每个班级中的样本百分比以增强（例如，为0.2％，为20％）。
text_column （str）：数据框中包含要增强文本的列的名称。
model_path （STR）：用于增强预训练的BERT模型的路径。
random_state （INT，可选）：用于指定要增加的随机种子。默认值为42。
weights （列表，可选）：在插入和替代技术之间进行选择的权重（默认值为[0.7，0.3]）。

返回：

pandas.dataframe：带有其他增强样品的原始数据框。

展开

附加信息

版本 1.0.0
类型 Ai源码
更新时间 2025-08-30
大小 13.36KB
来自于 Github

Easy NLP Augmentation

易于文本增强器

安装

如何使用

augment_random_word

augment_random_character

augment_word_bert

作者

文档

augment_random_word

augment_random_character

augment_word_bert

easy steamcmd

easy digital downloads

gp|简易CMS

Easy网管

简单的思考

简易内容管理系统

chat.petals.dev

GPT Prompt Templates

GPTyped

ML stack

awesome free chatgpt

pywin_contextmenu

Google Dorks

shepherd

mongo express