nlpcda Download - nlpcda Source code download

NLP Chinese Data Augmentation One-click Chinese Data Enhancement Tool

Use: pip install nlpcda

Open source is not easy, welcome to star?

pypi:https://pypi.org/project/nlpcda/

introduce

One-click Chinese data enhancement tool, supporting:

1. Random entity replacement
2. Synonyms
3. Synonyms and substitutions
4. Random word deletion (internal details: digital time and date fragments, content will not be deleted)
5.NER class BIO data enhancement
6. Randomly replace adjacent characters: Research shows that the order of Chinese characters does not necessarily affect the reading comprehension of the text .
7. Chinese equivalent substitution (1 1 ①, 2 2 ②)
8. Enhancement of translation interchange implementation
9. Use simbert to generate similar sentences

经过细节特殊处理，比如不改变年月日数字，尽量保证不改变原文语义。即使改变也能被猜出来、能被猜出来、能被踩出来、能被菜粗来、被菜粗、能菜粗来

WIP

A speech-based text washing process (similar to translation).文本to语音>语音recognition back to文本: generate speech on text based on fastspeech2, and voice recognition text based on wav2vec2

example:
input: Xinhua News Agency Beijing News > fastspeech2 > x.wav
x.wav > wav2vec2 > output: Xinhua set up Beijing news

Digital conversion tool (for text conversion, Chinese pronunciation synthesis requires pure Chinese)

Today is August 29th news > Today is August 29th news
I have 1234 apples > I have 1234 apples

significance

Generate a specified number of training corpus text without changing the original semantics
It has a good effect on the generalization performance, combating attacks, and interfering fluctuations of the NLP model.
Reference competition (I used this strategy + base bet to get: 50+-/1000): https://www.biendata.net/competition/2019diac/
Based on nlpcda, I am CCKS 2020: Large-scale product entity search based on titles won 9th place, named nlpcda

️ If you simply score your accuracy scores, you will generally not get any score improvement with this package.

API

1. Random (equivalent) entity replacement

parameter:

base_file: Use built-in (company) entities by default. Replace company entities
It is the text file path, the content is as follows:
Entity 1
Entity 2
...
Entity n
create_num=3: Returns up to 3 enhanced texts
change_rate=0.3: Text change rate
seed: random seed

 from nlpcda import Randomword

test_str = '''这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击'''

smw = Randomword ( create_num = 3 , change_rate = 0.3 )
rs1 = smw . replace ( test_str )

print ( '随机实体替换>>>>>>' )
for s in rs1 :
    print ( s )
'''
随机实体替换>>>>>>
这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
这是个实体：长兴国际；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
这是个实体：浙江世宝；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
'''

2. Random synonym replacement

parameter:

base_file: By default, use built-in synonyms. You can set/specify a richer synonyms yourself:
It is the text file path, the content is as follows (separated by spaces):
Aa01A0 Humans are all humans
id2 Synonym b1 Synonym b2 ... Synonym bk
...
idn synonym n1 synonym n2
create_num=3: Returns up to 3 enhanced texts
change_rate=0.3: Text change rate
seed: random seed

 from nlpcda import Similarword

test_str = '''这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击'''

smw = Similarword ( create_num = 3 , change_rate = 0.3 )
rs1 = smw . replace ( test_str )

print ( '随机同义词替换>>>>>>' )
for s in rs1 :
    print ( s )

'''
随机同义词替换>>>>>>
这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数量增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；斯nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
'''

3. Random synonyms substitution

parameter:

base_file: By default, use the built-in [synonymous homophone table]. You can set/specify a richer synonymous homophone table by yourself:
It is the text file path, the content is as follows (t separated):
de del Dede De technet till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till till
...
Pinyin n word n1 word n2
create_num=3: Returns up to 3 enhanced texts
change_rate=0.3: Text change rate
seed: random seed

 from nlpcda import Homophone

test_str = '''这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击'''

smw = Homophone ( create_num = 3 , change_rate = 0.3 )
rs1 = smw . replace ( test_str )

print ( '随机近义字替换>>>>>>' )
for s in rs1 :
    print ( s )

'''
随机近义字替换>>>>>>
这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
这是个实体：58同城；今填是2020年3月8日11:40，天气晴朗，天气很不错，空气痕好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
鷓是个实体：58同乘；今天是2020年3月8日11:40，天迄晴朗，天气很不错，空气很儫，不差；这个nlpcad包，用于方便一键数据增强，犐有效增牆NLP模型的橎化性能、减少波动、抵抗对抗攻击
'''

4. Random word deletion

parameter:

create_num=3: Returns up to 3 enhanced texts
change_rate=0.3: Text change rate
seed: random seed

 from nlpcda import RandomDeleteChar

test_str = '''这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击'''

smw = RandomDeleteChar ( create_num = 3 , change_rate = 0.3 )
rs1 = smw . replace ( test_str )

print ( '随机字删除>>>>>>' )
for s in rs1 :
    print ( s )

'''
随机字删除>>>>>>
这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气，不差；这个nlpcad包用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗
个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型泛化性能、减少波动、抵抗对抗
'''

5.NER named entity data enhancement

Enter the marked NER data directory, the marked file path that needs to be enhanced, and the number of enhanced, and you can enhance it with one click.

Ner class parameters:

ner_dir_name='ner_data' : Put the ner data in the ner_data directory (many .txt in it)
The directory provided by ner_dir_name is various annotated data files, and the file contents are separated in the standard NER BIO format:

Word 1 t TAG
Northt B-LOC
Beijingt I-LOC
Todayt O
Dayt O
Very t O
Hot O
. t O

ignore_tag_list=['O'] : No need to worry about O tags in the data
data_augument_tag_list=['P', 'LOC'] : Only enhance the entities of P and LOC tags
augment_size=3: Each labeled data, up to the number of new enhancements
seed=0: Random seed/ Can be defaulted

Call function augment() parameter

file_name: 1 path to mark the training file, such as 0.txt
ner.augment(file_name='0.txt')

example:

 from nlpcda import Ner

ner = Ner ( ner_dir_name = 'ner_data' ,
        ignore_tag_list = [ 'O' ],
        data_augument_tag_list = [ 'P' , 'LOC' , 'ORG' ],
        augument_size = 3 , seed = 0 )
data_sentence_arrs , data_label_arrs = ner . augment ( file_name = '0.txt' )
# 3条增强后的句子、标签 数据，len(data_sentence_arrs)==3
# 你可以写文件输出函数，用于写出，作为后续训练等
print ( data_sentence_arrs , data_label_arrs )

6. Randomly replace adjacent words

char_gram=3: A word is only exchanged with 3 adjacent words
Internal details: When encountering numbers, symbols, etc., they will not be exchanged.

 from nlpcda import CharPositionExchange

ts = '''这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击'''
smw = CharPositionExchange ( create_num = 3 , change_rate = 0.3 , char_gram = 3 , seed = 1 )
rs = smw . replace ( ts )
for s in rs :
    print ( s )

'''
这是个实体：58同城；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，不差；这个nlpcad包，用于方便一键数据增强，可有效增强NLP模型的泛化性能、减少波动、抵抗对抗攻击
这实个是体：58城同；今天是2020年3月8日11:40，天气晴朗，天气很不错，空气很好，差不；这个nlpcad包，便用一数方增键强据于，增有效可强NLP模型性泛化的能、动少减波、抵对攻抗抗击
这是个体实：58城同；今是天2020年3月8日11:40，朗气晴天，天气很错不，空好很气，不差；个这nlpcad包，方便键一据增用数于强，可有效强增NLP模型的性化泛能、动减波少、抗抗击抵对攻
'''

7. Equivalent word replacement

parameter:

base_file: By default, use the built-in [equivalent numeric word table]. You can set/specify a richer equivalent table by yourself (or use the function: add_equivalent_list):
It is the text file path, the content is as follows ((t) separated):
0 Zero
1 one ①
...
9 Nine Nine ⑨
create_num=3: Returns up to 3 enhanced texts
change_rate=0.3: Text change rate
seed: random seed

 from nlpcda import EquivalentChar

test_str = '''今天是2020年3月8日11:40，天气晴朗，天气很不错。'''

s = EquivalentChar ( create_num = 3 , change_rate = 0.3 )
# 添加等价字
s . add_equivalent_list ([ '看' , '瞅' ])
res = s . replace ( test_str )
print ( '等价字替换>>>>>>' )
for s in res :
    print ( s )

'''
等价字替换>>>>>>
今天是2020年3月8日11:40，天气晴朗，天气很不错。
今天是二〇2〇年3月八日1①:4〇，天气晴朗，天气很不错。
今天是二0贰零年3月捌日11:40，天气晴朗，天气很不错
'''

Add a custom dictionary

Used before use, add word participle effect

 from nlpcda import Randomword
from nlpcda import Similarword
from nlpcda import Homophone
from nlpcda import RandomDeleteChar
from nlpcda import Ner
from nlpcda import CharPositionExchange

Randomword . add_word ( '小明' )
Randomword . add_words ([ '小明' , '小白' , '天地良心' ])
# Similarword，Homophone，RandomDeleteChar 同上

8. Enhancement of translation interchange implementation

1. Enhanced note of Baidu's Chinese-English translation interchange implementation:

Apply for your appid, secretKey: http://api.fanyi.baidu.com/api/trans

 from nlpcda import baidu_translate

zh = '天气晴朗，天气很不错，空气很好'
# 申请你的 appid、secretKey
# 两遍洗数据法（回来的中文一般和原来不一样，要是一样，就不要了，靠运气？）
en_s = baidu_translate ( content = zh , appid = 'xxx' , secretKey = 'xxx' , t_from = 'zh' , t_to = 'en' )
zh_s = baidu_translate ( content = en_s , appid = 'xxx' , secretKey = 'xxx' , t_from = 'en' , t_to = 'zh' )
print ( zh_s )

2. Enhancement of Google Translation Interchange Implementation

pip package: py-googletrans

Free Google Translation API, requires a wall-blocking and unstable

https://py-googletrans.readthedocs.io/en/latest

pip install googletrans

 from googletrans import Translator
def googletrans ( content = '一个免费的谷歌翻译API' , t_from = 'zh-cn' , t_to = 'en' ):
    translator = Translator ()
    s = translator . translate ( text = content , dest = t_to , src = t_from )
    return s . text

9.simbert

Source: https://github.com/ZhuiyiTechnology/pretrained-models

Reference: https://github.com/ZhuiyiTechnology/simbert

Download any model in it, decompress it to any position and assign it to model_path variable:

name	Training data size	Vocabulary size	Model size	Download address
SimBERT Tiny	22 million similar sentence groups	13685	26MB	Baidu Netdisk (1tp7)
SimBERT Small	22 million similar sentence groups	13685	49MB	Baidu Netdisk (nu67)
SimBERT Base	22 million similar sentence groups	13685	344MB	Baidu Netdisk (6xhq)

parameter:

config: model_path (the model location downloaded above), device (cpu/cuda...), maximum length, random seed
sent: sentences that need to be enhanced
create_num: The number of sentences constructed

Environment reference (manual installation):


keras==2.3.1
bert4keras==0.7.7
# tensorflow==1.13.1
tensorflow-gpu==1.13.1

 from nlpcda import Simbert
config = {
        'model_path' : '/xxxx/chinese_simbert_L-12_H-768_A-12' ,
        'CUDA_VISIBLE_DEVICES' : '0,1' ,
        'max_len' : 32 ,
        'seed' : 1
}
simbert = Simbert ( config = config )
sent = '把我的一个亿存银行安全吗'
synonyms = simbert . replace ( sent = sent , create_num = 5 )
print ( synonyms )
'''
[('我的一个亿，存银行，安全吗', 0.9871675372123718), 
('把一个亿存到银行里安全吗', 0.9352194666862488), 
('一个亿存银行安全吗', 0.9330801367759705), 
('一个亿的存款存银行安全吗', 0.92387855052948),
 ('我的一千万存到银行安不安全', 0.9014463424682617)]
'''