DataAugmentationNMT下载DataAugmentationNMT源代码下载

DataAugmentationNMT

Ai源码

1.0.0

下载

DataagmentationNMT

该存储库包括针对我们论文中提出的神经机器翻译的稀有单词的数据扩展的代码和脚本。

引用

如果您使用此代码，请引用：

 @InProceedings{fadaee-bisazza-monz:2017:Short2,
  author    = {Fadaee, Marzieh  and  Bisazza, Arianna  and  Monz, Christof},
  title     = {Data Augmentation for Low-Resource Neural Machine Translation},
  booktitle = {Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)},
  month     = {July},
  year      = {2017},
  address   = {Vancouver, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {567--573},
  url       = {http://aclweb.org/anthology/P17-2090}
}

依赖性

TORCH7
nn
最佳
Lua-Cjson
TORCH-HDF5
Python 2.7

用法

步骤1：数据预处理

在训练[SRC/TRG]中的单语言模型之前，您需要使用preprocess.no_preset_v.py来预处理数据。

 python src/preprocess.no_preset_v.py --train_txt ./wiki.train.txt 
--val_txt ./wiki.val.txt --test_txt ./wiki.test.txt 
--output_h5 ./data.h5 --output_json ./data.json

这将产生文件data.h5和data.json ，将传递给培训脚本。

步骤2：语言模型培训

预处理数据后，您需要在向前和向后训练两种语言模型。

 th src/train.lua -input_h5 data.h5 -input_json data.json 
-checkpoint_name models_rnn/cv  -vocabfreq vocab_freq.trg.txt 

th src/train.lua -input_h5 data.rev.h5 -input_json data.rev.json 
-checkpoint_name models_rnn_rev/cv  -vocabfreq vocab_freq.trg.txt

您可以使用更多的标志来配置培训。

vocabfreq输入是低资源设置中单词的频率列表，以后需要增加这些语言模型。格式是：

 ...
change 3028
taken 3007
large 2999
again 2994
...

步骤3：替代产生

训练语言模型后，您可以在bitext中生成新句子的[src trg]。您可以运行此：

 th src/substitution.lua -checkpoint models_rnn/cv_xxx.t7 -start_text train.en 
-vocabfreq vocab_freq.trg.txt -sample 0 -topk 1000 -bwd 0 > train.en.subs

th src/substitution.lua -checkpoint models_rev.rnn/cv_xxx.t7 -start_text train.en.rev 
-vocabfreq vocab_freq.trg.txt -sample 0 -topk 1000 -bwd 1 > train.en.rev.subs

start_text是您针对稀有单词的bitext的一面。 vocabfreq是用于检测稀有单词的频率列表。 topk表示您想对句子中每个位置拥有的最大替换次数。

运行这两个代码将为您提供增强的Corpora，并在一侧提供替换列表： train.en.subs and train.en.rev.subs 。为了查找最适合上下文的变更，您需要找到这两个列表的交集：

 perl ./scripts/generate_intersect.pl train.en.subs train.en.rev.subs subs.intersect

subs.intersect包含可用于增强bitext的替换。这是输出的示例：

 information where we are successful will be published in this unk .
information{}
where{}
we{doctors:136 humans:135}
are{became:764 remained:245}
successful{}
will{}
be{}
published{interested:728 introduced:604 kept:456 performed:289 placed:615 played:535 released:477 written:790}
in{behind:932 beyond:836}
this{henry:58}
unk{}
.{}

第一行是原始句子，其后的每一行都是句子中的一个单词，并以各自的频率进行了替换。

步骤4：生成增强语料库

使用替代输出，bitext的[trg/src]端，对齐方式和词汇概率文件，您可以生成增强库。

您可以使用fast_align来获得bitext的对齐。对齐输入的格式是：

 ...
0-0 1-10 2-3 2-4 2-5 3-13 4-14 5-8 5-9 6-16 7-14 8-11 10-6 11-7 12-17
0-0 1-0 2-0 2-2 3-1 3-3 4-5 5-5 6-6 8-8 9-9 10-10 11-11
...

词汇概率输入可以从字典或对齐中获得。格式是：

 ...
safely sicher 0.0051237409068
safemode safemode 1
safeness antikollisionssystem 0.3333333
safer sicherer 0.09545972221228
...

为了生成可以运行的增强bitext：

 perl ./scripts/data_augmentation.pl subs.intersect train.de alignment.txt lex.txt augmentedOutput

这将生成两个文件：在[src/trg]和augmentedOutput.augmented中以[trg/src]语言augmentedOutput.fillout 。第一个文件是bitext的侧面增强针对稀有单词的侧面。第二个文件是增强句子的各个翻译。

如果您想对每个句子进行多次更改，也可以运行：

 perl ./scripts/data_augmentation_multiplechanges.pl subs.intersect train.de alignment.txt lex.txt augmentedOutput

输出的示例

这是[src/trg]中的增强文件中的句子：

 at the same time , the rights of consumers began:604~need to be maintained .

和[trg/src]中的filleout文件的各个句子：

 gleichzeitig begann~müssen die rechte der verbraucher geschützt werden .

在增强文件中，该单词以frequncy 604的形式替换为单词的需求。在摘录文件中，单词的翻译开始，替换为原始单词穆森。

步骤5：生成干净的bitext用于翻译

要删除所有标记并具有可用于翻译培训的干净bitext：您可以运行：

 perl ./scripts/filter_out_augmentations.pl augmentedOutput.en augmentedOutput.de 1000

您可以对要在此处增加的稀有词施加更多的frequncy限制。

致谢

在这项工作中，该代码被利用：

贾斯汀·约翰逊（Justin Johnson）的火炬奖

展开

附加信息

版本 1.0.0
类型 Ai源码
更新时间 2025-09-09
大小 163.75KB
来自于 Github

DataAugmentationNMT

DataagmentationNMT

引用

依赖性

用法

步骤1：数据预处理

步骤2：语言模型培训

步骤3：替代产生

步骤4：生成增强语料库

输出的示例

步骤5：生成干净的bitext用于翻译

致谢

ML stack

awesome free chatgpt

pywin_contextmenu

promptl

tick.chat

FastLoRAChat

chat.petals.dev

GPT Prompt Templates

GPTyped

ML stack

awesome free chatgpt

pywin_contextmenu

Google Dorks

shepherd

mongo express