gector下载 - gector源代码下载

gector

其他源码

1.0.0

下载

卫士 - 语法错误校正：标签，不重写

该存储库提供了培训和测试最新模型的代码，以通过以下文章的官方Pytorch实施进行语法校正校正：

卫士 - 语法错误校正：标签，不重写
Kostiantyn Omelianchuk，Vitaliy Atrasevych，Artem Chernodub，Oleksandr Skurzhanskyi
语法
第15届有关NLP创新使用用于建设教育应用的研讨会（与ACL 2020共同置于教育应用程序）

它主要基于AllenNLP和transformers 。

安装

以下命令安装所有必要的软件包：

pip install -r requirements.txt

使用Python 3.7测试了该项目。

数据集

本文中使用的所有公共GEC数据集都可以从此处下载。
可以在此处生成/下载合成创建的数据集。
要训练模型数据，必须通过命令进行预处理并转换为特殊格式：

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

预验证的模型

预验证的编码器	信心偏见	最小错误概率	Connl-2014（测试）	BEA-2019（测试）
伯特[链接]	0.1	0.41	61.0	68.0
罗伯塔[链接]	0.2	0.5	64.0	71.8
xlnet [链接]	0.2	0.5	63.2	71.2

注意：表中的分数与论文的分数不同，因为使用了后来的变形金刚。为了重现论文报告的结果，请使用此版本的存储库。

火车模型

要训练模型，只需运行：

python train.py --train_set TRAIN_SET --dev_set DEV_SET 
                --model_dir MODEL_DIR

其中有很多参数可以指定：

cold_steps_count我们仅训练最后一个线性层的时期数
transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}模型编码器
tn_prob概率没有任何错误；有助于平衡精确/召回
pieces_per_token每个令牌的最大子词数；有助于不要使库达摆脱记忆

在我们的实验中，我们有98/2的火车/开发人员分裂。

培训参数

我们描述了我们在此处用于培训和评估的所有参数。

模型推断

要在输入文件上运行模型，请使用以下命令：

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] 
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE 
                  --output_file OUTPUT_FILE

参数之间：

min_error_probability最小错误概率（如本文中）
additional_confidence - 信心偏见（如本文中）
special_tokens_fix重现了一些预算模型的报告结果

用于评估使用m^2scorer和错误。

文本简化

该存储库还实现了以下论文的代码：

通过标记简化文本
Kostiantyn Omelianchuk，Vipul Raheja，Oleksandr Skurzhanskyi
语法
第16届关于NLP创新使用用于建设教育应用的研讨会（共同介绍的W EACL 2021）

对于数据预处理，可以使用与GEC相同的接口。对于培训和评估阶段， utils/filter_brackets.py用于消除噪声。在推断期间，我们使用--normalize标志。

	莎丽		fkgl
模型	Turkcorpus	资产	fkgl
tst-Final [链接]	39.9	40.3	7.65
TST-FINE +调整	41.0	42.7	7.61

推理调整参数：

 iteration_count = 2
additional_keep_confidence = -0.68
additional_del_confidence = -0.84
min_error_probability = 0.04

用于评估，请使用EASSE软件包。

注意：表中的分数非常接近纸张中的分数，但由于两个原因，它们不完全匹配：

在本文中，我们报告了4种训练不同种子的4种模型的平均得分。
我们将代码库合并为GEC和文本简化任务，并将其更新为Transformers Lib的较新版本。

基于卫士的明显作品

DeepSpeed [Code]的香草Pytorch用AMP和分布式支持的Gector实施
改进语法错误校正任务的序列标记方法[Paper] [代码]
LM-Critic：无监督语法误差校正的语言模型[Paper] [代码]

引用

如果您发现这项工作对您的研究很有用，请引用我们的论文：

卫士 - 语法错误校正：标签，不重写

 @inproceedings{omelianchuk-etal-2020-gector,
    title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
    author = "Omelianchuk, Kostiantyn  and
      Atrasevych, Vitaliy  and
      Chernodub, Artem  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = jul,
    year = "2020",
    address = "Seattle, WA, USA â†’ Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.bea-1.16",
    pages = "163--170",
    abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}

通过标记简化文本

 @inproceedings{omelianchuk-etal-2021-text,
    title = "{T}ext {S}implification by {T}agging",
    author = "Omelianchuk, Kostiantyn  and
      Raheja, Vipul  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.bea-1.2",
    pages = "11--25",
    abstract = "Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.",
}

展开

附加信息