该存储库提供了培训和测试最新模型的代码,以通过以下文章的官方Pytorch实施进行语法校正校正:
卫士 - 语法错误校正:标签,不重写
Kostiantyn Omelianchuk,Vitaliy Atrasevych,Artem Chernodub,Oleksandr Skurzhanskyi
语法
第15届有关NLP创新使用用于建设教育应用的研讨会(与ACL 2020共同置于教育应用程序)
它主要基于AllenNLP和transformers 。
以下命令安装所有必要的软件包:
pip install -r requirements.txt使用Python 3.7测试了该项目。
本文中使用的所有公共GEC数据集都可以从此处下载。
可以在此处生成/下载合成创建的数据集。
要训练模型数据,必须通过命令进行预处理并转换为特殊格式:
python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE| 预验证的编码器 | 信心偏见 | 最小错误概率 | Connl-2014(测试) | BEA-2019(测试) |
|---|---|---|---|---|
| 伯特[链接] | 0.1 | 0.41 | 61.0 | 68.0 |
| 罗伯塔[链接] | 0.2 | 0.5 | 64.0 | 71.8 |
| xlnet [链接] | 0.2 | 0.5 | 63.2 | 71.2 |
注意:表中的分数与论文的分数不同,因为使用了后来的变形金刚。为了重现论文报告的结果,请使用此版本的存储库。
要训练模型,只需运行:
python train.py --train_set TRAIN_SET --dev_set DEV_SET
--model_dir MODEL_DIR其中有很多参数可以指定:
cold_steps_count我们仅训练最后一个线性层的时期数transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}模型编码器tn_prob概率没有任何错误;有助于平衡精确/召回pieces_per_token每个令牌的最大子词数;有助于不要使库达摆脱记忆在我们的实验中,我们有98/2的火车/开发人员分裂。
我们描述了我们在此处用于培训和评估的所有参数。
要在输入文件上运行模型,请使用以下命令:
python predict.py --model_path MODEL_PATH [MODEL_PATH ...]
--vocab_path VOCAB_PATH --input_file INPUT_FILE
--output_file OUTPUT_FILE参数之间:
min_error_probability最小错误概率(如本文中)additional_confidence - 信心偏见(如本文中)special_tokens_fix重现了一些预算模型的报告结果用于评估使用m^2scorer和错误。
该存储库还实现了以下论文的代码:
通过标记简化文本
Kostiantyn Omelianchuk,Vipul Raheja,Oleksandr Skurzhanskyi
语法
第16届关于NLP创新使用用于建设教育应用的研讨会(共同介绍的W EACL 2021)
对于数据预处理,可以使用与GEC相同的接口。对于培训和评估阶段, utils/filter_brackets.py用于消除噪声。在推断期间,我们使用--normalize标志。
| 莎丽 | fkgl | ||
|---|---|---|---|
| 模型 | Turkcorpus | 资产 | |
| tst-Final [链接] | 39.9 | 40.3 | 7.65 |
| TST-FINE +调整 | 41.0 | 42.7 | 7.61 |
推理调整参数:
iteration_count = 2
additional_keep_confidence = -0.68
additional_del_confidence = -0.84
min_error_probability = 0.04
用于评估,请使用EASSE软件包。
注意:表中的分数非常接近纸张中的分数,但由于两个原因,它们不完全匹配:
如果您发现这项工作对您的研究很有用,请引用我们的论文:
@inproceedings{omelianchuk-etal-2020-gector,
title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
author = "Omelianchuk, Kostiantyn and
Atrasevych, Vitaliy and
Chernodub, Artem and
Skurzhanskyi, Oleksandr",
booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
month = jul,
year = "2020",
address = "Seattle, WA, USA → Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.bea-1.16",
pages = "163--170",
abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}
@inproceedings{omelianchuk-etal-2021-text,
title = "{T}ext {S}implification by {T}agging",
author = "Omelianchuk, Kostiantyn and
Raheja, Vipul and
Skurzhanskyi, Oleksandr",
booktitle = "Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications",
month = apr,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.bea-1.2",
pages = "11--25",
abstract = "Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.",
}