該存儲庫提供了培訓和測試最新模型的代碼,以通過以下文章的官方Pytorch實施進行語法校正校正:
衛士 - 語法錯誤校正:標籤,不重寫
Kostiantyn Omelianchuk,Vitaliy Atrasevych,Artem Chernodub,Oleksandr Skurzhanskyi
文法
第15屆有關NLP創新使用用於建設教育應用的研討會(與ACL 2020共同置於教育應用程序)
它主要基於AllenNLP和transformers 。
以下命令安裝所有必要的軟件包:
pip install -r requirements.txt使用Python 3.7測試了該項目。
本文中使用的所有公共GEC數據集都可以從此處下載。
可以在此處生成/下載合成創建的數據集。
要訓練模型數據,必須通過命令進行預處理並轉換為特殊格式:
python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE| 預驗證的編碼器 | 信心偏見 | 最小錯誤概率 | Connl-2014(測試) | BEA-2019(測試) |
|---|---|---|---|---|
| 伯特[鏈接] | 0.1 | 0.41 | 61.0 | 68.0 |
| 羅伯塔[鏈接] | 0.2 | 0.5 | 64.0 | 71.8 |
| xlnet [鏈接] | 0.2 | 0.5 | 63.2 | 71.2 |
注意:表中的分數與論文的分數不同,因為使用了後來的變形金剛。為了重現論文報告的結果,請使用此版本的存儲庫。
要訓練模型,只需運行:
python train.py --train_set TRAIN_SET --dev_set DEV_SET
--model_dir MODEL_DIR其中有很多參數可以指定:
cold_steps_count我們僅訓練最後一個線性層的時期數transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}模型編碼器tn_prob概率沒有任何錯誤;有助於平衡精確/召回pieces_per_token每個令牌的最大子詞數;有助於不要使庫達擺脫記憶在我們的實驗中,我們有98/2的火車/開發人員分裂。
我們描述了我們在此處用於培訓和評估的所有參數。
要在輸入文件上運行模型,請使用以下命令:
python predict.py --model_path MODEL_PATH [MODEL_PATH ...]
--vocab_path VOCAB_PATH --input_file INPUT_FILE
--output_file OUTPUT_FILE參數之間:
min_error_probability最小錯誤概率(如本文中)additional_confidence - 信心偏見(如本文中)special_tokens_fix重現了一些預算模型的報告結果用於評估使用m^2scorer和錯誤。
該存儲庫還實現了以下論文的代碼:
通過標記簡化文本
Kostiantyn Omelianchuk,Vipul Raheja,Oleksandr Skurzhanskyi
文法
第16屆關於NLP創新使用用於建設教育應用的研討會(共同介紹的W EACL 2021)
對於數據預處理,可以使用與GEC相同的接口。對於培訓和評估階段, utils/filter_brackets.py用於消除噪聲。在推斷期間,我們使用--normalize標誌。
| 莎麗 | fkgl | ||
|---|---|---|---|
| 模型 | Turkcorpus | 資產 | |
| tst-Final [鏈接] | 39.9 | 40.3 | 7.65 |
| TST-FINE +調整 | 41.0 | 42.7 | 7.61 |
推理調整參數:
iteration_count = 2
additional_keep_confidence = -0.68
additional_del_confidence = -0.84
min_error_probability = 0.04
用於評估,請使用EASSE軟件包。
注意:表中的分數非常接近紙張中的分數,但由於兩個原因,它們不完全匹配:
如果您發現這項工作對您的研究很有用,請引用我們的論文:
@inproceedings{omelianchuk-etal-2020-gector,
title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
author = "Omelianchuk, Kostiantyn and
Atrasevych, Vitaliy and
Chernodub, Artem and
Skurzhanskyi, Oleksandr",
booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
month = jul,
year = "2020",
address = "Seattle, WA, USA → Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.bea-1.16",
pages = "163--170",
abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}
@inproceedings{omelianchuk-etal-2021-text,
title = "{T}ext {S}implification by {T}agging",
author = "Omelianchuk, Kostiantyn and
Raheja, Vipul and
Skurzhanskyi, Oleksandr",
booktitle = "Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications",
month = apr,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.bea-1.2",
pages = "11--25",
abstract = "Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.",
}