gector下載 - gector源代碼下載

gector

其他源碼

1.0.0

下載

衛士 - 語法錯誤校正：標籤，不重寫

該存儲庫提供了培訓和測試最新模型的代碼，以通過以下文章的官方Pytorch實施進行語法校正校正：

衛士 - 語法錯誤校正：標籤，不重寫
Kostiantyn Omelianchuk，Vitaliy Atrasevych，Artem Chernodub，Oleksandr Skurzhanskyi
文法
第15屆有關NLP創新使用用於建設教育應用的研討會（與ACL 2020共同置於教育應用程序）

它主要基於AllenNLP和transformers 。

安裝

以下命令安裝所有必要的軟件包：

pip install -r requirements.txt

使用Python 3.7測試了該項目。

數據集

本文中使用的所有公共GEC數據集都可以從此處下載。
可以在此處生成/下載合成創建的數據集。
要訓練模型數據，必須通過命令進行預處理並轉換為特殊格式：

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

預驗證的模型

預驗證的編碼器	信心偏見	最小錯誤概率	Connl-2014（測試）	BEA-2019（測試）
伯特[鏈接]	0.1	0.41	61.0	68.0
羅伯塔[鏈接]	0.2	0.5	64.0	71.8
xlnet [鏈接]	0.2	0.5	63.2	71.2

注意：表中的分數與論文的分數不同，因為使用了後來的變形金剛。為了重現論文報告的結果，請使用此版本的存儲庫。

火車模型

要訓練模型，只需運行：

python train.py --train_set TRAIN_SET --dev_set DEV_SET 
                --model_dir MODEL_DIR

其中有很多參數可以指定：

cold_steps_count我們僅訓練最後一個線性層的時期數
transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}模型編碼器
tn_prob概率沒有任何錯誤；有助於平衡精確/召回
pieces_per_token每個令牌的最大子詞數；有助於不要使庫達擺脫記憶

在我們的實驗中，我們有98/2的火車/開發人員分裂。

培訓參數

我們描述了我們在此處用於培訓和評估的所有參數。

模型推斷

要在輸入文件上運行模型，請使用以下命令：

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] 
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE 
                  --output_file OUTPUT_FILE

參數之間：

min_error_probability最小錯誤概率（如本文中）
additional_confidence - 信心偏見（如本文中）
special_tokens_fix重現了一些預算模型的報告結果

用於評估使用m^2scorer和錯誤。

文本簡化

該存儲庫還實現了以下論文的代碼：

通過標記簡化文本
Kostiantyn Omelianchuk，Vipul Raheja，Oleksandr Skurzhanskyi
文法
第16屆關於NLP創新使用用於建設教育應用的研討會（共同介紹的W EACL 2021）

對於數據預處理，可以使用與GEC相同的接口。對於培訓和評估階段， utils/filter_brackets.py用於消除噪聲。在推斷期間，我們使用--normalize標誌。

	莎麗		fkgl
模型	Turkcorpus	資產	fkgl
tst-Final [鏈接]	39.9	40.3	7.65
TST-FINE +調整	41.0	42.7	7.61

推理調整參數：

 iteration_count = 2
additional_keep_confidence = -0.68
additional_del_confidence = -0.84
min_error_probability = 0.04

用於評估，請使用EASSE軟件包。

注意：表中的分數非常接近紙張中的分數，但由於兩個原因，它們不完全匹配：

在本文中，我們報告了4種訓練不同種子的4種模型的平均得分。
我們將代碼庫合併為GEC和文本簡化任務，並將其更新為Transformers Lib的較新版本。

基於衛士的明顯作品

DeepSpeed [Code]的香草Pytorch用AMP和分佈式支持的Gector實施
改進語法錯誤校正任務的序列標記方法[Paper] [代碼]
LM-Critic：無監督語法誤差校正的語言模型[Paper] [代碼]

引用

如果您發現這項工作對您的研究很有用，請引用我們的論文：

衛士 - 語法錯誤校正：標籤，不重寫

 @inproceedings{omelianchuk-etal-2020-gector,
    title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
    author = "Omelianchuk, Kostiantyn  and
      Atrasevych, Vitaliy  and
      Chernodub, Artem  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = jul,
    year = "2020",
    address = "Seattle, WA, USA â†’ Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.bea-1.16",
    pages = "163--170",
    abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}

通過標記簡化文本

 @inproceedings{omelianchuk-etal-2021-text,
    title = "{T}ext {S}implification by {T}agging",
    author = "Omelianchuk, Kostiantyn  and
      Raheja, Vipul  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.bea-1.2",
    pages = "11--25",
    abstract = "Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.",
}

展開

附加信息

版本 1.0.0
類型其他源碼
更新時間 2025-04-18
大小 668.34KB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部