gectorのダウンロードgectorソースコードのダウンロード

gector

その他のソースコード

1.0.0

ダウンロード

Gector - 文法エラー補正：タグ、書き換えではありません

このリポジトリは、次の論文の公式Pytorch実装との文法エラー補正のための最先端モデルのトレーニングとテストのコードを提供します。

Gector - 文法エラー補正：タグ、書き換えではありません
kostiantyn omelianchuk、vitaliy atrasevych、artem chernodub、oleksandr skurzhanskyi
文法
教育アプリケーションを構築するためのNLPの革新的な使用に関する第15回ワークショップ（ACL 2020と共同住宅）

主にAllenNLPとtransformersに基づいています。

インストール

次のコマンドは、必要なすべてのパッケージをインストールします。

pip install -r requirements.txt

このプロジェクトは、Python 3.7を使用してテストされました。

データセット

論文で使用されるすべての公開GECデータセットは、こちらからダウンロードできます。
合成されたデータセットは、ここから生成/ダウンロードできます。
モデルデータをトレーニングするには、前処理し、コマンドを使用して特別な形式に変換する必要があります。

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

前処理されたモデル

前処理されたエンコーダー	自信バイアス	最小エラー問題	connl-2014（テスト）	BEA-2019（テスト）
バート[リンク]	0.1	0.41	61.0	68.0
ロベルタ[リンク]	0.2	0.5	64.0	71.8
xlnet [link]	0.2	0.5	63.2	71.2

注：テーブルのスコアは、トランスの後のバージョンが使用されるため、紙のスコアとは異なります。論文で報告されている結果を再現するには、このバージョンのリポジトリを使用してください。

列車モデル

モデルをトレーニングするには、単純に実行します。

python train.py --train_set TRAIN_SET --dev_set DEV_SET 
                --model_dir MODEL_DIR

それらの間で指定するパラメーターがたくさんあります。

cold_steps_count最後の線形層のみをトレーニングするエポックの数
transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert}モデルエンコーダー
tn_probエラーなしで文を取得する確率。精度/リコールのバランスをとるのに役立ちます
pieces_per_tokenトークンあたりのサブワードの最大数。 Cudaを記憶から外さないようにします

私たちの実験では、98/2列車/開発者が分割されました。

トレーニングパラメーター

ここでトレーニングと評価に使用するすべてのパラメーターについて説明しました。

モデル推論

入力ファイルでモデルを実行するには、次のコマンドを使用します。

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] 
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE 
                  --output_file OUTPUT_FILE

パラメーターの中で：

min_error_probability最小エラー確率（論文のように）
additional_confidence自信バイアス（論文のように）
special_tokens_fix事前に報告されたモデルのいくつかの結果を再現します

評価には、m^2scorerと誤りを使用します。

テキスト単純化

このリポジトリは、次の論文のコードも実装しています。

タグ付けによるテキスト単純化
kostiantyn omelianchuk、vipul raheja、oleksandr skurzhanskyi
文法
教育アプリケーションを構築するためのNLPの革新的な使用に関する第16回ワークショップ（共同住宅WEACL 2021）

データの前処理、トレーニング、およびテストの場合、GECと同じインターフェイスを使用できます。トレーニングと評価段階の両方で、 utils/filter_brackets.pyノイズを除去するために使用されます。推論中に、 --normalizeフラグを使用します。

	サリー		fkgl
モデル	Turkcorpus	資産	fkgl
TSTファイナル[リンク]	39.9	40.3	7.65
TST-Final + Tweaks	41.0	42.7	7.61

推論の調整パラメーター：

 iteration_count = 2
additional_keep_confidence = -0.68
additional_del_confidence = -0.84
min_error_probability = 0.04

評価には、Easseパッケージを使用してください。

注：テーブル内のスコアは、紙のスコアに非常に近いですが、2つの理由により完全に一致していません。

論文では、異なる種子で訓練された4つのモデルの平均スコアを報告しました。
GECとテキスト単純化タスクのコードベースをマージし、それらを新しいバージョンのトランスフォーマーLIBに更新しました。

Gectorに基づく顕著な作業

AMPを使用したGectorのVanilla Pytorchの実装およびDeepSpeedによるサポート[コード]
文法エラー補正タスクのシーケンスタグ付けアプローチの改善[Paper] [Code]
LM-critic：監視されていない文法エラー補正のための言語モデル[Paper] [Code]

引用

この作業があなたの研究に役立つとわかった場合は、私たちの論文を引用してください。

Gector - 文法エラー補正：タグ、書き換えではありません

 @inproceedings{omelianchuk-etal-2020-gector,
    title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
    author = "Omelianchuk, Kostiantyn  and
      Atrasevych, Vitaliy  and
      Chernodub, Artem  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = jul,
    year = "2020",
    address = "Seattle, WA, USA â†’ Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.bea-1.16",
    pages = "163--170",
    abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}

タグ付けによるテキスト単純化

 @inproceedings{omelianchuk-etal-2021-text,
    title = "{T}ext {S}implification by {T}agging",
    author = "Omelianchuk, Kostiantyn  and
      Raheja, Vipul  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.bea-1.2",
    pages = "11--25",
    abstract = "Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.",
}

拡大する

追加情報