ดาวน์โหลด gector - ดาวน์โหลดซอร์สโค้ด gector

gector

ซอร์สโค้ดอื่น ๆ

1.0.0

ดาวน์โหลด

Gector - การแก้ไขข้อผิดพลาดทางไวยากรณ์: แท็กไม่เขียนใหม่

ที่เก็บนี้ให้รหัสสำหรับการฝึกอบรมและการทดสอบแบบจำลองที่ล้ำสมัยสำหรับการแก้ไขข้อผิดพลาดทางไวยากรณ์ด้วยการใช้งาน Pytorch อย่างเป็นทางการของบทความต่อไปนี้:

Gector - การแก้ไขข้อผิดพลาดทางไวยากรณ์: แท็กไม่เขียนใหม่
Kostiantyn Omelianchuk, Vitaliy Atrasevych, Artem Chernodub, Oleksandr Skurzhanskyi
ไวยากรณ์
การประชุมเชิงปฏิบัติการครั้งที่ 15 เกี่ยวกับการใช้ NLP ที่เป็นนวัตกรรมสำหรับการสร้างแอพพลิเคชั่นการศึกษา (ร่วมกับ ACL 2020)

ส่วนใหญ่ขึ้นอยู่กับ AllenNLP และ transformers

การติดตั้ง

คำสั่งต่อไปนี้ติดตั้งแพ็คเกจที่จำเป็นทั้งหมด:

pip install -r requirements.txt

โครงการได้รับการทดสอบโดยใช้ Python 3.7

ชุดข้อมูล

ชุดข้อมูล GEC สาธารณะทั้งหมดที่ใช้ในกระดาษสามารถดาวน์โหลดได้จากที่นี่
ชุดข้อมูลที่สร้างขึ้นแบบสังเคราะห์สามารถสร้าง/ดาวน์โหลดได้ที่นี่
ในการฝึกอบรมข้อมูลแบบจำลองจะต้องมีการประมวลผลล่วงหน้าและแปลงเป็นรูปแบบพิเศษด้วยคำสั่ง:

python utils/preprocess_data.py -s SOURCE -t TARGET -o OUTPUT_FILE

นางแบบที่ได้รับการฝึกฝน

encoder pretrained	อคติความมั่นใจ	Min Error Prob	Connl-2014 (ทดสอบ)	Bea-2019 (ทดสอบ)
เบิร์ต [ลิงก์]	0.1	0.41	61.0	68.0
Roberta [Link]	0.2	0.5	64.0	71.8
xlnet [ลิงก์]	0.2	0.5	63.2	71.2

หมายเหตุ : คะแนนในตารางนั้นแตกต่างจากกระดาษของกระดาษเนื่องจากใช้หม้อแปลงรุ่นต่อไป ในการทำซ้ำผลลัพธ์ที่รายงานในกระดาษให้ใช้ที่เก็บเวอร์ชันนี้

รุ่นรถไฟ

ในการฝึกอบรมแบบจำลองเพียงแค่วิ่ง:

python train.py --train_set TRAIN_SET --dev_set DEV_SET 
                --model_dir MODEL_DIR

มีพารามิเตอร์จำนวนมากที่จะระบุในหมู่พวกเขา:

cold_steps_count จำนวนยุคที่เราฝึกอบรมเฉพาะชั้นเชิงเส้นสุดท้ายเท่านั้น
transformer_model {bert,distilbert,gpt2,roberta,transformerxl,xlnet,albert} รุ่นเข้ารหัสโมเดล
ความน่าจะเป็นของ tn_prob ในการรับประโยคที่ไม่มีข้อผิดพลาด ช่วยปรับสมดุลความแม่นยำ/การเรียกคืน
pieces_per_token จำนวนสูงสุดของคำศัพท์ต่อโทเค็น; ช่วยไม่ให้ Cuda ออกจากหน่วยความจำ

ในการทดลองของเราเรามีการแยกรถไฟ 98/2/dev

พารามิเตอร์การฝึกอบรม

เราอธิบายพารามิเตอร์ทั้งหมดที่เราใช้สำหรับการฝึกอบรมและประเมินผลที่นี่

การอนุมานแบบจำลอง

ในการเรียกใช้โมเดลของคุณบนไฟล์อินพุตใช้คำสั่งต่อไปนี้:

python predict.py --model_path MODEL_PATH [MODEL_PATH ...] 
                  --vocab_path VOCAB_PATH --input_file INPUT_FILE 
                  --output_file OUTPUT_FILE

ท่ามกลางพารามิเตอร์:

min_error_probability - ความน่าจะเป็นข้อผิดพลาดขั้นต่ำ (เช่นในกระดาษ)
additional_confidence - อคติความมั่นใจ (เช่นในกระดาษ)
special_tokens_fix เพื่อทำซ้ำผลลัพธ์ที่รายงานบางอย่างของแบบจำลองที่ผ่านการฝึกอบรม

สำหรับการประเมินผลใช้ m^2corer และ Errant

การทำให้เข้าใจง่าย

ที่เก็บนี้ยังใช้รหัสของกระดาษต่อไปนี้:

การทำให้ข้อความง่ายขึ้นโดยการติดแท็ก
Kostiantyn Omelianchuk, Vipul Raheja, Oleksandr Skurzhanskyi
ไวยากรณ์
การประชุมเชิงปฏิบัติการครั้งที่ 16 เกี่ยวกับการใช้ NLP ที่เป็นนวัตกรรมสำหรับการสร้างแอพพลิเคชั่นการศึกษา

สำหรับการประมวลผลข้อมูลล่วงหน้าการฝึกอบรมและการทดสอบอินเทอร์เฟซเดียวกันกับ GEC สามารถใช้งานได้ สำหรับทั้งขั้นตอนการฝึกอบรมและการประเมินผล utils/filter_brackets.py ใช้เพื่อลบเสียงรบกวน ในระหว่างการอนุมานเราใช้ธง --normalize

	ส่าหรี		fkgl
แบบอย่าง	ชาวเติร์กคอร์ปัส	สินทรัพย์	fkgl
tst-final [link]	39.9	40.3	7.65
TST-FILL + TWEAKS	41.0	42.7	7.61

การอนุมานการปรับแต่งพารามิเตอร์:

 iteration_count = 2
additional_keep_confidence = -0.68
additional_del_confidence = -0.84
min_error_probability = 0.04

สำหรับการประเมินผลใช้แพ็คเกจ EASSE

หมายเหตุ : คะแนนในตารางอยู่ใกล้กับที่อยู่ในกระดาษมาก แต่ไม่ตรงกับพวกเขาอย่างเต็มที่เนื่องจากเหตุผล 2 ประการ:

ในกระดาษเรารายงานคะแนนเฉลี่ย 4 รุ่นที่ผ่านการฝึกอบรมด้วยเมล็ดที่แตกต่างกัน
เรารวมรหัสฐานสำหรับงาน GEC และงานทำให้เข้าใจง่ายและอัปเดตเป็น Transformers LIB เวอร์ชันใหม่กว่า

งานที่เห็นได้ชัดเจนขึ้นอยู่กับ Gector

การใช้งาน Vanilla Pytorch ของ Gector ด้วย AMP และการสนับสนุนแบบกระจายโดย DeepSpeed [รหัส]
การปรับปรุงวิธีการติดแท็กลำดับสำหรับงานแก้ไขข้อผิดพลาดทางไวยากรณ์ [กระดาษ] [รหัส]
LM-Critic: แบบจำลองภาษาสำหรับการแก้ไขข้อผิดพลาดทางไวยากรณ์ที่ไม่ได้รับการดูแล [กระดาษ] [รหัส]

การอ้างอิง

หากคุณพบว่างานนี้มีประโยชน์สำหรับการวิจัยของคุณโปรดอ้างอิงเอกสารของเรา:

Gector - การแก้ไขข้อผิดพลาดทางไวยากรณ์: แท็กไม่เขียนใหม่

 @inproceedings{omelianchuk-etal-2020-gector,
    title = "{GECT}o{R} {--} Grammatical Error Correction: Tag, Not Rewrite",
    author = "Omelianchuk, Kostiantyn  and
      Atrasevych, Vitaliy  and
      Chernodub, Artem  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications",
    month = jul,
    year = "2020",
    address = "Seattle, WA, USA â†’ Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.bea-1.16",
    pages = "163--170",
    abstract = "In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages: first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model/ensemble GEC tagger achieves an F{_}0.5 of 65.3/66.5 on CONLL-2014 (test) and F{_}0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.",
}

การทำให้ข้อความง่ายขึ้นโดยการติดแท็ก

 @inproceedings{omelianchuk-etal-2021-text,
    title = "{T}ext {S}implification by {T}agging",
    author = "Omelianchuk, Kostiantyn  and
      Raheja, Vipul  and
      Skurzhanskyi, Oleksandr",
    booktitle = "Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications",
    month = apr,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.bea-1.2",
    pages = "11--25",
    abstract = "Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.",
}

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน 1.0.0
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-04-18
ขนาด 668.34KB
มาจาก Github

แอปที่เกี่ยวข้อง

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด