ดาวน์โหลด The NLP Pandect - ดาวน์โหลดซอร์สโค้ด The NLP Pandect

The-NLP-Pandect

Pandect นี้ (πανΔέκτηςเป็นภาษากรีกโบราณสำหรับสารานุกรม) ถูกสร้างขึ้นเพื่อช่วยให้คุณค้นหาเกือบทุกอย่างที่เกี่ยวข้องกับการประมวลผลภาษาธรรมชาติที่มีอยู่ออนไลน์

หมายเหตุ ตำนานด่วนเกี่ยวกับประเภททรัพยากรที่มีอยู่:
- โครงการโอเพ่นซอร์สมักจะเป็นที่เก็บ GitHub ที่มีจำนวนดาว
- - ทรัพยากรที่คุณสามารถอ่านได้โดยปกติจะโพสต์บล็อกหรือกระดาษ
- การรวบรวมทรัพยากรเพิ่มเติม
- - เครื่องมือต้นทางที่ไม่ใช่เปิดเฟรมเวิร์กหรือบริการชำระเงิน
? ️ - ทรัพยากรที่คุณสามารถดูได้
? ️ - ทรัพยากรที่คุณสามารถฟังได้

สารบัญ

- ส่วนหลัก	️ตัวอย่างส่วนย่อย
ทรัพยากร NLP	สรุปกระดาษ, สรุปการประชุม, ชุดข้อมูล NLP
พอดคาสต์ NLP	พอดคาสต์ NLP-only, พอดคาสต์ที่มี NLP หลายตอน
จดหมายข่าว NLP	-
NLP Meetups	-
ช่อง NLP YouTube	-
มาตรฐาน NLP	General NLU, ตอบคำถาม, หลายภาษา
ทรัพยากรวิจัย	ทรัพยากรเกี่ยวกับโมเดลหม้อแปลงการกลั่นและการตัดแต่งกิ่งการสรุปอัตโนมัติ
ทรัพยากรอุตสาหกรรม	แนวทางปฏิบัติที่ดีที่สุดสำหรับระบบ NLP, MLOPs สำหรับ NLP
การรู้จำเสียงพูด	ทรัพยากรทั่วไปข้อความเป็นคำพูดการพูดถึงข้อความชุดข้อมูล
การสร้างแบบจำลองหัวข้อ	บล็อกเฟรมเวิร์กที่เก็บและโครงการ
การสกัดคำหลัก	อันดับข้อความ, Rake, แนวทางอื่น ๆ
รับผิดชอบ NLP	ความสามารถในการตีความ NLP และ ML, จริยธรรม, อคติและความเท่าเทียมกันใน NLP, การโจมตีที่เป็นปฏิปักษ์ต่อ NLP
เฟรมเวิร์ก NLP	วัตถุประสงค์ทั่วไปการเพิ่มข้อมูลการแปลเครื่องการโจมตีที่เป็นปฏิปักษ์ระบบการโต้ตอบและการพูดการจับคู่เอนทิตีและสตริงเฟรมเวิร์กที่ไม่ใช่ภาษาอังกฤษ
เรียนรู้ NLP	หลักสูตรหนังสือบทเรียน
ชุมชน NLP	-
หัวข้อ NLP อื่น ๆ	โทเค็นการเพิ่มข้อมูลการรับรู้เอนทิตีการแก้ไขข้อผิดพลาด AutomL/AutoNLP การสร้างข้อความ

หมายเหตุ ส่วนคำหลัก: สรุปกระดาษ, บทสรุป, รายการที่ยอดเยี่ยม

บทสรุปและรายการที่ยอดเยี่ยมในหัวข้อ NLP:

ดัชนี NLP - ดัชนีที่ค้นหาได้ของเอกสาร NLP โดย Quantum Stat / NLP Cypher
NLP ที่ยอดเยี่ยมโดย Keon [GitHub, 16528 Stars]
การพูดและการประมวลผลภาษาธรรมชาติรายการที่ยอดเยี่ยมโดย Elaboshira [GitHub, 2189 Stars]
การเรียนรู้ที่ลึกซึ้งสำหรับการประมวลผลภาษาธรรมชาติ (NLP) [GitHub, 1274 Stars]
การขุดข้อความและทรัพยากรการประมวลผลภาษาธรรมชาติโดย Stepthom [GitHub, 557 Stars]
Brainsources สำหรับผู้ที่ชื่นชอบ #NLP โดย Philip Vollet
AI AID AI/ML/DL - ส่วน NLP [GitHub, 1473 Stars]
บทความ NLP โดย Devopedia

การประชุม NLP, สรุปกระดาษและบทสรุปกระดาษ:

เอกสารและกระดาษสรุป

100 เอกสาร NLP ที่ต้องอ่าน 100 เอกสาร NLP ต้องอ่าน [GitHub, 3732 Stars]
สรุปกระดาษ NLP โดย Dair-AI [GitHub, 1475 Stars]
คอลเลกชันของเอกสารสำหรับผู้ปฏิบัติงาน NLP [GitHub, 1075 Stars]
เอกสารเกี่ยวกับการโจมตีและการป้องกันที่เป็นปฏิปักษ์ต่อข้อความ [GitHub, 1501 Stars]
เอกสารการเรียนรู้ลึกล่าสุดใน NLU และ RL โดย Valentin Malykh [GitHub, 296 Stars]
การสำรวจการสำรวจ (NLP & ML): ชุดเอกสารสำรวจ NLP [GitHub, 1997 Stars]
รายการกระดาษสำหรับการถ่ายโอนสไตล์ในข้อความ [GitHub, 1609 Stars]
- ดัชนีการบันทึกวิดีโอสำหรับเอกสาร

สรุปการประชุม

การประชุม NLP 10 การประชุมสรุปโดย SoulBliss [GitHub, 459 Stars]
- แนวโน้ม ICLR 2020
- การประชุม Spacyirl 2019 ในภาพรวม
- Paper Digest - การประชุมและเอกสารในภาพรวม

ความคืบหน้าของ NLP และงาน NLP:

ความคืบหน้าของ NLP โดย Sebastianruder [GitHub, 22568 Stars]
งาน NLP โดย Kyubyong [GitHub, 3017 Stars]

ชุดข้อมูล NLP:

ชุดข้อมูล NLP โดย Niderhoff [GitHub, 5741 Stars]
ชุดข้อมูลโดย HuggingFace [GitHub, 19096 Stars]
ฐานข้อมูล NLP ที่ไม่ดี
UWA คำอธิบายประกอบคำที่ไม่คลุมเครือ - ชุดข้อมูล disambiguation คำ
MLDOC - Corpus สำหรับการจำแนกเอกสารหลายภาษาในแปดภาษา [GitHub, 152 Stars]

คำพูดและประโยคฝังตัว:

โมเดลฝังตัวที่ยอดเยี่ยมโดย Hironsan [GitHub, 1752 Stars]
รายการการฝังประโยคที่ยอดเยี่ยมโดย Separius [GitHub, 2219 Stars]
เบิร์ตที่ยอดเยี่ยมโดย Jiakui [GitHub, 1846 Stars]

สมุดบันทึกสคริปต์และที่เก็บ

Super Duper NLP repo [เว็บไซต์, 2020]

ทรัพยากรและบทสรุปที่ไม่ใช่ภาษาอังกฤษ

ทรัพยากร NLP สำหรับบาฮาซาชาวอินโดนีเซีย [GitHub, 480 ดาว]
แคตตาล็อก NLP indic [GitHub, 552 Stars]
แบบจำลองภาษาที่ผ่านการฝึกอบรมมาก่อนสำหรับเวียดนาม [GitHub, 653 Stars]
ชุดเครื่องมือภาษาธรรมชาติสำหรับภาษา Indic (Inltk) [GitHub, 814 Stars]
ไลบรารี NLP indic [GitHub, 550 ดาว]
พอร์ทัล ai4bharat-indicnlp
ARBML - การดำเนินโครงการ NLP และ ML อาหรับหลายแห่ง [GitHub, 387 Stars]
Zemberek -NLP - เครื่องมือ NLP สำหรับตุรกี [GitHub, 1146 Stars]
TDD AI - แพลตฟอร์มโอเพนซอร์ซสำหรับชุดข้อมูลตุรกีโมเดลภาษาและเครื่องมือ NLP ทั้งหมด
Klue - การประเมินความเข้าใจภาษาเกาหลี [GitHub, 560 Stars]
เกณฑ์มาตรฐาน NLP เปอร์เซีย - มาตรฐานสำหรับการประเมินและการเปรียบเทียบงาน NLP ต่างๆในภาษาเปอร์เซีย [GitHub, 73 ดาว]
NLP -Greek - แหล่งภาษากรีก [GitHub, 5 ดาว]
แหล่งข้อมูล NLP ที่ยอดเยี่ยมสำหรับฮังการี [GitHub, 221 ดาว]

รุ่น NLP ที่ผ่านการฝึกอบรมมาก่อน

รายการรุ่น NLP ที่ผ่านการฝึกอบรมมาก่อน [GitHub, 170 ดาว]
แบบจำลองภาษาที่ผ่านการฝึกอบรมที่พัฒนาโดยหีบหีบห่อของ Huawei Noah [GitHub, 3019 Stars]
รูปแบบภาษาสเปนและทรัพยากร [GitHub, 251 ดาว]

ประวัติ NLP

ทั่วไป

เทคนิคการเรียนรู้ลึกสมัยใหม่ที่ใช้กับการประมวลผลภาษาธรรมชาติ [GitHub, 1328 Stars]
- การทบทวนประวัติความเป็นมาของการประมวลผลภาษาธรรมชาติ [บล็อก, ตุลาคม 2018]

ทบทวน 2020 ปี

- การประมวลผลภาษาธรรมชาติในปี 2020: ปีที่ตรวจสอบ [บล็อก, ธันวาคม 2020]
- ไฮไลท์การวิจัยของ ML และ NLP ในปี 2020 [บล็อก, มกราคม 2021]

- กลับไปที่สารบัญ

พอดคาสต์ NLP-only

️ไฮไลต์ NLP [ปี: 2017 - ตอนนี้สถานะ: ใช้งาน]
️ตอน NLP โซน [ปี: 2021 - ตอนนี้สถานะ: ใช้งาน]

ตอน NLP มากมาย

? ️ Twiml AI [ปี: 2016 - ตอนนี้สถานะ: Active]
️การปฏิบัติ AI [ปี: 2018 - ตอนนี้สถานะ: ใช้งาน]
️การแลกเปลี่ยนข้อมูล [ปี: 2019 - ตอนนี้สถานะ: ใช้งาน]
? ️การไล่ระดับสีไม่เห็นด้วย [ปี: 2020 - ตอนนี้สถานะ: ใช้งาน]
? ️ Machine Learning Street Talk [ปี: 2020 - ตอนนี้สถานะ: Active]
? dataframed - แนวโน้มและข้อมูลเชิงลึกล่าสุดเกี่ยวกับวิธีการขยายผลกระทบของวิทยาศาสตร์ข้อมูลในองค์กร [ปี: 2019 - ตอนนี้สถานะ: ใช้งานอยู่]

บางตอน NLP

️พอดคาสต์ Super Data Science [ปี: 2016 - ตอนนี้สถานะ: Active]
️ Data Hack Radio [ปี: 2018 - ตอนนี้สถานะ: ใช้งานได้]
️ AI Game Changers [ปี: 2020, สถานะ: ใช้งานได้]
? analytics Show [ปี: 2019 - ตอนนี้สถานะ: Active]

- NLP News โดย Sebastian Ruder
- สัปดาห์นี้ใน NLP โดย Robert Dale
- เอกสารที่มีรหัส
- ชุดโดย deeplearning.ai
- กระดาษย่อยโดย paperdigest
- NLP Cypher โดย Quantumstat

- NLP Zurich [การบันทึก YouTube]
- การแฮ็กเครื่องจักร-การเรียนรู้ [การบันทึก YouTube]
- NY-NLP (นิวยอร์ก)

- Yannic Kilcher
- กอด
- กลุ่มอ่านหนังสือ Kaggle
- การอ่านกระดาษ Rasa
- Stanford CS224N: NLP พร้อมการเรียนรู้อย่างลึกซึ้ง
- nlpxing
- ML อธิบาย - AI Socratic Circles - AISC
- deeplearning.ai
- Machine Learning Street Talk

- กลับไปที่สารบัญ

NLU ทั่วไป

กาว - มาตรฐานการประเมินภาษาทั่วไป (กาว)
Superglue - เบนช์มาร์กสไตล์หลังจากกาวพร้อมชุดการทำความเข้าใจภาษาที่ยากขึ้นใหม่
Decanlp - Decathlon ภาษาธรรมชาติ (Decanlp) สำหรับการศึกษาแบบจำลอง NLP ทั่วไป
Dialoglue - Dialoglue: มาตรฐานการทำความเข้าใจภาษาธรรมชาติสำหรับบทสนทนาที่มุ่งเน้นงาน [GitHub, 280 ดาว]
Dynabench - Dynabench เป็นแพลตฟอร์มการวิจัยสำหรับการรวบรวมข้อมูลแบบไดนามิกและการเปรียบเทียบ
Big -Bench - เกณฑ์มาตรฐานความร่วมมือสำหรับการวัดและการคาดการณ์ความสามารถของแบบจำลองภาษา [GitHub, 2835 Stars]

การสรุป

Wikiasp-Wikiasp: ชุดข้อมูลการสรุปอิงตามเอกสารหลายรูปแบบ
Wikilingua - ชุดข้อมูลการสรุปแบบนามธรรมหลายภาษา

ตอบคำถาม

ทีม - ชุดข้อมูลตอบคำถามของคำถามสแตนฟอร์ด (ทีม)
Xquad-Xquad (ชุดข้อมูลตอบคำถามข้ามภาษา) สำหรับการตอบคำถามข้ามภาษา
Grailqa - การตอบคำถามทั่วไปอย่างยิ่ง (Grailqa)
CSQA - การตอบคำถามที่ซับซ้อน

เกณฑ์มาตรฐานหลายภาษาและไม่ใช่ภาษาอังกฤษ

- Xtreme - เกณฑ์มาตรฐานหลายงานหลายภาษา
Gluecos - มาตรฐานสำหรับ NLP สลับรหัส
indicglue - มาตรฐานการทำความเข้าใจภาษาธรรมชาติสำหรับภาษาบ่งชี้
Lince - มาตรฐานการประเมินการสลับรหัสภาษาศาสตร์
Russian Superglue - มาตรฐาน Superglue ของรัสเซีย

ชีวภาพกฎหมายและโดเมนวิทยาศาสตร์อื่น ๆ

Blurb - ความเข้าใจภาษาชีวการแพทย์และมาตรฐานการใช้เหตุผล
เกณฑ์มาตรฐานการประเมินภาษาสีน้ำเงิน
Lexglue - ชุดข้อมูลมาตรฐานสำหรับการทำความเข้าใจภาษากฎหมายเป็นภาษาอังกฤษ

ประสิทธิภาพของหม้อแปลง

สนามกีฬาระยะยาว-สนามกีฬาระยะไกลสำหรับการเปรียบเทียบหม้อแปลงที่มีประสิทธิภาพ (พิมพ์ล่วงหน้า) [GitHub, 716 ดาว]

การประมวลผลคำพูด

Superb - การประมวลผลคำพูดมาตรฐานประสิทธิภาพการทำงานสากล

อื่น

CodexGlue - ชุดข้อมูลมาตรฐานสำหรับ Code Intelligence
Crossner - Crossner: การประเมินการจดจำเอนทิตีข้ามโดเมน
Multinli - Corpus การอนุมานภาษาธรรมชาติหลายประเภท
Isarcasm: ชุดข้อมูลของการเสียดสีที่ตั้งใจไว้ - isarcasm เป็นชุดข้อมูลของทวีตแต่ละตัวระบุว่าเป็นแบบประชดประชันหรือ non_sarcastic

- กลับไปที่สารบัญ

ทั่วไป

- สูตรสำหรับการฝึกอบรมเครือข่ายประสาทโดย Andrej Karpathy [คำสำคัญ: การวิจัย, การฝึกอบรม, 2019]
- ความก้าวหน้าล่าสุดใน NLP ผ่านแบบจำลองภาษาที่ผ่านการฝึกอบรมมาแล้วขนาดใหญ่: การสำรวจ [กระดาษ, พฤศจิกายน 2564]

การฝัง

ที่เก็บ

การเป็นตัวแทนของ Elmo ที่ผ่านการฝึกอบรมมาก่อนสำหรับหลายภาษา [GitHub, 1458 Stars]
Sense2Vec - เวกเตอร์คำที่มีบริบท [GitHub, 1617 ดาว]
Wikipedia2vec [GitHub, 935 Stars]
Starspace [GitHub, 3938 Stars]
FastText [GitHub, 25871 Stars]

บล็อก

- แบบจำลองภาษาและคำศัพท์เชิงบริบทฝังโดย David S. Batista [บล็อก, 2018]
- คู่มือที่สำคัญสำหรับการฝังคำที่ผ่านการฝึกอบรมสำหรับผู้ปฏิบัติงาน NLP โดย Analyticsvidhya [บล็อก, 2020]
- Polyglot Word Embeddings ค้นพบกลุ่มภาษา [บล็อก, 2020]
- The Illustrated Word2vec โดย Jay Alammar [บล็อก, 2019]

คำพูดข้ามภาษาและประโยคฝังตัว

VECMAP - VECMAP (การแมปการฝังคำข้ามภาษา) [GitHub, 644 ดาว]
ประโยค - การแปลง - ประโยคหลายภาษาและการฝังภาพกับ Bert [GitHub, 14981 Stars]

การเข้ารหัสคู่ไบต์

BPEMB-การฝังคำว่า subword ที่ผ่านการฝึกอบรมไว้ล่วงหน้าใน 275 ภาษาโดยใช้การเข้ารหัสแบบไบต์คู่ (BPE) [GitHub, 1179 Stars]
subword -nmt - การแบ่งส่วนคำที่ไม่ได้รับการดูแลสำหรับการแปลเครื่องประสาทและการสร้างข้อความ [GitHub, 2185 ดาว]
Python -BPE - การเข้ารหัสคู่ไบต์สำหรับ Python [GitHub, 223 ดาว]

สถาปัตยกรรมที่ใช้หม้อแปลง

ทั่วไป

- ตระกูล Transformer โดย Lilian Weng [บล็อก, 2020]
- เล่นลอตเตอรีด้วยรางวัลและหลายภาษา - เกี่ยวกับผลกระทบของการเริ่มต้นแบบสุ่ม [ICLR 2020 Paper]
- ความสนใจ? ความสนใจ! โดย Lilian Weng [บล็อก, 2018]
- หม้อแปลง…“ อธิบาย”? [บล็อก, 2019]
️ความสนใจคือสิ่งที่คุณต้องการ; โมเดลเครือข่ายประสาทโดยตั้งใจโดยłukasz Kaiser [Talk, 2017]
- ความสนใจถูกปิดหนึ่ง [กรกฎาคม 2023]
? ️ความเข้าใจและการใช้ความสนใจตนเองสำหรับ NLP [Talk, 2018]
- ตำราอาหาร NLP: สูตรอาหารที่ทันสมัยสำหรับสถาปัตยกรรมการเรียนรู้เชิงลึกตามหม้อแปลง [กระดาษ, เมษายน 2564]
- โมเดลที่ผ่านการฝึกอบรมมาก่อน: อดีตปัจจุบันและอนาคต [กระดาษ, มิถุนายน 2564]
- การสำรวจของ Transformers [Paper, มิถุนายน 2021]

หม้อแปลงไฟฟ้า

- หม้อแปลงหมายเหตุประกอบโดย Harvard NLP [บล็อก, 2018]
- The Illustrated Transformer โดย Jay Alammar [บล็อก, 2018]
- คู่มือภาพประกอบ Transformers โดย Hong Jing [บล็อก, 2020]
- หม้อแปลงลำดับที่มีช่วงความสนใจแบบปรับตัวโดย Facebook บล็อก [บล็อก, 2019]
- วิวัฒนาการของการเป็นตัวแทนใน Transformer โดย Lena Voita [บล็อก, 2019]
- นักปฏิรูป: หม้อแปลงที่มีประสิทธิภาพ [บล็อก, 2020]
- Longformer-หม้อแปลงเอกสารยาวโดย Viktor Karlsson [บล็อก, 2020]
- Transformers จากศูนย์ [บล็อก, 2019]
- Transformers ในการประมวลผลภาษาธรรมชาติ - การสำรวจสั้น ๆ โดย George Ho [บล็อก, พฤษภาคม 2020]
Lite Transformer - Lite Transformer ที่มีความสนใจระยะยาวระยะยาว [GitHub, 596 Stars]
- Transformers จากศูนย์ [บล็อก, Oct 2021]

เบิร์ต

- คู่มือภาพเกี่ยวกับการใช้ Bert เป็นครั้งแรกโดย Jay Alammar [Blog, 2019]
- The Dark Secrets of Bert โดย Anna Rogers [บล็อก, 2020]
- ทำความเข้าใจกับการค้นหาที่ดีขึ้นกว่าเดิม [บล็อก, 2019]
- Demystifying Bert: คู่มือที่ครอบคลุมเกี่ยวกับกรอบ NLP ที่ก้าวล้ำ [บล็อก, 2019]
Sembert - Bert Aware Semantics สำหรับการทำความเข้าใจภาษา [GitHub, 286 Stars]
Bertweet - Bertweet: รูปแบบภาษาที่ผ่านการฝึกอบรมมาก่อนสำหรับทวีตภาษาอังกฤษ [GitHub, 574 Stars]
การสกัด subarchitecture ที่ดีที่สุดสำหรับ Bert [GitHub, 470 Stars]
ตัวละคร: การคืนดี Elmo และ Bert [GitHub, 195 Stars]
- เมื่อเบิร์ตเล่นลอตเตอรีตั๋วทั้งหมดจะชนะ [บล็อก, ธ.ค. 2020]
เอกสารที่เกี่ยวข้องกับ Bert รายการเอกสารที่เกี่ยวข้องกับ Bert [GitHub, 2032 Stars]

ตัวแปรหม้อแปลงอื่น ๆ

T5

- T5 ทำความเข้าใจสถาปัตยกรรมที่ดูแลตนเองตามหม้อแปลง [บล็อก, สิงหาคม 2020]
- T5: Transformer การถ่ายโอนข้อความเป็นข้อความ [บล็อก, 2020]
MultilingUad-T5-Multilingual T5 (MT5) เป็นแบบจำลองการพูดคุยแบบข้อความหลายภาษาแบบหลายภาษา [GitHub, 1245 Stars]

นกใหญ่

- Big Bird: Transformers สำหรับลำดับต้นฉบับที่ยาวขึ้นโดย Google Research [Paper, July 2020]

นักปฏิรูป / linformer / longformer / นักแสดง

️ Reformer: Transformer ที่มีประสิทธิภาพ - [Paper, กุมภาพันธ์ 2020] [วิดีโอ, ตุลาคม 2020]
? ️ Longformer: หม้อแปลงเอกสารยาว - [กระดาษ, เมษายน 2020] [วิดีโอ, เมษายน 2020]
? ️ linformer: การตั้งใจด้วยตนเองที่มีความซับซ้อนเชิงเส้น - [กระดาษ, มิถุนายน 2020] [วิดีโอ, มิถุนายน 2020]
️ทบทวนความสนใจกับนักแสดง - [กระดาษ, กันยายน 2020] [วิดีโอ, กันยายน 2020]
Performer-Pytorch-การใช้งานของนักแสดงซึ่งเป็นหม้อแปลงที่ใช้ความสนใจเชิงเส้นใน Pytorch [GitHub, 1084 Stars]

เปลี่ยนหม้อแปลง

- Switch Transformers: ปรับขนาดเป็นรุ่นพารามิเตอร์ Trillion Parameter Paper ต้นฉบับโดย Google Research [Paper, มกราคม 2021]

GPT-FAMILY

ทั่วไป

- ภาพประกอบ GPT-2 โดย Jay Alammar [บล็อก, 2019]
- GPT-2 ที่มีคำอธิบายประกอบโดย Aman Arora
- GPT-2 ของ Openai: โมเดล, hype และการโต้เถียงโดย Ryan Lowe [บล็อก, 2019]
- วิธีสร้างข้อความโดย Patrick von Platen [บล็อก, 2020]

GPT-3

แหล่งเรียนรู้

- การเรียนรู้แบบไม่มีการถ่ายภาพสำหรับการจำแนกข้อความโดย Amit Chaudhary [บล็อก, 2020]
- GPT-3 สรุปสั้น ๆ โดย Leo Gao [บล็อก, 2020]
- GPT-3 ขั้นตอนยักษ์สำหรับการเรียนรู้ลึกและ NLP โดย Yoel Zeldes [บล็อก, มิถุนายน 2020]
- รูปแบบภาษา GPT-3: ภาพรวมทางเทคนิคโดย Chuan Li [บล็อก, มิถุนายน 2020]
- เป็นไปได้ไหมที่โมเดลภาษาจะบรรลุความเข้าใจภาษา? โดย Christopher Potts

แอปพลิเคชัน

GPT-3 ที่ยอดเยี่ยม-รายการทรัพยากรทั้งหมดที่เกี่ยวข้องกับ GPT-3 [GitHub, 4589 Stars]
โครงการ GPT-3-แผนที่ของการเริ่มต้นและโครงการเชิงพาณิชย์ GPT-3 ทั้งหมด
GPT-3 Demo Showcase-GPT-3 Demo Showcase, แอพ 180+ แอพตัวอย่างและทรัพยากร
- Openai API - Demo API เพื่อใช้ OpenAI GPT สำหรับแอปพลิเคชันเชิงพาณิชย์

ความพยายามในการโอเพนซอร์ซ

- GPT-NEO-GPT-3 โอเพนซอร์สโอเพ่นซอร์สฮับ HuggingFace GPT-3
GPT -J - พารามิเตอร์ 6 พันล้านพารามิเตอร์รุ่นสร้างข้อความแบบอัตโนมัติที่ได้รับการฝึกฝนบนกอง
- ใช้ GPT-J อย่างมีประสิทธิภาพด้วยการเรียนรู้ไม่กี่นัด [บล็อก, กรกฎาคม 2021]

อื่น

- การตั้งใจด้วยตนเองสองสตรีมใน XLNet คืออะไรโดย Xu Liang [บล็อก, 2019]
- สรุปกระดาษภาพ: อัลเบิร์ต (A Lite Bert) โดย Amit Chaudhary [บล็อก, 2020]
- Turing NLG โดย Microsoft
- การจำแนกข้อความหลายฉลากด้วย XLNet โดย Josh Xin Jie Lee [บล็อก, 2019]
Electra [GitHub, 2326 Stars]
การใช้งานนักแสดงของนักแสดงซึ่งเป็นหม้อแปลงที่ใช้ความสนใจเชิงเส้นใน Pytorch [GitHub, 1084 Stars]

การกลั่นการตัดแต่งกิ่งและควอนตัม

สื่อการอ่าน

- การกลั่นความรู้จากเครือข่ายประสาทเพื่อสร้างโมเดลขนาดเล็กและเร็วขึ้นโดย Floydhub [บล็อก, 2019]
- การบีบอัดแบบจำลองการเรียนรู้ลึกสำหรับข้อความ: การสำรวจ [กระดาษ, เมษายน 2021]

เครื่องมือ

Bert-Squeeze-รหัสเพื่อลดขนาดของโมเดลที่ใช้หม้อแปลงหรือลดเวลาแฝงของพวกเขาในเวลาอนุมาน [GitHub, 79 ดาว]
Xtremedistil - XtremedistIltransformers สำหรับการกลั่นเครือข่ายประสาทหลายภาษาขนาดใหญ่ [GitHub, 153 ดาว]

การสรุปอัตโนมัติ

- PEGASUS: รูปแบบที่ล้ำสมัยสำหรับการสรุปข้อความแบบนามธรรมโดย Google AI [บล็อก, มิถุนายน 2020]
Ctrlsum - Ctrlsum: ไปสู่การสรุปข้อความทั่วไปที่ควบคุมได้ [GitHub, 146 ดาว]
XL-SUM-XL-SUM: การสรุปบทคัดย่อหลายภาษาขนาดใหญ่สำหรับ 44 ภาษา [GitHub, 252 ดาว]
Summertime-ชุดเครื่องมือสรุปข้อความโอเพนซอร์ซสำหรับไม่ใช่ผู้เชี่ยวชาญ [GitHub, 265 Stars]
ไพรเมอร์-ไพรเมอร์: ประโยคสวมหน้ากากที่ใช้พีระมิดก่อนการฝึกอบรมสำหรับการสรุปหลายเอกสาร [GitHub, 151 ดาว]
ซัมเมอร์ - โมเดลสำหรับการสรุปแบบนามธรรมอัตโนมัติ [GitHub, 170 ดาว]

กราฟความรู้และ NLP

- หลอมรวมความรู้ในรูปแบบภาษา [การนำเสนอ ต.ค. 2021]

หมายเหตุ ส่วนคำหลัก: แนวทางปฏิบัติที่ดีที่สุด, mlops

- กลับไปที่สารบัญ

แนวทางปฏิบัติที่ดีที่สุดสำหรับการสร้างโครงการ NLP

- ในการค้นหาแนวทางปฏิบัติที่ดีที่สุดสำหรับโครงการ NLP [Slides, Dec. 2020]
- EMNLP 2020: การประมวลผลภาษาธรรมชาติที่มีประสิทธิภาพสูงโดย Google Research, การบันทึก, พ.ย. 2020]
- การประมวลผลภาษาธรรมชาติที่ใช้งานได้จริง - คู่มือที่ครอบคลุมในการสร้างระบบ NLP ในโลกแห่งความเป็นจริง [หนังสือ, มิถุนายน 2020]
- วิธีการจัดโครงสร้างและจัดการโครงการ NLP [บล็อก, พฤษภาคม 2021]
- ใช้การคิด NLP - การคิด NLP ใช้: วิธีการแปลปัญหาเป็นโซลูชัน [บล็อก, มิถุนายน 2021]
- รู้เบื้องต้นเกี่ยวกับ NLP สำหรับการใช้งานในอุตสาหกรรม - การนำเสนอ DataTalksClub เกี่ยวกับการแนะนำ NLP สำหรับการใช้งานในอุตสาหกรรม [การบันทึก, ธันวาคม 2021]
- การวัดการดริฟท์แบบฝัง - แนวทางปฏิบัติที่ดีที่สุดสำหรับการตรวจสอบการดริฟท์ของรุ่น NLP [บล็อก, ธันวาคม 2565]

mlops สำหรับ NLP

MLOPS โดยเฉพาะอย่างยิ่งเมื่อนำไปใช้กับ NLP เป็นชุดของแนวทางปฏิบัติที่ดีที่สุดรอบ ๆ การทำงานโดยอัตโนมัติส่วนต่าง ๆ ของเวิร์กโฟลว์เมื่อสร้างและปรับใช้ท่อ NLP

โดยทั่วไป MLOPs สำหรับ NLP รวมถึงการมีกระบวนการต่อไปนี้:

การกำหนดเวอร์ชันข้อมูล - ตรวจสอบให้แน่ใจว่าการฝึกอบรมคำอธิบายประกอบและข้อมูลประเภทอื่น ๆ ของคุณมีการกำหนดเวอร์ชันและติดตาม
การติดตามการทดลอง - ตรวจสอบให้แน่ใจว่าการทดลองทั้งหมดของคุณจะถูกติดตามและบันทึกโดยอัตโนมัติในที่ที่พวกเขาสามารถทำซ้ำหรือย้อนกลับได้ง่าย
Model Registry - ตรวจสอบให้แน่ใจว่าโมเดลประสาทใด ๆ ที่คุณฝึกอบรมนั้นมีรูปแบบและติดตามและเป็นเรื่องง่ายที่จะย้อนกลับไปหาพวกเขา
การทดสอบอัตโนมัติและการทดสอบพฤติกรรม - นอกเหนือจากการทดสอบหน่วยและการรวมเป็นประจำคุณต้องการทำการทดสอบเชิงพฤติกรรมที่ตรวจสอบอคติหรือการโจมตีที่อาจเกิดขึ้น
การปรับใช้และการใช้งานแบบจำลอง - การปรับใช้แบบจำลองโดยอัตโนมัติโดยใช้การปรับใช้แบบไม่มีเวลาเช่นสีน้ำเงิน/สีเขียวการปรับใช้ Canary ฯลฯ
การสังเกตข้อมูลและโมเดล - ติดตามข้อมูลดริฟท์ความแม่นยำของแบบจำลองการดริฟท์ ฯลฯ

นอกจากนี้ยังมีองค์ประกอบอีกสององค์ประกอบที่ไม่แพร่หลายสำหรับ NLP และส่วนใหญ่จะใช้สำหรับการมองเห็นคอมพิวเตอร์และฟิลด์ย่อยอื่น ๆ ของ AI:

ร้านค้าคุณสมบัติ - ที่เก็บข้อมูลส่วนกลางของคุณสมบัติทั้งหมดที่พัฒนาขึ้นสำหรับรุ่น ML มากกว่าที่จะนำกลับมาใช้ใหม่ได้อย่างง่ายดายโดยโครงการ ML อื่น ๆ
การจัดการข้อมูลเมตา - การจัดเก็บข้อมูลทั้งหมดที่เกี่ยวข้องกับการใช้งานโมเดล ML ส่วนใหญ่สำหรับพฤติกรรมการทำซ้ำของโมเดล ML ที่ปรับใช้การติดตามสิ่งประดิษฐ์ ฯลฯ

การรวบรวม MLOPS และรายการที่ยอดเยี่ยม

Awesome-Mlops [GitHub, 12526 Stars]
Best-of-Ml-Python [GitHub, 16309 Stars]
mlops.toys - รายการที่รวบรวมของโครงการ MLOPS

สื่อการอ่าน

- การดำเนินการเรียนรู้ของเครื่อง (MLOPS): ภาพรวมคำจำกัดความและสถาปัตยกรรม [กระดาษพฤษภาคม 2022]
- ข้อกำหนดและสถาปัตยกรรมอ้างอิงสำหรับ MLOPS: ข้อมูลเชิงลึกจากอุตสาหกรรม [กระดาษ, ต.ค. 2022]
- Mlops: มันคืออะไรทำไมมันถึงสำคัญและวิธีการใช้งานโดย Neptune AI [บล็อก, กรกฎาคม 2021]
- เครื่องมือ MLOPS ที่ดีที่สุดที่คุณต้องรู้ในฐานะนักวิทยาศาสตร์ข้อมูลโดย Neptune AI [บล็อก, กรกฎาคม 2021]
- State of Mlops 2021 โดย Valohai [บล็อก, สิงหาคม 2021]
- Mlops Stack โดย Valohai [บล็อก, ตุลาคม 2020]
- การควบคุมเวอร์ชันข้อมูลสำหรับแอปพลิเคชันการเรียนรู้ของเครื่องโดย Megagon AI [บล็อก, กรกฎาคม 2021]
- วิวัฒนาการอย่างรวดเร็วของ Canonical Stack สำหรับการเรียนรู้ของเครื่อง [บล็อก, กรกฎาคม 2021]
- MLOPS: คู่มือผู้เริ่มต้นที่ครอบคลุม [บล็อก, มีนาคม 2021]
- สิ่งที่ฉันได้เรียนรู้เกี่ยวกับ Mlops จากการพูดกับผู้ฝึกสอน 100 มล. [บล็อก, พฤษภาคม 2021]
- โมเดล Datarobot Challenger - MLOPS Champion/Challenger Models
- บล็อก State of Mlops โดย Dr. Ori Cohen
- ภาพรวมระบบนิเวศของ MLOPS [บล็อก, 2021]

สื่อการเรียนรู้

- mlops cource โดย Made with ML
- GitHub Mlops - การรวบรวมทรัพยากรเกี่ยวกับวิธีการอำนวยความสะดวกในการเรียนรู้ของเครื่องด้วย GitHub
- หลักสูตรพื้นฐานการสังเกต ML เรียนรู้วิธีการตรวจสอบและปัญหาสาเหตุของสาเหตุกับโมเดล NLP การผลิต

ชุมชน Mlops

ชุมชน MLOPS - บล็อก, กลุ่ม Slack, จดหมายข่าวและอื่น ๆ ทั้งหมดเกี่ยวกับ MLOPS

การกำหนดเวอร์ชันข้อมูล

DVC - Data Version Control (DVC) ติดตามโมเดล ML และชุดข้อมูล [ฟรีและโอเพนซอร์ส] ลิงก์ไปยัง GitHub
- น้ำหนักและอคติ - เครื่องมือสำหรับการติดตามการทดลองและการกำหนดเวอร์ชันชุดข้อมูล [บริการชำระเงิน]
- Pachyderm-การควบคุมเวอร์ชันสำหรับข้อมูลด้วยเครื่องมือในการสร้างท่อส่ง ML/AI แบบ end-to-end ที่ปรับขนาดได้

การติดตามการทดลอง

MLFLOW - แพลตฟอร์มโอเพ่นซอร์สสำหรับการเรียนรู้ Lifecycle [ฟรีและโอเพนซอร์ส] ลิงก์ไปยัง GitHub
- น้ำหนักและอคติ - เครื่องมือสำหรับการติดตามการทดลองและการกำหนดเวอร์ชันชุดข้อมูล [บริการชำระเงิน]
- Neptune AI - การติดตามการทดลองและรีจิสทรีแบบจำลองที่สร้างขึ้นสำหรับทีมวิจัยและผลิต [บริการจ่าย]
- Comet ML - ช่วยให้นักวิทยาศาสตร์ข้อมูลและทีมสามารถติดตามเปรียบเทียบอธิบายและเพิ่มประสิทธิภาพการทดลองและแบบจำลอง [บริการชำระเงิน]
- SIGOPT - การฝึกอบรมและปรับแต่งอัตโนมัติแสดงภาพและเปรียบเทียบการรัน [บริการชำระเงิน]
Optuna - เฟรมเวิร์กการเพิ่มประสิทธิภาพ Hyperparameter [GitHub, 10650 Stars]
ล้าง ML - การทดลอง, orchestrate, ปรับใช้และสร้างที่เก็บข้อมูลทั้งหมดในที่เดียว [ฟรีและโอเพ่นซอร์ส] ลิงก์ไปยัง GitHub
Metaflow-ห้องสมุด Python/R ที่เป็นมิตรกับมนุษย์ที่ช่วยให้นักวิทยาศาสตร์และวิศวกรสร้างและจัดการโครงการวิทยาศาสตร์ข้อมูลในชีวิตจริง [GitHub, 8093 Stars]

รีจิสทรีแบบจำลอง

DVC - Data Version Control (DVC) ติดตามโมเดล ML และชุดข้อมูล [ฟรีและโอเพนซอร์ส] ลิงก์ไปยัง GitHub
MLFLOW - แพลตฟอร์มโอเพ่นซอร์สสำหรับการเรียนรู้ Lifecycle [ฟรีและโอเพนซอร์ส] ลิงก์ไปยัง GitHub
ModelDB - ระบบโอเพ่นซอร์สสำหรับการเรียนรู้รุ่น MOCINE MODEL MODEL, METADATA และการจัดการการทดลอง [GitHub, 1696 Stars]
- Neptune AI - การติดตามการทดลองและรีจิสทรีแบบจำลองที่สร้างขึ้นสำหรับทีมวิจัยและผลิต [บริการจ่าย]
- Valohai-ท่อส่ง ML แบบ end-to-end [บริการชำระเงิน]
- Pachyderm-การควบคุมเวอร์ชันสำหรับข้อมูลด้วยเครื่องมือในการสร้างท่อส่ง ML/AI แบบ end-to-end ที่ปรับขนาดได้
- Polyaxon - ทำซ้ำอัตโนมัติและปรับขนาดเวิร์กโฟลว์ข้อมูลของคุณด้วยเครื่องมือ MLOPS เกรดการผลิต [บริการชำระเงิน]
- Comet ML - ช่วยให้นักวิทยาศาสตร์ข้อมูลและทีมสามารถติดตามเปรียบเทียบอธิบายและเพิ่มประสิทธิภาพการทดลองและแบบจำลอง [บริการชำระเงิน]

การทดสอบอัตโนมัติและการทดสอบพฤติกรรม

รายการตรวจสอบ - นอกเหนือจากความแม่นยำ: การทดสอบพฤติกรรมของโมเดล NLP [GitHub, 2003 Stars]
TextAttack - กรอบสำหรับการโจมตีที่เป็นปฏิปักษ์การเพิ่มข้อมูลและการฝึกอบรมแบบจำลองใน NLP [GitHub, 2922 Stars]
WildNLP - เสียหายข้อความอินพุตเพื่อทดสอบความทนทานของโมเดล NLP [GitHub, 76 Stars]
ความคาดหวังที่ยอดเยี่ยม - เขียนการทดสอบข้อมูลของคุณ [GitHub, 9874 Stars]
DeepChecks - แพ็คเกจ Python สำหรับตรวจสอบความถูกต้องของรูปแบบการเรียนรู้ของเครื่องและข้อมูล [GitHub, 3582 Stars]

ความสามารถในการปรับใช้แบบจำลองและการให้บริการ

MLFLOW - แพลตฟอร์มโอเพ่นซอร์สสำหรับการเรียนรู้ Lifecycle [ฟรีและโอเพนซอร์ส] ลิงก์ไปยัง GitHub
- Amazon Sagemaker [บริการชำระเงิน]
- Valohai-ท่อส่ง ML แบบ end-to-end [บริการชำระเงิน]
- NLP Cloud - NLP API ที่พร้อมผลิต [บริการชำระเงิน]
- Saturn Cloud [บริการชำระเงิน]
- Seldon - การปรับใช้การเรียนรู้ของเครื่องสำหรับองค์กร [บริการชำระเงิน]
- Comet ML - ช่วยให้นักวิทยาศาสตร์ข้อมูลและทีมสามารถติดตามเปรียบเทียบอธิบายและเพิ่มประสิทธิภาพการทดลองและแบบจำลอง [บริการชำระเงิน]
- Polyaxon - ทำซ้ำอัตโนมัติและปรับขนาดเวิร์กโฟลว์ข้อมูลของคุณด้วยเครื่องมือ MLOPS เกรดการผลิต [บริการชำระเงิน]
Torchserve - เครื่องมือที่ยืดหยุ่นและใช้งานง่ายสำหรับการให้บริการรุ่น Pytorch [GitHub, 4174 Stars]
- Kubeflow - ชุดเครื่องมือการเรียนรู้ของเครื่องสำหรับ Kubernetes [GitHub, 10600 Stars]
KFServing - การอนุมานแบบไร้เซิร์ฟเวอร์บน Kubernetes [GitHub, 3504 Stars]
- TFX - TensorFlow Extended - แพลตฟอร์ม end -to -end สำหรับการปรับใช้ท่อส่ง ML การผลิต [บริการชำระเงิน]
- Pachyderm-การควบคุมเวอร์ชันสำหรับข้อมูลด้วยเครื่องมือในการสร้างท่อส่ง ML/AI แบบ end-to-end ที่ปรับขนาดได้
- Cortex - คอนเทนเนอร์เป็นบริการของ AWS [บริการชำระเงิน]
- Azure Machine Learning-วงจรการเรียนรู้แบบครบวงจร [บริการชำระเงิน]
End2end Serverless Transformers บน AWS Lambda [GitHub, 121 Stars]
NLP -Service - ตัวอย่างตัวอย่างของ NLP เป็นแพลตฟอร์มบริการที่สร้างขึ้นโดยใช้ Fastapi และ Hugging Face [GitHub, 13 ดาว]
- Dagster - Data Orchestrator สำหรับการเรียนรู้ของเครื่อง [ฟรีและโอเพ่นซอร์ส]
- Verta - AI และการปรับใช้การเรียนรู้ของเครื่องและการดำเนินงาน [บริการชำระเงิน]
Metaflow-ห้องสมุด Python/R ที่เป็นมิตรกับมนุษย์ที่ช่วยให้นักวิทยาศาสตร์และวิศวกรสร้างและจัดการโครงการวิทยาศาสตร์ข้อมูลในชีวิตจริง [GitHub, 8093 Stars]
Flyte - แพลตฟอร์มการทำงานอัตโนมัติเวิร์กโฟลว์สำหรับข้อมูลที่ซับซ้อนภารกิจที่สำคัญและกระบวนการ ML ในระดับ [GitHub, 5525 Stars]
MLRUN - การเรียนรู้ของเครื่องจักรอัตโนมัติและการติดตาม [GitHub, 1425 Stars]
- Datarobot Mlops - Datarobot Mlops เป็นศูนย์กลางของความเป็นเลิศสำหรับการผลิตของคุณ AI

แบบจำลองการดีบัก

Imodels - แพ็คเกจสำหรับการสร้างแบบจำลองการทำนายที่กระชับโปร่งใสและแม่นยำ [GitHub, 1375 Stars]
ห้องนักบิน - เครื่องมือดีบั๊กที่ใช้งานได้จริงสำหรับการฝึกอบรมเครือข่ายประสาทลึก [GitHub, 474 ดาว]

การทำนายความแม่นยำของแบบจำลอง

WeightWatcher - เครื่องมือ WeightWatcher สำหรับการทำนายความแม่นยำของเครือข่ายประสาทลึก [GitHub, 1453 ดาว]

ข้อมูลและการสังเกตแบบจำลอง

ทั่วไป

ARIZE AI - การตรวจสอบดริฟท์แบบฝังสำหรับรุ่น NLP
ARIZE -Phoenix - ML การสังเกตสำหรับ LLMS, Vision, Language และ Tabular Models
Whylogs - มาตรฐานโอเพ่นซอร์สสำหรับข้อมูลและการบันทึก ML [GitHub, 2636 Stars]
Rubrix - เครื่องมือโอเพ่นซอร์สสำหรับการสำรวจและวนซ้ำข้อมูลสำหรับโครงการปัญญาประดิษฐ์ [GitHub, 3843 ดาว]
MLRUN - การเรียนรู้ของเครื่องจักรอัตโนมัติและการติดตาม [GitHub, 1425 Stars]
- Datarobot Mlops - Datarobot Mlops เป็นศูนย์กลางของความเป็นเลิศสำหรับการผลิตของคุณ AI
- Cortex - คอนเทนเนอร์เป็นบริการของ AWS [บริการชำระเงิน]

โมเดลเป็นศูนย์กลาง

- อัลกอริทึม - ลดความเสี่ยงด้วยการรายงานขั้นสูงและการรักษาความปลอดภัยระดับองค์กรและการกำกับดูแลในทุกข้อมูลโมเดลและโครงสร้างพื้นฐาน [บริการชำระเงิน]
- DataIku - DataIku สำหรับทีมที่ต้องการส่งมอบการวิเคราะห์ขั้นสูงโดยใช้เทคนิคล่าสุดในระดับ Big Data [บริการชำระเงิน]
เห็นได้ชัดว่า AI - เครื่องมือในการวิเคราะห์และตรวจสอบรูปแบบการเรียนรู้ของเครื่อง [ฟรีและโอเพนซอร์ส] ลิงก์ไปยัง GitHub
- Fiddler - ML MODEL MODEL TOLON เครื่องมือจัดการประสิทธิภาพ [บริการชำระเงิน]
- Hydrosphere - แพลตฟอร์มโอเพ่นซอร์สสำหรับการจัดการโมเดล ML [บริการชำระเงิน]
- Verta - AI และการปรับใช้การเรียนรู้ของเครื่องและการดำเนินงาน [บริการชำระเงิน]
- Domino Model Ops - ปรับใช้และจัดการโมเดลเพื่อผลักดันผลกระทบทางธุรกิจ [บริการชำระเงิน]

ศูนย์ข้อมูล

- DataFold - คุณภาพข้อมูลผ่าน diffs การทำโปรไฟล์และการตรวจจับความผิดปกติ [บริการชำระเงิน]
- Acceldata - ปรับปรุงความน่าเชื่อถือเร่งขนาดและลดต้นทุนในท่อข้อมูลทั้งหมด [บริการชำระเงิน]
- Bigeye - การตรวจสอบและแจ้งเตือนชุดข้อมูลของคุณในไม่กี่นาที [บริการชำระเงิน]
- Datakin-โซลูชันสายเลือดข้อมูลแบบเรียลไทม์แบบเรียลไทม์ [บริการชำระเงิน]
- Monte Carlo - ความสมบูรณ์ของข้อมูล, Drifts, Schema, Lineage [บริการชำระเงิน]
- โซดา - การตรวจสอบข้อมูลการทดสอบและการตรวจสอบ [บริการชำระเงิน]

ร้านค้าคุณสมบัติ

- Tecton - ร้านค้าฟีเจอร์ Enterprise สำหรับการเรียนรู้ของเครื่อง [บริการชำระเงิน]
Feast - Open Source Feature Store สำหรับเว็บไซต์การเรียนรู้ของเครื่อง [GitHub, 5525 Stars]
- Hopsworks Feature Store - ระบบการจัดการข้อมูลสำหรับการจัดการคุณสมบัติการเรียนรู้ของเครื่อง [บริการชำระเงิน]

การจัดการข้อมูลเมตา

ML Metadata - ห้องสมุดสำหรับการบันทึกและดึงข้อมูลเมตาที่เกี่ยวข้องกับ ML Developer และ Workflows นักวิทยาศาสตร์ด้านข้อมูล [GitHub, 617 Stars]
- Neptune AI - การติดตามการทดลองและรีจิสทรีแบบจำลองที่สร้างขึ้นสำหรับทีมวิจัยและผลิต [บริการจ่าย]

เฟรมเวิร์ก Mlops

Metaflow-ห้องสมุด Python/R ที่เป็นมิตรกับมนุษย์ที่ช่วยให้นักวิทยาศาสตร์และวิศวกรสร้างและจัดการโครงการวิทยาศาสตร์ข้อมูลในชีวิตจริง [GitHub, 8093 Stars]
KEDRO - Framework Python สำหรับการสร้างรหัสวิทยาศาสตร์ข้อมูลที่ทำซ้ำได้การบำรุงรักษาและแบบแยกส่วน [GitHub, 9883 Stars]
Seldon Core - MLOPS Framework to Package, ปรับใช้, ตรวจสอบและจัดการรูปแบบการเรียนรู้ของเครื่องผลิตหลายพันรุ่น [GitHub, 4353 Stars]
ZENML - MLOPS Framework เพื่อสร้างท่อ ML ที่ทำซ้ำได้สำหรับการเรียนรู้ของเครื่องจักรการเรียนรู้ [GitHub, 3972 Stars]
- Google Vertex AI - สร้างปรับใช้และปรับขนาดโมเดล ML ได้เร็วขึ้นด้วยเครื่องมือที่ผ่านการฝึกอบรมมาล่วงหน้าและกำหนดเองภายในแพลตฟอร์ม AI แบบครบวงจร [บริการชำระเงิน]
Diffgram - แพลตฟอร์มข้อมูลการฝึกอบรมที่สมบูรณ์สำหรับการเรียนรู้ของเครื่องเป็นแอปพลิเคชันเดียว [GitHub, 1834 Stars]
- Continual.ai - สร้างปรับใช้และใช้งานโมเดล ML ได้ง่ายขึ้นและเร็วขึ้นด้วยอินเทอร์เฟซที่เปิดเผยบนคลังข้อมูลคลาวด์เช่นเกล็ดหิมะ, BigQuery, Redshift และ Databricks [บริการชำระเงิน]

สถาปัตยกรรมที่ใช้หม้อแปลง

- กลับไปที่สารบัญ

ทั่วไป

- ทำไมเบิร์ตถึงล้มเหลวในสภาพแวดล้อมเชิงพาณิชย์โดย Intel AI [บล็อก, 2020]
- Bert Tuning Fine สำหรับการจำแนกข้อความกับ Farm โดย Sebastian Guggisberg [บล็อก, 2020]
Pretrain Transformers Models ใน Pytorch โดยใช้ Hugging Face Transformers [GitHub, 254 Stars]
? ️ภาคปฏิบัติ NLP สำหรับโลกแห่งความเป็นจริง [การนำเสนอ, 2019]
? ️จากกระดาษหนึ่งไปยังอีกผลิตภัณฑ์หนึ่ง - เราใช้ Bert โดย Christoph Henkelmann [Talk, 2020]

หม้อแปลงหลาย GPU

ParallelFormers: ชุดเครื่องมือแบบขนานแบบจำลองที่มีประสิทธิภาพสำหรับการปรับใช้ [GitHub, 776 Stars]

Training Transformers ได้อย่างมีประสิทธิภาพ

การฝึกอบรมเบิร์ตด้วยงบประมาณการคำนวณ/เวลา (วิชาการ) [GitHub, 309 ดาว]

ฝังเป็นบริการ

Embedding-as-Service [GitHub, 204 Stars]
bert-as-service [GitHub, 12399 Stars]

สูตร NLP การใช้งานอุตสาหกรรม:

สูตร NLP โดย Microsoft [GitHub, 6367 Stars]
NLP กับ Python โดย Susanli2016 [GitHub, 2721 Stars]
ยูทิลิตี้พื้นฐานสำหรับ Pytorch NLP โดย Petrochukm [GitHub, 2210 Stars]

แอปพลิเคชัน NLP ในชีวภาพการเงินกฎหมายและอุตสาหกรรมอื่น ๆ

แบล็กสโตน - ท่อส่งสัญญาณและแบบจำลองสำหรับ NLP บนข้อความทางกฎหมายที่ไม่มีโครงสร้าง [GitHub, 636 ดาว]
Sci Spacy - Spacy Pipeline และแบบจำลองสำหรับเอกสารทางวิทยาศาสตร์/ชีวการแพทย์ [GitHub, 1688 Stars]
Finbert: ได้รับการฝึกอบรมล่วงหน้าเกี่ยวกับการยื่นเอกสารทางการเงินสำหรับงาน NLP ทางการเงิน [GitHub, 197 ดาว]
Lexnlp - การดึงข้อมูลและการสกัดสำหรับข้อความทางกฎหมายที่ไม่มีโครงสร้าง [GitHub, 692 Stars]
Nerdl และ Nercrf - การสอนเกี่ยวกับการรับรู้เอนทิตีที่มีชื่อสำหรับการดูแลสุขภาพด้วย Sparknlp
การวิเคราะห์ข้อความทางกฎหมาย - รายการทรัพยากรที่เลือกซึ่งอุทิศให้กับการวิเคราะห์ข้อความทางกฎหมาย [GitHub, 613 Stars]
Bioie - รายการทรัพยากรที่เกี่ยวข้องกับการสกัดข้อมูลทางชีวการแพทย์ [GitHub, 338 Stars]

คำสำคัญส่วน หมายเหตุ : การรู้จำเสียงพูด

- กลับไปที่สารบัญ

การรู้จำเสียงพูดทั่วไป

WAV2LETTER - ชุดเครื่องมือรู้จำเสียงพูดอัตโนมัติ [GitHub, 6370 Stars]
DeepSpeech - สถาปัตยกรรม DeepSpeech ของ Baidu [GitHub, 25166 Stars]
- Acoustic Word Embeddings โดย Maria Obedkova [บล็อก, 2020]
Kaldi - Kaldi เป็นชุดเครื่องมือสำหรับการจดจำคำพูด [GitHub, 14177 Stars]
Awesome -Kaldi - ทรัพยากรสำหรับการใช้ Kaldi [GitHub, 532 Stars]
ESPNET-ชุดเครื่องมือประมวลผลคำพูดแบบ end-to-end [GitHub, 8355 Stars]
- HUBERT - การเป็นตัวแทนการเป็นตัวแทนของตนเองเพื่อการรู้จำเสียงพูดการสร้างและการบีบอัด [บล็อก, มิถุนายน 2021]

ส่งข้อความถึงการสร้างคำพูด / การพูด

Fastspeech - การใช้งานของ Fastspeech ตาม pytorch [GitHub, 857 Stars]
TTS-ชุดเครื่องมือการเรียนรู้ลึกสำหรับข้อความเป็นคำพูด [GitHub, 34356 Stars]
- Notebooklm - Google Gemini ผู้ช่วยส่วนตัว / เครื่องกำเนิดพอดคาสต์

คำพูดถึงข้อความ

Whisper - การจดจำคำพูดที่แข็งแกร่งผ่านการกำกับดูแลที่อ่อนแอขนาดใหญ่โดย Openai [GitHub, 68884 Stars]
Vibe - เครื่องมือ GUI ในการทำงานกับการสนับสนุน Whisper, Multilingual และ CUDA รวมถึง [GitHub, 931 Stars]

ชุดข้อมูล

Voxpopuli - คลังเสียงพูดหลายภาษาขนาดใหญ่สำหรับการเรียนรู้การเป็นตัวแทน [GitHub, 507 ดาว]

หมายเหตุ ส่วนคำหลัก: การสร้างแบบจำลองหัวข้อ

- กลับไปที่สารบัญ

บล็อก

- การสร้างแบบจำลองหัวข้อด้วย Pyspark และ Spark NLP โดย Maria Obedkova [Spark, Blog, 2020]
- วิธีการที่ไม่เหมือนใครในการจัดกลุ่มข้อความสั้น ๆ (ทฤษฎีอัลกอริทึม) โดย Brittany Bowers [บล็อก, 2020]

เฟรมเวิร์กสำหรับการสร้างแบบจำลองหัวข้อ

Gensim - กรอบสำหรับการสร้างแบบจำลองหัวข้อ [GitHub, 15597 Stars]
Spark NLP [GitHub, 3826 Stars]

ที่เก็บ

TOP2VEC [GitHub, 2924 Stars]
คำอธิบายความสัมพันธ์ที่ยึดไว้ในการสร้างแบบจำลองหัวข้อ [GitHub, 303 ดาว]
การสร้างแบบจำลองหัวข้อในพื้นที่ฝังตัว [GitHub, 540 Stars] กระดาษ
TopicNet - A high-level interface for BigARTM library [GitHub, 140 stars]
BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 6038 stars]
OCTIS - A python package to optimize and evaluate topic models [GitHub, 718 stars]
Contextualized Topic Models [GitHub, 1196 stars]
GSDMM - GSDMM: Short text clustering [GitHub, 353 stars]

Note Section keywords: keyword extraction

- Back to the Table of Contents

Text Rank

PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 2132 stars]
textrank - TextRank implementation for Python 3 [GitHub, 1248 stars]

RAKE - Rapid Automatic Keyword Extraction

rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
yake - Single-document unsupervised keyword extraction [GitHub, 1632 stars]
RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 375 stars]
rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]

Other Approaches

flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5583 stars]
BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 254 stars]
keyBERT - Minimal keyword extraction with BERT [GitHub, 3471 stars]
KeyphraseVectorizers - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 251 stars]

NLP and ML Interpretability

NLP-centric

Explainability for Natural Language Processing - KDD'2021 Tutorial Slides [Presentation, August 2021]
ecco - Tools to visuals and explore NLP language models [GitHub, 1974 stars]
NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 243 stars]
transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 1278 stars]
Awesome-explainable-AI - collection of research materials on explainable AI/ML [GitHub, 1400 stars]
LAMA - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 1346 stars]

ทั่วไป

Language Interpretability Tool (LIT) [GitHub, 3474 stars]
WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 468 stars]
Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 413 stars]
InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 6238 stars]
thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 143 stars]
Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 342 stars]
imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1375 stars]

Ethics, Bias, and Equality in NLP

- Bias in Natural Language Processing @EMNLP 2020 [Blog, Nov 2020]
?️ Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
Ethics in NLP - resources from ACLs Ethics in NLP track
The Institute for Ethical AI & Machine Learning
- Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]
Fairness-in-AI - this package is used to detect and mitigate biases in NLP tasks [GitHub, 77 stars]
nlg-bias - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 65 stars]
bias-in-nlp - list of papers related to bias in NLP [GitHub, 9 stars]

Adversarial Attacks for NLP

- Privacy Considerations in Large Language Models [Blog, Dec 2020]
DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 73 stars]
Adversarial-Misspellings - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 62 stars]

Hate Speech Analysis

HateXplain - BERT for detecting abusive language [GitHub, 187 stars]

Note Section keywords: frameworks

- Back to the Table of Contents

General Purpose

spaCy by Explosion AI [GitHub, 29784 stars]
flair by Zalando [GitHub, 13855 stars]
AllenNLP by AI2 [GitHub, 11740 stars]
stanza (former Stanford NLP) [GitHub, 7253 stars]
spaCy stanza [GitHub, 723 stars]
nltk [GitHub, 13489 stars]
gensim - framework for topic modeling [GitHub, 15597 stars]
pororo - Platform of neural models for natural language processing [GitHub, 1279 stars]
NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2936 stars]
FARM [GitHub, 1734 stars]
gobbli by RTI International [GitHub, 275 stars]
headliner - training and deployment of seq2seq models [GitHub, 229 stars]
SyferText - A privacy preserving NLP framework [GitHub, 197 stars]
DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1263 stars]
TextHero - Text preprocessing, representation and visualization [GitHub, 2882 stars]
textblob - TextBlob: Simplified Text Processing [GitHub, 9109 stars]
AdaptNLP - A high level framework and library for NLP [GitHub, 407 stars]
textacy - NLP, before and after spaCy [GitHub, 2209 stars]
texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2388 stars]
jiant - jiant is an NLP toolkit [GitHub, 1639 stars]

Data Augmentation

WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
snorkel Framework to generate training data [GitHub, 5791 stars]
NLPAug Data augmentation for NLP [GitHub, 4419 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
faker - Python package that generates fake data for you [GitHub, 17648 stars]
textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 639 stars]
Parrot - Practical and feature-rich paraphrasing framework [GitHub, 871 stars]
AugLy - data augmentations library for audio, image, text, and video [GitHub, 4950 stars]
TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 396 stars]

Adversarial NLP Attacks & Behavioral Testing

TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 6172 stars]
CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2003 stars]

Transformer-oriented

transformers by HuggingFace [GitHub, 132974 stars]
Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 2543 stars]
haystack - Transformers at scale for question answering & neural search. [GitHub, 16997 stars]

Dialogue Systems and Speech

DeepPavlov by MIPT [GitHub, 6676 stars]
ParlAI by FAIR [GitHub, 10477 stars]
rasa - Framework for Conversational Agents [GitHub, 18726 stars]
wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6370 stars]
ChatterBot - conversational dialog engine for creating chatbots [GitHub, 14039 stars]
SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 8674 stars]
dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]

Word/Sentence-embeddings oriented

MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3181 stars]
vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 644 stars]
sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 14981 stars]

Social Media Oriented

Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 661 stars]

สัทศาสตร์

DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 352 stars]

Morphology

LemmInflect - python module for English lemmatization and inflection [GitHub, 259 stars]
Inflect - generate plurals, ordinals, indefinite articles [GitHub, 964 stars]
simplemma - simple multilingual lemmatizer for Python [GitHub, 964 stars]

Multi-lingual tools

polyglot - Multi-lingual NLP Framework [GitHub, 2309 stars]
trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 730 stars]

Distributed NLP / Multi-GPU NLP

Spark NLP [GitHub, 3826 stars]
Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 776 stars]

การแปลเครื่องจักร

COMET -A Neural Framework for MT Evaluation [GitHub, 493 stars]
marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 1236 stars]
argos-translate - Open source neural machine translation in Python [GitHub, 3771 stars]
Opus-MT - Open neural machine translation models and web services [GitHub, 605 stars]
dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 440 stars]
CTranslate2 - CTranslate2 end-to-end machine translation [GitHub, 3300 stars]

Entity and String Matching

PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 736 stars]
pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]
fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 9220 stars]
jellyfish - approximate and phonetic matching of strings [GitHub, 2049 stars]
textdistance - Compute distance between sequences [GitHub, 3367 stars]
DeepMatcher - Compute distance between sequences [GitHub, 555 stars]
RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 339 stars]
Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 17 stars]

Discourse Analysis

ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 543 stars]

PII scrubbing

scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 394 stars]

Hastag Segmentation

hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 68 stars]

Books Analysis / Literary Analysis / Semantic Search

booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 785 stars]
bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 76 stars]
SemanticFinder - frontend-only live semantic search with transformers.js [GitHub, 224 stars]

Non-English oriented

ญี่ปุ่น

fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 391 stars]
SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 390 stars]
Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 226 stars]
jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 148 stars]
Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 745 stars]
kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 953 stars]
nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 382 stars]
KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 201 stars]
Jigg - Pipeline framework for easy natural language processing [GitHub, 74 stars]
Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 376 stars]
RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 473 stars]
toiro - a comparison tool of Japanese tokenizers [GitHub, 118 stars]

แบบไทย

AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 79 stars]
ThaiLMCut - Word Tokenizer for Thai Language [GitHub, 15 stars]

ชาวจีน

Spacy-pkuseg - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 53 stars]

Ukrainian

recruitment-dataset - Recruitment Dataset Preprocessing and Recommender System (Ukrainian, English)

อื่น

textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 103 stars]
Kashgari Transfer Learning with focus on Chinese [GitHub, 2389 stars]
Underthesea - Vietnamese NLP Toolkit [GitHub, 1383 stars]
PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 84 stars]

Text Data Labelling & Classification

Small-Text - Active Learning for Text Classifcation in Python [GitHub, 549 stars]
Doccano - open source annotation tool for machine learning practitioners [GitHub, 9460 stars]
Adala - Autonomous DAta (Labeling) Agent framework [GitHub, 927 stars]
EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
- Prodigy - annotation tool powered by active learning [Paid Service]

Note Section keywords: learn NLP

- Back to the Table of Contents

ทั่วไป

- Learn NLP the practical way [Blog, Nov. 2019]
- Learn NLP the Stanford way (+Part 2) [Blog, Nov 2020]
- Choosing the right course for a Practical NLP Engineer
- 12 Best Natural Language Processing Courses & Tutorials to Learn Online
Treasure of Transformers - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 912 stars]
?️ Rasa Algorithm Whiteboard - YouTube series by Rasa explaining various Data Science and NLP Algorithms
?️ ExplosionAI Videos - YouTube series by ExplosionAI teaching you how to use spacy and apply it for NLP

Courses

?️ CS25: Transformers United Stanford - Fall 2021 [Course, Fall 2021]
- NLP Course | For You - Great and interactive course on NLP
- Advanced NLP with spaCy - how to use spaCy to build advanced natural language understanding systems
- Transformer models for NLP by HuggingFace
?️ Stanford NLP Seminar - slides from the Stanford NLP course

หนังสือ

- Natural Language Processing with Transformers - [Book, February 2022]
- Applied Natural Language Processing in the Enterprise - [Book, May 2021]
- Practical Natural Language Processing - [Book, June 2020]
- Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
- Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
- Top NLP Books to Read 2020 - Blog post by Raymong Cheng [Blog, Sep 2020]

บทเรียน

nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1366 stars]
nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 14110 stars]
Hands-On NLTK Tutorial [GitHub, 540 stars]
Modern Practical Natural Language Processing [GitHub, 266 stars]
Transformers-Tutorials - demos with the Transformers library by HuggingFace [GitHub, 9176 stars]
CalmCode Tutorials - Set of Python Data Science Tutorials

r/LanguageTechnology - NLP Reddit forum

- Back to the Table of Contents

Tokenization

tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 8940 stars]
SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 10141 stars]
SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 135 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks

WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
NLPAug Data augmentation for NLP [GitHub, 4419 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 917 stars]
NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 773 stars]
EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
snorkel Framework to generate training data [GitHub, 5791 stars]
dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]

Reading Material and Tutorials

A Survey of Data Augmentation Approaches for NLP [Paper, May 2021] GitHub Link
- A Visual Survey of Data Augmentation in NLP [Blog, 2020]
- Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]

Named Entity Recognition (NER)

Datasets for Entity Recognition [GitHub, 1497 stars]
Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 338 stars]
Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 212 stars]
Few-NERD - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 385 stars]

การสกัดความสัมพันธ์

tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 355 stars]
tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 69 stars]
tac-self-attention Relation extraction with position-aware self-attention [GitHub, 64 stars]
Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 51 stars]

ความละเอียด coreference

NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2850 stars]
coref - BERT and SpanBERT for Coreference Resolution [GitHub, 443 stars]

การวิเคราะห์ความเชื่อมั่น

Reading list for Awesome Sentiment Analysis papers by declare-lab [GitHub, 517 stars]
Awesome Sentiment Analysis by xiamx [GitHub, 913 stars]

Domain Adaptation

Neural Adaptation in Natural Language Processing - curated list [GitHub, 261 stars]

Low Resource NLP

CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 597 stars]

Spell Correction / Error Correction

Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1502 stars]
NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 665 stars]
SymSpellPy - Python port of SymSpell [GitHub, 796 stars]
- Speller100 by Microsoft [Blog, Feb 2021]
JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 608 stars]
pycorrector - spell correction for Chinese [GitHub, 5517 stars]
contractions - Fixes contractions such as you're to you are [GitHub, 308 stars]
- Fine Tuning T5 for Grammar Correction by Sachin Abeywardana [Blog, Nov 2022]

Style Transfer for NLP

Styleformer - Neural Language Style Transfer framework [GitHub, 475 stars]
StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 60 stars]

Automata Theory for NLP

pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]

Obscene words detection

LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 2899 stars]

Reddit Analysis

Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 489 stars]

Skill Detection

SkillNER - rule based NLP module to extract job skills from text [GitHub, 153 stars]

Reinforcement Learning for NLP

nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 192 stars]

AutoML / AutoNLP

AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 3836 stars]
TPOT - Python Automated Machine Learning tool [GitHub, 9691 stars]
Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 2359 stars]
HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 710 stars]
- AutoML Natural Language - Google's paid AutoML NLP service
Optuna - hyperparameter optimization framework [GitHub, 10650 stars]
FLAML - fast and lightweight AutoML library [GitHub, 3871 stars]
Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 306 stars]

OCR - Optical Character Recognition

?️ A framework for designing document processing solutions [Blog, June 2022]

Document AI

- Table Transformer + HuggingFace Models

การสร้างข้อความ

keytotext - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 445 stars]
- Controllable Neural Text Generation [Blog, Jan 2021]
BARTScore Evaluating Generated Text as Text Generation [GitHub, 317 stars]