The NLP Pandect下載 - The NLP Pandect源代碼下載

The NLP Pandect

其他源碼

1.0.0

下載

nlp-pandect

創建了這種Pandect（πανδέκτης對於百科全書而言是古希臘語），以幫助您找到幾乎所有與自然語言處理有關的東西。

請注意可用資源類型的快速傳說：
- 開源項目，通常是帶有星星數量的GitHub存儲庫
？ - 您可以閱讀的資源，通常是博客文章或論文
- 額外資源的集合
？ - 非開放源工具，框架或付費服務
？ ️-您可以觀看的資源
？ ️-您可以聽的資源

？主要部分	？ 子部分樣本
NLP資源	論文摘要，會議摘要，NLP數據集
NLP播客	僅NLP播客，帶有許多NLP情節的播客
NLP通訊	-
NLP聚會	-
NLP YouTube頻道	-
NLP基準	NLU將軍，問答，多語言
研究資源	有關變壓器模型，蒸餾和修剪，自動匯總的資源
行業資源	NLP系統的最佳實踐，NLP的MLOP
語音識別	一般資源，文字到語音，對文本的語音，數據集
主題建模	博客，框架，存儲庫和項目
關鍵字提取	文本等級，耙子，其他方法
負責的NLP	NLP和ML可解釋性，道德，偏見和NLP的平等，NLP的對抗性攻擊
NLP框架	通用，數據增強，機器翻譯，對抗攻擊，對話系統和語音，實體和字符串匹配，非英語框架，文本註釋
學習NLP	課程，書籍，教程
NLP社區	-
其他NLP主題	令牌化，數據增強，命名實體識別，錯誤校正，automl/autonlp，文本生成

注意部分關鍵字：紙張摘要，綱要，很棒的清單

關於NLP主題的彙編和令人敬畏的列表：

NLP索引 - NLP論文的可搜索索引，量子stat / nlp cypher
Keon的真棒NLP [Github，16528 Stars]
語音和自然語言處理令人敬畏的列表[github，2189星]
自然語言處理的令人敬畏的深度學習（NLP）[GitHub，1274顆星]
文本挖掘和自然語言處理資源由Stepthom [Github，557星]
Philip Vollet的#NLP愛好者的腦袋
很棒的AI/ML/DL -NLP部分[GitHub，1473星]
Devopedia的NLP文章

NLP會議，紙張摘要和紙質彙編：

論文和紙張摘要

100條必須閱讀的NLP論文100必讀NLP論文[GitHub，3732星]
dair-ai的NLP紙摘要[Github，1475星]
NLP從業者的策劃論文收集[Github，1075顆星]
關於文字對抗攻擊和防禦的論文[Github，1501星]
NLU的最新深度學習論文和Valentin Malykh的RL [Github，296星]
調查調查（NLP和ML）：NLP調查論文的收集[Github，1997年Stars]
文本中的樣式轉移紙列表[GitHub，1609星]
？論文的視頻錄音索引

會議摘要

NLP前10個會議彙編作者：Soulbliss [Github，459星]
？ ICLR 2020趨勢
？ Spacyirl 2019會議概述
？紙摘要 - 概述中的會議和論文

NLP進度和NLP任務：

NLP的SebastianRuder進展[Github，22568星]
Kyubyong的NLP任務[Github，3017星]

NLP數據集：

NIDERHOFF的NLP數據集[GitHub，5741星]
huggingface的數據集[GitHub，19096恆星]
大壞NLP數據庫
UWA明確的單詞註釋 - 單詞感官歧義數據集
MLDOC-八種語言的多語言文檔分類語料庫[github，152星]

單詞和句子嵌入：

Hironsan的令人敬畏的嵌入模型[Github，1752年的恆星]
Septius的句子嵌入列表[Github，2219星]
Jiakui的真棒Bert [Github，1846年的星星]

筆記本，腳本和存儲庫

超級Duper NLP回購[網站，2020年]

非英語資源和彙編

Bahasa印度尼西亞人的NLP資源[Github，480星]
INDIN NLP目錄[GitHub，552星]
越南語的預訓練語言模型[Github，653星]
指示語言（inltk）的自然語言工具包[github，814星]
INDIN NLP庫[GitHub，550星]
AI4BHARAT-INDICNLP門戶
ARBML-許多阿拉伯語NLP和ML項目的實施[Github，387星]
Zemberek -NLP-土耳其語的NLP工具[Github，1146星]
TDD AI-用於所有土耳其數據集，語言模型和NLP工具的開源平台。
KLUE-韓國語言理解評估[Github，560星]
波斯NLP基準 - 用於評估和比較波斯語的各種NLP任務的基準[Github，73星]
NLP -Greek-希臘語言來源[Github，5星]
匈牙利的真棒NLP資源[Github，221星]

預訓練的NLP模型

預訓練的NLP模型列表[GitHub，170星]
華為Noah的方舟實驗室開發的審計語言模型[Github，3019星]
西班牙語模型和資源[Github，251星]

NLP歷史

一般的

現代深度學習技術應用於自然語言處理[Github，1328星]
？自然語言處理神經史的評論[博客，2018年10月]

2020年審查

？ 2020年的自然語言處理：評論一年[博客，2020年12月]
？ 2020年的ML和NLP研究重點[博客，2021年1月]

？回到目錄

僅NLP播客

？） NLP亮點[年：2017-現在，狀態：活動]
？ NLP區域情節[年：2021-現在，狀態：活動]

許多NLP情節

？？ twiml ai [年：2016-現在，狀態：活動]
？ ️實用的AI [年：2018年 - 現在，狀態：活動]
？ 數據交換[年：2019年 - 現在，狀態：活動]
？ quentient異議[年：2020年 - 現在，狀態：活動]
？ 機器學習街道談話[年：2020-現在，狀態：活動]
數據框架 - 有關如何擴展組織中數據科學影響的最新趨勢和見解[年：2019年 - 現在，狀態：活躍]

一些NLP情節

？？超級數據科學播客[年：2016-現在，狀態：活動]
？ 數據黑客收音機[年：2018-現在，狀態：活動]
？？
分析顯示[年：2019年 - 現在，狀態：活動]

？塞巴斯蒂安·魯德（Sebastian Ruder）的NLP新聞
？本週在羅伯特·戴爾（Robert Dale）的NLP
？用代碼的論文
？深度學習的批次
？紙消化紙消化
？ NLP Cypher通過QuantumStat

？ NLP Zurich [YouTube錄音]
？黑客手機學習[YouTube錄音]
？ NY-NLP（紐約）

？ Yannic Kilcher
？擁抱面
？ Kaggle閱讀小組
？ RASA紙閱讀
？ Stanford CS224N：NLP深度學習
？ nlpxing
？ ML解釋-AI Socratic Circles -Aisc
？深度學習
？機器學習街道談話

？回到目錄

NLU將軍

膠 - 一般語言理解評估（膠）基準
Superglue-在膠水之後使用的基準測試，並具有一套更困難的語言理解任務
decanlp-用於研究一般NLP模型的自然語言十項全能（DECANLP）
DialogLue-對話：一種自然語言理解為任務對話的基準[Github，280星]
Dynabench -Dynabench是一個動態數據收集和基準測試的研究平台
大基礎 - 用於測量和推斷語言模型功能的協作基準[Github，2835 Stars]

摘要

Wikiasp-Wikiasp：基於多文件的摘要數據集
Wikilingua-多語言抽象摘要數據集

問題回答

小隊 - 斯坦福問題回答數據集（小隊）
Xquad-Xquad（跨語性問題回答數據集）用於跨語性問題回答
Grailqa-強烈可推廣的問答（Grailqa）
CSQA-複雜的順序問題回答

多語言和非英語基準

？ Xtreme-大量多語言多任務基準
Gluecos-代碼開關NLP的基準
Indicglue-自然語言理解指標語言的基準
Lince-語言代碼轉換評估基準
俄羅斯超級豪華 - 俄羅斯超級豪華基準測試

生物，法律和其他科學領域

Blurb-生物醫學語言理解和推理基準
藍色 - 生物醫學理解評估基準
Lexglue-用英語理解法律語言理解的基準數據集

變壓器效率

遠程競技場 - 用於基準測試有效變壓器的遠程競技場（預印）[GitHub，716星]

語音處理

出色 - 語音處理通用性能基準

其他

codexglue-代碼智能的基準數據集
交叉訓練 - 橫碼：評估命名實體識別的跨域
MULTINLI-多元類別的自然語言推理語料庫
Isarcasm：一個預期的諷刺數據集 - Isarcasm是一條推文的數據集，每個數據集都標記為諷刺或non_sarcastic

？回到目錄

一般的

？ Andrej Karpathy的培訓神經網絡的配方[關鍵字：研究，培訓，2019年]
？ NLP通過大型預訓練的語言模型的最新進展：調查[論文，2021年11月]

嵌入

存儲庫

許多語言的預訓練Elmo表示[Github，1458星]
Sense2Vec-上下文鍵為鍵的矢量[Github，1617星]
wikipedia2vec [github，935星]
星空[Github，3938星]
FastText [Github，25871星]

部落格

？ David S. Batista的語言模型和上下文化的單詞嵌入[博客，2018年]
？ AnalyticsVidhya [博客，2020年]為NLP從業人員提供術語讀詞嵌入的基本指南
？ polyglot Word Embeddings發現語言簇[博客，2020年]
？ Jay Alammar的插圖Word2Vec [博客，2019年]

跨語性單詞和句子嵌入

vecmap- vecmap（跨語性詞嵌入映射）[github，644星]
句子轉換器 - 帶有Bert的多語言句子和圖像嵌入[Github，14981星]

字節對編碼

BPEMB-基於字節對編碼（BPE）的275種語言的預訓練子字嵌入[Github，1179恆星]
子詞-NMT-神經機器翻譯和文本生成的無監督單詞分割[Github，2185星]
Python -bpe- python的字節對編碼[Github，223星]

基於變壓器的體系結構

一般的

？ Lilian Weng的Transformer家族[博客，2020年]
？用獎勵和多種語言演奏彩票 - 關於隨機初始化的效果[ICLR 2020紙]
？注意力？注意力！ Lilian Weng [博客，2018]
？變壓器……“解釋”？ [博客，2019年]
？ 您只需要注意；注意神經網絡模型的羅卡斯·凱瑟（Talk，2017年）
？注意到了一個[2023年7月]
？ 理解和應用自我注意力[Talk，2018]
？ NLP食譜：基於變壓器的深度學習體系結構的現代食譜[論文，2021年4月]
？預訓練的模型：過去，現在和未來[論文，2021年6月]
？變壓器的調查[紙，2021年6月]

變壓器

？哈佛NLP的註釋變壓器[博客，2018年]
？ Jay Alammar的插圖變壓器[博客，2018年]
？漢吉的插圖指南[博客，2020年]
？ Facebook帶有自適應注意跨度的順序變壓器。博客[博客，2019年]
？莉娜·沃塔（Lena Voita）在變壓器中的表示演變[博客，2019年]
？改革者：高效的變壓器[博客，2020年]
？ Longformer - Viktor Karlsson的長篇文檔變壓器[博客，2020年]
？從頭開始的變壓器[博客，2019年]
？自然語言處理中的變壓器 - 喬治·霍（George Ho）的簡短調查[博客，2020年5月]
Lite Transformer-帶有長短範圍的Lite Transformer注意[GitHub，596星]
？從頭開始的變壓器[博客，2021年10月]

伯特

？ Jay Alammar首次使用Bert的視覺指南[博客，2019年]
？安娜·羅傑斯（Anna Rogers）的《伯特的黑暗秘密》 [博客，2020年]
？比以往任何時候都更好地了解搜索[博客，2019年]
？ Demystifusing Bert：開創性NLP框架的綜合指南[博客，2019年]
Sembert-語言理解的語義知覺伯特[Github，286星]
Bertweet -Bertweet：英語推文的預訓練的語言模型[Github，574星]
BERT的最佳亞構造提取[GitHub，470星]
角色伯特：和解Elmo和Bert [Github，195星]
？當伯特（Bert）播放彩票時，所有門票都在獲勝[博客，2020年12月]
伯特相關的論文列表與BERT相關的論文列表[Github，2032 Stars]

其他變壓器變體

T5

？ T5了解基於變壓器的自我監督架構[博客，2020年8月]
？ T5：文本到文本傳輸變壓器[博客，2020年]
多語言T5-多語言T5（MT5）是一種大量多語言的文本到文本變壓器模型[GitHub，1245 stars]

大鳥

？大鳥：Google Research的較長序列原始論文的變壓器[論文，2020年7月]

改革者 / Linformer / Longformer / Performer

改革者：高效的變壓器 - [紙，2020年2月] [視頻，2020年10月]
longformer：長期文檔變壓器 - [紙，2020年4月] [視頻，2020年4月]
？ ️線形：線性複雜性的自我注意力 - [紙，2020年6月] [視頻，2020年6月]
？ 對錶演者重新考慮注意力 - [紙，2020年9月] [視頻，2020年9月]
Performer-Pytorch- pytorch中表演者的實現，是一種線性注意力的變壓器[Github，1084星]

開關變壓器

？開關變壓器：擴展到Google Research的原始紙張縮放到數万億參數模型[論文，2021年1月]

GPT家庭

一般的

？ Jay Alammar的插圖GPT-2 [博客，2019年]
？ Aman Arora註釋的GPT-2
？ Openai的GPT-2：Ryan Lowe的模型，炒作和爭議[博客，2019年]
？如何生成Patrick von Platen的文字[博客，2020年]

GPT-3

學習資源

？ Amit Chaudhary的文本分類的零鏡頭學習[博客，2020年]
？ gpt-3 leo gao的簡短摘要[博客，2020年]
？ GPT-3，Yoel Zeldes的深度學習和NLP的巨大步驟[博客，2020年6月]
？ GPT-3語言模型：Chuan Li的技術概述[博客，2020年6月]
？語言模型是否可以實現語言理解？克里斯托弗·波茨（Christopher Potts）

申請

很棒的GPT-3-與GPT-3相關的所有資源列表[GitHub，4589星]
GPT-3項目 - 所有GPT-3初創企業和商業項目的地圖
GPT-3演示展示-GPT-3演示展示櫃，180多個應用程序，示例和資源
？ OpenAI API -API演示用於商業應用中的OpenAI GPT

開源努力

？ GPT-NEO-正在進行中GPT-3開源複製HuggingFace Hub
GPT -J-在堆上訓練的60億參數，自回歸文本生成模型
？有效地使用GPT-J進行很少的學習[博客，2021年7月]

其他

？ Xu Liang的Xlnet中的兩流自我發作是什麼[博客，2019年]
？視覺論文摘要：阿米特·喬杜里（Amit Chaudhary）的阿爾伯特（Albert）（lite bert）[博客，2020年]
？ Microsoft的Turing NLG
？ Josh Xin Jie Lee的多標籤文本分類[Blog，2019]
electra [github，2326星]
表演者在Pytorch [Github，1084 stars]中的表演者（一種線性注意變壓器）的實現

蒸餾，修剪和量化

閱讀材料

？從神經網絡中提取知識，以建立Floydhub的較小，更快的模型[Blog，2019]
？深度學習模型的壓縮：調查[論文，2021年4月]

工具

Bert-squeeze-代碼以減少基於變壓器的模型的大小或減少推理時間的延遲[Github，79星]
Xtremedistil- Xtremedistiltransformers用於蒸餾大量的多語言神經網絡[GitHub，153星]

自動匯總

？ Pegasus：Google AI的抽象文本摘要的最新模型[博客，2020年6月]
ctrlsum -ctrlsum：邁向通用可控文本摘要[Github，146星]
XL-SUM- XL-SUM：44種語言的大規模多語言抽象摘要[GitHub，252星]
夏季 - 非專家的開源文本摘要工具包[Github，265星]
底漆 - 底漆：基於金字塔的蒙版句子預訓練多文件摘要[github，151星]
Summarus-自動抽象摘要的模型[Github，170星]

知識圖和NLP

？將知識融合到語言模型中[演講，2021年10月]

注意部分關鍵字：最佳實踐，MLOPS

？回到目錄

建立NLP項目的最佳實踐

？尋找NLP項目的最佳實踐[幻燈片，2020年12月]
？ EMNLP 2020：Google Research，Recording，2020年11月的高性能自然語言處理]
？實用的自然語言處理 - 構建現實世界NLP系統的綜合指南[書，2020年6月]
？如何構建和管理NLP項目[博客，2021年5月]
？應用NLP思維 - 應用NLP思維：如何將問題轉化為解決方案[博客，2021年6月]
？ NLP的行業使用簡介-DataTalkSClub在NLP介紹的行業使用介紹[記錄，2021年12月]
？測量嵌入漂移 - 監視NLP模型漂移的最佳實踐[博客，2022年12月]

NLP的MLOP

MLOP，尤其是應用於NLP時，是圍繞在構建和部署NLP管道時自動化工作流程的各個部分的最佳實踐。

通常，NLP的MLOP包括進行以下過程：

數據版本- 確保您的培訓，註釋和其他類型的數據已版本和跟踪
實驗跟踪- 確保您的所有實驗都會自動跟踪並保存，可以輕鬆複製或追回它們
模型註冊表- 確保您訓練的任何神經模型均已版本和跟踪，並且很容易回到其中任何一個
自動測試和行為測試- 除了常規單元和集成測試外，您還需要進行行為測試，以檢查是否有偏見或潛在的對抗性攻擊
模型部署和服務- 自動化模型部署，理想情況下，零降低時間部署，例如藍色/綠色，金絲雀部署等。
數據和模型可觀察性- 跟踪數據漂移，模型準確性漂移等。

此外，還有兩個組件對於NLP不那麼普遍，主要用於計算機視覺和AI的其他子場：

功能商店- 為ML模型開發的所有功能的集中存儲，比任何其他ML項目都可以輕鬆地重複使用
元數據管理- 與使用ML模型有關的所有信息的存儲，主要用於重現部署的ML模型，人工製品跟踪等的行為。

MLOPS彙編和很棒的列表

很棒的洛普[Github，12526星]
最佳ML-Python [Github，16309星]
mlops.toys-策劃的MLOPS項目清單

閱讀材料

？機器學習操作（MLOPS）：概述，定義和體系結構[論文，2022年5月]
？ MLOP的要求和參考架構：行業的見解[論文，2022年10月]
？ MLOP：它是什麼，為什麼重要以及如何實施Neptune AI [博客，2021年7月]
？您需要了解的最佳MLOP工具作為Neptune AI的數據科學家[博客，2021年7月]
？ MLOPS 2021撰寫的Valohai [博客，2021年8月]
？ Valohai的MLOPS堆棧[博客，2020年10月]
？ Megagon AI的機器學習應用程序的數據版本[博客，2021年7月]
？機器學習的規範堆棧的快速發展[博客，2021年7月]
？ MLOP：綜合初學者指南[博客，2021年3月]
？我從與100多名ML從業人員交談中學到了有關MLOP的知識[博客，2021年5月]
？ Datarobot Challenger模型 - MLOPS冠軍/挑戰者模型
？ Ori Cohen博士MLOPS博客
？ MLOPS生態系統概述[博客，2021]

學習材料

？用ML製造的MLOPS COURCE
？ GitHub MLOP-收集有關如何促進機器學習操作的資源收集
？ ML可觀察性基礎課程學習如何通過生產NLP模型監測和根本原因問題

MLOPS社區

MLOPS社區 - 博客，Slack Group，新聞通訊等有關MLOP的信息

數據版本

DVC-數據版本控制（DVC）跟踪ML模型和數據集[免費和開源] GitHub鏈接
？權重和偏見 - 實驗跟踪和數據集版本的工具[付費服務]
？ Pachyderm-具有使用工具的數據控製版本，以構建可擴展的端到端ML/AI管道[帶免費層的付費服務]

實驗跟踪

MLFLOW-機器學習生命週期的開源平台[免費和開源]鏈接到GitHub
？權重和偏見 - 實驗跟踪和數據集版本的工具[付費服務]
？ Neptune AI-為研究和生產團隊構建的實驗跟踪和模型註冊表[付費服務]
？彗星ML-使數據科學家和團隊能夠跟踪，比較，解釋和優化實驗和模型[付費服務]
？ Sigopt-自動培訓和調整，可視化和比較跑步[付費服務]
Optuna-超參數優化框架[GitHub，10650恆星]
清除ML-實驗，編排，部署和構建數據存儲，全部在一個地方[免費和開源]鏈接到GitHub
元流 - 對人類友好的Python/R庫，可幫助科學家和工程師建立和管理現實生活中的數據科學項目[Github，8093 Stars]

模型註冊表

DVC-數據版本控制（DVC）跟踪ML模型和數據集[免費和開源] GitHub鏈接
MLFLOW-機器學習生命週期的開源平台[免費和開源]鏈接到GitHub
MODELDB-機器學習模型版本，元數據和實驗管理的開源系統[GitHub，1696年星]
？ Neptune AI-為研究和生產團隊構建的實驗跟踪和模型註冊表[付費服務]
？ Valohai-端到端ML管道[付費服務]
？ Pachyderm-具有使用工具的數據控製版本，以構建可擴展的端到端ML/AI管道[帶免費層的付費服務]
？ Polyaxon-使用生產級MLOPS工具[付費服務]複製，自動化和擴展數據科學工作流程
？彗星ML-使數據科學家和團隊能夠跟踪，比較，解釋和優化實驗和模型[付費服務]

自動測試和行為測試

清單 - 超越準確性：NLP模型的行為測試[GitHub，2003年星]
TextAttack- NLP中的對抗性攻擊，數據增強和模型培訓的框架[GitHub，2922 Stars]
WILDNLP-損壞輸入文本，以測試NLP模型的魯棒性[GitHub，76星]
巨大的期望 - 為您的數據編寫測試[GitHub，9874星]
Deepnecks-用於全面驗證您的機器學習模型和數據的Python軟件包[GitHub，3582星]

模型可部署性和服務

MLFLOW-機器學習生命週期的開源平台[免費和開源]鏈接到GitHub
？ Amazon Sagemaker [付費服務]
？ Valohai-端到端ML管道[付費服務]
？ NLP Cloud-生產的NLP API [付費服務]
？土星雲[付費服務]
？ Seldon-企業的機器學習部署[付費服務]
？彗星ML-使數據科學家和團隊能夠跟踪，比較，解釋和優化實驗和模型[付費服務]
？ Polyaxon-使用生產級MLOPS工具[付費服務]複製，自動化和擴展數據科學工作流程
Torchserve-靈活且易於使用的工具用於服務Pytorch型號[GitHub，4174星]
？ kubeflow- kubernetes的機器學習工具包[github，10600星]
KFSERVING-無服務器推斷Kubernetes [Github，3504星]
？ TFX -TensorFlow擴展 - 部署生產ML管道的端到端平台[付費服務]
？ Pachyderm-具有使用工具的數據控製版本，以構建可擴展的端到端ML/AI管道[帶免費層的付費服務]
？皮層 - 容器作為AWS [付費服務]的服務
？ Azure機器學習 - 端到端機器學習生命週期[付費服務]
END2END無服務器變形金剛在AWS lambda上[GitHub，121星]
NLP服務 - NLP的樣本演示為使用FastApi和擁抱臉的服務平台[Github，13星]
？ DAGSTER-機器學習的數據編排[免費和開源]
？ Verta -AI和機器學習部署和操作[付費服務]
元流 - 對人類友好的Python/R庫，可幫助科學家和工程師建立和管理現實生活中的數據科學項目[Github，8093 Stars]
Flyte-適用於復雜，任務數據和ML流程的工作流自動化平台[GitHub，5525星]
MLRUN-機器學習自動化和跟踪[GitHub，1425星]
？ Datarobot MLOP- DataRobot MLOP為您的生產AI提供了卓越的中心

模型調試

imodels-簡潔，透明和準確的預測建模的包裝[Github，1375顆星]
駕駛艙 - 一種用於訓練深神經網絡的實用調試工具[Github，474星]

模型準確性預測

Weightwatcher-重量觀看者工具用於預測深神經網絡的準確性[GitHub，1453星]

數據和模型可觀察性

一般的

Arize AI-嵌入NLP模型的漂移監測
Arize -Phoenix- LLM，視覺，語言和表格模型的ML可觀察性
Whylogs-數據和ML記錄的開源標準[GitHub，2636星]
Rubrix-用於探索和迭代人工智能項目數據的開源工具[GitHub，3843星]
MLRUN-機器學習自動化和跟踪[GitHub，1425星]
？ Datarobot MLOP- DataRobot MLOP為您的生產AI提供了卓越的中心
？皮層 - 容器作為AWS [付費服務]的服務

以型號為中心

？算法 - 通過所有數據，模型和基礎架構[付費服務]中高級報告和企業級安全和治理的風險最小
？ Dataiku -Dataiku適用於想要使用大數據量表[付費服務]的最新技術提供高級分析的團隊
顯然是AI-分析和監視機器學習模型的工具[免費和開源]鏈接到GitHub
？提琴手 - ML模型績效管理工具[付費服務]
？水圈 - 用於管理ML模型的開源平台[付費服務]
？ Verta -AI和機器學習部署和操作[付費服務]
？多米諾模型操作 - 部署和管理模型以驅動業務影響[付費服務]

以數據為中心

？數據折 - 通過差異，分析和異常檢測[付費服務]數據質量[付費服務]
？ Acceldata-提高可靠性，加速規模並降低所有數據管道的成本[付費服務]
？ BigEye-在幾分鐘內監視和警報數據集[付費服務]
？ Datakin-端到端，實時數據譜系解決方案[付費服務]
？蒙特卡洛 - 數據完整性，漂移，模式，血統[付費服務]
？蘇打水 - 數據監視，測試和驗證[付費服務]

功能存儲

？ Tecton-用於機器學習的企業功能商店[付費服務]
盛宴 - 機器學習網站的開源功能商店[GitHub，5525星]
？ HOPSWORKS功能商店 - 用於管理機器學習功能的數據管理系統[付費服務]

元數據管理

ML Metadata-一個用於記錄和檢索ML開發人員和數據科學家工作流相關的元數據的庫[GitHub，617星]
？ Neptune AI-為研究和生產團隊構建的實驗跟踪和模型註冊表[付費服務]

MLOPS框架

元流 - 對人類友好的Python/R庫，可幫助科學家和工程師建立和管理現實生活中的數據科學項目[Github，8093 Stars]
Kedro -Python框架，用於創建可重複，可維護和模塊化數據科學代碼[GitHub，9883星]
Seldon Core -MLOPS框架包裝，部署，監視和管理數千種生產機器學習模型[GitHub，4353 Stars]
Zenml- MLOPS框架為生產機器學習創建可再現的ML管道[GitHub，3972星]
？ Google Vertex AI-更快地構建，部署和擴展ML模型，並在統一的AI平台[付費服務]中具有預訓練和自定義工具
DIFFGRAM-作為單個應用程序傳遞的機器學習的完整培訓數據平台[GitHub，1834年Stars]
？連續.ai-通過在雲數據倉庫上的聲明界面（例如雪花，bigquery，Redshift和Databricks）上的聲明界面構建，部署和操作ML模型。 [付費服務]

基於變壓器的體系結構

？回到目錄

一般的

？為什麼伯特在英特爾AI的商業環境中失敗[博客，2020年]
？通過塞巴斯蒂安·古吉斯貝格（Sebastian Guggisberg
使用擁抱的臉部變壓器（GitHub，254星]，Pytorch中的預處理變壓器模型
？ 現實世界的實用NLP [演講，2019年]
距紙到產品 - 我們如何實施Christoph Henkelmann的Bert [Talk，2020]

多GPU變壓器

ParallFormers：用於部署的有效模型並行化工具包[GitHub，776星]

有效訓練變壓器

用計算/時間（學術）預算培訓BERT [GITHUB，309星]

嵌入作為服務

嵌入為服務[github，204星]
Bert-As-Service [Github，12399星]

NLP食譜工業應用：

Microsoft的NLP食譜[GitHub，6367星]
NLP與Susanli2016的Python [Github，2721顆星]
Petrochukm的Pytorch NLP的基本實用程序[Github，2210星]

NLP在生物，金融，法律和其他行業中的應用

Blackstone-無組織法律文本的NLP的尖頂管道和模型[Github，636星]
科學/生物醫學文檔的Sci Spacy -Spacy管道和模型[Github，1688 stars]
Finbert：預先培訓的SEC Financial NLP任務[Github，197 Stars]
Lexnlp-真實的，非結構化法律文本的信息檢索和提取[GitHub，692星]
NERDL和NERCRF-關於SparkNLP的醫療保健指定實體識別的教程
法律文本分析 - 專用於法律文本分析的選定資源清單[GitHub，613星]
Bioie-與進行生物醫學信息提取相關的策劃資源清單[Github，338星]

注意部分關鍵字：語音識別

？回到目錄

一般語音識別

Wav2letter-自動語音識別工具包[GitHub，6370星]
DeepSpeech -Baidu的DeepSpeech Architecture [Github，25166星]
？瑪麗亞·奧貝德科娃（Maria Obedkova）的聲詞嵌入[博客，2020年]
卡爾迪 - 卡爾迪（Kaldi）是語音識別的工具包[Github，14177顆星]
很棒的卡爾迪 - 使用kaldi的資源[github，532星]
ESPNET-端到端語音處理工具包[GitHub，8355星]
？休伯特 - 自我監督的表示語音識別，產生和壓縮的學習[博客，2021年6月]

文字到語音 /語音生成

FastSpeech-基於Pytorch的FastSpeech的實現[Github，857星]
TTS-文本到語音的深度學習工具包[github，34356星]
？ Notebooklm -Google Gemini供電的個人助理 /播客生成器

對文字的講話

耳語 - 通過大規模弱監督的強大語音識別，Openai [github，68884星]
Vibe-使用耳語，多語言和CUDA支持的GUI工具包括[GitHub，931星]

數據集

Voxpopuli-用於表示學習的大規模多語言語料庫[Github，507星]

注意部分關鍵字：主題建模

？回到目錄

部落格

？瑪麗亞·奧貝德科娃（Maria Obedkova）的主題建模和Spark NLP [Spark，博客，2020年]
？布列塔尼·鮑爾斯（Brittany Bowers）的簡短文本聚類（算法理論）的獨特方法[博客，2020年]

主題建模的框架

Gensim-主題建模框架[GitHub，15597星]
Spark NLP [Github，3826星]

存儲庫

top2vec [github，2924星]
錨定相關解釋主題建模[GitHub，303星]
Topic Modeling in Embedding Spaces [GitHub, 540 stars] Paper
TopicNet - A high-level interface for BigARTM library [GitHub, 140 stars]
BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 6038 stars]
OCTIS - A python package to optimize and evaluate topic models [GitHub, 718 stars]
Contextualized Topic Models [GitHub, 1196 stars]
GSDMM - GSDMM: Short text clustering [GitHub, 353 stars]

Note Section keywords: keyword extraction

? Back to the Table of Contents

Text Rank

PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 2132 stars]
textrank - TextRank implementation for Python 3 [GitHub, 1248 stars]

RAKE - Rapid Automatic Keyword Extraction

rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
yake - Single-document unsupervised keyword extraction [GitHub, 1632 stars]
RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 375 stars]
rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]

Other Approaches

flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5583 stars]
BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 254 stars]
keyBERT - Minimal keyword extraction with BERT [GitHub, 3471 stars]
KeyphraseVectorizers - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 251 stars]

進一步閱讀

? Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts by Haowen Jiang [Blog, Feb 2021]
? How to Extract Relevant Keywords with KeyBERT [Blog, June 2021]

Note Section keywords: ethics, responsible NLP

? Back to the Table of Contents

NLP and ML Interpretability

NLP-centric

Explainability for Natural Language Processing - KDD'2021 Tutorial Slides [Presentation, August 2021]
ecco - Tools to visuals and explore NLP language models [GitHub, 1974 stars]
NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 243 stars]
transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 1278 stars]
Awesome-explainable-AI - collection of research materials on explainable AI/ML [GitHub, 1400 stars]
LAMA - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 1346 stars]

一般的

Language Interpretability Tool (LIT) [GitHub, 3474 stars]
WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 468 stars]
Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 413 stars]
InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 6238 stars]
thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 143 stars]
Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 342 stars]
imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1375 stars]

Ethics, Bias, and Equality in NLP

? Bias in Natural Language Processing @EMNLP 2020 [Blog, Nov 2020]
?️ Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
Ethics in NLP - resources from ACLs Ethics in NLP track
The Institute for Ethical AI & Machine Learning
? Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]
Fairness-in-AI - this package is used to detect and mitigate biases in NLP tasks [GitHub, 77 stars]
nlg-bias - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 65 stars]
bias-in-nlp - list of papers related to bias in NLP [GitHub, 9 stars]

Adversarial Attacks for NLP

? Privacy Considerations in Large Language Models [Blog, Dec 2020]
DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 73 stars]
Adversarial-Misspellings - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 62 stars]

Hate Speech Analysis

HateXplain - BERT for detecting abusive language [GitHub, 187 stars]

Note Section keywords: frameworks

? Back to the Table of Contents

General Purpose

spaCy by Explosion AI [GitHub, 29784 stars]
flair by Zalando [GitHub, 13855 stars]
AllenNLP by AI2 [GitHub, 11740 stars]
stanza (former Stanford NLP) [GitHub, 7253 stars]
spaCy stanza [GitHub, 723 stars]
nltk [GitHub, 13489 stars]
gensim - framework for topic modeling [GitHub, 15597 stars]
pororo - Platform of neural models for natural language processing [GitHub, 1279 stars]
NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2936 stars]
FARM [GitHub, 1734 stars]
gobbli by RTI International [GitHub, 275 stars]
headliner - training and deployment of seq2seq models [GitHub, 229 stars]
SyferText - A privacy preserving NLP framework [GitHub, 197 stars]
DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1263 stars]
TextHero - Text preprocessing, representation and visualization [GitHub, 2882 stars]
textblob - TextBlob: Simplified Text Processing [GitHub, 9109 stars]
AdaptNLP - A high level framework and library for NLP [GitHub, 407 stars]
textacy - NLP, before and after spaCy [GitHub, 2209 stars]
texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2388 stars]
jiant - jiant is an NLP toolkit [GitHub, 1639 stars]

Data Augmentation

WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
snorkel Framework to generate training data [GitHub, 5791 stars]
NLPAug Data augmentation for NLP [GitHub, 4419 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
faker - Python package that generates fake data for you [GitHub, 17648 stars]
textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 639 stars]
Parrot - Practical and feature-rich paraphrasing framework [GitHub, 871 stars]
AugLy - data augmentations library for audio, image, text, and video [GitHub, 4950 stars]
TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 396 stars]

Adversarial NLP Attacks & Behavioral Testing

TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 6172 stars]
CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2003 stars]

Transformer-oriented

transformers by HuggingFace [GitHub, 132974 stars]
Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 2543 stars]
haystack - Transformers at scale for question answering & neural search. [GitHub, 16997 stars]

Dialogue Systems and Speech

DeepPavlov by MIPT [GitHub, 6676 stars]
ParlAI by FAIR [GitHub, 10477 stars]
rasa - Framework for Conversational Agents [GitHub, 18726 stars]
wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6370 stars]
ChatterBot - conversational dialog engine for creating chatbots [GitHub, 14039 stars]
SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 8674 stars]
dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]

Word/Sentence-embeddings oriented

MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3181 stars]
vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 644 stars]
sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 14981 stars]

Social Media Oriented

Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 661 stars]

Phonetics

DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 352 stars]

形態學

LemmInflect - python module for English lemmatization and inflection [GitHub, 259 stars]
Inflect - generate plurals, ordinals, indefinite articles [GitHub, 964 stars]
simplemma - simple multilingual lemmatizer for Python [GitHub, 964 stars]

Multi-lingual tools

polyglot - Multi-lingual NLP Framework [GitHub, 2309 stars]
trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 730 stars]

Distributed NLP / Multi-GPU NLP

Spark NLP [GitHub, 3826 stars]
Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 776 stars]

Machine Translation

COMET -A Neural Framework for MT Evaluation [GitHub, 493 stars]
marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 1236 stars]
argos-translate - Open source neural machine translation in Python [GitHub, 3771 stars]
Opus-MT - Open neural machine translation models and web services [GitHub, 605 stars]
dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 440 stars]
CTranslate2 - CTranslate2 end-to-end machine translation [GitHub, 3300 stars]

Entity and String Matching

PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 736 stars]
pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]
fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 9220 stars]
jellyfish - approximate and phonetic matching of strings [GitHub, 2049 stars]
textdistance - Compute distance between sequences [GitHub, 3367 stars]
DeepMatcher - Compute distance between sequences [GitHub, 555 stars]
RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 339 stars]
Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 17 stars]

Discourse Analysis

ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 543 stars]

PII scrubbing

scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 394 stars]

Hastag Segmentation

hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 68 stars]

Books Analysis / Literary Analysis / Semantic Search

booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 785 stars]
bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 76 stars]
SemanticFinder - frontend-only live semantic search with transformers.js [GitHub, 224 stars]

Non-English oriented

日本人

fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 391 stars]
SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 390 stars]
Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 226 stars]
jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 148 stars]
Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 745 stars]
kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 953 stars]
nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 382 stars]
KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 201 stars]
Jigg - Pipeline framework for easy natural language processing [GitHub, 74 stars]
Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 376 stars]
RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 473 stars]
toiro - a comparison tool of Japanese tokenizers [GitHub, 118 stars]

泰國

AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 79 stars]
ThaiLMCut - Word Tokenizer for Thai Language [GitHub, 15 stars]

中國人

Spacy-pkuseg - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 53 stars]

烏克蘭

recruitment-dataset - Recruitment Dataset Preprocessing and Recommender System (Ukrainian, English)

其他

textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 103 stars]
Kashgari Transfer Learning with focus on Chinese [GitHub, 2389 stars]
Underthesea - Vietnamese NLP Toolkit [GitHub, 1383 stars]
PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 84 stars]

Text Data Labelling & Classification

Small-Text - Active Learning for Text Classifcation in Python [GitHub, 549 stars]
Doccano - open source annotation tool for machine learning practitioners [GitHub, 9460 stars]
Adala - Autonomous DAta (Labeling) Agent framework [GitHub, 927 stars]
EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
? Prodigy - annotation tool powered by active learning [Paid Service]

Note Section keywords: learn NLP

? Back to the Table of Contents

一般的

? Learn NLP the practical way [Blog, Nov. 2019]
? Learn NLP the Stanford way (+Part 2) [Blog, Nov 2020]
? Choosing the right course for a Practical NLP Engineer
? 12 Best Natural Language Processing Courses & Tutorials to Learn Online
Treasure of Transformers - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 912 stars]
?️ Rasa Algorithm Whiteboard - YouTube series by Rasa explaining various Data Science and NLP Algorithms
?️ ExplosionAI Videos - YouTube series by ExplosionAI teaching you how to use spacy and apply it for NLP

Courses

?️ CS25: Transformers United Stanford - Fall 2021 [Course, Fall 2021]
? NLP Course | For You - Great and interactive course on NLP
? Advanced NLP with spaCy - how to use spaCy to build advanced natural language understanding systems
? Transformer models for NLP by HuggingFace
?️ Stanford NLP Seminar - slides from the Stanford NLP course

圖書

? Natural Language Processing with Transformers - [Book, February 2022]
? Applied Natural Language Processing in the Enterprise - [Book, May 2021]
? Practical Natural Language Processing - [Book, June 2020]
? Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
? Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
? Top NLP Books to Read 2020 - Blog post by Raymong Cheng [Blog, Sep 2020]

教程

nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1366 stars]
nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 14110 stars]
Hands-On NLTK Tutorial [GitHub, 540 stars]
Modern Practical Natural Language Processing [GitHub, 266 stars]
Transformers-Tutorials - demos with the Transformers library by HuggingFace [GitHub, 9176 stars]
CalmCode Tutorials - Set of Python Data Science Tutorials

r/LanguageTechnology - NLP Reddit forum

? Back to the Table of Contents

Tokenization

tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 8940 stars]
SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 10141 stars]
SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 135 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks

WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
NLPAug Data augmentation for NLP [GitHub, 4419 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 917 stars]
NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 773 stars]
EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
snorkel Framework to generate training data [GitHub, 5791 stars]
dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]

Reading Material and Tutorials

A Survey of Data Augmentation Approaches for NLP [Paper, May 2021] GitHub Link
? A Visual Survey of Data Augmentation in NLP [Blog, 2020]
? Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]

Named Entity Recognition (NER)

Datasets for Entity Recognition [GitHub, 1497 stars]
Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 338 stars]
Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 212 stars]
Few-NERD - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 385 stars]

Relation Extraction

tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 355 stars]
tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 69 stars]
tac-self-attention Relation extraction with position-aware self-attention [GitHub, 64 stars]
Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 51 stars]

Coreference Resolution

NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2850 stars]
coref - BERT and SpanBERT for Coreference Resolution [GitHub, 443 stars]

情感分析

Reading list for Awesome Sentiment Analysis papers by declare-lab [GitHub, 517 stars]
Awesome Sentiment Analysis by xiamx [GitHub, 913 stars]

Domain Adaptation

Neural Adaptation in Natural Language Processing - curated list [GitHub, 261 stars]

Low Resource NLP

CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 597 stars]

Spell Correction / Error Correction

Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1502 stars]
NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 665 stars]
SymSpellPy - Python port of SymSpell [GitHub, 796 stars]
? Speller100 by Microsoft [Blog, Feb 2021]
JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 608 stars]
pycorrector - spell correction for Chinese [GitHub, 5517 stars]
contractions - Fixes contractions such as you're to you are [GitHub, 308 stars]
? Fine Tuning T5 for Grammar Correction by Sachin Abeywardana [Blog, Nov 2022]

Style Transfer for NLP

Styleformer - Neural Language Style Transfer framework [GitHub, 475 stars]
StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 60 stars]

Automata Theory for NLP

pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]

Obscene words detection

LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 2899 stars]

Reddit Analysis

Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 489 stars]

Skill Detection

SkillNER - rule based NLP module to extract job skills from text [GitHub, 153 stars]

Reinforcement Learning for NLP

nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 192 stars]

AutoML / AutoNLP

AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 3836 stars]
TPOT - Python Automated Machine Learning tool [GitHub, 9691 stars]
Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 2359 stars]
HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 710 stars]
? AutoML Natural Language - Google's paid AutoML NLP service
Optuna - hyperparameter optimization framework [GitHub, 10650 stars]
FLAML - fast and lightweight AutoML library [GitHub, 3871 stars]
Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 306 stars]

OCR - Optical Character Recognition

?️ A framework for designing document processing solutions [Blog, June 2022]

Document AI

? Table Transformer + HuggingFace Models

Text Generation

keytotext - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 445 stars]
? Controllable Neural Text Generation [Blog, Jan 2021]
BARTScore Evaluating Generated Text as Text Generation [GitHub, 317 stars]