The NLP Pandect下载 - The NLP Pandect源代码下载

The NLP Pandect

其他源码

1.0.0

下载

nlp-pandect

创建了这种Pandect（πανδέκτης对于百科全书而言是古希腊语），以帮助您找到几乎所有与自然语言处理有关的东西。

请注意可用资源类型的快速传说：
- 开源项目，通常是带有星星数量的GitHub存储库
？ - 您可以阅读的资源，通常是博客文章或论文
- 额外资源的集合
？ - 非开放源工具，框架或付费服务
？️-您可以观看的资源
？️-您可以听的资源

？主要部分	？子部分样本
NLP资源	论文摘要，会议摘要，NLP数据集
NLP播客	仅NLP播客，带有许多NLP情节的播客
NLP通讯	-
NLP聚会	-
NLP YouTube频道	-
NLP基准	NLU将军，问答，多语言
研究资源	有关变压器模型，蒸馏和修剪，自动汇总的资源
行业资源	NLP系统的最佳实践，NLP的MLOP
语音识别	一般资源，文字到语音，对文本的语音，数据集
主题建模	博客，框架，存储库和项目
关键字提取	文本等级，耙子，其他方法
负责的NLP	NLP和ML可解释性，道德，偏见和NLP的平等，NLP的对抗性攻击
NLP框架	通用，数据增强，机器翻译，对抗攻击，对话系统和语音，实体和字符串匹配，非英语框架，文本注释
学习NLP	课程，书籍，教程
NLP社区	-
其他NLP主题	令牌化，数据增强，命名实体识别，错误校正，automl/autonlp，文本生成

注意部分关键字：纸张摘要，纲要，很棒的清单

关于NLP主题的汇编和令人敬畏的列表：

NLP索引 - NLP论文的可搜索索引，量子stat / nlp cypher
Keon的真棒NLP [Github，16528 Stars]
语音和自然语言处理令人敬畏的列表[github，2189星]
自然语言处理的令人敬畏的深度学习（NLP）[GitHub，1274颗星]
文本挖掘和自然语言处理资源由Stepthom [Github，557星]
Philip Vollet的#NLP爱好者的脑袋
很棒的AI/ML/DL -NLP部分[GitHub，1473星]
Devopedia的NLP文章

NLP会议，纸张摘要和纸质汇编：

论文和纸张摘要

100条必须阅读的NLP论文100必读NLP论文[GitHub，3732星]
dair-ai的NLP纸摘要[Github，1475星]
NLP从业者的策划论文收集[Github，1075颗星]
关于文字对抗攻击和防御的论文[Github，1501星]
NLU的最新深度学习论文和Valentin Malykh的RL [Github，296星]
调查调查（NLP和ML）：NLP调查论文的收集[Github，1997年Stars]
文本中的样式转移纸列表[GitHub，1609星]
？论文的视频录音索引

会议摘要

NLP前10个会议汇编作者：Soulbliss [Github，459星]
？ ICLR 2020趋势
？ Spacyirl 2019会议概述
？纸摘要 - 概述中的会议和论文

NLP进度和NLP任务：

NLP的SebastianRuder进展[Github，22568星]
Kyubyong的NLP任务[Github，3017星]

NLP数据集：

NIDERHOFF的NLP数据集[GitHub，5741星]
huggingface的数据集[GitHub，19096恒星]
大坏NLP数据库
UWA明确的单词注释 - 单词感官歧义数据集
MLDOC-八种语言的多语言文档分类语料库[github，152星]

单词和句子嵌入：

Hironsan的令人敬畏的嵌入模型[Github，1752年的恒星]
Septius的句子嵌入列表[Github，2219星]
Jiakui的真棒Bert [Github，1846年的星星]

笔记本，脚本和存储库

超级Duper NLP回购[网站，2020年]

非英语资源和汇编

Bahasa印度尼西亚人的NLP资源[Github，480星]
INDIN NLP目录[GitHub，552星]
越南语的预训练语言模型[Github，653星]
指示语言（inltk）的自然语言工具包[github，814星]
INDIN NLP库[GitHub，550星]
AI4BHARAT-INDICNLP门户
ARBML-许多阿拉伯语NLP和ML项目的实施[Github，387星]
Zemberek -NLP-土耳其语的NLP工具[Github，1146星]
TDD AI-用于所有土耳其数据集，语言模型和NLP工具的开源平台。
KLUE-韩国语言理解评估[Github，560星]
波斯NLP基准 - 用于评估和比较波斯语的各种NLP任务的基准[Github，73星]
NLP -Greek-希腊语言来源[Github，5星]
匈牙利的真棒NLP资源[Github，221星]

预训练的NLP模型

预训练的NLP模型列表[GitHub，170星]
华为Noah的方舟实验室开发的审计语言模型[Github，3019星]
西班牙语模型和资源[Github，251星]

NLP历史

一般的

现代深度学习技术应用于自然语言处理[Github，1328星]
？自然语言处理神经史的评论[博客，2018年10月]

2020年审查

？ 2020年的自然语言处理：评论一年[博客，2020年12月]
？ 2020年的ML和NLP研究重点[博客，2021年1月]

？回到目录

仅NLP播客

？） NLP亮点[年：2017-现在，状态：活动]
？NLP区域情节[年：2021-现在，状态：活动]

许多NLP情节

？？ twiml ai [年：2016-现在，状态：活动]
？️实用的AI [年：2018年 - 现在，状态：活动]
？数据交换[年：2019年 - 现在，状态：活动]
？quentient异议[年：2020年 - 现在，状态：活动]
？机器学习街道谈话[年：2020-现在，状态：活动]
数据框架 - 有关如何扩展组织中数据科学影响的最新趋势和见解[年：2019年 - 现在，状态：活跃]

一些NLP情节

？？超级数据科学播客[年：2016-现在，状态：活动]
？数据黑客收音机[年：2018-现在，状态：活动]
？？
分析显示[年：2019年 - 现在，状态：活动]

？塞巴斯蒂安·鲁德（Sebastian Ruder）的NLP新闻
？本周在罗伯特·戴尔（Robert Dale）的NLP
？用代码的论文
？深度学习的批次
？纸消化纸消化
？ NLP Cypher通过QuantumStat

？ NLP Zurich [YouTube录音]
？黑客手机学习[YouTube录音]
？ NY-NLP（纽约）

？ Yannic Kilcher
？拥抱面
？ Kaggle阅读小组
？ RASA纸阅读
？ Stanford CS224N：NLP深度学习
？ nlpxing
？ ML解释-AI Socratic Circles -Aisc
？深度学习
？机器学习街道谈话

？回到目录

NLU将军

胶 - 一般语言理解评估（胶）基准
Superglue-在胶水之后使用的基准测试，并具有一套更困难的语言理解任务
decanlp-用于研究一般NLP模型的自然语言十项全能（DECANLP）
DialogLue-对话：一种自然语言理解为任务对话的基准[Github，280星]
Dynabench -Dynabench是一个动态数据收集和基准测试的研究平台
大基础 - 用于测量和推断语言模型功能的协作基准[Github，2835 Stars]

摘要

Wikiasp-Wikiasp：基于多文件的摘要数据集
Wikilingua-多语言抽象摘要数据集

问题回答

小队 - 斯坦福问题回答数据集（小队）
Xquad-Xquad（跨语性问题回答数据集）用于跨语性问题回答
Grailqa-强烈可推广的问答（Grailqa）
CSQA-复杂的顺序问题回答

多语言和非英语基准

？ Xtreme-大量多语言多任务基准
Gluecos-代码开关NLP的基准
Indicglue-自然语言理解指标语言的基准
Lince-语言代码转换评估基准
俄罗斯超级豪华 - 俄罗斯超级豪华基准测试

生物，法律和其他科学领域

Blurb-生物医学语言理解和推理基准
蓝色 - 生物医学理解评估基准
Lexglue-用英语理解法律语言理解的基准数据集

变压器效率

远程竞技场 - 用于基准测试有效变压器的远程竞技场（预印）[GitHub，716星]

语音处理

出色 - 语音处理通用性能基准

其他

codexglue-代码智能的基准数据集
交叉训练 - 横码：评估命名实体识别的跨域
MULTINLI-多元类别的自然语言推理语料库
Isarcasm：一个预期的讽刺数据集 - Isarcasm是一条推文的数据集，每个数据集都标记为讽刺或non_sarcastic

？回到目录

一般的

？ Andrej Karpathy的培训神经网络的配方[关键字：研究，培训，2019年]
？ NLP通过大型预训练的语言模型的最新进展：调查[论文，2021年11月]

嵌入

存储库

许多语言的预训练Elmo表示[Github，1458星]
Sense2Vec-上下文键为键的矢量[Github，1617星]
wikipedia2vec [github，935星]
星空[Github，3938星]
FastText [Github，25871星]

博客

？ David S. Batista的语言模型和上下文化的单词嵌入[博客，2018年]
？ AnalyticsVidhya [博客，2020年]为NLP从业人员提供术语读词嵌入的基本指南
？ polyglot Word Embeddings发现语言簇[博客，2020年]
？ Jay Alammar的插图Word2Vec [博客，2019年]

跨语性单词和句子嵌入

vecmap- vecmap（跨语性词嵌入映射）[github，644星]
句子转换器 - 带有Bert的多语言句子和图像嵌入[Github，14981星]

字节对编码

BPEMB-基于字节对编码（BPE）的275种语言的预训练子字嵌入[Github，1179恒星]
子词-NMT-神经机器翻译和文本生成的无监督单词分割[Github，2185星]
Python -bpe- python的字节对编码[Github，223星]

基于变压器的体系结构

一般的

？ Lilian Weng的Transformer家族[博客，2020年]
？用奖励和多种语言演奏彩票 - 关于随机初始化的效果[ICLR 2020纸]
？注意力？注意力！ Lilian Weng [博客，2018]
？变压器……“解释”？ [博客，2019年]
？您只需要注意；注意神经网络模型的罗卡斯·凯瑟（Talk，2017年）
？注意到了一个[2023年7月]
？理解和应用自我注意力[Talk，2018]
？ NLP食谱：基于变压器的深度学习体系结构的现代食谱[论文，2021年4月]
？预训练的模型：过去，现在和未来[论文，2021年6月]
？变压器的调查[纸，2021年6月]

变压器

？哈佛NLP的注释变压器[博客，2018年]
？ Jay Alammar的插图变压器[博客，2018年]
？汉吉的插图指南[博客，2020年]
？ Facebook带有自适应注意跨度的顺序变压器。博客[博客，2019年]
？莉娜·沃塔（Lena Voita）在变压器中的表示演变[博客，2019年]
？改革者：高效的变压器[博客，2020年]
？ Longformer - Viktor Karlsson的长篇文档变压器[博客，2020年]
？从头开始的变压器[博客，2019年]
？自然语言处理中的变压器 - 乔治·霍（George Ho）的简短调查[博客，2020年5月]
Lite Transformer-带有长短范围的Lite Transformer注意[GitHub，596星]
？从头开始的变压器[博客，2021年10月]

伯特

？ Jay Alammar首次使用Bert的视觉指南[博客，2019年]
？安娜·罗杰斯（Anna Rogers）的《伯特的黑暗秘密》 [博客，2020年]
？比以往任何时候都更好地了解搜索[博客，2019年]
？ Demystifusing Bert：开创性NLP框架的综合指南[博客，2019年]
Sembert-语言理解的语义知觉伯特[Github，286星]
Bertweet -Bertweet：英语推文的预训练的语言模型[Github，574星]
BERT的最佳亚构造提取[GitHub，470星]
角色伯特：和解Elmo和Bert [Github，195星]
？当伯特（Bert）播放彩票时，所有门票都在获胜[博客，2020年12月]
伯特相关的论文列表与BERT相关的论文列表[Github，2032 Stars]

其他变压器变体

T5

？ T5了解基于变压器的自我监督架构[博客，2020年8月]
？ T5：文本到文本传输变压器[博客，2020年]
多语言T5-多语言T5（MT5）是一种大量多语言的文本到文本变压器模型[GitHub，1245 stars]

大鸟

？大鸟：Google Research的较长序列原始论文的变压器[论文，2020年7月]

改革者 / Linformer / Longformer / Performer

改革者：高效的变压器 - [纸，2020年2月] [视频，2020年10月]
longformer：长期文档变压器 - [纸，2020年4月] [视频，2020年4月]
？️线形：线性复杂性的自我注意力 - [纸，2020年6月] [视频，2020年6月]
？对表演者重新考虑注意力 - [纸，2020年9月] [视频，2020年9月]
Performer-Pytorch- pytorch中表演者的实现，是一种线性注意力的变压器[Github，1084星]

开关变压器

？开关变压器：扩展到Google Research的原始纸张缩放到数万亿参数模型[论文，2021年1月]

GPT家庭

一般的

？ Jay Alammar的插图GPT-2 [博客，2019年]
？ Aman Arora注释的GPT-2
？ Openai的GPT-2：Ryan Lowe的模型，炒作和争议[博客，2019年]
？如何生成Patrick von Platen的文字[博客，2020年]

GPT-3

学习资源

？ Amit Chaudhary的文本分类的零镜头学习[博客，2020年]
？ gpt-3 leo gao的简短摘要[博客，2020年]
？ GPT-3，Yoel Zeldes的深度学习和NLP的巨大步骤[博客，2020年6月]
？ GPT-3语言模型：Chuan Li的技术概述[博客，2020年6月]
？语言模型是否可以实现语言理解？克里斯托弗·波茨（Christopher Potts）

申请

很棒的GPT-3-与GPT-3相关的所有资源列表[GitHub，4589星]
GPT-3项目 - 所有GPT-3初创企业和商业项目的地图
GPT-3演示展示-GPT-3演示展示柜，180多个应用程序，示例和资源
？ OpenAI API -API演示用于商业应用中的OpenAI GPT

开源努力

？ GPT-NEO-正在进行中GPT-3开源复制HuggingFace Hub
GPT -J-在堆上训练的60亿参数，自回归文本生成模型
？有效地使用GPT-J进行很少的学习[博客，2021年7月]

其他

？ Xu Liang的Xlnet中的两流自我发作是什么[博客，2019年]
？视觉论文摘要：阿米特·乔杜里（Amit Chaudhary）的阿尔伯特（Albert）（lite bert）[博客，2020年]
？ Microsoft的Turing NLG
？ Josh Xin Jie Lee的多标签文本分类[Blog，2019]
electra [github，2326星]
表演者在Pytorch [Github，1084 stars]中的表演者（一种线性注意变压器）的实现

蒸馏，修剪和量化

阅读材料

？从神经网络中提取知识，以建立Floydhub的较小，更快的模型[Blog，2019]
？深度学习模型的压缩：调查[论文，2021年4月]

工具

Bert-squeeze-代码以减少基于变压器的模型的大小或减少推理时间的延迟[Github，79星]
Xtremedistil- Xtremedistiltransformers用于蒸馏大量的多语言神经网络[GitHub，153星]

自动汇总

？ Pegasus：Google AI的抽象文本摘要的最新模型[博客，2020年6月]
ctrlsum -ctrlsum：迈向通用可控文本摘要[Github，146星]
XL-SUM- XL-SUM：44种语言的大规模多语言抽象摘要[GitHub，252星]
夏季 - 非专家的开源文本摘要工具包[Github，265星]
底漆 - 底漆：基于金字塔的蒙版句子预训练多文件摘要[github，151星]
Summarus-自动抽象摘要的模型[Github，170星]

知识图和NLP

？将知识融合到语言模型中[演讲，2021年10月]

注意部分关键字：最佳实践，MLOPS

？回到目录

建立NLP项目的最佳实践

？寻找NLP项目的最佳实践[幻灯片，2020年12月]
？ EMNLP 2020：Google Research，Recording，2020年11月的高性能自然语言处理]
？实用的自然语言处理 - 构建现实世界NLP系统的综合指南[书，2020年6月]
？如何构建和管理NLP项目[博客，2021年5月]
？应用NLP思维 - 应用NLP思维：如何将问题转化为解决方案[博客，2021年6月]
？ NLP的行业使用简介-DataTalkSClub在NLP介绍的行业使用介绍[记录，2021年12月]
？测量嵌入漂移 - 监视NLP模型漂移的最佳实践[博客，2022年12月]

NLP的MLOP

MLOP，尤其是应用于NLP时，是围绕在构建和部署NLP管道时自动化工作流程的各个部分的最佳实践。

通常，NLP的MLOP包括进行以下过程：

数据版本- 确保您的培训，注释和其他类型的数据已版本和跟踪
实验跟踪- 确保您的所有实验都会自动跟踪并保存，可以轻松复制或追回它们
模型注册表- 确保您训练的任何神经模型均已版本和跟踪，并且很容易回到其中任何一个
自动测试和行为测试- 除了常规单元和集成测试外，您还需要进行行为测试，以检查是否有偏见或潜在的对抗性攻击
模型部署和服务- 自动化模型部署，理想情况下，零降低时间部署，例如蓝色/绿色，金丝雀部署等。
数据和模型可观察性- 跟踪数据漂移，模型准确性漂移等。

此外，还有两个组件对于NLP不那么普遍，主要用于计算机视觉和AI的其他子场：

功能商店- 为ML模型开发的所有功能的集中存储，比任何其他ML项目都可以轻松地重复使用
元数据管理- 与使用ML模型有关的所有信息的存储，主要用于重现部署的ML模型，人工制品跟踪等的行为。

MLOPS汇编和很棒的列表

很棒的洛普[Github，12526星]
最佳ML-Python [Github，16309星]
mlops.toys-策划的MLOPS项目清单

阅读材料

？机器学习操作（MLOPS）：概述，定义和体系结构[论文，2022年5月]
？ MLOP的要求和参考架构：行业的见解[论文，2022年10月]
？ MLOP：它是什么，为什么重要以及如何实施Neptune AI [博客，2021年7月]
？您需要了解的最佳MLOP工具作为Neptune AI的数据科学家[博客，2021年7月]
？ MLOPS 2021撰写的Valohai [博客，2021年8月]
？ Valohai的MLOPS堆栈[博客，2020年10月]
？ Megagon AI的机器学习应用程序的数据版本[博客，2021年7月]
？机器学习的规范堆栈的快速发展[博客，2021年7月]
？ MLOP：综合初学者指南[博客，2021年3月]
？我从与100多名ML从业人员交谈中学到了有关MLOP的知识[博客，2021年5月]
？ Datarobot Challenger模型 - MLOPS冠军/挑战者模型
？ Ori Cohen博士MLOPS博客
？ MLOPS生态系统概述[博客，2021]

学习材料

？用ML制造的MLOPS COURCE
？ GitHub MLOP-收集有关如何促进机器学习操作的资源收集
？ ML可观察性基础课程学习如何通过生产NLP模型监测和根本原因问题

MLOPS社区

MLOPS社区 - 博客，Slack Group，新闻通讯等有关MLOP的信息

数据版本

DVC-数据版本控制（DVC）跟踪ML模型和数据集[免费和开源] GitHub链接
？权重和偏见 - 实验跟踪和数据集版本的工具[付费服务]
？ Pachyderm-具有使用工具的数据控制版本，以构建可扩展的端到端ML/AI管道[带免费层的付费服务]

实验跟踪

MLFLOW-机器学习生命周期的开源平台[免费和开源]链接到GitHub
？权重和偏见 - 实验跟踪和数据集版本的工具[付费服务]
？ Neptune AI-为研究和生产团队构建的实验跟踪和模型注册表[付费服务]
？彗星ML-使数据科学家和团队能够跟踪，比较，解释和优化实验和模型[付费服务]
？ Sigopt-自动培训和调整，可视化和比较跑步[付费服务]
Optuna-超参数优化框架[GitHub，10650恒星]
清除ML-实验，编排，部署和构建数据存储，全部在一个地方[免费和开源]链接到GitHub
元流 - 对人类友好的Python/R库，可帮助科学家和工程师建立和管理现实生活中的数据科学项目[Github，8093 Stars]

模型注册表

DVC-数据版本控制（DVC）跟踪ML模型和数据集[免费和开源] GitHub链接
MLFLOW-机器学习生命周期的开源平台[免费和开源]链接到GitHub
MODELDB-机器学习模型版本，元数据和实验管理的开源系统[GitHub，1696年星]
？ Neptune AI-为研究和生产团队构建的实验跟踪和模型注册表[付费服务]
？ Valohai-端到端ML管道[付费服务]
？ Pachyderm-具有使用工具的数据控制版本，以构建可扩展的端到端ML/AI管道[带免费层的付费服务]
？ Polyaxon-使用生产级MLOPS工具[付费服务]复制，自动化和扩展数据科学工作流程
？彗星ML-使数据科学家和团队能够跟踪，比较，解释和优化实验和模型[付费服务]

自动测试和行为测试

清单 - 超越准确性：NLP模型的行为测试[GitHub，2003年星]
TextAttack- NLP中的对抗性攻击，数据增强和模型培训的框架[GitHub，2922 Stars]
WILDNLP-损坏输入文本，以测试NLP模型的鲁棒性[GitHub，76星]
巨大的期望 - 为您的数据编写测试[GitHub，9874星]
Deepnecks-用于全面验证您的机器学习模型和数据的Python软件包[GitHub，3582星]

模型可部署性和服务

MLFLOW-机器学习生命周期的开源平台[免费和开源]链接到GitHub
？ Amazon Sagemaker [付费服务]
？ Valohai-端到端ML管道[付费服务]
？ NLP Cloud-生产的NLP API [付费服务]
？土星云[付费服务]
？ Seldon-企业的机器学习部署[付费服务]
？彗星ML-使数据科学家和团队能够跟踪，比较，解释和优化实验和模型[付费服务]
？ Polyaxon-使用生产级MLOPS工具[付费服务]复制，自动化和扩展数据科学工作流程
Torchserve-灵活且易于使用的工具用于服务Pytorch型号[GitHub，4174星]
？ kubeflow- kubernetes的机器学习工具包[github，10600星]
KFSERVING-无服务器推断Kubernetes [Github，3504星]
？ TFX -TensorFlow扩展 - 部署生产ML管道的端到端平台[付费服务]
？ Pachyderm-具有使用工具的数据控制版本，以构建可扩展的端到端ML/AI管道[带免费层的付费服务]
？皮层 - 容器作为AWS [付费服务]的服务
？ Azure机器学习 - 端到端机器学习生命周期[付费服务]
END2END无服务器变形金刚在AWS lambda上[GitHub，121星]
NLP服务 - NLP的样本演示为使用FastApi和拥抱脸的服务平台[Github，13星]
？ DAGSTER-机器学习的数据编排[免费和开源]
？ Verta -AI和机器学习部署和操作[付费服务]
元流 - 对人类友好的Python/R库，可帮助科学家和工程师建立和管理现实生活中的数据科学项目[Github，8093 Stars]
Flyte-适用于复杂，任务数据和ML流程的工作流自动化平台[GitHub，5525星]
MLRUN-机器学习自动化和跟踪[GitHub，1425星]
？ Datarobot MLOP- DataRobot MLOP为您的生产AI提供了卓越的中心

模型调试

imodels-简洁，透明和准确的预测建模的包装[Github，1375颗星]
驾驶舱 - 一种用于训练深神经网络的实用调试工具[Github，474星]

模型准确性预测

Weightwatcher-重量观看者工具用于预测深神经网络的准确性[GitHub，1453星]

数据和模型可观察性

一般的

Arize AI-嵌入NLP模型的漂移监测
Arize -Phoenix- LLM，视觉，语言和表格模型的ML可观察性
Whylogs-数据和ML记录的开源标准[GitHub，2636星]
Rubrix-用于探索和迭代人工智能项目数据的开源工具[GitHub，3843星]
MLRUN-机器学习自动化和跟踪[GitHub，1425星]
？ Datarobot MLOP- DataRobot MLOP为您的生产AI提供了卓越的中心
？皮层 - 容器作为AWS [付费服务]的服务

以型号为中心

？算法 - 通过所有数据，模型和基础架构[付费服务]中高级报告和企业级安全和治理的风险最小
？ Dataiku -Dataiku适用于想要使用大数据量表[付费服务]的最新技术提供高级分析的团队
显然是AI-分析和监视机器学习模型的工具[免费和开源]链接到GitHub
？提琴手 - ML模型绩效管理工具[付费服务]
？水圈 - 用于管理ML模型的开源平台[付费服务]
？ Verta -AI和机器学习部署和操作[付费服务]
？多米诺模型操作 - 部署和管理模型以驱动业务影响[付费服务]

以数据为中心

？数据折 - 通过差异，分析和异常检测[付费服务]数据质量[付费服务]
？ Acceldata-提高可靠性，加速规模并降低所有数据管道的成本[付费服务]
？ BigEye-在几分钟内监视和警报数据集[付费服务]
？ Datakin-端到端，实时数据谱系解决方案[付费服务]
？蒙特卡洛 - 数据完整性，漂移，模式，血统[付费服务]
？苏打水 - 数据监视，测试和验证[付费服务]

功能存储

？ Tecton-用于机器学习的企业功能商店[付费服务]
盛宴 - 机器学习网站的开源功能商店[GitHub，5525星]
？ HOPSWORKS功能商店 - 用于管理机器学习功能的数据管理系统[付费服务]

元数据管理

ML Metadata-一个用于记录和检索ML开发人员和数据科学家工作流相关的元数据的库[GitHub，617星]
？ Neptune AI-为研究和生产团队构建的实验跟踪和模型注册表[付费服务]

MLOPS框架

元流 - 对人类友好的Python/R库，可帮助科学家和工程师建立和管理现实生活中的数据科学项目[Github，8093 Stars]
Kedro -Python框架，用于创建可重复，可维护和模块化数据科学代码[GitHub，9883星]
Seldon Core -MLOPS框架包装，部署，监视和管理数千种生产机器学习模型[GitHub，4353 Stars]
Zenml- MLOPS框架为生产机器学习创建可再现的ML管道[GitHub，3972星]
？ Google Vertex AI-更快地构建，部署和扩展ML模型，并在统一的AI平台[付费服务]中具有预训练和自定义工具
DIFFGRAM-作为单个应用程序传递的机器学习的完整培训数据平台[GitHub，1834年Stars]
？连续.ai-通过在云数据仓库上的声明界面（例如雪花，bigquery，Redshift和Databricks）上的声明界面构建，部署和操作ML模型。 [付费服务]

基于变压器的体系结构

？回到目录

一般的

？为什么伯特在英特尔AI的商业环境中失败[博客，2020年]
？通过塞巴斯蒂安·古吉斯贝格（Sebastian Guggisberg
使用拥抱的脸部变压器（GitHub，254星]，Pytorch中的预处理变压器模型
？现实世界的实用NLP [演讲，2019年]
距纸到产品 - 我们如何实施Christoph Henkelmann的Bert [Talk，2020]

多GPU变压器

ParallFormers：用于部署的有效模型并行化工具包[GitHub，776星]

有效训练变压器

用计算/时间（学术）预算培训BERT [GITHUB，309星]

嵌入作为服务

嵌入为服务[github，204星]
Bert-As-Service [Github，12399星]

NLP食谱工业应用：

Microsoft的NLP食谱[GitHub，6367星]
NLP与Susanli2016的Python [Github，2721颗星]
Petrochukm的Pytorch NLP的基本实用程序[Github，2210星]

NLP在生物，金融，法律和其他行业中的应用

Blackstone-无组织法律文本的NLP的尖顶管道和模型[Github，636星]
科学/生物医学文档的Sci Spacy -Spacy管道和模型[Github，1688 stars]
Finbert：预先培训的SEC Financial NLP任务[Github，197 Stars]
Lexnlp-真实的，非结构化法律文本的信息检索和提取[GitHub，692星]
NERDL和NERCRF-关于SparkNLP的医疗保健指定实体识别的教程
法律文本分析 - 专用于法律文本分析的选定资源清单[GitHub，613星]
Bioie-与进行生物医学信息提取相关的策划资源清单[Github，338星]

注意部分关键字：语音识别

？回到目录

一般语音识别

Wav2letter-自动语音识别工具包[GitHub，6370星]
DeepSpeech -Baidu的DeepSpeech Architecture [Github，25166星]
？玛丽亚·奥贝德科娃（Maria Obedkova）的声词嵌入[博客，2020年]
卡尔迪 - 卡尔迪（Kaldi）是语音识别的工具包[Github，14177颗星]
很棒的卡尔迪 - 使用kaldi的资源[github，532星]
ESPNET-端到端语音处理工具包[GitHub，8355星]
？休伯特 - 自我监督的表示语音识别，产生和压缩的学习[博客，2021年6月]

文字到语音 /语音生成

FastSpeech-基于Pytorch的FastSpeech的实现[Github，857星]
TTS-文本到语音的深度学习工具包[github，34356星]
？ Notebooklm -Google Gemini供电的个人助理 /播客生成器

对文字的讲话

耳语 - 通过大规模弱监督的强大语音识别，Openai [github，68884星]
Vibe-使用耳语，多语言和CUDA支持的GUI工具包括[GitHub，931星]

数据集

Voxpopuli-用于表示学习的大规模多语言语料库[Github，507星]

注意部分关键字：主题建模

？回到目录

博客

？玛丽亚·奥贝德科娃（Maria Obedkova）的主题建模和Spark NLP [Spark，博客，2020年]
？布列塔尼·鲍尔斯（Brittany Bowers）的简短文本聚类（算法理论）的独特方法[博客，2020年]

主题建模的框架

Gensim-主题建模框架[GitHub，15597星]
Spark NLP [Github，3826星]

存储库

top2vec [github，2924星]
锚定相关解释主题建模[GitHub，303星]
嵌入空间中的主题建模[github，540星]纸
主题网 - Bigartm库的高级界面[GitHub，140星]
BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 6038 stars]
OCTIS - A python package to optimize and evaluate topic models [GitHub, 718 stars]
Contextualized Topic Models [GitHub, 1196 stars]
GSDMM - GSDMM: Short text clustering [GitHub, 353 stars]

Note Section keywords: keyword extraction

？ Back to the Table of Contents

Text Rank

PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 2132 stars]
textrank - TextRank implementation for Python 3 [GitHub, 1248 stars]

RAKE - Rapid Automatic Keyword Extraction

rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
yake - Single-document unsupervised keyword extraction [GitHub, 1632 stars]
RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 375 stars]
rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]

Other Approaches

flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5583 stars]
BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 254 stars]
keyBERT - Minimal keyword extraction with BERT [GitHub, 3471 stars]
KeyphraseVectorizers - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 251 stars]

进一步阅读

？ Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts by Haowen Jiang [Blog, Feb 2021]
？ How to Extract Relevant Keywords with KeyBERT [Blog, June 2021]

Note Section keywords: ethics, responsible NLP

？ Back to the Table of Contents

NLP and ML Interpretability

NLP-centric

Explainability for Natural Language Processing - KDD'2021 Tutorial Slides [Presentation, August 2021]
ecco - Tools to visuals and explore NLP language models [GitHub, 1974 stars]
NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 243 stars]
transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 1278 stars]
Awesome-explainable-AI - collection of research materials on explainable AI/ML [GitHub, 1400 stars]
LAMA - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 1346 stars]

一般的

Language Interpretability Tool (LIT) [GitHub, 3474 stars]
WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 468 stars]
Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 413 stars]
InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 6238 stars]
thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 143 stars]
Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 342 stars]
imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1375 stars]

Ethics, Bias, and Equality in NLP

？ Bias in Natural Language Processing @EMNLP 2020 [Blog, Nov 2020]
?️ Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
Ethics in NLP - resources from ACLs Ethics in NLP track
The Institute for Ethical AI & Machine Learning
？ Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]
Fairness-in-AI - this package is used to detect and mitigate biases in NLP tasks [GitHub, 77 stars]
nlg-bias - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 65 stars]
bias-in-nlp - list of papers related to bias in NLP [GitHub, 9 stars]

Adversarial Attacks for NLP

？ Privacy Considerations in Large Language Models [Blog, Dec 2020]
DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 73 stars]
Adversarial-Misspellings - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 62 stars]

Hate Speech Analysis

HateXplain - BERT for detecting abusive language [GitHub, 187 stars]

Note Section keywords: frameworks

？ Back to the Table of Contents

General Purpose

spaCy by Explosion AI [GitHub, 29784 stars]
flair by Zalando [GitHub, 13855 stars]
AllenNLP by AI2 [GitHub, 11740 stars]
stanza (former Stanford NLP) [GitHub, 7253 stars]
spaCy stanza [GitHub, 723 stars]
nltk [GitHub, 13489 stars]
gensim - framework for topic modeling [GitHub, 15597 stars]
pororo - Platform of neural models for natural language processing [GitHub, 1279 stars]
NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2936 stars]
FARM [GitHub, 1734 stars]
gobbli by RTI International [GitHub, 275 stars]
headliner - training and deployment of seq2seq models [GitHub, 229 stars]
SyferText - A privacy preserving NLP framework [GitHub, 197 stars]
DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1263 stars]
TextHero - Text preprocessing, representation and visualization [GitHub, 2882 stars]
textblob - TextBlob: Simplified Text Processing [GitHub, 9109 stars]
AdaptNLP - A high level framework and library for NLP [GitHub, 407 stars]
textacy - NLP, before and after spaCy [GitHub, 2209 stars]
texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2388 stars]
jiant - jiant is an NLP toolkit [GitHub, 1639 stars]

数据增强

WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
snorkel Framework to generate training data [GitHub, 5791 stars]
NLPAug Data augmentation for NLP [GitHub, 4419 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
faker - Python package that generates fake data for you [GitHub, 17648 stars]
textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 639 stars]
Parrot - Practical and feature-rich paraphrasing framework [GitHub, 871 stars]
AugLy - data augmentations library for audio, image, text, and video [GitHub, 4950 stars]
TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 396 stars]

Adversarial NLP Attacks & Behavioral Testing

TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 6172 stars]
CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2003 stars]

Transformer-oriented

transformers by HuggingFace [GitHub, 132974 stars]
Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 2543 stars]
haystack - Transformers at scale for question answering & neural search. [GitHub, 16997 stars]

Dialogue Systems and Speech

DeepPavlov by MIPT [GitHub, 6676 stars]
ParlAI by FAIR [GitHub, 10477 stars]
rasa - Framework for Conversational Agents [GitHub, 18726 stars]
wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6370 stars]
ChatterBot - conversational dialog engine for creating chatbots [GitHub, 14039 stars]
SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 8674 stars]
dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]

Word/Sentence-embeddings oriented

MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3181 stars]
vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 644 stars]
sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 14981 stars]

Social Media Oriented

Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 661 stars]

Phonetics

DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 352 stars]

形态学

LemmInflect - python module for English lemmatization and inflection [GitHub, 259 stars]
Inflect - generate plurals, ordinals, indefinite articles [GitHub, 964 stars]
simplemma - simple multilingual lemmatizer for Python [GitHub, 964 stars]

Multi-lingual tools

polyglot - Multi-lingual NLP Framework [GitHub, 2309 stars]
trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 730 stars]

Distributed NLP / Multi-GPU NLP

Spark NLP [GitHub, 3826 stars]
Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 776 stars]

Machine Translation

COMET -A Neural Framework for MT Evaluation [GitHub, 493 stars]
marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 1236 stars]
argos-translate - Open source neural machine translation in Python [GitHub, 3771 stars]
Opus-MT - Open neural machine translation models and web services [GitHub, 605 stars]
dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 440 stars]
CTranslate2 - CTranslate2 end-to-end machine translation [GitHub, 3300 stars]

Entity and String Matching

PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 736 stars]
pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]
fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 9220 stars]
jellyfish - approximate and phonetic matching of strings [GitHub, 2049 stars]
textdistance - Compute distance between sequences [GitHub, 3367 stars]
DeepMatcher - Compute distance between sequences [GitHub, 555 stars]
RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 339 stars]
Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 17 stars]

Discourse Analysis

ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 543 stars]

PII scrubbing

scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 394 stars]

Hastag Segmentation

hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 68 stars]

Books Analysis / Literary Analysis / Semantic Search

booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 785 stars]
bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 76 stars]
SemanticFinder - frontend-only live semantic search with transformers.js [GitHub, 224 stars]

Non-English oriented

日本人

fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 391 stars]
SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 390 stars]
Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 226 stars]
jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 148 stars]
Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 745 stars]
kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 953 stars]
nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 382 stars]
KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 201 stars]
Jigg - Pipeline framework for easy natural language processing [GitHub, 74 stars]
Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 376 stars]
RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 473 stars]
toiro - a comparison tool of Japanese tokenizers [GitHub, 118 stars]

泰国

AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 79 stars]
ThaiLMCut - Word Tokenizer for Thai Language [GitHub, 15 stars]

中国人

Spacy-pkuseg - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 53 stars]

乌克兰

recruitment-dataset - Recruitment Dataset Preprocessing and Recommender System (Ukrainian, English)

其他

textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 103 stars]
Kashgari Transfer Learning with focus on Chinese [GitHub, 2389 stars]
Underthesea - Vietnamese NLP Toolkit [GitHub, 1383 stars]
PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 84 stars]

Text Data Labelling & Classification

Small-Text - Active Learning for Text Classifcation in Python [GitHub, 549 stars]
Doccano - open source annotation tool for machine learning practitioners [GitHub, 9460 stars]
Adala - Autonomous DAta (Labeling) Agent framework [GitHub, 927 stars]
EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
？ Prodigy - annotation tool powered by active learning [Paid Service]

Note Section keywords: learn NLP

？ Back to the Table of Contents

一般的

？ Learn NLP the practical way [Blog, Nov. 2019]
？ Learn NLP the Stanford way (+Part 2) [Blog, Nov 2020]
？ Choosing the right course for a Practical NLP Engineer
？ 12 Best Natural Language Processing Courses & Tutorials to Learn Online
Treasure of Transformers - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 912 stars]
?️ Rasa Algorithm Whiteboard - YouTube series by Rasa explaining various Data Science and NLP Algorithms
?️ ExplosionAI Videos - YouTube series by ExplosionAI teaching you how to use spacy and apply it for NLP

课程

?️ CS25: Transformers United Stanford - Fall 2021 [Course, Fall 2021]
？ NLP Course | For You - Great and interactive course on NLP
？ Advanced NLP with spaCy - how to use spaCy to build advanced natural language understanding systems
？ Transformer models for NLP by HuggingFace
?️ Stanford NLP Seminar - slides from the Stanford NLP course

图书

？ Natural Language Processing with Transformers - [Book, February 2022]
？ Applied Natural Language Processing in the Enterprise - [Book, May 2021]
？ Practical Natural Language Processing - [Book, June 2020]
？ Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
？ Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
？ Top NLP Books to Read 2020 - Blog post by Raymong Cheng [Blog, Sep 2020]

教程

nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1366 stars]
nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 14110 stars]
Hands-On NLTK Tutorial [GitHub, 540 stars]
Modern Practical Natural Language Processing [GitHub, 266 stars]
Transformers-Tutorials - demos with the Transformers library by HuggingFace [GitHub, 9176 stars]
CalmCode Tutorials - Set of Python Data Science Tutorials

r/LanguageTechnology - NLP Reddit forum

？ Back to the Table of Contents

Tokenization

tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 8940 stars]
SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 10141 stars]
SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 135 stars]

Data Augmentation and Weak Supervision

Libraries and Frameworks

WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
NLPAug Data augmentation for NLP [GitHub, 4419 stars]
SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 917 stars]
NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 773 stars]
EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
snorkel Framework to generate training data [GitHub, 5791 stars]
dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]

Reading Material and Tutorials

A Survey of Data Augmentation Approaches for NLP [Paper, May 2021] GitHub Link
？ A Visual Survey of Data Augmentation in NLP [Blog, 2020]
？ Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]

Named Entity Recognition (NER)

Datasets for Entity Recognition [GitHub, 1497 stars]
Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 338 stars]
Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 212 stars]
Few-NERD - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 385 stars]

Relation Extraction

tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 355 stars]
tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 69 stars]
tac-self-attention Relation extraction with position-aware self-attention [GitHub, 64 stars]
Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 51 stars]

Coreference Resolution

NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2850 stars]
coref - BERT and SpanBERT for Coreference Resolution [GitHub, 443 stars]

情感分析

Reading list for Awesome Sentiment Analysis papers by declare-lab [GitHub, 517 stars]
Awesome Sentiment Analysis by xiamx [GitHub, 913 stars]

Domain Adaptation

Neural Adaptation in Natural Language Processing - curated list [GitHub, 261 stars]

Low Resource NLP

CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 597 stars]

Spell Correction / Error Correction

Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1502 stars]
NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 665 stars]
SymSpellPy - Python port of SymSpell [GitHub, 796 stars]
？ Speller100 by Microsoft [Blog, Feb 2021]
JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 608 stars]
pycorrector - spell correction for Chinese [GitHub, 5517 stars]
contractions - Fixes contractions such as you're to you are [GitHub, 308 stars]
？ Fine Tuning T5 for Grammar Correction by Sachin Abeywardana [Blog, Nov 2022]

Style Transfer for NLP

Styleformer - Neural Language Style Transfer framework [GitHub, 475 stars]
StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 60 stars]

Automata Theory for NLP

pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]

Obscene words detection

LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 2899 stars]

Reddit Analysis

Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 489 stars]

Skill Detection

SkillNER - rule based NLP module to extract job skills from text [GitHub, 153 stars]

Reinforcement Learning for NLP

nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 192 stars]

AutoML / AutoNLP

AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 3836 stars]
TPOT - Python Automated Machine Learning tool [GitHub, 9691 stars]
Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 2359 stars]
HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 710 stars]
？ AutoML Natural Language - Google's paid AutoML NLP service
Optuna - hyperparameter optimization framework [GitHub, 10650 stars]
FLAML - fast and lightweight AutoML library [GitHub, 3871 stars]
Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 306 stars]

OCR - Optical Character Recognition

?️ A framework for designing document processing solutions [Blog, June 2022]

Document AI

？ Table Transformer + HuggingFace Models

文字生成

keytotext - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 445 stars]
？ Controllable Neural Text Generation [Blog, Jan 2021]
BARTScore Evaluating Generated Text as Text Generation [GitHub, 317 stars]