
创建了这种Pandect(πανδέκτης对于百科全书而言是古希腊语),以帮助您找到几乎所有与自然语言处理有关的东西。
请注意可用资源类型的快速传说:
- 开源项目,通常是带有星星数量的GitHub存储库
? - 您可以阅读的资源,通常是博客文章或论文
- 额外资源的集合
? - 非开放源工具,框架或付费服务
?️-您可以观看的资源
?️-您可以听的资源
| ?主要部分 | ?子部分样本 |
|---|
| NLP资源 | 论文摘要,会议摘要,NLP数据集 |
| NLP播客 | 仅NLP播客,带有许多NLP情节的播客 |
| NLP通讯 | - |
| NLP聚会 | - |
| NLP YouTube频道 | - |
| NLP基准 | NLU将军,问答,多语言 |
| 研究资源 | 有关变压器模型,蒸馏和修剪,自动汇总的资源 |
| 行业资源 | NLP系统的最佳实践,NLP的MLOP |
| 语音识别 | 一般资源,文字到语音,对文本的语音,数据集 |
| 主题建模 | 博客,框架,存储库和项目 |
| 关键字提取 | 文本等级,耙子,其他方法 |
| 负责的NLP | NLP和ML可解释性,道德,偏见和NLP的平等,NLP的对抗性攻击 |
| NLP框架 | 通用,数据增强,机器翻译,对抗攻击,对话系统和语音,实体和字符串匹配,非英语框架,文本注释 |
| 学习NLP | 课程,书籍,教程 |
| NLP社区 | - |
| 其他NLP主题 | 令牌化,数据增强,命名实体识别,错误校正,automl/autonlp,文本生成 |
注意部分关键字:纸张摘要,纲要,很棒的清单
关于NLP主题的汇编和令人敬畏的列表:
- NLP索引 - NLP论文的可搜索索引,量子stat / nlp cypher
- Keon的真棒NLP [Github,16528 Stars]
- 语音和自然语言处理令人敬畏的列表[github,2189星]
- 自然语言处理的令人敬畏的深度学习(NLP)[GitHub,1274颗星]
- 文本挖掘和自然语言处理资源由Stepthom [Github,557星]
- Philip Vollet的#NLP爱好者的脑袋
- 很棒的AI/ML/DL -NLP部分[GitHub,1473星]
- Devopedia的NLP文章
NLP会议,纸张摘要和纸质汇编:
论文和纸张摘要
- 100条必须阅读的NLP论文100必读NLP论文[GitHub,3732星]
- dair-ai的NLP纸摘要[Github,1475星]
- NLP从业者的策划论文收集[Github,1075颗星]
- 关于文字对抗攻击和防御的论文[Github,1501星]
- NLU的最新深度学习论文和Valentin Malykh的RL [Github,296星]
- 调查调查(NLP和ML):NLP调查论文的收集[Github,1997年Stars]
- 文本中的样式转移纸列表[GitHub,1609星]
- ?论文的视频录音索引
会议摘要
- NLP前10个会议汇编作者:Soulbliss [Github,459星]
- ? ICLR 2020趋势
- ? Spacyirl 2019会议概述
- ?纸摘要 - 概述中的会议和论文
NLP进度和NLP任务:
- NLP的SebastianRuder进展[Github,22568星]
- Kyubyong的NLP任务[Github,3017星]
NLP数据集:
- NIDERHOFF的NLP数据集[GitHub,5741星]
- huggingface的数据集[GitHub,19096恒星]
- 大坏NLP数据库
- UWA明确的单词注释 - 单词感官歧义数据集
- MLDOC-八种语言的多语言文档分类语料库[github,152星]
单词和句子嵌入:
- Hironsan的令人敬畏的嵌入模型[Github,1752年的恒星]
- Septius的句子嵌入列表[Github,2219星]
- Jiakui的真棒Bert [Github,1846年的星星]
笔记本,脚本和存储库
非英语资源和汇编
- Bahasa印度尼西亚人的NLP资源[Github,480星]
- INDIN NLP目录[GitHub,552星]
- 越南语的预训练语言模型[Github,653星]
- 指示语言(inltk)的自然语言工具包[github,814星]
- INDIN NLP库[GitHub,550星]
- AI4BHARAT-INDICNLP门户
- ARBML-许多阿拉伯语NLP和ML项目的实施[Github,387星]
- Zemberek -NLP-土耳其语的NLP工具[Github,1146星]
- TDD AI-用于所有土耳其数据集,语言模型和NLP工具的开源平台。
- KLUE-韩国语言理解评估[Github,560星]
- 波斯NLP基准 - 用于评估和比较波斯语的各种NLP任务的基准[Github,73星]
- NLP -Greek-希腊语言来源[Github,5星]
- 匈牙利的真棒NLP资源[Github,221星]
预训练的NLP模型
- 预训练的NLP模型列表[GitHub,170星]
- 华为Noah的方舟实验室开发的审计语言模型[Github,3019星]
- 西班牙语模型和资源[Github,251星]
NLP历史
一般的
- 现代深度学习技术应用于自然语言处理[Github,1328星]
- ?自然语言处理神经史的评论[博客,2018年10月]
2020年审查
- ? 2020年的自然语言处理:评论一年[博客,2020年12月]
- ? 2020年的ML和NLP研究重点[博客,2021年1月]
?回到目录
仅NLP播客
- ?) NLP亮点[年:2017-现在,状态:活动]
- ?NLP区域情节[年:2021-现在,状态:活动]
许多NLP情节
- ?? twiml ai [年:2016-现在,状态:活动]
- ?️实用的AI [年:2018年 - 现在,状态:活动]
- ?数据交换[年:2019年 - 现在,状态:活动]
- ?quentient异议[年:2020年 - 现在,状态:活动]
- ?机器学习街道谈话[年:2020-现在,状态:活动]
- 数据框架 - 有关如何扩展组织中数据科学影响的最新趋势和见解[年:2019年 - 现在,状态:活跃]
一些NLP情节
- ??超级数据科学播客[年:2016-现在,状态:活动]
- ?数据黑客收音机[年:2018-现在,状态:活动]
- ??
- 分析显示[年:2019年 - 现在,状态:活动]
- ?塞巴斯蒂安·鲁德(Sebastian Ruder)的NLP新闻
- ?本周在罗伯特·戴尔(Robert Dale)的NLP
- ?用代码的论文
- ?深度学习的批次
- ?纸消化纸消化
- ? NLP Cypher通过QuantumStat
- ? NLP Zurich [YouTube录音]
- ?黑客手机学习[YouTube录音]
- ? NY-NLP(纽约)
- ? Yannic Kilcher
- ?拥抱面
- ? Kaggle阅读小组
- ? RASA纸阅读
- ? Stanford CS224N:NLP深度学习
- ? nlpxing
- ? ML解释-AI Socratic Circles -Aisc
- ?深度学习
- ?机器学习街道谈话
?回到目录
NLU将军
- 胶 - 一般语言理解评估(胶)基准
- Superglue-在胶水之后使用的基准测试,并具有一套更困难的语言理解任务
- decanlp-用于研究一般NLP模型的自然语言十项全能(DECANLP)
- DialogLue-对话:一种自然语言理解为任务对话的基准[Github,280星]
- Dynabench -Dynabench是一个动态数据收集和基准测试的研究平台
- 大基础 - 用于测量和推断语言模型功能的协作基准[Github,2835 Stars]
摘要
- Wikiasp-Wikiasp:基于多文件的摘要数据集
- Wikilingua-多语言抽象摘要数据集
问题回答
- 小队 - 斯坦福问题回答数据集(小队)
- Xquad-Xquad(跨语性问题回答数据集)用于跨语性问题回答
- Grailqa-强烈可推广的问答(Grailqa)
- CSQA-复杂的顺序问题回答
多语言和非英语基准
- ? Xtreme-大量多语言多任务基准
- Gluecos-代码开关NLP的基准
- Indicglue-自然语言理解指标语言的基准
- Lince-语言代码转换评估基准
- 俄罗斯超级豪华 - 俄罗斯超级豪华基准测试
生物,法律和其他科学领域
- Blurb-生物医学语言理解和推理基准
- 蓝色 - 生物医学理解评估基准
- Lexglue-用英语理解法律语言理解的基准数据集
变压器效率
- 远程竞技场 - 用于基准测试有效变压器的远程竞技场(预印)[GitHub,716星]
语音处理
其他
- codexglue-代码智能的基准数据集
- 交叉训练 - 横码:评估命名实体识别的跨域
- MULTINLI-多元类别的自然语言推理语料库
- Isarcasm:一个预期的讽刺数据集 - Isarcasm是一条推文的数据集,每个数据集都标记为讽刺或non_sarcastic
?回到目录
一般的
- ? Andrej Karpathy的培训神经网络的配方[关键字:研究,培训,2019年]
- ? NLP通过大型预训练的语言模型的最新进展:调查[论文,2021年11月]
嵌入
存储库
- 许多语言的预训练Elmo表示[Github,1458星]
- Sense2Vec-上下文键为键的矢量[Github,1617星]
- wikipedia2vec [github,935星]
- 星空[Github,3938星]
- FastText [Github,25871星]
博客
- ? David S. Batista的语言模型和上下文化的单词嵌入[博客,2018年]
- ? AnalyticsVidhya [博客,2020年]为NLP从业人员提供术语读词嵌入的基本指南
- ? polyglot Word Embeddings发现语言簇[博客,2020年]
- ? Jay Alammar的插图Word2Vec [博客,2019年]
跨语性单词和句子嵌入
- vecmap- vecmap(跨语性词嵌入映射)[github,644星]
- 句子转换器 - 带有Bert的多语言句子和图像嵌入[Github,14981星]
字节对编码
- BPEMB-基于字节对编码(BPE)的275种语言的预训练子字嵌入[Github,1179恒星]
- 子词-NMT-神经机器翻译和文本生成的无监督单词分割[Github,2185星]
- Python -bpe- python的字节对编码[Github,223星]
基于变压器的体系结构
一般的
- ? Lilian Weng的Transformer家族[博客,2020年]
- ?用奖励和多种语言演奏彩票 - 关于随机初始化的效果[ICLR 2020纸]
- ?注意力?注意力! Lilian Weng [博客,2018]
- ?变压器……“解释”? [博客,2019年]
- ?您只需要注意;注意神经网络模型的罗卡斯·凯瑟(Talk,2017年)
- ?注意到了一个[2023年7月]
- ?理解和应用自我注意力[Talk,2018]
- ? NLP食谱:基于变压器的深度学习体系结构的现代食谱[论文,2021年4月]
- ?预训练的模型:过去,现在和未来[论文,2021年6月]
- ?变压器的调查[纸,2021年6月]
变压器
- ?哈佛NLP的注释变压器[博客,2018年]
- ? Jay Alammar的插图变压器[博客,2018年]
- ?汉吉的插图指南[博客,2020年]
- ? Facebook带有自适应注意跨度的顺序变压器。博客[博客,2019年]
- ?莉娜·沃塔(Lena Voita)在变压器中的表示演变[博客,2019年]
- ?改革者:高效的变压器[博客,2020年]
- ? Longformer - Viktor Karlsson的长篇文档变压器[博客,2020年]
- ?从头开始的变压器[博客,2019年]
- ?自然语言处理中的变压器 - 乔治·霍(George Ho)的简短调查[博客,2020年5月]
- Lite Transformer-带有长短范围的Lite Transformer注意[GitHub,596星]
- ?从头开始的变压器[博客,2021年10月]
伯特
- ? Jay Alammar首次使用Bert的视觉指南[博客,2019年]
- ?安娜·罗杰斯(Anna Rogers)的《伯特的黑暗秘密》 [博客,2020年]
- ?比以往任何时候都更好地了解搜索[博客,2019年]
- ? Demystifusing Bert:开创性NLP框架的综合指南[博客,2019年]
- Sembert-语言理解的语义知觉伯特[Github,286星]
- Bertweet -Bertweet:英语推文的预训练的语言模型[Github,574星]
- BERT的最佳亚构造提取[GitHub,470星]
- 角色伯特:和解Elmo和Bert [Github,195星]
- ?当伯特(Bert)播放彩票时,所有门票都在获胜[博客,2020年12月]
- 伯特相关的论文列表与BERT相关的论文列表[Github,2032 Stars]
其他变压器变体
T5
- ? T5了解基于变压器的自我监督架构[博客,2020年8月]
- ? T5:文本到文本传输变压器[博客,2020年]
- 多语言T5-多语言T5(MT5)是一种大量多语言的文本到文本变压器模型[GitHub,1245 stars]
大鸟
- ?大鸟:Google Research的较长序列原始论文的变压器[论文,2020年7月]
改革者 / Linformer / Longformer / Performer
- 改革者:高效的变压器 - [纸,2020年2月] [视频,2020年10月]
- longformer:长期文档变压器 - [纸,2020年4月] [视频,2020年4月]
- ?️线形:线性复杂性的自我注意力 - [纸,2020年6月] [视频,2020年6月]
- ?对表演者重新考虑注意力 - [纸,2020年9月] [视频,2020年9月]
- Performer-Pytorch- pytorch中表演者的实现,是一种线性注意力的变压器[Github,1084星]
开关变压器
- ?开关变压器:扩展到Google Research的原始纸张缩放到数万亿参数模型[论文,2021年1月]
GPT家庭
一般的
- ? Jay Alammar的插图GPT-2 [博客,2019年]
- ? Aman Arora注释的GPT-2
- ? Openai的GPT-2:Ryan Lowe的模型,炒作和争议[博客,2019年]
- ?如何生成Patrick von Platen的文字[博客,2020年]
GPT-3
学习资源
- ? Amit Chaudhary的文本分类的零镜头学习[博客,2020年]
- ? gpt-3 leo gao的简短摘要[博客,2020年]
- ? GPT-3,Yoel Zeldes的深度学习和NLP的巨大步骤[博客,2020年6月]
- ? GPT-3语言模型:Chuan Li的技术概述[博客,2020年6月]
- ?语言模型是否可以实现语言理解?克里斯托弗·波茨(Christopher Potts)
申请
- 很棒的GPT-3-与GPT-3相关的所有资源列表[GitHub,4589星]
- GPT-3项目 - 所有GPT-3初创企业和商业项目的地图
- GPT-3演示展示-GPT-3演示展示柜,180多个应用程序,示例和资源
- ? OpenAI API -API演示用于商业应用中的OpenAI GPT
开源努力
- ? GPT-NEO-正在进行中GPT-3开源复制HuggingFace Hub
- GPT -J-在堆上训练的60亿参数,自回归文本生成模型
- ?有效地使用GPT-J进行很少的学习[博客,2021年7月]
其他
- ? Xu Liang的Xlnet中的两流自我发作是什么[博客,2019年]
- ?视觉论文摘要:阿米特·乔杜里(Amit Chaudhary)的阿尔伯特(Albert)(lite bert)[博客,2020年]
- ? Microsoft的Turing NLG
- ? Josh Xin Jie Lee的多标签文本分类[Blog,2019]
- electra [github,2326星]
- 表演者在Pytorch [Github,1084 stars]中的表演者(一种线性注意变压器)的实现
蒸馏,修剪和量化
阅读材料
- ?从神经网络中提取知识,以建立Floydhub的较小,更快的模型[Blog,2019]
- ?深度学习模型的压缩:调查[论文,2021年4月]
工具
- Bert-squeeze-代码以减少基于变压器的模型的大小或减少推理时间的延迟[Github,79星]
- Xtremedistil- Xtremedistiltransformers用于蒸馏大量的多语言神经网络[GitHub,153星]
自动汇总
- ? Pegasus:Google AI的抽象文本摘要的最新模型[博客,2020年6月]
- ctrlsum -ctrlsum:迈向通用可控文本摘要[Github,146星]
- XL-SUM- XL-SUM:44种语言的大规模多语言抽象摘要[GitHub,252星]
- 夏季 - 非专家的开源文本摘要工具包[Github,265星]
- 底漆 - 底漆:基于金字塔的蒙版句子预训练多文件摘要[github,151星]
- Summarus-自动抽象摘要的模型[Github,170星]
知识图和NLP
- ?将知识融合到语言模型中[演讲,2021年10月]
注意部分关键字:最佳实践,MLOPS
?回到目录
建立NLP项目的最佳实践
- ?寻找NLP项目的最佳实践[幻灯片,2020年12月]
- ? EMNLP 2020:Google Research,Recording,2020年11月的高性能自然语言处理]
- ?实用的自然语言处理 - 构建现实世界NLP系统的综合指南[书,2020年6月]
- ?如何构建和管理NLP项目[博客,2021年5月]
- ?应用NLP思维 - 应用NLP思维:如何将问题转化为解决方案[博客,2021年6月]
- ? NLP的行业使用简介-DataTalkSClub在NLP介绍的行业使用介绍[记录,2021年12月]
- ?测量嵌入漂移 - 监视NLP模型漂移的最佳实践[博客,2022年12月]
NLP的MLOP
MLOP,尤其是应用于NLP时,是围绕在构建和部署NLP管道时自动化工作流程的各个部分的最佳实践。
通常,NLP的MLOP包括进行以下过程:
- 数据版本- 确保您的培训,注释和其他类型的数据已版本和跟踪
- 实验跟踪- 确保您的所有实验都会自动跟踪并保存,可以轻松复制或追回它们
- 模型注册表- 确保您训练的任何神经模型均已版本和跟踪,并且很容易回到其中任何一个
- 自动测试和行为测试- 除了常规单元和集成测试外,您还需要进行行为测试,以检查是否有偏见或潜在的对抗性攻击
- 模型部署和服务- 自动化模型部署,理想情况下,零降低时间部署,例如蓝色/绿色,金丝雀部署等。
- 数据和模型可观察性- 跟踪数据漂移,模型准确性漂移等。
此外,还有两个组件对于NLP不那么普遍,主要用于计算机视觉和AI的其他子场:
- 功能商店- 为ML模型开发的所有功能的集中存储,比任何其他ML项目都可以轻松地重复使用
- 元数据管理- 与使用ML模型有关的所有信息的存储,主要用于重现部署的ML模型,人工制品跟踪等的行为。
MLOPS汇编和很棒的列表
- 很棒的洛普[Github,12526星]
- 最佳ML-Python [Github,16309星]
- mlops.toys-策划的MLOPS项目清单
阅读材料
- ?机器学习操作(MLOPS):概述,定义和体系结构[论文,2022年5月]
- ? MLOP的要求和参考架构:行业的见解[论文,2022年10月]
- ? MLOP:它是什么,为什么重要以及如何实施Neptune AI [博客,2021年7月]
- ?您需要了解的最佳MLOP工具作为Neptune AI的数据科学家[博客,2021年7月]
- ? MLOPS 2021撰写的Valohai [博客,2021年8月]
- ? Valohai的MLOPS堆栈[博客,2020年10月]
- ? Megagon AI的机器学习应用程序的数据版本[博客,2021年7月]
- ?机器学习的规范堆栈的快速发展[博客,2021年7月]
- ? MLOP:综合初学者指南[博客,2021年3月]
- ?我从与100多名ML从业人员交谈中学到了有关MLOP的知识[博客,2021年5月]
- ? Datarobot Challenger模型 - MLOPS冠军/挑战者模型
- ? Ori Cohen博士MLOPS博客
- ? MLOPS生态系统概述[博客,2021]
学习材料
- ?用ML制造的MLOPS COURCE
- ? GitHub MLOP-收集有关如何促进机器学习操作的资源收集
- ? ML可观察性基础课程学习如何通过生产NLP模型监测和根本原因问题
MLOPS社区
- MLOPS社区 - 博客,Slack Group,新闻通讯等有关MLOP的信息
数据版本
- DVC-数据版本控制(DVC)跟踪ML模型和数据集[免费和开源] GitHub链接
- ?权重和偏见 - 实验跟踪和数据集版本的工具[付费服务]
- ? Pachyderm-具有使用工具的数据控制版本,以构建可扩展的端到端ML/AI管道[带免费层的付费服务]
实验跟踪
- MLFLOW-机器学习生命周期的开源平台[免费和开源]链接到GitHub
- ?权重和偏见 - 实验跟踪和数据集版本的工具[付费服务]
- ? Neptune AI-为研究和生产团队构建的实验跟踪和模型注册表[付费服务]
- ?彗星ML-使数据科学家和团队能够跟踪,比较,解释和优化实验和模型[付费服务]
- ? Sigopt-自动培训和调整,可视化和比较跑步[付费服务]
- Optuna-超参数优化框架[GitHub,10650恒星]
- 清除ML-实验,编排,部署和构建数据存储,全部在一个地方[免费和开源]链接到GitHub
- 元流 - 对人类友好的Python/R库,可帮助科学家和工程师建立和管理现实生活中的数据科学项目[Github,8093 Stars]
模型注册表
- DVC-数据版本控制(DVC)跟踪ML模型和数据集[免费和开源] GitHub链接
- MLFLOW-机器学习生命周期的开源平台[免费和开源]链接到GitHub
- MODELDB-机器学习模型版本,元数据和实验管理的开源系统[GitHub,1696年星]
- ? Neptune AI-为研究和生产团队构建的实验跟踪和模型注册表[付费服务]
- ? Valohai-端到端ML管道[付费服务]
- ? Pachyderm-具有使用工具的数据控制版本,以构建可扩展的端到端ML/AI管道[带免费层的付费服务]
- ? Polyaxon-使用生产级MLOPS工具[付费服务]复制,自动化和扩展数据科学工作流程
- ?彗星ML-使数据科学家和团队能够跟踪,比较,解释和优化实验和模型[付费服务]
自动测试和行为测试
- 清单 - 超越准确性:NLP模型的行为测试[GitHub,2003年星]
- TextAttack- NLP中的对抗性攻击,数据增强和模型培训的框架[GitHub,2922 Stars]
- WILDNLP-损坏输入文本,以测试NLP模型的鲁棒性[GitHub,76星]
- 巨大的期望 - 为您的数据编写测试[GitHub,9874星]
- Deepnecks-用于全面验证您的机器学习模型和数据的Python软件包[GitHub,3582星]
模型可部署性和服务
- MLFLOW-机器学习生命周期的开源平台[免费和开源]链接到GitHub
- ? Amazon Sagemaker [付费服务]
- ? Valohai-端到端ML管道[付费服务]
- ? NLP Cloud-生产的NLP API [付费服务]
- ?土星云[付费服务]
- ? Seldon-企业的机器学习部署[付费服务]
- ?彗星ML-使数据科学家和团队能够跟踪,比较,解释和优化实验和模型[付费服务]
- ? Polyaxon-使用生产级MLOPS工具[付费服务]复制,自动化和扩展数据科学工作流程
- Torchserve-灵活且易于使用的工具用于服务Pytorch型号[GitHub,4174星]
- ? kubeflow- kubernetes的机器学习工具包[github,10600星]
- KFSERVING-无服务器推断Kubernetes [Github,3504星]
- ? TFX -TensorFlow扩展 - 部署生产ML管道的端到端平台[付费服务]
- ? Pachyderm-具有使用工具的数据控制版本,以构建可扩展的端到端ML/AI管道[带免费层的付费服务]
- ?皮层 - 容器作为AWS [付费服务]的服务
- ? Azure机器学习 - 端到端机器学习生命周期[付费服务]
- END2END无服务器变形金刚在AWS lambda上[GitHub,121星]
- NLP服务 - NLP的样本演示为使用FastApi和拥抱脸的服务平台[Github,13星]
- ? DAGSTER-机器学习的数据编排[免费和开源]
- ? Verta -AI和机器学习部署和操作[付费服务]
- 元流 - 对人类友好的Python/R库,可帮助科学家和工程师建立和管理现实生活中的数据科学项目[Github,8093 Stars]
- Flyte-适用于复杂,任务数据和ML流程的工作流自动化平台[GitHub,5525星]
- MLRUN-机器学习自动化和跟踪[GitHub,1425星]
- ? Datarobot MLOP- DataRobot MLOP为您的生产AI提供了卓越的中心
模型调试
- imodels-简洁,透明和准确的预测建模的包装[Github,1375颗星]
- 驾驶舱 - 一种用于训练深神经网络的实用调试工具[Github,474星]
模型准确性预测
- Weightwatcher-重量观看者工具用于预测深神经网络的准确性[GitHub,1453星]
数据和模型可观察性
一般的
- Arize AI-嵌入NLP模型的漂移监测
- Arize -Phoenix- LLM,视觉,语言和表格模型的ML可观察性
- Whylogs-数据和ML记录的开源标准[GitHub,2636星]
- Rubrix-用于探索和迭代人工智能项目数据的开源工具[GitHub,3843星]
- MLRUN-机器学习自动化和跟踪[GitHub,1425星]
- ? Datarobot MLOP- DataRobot MLOP为您的生产AI提供了卓越的中心
- ?皮层 - 容器作为AWS [付费服务]的服务
以型号为中心
- ?算法 - 通过所有数据,模型和基础架构[付费服务]中高级报告和企业级安全和治理的风险最小
- ? Dataiku -Dataiku适用于想要使用大数据量表[付费服务]的最新技术提供高级分析的团队
- 显然是AI-分析和监视机器学习模型的工具[免费和开源]链接到GitHub
- ?提琴手 - ML模型绩效管理工具[付费服务]
- ?水圈 - 用于管理ML模型的开源平台[付费服务]
- ? Verta -AI和机器学习部署和操作[付费服务]
- ?多米诺模型操作 - 部署和管理模型以驱动业务影响[付费服务]
以数据为中心
- ?数据折 - 通过差异,分析和异常检测[付费服务]数据质量[付费服务]
- ? Acceldata-提高可靠性,加速规模并降低所有数据管道的成本[付费服务]
- ? BigEye-在几分钟内监视和警报数据集[付费服务]
- ? Datakin-端到端,实时数据谱系解决方案[付费服务]
- ?蒙特卡洛 - 数据完整性,漂移,模式,血统[付费服务]
- ?苏打水 - 数据监视,测试和验证[付费服务]
功能存储
- ? Tecton-用于机器学习的企业功能商店[付费服务]
- 盛宴 - 机器学习网站的开源功能商店[GitHub,5525星]
- ? HOPSWORKS功能商店 - 用于管理机器学习功能的数据管理系统[付费服务]
元数据管理
- ML Metadata-一个用于记录和检索ML开发人员和数据科学家工作流相关的元数据的库[GitHub,617星]
- ? Neptune AI-为研究和生产团队构建的实验跟踪和模型注册表[付费服务]
MLOPS框架
- 元流 - 对人类友好的Python/R库,可帮助科学家和工程师建立和管理现实生活中的数据科学项目[Github,8093 Stars]
- Kedro -Python框架,用于创建可重复,可维护和模块化数据科学代码[GitHub,9883星]
- Seldon Core -MLOPS框架包装,部署,监视和管理数千种生产机器学习模型[GitHub,4353 Stars]
- Zenml- MLOPS框架为生产机器学习创建可再现的ML管道[GitHub,3972星]
- ? Google Vertex AI-更快地构建,部署和扩展ML模型,并在统一的AI平台[付费服务]中具有预训练和自定义工具
- DIFFGRAM-作为单个应用程序传递的机器学习的完整培训数据平台[GitHub,1834年Stars]
- ?连续.ai-通过在云数据仓库上的声明界面(例如雪花,bigquery,Redshift和Databricks)上的声明界面构建,部署和操作ML模型。 [付费服务]
基于变压器的体系结构
?回到目录
一般的
- ?为什么伯特在英特尔AI的商业环境中失败[博客,2020年]
- ?通过塞巴斯蒂安·古吉斯贝格(Sebastian Guggisberg
- 使用拥抱的脸部变压器(GitHub,254星],Pytorch中的预处理变压器模型
- ?现实世界的实用NLP [演讲,2019年]
- 距纸到产品 - 我们如何实施Christoph Henkelmann的Bert [Talk,2020]
多GPU变压器
- ParallFormers:用于部署的有效模型并行化工具包[GitHub,776星]
有效训练变压器
- 用计算/时间(学术)预算培训BERT [GITHUB,309星]
嵌入作为服务
- 嵌入为服务[github,204星]
- Bert-As-Service [Github,12399星]
NLP食谱工业应用:
- Microsoft的NLP食谱[GitHub,6367星]
- NLP与Susanli2016的Python [Github,2721颗星]
- Petrochukm的Pytorch NLP的基本实用程序[Github,2210星]
NLP在生物,金融,法律和其他行业中的应用
- Blackstone-无组织法律文本的NLP的尖顶管道和模型[Github,636星]
- 科学/生物医学文档的Sci Spacy -Spacy管道和模型[Github,1688 stars]
- Finbert:预先培训的SEC Financial NLP任务[Github,197 Stars]
- Lexnlp-真实的,非结构化法律文本的信息检索和提取[GitHub,692星]
- NERDL和NERCRF-关于SparkNLP的医疗保健指定实体识别的教程
- 法律文本分析 - 专用于法律文本分析的选定资源清单[GitHub,613星]
- Bioie-与进行生物医学信息提取相关的策划资源清单[Github,338星]
注意部分关键字:语音识别
?回到目录
一般语音识别
- Wav2letter-自动语音识别工具包[GitHub,6370星]
- DeepSpeech -Baidu的DeepSpeech Architecture [Github,25166星]
- ?玛丽亚·奥贝德科娃(Maria Obedkova)的声词嵌入[博客,2020年]
- 卡尔迪 - 卡尔迪(Kaldi)是语音识别的工具包[Github,14177颗星]
- 很棒的卡尔迪 - 使用kaldi的资源[github,532星]
- ESPNET-端到端语音处理工具包[GitHub,8355星]
- ?休伯特 - 自我监督的表示语音识别,产生和压缩的学习[博客,2021年6月]
文字到语音 /语音生成
- FastSpeech-基于Pytorch的FastSpeech的实现[Github,857星]
- TTS-文本到语音的深度学习工具包[github,34356星]
- ? Notebooklm -Google Gemini供电的个人助理 /播客生成器
对文字的讲话
- 耳语 - 通过大规模弱监督的强大语音识别,Openai [github,68884星]
- Vibe-使用耳语,多语言和CUDA支持的GUI工具包括[GitHub,931星]
数据集
- Voxpopuli-用于表示学习的大规模多语言语料库[Github,507星]
注意部分关键字:主题建模
?回到目录
博客
- ?玛丽亚·奥贝德科娃(Maria Obedkova)的主题建模和Spark NLP [Spark,博客,2020年]
- ?布列塔尼·鲍尔斯(Brittany Bowers)的简短文本聚类(算法理论)的独特方法[博客,2020年]
主题建模的框架
- Gensim-主题建模框架[GitHub,15597星]
- Spark NLP [Github,3826星]
存储库
- top2vec [github,2924星]
- 锚定相关解释主题建模[GitHub,303星]
- 嵌入空间中的主题建模[github,540星]纸
- 主题网 - Bigartm库的高级界面[GitHub,140星]
- BERTopic - Leveraging BERT and a class-based TF-IDF to create easily interpretable topics [GitHub, 6038 stars]
- OCTIS - A python package to optimize and evaluate topic models [GitHub, 718 stars]
- Contextualized Topic Models [GitHub, 1196 stars]
- GSDMM - GSDMM: Short text clustering [GitHub, 353 stars]
Note Section keywords: keyword extraction
? Back to the Table of Contents
Text Rank
- PyTextRank - PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension [GitHub, 2132 stars]
- textrank - TextRank implementation for Python 3 [GitHub, 1248 stars]
RAKE - Rapid Automatic Keyword Extraction
- rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
- yake - Single-document unsupervised keyword extraction [GitHub, 1632 stars]
- RAKE-tutorial - A python implementation of the Rapid Automatic Keyword Extraction [GitHub, 375 stars]
- rake-nltk - Rapid Automatic Keyword Extraction algorithm using NLTK [GitHub, 1061 stars]
Other Approaches
- flashtext - Extract Keywords from sentence or Replace keywords in sentences [GitHub, 5583 stars]
- BERT-Keyword-Extractor - Deep Keyphrase Extraction using BERT [GitHub, 254 stars]
- keyBERT - Minimal keyword extraction with BERT [GitHub, 3471 stars]
- KeyphraseVectorizers - vectorizers that extract keyphrases with part-of-speech patterns [GitHub, 251 stars]
进一步阅读
- ? Adding a custom tokenizer to spaCy and extracting keywords from Chinese texts by Haowen Jiang [Blog, Feb 2021]
- ? How to Extract Relevant Keywords with KeyBERT [Blog, June 2021]
Note Section keywords: ethics, responsible NLP
? Back to the Table of Contents
NLP and ML Interpretability
NLP-centric
- Explainability for Natural Language Processing - KDD'2021 Tutorial Slides [Presentation, August 2021]
- ecco - Tools to visuals and explore NLP language models [GitHub, 1974 stars]
- NLP Profiler - A simple NLP library allows profiling datasets with text columns [GitHub, 243 stars]
- transformers-interpret - Model explainability that works seamlessly with transformers [GitHub, 1278 stars]
- Awesome-explainable-AI - collection of research materials on explainable AI/ML [GitHub, 1400 stars]
- LAMA - LAMA is a probe for analyzing the factual and commonsense knowledge contained in pretrained language models [GitHub, 1346 stars]
一般的
- Language Interpretability Tool (LIT) [GitHub, 3474 stars]
- WhatLies - Toolkit to help visualise - what lies in word embeddings [GitHub, 468 stars]
- Interpret-Text - Interpretability techniques and visualization dashboards for NLP models [GitHub, 413 stars]
- InterpretML - Fit interpretable models. Explain blackbox machine learning [GitHub, 6238 stars]
- thermostat - Collection of NLP model explanations and accompanying analysis tools [GitHub, 143 stars]
- Dodrio - Exploring attention weights in transformer-based models with linguistic knowledge [GitHub, 342 stars]
- imodels - package for concise, transparent, and accurate predictive modeling [GitHub, 1375 stars]
Ethics, Bias, and Equality in NLP
- ? Bias in Natural Language Processing @EMNLP 2020 [Blog, Nov 2020]
- ?️ Machine Learning as a Software Engineering Enterprise - NeurIPS 2020 Keynote [Presentation, Dec 2020]
- Ethics in NLP - resources from ACLs Ethics in NLP track
- The Institute for Ethical AI & Machine Learning
- ? Understanding the Capabilities, Limitations, and Societal Impact of Large Language Models [Paper, Feb 2021]
- Fairness-in-AI - this package is used to detect and mitigate biases in NLP tasks [GitHub, 77 stars]
- nlg-bias - dataset + classifier tools to study social perception biases in natural language generation [GitHub, 65 stars]
- bias-in-nlp - list of papers related to bias in NLP [GitHub, 9 stars]
Adversarial Attacks for NLP
- ? Privacy Considerations in Large Language Models [Blog, Dec 2020]
- DeepWordBug - Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers [GitHub, 73 stars]
- Adversarial-Misspellings - Combating Adversarial Misspellings with Robust Word Recognition [GitHub, 62 stars]
Hate Speech Analysis
- HateXplain - BERT for detecting abusive language [GitHub, 187 stars]
Note Section keywords: frameworks
? Back to the Table of Contents
General Purpose
- spaCy by Explosion AI [GitHub, 29784 stars]
- flair by Zalando [GitHub, 13855 stars]
- AllenNLP by AI2 [GitHub, 11740 stars]
- stanza (former Stanford NLP) [GitHub, 7253 stars]
- spaCy stanza [GitHub, 723 stars]
- nltk [GitHub, 13489 stars]
- gensim - framework for topic modeling [GitHub, 15597 stars]
- pororo - Platform of neural models for natural language processing [GitHub, 1279 stars]
- NLP Architect - A Deep Learning NLP/NLU library by Intel® AI Lab [GitHub, 2936 stars]
- FARM [GitHub, 1734 stars]
- gobbli by RTI International [GitHub, 275 stars]
- headliner - training and deployment of seq2seq models [GitHub, 229 stars]
- SyferText - A privacy preserving NLP framework [GitHub, 197 stars]
- DeText - Text Understanding Framework for Ranking and Classification Tasks [GitHub, 1263 stars]
- TextHero - Text preprocessing, representation and visualization [GitHub, 2882 stars]
- textblob - TextBlob: Simplified Text Processing [GitHub, 9109 stars]
- AdaptNLP - A high level framework and library for NLP [GitHub, 407 stars]
- textacy - NLP, before and after spaCy [GitHub, 2209 stars]
- texar - Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow [GitHub, 2388 stars]
- jiant - jiant is an NLP toolkit [GitHub, 1639 stars]
数据增强
- WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
- snorkel Framework to generate training data [GitHub, 5791 stars]
- NLPAug Data augmentation for NLP [GitHub, 4419 stars]
- SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
- faker - Python package that generates fake data for you [GitHub, 17648 stars]
- textflint - Unified Multilingual Robustness Evaluation Toolkit for NLP [GitHub, 639 stars]
- Parrot - Practical and feature-rich paraphrasing framework [GitHub, 871 stars]
- AugLy - data augmentations library for audio, image, text, and video [GitHub, 4950 stars]
- TextAugment - Python 3 library for augmenting text for natural language processing applications [GitHub, 396 stars]
Adversarial NLP Attacks & Behavioral Testing
- TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
- CleverHans - adversarial example library for constructing NLP attacks and building defenses [GitHub, 6172 stars]
- CheckList - Beyond Accuracy: Behavioral Testing of NLP models [GitHub, 2003 stars]
Transformer-oriented
- transformers by HuggingFace [GitHub, 132974 stars]
- Adapter Hub and its documentation - Adapter modules for Transformers [GitHub, 2543 stars]
- haystack - Transformers at scale for question answering & neural search. [GitHub, 16997 stars]
Dialogue Systems and Speech
- DeepPavlov by MIPT [GitHub, 6676 stars]
- ParlAI by FAIR [GitHub, 10477 stars]
- rasa - Framework for Conversational Agents [GitHub, 18726 stars]
- wav2letter - Automatic Speech Recognition Toolkit [GitHub, 6370 stars]
- ChatterBot - conversational dialog engine for creating chatbots [GitHub, 14039 stars]
- SpeechBrain - open-source and all-in-one speech toolkit based on PyTorch [GitHub, 8674 stars]
- dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]
Word/Sentence-embeddings oriented
- MUSE A library for Multilingual Unsupervised or Supervised word Embeddings [GitHub, 3181 stars]
- vecmap A framework to learn cross-lingual word embedding mappings [GitHub, 644 stars]
- sentence-transformers - Multilingual Sentence & Image Embeddings with BERT [GitHub, 14981 stars]
Social Media Oriented
- Ekphrasis - text processing tool, geared towards text from social networks [GitHub, 661 stars]
Phonetics
- DeepPhonemizer - grapheme to phoneme conversion with deep learning [GitHub, 352 stars]
形态学
- LemmInflect - python module for English lemmatization and inflection [GitHub, 259 stars]
- Inflect - generate plurals, ordinals, indefinite articles [GitHub, 964 stars]
- simplemma - simple multilingual lemmatizer for Python [GitHub, 964 stars]
Multi-lingual tools
- polyglot - Multi-lingual NLP Framework [GitHub, 2309 stars]
- trankit - Light-Weight Transformer-based Python Toolkit for Multilingual NLP [GitHub, 730 stars]
Distributed NLP / Multi-GPU NLP
- Spark NLP [GitHub, 3826 stars]
- Parallelformers: An Efficient Model Parallelization Toolkit for Deployment [GitHub, 776 stars]
Machine Translation
- COMET -A Neural Framework for MT Evaluation [GitHub, 493 stars]
- marian-nmt - Fast Neural Machine Translation in C++ [GitHub, 1236 stars]
- argos-translate - Open source neural machine translation in Python [GitHub, 3771 stars]
- Opus-MT - Open neural machine translation models and web services [GitHub, 605 stars]
- dl-translate - A deep learning-based translation library built on Huggingface transformers [GitHub, 440 stars]
- CTranslate2 - CTranslate2 end-to-end machine translation [GitHub, 3300 stars]
Entity and String Matching
- PolyFuzz - Fuzzy string matching, grouping, and evaluation [GitHub, 736 stars]
- pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]
- fuzzywuzzy - Fuzzy String Matching in Python [GitHub, 9220 stars]
- jellyfish - approximate and phonetic matching of strings [GitHub, 2049 stars]
- textdistance - Compute distance between sequences [GitHub, 3367 stars]
- DeepMatcher - Compute distance between sequences [GitHub, 555 stars]
- RE2 - Simple and Effective Text Matching with Richer Alignment Features [GitHub, 339 stars]
- Machamp - Machamp: A Generalized Entity Matching Benchmark [GitHub, 17 stars]
Discourse Analysis
- ConvoKit - Cornell Conversational Analysis Toolkit [GitHub, 543 stars]
PII scrubbing
- scrubadub - Clean personally identifiable information from dirty dirty text [GitHub, 394 stars]
Hastag Segmentation
- hashformers - automatically inserting the missing spaces between the words in a hashtag [GitHub, 68 stars]
Books Analysis / Literary Analysis / Semantic Search
- booknlp - a natural language processing pipeline that scales to books and other long documents (in English) [GitHub, 785 stars]
- bookworm - ingests novels, builds an implicit character network and a deeply analysable graph [GitHub, 76 stars]
- SemanticFinder - frontend-only live semantic search with transformers.js [GitHub, 224 stars]
Non-English oriented
日本人
- fugashi - Cython MeCab wrapper for fast, pythonic Japanese tokenization and morphological analysis [GitHub, 391 stars]
- SudachiPy - SudachiPy is a Python version of Sudachi, a Japanese morphological analyzer [GitHub, 390 stars]
- Konoha - easy-to-use Japanese Text Processing tool, which makes it possible to switch tokenizers with small changes of code [GitHub, 226 stars]
- jProcessing - Japanese Natural Langauge Processing Libraries [GitHub, 148 stars]
- Ginza - Japanese NLP Library using spaCy as framework based on Universal Dependencies [GitHub, 745 stars]
- kuromoji - self-contained and very easy to use Japanese morphological analyzer designed for search [GitHub, 953 stars]
- nagisa - Japanese tokenizer based on recurrent neural networks [GitHub, 382 stars]
- KyTea - Kyoto Text Analysis Toolkit for word segmentation and pronunciation estimation [GitHub, 201 stars]
- Jigg - Pipeline framework for easy natural language processing [GitHub, 74 stars]
- Juman++ - Juman++ (a Morphological Analyzer Toolkit) [GitHub, 376 stars]
- RakutenMA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript [GitHub, 473 stars]
- toiro - a comparison tool of Japanese tokenizers [GitHub, 118 stars]
泰国
- AttaCut - Fast and Reasonably Accurate Word Tokenizer for Thai [GitHub, 79 stars]
- ThaiLMCut - Word Tokenizer for Thai Language [GitHub, 15 stars]
中国人
- Spacy-pkuseg - The pkuseg toolkit for multi-domain Chinese word segmentation [GitHub, 53 stars]
乌克兰
- recruitment-dataset - Recruitment Dataset Preprocessing and Recommender System (Ukrainian, English)
其他
- textblob-de - TextBlob: Simplified Text Processing for German [GitHub, 103 stars]
- Kashgari Transfer Learning with focus on Chinese [GitHub, 2389 stars]
- Underthesea - Vietnamese NLP Toolkit [GitHub, 1383 stars]
- PTT5 - Pretraining and validating the T5 model on Brazilian Portuguese data [GitHub, 84 stars]
Text Data Labelling & Classification
- Small-Text - Active Learning for Text Classifcation in Python [GitHub, 549 stars]
- Doccano - open source annotation tool for machine learning practitioners [GitHub, 9460 stars]
- Adala - Autonomous DAta (Labeling) Agent framework [GitHub, 927 stars]
- EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
- ? Prodigy - annotation tool powered by active learning [Paid Service]
Note Section keywords: learn NLP
? Back to the Table of Contents
一般的
- ? Learn NLP the practical way [Blog, Nov. 2019]
- ? Learn NLP the Stanford way (+Part 2) [Blog, Nov 2020]
- ? Choosing the right course for a Practical NLP Engineer
- ? 12 Best Natural Language Processing Courses & Tutorials to Learn Online
- Treasure of Transformers - Natural Language processing papers, videos, blogs, official repos along with colab Notebooks [GitHub, 912 stars]
- ?️ Rasa Algorithm Whiteboard - YouTube series by Rasa explaining various Data Science and NLP Algorithms
- ?️ ExplosionAI Videos - YouTube series by ExplosionAI teaching you how to use spacy and apply it for NLP
课程
- ?️ CS25: Transformers United Stanford - Fall 2021 [Course, Fall 2021]
- ? NLP Course | For You - Great and interactive course on NLP
- ? Advanced NLP with spaCy - how to use spaCy to build advanced natural language understanding systems
- ? Transformer models for NLP by HuggingFace
- ?️ Stanford NLP Seminar - slides from the Stanford NLP course
图书
- ? Natural Language Processing with Transformers - [Book, February 2022]
- ? Applied Natural Language Processing in the Enterprise - [Book, May 2021]
- ? Practical Natural Language Processing - [Book, June 2020]
- ? Dive into Deep Learning - An interactive deep learning book with code, math, and discussions
- ? Natural Language Processing and Computational Linguistics - Speech, Morphology and Syntax (Cognitive Science)
- ? Top NLP Books to Read 2020 - Blog post by Raymong Cheng [Blog, Sep 2020]
教程
- nlp-tutorial - A list of NLP(Natural Language Processing) tutorials built on PyTorch [GitHub, 1366 stars]
- nlp-tutorial - Natural Language Processing Tutorial for Deep Learning Researchers [GitHub, 14110 stars]
- Hands-On NLTK Tutorial [GitHub, 540 stars]
- Modern Practical Natural Language Processing [GitHub, 266 stars]
- Transformers-Tutorials - demos with the Transformers library by HuggingFace [GitHub, 9176 stars]
- CalmCode Tutorials - Set of Python Data Science Tutorials
- r/LanguageTechnology - NLP Reddit forum
? Back to the Table of Contents
Tokenization
- tokenizers - Fast State-of-the-Art Tokenizers optimized for Research and Production [GitHub, 8940 stars]
- SentencePiece - Unsupervised text tokenizer for Neural Network-based text generation [GitHub, 10141 stars]
- SoMaJo - A tokenizer and sentence splitter for German and English web and social media texts [GitHub, 135 stars]
Data Augmentation and Weak Supervision
Libraries and Frameworks
- WildNLP Text manipulation library to test NLP models [GitHub, 76 stars]
- NLPAug Data augmentation for NLP [GitHub, 4419 stars]
- SentAugment Data augmentation by retrieving similar sentences from larger datasets [GitHub, 363 stars]
- TextAttack - framework for adversarial attacks, data augmentation, and model training in NLP [GitHub, 2922 stars]
- skweak - software toolkit for weak supervision applied to NLP tasks [GitHub, 917 stars]
- NL-Augmenter - Collaborative Repository of Natural Language Transformations [GitHub, 773 stars]
- EDA - Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks [GitHub, 1585 stars]
- snorkel Framework to generate training data [GitHub, 5791 stars]
- dialoguefactory Generate continuous dialogue data in a simulated textual world [GitHub, 5 stars]
Reading Material and Tutorials
- A Survey of Data Augmentation Approaches for NLP [Paper, May 2021] GitHub Link
- ? A Visual Survey of Data Augmentation in NLP [Blog, 2020]
- ? Weak Supervision: A New Programming Paradigm for Machine Learning [Blog, March 2019]
Named Entity Recognition (NER)
- Datasets for Entity Recognition [GitHub, 1497 stars]
- Datasets to train supervised classifiers for Named-Entity Recognition [GitHub, 338 stars]
- Bootleg - Self-Supervision for Named Entity Disambiguation at the Tail [GitHub, 212 stars]
- Few-NERD - Large-scale, fine-grained manually annotated named entity recognition dataset [GitHub, 385 stars]
Relation Extraction
- tacred-relation TACRED: position-aware attention model for relation extraction [GitHub, 355 stars]
- tacrev TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task [GitHub, 69 stars]
- tac-self-attention Relation extraction with position-aware self-attention [GitHub, 64 stars]
- Re-TACRED Re-TACRED: Addressing Shortcomings of the TACRED Dataset [GitHub, 51 stars]
Coreference Resolution
- NeuralCoref 4.0: Coreference Resolution in spaCy with Neural Networks by HuggingFace [GitHub, 2850 stars]
- coref - BERT and SpanBERT for Coreference Resolution [GitHub, 443 stars]
情感分析
- Reading list for Awesome Sentiment Analysis papers by declare-lab [GitHub, 517 stars]
- Awesome Sentiment Analysis by xiamx [GitHub, 913 stars]
Domain Adaptation
- Neural Adaptation in Natural Language Processing - curated list [GitHub, 261 stars]
Low Resource NLP
- CMU LTI Low Resource NLP Bootcamp 2020 - CMU Language Technologies Institute low resource NLP bootcamp 2020 [GitHub, 597 stars]
Spell Correction / Error Correction
- Gramformer - ramework for detecting, highlighting and correcting grammatical errors [GitHub, 1502 stars]
- NeuSpell - A Neural Spelling Correction Toolkit [GitHub, 665 stars]
- SymSpellPy - Python port of SymSpell [GitHub, 796 stars]
- ? Speller100 by Microsoft [Blog, Feb 2021]
- JamSpell - spell checking library - accurate, fast, multi-language [GitHub, 608 stars]
- pycorrector - spell correction for Chinese [GitHub, 5517 stars]
- contractions - Fixes contractions such as
you're to you are [GitHub, 308 stars] - ? Fine Tuning T5 for Grammar Correction by Sachin Abeywardana [Blog, Nov 2022]
Style Transfer for NLP
- Styleformer - Neural Language Style Transfer framework [GitHub, 475 stars]
- StylePTB - A Compositional Benchmark for Fine-grained Controllable Text Style Transfer [GitHub, 60 stars]
Automata Theory for NLP
- pyahocorasick - Python module implementing Aho-Corasick algorithm for string matching [GitHub, 937 stars]
Obscene words detection
- LDNOOBW - List of Dirty, Naughty, Obscene, and Otherwise Bad Words [GitHub, 2899 stars]
Reddit Analysis
- Subreddit Analyzer - comprehensive Data and Text Mining workflow for submissions and comments from any given public subreddit [GitHub, 489 stars]
Skill Detection
- SkillNER - rule based NLP module to extract job skills from text [GitHub, 153 stars]
Reinforcement Learning for NLP
- nlp-gym - NLPGym - A toolkit to develop RL agents to solve NLP tasks [GitHub, 192 stars]
AutoML / AutoNLP
- AutoNLP - Faster and easier training and deployments of SOTA NLP models [GitHub, 3836 stars]
- TPOT - Python Automated Machine Learning tool [GitHub, 9691 stars]
- Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch [GitHub, 2359 stars]
- HungaBunga - Brute-Force all sklearn models with all parameters using .fit .predict [GitHub, 710 stars]
- ? AutoML Natural Language - Google's paid AutoML NLP service
- Optuna - hyperparameter optimization framework [GitHub, 10650 stars]
- FLAML - fast and lightweight AutoML library [GitHub, 3871 stars]
- Gradsflow - open-source AutoML & PyTorch Model Training Library [GitHub, 306 stars]
OCR - Optical Character Recognition
- ?️ A framework for designing document processing solutions [Blog, June 2022]
Document AI
- ? Table Transformer + HuggingFace Models
文字生成
- keytotext - a model which will take keywords as inputs and generate sentences as outputs [GitHub, 445 stars]
- ? Controllable Neural Text Generation [Blog, Jan 2021]
- BARTScore Evaluating Generated Text as Text Generation [GitHub, 317 stars]
Title / Headlines Generation
- TitleStylist Learning to Generate Headlines with Controlled Styles [GitHub, 76 stars]
NLP research reproducibility
- ? A Systematic Review of Reproducibility Research in Natural Language Processing [Paper, March 2021]
License CC0
Attributions
资源
- All linked resources belong to original authors
图标
- Akropolis by parkjisun from the Noun Project
- Book of Ester by Gilad Sotil from the Noun Project
- quill by Juan Pablo Bravo from the Noun Project
- acting by Flatart from the Noun Project
- olympic by supalerk laipawat from the Noun Project
- aristocracy by Eucalyp from the Noun Project
- Horn by Eucalyp from the Noun Project
- temple by Eucalyp from the Noun Project
- constellation by Eucalyp from the Noun Project
- ancient greek round pattern by Olena Panasovska from the Noun Project
- Harp by Vectors Point from the Noun Project
- Atlas by parkjisun from the Noun Project
- Parthenon by Eucalyp from the Noun Project
- papyrus by IconMark from the Noun Project
- papyrus by Smalllike from the Noun Project
- pegasus by Saeful Muslim from the Noun Project
字体
The Pandect Series also includes