text_mining_resources下载text_mining

text_mining_resources

其他源码

1.0.0

下载

史蒂夫叔叔的文本分析和NLP资源清单

 ____ ____ ____ ____ _________ ____ ____ ____ ____ ____ ____ 
||t |||e |||x |||t |||       |||m |||i |||n |||i |||n |||g ||
||__|||__|||__|||__|||_______|||__|||__|||__|||__|||__|||__||
|/__|/__|/__|/__|/_______|/__|/__|/__|/__|/__|/__|

策划的资源清单，用于学习自然语言处理，文本分析和非结构化数据。

图书
- r
- Python
- 一般的
博客
博客文章，论文，案例研究
- 一般的
- NLP的偏见
- 刮擦
- 打扫
- 茎
- 减少维度
- 讽刺检测
- 文档分类
- 实体和信息提取
- 文档聚类和文档相似性
- 概念分析/主题建模
- 情感分析
- 文本摘要
- 机器翻译
- 问答系统，聊天机器人
- 模糊匹配，概率匹配，记录链接等。
- 单词和文档嵌入
- 变形金刚和语言模型
- 深度学习
- 知识图
主要的NLP会议
基准
在线课程
API和库
产品
在线演示和工具
数据集
杂项
其他精选列表

图书

r

带有R的文字开采
用r掌握文本挖掘
与R一起挖掘文字开采

Python

使用变压器的自然语言处理，修订版
开始自然语言处理
使用Python的文本分析的蓝图：基于机器学习的共同现实世界（NLP）应用程序的解决方案
实用的自然语言处理
使用Python进行自然语言处理
使用Pytorch进行自然语言处理
Python自然语言处理
使用Python掌握自然语言处理
自然语言处理：Python和NLTK
使用Python的应用文本分析：通过机器学习启用语言感知数据产品
使用Python应用自然语言处理。 2018。
用文字深入学习

一般的

驯服文本：如何查找，组织和操纵它。动手指南，学习用于查找，组织和操纵非结构化文本的创新工具和技术。
语音和语言处理
统计自然语言处理的基础
使用Perl和Prolog的语言处理：理论，实施和应用（认知技术）
信息检索的介绍
自然语言处理手册
实用的文本挖掘和非结构化文本数据应用程序的统计分析
预测文本挖掘的基本原理
挖掘社交网络：数据挖掘Facebook，Twitter，LinkedIn，Google+，Github等
自然语言处理的神经网络方法
文字采矿：社会科学指南
实用文本分析：解释商业智能的文本和非结构化数据
自然语言处理中的神经网络方法
文本的机器学习（2018）
西班牙语的自然语言处理
计算语言学基础人类计算机以自然语言的交流。提供有关如何构建通话机器人的见解。
语音识别的统计方法。突出显示了语音识别的重要研究和统计方法。
如何将数据标记为管理大型文本注释项目的扩展指南

博客

大概是一个科学博客
塞巴斯蒂安·鲁德（Sebastian Ruder）
NLP-Progress
自然语言处理博客

博客文章，论文，案例研究

一般的

NLP医疗保健。医疗保健付款人和提供者如何使用NLP。
AI哈佛商业评论。 NLP改善对机器互动的影响。
为什么自然语言处理的准确性对于零售AI的未来至关重要
自然语言处理很有趣！计算机如何理解人类语言。 2018。
WEF直播活动 - Twitter Fed全球新闻和情感跟踪器 - 直播2019年1月
现代深度学习技术应用于自然语言处理
自然语言处理的权威指南。 Monkeylearn。非技术概述。
从自然语言到日历条目，带有Clojure。 2015年3月。NLP，Clojure
问HN：如何进入NLP（自然语言处理）？
问HN：分析大型文本的最佳工具是什么？
Quora：我如何学习自然语言处理？初学者的良好介绍与时间估计分解并链接到斯坦福大学CS课程。
Quora主题：自然语言处理
自然语言处理的权威指南2015年10月。
本文的期货2015年2月。对文本中所有当前创新的调查是一种媒介。
R或Python在2015年8月在文本开采中。
2012年8月8日在文本采矿中从哪里开始。
R和Python中的文字挖掘：有8个入门技巧。 2016年10月
Python的文本分析介绍，2012年4月1日。初学者在Python中对情感分析的基础思想的演练。
使用Python挖掘Twitter数据（第1部分：收集数据）
为什么文字挖掘可能是下一个大事。 2012年3月。
SAS首席执行官提供了有关BI的分析，揭示了2011年6月的文本分析用例。
文本挖掘的价值和好处。 2015年9月。
Text Mining South Park 2016年2月 - 涵盖各种主题的文本挖掘博客。
自然语言处理：简介
自然语言处理教程。 2013年6月。
自然语言处理博客。
使用Twitter流API和Python的文本挖掘简介
- github repo带代码：https：//github.com/adilmoujahid/twitter_analytics
如何进入自然语言处理”。 NLP的基本非技术介绍。
贝蒂：您的命令行友好的英语界面。
创建机器学习模型来分析启动新闻-Part1。第2部分。第3部分。
比较最有用的文本处理API
100条必须阅读的NLP论文
Python处理文本数据指南
众包医疗关系提取的地面真相
基于自然语言的财务预测：调查
基于自然语言的财务预测：调查。一篇文章阐明了自然语言财务预测的范围。
5个自然语言处理的英雄工具
自然语言处理解锁隐藏的数据以改变医疗保健效率，质量和成本
从电子临床文件中提取医疗问题
用于机器学习的自然语言处理（NLP）。包括基本的，易于理解的预处理，并比较Python中的一些ML分类模型。
如何编写拼写校正器 - 彼得·诺维格（Peter Norvig）
使用AI释放非结构化政府数据的力量：（W。Eggers，N。Malik和M. Gracie，2019年1月）。 “将非结构化的文本视为在物理和虚拟文件柜中'被困住'。诺言是明确的：政府可以通过提高其“连接点”并确定可用数据中的模式来提高效力并防止许多灾难。”这本Deloitte文章提供了NLP的易于完善的底漆和背景，并且各种应用程序NLP可以用于非结构化的政府文本数据。本文包括许多美国政府的例子，介绍了当前如何在不同领域部署NLP（例如，帮助分析公共反馈/情感分析/主题建模，改善法医调查，以帮助政府的政策制定和法规合规性）。关键是要应用不同的NLP技术来探索和发现关键的政府情报见解。
娱乐产品的提取功能：由媒体消费心理学告知的指导潜在的迪里奇分配方法：（ O.Toubia，G。Iyengar，R。Bunnell和A. Lemaire，2019年2月）。 “我们依靠NLP文献以自动化和可扩展的方式开发一种标记娱乐产品的方法。在电影的背景下，我们首先表明，提出的功能提高了我们在个人级别上预测消耗的能力……我们还表明，LDA功能有可能提高模型的性能，从而预测聚集性能的效果，而不是个人级别的消耗。”这篇学术文章既提供了框架和管理意义，又表明LDA和NLP在娱乐产品中的特征提取中应用，可以帮助传统的基于内容的消费者行为模型，以及应用于媒体和娱乐行业的相关营销模型。
经验教训在医疗保健中构建自然语言处理系统
算法如何知道您将要输入什么

NLP的偏见

AI偏见：人类有责任确保公平
VentureBeat Blogpost-数据集中的性别偏见 - 基于UCLA研究论文“学习性别中性单词嵌入” 2018年8月。
在200个情绪分析系统中检查性别和种族偏见。 2018
男人是按计算机程序员做的，就像女人是家庭主妇吗？单词嵌入。

刮擦

使用零工教程刮擦HTML，以使用Python模块砂纸从凌乱的HTML网站中提取数据。
从任何文档中提取文本；没有穆斯，没有大惊小怪。 2014年7月。
使用砂纸制造自己的数据集，2017年9月。

打扫

如何解决90％的NLP问题：逐步指南2018年1月。用于成功的NLP模型构建的数据清洁和探索的逐步指南。
Python中的文本预处理：步骤，工具和示例。 2018年10月
如何使用Python 2017年10月清洁机器学习的文本。逐步指南如何执行文本数据预处理。
功能提取，基本预处理和高级处理

停止文字

用python中的NLTK删除停止单词
情感分析的文本分类 - 拦截和搭配

茎

文章：文本词干：方法，应用和挑战。 2016年12月。
茎和咬合之间有什么区别？ 2018年2月。在不同语言中使用Stemmating和Lemmatization的差异和示例。
python中的茎和诱饵。 2018年10月。将茎和柠檬水与背后，结果，利弊，使用上下文以及代码语法的算法进行比较。
情感研讨会教程：茎

减少维度

使用SVD驯服文本。 SAS。 2004年1月。
减少字袋型号的尺寸：PCA vs LSA
概述单词袋以及如何在python中为NLP编码
单词和TF-IDF解释了

讽刺检测

自动讽刺检测：调查ACM计算机调查，2017年9月。
级联：在线讨论论坛中的上下文讽刺检测第27届国际计算语言学会议，2018年8月。
使用深卷积神经网络国际计算机工程技术高级研究杂志，第6卷，第1期，2017年1月，更深入地了解讽刺推文。
通过深卷积神经网络检测讽刺。 2018年4月30日。使用CNN进行有效检测讽刺的上下文学习。

文档分类

幼稚的贝叶斯和文本分类，2014年。《天真贝叶斯算法》的深入概述以及如何在文档分类过程中使用。
2016年，Facebook研究人员的一篇论文介绍了FastText，这是一种快速有效的文档分类算法。
机器学习中的文本分类器算法，2017年。一篇博客文章，展示了如何将几种深度学习算法应用于文档分类问题。
在REUTERS-21578 R8数据集中对文档进行分类，2016年。R中的一个不错的教程，该教程显示了如何使用三种不同的ML算法对新闻文章进行分类。
整理文本矿业啤酒评论，2018年。使用KNN算法将精酿啤酒产品的评论分类为啤酒风格（例如，“ Pilsner”，“ IPA”或“ Belgian”）。
使用fastText和comet.ml在知识图中对关系进行分类
Scikit-Learn，2018年多级文本分类。一篇文章，展示了如何处理多级问题，例如将消费者投诉分为12个类别之一。
Scikit-Learn中的文字学习机器学习（Pycon 2016），2016年。一个不错的视频教程，讨论了如何在文档分类过程中使用Scikit-Learn。
处理文本数据的最终指南（使用Python） - 对于数据科学家和工程师，2018年。标题说明了一切。
与Scikit-Learn和NLTK，2017年的Python中的文本分类。另一个教程，展示了如何使用Scikit-Learn执行文本分类。
通过通用语言模型引入最新文本分类状态，2019年。引入了一种开创性的转移学习方法，用于文档分类。
通过预测长电影评论的情感分类的N -gram通过预测n -gram的学习文档嵌入 - 在github上使用代码的纸张
迈向可解释的NLP：2019年文本分类的生成解释框架。一篇论文描述了一种新的方法来解释文本分类模型的内部起作用。

实体和信息提取

实体提取和网络分析。 Python， StanfordCoreNLP
自然语言处理以进行信息提取
提取信息的NLP技术。对NLP数据挖掘工具和技术的七个步骤框架的深入探索。

文档聚类和文档相似性

文本聚类：从非结构化数据中获取快速见解。 2017年7月。
文档群集。 MSC论文。
文档聚类：详细评论。莎阿和马哈扬。 Ijais 2012。
使用Python进行文档聚类，该存储库将IMDB电影描述簇。基于本原始教程，其GitHub repo在这里。
使用SAS®企业矿工进行视频游戏用户评论的文本挖掘和情感分析
谁写了《反特朗普纽约时报》专栏文章？使用tidyText查找文档相似性

概念分析/主题建模

主题模型：过去，现在和未来
使用LSA的单词向量，第2部分
概率主题模型
乐高颜色主题为主题模型2017年9月。
我们的创业公司如何从无监督的LDA转换为半监督的引导。
LSA，PLSA，LDA和LDA2VEC的主题建模2018年8月。
Text2Vec的主题模型描述
主题建模门户
主题模型的应用2017。
Mac 30500：文本分析：主题建模
COTA，Uber的主题建模方法来改善客户支持
使用LDA主题模型作为分类模型输入
NLP：在几分钟内使用LDA从数据集中提取主要主题
主题建模澳大利亚高等法院的法律主题和司法活动，1903 - 2015年

情感分析

方法

CACM：情感分析的技术和应用，2013年。《 ACM杂志通信》中情感分析的一个很好的概述。
无监督的情感分析与签名的社交网络，2017年。会议论文描述了将情感分析应用于社交网络的挑战，并提出了一种新的无监督方法。
基于词典的情感分析方法，2010年。使用SO-CAL（语义取向计算器），这是对情感分析的主观性和意见的度量。
这种感性的感觉，2015年。将R的Syezhet软件包的结果与人类标签的结果进行了比较。 2016年更新。
无监督的情感神经元，2017年。Openai的团队开发了一种使用Deep NNS进行情感分析的新方法，对数据的数据比平常少得多。
当前的文本情感分析从意见到情感挖掘，2017年。一篇期刊文章调查了当前情感分析研究和工具的状态。
情感分析工具概述，第1部分。正面和负词数据库，2017年。概述了一些词典数据库的博客文章。
情感分析，概念分析和应用，2018年。情感分析的概述，并对有关Uber的推文进行了分析。
突破性的研究论文和情感分析模型，2018年。一个博客比较了简单至高级方法的情感分析的表现。
Twitter情感分析使用合并的LSTM-CNN模型，2018年。一篇博客文章，描述了一种使用深度学习的情感分析方法。
VADER：一种基于简约的规则模型，用于社交媒体文本的情感分析，2014年。呈现Vader的会议论文，Vader是一种简单的基于规则的情感分析模型。
基于词典的微博帖子情感分析的方法比较2014年。基于SentiWordnet等词汇资源的推论，介绍了一种基于词典的Twitter帖子情感分析的新方法。

挑战

关于否定性的否定性，2011年。会议论文讨论了在文本中处理否定性的挑战，并进行了有关IMDB电影评论的案例研究。
情感分析中的挑战，2015年。加拿大国家参议员委员会的实践指南，描述了情感分析的一些主要挑战。
关于情感分析挑战的一项调查，2016年。讨论和比较四十七篇论文中的情感分析挑战的期刊文章。

政治

使用Python，2017年对特朗普的推文的情感分析。对特朗普的推文使用Tweepy和Textblob进行NLP处理的情感分析。
唐纳德·特朗普（Donald Trump）vs希拉里·克林顿（Hillary Clinton）：Twitter提到的情感分析，2016年。比较了特朗普的推文与希拉里（Hillary）的推文的情感，导致2016年美国总统大选。
情感分析有效吗？对Yelp评论的整洁分析，2016年。在评论中结合了预测结果和单个单词，以表明情绪分析在Yelp评论中效果很好。
从推文到民意调查：将文本情感与2010年的公众舆论时间序列联系起来。会议论文描述了Twitter上的情感分析如何与公众舆论民意调查有关。

股市

Twitter情绪预测了2010年的股票市场。衡量每日Twitter Feedsa的“情绪”的期刊文章，并表明情绪可以预测DJIA。
非线性影响：社交媒体对市场价格的因果影响的证据，2016年。一篇期刊文章表明，社交媒体与DJIA的关系是非线性的。
福布斯：量化交易者如何利用情绪在2015年获得市场优势。一篇文章，展示了量化交易者如何使用情感分析的文章。
SENDEX：量化定性。一种在线工具，可衡量不同股票的整体情感。
Trump2Cash：由特朗普推文提供支持的股票交易机器人。一个观看唐纳德·特朗普（Donald Trump）的Twitter帐户的机器人，并等待他提及任何公开交易的公司。一篇相关的博客文章描述了一个机器人，该机器人将特朗普的推文变成了计划生育捐赠。

申请

在海上迷失：社交媒体如何帮助巡航线吸引千禧一代，2016年。一份白皮书描述了巡航线如何吸引其他受众。
哈里·绘图师（Harry Plotter）：与RIDYTEXT和R TIDYVERSE庆祝20周年，2015年。一篇技术文章，展示了如何将情感分析应用于Harry Potter系列的文本。
数据科学101：2017年r教程中的情感分析。一篇技术文章，描述了如何使用R中的TidyText软件包来分析美国总统演讲。
2017年戛纳狮子会：2017年火星巧克力澳大利亚（Clemenger BBDO），2017年。
情感分析：10个应用程序和4个服务，2018年。《情感分析》的简短但简洁的介绍，其业务影响和四个情感分析云服务提供商，包括Google，Amazon和Microsoft。
您的老板可以通过阅读整个公司的电子邮件（2018年）可以学到的东西。“课程：弄清楚劳动力的真相不是通过窃听员工所说的内容，而是通过检查他们的说法。”本文以将情感分析应用于大型内部非结构化文本数据集（例如员工电子邮件）的主题为中心。文本分析和NLP已成为一种越来越流行的方法，可以帮助搜索可能表明员工参与工作场所的线索，以及任何潜在的“红色范围”，这些潜在的“红色范围”应受到组织及其道德含义的特别关注。
亚马逊产品评论的基于方面的情感分析，2018年。一篇文章，展示了如何应用亚马逊产品评论的不同方面的情感分析。
对2017年超级碗51的220万条推文的情感分析。一篇文章，展示了如何将情感分析应用于有关超级碗的推文。
情感和情感分析：NLP的从业者指南，2018年。《情感分析》的概述，适用于新闻文章。

工具和技术

流媒体分析教程。
如何在Azure中分析情感。
操作方法 - 验证 - 索引 - 使用python-tutorial/。
Twitter情感分析概述，2016年。情感分析概述，以及如何使用TextBlob进行情感分析的分步演练。
Elmo使用TensorFlow Hub，2018年的keras中嵌入。一种使用TensorFlow Hub在KERAS模型中使用Google Elmo的指南。
Twitter情感分析在Python中使用TextBlob，2018年。

文本摘要

文字摘要用Gensim
使用句子嵌入的无监督文本摘要
改善文本摘要中的抽象提出了两种改进技术
文本摘要和与科学和健康相关的数据 - 文本汇总的分类。 2016年。关于文本摘要的基本研究。

机器翻译

博客文章：在翻译中发现：2016年11月Google翻译中更准确，流利的句子
纽约时报：伟大的AI觉醒2016年12月。Google如何使用人工智能来改变Google Translate，这是其最受欢迎的服务之一，以及如何使机器学习有望重塑计算本身。
机器学习翻译和Google翻译算法
神经机器翻译（SEQ2SEQ）教程
论文解剖：“您需要的全部需要”解释了一份重要论文，该论文在2017年首次引入了“注意机制”。
带注释的变压器逐线实现“您需要的全部需要”。
BERT：深度双向变压器的预培训，用于了解2018年发布的新语言表示模型。实施代码。 Pytorch港口。
基于短语和神经无监督的机器翻译提出了两个模型变体：基于神经和短语模型。在2018年EMNLP上获得最佳纸张奖。实施代码。

问答系统，聊天机器人

认识露西：创建聊天机器人原型
Microsoft Bot框架。
培训数百万个个性化对话代理商
为您的聊天机器人利用NLP和机器学习的终极指南。 2016。
在Python（使用NLTK）中从头开始构建一个简单的聊天机器人。 2018年9月
一项关于对话系统的调查：最新进展和新的边界2018年1月。
检查自动翻译聊天机器人对偶然L2学习的在线协作对话框的影响
使用常见问题发现，愤怒检测和自然语言理解创建银行聊天机器人
生成模型聊天机器人 - 2017年5月
与Python-2017年3月建造多功能的Slackbot指南
在Python（使用NLTK）中建立一个简单的聊天机器人 - 2018年9月
通往会话银行业务未来的道路
聊天机器人 - 为NLP模型设计意图和实体2017年1月
自动诊断的面向任务的对话系统。 2018年。谈论使用了MDP培训的数据集及其医学诊断应用程序。
AI Frontiers的Li Deng：三代口语对话系统（BOT）。 2017年。微软首席科学家AI幻灯片。
NLP - 建立一个问答模型。 2018年3月

模糊匹配，概率匹配，记录链接等。

R中的consep方法。近似字符串匹配（模糊匹配）
R.示例用法中的fuzzywuzzy软件包。
模糊弦匹配 - 解决非结构化信息的生存技巧
record链接软件包：检测数据中的错误
R软件包快速链接：快速概率记录链接
通过定义一个密钥文件，将r an r an r an r an r函数合并以合并文件
学习文本与暹罗经常性网络的相似性
Dedupe：一个用于准确且可扩展的模糊匹配，记录重复数据删除和实体分辨率的python库。
RecordLinkage：用于记录链接的工具包和用Python编写的重复数据删除。

单词和文档嵌入

当前最好的通用单词嵌入和句子嵌入
对单词嵌入的直观理解：从计数向量到word2vec
对DOC2VEC的经验评估，对嵌入2016年文档的实用见解。
使用段落向量嵌入的文档2015。
手套单词嵌入式演示2017。来自Fasti。
Word2Vec 2016的文本分类。
文件嵌入2017
从单词嵌入到2015年的文档距离。
单词嵌入，ML的偏见，为什么您不喜欢数学，以及为什么AI需要您2017年。RachelThomas（Fastai）
自然语言处理中的单词向量：全球向量（手套）。 2018年8月。
Lee数据集上的DOC2VEC教程
带有Spacy和Gensim的Python中的单词嵌入
深层上下文化的单词代表。 Elmo。 Pytorch实施。 TF实施
通用语言模型用于文本分类。实施代码。
从自然语言推理数据中监督通用句子表示的学习。
在翻译中学习：上下文化的单词向量。湾。
句子和文件的分布式表示。段向量。请参阅Gensim的DOC2VEC教程
Sense2Vec。单词感觉歧义。
跳过思想矢量。单词表示方法。
通过神经网络序列学习的顺序
单词向量的惊人力量。 2016。
序列标记的上下文字符串嵌入。 2018。
从语义任务中学习嵌入的层次多任务方法，为一组相互关联的NLP任务引入多任务学习方法。在2019年1月的AAAI会议上发表。实施代码。
elmo单词嵌入
白痴的Word2Vec自然语言处理指南
忙于嵌入单词 - 介绍（2018年2月）
NLP的成像网已经到了。 2018年7月。预先训练的NLP语言模型概述，使Imagenet对计算机视觉的贡献相似。
word2vec：鱼 +音乐=低音
通用句子编码器在视觉上解释。 2020年6月。

变形金刚和语言模型

了解大型语言模型。塞巴斯蒂安·拉斯卡卡（Sebastian Raschka）。 2023年2月。
贝尔特学的启示：我们对伯特的工作方式的了解。 2020年11月。
基于BERT的模型的评论。 2019年7月。
伯特解释了NLP的艺术语言模型。关于伯特如何运作的基本原理的一个很好的解释。
插图的Bert，Elmo和Co。（NLP如何破解转移学习）。 2018年12月。
机器在阅读测试中击败了人类。但是他们了解吗？
每个NLP工程师都需要了解预训练的语言模型。 2019。
变压器……“解释”？
插图的变压器
拥抱Face在变压器模型上的课程
OpenAI：更好的语言模型及其含义：基于预训练的变压器的无监督语言模型，可在许多语言基准上实现最新的语言模型，重点是文本生成。有争议的有限发布。 2019年2月14日。

chatgpt

Chatgpt启动博客
很棒的chatgpt提示

...在教育中

CHATGPT用户体验：对教育的影响。 Zhai（佐治亚州的Unviversity）。 2022年12月。
AI聊天机器人启用了新的学习模式：三种方法和作业Mollick和Mollick（宾夕法尼亚大学）。 2022年12月。
教育工作者战斗pla窃，因为有89％的学生承认使用Openai的Chatgpt进行家庭作业。福布斯，2023年1月
chatgpt：教育的朋友还是敌人？ Hirsh-Pasek和Blinkoff（Temple University）。 2023年1月。
不要在学校禁止chatgpt。教书。纽约时报（2023年1月）。
chatgpt和商业教育的未来。 2023年2月。
Udemy课程（2023年1月）。教育教师的chatgpt。

深度学习

KERAS LSTM教程 - 如何轻松构建强大的深度学习语言模型。
- 文章的前半部分描述了RNN，LSTM细胞的解剖学LSTM网络。下半部分是使用发电机用于数据输入的KERAS中的功能进行LSTM实现。
自然语言处理的深度学习：带有Jupyter笔记本的教程。
- 一篇简短的文章，其中包含链接和描述，以进一步的视频教程，以解决NLP问题的DL方法。总共五个课程，包括预处理，单词表示和LSTM，以及其他主题。
对自然语言处理中深度学习的用法的调查。
- NLP的DL（科罗拉多大学，2018年7月）的35页学术文献评论。神经网络体系结构的详细描述，然后是一组全面的应用程序。
序列分类与人类注意力：使用来自眼神传播语料库的人类注意力来使重复的神经网络（RNN）中的注意力正常。实施代码。
使用Ulmfit和Fastai Library在Python中的文本分类教程（NLP）教程
多任务深度神经网络，用于自然语言理解。学术文章详细介绍了Microsoft的MTDNN算法，该算法在2019年2月在Glue Benchmark中优于Bert，Elmo和Bilstm。
深度学习研究人员的自然语言处理教程：使用Tensorflow和Pytorch的2019年NLP教程存储库。
深入学习情感分析：调查
神经阅读理解和2018年12月以外的斯坦福大学 - 建立在深神经网络之上的阅读理解模型。
Microsoft：多任务深神经网络（MT-DNN）：微软对Google Bert的改进，重点是自然语言理解。要发布的代码。 2019年1月31日。
结构化的自我实践句子嵌入

胶囊网络

通过动态路由进行文本分类研究胶囊网络。 2018。
基于注意力的胶囊网络具有动态路由以进行关系提取。 2018。
使用胶囊和GRU的Twitter情感分析。 2018。
使用胶囊网络在评论中识别侵略性和毒性。 2018年。胶囊网络的早期是Geoffrey Hinton等人在2017年引入的，作为试图引入优于古典CNN的NN体系结构的尝试。该想法旨在通过神经元的“胶囊”之间的动态路由来捕获输入层中的层次关系。由于解决层次复杂性的主题的亲密关系很可能，因此，该想法扩展到NLP领域已成为积极研究的统治，例如上面列出的论文中。
胶囊之间的动态路由。 2017。
带有EM路由的矩阵胶囊。 2018。

知识图

使用fastText和comet.ml在知识图中对关系进行分类
WTF是知识图吗？
自然语言处理中图的调查。 Nastase等，2015。

主要的NLP会议

神经
计算语言学协会（ACL）
自然语言处理（EMNLP）的经验方法
计算语言学协会北美分会（NAACL）
计算语言学协会（EACL）的欧洲分会
国际计算语言学会议（COLING）

基准

小队排行榜。斯坦福问题回答数据集（小队）上表现最强的NLP模型的列表。
- 小队1.0纸（2016年10月上次更新）。小队v1.1包括超过100,000个问答对，并基于Wikipedia文章。
- 小队2.0纸（2018年10月）。第二代小队包含了无法回答的问题，而NLP模型必须从培训数据中识别为无法回答的问题。
胶排行榜。
- 胶纸（2018年9月）。 A collection of nine NLP tasks including single-sentence tasks (eg check if grammar is correct, sentiment analysis), similarity and paraphrase tasks (eg determine if two questions are equivalent), and inference tasks (eg determine whether a premise contradicts a hypothesis).

Online courses

Udemy

Udemy: Deep Learning and NLP AZ™: How to create a ChatBot
Udemy: Natural Language Processing with Deep Learning in Python
Udemy: NLP - Natural Language Processing with Python
Udemy: Deep Learning: Advanced NLP and RNNs
Udemy: Natural Language Processing and Text Mining Without Coding

Stanford

Stanford CS 224N / Ling 284
- Website: http://cs224d.stanford.edu/
- Reddit: https://www.reddit.com/r/CS224d/comments/4n04ew/follow_along_with_cs224d_2015_or_2016/
Lecture Collection | Natural Language Processing with Deep Learning (Winter 2017)

Coursera

Courses for "natural language processing" on Coursera
Coursera: Applied Text Mining in Python
Coursera: Nartual Language Processing
Coursera: Sequence Models for Time Series and Natural Language Processing
Coursera: Coursera: Clinical Natural Language Processing

DataCamp

DataCamp: Natural Language Processing Fundamentals in Python
DataCamp: Sentiment Analysis in R: The Tidy Way
DataCamp: Text Mining: Bag of Words
DataCamp: Building Chatbots in Python
DataCamp: Advanced NLP with spaCy

其他的

Deep Learning Drizzle : Drench yourself in Deep Learning, Reinforcement Learning, Machine Learning, Computer Vision, and NLP from this curated list of exciting lectures!
Natural Language Processing | Dan Jurafsky, Christopher Manning
Deep Learning for NLP. DeepMind and University of Oxford Department of Computer Science.
CMU CS 11-747: Neural Network for NLP
YSDA NLP course. Yandex School of data analysis.
CMU Language and Statistics II: (More) Empirical Methods in Natural Language Processing
UT CS 388: Natural Language Processing
Columbia: COMS W4705: Natural Language Processing
Columbia: COMS E6998: Machine Learning for Natural Language Processing (Spring 2012)
Machine Translation: Spring 2016
Commonlounge: Learn Natural Language Processing: From Beginner to Expert
Big Data University: Advanced Text Analytics – Getting Results with SystemT
Udacity: Natural Language Processing Nanodegree
edX: Natural Language Processing: An introduction to NLP, taught by Microsoft researchers

APIs and Libraries

R packages
- tm: Text Mining.
- lsa: Latent Semantic Analysis.
- lda: Collapsed Gibbs Sampling Methods for Topic Models.
- textir: Inverse Regression for Text Analysis.
- corpora: Statistics and data sets for corpus frequency data.
- tau: Text Analysis Utilities.
- tidytext: Text mining using dplyr, ggplot2, and other tidy tools.
- Sentiment140: Sentiment text analysis
- sentimentr: Lexicon-based sentiment analysis.
- cleanNLP: ML-based sentiment analysis.
- RSentiment: Lexicon-based sentiment analysis. Contains support for negation detection and sarcasm.
- text2vec: Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities.
- fastTextR: Interface to the fastText library.
- LDAvis: Interactive visualization of topic models.
- keras: Interface to Keras, a high-level neural networks 'API'. (RStudio Blog: TensorFlow for R)
- retweet: Client for accessing Twitter's REST and stream APIs. (21 Recipes for Mining Twitter Data with rtweet)
- topicmodels: Interface to the C code for Latent Dirichlet Allocation (LDA).
- textmineR: Aid for text mining in R, with a syntax that should be familiar to experienced R users.
- wordVectors: Creating and exploring word2vec and other word embedding models.
- gtrendsR: Interface for retrieving and displaying the information returned online by Google Trends.
  - Analyzing Google Trends Data in R
- textstem: Tools that stem and lemmatize text.
- NLPutils Utilities for Natural Language Processing.
- Udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing using UDPipe.
Python modules
- NLTK: Natural Language Toolkit.
  - Video: NLTK with Python 3 for Natural Language Processing
- scikit-learn: Machine Learning in Python
  - 教程
- Spark NLP: Open source text processing library for Python, Java, and Scala. It provides production-grade, scalable, and trainable versions of the latest research in natural language processing.
- spaCy: Industrial-Strength Natural Language Processing in Python.
- textblob: Simplified Text processing.
  - Natural Language Basics with TextBlob
- Gensim: Topic Modeling for humans.
- Pattern.en: A fast part-of-speech tagger for English, sentiment analysis, tools for English verb conjugation and noun singularization & pluralization, and a WordNet interface.
- textmining: Python Text Mining utilities.
- Scrapy: Open source and collaborative framework for extracting the data you need from websites.
- lda2vec: Tools for interpreting natural language.
- PyText A deep-learning based NLP modeling framework built on PyTorch.
- sent2vec: General purpose unsupervised sentence representations.
- flair: A very simple framework for state-of-the-art Natural Language Processing (NLP)
- word_forms: Accurately generate all possible forms of an English word eg "election" --> "elect", "electoral", "electorate" etc.
- AllenNLP: Open-source NLP research library, built on PyTorch.
- Beautiful Soup: Parse HTML and XML documents. Useful for webscraping.
- BigARTM: Fast topic modeling platform.
- Scattertext: Beautiful visualizations of how language differs among document types.
- embeddings: Pretrained word embeddings in Python.
- fastText: Library for efficient learning of word representations and sentence classification.
- Google Seq2Seq: A general-purpose encoder-decoder framework for Tensorflow that can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more.
- polyglot: A natural language pipeline that supports multilingual applications.
- textacy: NLP, before and after spaCy
- Glove-Python: A “toy” implementation of GloVe in Python. Includes a paragraph embedder.
- Bert As A Service: Client/Server package for sentence encoding, ie mapping a variable-length sentence to a fixed-length vector. Design intent to provide a scalable production ready service, also allowing researchers to apply BERT quickly.
- Keras-BERT: A Keras Implementation of BERT
- Paragraph embedding scripts and Pre-trained models: Scripts for training and testing paragraph vectors, with links to some pre-trained Doc2Vec and Word2Vec models
- Texthero Text preprocessing, representation and visualization from zero to hero.
Apache Tika: a content analysis tookilt.
Apache Spark: is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs.
- MLlib: MLlib is Spark's machine learning (ML) library. Its goal is to make practical machine learning scalable and easy. Related to NLP there are methods available for LDA, Word2Vec, and TFIDF.
- LDA: latent Dirichlet allocation
- Word2Vec: is an Estimator which takes sequences of words representing documents and trains a Word2VecModel. The model maps each word to a unique fixed-size vector. The Word2VecModel transforms each document into a vector using the average of all words in the document
- TFIDF: term frequency-inverse document frequency
HDF5: an open source file format that supports large, complex, heterogeneous data. Requires no configuration.
- h5py: Python HDF5 package
Stanford CoreNLP: a suite of core NLP tools
- Also checkout http://corenlp.run for a hosted version of the CoreNLP server.
- Introduction to StanfordNLP: An Incredible State-of-the-Art NLP Library for 53 Languages (with Python code)
Stanford Parser: A probabilistic natural language parser.
Stanford POS Tagger: A Parts-of-Speech tagger.
Stanford Named Entity Recognizer: Recognizes proper nouns (things, places, organizations) and labels them as such.
Stanford Classifier: A softmax classifier.
Stanford OpenIE: Extracts relationships between words in a sentence (eg Mark Zuckerberg; founded; Facebook).
Stanford Topic Modeling Toolbox
MALLET: MAchine Learning for LanguagE Toolkit
- Github: https://github.com/mimno/Mallet
Apache OpenNLP: Machine learning based toolkit for text NLP.
Streamcrab: Real-Time, Twitter sentiment analyzer engine http:/www.streamcrab.com
TextRazor API: Extract Meaning from your Text.
fastText. Library for fast text representation and classification. Facebook。
Comparison of Top 6 Python NLP Libraries.
pyCaret's NLP Module. PyCaret is an open source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights; also, PyCaret's Founder Moez Ali is a Smith Alumni - MMA 2020.

产品

Systran - Enterprise Translation Products
SAS Text Miner (Part of SAS Enterprise Miner)
SAS Sentiment Analysis
STATISTICA
- Text Mining (Big Data, Unstructured Data)
KNIME
RapidMiner
门
IBM Watson
- Video: How IBM Watson learns (3 minutes)
- Video: IBM Watson on Jeapardy! (10 minutes)
- Video: IBM Watson: The Science Behind an Answer (7 minutes)
Crimson Hexagon
Stocktwits: Tap into the Pulse of Markets
Meltwater
CrowdFlower: AI for your business.
Lexalytics Sematria: API and Excel plugin.
Rosette Text Analytics: AI for Human Language
炼金术API
Monkey Learn
LightTag Annotation Tool. Hosted annotation tool for teams.
UBIAI. Easy-to-use text annotation tool for teams with most comprehensive auto-annotation features. Supports NER, relations and document classification as well as OCR annotation for invoice labeling
Anafora: Free and open source web-based raw text annotation tool
brat: Rapid annotation tool.
Google's Colab: Ready-to-go Notebook environment that makes it easy to get up and running.
Lyrebird.ai: “Ultra-Realistic Voice Cloning and Text-to-Speech” recognition platform. This Canadian start-up has created a product/platform that syncs both voice cloning with text-to-speech. Lyrebird recognizes the intonations and voice patterns from audio recordings, and overlays text data input to recreate a text-to-speech audio file output from the selected voice pattern audio recording.
Ask Data by Tableau Software Inc.: In February 2019, Tableau released a new NLP feature service add-on to help assist existing Tableau platform users with retrieving quick and easy data visualizations to drive business intelligence insights. Similar to a search engine user interface, Tableau's Ask Data feature interface applies NLP from user text input to extract key words to find data analytics and business insights quickly on the Tableau Platform.
Dialogflow Google's Natural Language Platform used to integrate conversational user interfaces into mobile apps, web applications, bots, VRUs, etc.
Weka Easy-to-use, graphical Machine Learning Workbench including NLP capabilities.
Annotation Lab - Free End-to-End No-Code platform for text annotation and DL model training/tuning. Out-of-the-box support for Named Entity Recognition, Classification, Relation extraction and Assertion Status Spark NLP models. Unlimited support for users, teams, projects, documents.

云

Microsoft Azure Text Analytics
Amazon Lex: A service for building conversational interfaces into any application using voice and text.
Amazon Comprehend
Google Cloud Natural Language
IBM Watson
- Video: How IBM Watson learns (3 minutes)
- Video: IBM Watson on Jeapardy! (10 minutes)
- Video: IBM Watson: The Science Behind an Answer (7 minutes)

Getting Data out of PDFs

Apache PDFBox
Tabula: A tool for liberating data tables locked inside PDF files.
PDFLayoutTextStripper: Converts a pdf file into a text file while keeping the layout of the original pdf.
pdftabextract: A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
SO: How to extract text from a PDF?
Tools for Extracting Data and Text from PDFs - A Review
How I used NLP (SpaCy) to screen Data Science Resumes
PyPDF2: PDF file manipulation (PDF to PDF).

Online Demos and Tools

MIT OpenNPT for neural machine translation and neural sequence modeling
Stanford Parser
Stanford CoreNLP
word2vec demo
Another word2vec demo
sense2vec: Semantic Analysis of the Reddit Hivemind
RegexPal: Great tool for testing out regular expressions.
AllenNLP Demo: Great demo using AllenNLP of everything from Named Entity Recognition to Textual Entailment.
Cognitive Computation Group - Part of Speech Tagging Demo These demos exhibit part-of-speech tagging, information extraction tasks etc.

数据集

UCI's Text Datasets. A collection of databases, domain theories, and data generators used by Machine Learning community.
data.world's Text Datasets
Awesome Public Datasets' Natural Languge
Insight Resources Datasets
Bing Sentiment Analysis
Consumer Complaint Database. From the Consumer Financial Protection Bureau.
Sentiment Labelled Sentences Data Set . Contains sentences labelled as "positive" or "negative", from imdb.com, amazon.com, and yelp.com.
Amazon product data
Data is Plural
FiveThirtyEight's datasets
r/datasets
Awesome public datasets
R's datasets package
200,000 Russian Troll Tweets - Released by Congress from Twitter suspended accounts and removed from public view.
Wikipedia: List of datasets for ML research
Google Dataset Search
Kaggle: UMICH SI650 - Sentiment Classification
Lee's Similarity Data Sets
Corpus of Presidential Speeches (CoPS) and a Clinton/Trump Corpus
15 Best Chatbot Datasets for Machine Learning
A Survey of Available Corpora for Building Data-Driven Dialogue Systems
nlp-datasets
Hate-speech-and-offensive-language
First Quora Dataset Release: Question Pairs
The Best 25 Datasets for Natural Language Processing
SWAG: A large-scale dataset created for Natural Language Inference (NLI) with common-sense reasoning.
MIMIC: an openly available dataset developed by the MIT Lab for Computational Physiology, comprising deidentified health data associated with ~40,000 critical care patients.
Clinical NLP Dataset Repository: A curated list of publicly-available clinical datasets for use in NLP research.
Million Song Lyrics
The Multi-Genre NLI Corpus
Twitter US Airline Sentiment
Million Song Lyrics: Dataset of song lyrics in Bag-Of-Words (BOW) format.
DuoRC – 186K unique question-answer pairs with evaluation script for Paraphrased Reading Comprehension
EDGAR Financial Statements: Reporting engine for financial and regulatory filings for companies worldwide. A huge repository of financial and company data for text mining.
American National Corpus Download
Santa Barbara Corpus of Spoken American English
Leipzig Corpora Collection: Corpora in English, Arabic, French, Russian, German
Awesome Twitter
The Big Bad NLP Database
CBC News Coronavirus articles
Huggingface

Lexicons for Sentiment Analysis

MPQA Lexicon
SentiWordNet
阿菲
bing
nrc
vaderSentiment

杂项

AskReddit: People with a mother tongue that isn't English, what are the most annoying things about the English language when you are trying to learn it?
Funny Video: Emotional Spell Check
How to win Kaggle competition based on NLP task, if you are not an NLP expert
Detecting Gang-Involved Escalation on Social Media Using Context Detecting Aggression and Loss in social media using CNN
Reasoning about Actions and State Changes by Injecting Commonsense Knowledge Incorporating global, commonsense constraints & biasing reading with preferences from large-scale corp
The Language of Hip Hop: A 2017 analysis by Matt Daniels of Pudding determining the popularity of various words in hip hop music and across artists.
Using Natural Language Processing for Automatic Detection of Plagiarism
Probabilistic Graphical Models: Lagrangian Relaxation Algorithms for Natural Language Processing
Human Emotion How to determine confidence level for manually labeled sentiment data?
A Complete Exploratory Data Analysis and Visualization for Text Data

Other Curated Lists

awesome-nlp: A curated list of resources dedicated to Natural Language Processing (NLP)
awesome-machine-learning
Awesome Deep Learning for Natural Language Processing (NLP)
Paper with Code: A fantastic list of recent machine learning papers on ArXiv, with links to code.
Chinese NLP Tools. 2019. List of tools for NLP in Chinese Language.
Association for Computational Linguistics Papers Anthology: The ACL Anthology currently hosts almost 50,000 papers on the study of computational linguistics and natural language processing. Includes all papers from recent conferences.
Over 150 of the Best Machine Learning, NLP, and Python Tutorials I've Found

贡献

Contributions are more than welcome! Please read the contribution guidelines first.

执照

To the extent possible under law, @stepthom has waived all copyright and related or neighboring rights to this work.

展开

附加信息

版本 1.0.0
类型其他源码
更新时间 2025-04-17
大小 31.39KB
来自于 Github

text_mining_resources