100 Days of NLP 100 Days of NLP源代码下载

100 Days of NLP

其他源码

1.0.0

下载

Emb

魔术没有什么魔术。魔术师只是理解一些简单的东西，对于未经训练的观众来说似乎并不简单或自然。一旦您学习了如何在使手看起来空的同时握住卡，那么您只需要练习才能“做魔术”。 - 杰弗里·弗里德尔（Jeffrey Friedl）在书中掌握正则表达式

注意：请提出任何建议，更正和反馈的问题。

大多数代码样本都是使用Jupyter笔记本（使用COLAB）完成的。因此每个代码可以独立运行。

已经探讨了以下主题：

NLP概念
基于分类的应用程序
基于生成的应用
基于聚类的应用程序
基于问题的应用程序
基于排名的应用程序
基于建议的应用程序

注意：难度水平是根据我的理解分配的。

NLP概念

令牌化	单词嵌入 - word2vec	单词嵌入 - 手套	单词嵌入 - elmo
RNN，LSTM，Gru	包装衬垫序列	注意机制-Luong	注意机制-Bahdanau
指针网络	变压器	GPT-2	伯特
主题建模 - LDA	主成分分析（PCA）	天真的贝叶斯	数据增强
句子嵌入

令牌化

将文本数据转换为令牌的过程是NLP中最重要的一步之一。使用以下方法进行了代币化：

Spacy
字节对编码（句子）
Unigram编码（句子）
火把
象征器

单词嵌入 - word2vec

单词嵌入是文本的学习表示，其中具有相同含义的单词具有相似的表示形式。正是这种代表单词和文档的方法可能被认为是挑战自然语言处理问题的深度学习的关键突破之一。

Emb

Word2Vec是Google开发的最受欢迎的预贴单词嵌入之一。根据学习嵌入的方式，Word2Vec分为两种方法：

连续袋（CBOW）
Skip-gram模型

Word2Vec拱门

单词嵌入 - 手套

手套是获得预训练的嵌入的另一种常用方法。手套旨在实现两个目标：

创建在向量空间中捕获意义的词向量
利用全球计数统计，而不是本地信息

单词嵌入 - elmo

Elmo是一个建模的深层上下文化的单词表示形式：

单词使用的复杂特征（例如，语法和语义）
这些如何使用在语言环境中（即，模型polysemy）各不相同。

这些单词向量是深层双向语言模型（BILM）的内部状态的函数，该功能已在大型文本语料库中进行了预培训。

Elmo Arch

RNN，LSTM，Gru

经常性网络-RNN，LSTM，GRU已被证明是NLP应用程序中最重要的单元之一，因为它们的架构。在许多问题上，需要像预测场景中的情绪一样需要记住序列性质，需要记住以前的场景。

RNN GIF

包装衬垫序列

当训练RNN（LSTM，GRU或VANILLA-RNN）时，很难批量批次序列。理想情况下，我们将把所有序列都放在固定的长度上，并最终进行不必要的计算。我们如何克服这一点？ Pytorch提供pack_padded_sequences功能。

包IMG

注意机制-Luong

注意机制诞生是为了帮助记住神经机器翻译（NMT）中的长源句子。与其从编码器的最后一个隐藏状态中构建单个上下文向量，不如在解码句子时更多地专注于输入的相关部分。将通过获取编码器输出和解码器RNN的current output来创建上下文向量。

包IMG

注意力评分可以通过三种方式计算。 dot ， general和concat 。

luong_fn

注意机制-Bahdanau

Bahdanau和Luong关注之间的主要区别在于上下文向量的创建方式。上下文向量将通过取编码器输出和解码器RNN的previous hidden state创建。在Luong注意的位置，将通过取编码器输出和解码器RNN的current hidden state来创建上下文向量。

一旦计算上下文，它将与解码器输入嵌入并作为解码器RNN输入。

包IMG

Bahdanau的关注也被称为additive 。

bahdanau_fn

变压器

注意机制已成为各种任务中引人注目的序列建模和转导模型的组成部分，从而允许对依赖性建模而不考虑其在输入或输出序列中的距离。这种注意机制与经常性网络结合使用。

变压器是一种模型体系结构避免复发，而是完全依靠注意机制在输入和输出之间吸引全局依赖性。

变压器

GPT-2

GPT-2论文指出：

通常，在任务规定的数据集中进行监督学习，通常会接近自然语言处理任务，例如问答，机器翻译，阅读理解和摘要。我们证明，当在新的名为WebText的新网页上培训时，语言模型开始学习这些任务而没有任何明确的监督。我们最大的型号GPT-2是1.5B参数变压器，在8个测试的语言建模数据集中有7个以零拍设置的方式实现了最先进的结果，但仍未拟合WebText。该模型的样本反映了这些改进，并包含文本的连贯段落。这些发现提出了建立语言处理系统的有希望的途径，该途径学会从其自然发生的演示中执行任务。

GPT2

GPT-2利用了仅12层解码器的变压器体系结构。

GPT2

伯特

语言建模是使用未标记的数据在NLP中预认识神经网络的有效任务。传统语言模型采用以前的N代币并预测下一个标记。相比之下，伯特（Bert）训练一种语言模型，该模型在预测时同时考虑了上一个和下一个令牌。伯特还接受了下一个句子预测任务的培训，以更好地处理需要关于两个句子之间关系的任务（例如相似的问题）

伯特使用变压器体系结构来编码句子。

指针网络

指针网络是序列到序列模型，其中输出为离散令牌，与输入序列中的位置相对应。指针网络和标准SEQ2SEQ模型之间的主要区别是：

指针网络的输出是离散的，对应于输入序列中的位置
输出每个步骤中目标类的数量取决于输入的长度，这是可变的。

它与以前的注意力尝试不同，而不是使用注意力将编码器的隐藏单元与每个解码器步骤的上下文向量进行混合，而是将注意力用作指针来选择输入序列的成员作为输出。

使用LDA进行主题建模

自然语言处理的主要应用之一是自动提取人们从大量文本中讨论的主题。一些大文本的示例可能是社交媒体，酒店，电影等客户评论，用户反馈，新闻报道，客户投诉的电子邮件等。

了解人们在说什么，并理解他们的问题和观点对企业，管理人员和政治运动非常有价值。而且很难手动阅读如此大的卷并编译主题。

因此，需要一种自动化算法，该算法可以通过文本文档读取并自动输出所讨论的主题。

在此笔记本中，我们将以20 Newsgroups数据集的真实示例，并使用LDA提取自然讨论的主题。

LDA

LDA的主题建模方法是将每个文档视为一定比例的主题集合。并且每个主题都是关键字集合，并以一定比例的比例。

一旦您提供了主题数量的算法，就可以将文档和关键字分布中的主题分布重新排列，以获得主题键单词分布的良好组成。

主成分分析（PCA）

从根本上讲，PCA是一种减少维度的技术，可将数据集的列转换为新集合功能。它通过找到一组新的方向（例如X和Y轴）来解释数据中的最大可变性。这个新的系统坐标轴称为主组件（PC）。

PCA

实际上，PCA的使用有两个原因：

Dimensionality Reduction ：分布在大量列中的信息被转换为主组件（PC），以便前几个PC可以解释总信息的相当大部分（方差）。这些PC可以用作机器学习模型中的解释变量。
Visualize Data ：可视化类（或群集）的分离很难具有超过3个维度（功能）的数据。对于前两台PC本身，通常可以看到明确的分离。

天真的贝叶斯算法

天真的贝叶斯分类器是用于分类任务的概率机器学习模型。分类器的关键基于贝叶斯定理。

幼稚的

考虑到B发生了B，使用贝叶斯定理，我们可以找到发生A的可能性。在这里，B是证据，A是假设。这里提出的假设是预测因素/特征是独立的。那就是一个特定特征的存在不会影响另一个特征。因此，它被称为天真。

天真贝叶斯分类器的类型：

Multinomial Naive Bayes ：这主要是当变量离散（如单词）时使用的。分类器使用的功能/预测指标是文档中存在的单词的频率。

Gaussian Naive Bayes ：当预测变量占据连续值并且不是离散的值时，我们假设这些值是从高斯分布中采样的。

Bernoulli Naive Bayes ：这与多项式幼稚的贝叶斯相似，但预测因子是布尔变量。我们用来预测类变量的参数仅占用值是或否，例如，是否在文本中出现单词。

使用20NewSgroup数据集，探索了Naive Bayes算法进行分类。

NLP中的数据增强

在计算机视觉中，使用图像数据增强是一种标准实践。这是因为图像诸如将图像旋转几个度或将其转换为灰度之类的图像的琐碎操作不会改变其语义。而在自然语言处理（NLP）字段中，由于语言的复杂性很难增强文本。

探索了使用以下技术的数据增强：

基于同义词的替代
基于反义词的替代
背面翻译
文本表面变换
随机噪声注入
基于单词嵌入的替代
上下文单词嵌入（BERT家族）替代

句子嵌入

探索了一个名为Sbert的新建筑。暹罗网络体系结构可以派生用于输入句子的固定尺寸向量。使用诸如余弦或曼哈顿 /欧几里得距离之类的相似性度量，可以找到语义上相似的句子。

基于分类的应用程序

情感分析-IMDB	情感分类 - hinglish	文档分类
重复的问题对分类 - Quora	POS标签	自然语言推论-SNLI
有毒评论分类	语法正确的句子 - 可乐	NER标记

情感分析-IMDB

情感分析是指使用自然语言处理，文本分析，计算语言学和生物识别技术来系统地识别，提取，量化和研究情感状态和主观信息。

探索了以下变体：

使用RNN的简单情感分析
改进的情感分析
情绪分析引起关注
伯特的情感分析

简单的情感分类与RNN

RNN用于处理和识别情感。

改善情感分类

在尝试了基本的RNN后，test_accuracy的基本RNN少于50％，已经实验了以下技术，并实现了超过88％的test_accuracy。

使用的技术：

包装的衬垫序列
预训练的单词嵌入
不同的RNN架构
双向RNN
多层RNN
正则化
不同的优化器

情绪分析引起关注

在预测输入情绪时，注意力有助于关注相关输入。 Bahdanau的注意力被用来吸收LSTM的输出，并串联了最终的前进和向后隐藏状态。不使用预训练的单词嵌入，可以达到88%的测试准确性。

伯特的情感分析

伯特（Bert）在11个自然语言处理任务上获得了新的最新结果。 NLP中的转移学习在发布BERT模型后触发。探索了使用BERT进行情感分析。

情感分析 - hinglish

混合语言（也称为混合代码）是多语言社会的规范。多语言的人是非母语英语的人，他们倾向于使用基于英语的语音键入和插入英语的主语言来代码混音。

任务是预测给定代码混合推文的情感。情感标签是正面，负面或中立的，而混合的语言将是英语印度语。 （Sentimix）

探索了以下变体：

前哨使用MLP
使用LSTM的Sentimix
Sentimix与XLM-Roberta
Sentimix带有XLM-ROBERTA-LSTM注意力
Sentimix与XLM-Roberta-CNN
Sentimix带有XLM-Roberta-insemble

前哨使用MLP

使用简单的MLP模型，在测试数据上达到了F1 score of 0.58

codemix

使用LSTM的Sentimix

探索基本MLP模型后，将LSTM模型用于情感预测，并实现了F1分数为0.57 。

codemix

实际上，与基本MLP模型相比，结果较小。原因之一可能是LSTM由于代码混合数据的高度多样性而无法学习句子中单词之间的关系。

Sentimix与XLM-Roberta

由于LSTM由于代码混合数据的高度多样性而无法使用代码混合句子中的单词之间的关系，并且不使用预训练的嵌入，因此F1分数较小。

为了减轻此问题，XLM-Roberta模型（已在100种语言上进行了预培训）用于编码句子。为了使用XLM-Roberta模型，该句子需要使用适当的语言。因此，首先需要将Hinglish单词转换为印地语（Devanagari）形式。

codemix

达到0.59的F1分数。改进的方法将在稍后进行。

Sentimix带有XLM-ROBERTA-LSTM注意力

XLM-Roberta模型的最终输出用作双向LSTM模型的输入嵌入。从LSTM层中获取输出的注意力层产生输入的加权表示，然后通过分类器通过分类器来预测句子的情感。

codemix

达到0.64的F1分数。

Sentimix与XLM-Roberta-CNN

就像3x3滤波器可以浏览图像的斑块一样，1x2滤波器可以在文本中浏览两个顺序单词，即Bi-gram。在此CNN模型中，我们将使用不同尺寸的多个过滤器，这些过滤器将查看BI-GRAM（1x2滤波器），Tri-gram（1x3滤波器）和/或N-Grams（1xn filtr）（1倍滤波器）。

此处的直觉是，审查中某些Bi-gram，Tri-grams和n-grams的外观将很好地表明最终情感。

codemix

达到0.69的F1分数。

Sentimix带有XLM-Roberta-insemble

CNN捕获了RNN捕获全球依赖性的本地依赖性。通过组合两者，我们可以更好地了解数据。 CNN模型和双向牵引模型的结合执行了其他模型。

codemix

达到0.71的F1分数。（排名前5位）。

文档分类

文档分类或文档分类是图书馆科学，信息科学和计算机科学中的一个问题。任务是将文档分配给一个或多个类或类别。

探索了以下变体：

使用分层注意网络进行文档分类
通过正则化技术改进了汉

使用分层注意网络进行文档分类

分层注意力网络（HAN）考虑文档的层次结构（文档 - 句子 - 单词），并包括一种注意机制，该机制能够在考虑上下文时在文档中找到最重要的单词和句子。

通过正则化技术改进了汉

基本的HAN模型迅速拟合。为了克服这一问题，探索了Embedding Dropout的技术，探索了Locked Dropout 。还有另一种称为Weight Dropout的技术，该技术未实施（让我知道是否有任何实施此功能的资源）。还使用预训练的单词嵌入Glove代替随机初始化。由于可以在句子级别和单词级别上进行注意，因此我们可以看到哪些单词在句子中很重要，哪些句子在文档中很重要。

重复的问题对分类 - Quora

QQP代表Quora问题对。该任务的目的是给定的问题；我们需要找到这些问题在语义上是否彼此相似。

探索了以下变体：

用暹罗网络分类的QQP分类
QQP与Bert分类

用暹罗网络分类的QQP分类

该算法需要将这对问题作为输入，并应输出它们的相似性。使用暹罗网络。 Siamese neural network （有时称为双神经网络）是一个人工神经网络，在两个不同的输入向量的同时使用same weights来计算可比的输出向量。

QQP

QQP与Bert分类

尝试暹罗模型后，探索了Bert进行Quora重复问题对检测。伯特（Bert）将问题1和问题2作为输入，由[SEP]令牌分开，分类是使用[CLS]令牌的最终表示。

QQP

POS标签

词性（POS）标记是一项任务，是在句子中标记每个单词，并具有适当的语音部分。

探索了以下变体：

使用Bilstm的POS标签
带有变压器的POS标签
bert的pos标签

使用Bilstm的POS标签

此代码涵盖基本工作流程。我们将学习如何：加载数据，创建火车/测试/验证拆分，构建词汇，创建数据迭代器，定义模型并实现火车/评估/测试环和运行时间（推理）标记。

使用的模型是多层双向LSTM网络

pos

带有变压器的POS标签

尝试RNN方法后，探索了具有基于变压器的体系结构的POS标记。由于变压器同时包含编码器和解码器，并且对于序列标记任务，仅Encoder就足够了。由于数据很小，因此具有6层编码器的数据将过度拟合数据。因此使用了三层变压器编码器模型。

pos

bert的pos标签

尝试使用变压器编码器进行POS标记后，利用了预训练的BERT模型的POS标记。它达到了91%的测试准确性。

pos

自然语言推论-SNLI

自然语言推理（NLI）的目标是一项广泛研究的自然语言处理任务，是确定一个给定的语句（前提）是否在语义上需要另一个给定的陈述（假设）。

探索了以下变体：

NLI与Bilstm
NLI引起注意
NLI带有变压器
NLI与Bert
NLI蒸馏

NLI与Bilstm

带有暹罗Bilstm网络的基本模型已插入

Nli

这可以视为基线设置。实现了76.84%的测试精度。

NLI引起注意

在上一个笔记本中，前提和假设的最终隐藏状态是LSTM的表示。现在，将在所有输入令牌上计算注意力，而不是采取最终的隐藏状态，并将最终加权向量作为前提和假设的表示。

Nli

测试准确性从76.84%提高到79.51% 。

NLI带有变压器

变压器编码器用于编码前提和假设。一旦句子通过编码器，所有令牌的求和被视为最终表示（可以探索其他变体）。与RNN变体相比，模型精度较低。

Nli

NLI与Bert

探索了带有BERT基本模型的NLI。伯特（Bert）将前提和假设作为[SEP]令牌分开的输入，并使用[CLS]令牌的最终表示进行了分类。

Nli

NLI蒸馏

Distillation ：您可以用来将一种称为teacher的大型模型压缩成一个较小的模型，称为student 。以下学生，使用教师模型来对NLI进行蒸馏。

学生模型：逻辑回归
教师模型：双向LSTM引起关注

Nli

有毒评论分类

讨论您关心的事情可能很困难。在线虐待和骚扰的威胁意味着许多人停止表达自己并放弃寻求不同的意见。平台难以有效地促进对话，导致许多社区限制或完全关闭用户评论。

为您提供了大量的Wikipedia评论，这些评论已被人类对有毒行为的评估者标记。毒性的类型是：

有毒的
严重的毒素
猥亵
威胁
侮辱
Identity_hate

探索了以下变体：

与GRU的有毒评论分类
改善有毒评论分类
使用简化的有毒评论分类

与GRU的有毒评论分类

使用的模型是双向GRU网络。

有毒的

实现了99.42%的测试精度。由于90％的数据未标记为任何毒性，因此仅将所有数据预测为无毒的数据提供了90％的准确模型。因此，准确性不是可靠的指标。实施了不同的ROC AUC。

改善有毒评论分类

由于Categorical Cross Entropy作为损失，ROC_AUC得分为0.5 。通过将损失更改为Binary Cross Entropy ，并通过添加池层（最大值，平均值）来修改模型，ROC_AUC得分提高到0.9873 。

有毒的

使用简化的有毒评论分类

使用简化将有毒评论分类转换为应用程序。预先训练的模型现已可用。

有毒的

语法正确的句子 - 可乐

人工神经网络能够判断句子的语法可接受性吗？为了探索此任务，使用语言可接受性（COLA）数据集。可乐是一组标记为语法正确或不正确的句子。

探索了以下变体：

带有伯特的可乐
与Distilbert的可乐

带有伯特的可乐

伯特（Bert）在11个自然语言处理任务上获得了新的最新结果。 NLP中的转移学习在发布BERT模型后触发。在本笔记本中，我们将探讨如何使用BERT对句子进行分类是否正确使用COLA数据集。

达到85%的准确性，Matthews相关系数（MCC）达到了64.1 。

与Distilbert的可乐

Distillation ：您可以用来将一种称为teacher的大型模型压缩成一个较小的模型，称为student 。以下学生，使用教师模型来对可乐进行蒸馏。

学生模型：Distilbert Base未置
教师模型：伯特基地未置

已尝试以下实验：

使用BERT模型（老师）培训。 ACC： 84.06 ，MCC： 61.5
使用Distilbert模型培训（没有老师强迫）。 ACC： 82.54 ，MCC： 57
使用Distilbert模型（与老师强迫）进行培训。 ACC： 82.92 ，MCC： 57.9

NER标记

命名 - 实体识别（NER）标签是用适当的实体标记每个单词的任务。

探索了以下变体：

用Bilstm标记
用bilstm-crf标记
使用Viterbi算法解码
用char-bilstm-crf标记
NER标记的评估指标
使用精简
用变压器标记
用伯特标记
用变压器-CRF标记
用spacy标记

用Bilstm标记

此代码涵盖基本工作流程。我们将查看如何：加载数据，创建火车/测试/验证拆分，构建词汇，创建数据迭代器，定义模型并实现火车/评估/测试环和火车，并测试模型。

使用的模型是双向LSTM网络

ner

用bilstm-crf标记

对于序列标记（NER），当前单词的标签可能取决于上一个单词的标签。（例如：纽约）。

如果没有CRF，我们只会使用单个线性层将双向LSTM的输出转换为每个标签的分数。这些被称为emission scores ，这是单词为某个标签的可能性的表示。

CRF不仅计算排放得分，而且还计算transition scores ，这是一个单词是一个标签的可能性，因为上一个单词是某个标签。因此，过渡分数衡量从一个标签过渡到另一个标签的可能性。

ner

使用Viterbi算法解码

对于解码，使用Viterbi算法。

由于我们使用的是CRF，因此我们并没有在每个单词上预测正确的标签，而是我们预测单词序列的正确标签序列。 Viterbi解码是一种准确做到这一点的方法 - 从条件随机场计算出的分数中找到最佳的标签序列。

ner

用char-bilstm-crf标记

在我们的标记任务中使用子字信息，因为它可以是标签的有力指标，无论它们是语音的一部分还是实体的一部分。例如，它可能会学会形容词通常以“ -y”或“ -ul”结尾，或者通常以“ -land”或“ -burg”结尾。

因此，我们的序列标记模型都使用

单词嵌入形式的word-level信息。
character-level信息直至两个方向上都包含每个单词。

ner

NER标记的评估指标

微观和宏观大小（对于任何指标）将计算略有不同的事物，因此它们的解释有所不同。宏观平均水平将独立计算每个类别的度量，然后取平均值（因此平均处理所有类），而微平均水平将汇总所有类以计算平均度量的贡献。在多级分类设置中，如果您怀疑可能存在类不平衡，那么微平均水平是可取的（即，与其他类别相比，您可能有更多的类示例）。

ner

使用精简

使用简化将NER标记转换为应用程序。预训练模型（CHAR-BILSTM-CRF）现已上市。

ner

用变压器标记

尝试RNN方法后，探索了使用基于变压器的体系结构的NER标记。由于变压器同时包含编码器和解码器，并且对于序列标记任务，仅Encoder就足够了。使用了三层变压器编码器模型。

ner

用伯特标记

尝试使用变压器编码器进行NER标记后，探索了使用预训练的bert-base-cased模型的NER标记。

ner

用变压器-CRF标记

与NER标记任务中的BilstM相比，仅具有变压器并不能给出良好的结果。与独立变压器相比，实现了变压器顶部上的CRF层，这正在改善结果。

ner

用spacy标记

Spacy在Python中为NER提供了一个非常有效的统计系统，该系统可以将标签分配给令牌组。它提供了一个默认模型，该模型可以识别广泛的命名或数值实体，包括人，组织，语言，事件等。

ner

除了这些默认实体外，Spacy还通过训练模型以更新的训练有素的示例对其进行更新，还使我们可以自由地向NER模型添加任意类别。

在特定领域数据（银行）中使用的2个称为ACTIVITY和SERVICE的新实体是由很少的培训样本创建和培训的。

ner

基于生成的应用

名称生成	机器翻译	话语产生
图像字幕	图像字幕 - 乳胶方程	新闻摘要
电子邮件主题生成

LSTM的名称生成

使用字符级LSTM语言模型。也就是说，我们将为LSTM提供大量名称，并要求它对下一个字符的下一个字符的概率分布进行建模给定以前的字符的顺序。然后，这将使我们一次生成一个新名称一个字符

name_gen

机器翻译

机器翻译（MT）是将一种自然语言自动转换为另一种自然语言，保留输入文本的含义并以输出语言产生流利的文本的任务。理想情况下，源语言序列被转化为目标语言序列。任务是将句子从German to English 。

探索了以下变体：

基本的机器翻译
改进的机器翻译
Bahdanau注意的机器翻译
掩盖，包装填充输入，注意可视化，MT上的BLEU
带有变压器的机器翻译
自我注意力可视化

基本的机器翻译

最常见的序列到序列（SEQ2SEQ）模型是编码器模型，通常使用经常性神经网络（RNN）将源（输入）句子编码为单个向量。在此笔记本中，我们将此单个向量称为上下文向量。我们可以将上下文向量视为整个输入句子的抽象表示。然后，该向量由第二个rnn解码，该载体通过一次生成一个单词来学习输出目标（输出）句子。

basic_mt

改进的机器翻译

在尝试具有文本困惑36.68的基本机器翻译后，已经进行了以下技术并进行了测试困惑7.041 。

使用GRU代替LSTM
单层
上下文向量与解码器输入嵌入一起发送到解码器RNN
上下文向量与解码器隐藏状态一起发送到分类器

改进的_mt

在applications/generation文件夹中查看代码

Bahdanau注意的机器翻译

注意机制诞生是为了帮助记住神经机器翻译（NMT）中的长源句子。与其从编码器的最后一个隐藏状态中构建单个上下文向量，不如在解码句子时更多地专注于输入的相关部分。上下文向量将通过取编码器输出和解码器RNN的previous hidden state创建。

bahdanau_mt

掩盖，包装填充输入，注意可视化，MT上的BLEU

诸如掩盖（忽略填充输入的注意力），包装填充序列（以更好地计算），注意力可视化和测试数据上的BLEU度量等增强功能。

mt_visual

带有变压器的机器翻译

变压器，一种模型体系结构避免了复发，而完全依靠注意力机制来吸引输入和输出之间的全局依赖性，以从德语到英语进行机器翻译

mt_visual

自我注意力可视化

为基于变压器的机器翻译模型添加了运行时间翻译（推理）和注意力可视化。

mt_visual

话语产生

在NLP中，尤其是在问题回答，信息检索，信息提取，对话系统中，发言人的生成是一个重要的问题。它也可用于为许多NLP问题创建合成培训数据。

探索了以下变体：

基本的话语产生
引起注意的发言
发出梁搜索的发言
覆盖范围的发言人
带有变压器的发言人
变压器发出发言的束缚搜索
用BPE令牌化产生话语
使用简化的发言人
General Utterance Generation

Basic Utterance Generation

The most common used model for this kind of application is sequence-to-sequence network. A basic 2 layer LSTM was used.

utt_gen

Utterance Generation with Attention

The attention mechanism will help in memorizing long sentences. Rather than building a single context vector out of the encoder's last hidden state, attention is used to focus more on the relevant parts of the input while decoding a sentence. The context vector will be created by taking encoder outputs and the hidden state of the decoder rnn.

After trying the basic LSTM apporach, Utterance generation with attention mechanism was implemented. Inference (run time generation) was also implemented.

utt_gen

While generating the a word in the utterance, decoder will attend over encoder inputs to find the most relevant word. This process can be visualized.

utt_gen

Utterance Generation with Beam Search

One of the ways to mitigate the repetition in the generation of utterances is to use Beam Search. By choosing the top-scored word at each step (greedy) may lead to a sub-optimal solution but by choosing a lower scored word that may reach an optimal solution.

Instead of greedily choosing the most likely next step as the sequence is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

utt_gen

Utterance Generation with Coverage

Repetition is a common problem for sequenceto-sequence models, and is especially pronounced when generating a multi-sentence text. In coverage model, we maintain a coverage vector c^t , which is the sum of attention distributions over all previous decoder timesteps

utt_gen

This ensures that the attention mechanism's current decision (choosing where to attend next) is informed by a reminder of its previous decisions (summarized in c^t). This should make it easier for the attention mechanism to avoid repeatedly attending to the same locations, and thus avoid generating repetitive text.

Utterance Generation with Transformer

The Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output is used to do generate utterance from a given sentence. The training time was also lot faster 4x times compared to RNN based architecture.

utt_gen

Beam Search in Utterance Generation with Transformer

Added beam search to utterance generation with transformers. With beam search, the generated utterances are more diverse and can be more than 1 (which is the case of the greedy approach). This implemented was better than naive one implemented previously.

utt_gen

Utterance Generation with BPE Tokenization

Utterance generation using BPE tokenization instead of Spacy is implemented.

Today, subword tokenization schemes inspired by BPE have become the norm in most advanced models including the very popular family of contextual language models like BERT, GPT-2,RoBERTa, etc.

BPE brings the perfect balance between character and word-level hybrid representations which makes it capable of managing large corpora. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens.

utt_gen

Utterance Generation using Streamlit

Converted the Utterance Generation into an app using streamlit. The pre-trained model trained on the Quora dataset is available now.

utt_gen

General Utterance Generation

Till now the Utterance Generation is trained using the Quora Question Pairs dataset, which contains sentences in the form of questions. When given a normal sentence (which is not in a question format) the generated utterances are very poor. This is due the bias induced by the dataset. Since the model is only trained on question type sentences, it fails to generate utterances in case of normal sentences. In order to generate utterances for a normal sentence, COCO dataset is used to train the model. utt_gen

utt_gen

图像字幕

Image Captioning is the process of generating a textual description of an image. It uses both Natural Language Processing and Computer Vision techniques to generate the captions.

Flickr8K dataset is used. It contains 8092 images, each image having 5 captions.

Following varients have been explored:

Basic Image Captioning
Image Captioning with Attention
Image Captioning with Beam Search
Image Captioning with BPE Tokenization

Basic Image Captioning

The encoder-decoder framework is widely used for this task. The image encoder is a convolutional neural network (CNN). The decoder is a recurrent neural network(RNN) which takes in the encoded image and generates the caption.

In this notebook, the resnet-152 model pretrained on the ILSVRC-2012-CLS image classification dataset is used as the encoder. The decoder is a long short-term memory (LSTM) network.

img_cap

Image Captioning with Attention

In this notebook, the resnet-101 model pretrained on the ILSVRC-2012-CLS image classification dataset is used as the encoder. The decoder is a long short-term memory (LSTM) network. Attention is implemented. Instead of the simple average, we use the weighted average across all pixels, with the weights of the important pixels being greater. This weighted representation of the image can be concatenated with the previously generated word at each step to generate the next word of the caption.

img_cap

Image Captioning with Beam Search

Instead of greedily choosing the most likely next step as the caption is constructed, the beam search expands all possible next steps and keeps the k most likely, where k is a user-specified parameter and controls the number of beams or parallel searches through the sequence of probabilities.

img_cap

Image Captioning with BPE Tokenization

Today, subword tokenization schemes inspired by BPE have become the norm in most advanced models including the very popular family of contextual language models like BERT, GPT-2,RoBERTa, etc.

BPE brings the perfect balance between character and word-level hybrid representations which makes it capable of managing large corpora. This behavior also enables the encoding of any rare words in the vocabulary with appropriate subword tokens without introducing any “unknown” tokens.

BPE was used in order to tokenize the captions instead of using nltk.

img_cap

Image Captioning - LaTex Equations

An application of image captioning is to convert the the equation present in the image to latex format.

Following varients have been explored:

Basic Image to Latex
Image to Latex with Attention
Image to Latex with Beam Search
Image to LaTex Conversion using Streamlit

Basic Image to Latex

An application of image captioning is to convert the the equation present in the image to latex format. Basic Sequence-to-Sequence models is used. CNN is used as encoder and RNN as decoder. Im2latex dataset is used. It contains 100K samples comprising of training, validation and test splits.

img_cap

Generated formulas are not great. Following notebooks will explore techniques to improve it.

Image to Latex with Attention

Latex code generation using the attention mechanism is implemented. Instead of the simple average, we use the weighted average across all pixels, with the weights of the important pixels being greater. This weighted representation of the image can be concatenated with the previously generated word at each step to generate the next word of the formula.

img_cap

Image to Latex with Beam Search

Added beam search in the decoding process. Also added Positional encoding to the input image and learning rate scheduler.

Image to LaTex Conversion using Streamlit

Converted the Latex formula generation into an app using streamlit.

News Summarization with T5

Automatic text summarization is the task of producing a concise and fluent summary while preserving key information content and overall meaning. Have you come across the mobile app inshorts ? It's an innovative news app that converts news articles into a 60-word summary. And that is exactly what we are going to do in this notebook. The model used for this task is T5 .

news_sum

Email Subject Generation with T5.

Given the overwhelming number of emails, an effective subject line becomes essential to better inform the recipient of the email's content.

Email subject generation using T5 model was explored. AESLC dataset was used for this purpose.

email_sub

Question-Answering based Applications

Basic Question Answering
- Basic Question Answering with Dynamic Memory Network
- Basic Question Answering with Dynamic Memory Network Plus
Visual Question Answering
- Basic Visual Question Answering
- Visual Question Answering with Dynamic Memory Network Plus
Boolean Question Answering
Closed Question Answering - SQuAD
- Question Answering using Dynamic Co-Attention Network
- Question Answering using Double Cross Attention

Clustering based Applications

Topic Identification in News

Covid Article finding

Topic Identification in News

Topic Identification is a Natural Language Processing (NLP) is the task to automatically extract meaning from texts by identifying recurrent themes or topics.

Following varients have been explored:

Topic Identification in News using LDA
Improved Topic Identification in News using LDA
Topic Identification in News using LSA

Topic Identification in News using LDA

LDA's approach to topic modeling is it considers each document as a collection of topics in a certain proportion. And each topic as a collection of keywords, again, in a certain proportion.

Once you provide the algorithm with the number of topics, all it does it to rearrange the topics distribution within the documents and keywords distribution within the topics to obtain a good composition of topic-keywords distribution.

20 Newsgroup dataset was used and only the articles are provided to identify the topics. Topic Modelling algorithms will provide for each topic what are the important words. It is upto us to infer the topic name.

lda

Improved Topic Identification in News using LDA

Choosing the number of topics is a difficult job in Topic Modelling. In order to choose the optimal number of topics, grid search is performed on various hypermeters. In order to choose the best model the model having the best perplexity score is choosed.

A good topic model will have non-overlapping, fairly big sized blobs for each topic.

lda

LDA using scikit-learn is implemented.
Inference (predicting the topic of a given sentence) is also implemented.

Topic Identification in News using LSA

We would clearly expect that the words that appear most frequently in one topic would appear less frequently in the other - otherwise that word wouldn't make a good choice to separate out the two topics. Therefore, we expect the topics to be orthogonal .

Latent Semantic Analysis (LSA) uses SVD. You will sometimes hear topic modelling referred to as LSA.

The SVD algorithm factorizes a matrix into one matrix with orthogonal columns and one with orthogonal rows (along with a diagonal matrix, which contains the relative importance of each factor).

SVD

笔记：

SVD is a determined dimension reduction algorithm
LDA is a probability-based generative model

Covid article finding using LDA

Finding the relevant article from a covid-19 research article corpus of 50K+ documents using LDA is explored.

The documents are first clustered into different topics using LDA. For a given query, dominant topic will be found using the trained LDA. Once the topic is found, most relevant articles will be fetched using the jensenshannon distance.

Only abstracts are used for the LDA model training. LDA model was trained using 35 topics.

lda

Question Answering based Applications

Factual Question Answering	Visual Question Answering	Boolean Question Answering
Closed Question Answering

Factual Question Answering

Given a set of facts, question concering them needs to be answered. Dataset used is bAbI which has 20 tasks with an amalgamation of inputs, queries and answers. See the following figure for sample.

babi

Following varients have been explored:

Basic Question Answering with Dynamic Memory Networks
Question Answering using DMN Plus

Basic Question Answering with Dynamic Memory Networks

Dynamic Memory Network (DMN) is a neural network architecture which processes input sequences and questions, forms episodic memories, and generates relevant answers.

dmn

Question Answering using DMN Plus

The main difference between DMN+ and DMN is the improved InputModule for calculating the facts from input sentences keeping in mind the exchange of information between input sentences using a Bidirectional GRU and a improved version of MemoryModule using Attention based GRU model.

dmn

Visual Question Answering

Visual Question Answering (VQA) is the task of given an image and a natural language question about the image, the task is to provide an accurate natural language answer.

Following varients have been explored:

Basic Visual Question Answering
Visual Question Answering with DMN Plus

Basic Visual Question Answering

vqa

The model uses a two layer LSTM to encode the questions and the last hidden layer of VGGNet to encode the images. The image features are then l_2 normalized. Both the question and image features are transformed to a common space and fused via element-wise multiplication, which is then passed through a fully connected layer followed by a softmax layer to obtain a distribution over answers.

Visual Question Answering with DMN Plus

To apply the DMN to visual question answering, input module is modified for images. The module splits an image into small local regions and considers each region equivalent to a sentence in the input module for text.

The input module for VQA is composed of three parts, illustrated in below fig:

local region feature extraction
visual feature embedding
input fusion layer

vqa

Boolean Question Answering

Boolean question answering is to answer whether the question has answer present in the given context or not. The BoolQ dataset contains the queries for complex, non-factoid information, and require difficult entailment-like inference to solve.

boolqa

Closed Question Answering

Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage

Following varients have been explored:

Question Answering using Dynamic-CoAttention-Network
Question Answering using Double-Cross-Attention

Question Answering using Dynamic-CoAttention-Network

The DCN first fuses co-dependent representations of the question and the document in order to focus on relevant parts of both. Then a dynamic pointing decoder iterates over potential answer spans. This iterative procedure enables the model to recover from initial local maxima corresponding to incorrect answers.

The Dynamic Coattention Network has two major parts: a coattention encoder and a dynamic decoder.

CoAttention Encoder : The model first encodes the given document and question separately via the document and question encoder. The document and question encoders are essentially a one-directional LSTM network with one layer. Then it passes both the document and question encodings to another encoder which computes the coattention via matrix multiplications and outputs the coattention encoding from another bidirectional LSTM network.

Dynamic Decoder : Dynamic decoder is also a one-directional LSTM network with one layer. The model runs the LSTM network through several iterations . In each iteration, the LSTM takes in the final hidden state of the LSTM and the start and end word embeddings of the answer in the last iteration and outputs a new hidden state. Then, the model uses a Highway Maxout Network (HMN) to compute the new start and end word embeddings of the answer in each iteration.

dcn

Question Answering using Double-Cross-Attention

Double Cross Attention (DCA) seems to provide better results compared to both BiDAF and Dynamic Co-Attention Network (DCN). The motivation behind this approach is that first we pay attention to each context and question and then we attend those attentions with respect to each other in a slightly similar way as DCN. The intuition is that if iteratively read/attend both context and question, it should help us to search for answers easily.

I have augmented the Dynamic Decoder part from DCN model in-order to have iterative decoding process which helps finding better answer.

DCA

Ranking Based Applications

Covid-19 Browser

There was a kaggle problem on covid-19 research challenge which has over 1,00,000 + documents. This freely available dataset is provided to the global research community to apply recent advances in natural language processing and other AI techniques to generate new insights in support of the ongoing fight against this infectious disease. There is a growing urgency for these approaches because of the rapid acceleration in new coronavirus literature, making it difficult for the medical research community to keep up.

The procedure I have taken is to convert the abstracts into a embedding representation using sentence-transformers . When a query is asked, it will converted into an embedding and then ranked across the abstracts using cosine similarity.

冠状病毒

Recommendation based Applications

Song Recommendation

By taking user's listening queue as a sentence, with each word in that sentence being a song that the user has listened to, training the Word2vec model on those sentences essentially means that for each song the user has listened to in the past, we're using the songs they have listened to before and after to teach our model that those songs somehow belong to the same context.

song_recom

What's interesting about those vectors is that similar songs will have weights that are closer together than songs that are unrelated.

展开

附加信息

版本 1.0.0
类型其他源码
更新时间 2025-04-17
大小 119.78MB
来自于 Github