pretraining for language understanding下载 - pretraining for language understanding

培训语言理解

现在，语言理解的语言模型的预培训是NLP背景下的重要一步。

语言模型将在大规模的语料库上进行训练，然后我们可以将其用作需要处理语言的其他模型中的组件（例如将其用于下游任务）。

概述

语言模型

LANUGAGE模型（LM）捕获了所有可能的句子上的分布。

输入：句子
输出：输入句子的概率

虽然语言建模是对大规模语料库的典型无监督学习，但我们将其变成了本回购中的一系列监督学习。

自回归语言模型

自回归语言模型捕获下一个令牌上的分布是基于所有以前的令牌。换句话说，它查看以前的令牌，并预测下一个令牌。

自回归语言模型的目的以以下公式表示：

由于自回归语言模型应向前或向后，因此只能使用单向单向上下文信息。因此，很难同时了解两个方向的上下文。

RNNLM，Elmo是自回归语言模型的典型示例，并且本储藏库涵盖了单向/双向LSTM语言模型。

参见双向LSTM LM，Elmo在两个方向上使用上下文。但是，只有浅层理解才是可能的beacuase，它使用在每个方向上独立学习的上下文。
参见有关模型体系结构的详细说明，请参阅下面的“参考”选项卡中的纸张/回购。

1。构建语料库

维基百科

Wikipedia定期分发整个文档。您可以在此处下载韩国Wikipedia垃圾场（以及此处的英语Wikipedia Dump）。 Wikipedia建议使用pages-articles.xml.bz2 ，仅包含整个文档的最新版本，并且大约为600 MB压缩（对于英语， pages-articles-multistream.xml.bz2 ）。

您可以使用wikipedia_ko.sh脚本下载最新的韩国Wikipedia文档上的转储。对于英语，使用wikipedia_en.sh

例子：

 $ cd build_corpus
$ chmod 777 wikipedia_ko.sh
$ ./wikipedia_ko.sh

使用上面的Shell脚本下载的转储为XML格式，我们需要将XML解析为文本文件。 python脚本WikiExtractor.py在attardi/wikiextractor repo中，从转储中提取和清洁文本。

例子：

 $ git clone https://github.com/attardi/wikiextractor
$ python wikiextractor/WikiExtractor.py kowiki-latest-pages-articles.xml

$ head -n 4 text/AA/wiki_02
<doc id="577" url="https://ko.wikipedia.org/wiki?curid=577" title="천문학">
천문학

천문학(天文學, )은 별이나 행성, 혜성, 은하와 같은 천체와, 지구 대기의 ..
</doc>

提取的文本作为特定大小的文本文件保存。要结合使用，请使用build_corpus.py 。输出corpus.txt包含4,277,241个句子，55,568,030个单词。

例子：

 $ python build_corpus.py > corpus.txt
$ wc corpus.txt 
4277241  55568030 596460787 corpus.txt

现在，您需要将语料库拆分以训练和测试集。

 $ cat corpus.txt | shuf > corpus.shuf.txt
$ head -n 855448 corpus.shuf.txt > corpus.test.txt
$ tail -n 3421793 corpus.shuf.txt > corpus.train.txt
$ wc -l corpus.train.txt corpus.test.txt
  3421793 corpus.train.txt
   855448 corpus.test.txt
  4277241 합계

2。预处理

构建词汇

我们的语料库corpus.txt有55,568,030个单词和608,221个独特的单词。如果将令牌包含在词汇中所需的最小频率设置为3，则词汇包含297,773个唯一单词。

在这里，我们使用火车语料库corpus.train.txt来构建词汇。 Train语料库构建的词汇包含557,627个独特的单词，而271,503个独特的单词至少出现3次。

例子：

 $ python build_vocab.py --corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --min_freq 3 --lower
Namespace(bos_token='<bos>', corpus='build_corpus/corpus.train.txt', eos_token='<eos>', is_tokenized=False, lower=True, min_freq=3, pad_token='<pad>', tokenizer='mecab', unk_token='<unk>', vocab='vocab.train.pkl')
Vocabulary size:  271503
Vocabulary saved to vocab.train.pkl

由于词汇文件太大（〜1.3GB）以无法上传此存储库，因此我将其上传到Google Drive。

vocab.train.pkl ：[下载]

3。培训

 $ python lm_trainer.py -h
usage: lm_trainer.py [-h] --train_corpus TRAIN_CORPUS --vocab VOCAB
                     --model_type MODEL_TYPE [--test_corpus TEST_CORPUS]
                     [--is_tokenized] [--tokenizer TOKENIZER]
                     [--max_seq_len MAX_SEQ_LEN] [--multi_gpu] [--cuda CUDA]
                     [--epochs EPOCHS] [--batch_size BATCH_SIZE]
                     [--clip_value CLIP_VALUE] [--shuffle SHUFFLE]
                     [--embedding_size EMBEDDING_SIZE]
                     [--hidden_size HIDDEN_SIZE] [--n_layers N_LAYERS]
                     [--dropout_p DROPOUT_P]

optional arguments:
  -h, --help            show this help message and exit
  --train_corpus TRAIN_CORPUS
  --vocab VOCAB
  --model_type MODEL_TYPE
                        Model type selected in the list: LSTM, BiLSTM
  --test_corpus TEST_CORPUS
  --is_tokenized        Whether the corpus is already tokenized
  --tokenizer TOKENIZER
                        Tokenizer used for input corpus tokenization
  --max_seq_len MAX_SEQ_LEN
                        The maximum total input sequence length after
                        tokenization
  --multi_gpu           Whether to training with multiple GPU
  --cuda CUDA           Whether CUDA is currently available
  --epochs EPOCHS       Total number of training epochs to perform
  --batch_size BATCH_SIZE
                        Batch size for training
  --clip_value CLIP_VALUE
                        Maximum allowed value of the gradients. The gradients
                        are clipped in the range
  --shuffle SHUFFLE     Whether to reshuffle at every epoch
  --embedding_size EMBEDDING_SIZE
                        Word embedding vector dimension
  --hidden_size HIDDEN_SIZE
                        Hidden size of LSTM
  --n_layers N_LAYERS   Number of layers in LSTM
  --dropout_p DROPOUT_P
                        Dropout rate used for dropout layer in LSTM

例子：

 $ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type LSTM --batch_size 16

您可以通过参数输入选择自己的参数值。

多个GPU培训

用单个GPU训练模型不仅非常慢，还限制了调整批次尺寸，型号大小等。要使用多个GPU加速模型培训并使用大型型号，您要做的就是包括--multi_gpu标志。有关更多详细信息，请在此处查看。

培训单病毒LSTM语言模型

此示例代码使用8 * V100 GPU上的并行训练在Wikipedia语料库上训练单向LSTM模型。

 $ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type LSTM --multi_gpu
Namespace(batch_size=512, clip_value=10, cuda=True, dropout_p=0.2, embedding_size=256, epochs=10, hidden_size=1024, is_tokenized=False, max_seq_len=32, model_type='LSTM', multi_gpu=True, n_layers=3, shuffle=True, test_corpus=None, tokenizer='mecab', train_corpus='build_corpus/corpus.train.txt', vocab='vocab.train.pkl')
=========MODEL=========
 DataParallelModel(
  (module): LSTMLM(
    (embedding): Embedding(271503, 256)
    (lstm): LSTM(256, 1024, num_layers=3, batch_first=True, dropout=0.2)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (fc2): Linear(in_features=512, out_features=271503, bias=True)
    (softmax): LogSoftmax()
  )
)

培训双向LSTM语言模型

此示例代码使用8 * V100 GPU上的并行训练在Wikipedia语料库上训练双向LSTM模型。

 $ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type BiLSTM --n_layers 1 --multi_gpu
Namespace(batch_size=512, clip_value=10, cuda=True, dropout_p=0.2, embedding_size=256, epochs=10, hidden_size=1024, is_tokenized=False, max_seq_len=32, model_type='BiLSTM', multi_gpu=True, n_layers=1, shuffle=True, test_corpus=None, tokenizer='mecab', train_corpus='build_corpus/corpus.train.txt', vocab='vocab.train.pkl')
=========MODEL=========
 DataParallelModel(
  (module): BiLSTMLM(
    (embedding): Embedding(271503, 256)
    (lstm): LSTM(256, 1024, batch_first=True, dropout=0.2, bidirectional=True)
    (fc): Linear(in_features=2048, out_features=1024, bias=True)
    (fc2): Linear(in_features=1024, out_features=512, bias=True)
    (fc3): Linear(in_features=512, out_features=271503, bias=True)
    (softmax): LogSoftmax()
  )
)

4。评估

困惑

语言模型在所有可能的句子上捕获分布。而且，最好的语言模型是最好的语言模型可以预测看不见的句子。困惑是对概率分布预测看不见的句子的程度的非常普遍的测量。

困惑：给定句子的反概率，通过单词数量归一化（通过几何平均值）

从上面的方程式中可以看到，困惑被定义为凸起的负平均对数似然性。换句话说，最大化概率与最小化的困惑相同。

结果

而现在，困惑是我们将要使用的指标。低的困惑表明概率分布符合预测句子。

模型	损失	困惑
单向LSTM	3.496	33.037
双向LSTM	1.896	6.669
双向lstm-large（ hidden_size = 1024）	1.771	5.887

参考

一般的

[Google DeepMind] Wavenet：原始音频的生成模型
[Dan Jurafsky] CS 124：从斯坦福语中从语言到信息
[attardi/wikiextractor] wikiextractor

型号

Unirectiaonl LSTM LM

[DSKSD] 6。循环神经网络和语言模型
[Yunjey/Pytorch-Tutorial]语言模型（RNN-LM）
[pytorch/示例]文字级别的语言建模RNN

双向LSTM LM

[Mousa，AMR和BjörnSchuller]上下文双向长期记忆复发性神经网络语言模型：一种情感分析的生成方法
[Motoki Wu]双向语言模型

多GPU培训

[Matthew L] Pytorch Multi-GPU제대로제대로
[zhanghang1989/pytorch编码] pytorch编码，问题：如何使用dataParallearCriterion，dataParallelModel

展开

pretraining for language understanding

培训语言理解

概述

语言模型

自回归语言模型

1。构建语料库

维基百科

2。预处理

构建词汇

3。培训

多个GPU培训

培训单病毒LSTM语言模型

培训双向LSTM语言模型

4。评估

困惑

结果

参考

一般的

型号

Unirectiaonl LSTM LM

双向LSTM LM

多GPU培训

language tools

VK 音乐

SoundBridge 遥控器

efficient language detector

scene language

他妈的

chat.petals.dev

GPT Prompt Templates

GPTyped

ML stack

awesome free chatgpt

pywin_contextmenu

Google Dorks

shepherd

mongo express