pretraining for language understanding下載 - pretraining for language understanding

培訓語言理解

現在，語言理解的語言模型的預培訓是NLP背景下的重要一步。

語言模型將在大規模的語料庫上進行訓練，然後我們可以將其用作需要處理語言的其他模型中的組件（例如將其用於下游任務）。

概述

語言模型

LANUGAGE模型（LM）捕獲了所有可能的句子上的分佈。

輸入：句子
輸出：輸入句子的概率

雖然語言建模是對大規模語料庫的典型無監督學習，但我們將其變成了本回購中的一系列監督學習。

自回歸語言模型

自回歸語言模型捕獲下一個令牌上的分佈是基於所有以前的令牌。換句話說，它查看以前的令牌，並預測下一個令牌。

自回歸語言模型的目的以以下公式表示：

由於自回歸語言模型應向前或向後，因此只能使用單向單向上下文信息。因此，很難同時了解兩個方向的上下文。

RNNLM，Elmo是自回歸語言模型的典型示例，並且本儲藏庫涵蓋了單向/雙向LSTM語言模型。

參見雙向LSTM LM，Elmo在兩個方向上使用上下文。但是，只有淺層理解才是可能的beacuase，它使用在每個方向上獨立學習的上下文。
參見有關模型體系結構的詳細說明，請參閱下面的“參考”選項卡中的紙張/回購。

1。構建語料庫

維基百科

Wikipedia定期分發整個文檔。您可以在此處下載韓國Wikipedia垃圾場（以及此處的英語Wikipedia Dump）。 Wikipedia建議使用pages-articles.xml.bz2 ，僅包含整個文檔的最新版本，並且大約為600 MB壓縮（對於英語， pages-articles-multistream.xml.bz2 ）。

您可以使用wikipedia_ko.sh腳本下載最新的韓國Wikipedia文檔上的轉儲。對於英語，使用wikipedia_en.sh

例子：

 $ cd build_corpus
$ chmod 777 wikipedia_ko.sh
$ ./wikipedia_ko.sh

使用上面的Shell腳本下載的轉儲為XML格式，我們需要將XML解析為文本文件。 python腳本WikiExtractor.py在attardi/wikiextractor repo中，從轉儲中提取和清潔文本。

例子：

 $ git clone https://github.com/attardi/wikiextractor
$ python wikiextractor/WikiExtractor.py kowiki-latest-pages-articles.xml

$ head -n 4 text/AA/wiki_02
<doc id="577" url="https://ko.wikipedia.org/wiki?curid=577" title="천문학">
천문학

천문학(天文學, )은 별이나 행성, 혜성, 은하와 같은 천체와, 지구 대기의 ..
</doc>

提取的文本作為特定大小的文本文件保存。要結合使用，請使用build_corpus.py 。輸出corpus.txt包含4,277,241個句子，55,568,030個單詞。

例子：

 $ python build_corpus.py > corpus.txt
$ wc corpus.txt 
4277241  55568030 596460787 corpus.txt

現在，您需要將語料庫拆分以訓練和測試集。

 $ cat corpus.txt | shuf > corpus.shuf.txt
$ head -n 855448 corpus.shuf.txt > corpus.test.txt
$ tail -n 3421793 corpus.shuf.txt > corpus.train.txt
$ wc -l corpus.train.txt corpus.test.txt
  3421793 corpus.train.txt
   855448 corpus.test.txt
  4277241 합계

2。預處理

構建詞彙

我們的語料庫corpus.txt有55,568,030個單詞和608,221個獨特的單詞。如果將令牌包含在詞彙中所需的最小頻率設置為3，則詞彙包含297,773個唯一單詞。

在這裡，我們使用火車語料庫corpus.train.txt來構建詞彙。 Train語料庫構建的詞彙包含557,627個獨特的單詞，而271,503個獨特的單詞至少出現3次。

例子：

 $ python build_vocab.py --corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --min_freq 3 --lower
Namespace(bos_token='<bos>', corpus='build_corpus/corpus.train.txt', eos_token='<eos>', is_tokenized=False, lower=True, min_freq=3, pad_token='<pad>', tokenizer='mecab', unk_token='<unk>', vocab='vocab.train.pkl')
Vocabulary size:  271503
Vocabulary saved to vocab.train.pkl

由於詞彙文件太大（〜1.3GB）以無法上傳此存儲庫，因此我將其上傳到Google Drive。

vocab.train.pkl ：[下載]

3。培訓

 $ python lm_trainer.py -h
usage: lm_trainer.py [-h] --train_corpus TRAIN_CORPUS --vocab VOCAB
                     --model_type MODEL_TYPE [--test_corpus TEST_CORPUS]
                     [--is_tokenized] [--tokenizer TOKENIZER]
                     [--max_seq_len MAX_SEQ_LEN] [--multi_gpu] [--cuda CUDA]
                     [--epochs EPOCHS] [--batch_size BATCH_SIZE]
                     [--clip_value CLIP_VALUE] [--shuffle SHUFFLE]
                     [--embedding_size EMBEDDING_SIZE]
                     [--hidden_size HIDDEN_SIZE] [--n_layers N_LAYERS]
                     [--dropout_p DROPOUT_P]

optional arguments:
  -h, --help            show this help message and exit
  --train_corpus TRAIN_CORPUS
  --vocab VOCAB
  --model_type MODEL_TYPE
                        Model type selected in the list: LSTM, BiLSTM
  --test_corpus TEST_CORPUS
  --is_tokenized        Whether the corpus is already tokenized
  --tokenizer TOKENIZER
                        Tokenizer used for input corpus tokenization
  --max_seq_len MAX_SEQ_LEN
                        The maximum total input sequence length after
                        tokenization
  --multi_gpu           Whether to training with multiple GPU
  --cuda CUDA           Whether CUDA is currently available
  --epochs EPOCHS       Total number of training epochs to perform
  --batch_size BATCH_SIZE
                        Batch size for training
  --clip_value CLIP_VALUE
                        Maximum allowed value of the gradients. The gradients
                        are clipped in the range
  --shuffle SHUFFLE     Whether to reshuffle at every epoch
  --embedding_size EMBEDDING_SIZE
                        Word embedding vector dimension
  --hidden_size HIDDEN_SIZE
                        Hidden size of LSTM
  --n_layers N_LAYERS   Number of layers in LSTM
  --dropout_p DROPOUT_P
                        Dropout rate used for dropout layer in LSTM

例子：

 $ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type LSTM --batch_size 16

您可以通過參數輸入選擇自己的參數值。

多個GPU培訓

用單個GPU訓練模型不僅非常慢，還限制了調整批次尺寸，型號大小等。要使用多個GPU加速模型培訓並使用大型型號，您要做的就是包括--multi_gpu標誌。有關更多詳細信息，請在此處查看。

培訓單病毒LSTM語言模型

此示例代碼使用8 * V100 GPU上的並行訓練在Wikipedia語料庫上訓練單向LSTM模型。

 $ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type LSTM --multi_gpu
Namespace(batch_size=512, clip_value=10, cuda=True, dropout_p=0.2, embedding_size=256, epochs=10, hidden_size=1024, is_tokenized=False, max_seq_len=32, model_type='LSTM', multi_gpu=True, n_layers=3, shuffle=True, test_corpus=None, tokenizer='mecab', train_corpus='build_corpus/corpus.train.txt', vocab='vocab.train.pkl')
=========MODEL=========
 DataParallelModel(
  (module): LSTMLM(
    (embedding): Embedding(271503, 256)
    (lstm): LSTM(256, 1024, num_layers=3, batch_first=True, dropout=0.2)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (fc2): Linear(in_features=512, out_features=271503, bias=True)
    (softmax): LogSoftmax()
  )
)

培訓雙向LSTM語言模型

此示例代碼使用8 * V100 GPU上的並行訓練在Wikipedia語料庫上訓練雙向LSTM模型。

 $ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type BiLSTM --n_layers 1 --multi_gpu
Namespace(batch_size=512, clip_value=10, cuda=True, dropout_p=0.2, embedding_size=256, epochs=10, hidden_size=1024, is_tokenized=False, max_seq_len=32, model_type='BiLSTM', multi_gpu=True, n_layers=1, shuffle=True, test_corpus=None, tokenizer='mecab', train_corpus='build_corpus/corpus.train.txt', vocab='vocab.train.pkl')
=========MODEL=========
 DataParallelModel(
  (module): BiLSTMLM(
    (embedding): Embedding(271503, 256)
    (lstm): LSTM(256, 1024, batch_first=True, dropout=0.2, bidirectional=True)
    (fc): Linear(in_features=2048, out_features=1024, bias=True)
    (fc2): Linear(in_features=1024, out_features=512, bias=True)
    (fc3): Linear(in_features=512, out_features=271503, bias=True)
    (softmax): LogSoftmax()
  )
)

4。評估

困惑

語言模型在所有可能的句子上捕獲分佈。而且，最好的語言模型是最好的語言模型可以預測看不見的句子。困惑是對概率分佈預測看不見的句子的程度的非常普遍的測量。

困惑：給定句子的反概率，通過單詞數量歸一化（通過幾何平均值）

從上面的方程式中可以看到，困惑被定義為凸起的負平均對數似然性。換句話說，最大化概率與最小化的困惑相同。

結果

而現在，困惑是我們將要使用的指標。低的困惑表明概率分佈符合預測句子。

模型	損失	困惑
單向LSTM	3.496	33.037
雙向LSTM	1.896	6.669
雙向lstm-large（ hidden_size = 1024）	1.771	5.887

參考

一般的

[Google DeepMind] Wavenet：原始音頻的生成模型
[Dan Jurafsky] CS 124：從斯坦福語中從語言到信息
[attardi/wikiextractor] wikiextractor

型號

Unirectiaonl LSTM LM

[DSKSD] 6。循環神經網絡和語言模型
[Yunjey/Pytorch-Tutorial]語言模型（RNN-LM）
[pytorch/示例]文字級別的語言建模RNN

雙向LSTM LM

[Mousa，AMR和BjörnSchuller]上下文雙向長期記憶復發性神經網絡語言模型：一種情感分析的生成方法
[Motoki Wu]雙向語言模型

多GPU培訓

[Matthew L] Pytorch Multi-GPU제대로제대로
[zhanghang1989/pytorch編碼] pytorch編碼，問題：如何使用dataParallearCriterion，dataParallelModel

展開

pretraining for language understanding

培訓語言理解

概述

語言模型

自回歸語言模型

1。構建語料庫

維基百科

2。預處理

構建詞彙

3。培訓

多個GPU培訓

培訓單病毒LSTM語言模型

培訓雙向LSTM語言模型

4。評估

困惑

結果

參考

一般的

型號

Unirectiaonl LSTM LM

雙向LSTM LM

多GPU培訓

language tools

VK 音樂

SoundBridge 遙控器

efficient language detector

scene language

他媽的

chat.petals.dev

GPT Prompt Templates

GPTyped

ML stack

awesome free chatgpt

pywin_contextmenu

Google Dorks

shepherd

mongo express