pretraining for language understandingダウンロード - pretraining for language understandingソースコードダウンロード

言語理解のための事前トレーニング

現在、言語理解のための言語モデルの事前トレーニングは、NLPのコンテキストの重要なステップです。

言語モデルは大規模なコーパスでトレーニングされ、その後、言語を処理する必要がある他のモデルのコンポーネントとして使用できます（たとえば、ダウンストリームタスクに使用する）。

概要

言語モデル

Lanugageモデル（LM）は、考えられるすべての文の分布をキャプチャします。

入力：文
出力：入力文の確率

言語モデリングは、巨大なコーパスでの典型的な監視されていない学習ですが、これをこのレポでの監視された学習のシーケンスに変えます。

オートレーフ言語モデル

AutoreGressive Languageモデルは、次のトークンの分布をキャプチャします。前のすべてのトークンに基づいています。言い換えれば、それは前のトークンを見て、次のトークンを予測します。

自己回帰言語モデルの目的は、次のように式で表現されます。

自己回帰言語モデルは順方向または後方である必要があるため、1方向のコンテキスト情報のみを使用できます。したがって、両方向のコンテキストを同時に理解することは困難です。

RNNLM、ELMOは自己回帰言語モデルの典型的な例であり、このレポでは一方向/双方向LSTM言語モデルがカバーされています。

cf.双方向LSTM LM、ELMOは両方向にコンテキストを使用します。ただし、それぞれの方向で独立して学習されるコンテキストを使用する可能性のあるビーカスのみが浅い理解のみです。
cf.モデルアーキテクチャの詳細な説明については、下の[参照]タブの論文/レポを参照してください。

1.コーパスを構築します

ウィキペディア

ウィキペディアは定期的にドキュメント全体を配布しています。韓国のウィキペディアダンプをこちらからダウンロードできます（ここで英語のウィキペディアダンプ）。 Wikipediaは、ドキュメント全体の最新バージョンのみを含むpages-articles.xml.bz2 pages-articles-multistream.xml.bz2使用することをお勧めします。

wikipedia_ko.shスクリプトを使用して、最新の韓国のウィキペディアドキュメントにダンプをダウンロードできます。英語の場合は、 wikipedia_en.shを使用してください

例：

 $ cd build_corpus
$ chmod 777 wikipedia_ko.sh
$ ./wikipedia_ko.sh

上記のシェルスクリプトを使用したダウンロードされたダンプはXML形式であり、XMLをテキストファイルに解析する必要があります。 PythonスクリプトWikiExtractor.py attardi/wikiextractorリポジトリ、ダンプからテキストを抽出してクリーニングします。

例：

 $ git clone https://github.com/attardi/wikiextractor
$ python wikiextractor/WikiExtractor.py kowiki-latest-pages-articles.xml

$ head -n 4 text/AA/wiki_02
<doc id="577" url="https://ko.wikipedia.org/wiki?curid=577" title="천문학">
천문학

천문학(天文學, )은 별이나 행성, 혜성, 은하와 같은 천체와, 지구 대기의 ..
</doc>

抽出されたテキストは、特定のサイズのテキストファイルとして保存されます。これらを組み合わせるには、 build_corpus.pyを使用します。出力corpus.txtには、4,277,241文、55,568,030語が含まれています。

例：

 $ python build_corpus.py > corpus.txt
$ wc corpus.txt 
4277241  55568030 596460787 corpus.txt

これで、コーパスを分割してトレーニングセットとテストセットを分割する必要があります。

 $ cat corpus.txt | shuf > corpus.shuf.txt
$ head -n 855448 corpus.shuf.txt > corpus.test.txt
$ tail -n 3421793 corpus.shuf.txt > corpus.train.txt
$ wc -l corpus.train.txt corpus.test.txt
  3421793 corpus.train.txt
   855448 corpus.test.txt
  4277241 합계

2。プレアクセス

VOCABを作成します

corpus corpus.txtには55,568,030語、608,221のユニークな単語があります。語彙にトークンを含めるために必要な最小周波数が3に設定されている場合、語彙には297,773の一意の単語が含まれています。

ここでは、Train Corpus corpus.train.txtを使用して語彙を構築します。列車のコーパスによって構築された語彙には、 557,627のユニークな単語と、少なくとも3回表示される271,503のユニークな単語が含まれています。

例：

 $ python build_vocab.py --corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --min_freq 3 --lower
Namespace(bos_token='<bos>', corpus='build_corpus/corpus.train.txt', eos_token='<eos>', is_tokenized=False, lower=True, min_freq=3, pad_token='<pad>', tokenizer='mecab', unk_token='<unk>', vocab='vocab.train.pkl')
Vocabulary size:  271503
Vocabulary saved to vocab.train.pkl

語彙ファイルが大きすぎて（〜1.3GB）このレポでアップロードするには、Googleドライブにアップロードしました。

vocab.train.pkl ：[ダウンロード]

3。トレーニング

 $ python lm_trainer.py -h
usage: lm_trainer.py [-h] --train_corpus TRAIN_CORPUS --vocab VOCAB
                     --model_type MODEL_TYPE [--test_corpus TEST_CORPUS]
                     [--is_tokenized] [--tokenizer TOKENIZER]
                     [--max_seq_len MAX_SEQ_LEN] [--multi_gpu] [--cuda CUDA]
                     [--epochs EPOCHS] [--batch_size BATCH_SIZE]
                     [--clip_value CLIP_VALUE] [--shuffle SHUFFLE]
                     [--embedding_size EMBEDDING_SIZE]
                     [--hidden_size HIDDEN_SIZE] [--n_layers N_LAYERS]
                     [--dropout_p DROPOUT_P]

optional arguments:
  -h, --help            show this help message and exit
  --train_corpus TRAIN_CORPUS
  --vocab VOCAB
  --model_type MODEL_TYPE
                        Model type selected in the list: LSTM, BiLSTM
  --test_corpus TEST_CORPUS
  --is_tokenized        Whether the corpus is already tokenized
  --tokenizer TOKENIZER
                        Tokenizer used for input corpus tokenization
  --max_seq_len MAX_SEQ_LEN
                        The maximum total input sequence length after
                        tokenization
  --multi_gpu           Whether to training with multiple GPU
  --cuda CUDA           Whether CUDA is currently available
  --epochs EPOCHS       Total number of training epochs to perform
  --batch_size BATCH_SIZE
                        Batch size for training
  --clip_value CLIP_VALUE
                        Maximum allowed value of the gradients. The gradients
                        are clipped in the range
  --shuffle SHUFFLE     Whether to reshuffle at every epoch
  --embedding_size EMBEDDING_SIZE
                        Word embedding vector dimension
  --hidden_size HIDDEN_SIZE
                        Hidden size of LSTM
  --n_layers N_LAYERS   Number of layers in LSTM
  --dropout_p DROPOUT_P
                        Dropout rate used for dropout layer in LSTM

例：

 $ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type LSTM --batch_size 16

引数入力を介して独自のパラメーター値を選択できます。

複数のGPUでのトレーニング

単一のGPUでモデルをトレーニングすることは非常に遅いだけでなく、バッチサイズ、モデルサイズなどの調整も制限します。複数のGPUでモデルトレーニングを加速し、大きなモデルを使用するために、あなたがしなければならないことは、 --multi_gpuフラグをBELOWSのように含めることです。詳細については、こちらをご覧ください。

一分化したLSTM言語モデルのトレーニング

この例コードは、8 * V100 GPUでの並列トレーニングを使用して、ウィキペディアコーパスの単方向LSTMモデルをトレーニングします。

 $ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type LSTM --multi_gpu
Namespace(batch_size=512, clip_value=10, cuda=True, dropout_p=0.2, embedding_size=256, epochs=10, hidden_size=1024, is_tokenized=False, max_seq_len=32, model_type='LSTM', multi_gpu=True, n_layers=3, shuffle=True, test_corpus=None, tokenizer='mecab', train_corpus='build_corpus/corpus.train.txt', vocab='vocab.train.pkl')
=========MODEL=========
 DataParallelModel(
  (module): LSTMLM(
    (embedding): Embedding(271503, 256)
    (lstm): LSTM(256, 1024, num_layers=3, batch_first=True, dropout=0.2)
    (fc): Linear(in_features=1024, out_features=512, bias=True)
    (fc2): Linear(in_features=512, out_features=271503, bias=True)
    (softmax): LogSoftmax()
  )
)

双方向LSTM言語モデルのトレーニング

この例コードは、8 * V100 GPUでの並列トレーニングを使用して、ウィキペディアコーパスの双方向LSTMモデルをトレーニングします。

 $ python lm_trainer.py --train_corpus build_corpus/corpus.train.txt --vocab vocab.train.pkl --model_type BiLSTM --n_layers 1 --multi_gpu
Namespace(batch_size=512, clip_value=10, cuda=True, dropout_p=0.2, embedding_size=256, epochs=10, hidden_size=1024, is_tokenized=False, max_seq_len=32, model_type='BiLSTM', multi_gpu=True, n_layers=1, shuffle=True, test_corpus=None, tokenizer='mecab', train_corpus='build_corpus/corpus.train.txt', vocab='vocab.train.pkl')
=========MODEL=========
 DataParallelModel(
  (module): BiLSTMLM(
    (embedding): Embedding(271503, 256)
    (lstm): LSTM(256, 1024, batch_first=True, dropout=0.2, bidirectional=True)
    (fc): Linear(in_features=2048, out_features=1024, bias=True)
    (fc2): Linear(in_features=1024, out_features=512, bias=True)
    (fc3): Linear(in_features=512, out_features=271503, bias=True)
    (softmax): LogSoftmax()
  )
)

4。評価

困惑

言語モデルは、可能なすべての文の分布をキャプチャします。そして、最良の言語モデルは、目に見えない文を最もよく予測するものです。困惑は、確率分布が目に見えない文をどれだけ予測するかについての非常に一般的な測定です。

困惑：特定の文の逆確率、単語の数によって正規化された（幾何平均を取ることによって）

上記の方程式からわかるように、困惑は、指数の負の平均対数尤度として定義されます。言い換えれば、確率を最大化することは、困惑を最小限に抑えることと同じです。

結果

そして今、困惑は私たちが使用するメトリックです。低い困惑は、確率分布が文の予測に優れていることを示しています。

モデル	損失	困惑
単方向LSTM	3.496	33.037
双方向LSTM	1.896	6.669
Bidirectional-lstm-large（ hidden_size = 1024）	1.771	5.887

参照

一般的な

[Google DeepMind] Wavenet：生のオーディオの生成モデル
[Dan Jurafsky] CS 124：言語からスタンフォードの情報まで
[attardi/wikiextractor] wikiextractor

モデル

unidirectiaonl lstm lm

[DSKSD] 6。再発性ニューラルネットワークと言語モデル
[Yunjey/Pytorch-Tutorial]言語モデル（RNN-LM）
[Pytorch/Examples] Word-Level Language Modeling RNN

双方向LSTM LM

[Mousa、Amr、およびBjörnSchuller]文脈的双方向長い短期記憶再発性ニューラルネットワーク言語モデル：感情分析への生成アプローチ
[Motoki Wu]双方向言語モデル

マルチGPUトレーニング

[Matthew L] Pytorch Multi-Gpu
[zhanghang1989/pytorch-encoding] pytorch-encoding、issue：dataparallellerition、dataparallelmodelの使用方法

拡大する

pretraining for language understanding

言語理解のための事前トレーニング

概要

言語モデル

オートレーフ言語モデル

1.コーパスを構築します

ウィキペディア

2。プレアクセス

VOCABを作成します

3。トレーニング

複数のGPUでのトレーニング

一分化したLSTM言語モデルのトレーニング

双方向LSTM言語モデルのトレーニング

4。評価

困惑

結果

参照

一般的な

モデル

unidirectiaonl lstm lm

双方向LSTM LM

マルチGPUトレーニング

language tools

VKの音楽

SoundBridge のリモート

efficient language detector

scene language

ファックのために

chat.petals.dev

GPT Prompt Templates

GPTyped

ML stack

awesome free chatgpt

pywin_contextmenu

Google Dorks

shepherd

mongo express