Simple XLNet implementation with Pytorch Wrapper!
$ git clone https://github.com/graykode/xlnet-Pytorch && cd xlnet-Pytorch
# To use Sentence Piece Tokenizer(pretrained-BERT Tokenizer)
$ pip install pytorch_pretrained_bert
$ python main.py --data ./data.txt --tokenizer bert-base-uncased
--seq_len 512 --reuse_len 256 --perm_size 256
--bi_data True --mask_alpha 6 --mask_beta 1
--num_predict 85 --mem_len 384 --num_epoch 100Also, You can run code in Google Colab easily.
—data(String) : .txt file to train. It doesn't matter multiline text. Also, one file will be one batch tensor. Default : data.txt
—tokenizer(String) : I just used huggingface/pytorch-pretrained-BERT's Tokenizer as subword tokenizer(I'll edit it to sentence piece soon). you can choose in bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased. Default : bert-base-uncased
—seq_len(Integer) : Sequence length. Default : 512
—reuse_len(Interger) : Number of token that can be reused as memory. Could be half of seq_len. Default : 256
—perm_size(Interger) : the length of longest permutation. Could be set to be reuse_len. Default : 256
--bi_data(Boolean) : whether to create bidirectional data. If bi_data is True, biz(batch size) should be even number. Default : False
—mask_alpha(Interger) : How many tokens to form a group. Defalut : 6
—mask_beta(Integer) : How many tokens to mask within each group. Default : 1
—num_predict(Interger) : Num of tokens to predict. In Paper, it mean Partial Prediction. Default : 85
—mem_len(Interger) : Number of steps to cache in Transformer-XL Architecture. Default : 384
—num_epoch(Interger) : Number of Epoch. Default : 100
XLNet is a new unsupervised language representation learning method based on a novel generalized permutation language modeling objective. Additionally, XLNet employs Transformer-XL as the backbone model, exhibiting excellent performance for language tasks involving long context.
| Model | MNLI | QNLI | QQP | RTE | SST-2 | MRPC | CoLA | STS-B |
|---|---|---|---|---|---|---|---|---|
| BERT | 86.6 | 92.3 | 91.3 | 70.4 | 93.2 | 88.0 | 60.6 | 90.0 |
| XLNet | 89.8 | 93.9 | 91.8 | 83.8 | 95.6 | 89.2 | 63.6 | 91.8 |
How did XLNet benefit from Auto-Regression and Auto-Encoding models?


Permutation Language Modeling with Partial Prediction
Permutation Language Modeling

Partial Prediction

Two-Stream Self-Attention with Target-Aware Representation
Two-Stram Self-Attention

Target-Aware Representation
