Chinese Description | English

This project provides an XLNet pre-trained model for Chinese, aiming to enrich Chinese natural language processing resources and provide diversified Chinese pre-trained model selection. We welcome all experts and scholars to download and use it, and jointly promote and develop the construction of Chinese resources.
This project is based on CMU/Google's official XLNet: https://github.com/zihangdai/xlnet
Chinese LERT | Chinese English PERT | Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | Knowledge distillation tool TextBrewer | Model cutting tool TextPruner
See more resources released by iFL of Harbin Institute of Technology (HFL): https://github.com/ymcui/HFL-Anthology
2023/3/28 Open source Chinese LLaMA&Alpaca big model, which can be quickly deployed and experienced on PC, view: https://github.com/ymcui/Chinese-LLaMA-Alpaca
2022/10/29 We propose a pre-trained model LERT that integrates linguistic information. View: https://github.com/ymcui/LERT
2022/3/30 We open source a new pre-trained model PERT. View: https://github.com/ymcui/PERT
2021/12/17 IFLYTEK Joint Laboratory launches the model cutting toolkit TextPruner. View: https://github.com/airaria/TextPruner
2021/10/24 IFLYTEK Joint Laboratory released a pre-trained model CINO for ethnic minority languages. View: https://github.com/ymcui/Chinese-Minority-PLM
2021/7/21 "Natural Language Processing: Methods based on Pre-training Models" written by many scholars from Harbin Institute of Technology SCIR has been published, and everyone is welcome to purchase it.
2021/1/27 All models have supported TensorFlow 2, please call or download it through the transformers library. https://huggingface.co/hfl
2020/8/27 IFL Joint Laboratory topped the list in the GLUE general natural language understanding evaluation, check the GLUE list, news.
2020/3/11 In order to better understand the needs, you are invited to fill out the questionnaire to provide you with better resources.
2020/2/26 IFLYTEK Joint Laboratory releases knowledge distillation tool TextBrewer
2019/12/19 The model published in this directory has been connected to Huggingface-Transformers to view the fast loading
2019/9/5 XLNet-base is available for download, view the model download
2019/8/19 provides Chinese XLNet-mid model trained on large-scale universal corpus (5.4B word count), view model download
| chapter | describe |
|---|---|
| Model download | Provides the download address of pre-trained XLNet in Chinese |
| Baseline system effects | List some baseline system effects |
| Pre-training details | Description of pre-training details |
| Downstream task fine-tuning details | Related descriptions of downstream task fine-tuning details |
| FAQ | FAQs and Answers |
| Quote | Technical Reports in this directory |
XLNet-mid : 24-layer, 768-hidden, 12-heads, 209M parametersXLNet-base : 12-layer, 768-hidden, 12-heads, 117M parameters| Model abbreviation | Materials | Google Download | Baidu Netdisk download |
|---|---|---|---|
XLNet-mid, Chinese | Chinese Wiki+ General data [1] | TensorFlow PyTorch | TensorFlow (password 2jv2) |
XLNet-base, Chinese | Chinese Wiki+ General data [1] | TensorFlow PyTorch | TensorFlow (password ge7w) |
[1] General data include: encyclopedia, news, Q&A and other data, with a total number of words reaching 5.4B, the same as the BERT-wwm-ext training corpus we released.
If you need the PyTorch version,
1) Please convert it yourself through the conversion script provided by Transformers.
2) Or directly download PyTorch through the official website of huggingface: https://huggingface.co/hfl
Method: Click any model you want to download → Pull to the bottom and click "List all files in model" → Download bin and json files in the pop-up box.
It is recommended to use Baidu Netdisk download points in mainland China, and overseas users are recommended to use Google download points. XLNet-mid model file size is about 800M . Taking TensorFlow version XLNet-mid, Chinese as an example, after downloading, decompress the zip file to obtain:
chinese_xlnet_mid_L-24_H-768_A-12.zip
|- xlnet_model.ckpt # 模型权重
|- xlnet_model.meta # 模型meta信息
|- xlnet_model.index # 模型index信息
|- xlnet_config.json # 模型参数
|- spiece.model # 词表
Relying on Huggingface-Transformers 2.2.2, the above models can be easily called.
tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME")
model = AutoModel.from_pretrained("MODEL_NAME")
The corresponding list of MODEL_NAME is as follows:
| Model name | MODEL_NAME |
|---|---|
| XLNet-mid | hfl/chinese-xlnet-mid |
| XLNet-base | hfl/chinese-xlnet-base |
To compare the baseline effect, we tested it on the following Chinese datasets. Chinese BERT, BERT-wwm, BERT-wwm-ext, XLNet-base, XLNet-mid were compared. Among them, the results of Chinese BERT, BERT-wwm, and BERT-wwm-ext are taken from the Chinese BERT-wwm project. Time and energy are limited and have not been able to cover more categories of tasks. Please try it yourself.
Note: To ensure the reliability of the results, for the same model, we run 10 times (different random seeds) to report the maximum and average values of the model performance. If nothing unexpected happens, the result of your operation should be in this range.
In the evaluation indicator, the average value is represented in brackets and the maximum value is represented outside brackets.
**CMRC 2018 data set** is Chinese machine reading comprehension data released by iFLYTEK Joint Laboratory. According to a given question, the system needs to extract fragments from the chapter as the answer, in the same form as SQuAD. Evaluation indicators are: EM / F1
| Model | Development Set | Test set | Challenge Set |
|---|---|---|---|
| BERT | 65.5 (64.4) / 84.5 (84.0) | 70.0 (68.7) / 87.0 (86.3) | 18.6 (17.0) / 43.3 (41.3) |
| BERT-wwm | 66.3 (65.0) / 85.6 (84.7) | 70.5 (69.1) / 87.4 (86.7) | 21.0 (19.3) / 47.0 (43.9) |
| BERT-wwm-ext | 67.1 (65.6) / 85.7 (85.0) | 71.4 (70.0) / 87.7 (87.0) | 24.0 (20.0) / 47.3 (44.6) |
| XLNet-base | 65.2 (63.0) / 86.9 (85.9) | 67.0 (65.8) / 87.2 (86.8) | 25.0 (22.7) / 51.3 (49.5) |
| XLNet-mid | 66.8 (66.3) / 88.4 (88.1) | 69.3 (68.5) / 89.2 (88.8) | 29.1 (27.1) / 55.8 (54.9) |
**DRCD dataset** is released by Delta Research Institute, Taiwan, China. Its form is the same as SQuAD and is an extracted reading comprehension dataset based on traditional Chinese. Evaluation indicators are: EM / F1
| Model | Development Set | Test set |
|---|---|---|
| BERT | 83.1 (82.7) / 89.9 (89.6) | 82.2 (81.6) / 89.2 (88.8) |
| BERT-wwm | 84.3 (83.4) / 90.5 (90.2) | 82.8 (81.8) / 89.7 (89.0) |
| BERT-wwm-ext | 85.0 (84.5) / 91.2 (90.9) | 83.6 (83.0) / 90.4 (89.9) |
| XLNet-base | 83.8 (83.2) / 92.3 (92.0) | 83.5 (82.8) / 92.2 (91.8) |
| XLNet-mid | 85.3 (84.9) / 93.5 (93.3) | 85.5 (84.8) / 93.6 (93.2) |
In the emotion classification task, we used the ChnSentiCorp dataset. The model needs to divide the text into two categories:积极and消极. The evaluation indicator is: Accuracy
| Model | Development Set | Test set |
|---|---|---|
| BERT | 94.7 (94.3) | 95.0 (94.7) |
| BERT-wwm | 95.1 (94.5) | 95.4 (95.0) |
| XLNet-base | ||
| XLNet-mid | 95.8 (95.2) | 95.4 (94.9) |
The following is to describe the pre-training details using XLNet-mid model as an example.
Follow the steps of XLNet official tutorial, you first need to use Sentence Piece to generate a vocabulary list. In this project, we used a vocabulary size of 32000, and the rest of the parameters are configured in the default configuration in the official example.
spm_train
--input=wiki.zh.txt
--model_prefix=sp10m.cased.v3
--vocab_size=32000
--character_coverage=0.99995
--model_type=unigram
--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod>
--user_defined_symbols=<eop>,.,(,),",-,–,£,€
--shuffle_input_sentence
--input_sentence_size=10000000
After generating the vocabulary list, the original text corpus is used to generate the training tf_records file. The original text is constructed the same way as the original tutorial:
The following are the commands when generating data (please set num_task and task based on the actual number of slices):
SAVE_DIR=./output_b32
INPUT=./data/*.proc.txt
python data_utils.py
--bsz_per_host=32
--num_core_per_host=8
--seq_len=512
--reuse_len=256
--input_glob=${INPUT}
--save_dir=${SAVE_DIR}
--num_passes=20
--bi_data=True
--sp_path=spiece.model
--mask_alpha=6
--mask_beta=1
--num_predict=85
--uncased=False
--num_task=10
--task=1
After obtaining the above data, pre-training XLNet officially begins. The reason why it is called XLNet-mid is that the number of layers is increased only compared to XLNet-base (12 layers increase to 24 layers), and the remaining parameters have not changed, mainly because of the limitations of computing devices. The commands used are as follows:
DATA=YOUR_GS_BUCKET_PATH_TO_TFRECORDS
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
TPU_NAME=v3-xlnet
TPU_ZONE=us-central1-b
python train.py
--record_info_dir=$DATA
--model_dir=$MODEL_DIR
--train_batch_size=32
--seq_len=512
--reuse_len=256
--mem_len=384
--perm_size=256
--n_layer=24
--d_model=768
--d_embed=768
--n_head=12
--d_head=64
--d_inner=3072
--untie_r=True
--mask_alpha=6
--mask_beta=1
--num_predict=85
--uncased=False
--train_steps=2000000
--save_steps=20000
--warmup_steps=20000
--max_save=20
--weight_decay=0.01
--adam_epsilon=1e-6
--learning_rate=1e-4
--dropout=0.1
--dropatt=0.1
--tpu=$TPU_NAME
--tpu_zone=$TPU_ZONE
--use_tpu=True
The device used for fine-tuning of downstream tasks is Google Cloud TPU v2 (64G HBM). The following briefly describes the configuration of each task when fine-tuning is performed. If you use the GPU for fine adjustment, please change the corresponding parameters to adapt, especially batch_size , learning_rate and other parameters. For related code, please check the src directory.
For reading comprehension tasks, the tf_records data is first needed. Please refer to the SQuAD 2.0 processing method of XLNet official tutorial, which will not be described here. The following are the script parameters used in the CMRC 2018 Chinese machine reading comprehension task:
XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b
python -u run_cmrc_drcd.py
--spiece_model_file=./spiece.model
--model_config_path=${XLNET_DIR}/xlnet_config.json
--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt
--tpu_zone=${TPU_ZONE}
--use_tpu=True
--tpu=${TPU_NAME}
--num_hosts=1
--num_core_per_host=8
--output_dir=${DATA_DIR}
--model_dir=${MODEL_DIR}
--predict_dir=${MODEL_DIR}/eval
--train_file=${DATA_DIR}/cmrc2018_train.json
--predict_file=${DATA_DIR}/cmrc2018_dev.json
--uncased=False
--max_answer_length=40
--max_seq_length=512
--do_train=True
--train_batch_size=16
--do_predict=True
--predict_batch_size=16
--learning_rate=3e-5
--adam_epsilon=1e-6
--iterations=1000
--save_steps=2000
--train_steps=2400
--warmup_steps=240
The following are the script parameters used in the DRCD traditional Chinese machine reading comprehension task:
XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b
python -u run_cmrc_drcd.py
--spiece_model_file=./spiece.model
--model_config_path=${XLNET_DIR}/xlnet_config.json
--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt
--tpu_zone=${TPU_ZONE}
--use_tpu=True
--tpu=${TPU_NAME}
--num_hosts=1
--num_core_per_host=8
--output_dir=${DATA_DIR}
--model_dir=${MODEL_DIR}
--predict_dir=${MODEL_DIR}/eval
--train_file=${DATA_DIR}/DRCD_training.json
--predict_file=${DATA_DIR}/DRCD_dev.json
--uncased=False
--max_answer_length=30
--max_seq_length=512
--do_train=True
--train_batch_size=16
--do_predict=True
--predict_batch_size=16
--learning_rate=3e-5
--adam_epsilon=1e-6
--iterations=1000
--save_steps=2000
--train_steps=3600
--warmup_steps=360
Unlike reading comprehension tasks, classification tasks do not need to generate tf_records in advance. The following are the script parameters used in the ChnSentiCorp emotion classification task:
XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b
python -u run_classifier.py
--spiece_model_file=./spiece.model
--model_config_path=${XLNET_DIR}/xlnet_config.json
--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt
--task_name=csc
--do_train=True
--do_eval=True
--eval_all_ckpt=False
--uncased=False
--data_dir=${RAW_DIR}
--output_dir=${DATA_DIR}
--model_dir=${MODEL_DIR}
--train_batch_size=48
--eval_batch_size=48
--num_hosts=1
--num_core_per_host=8
--num_train_epochs=3
--max_seq_length=256
--learning_rate=2e-5
--save_steps=5000
--use_tpu=True
--tpu=${TPU_NAME}
--tpu_zone=${TPU_ZONE}
Q: Will a larger model be released?
A: Not sure, not guaranteed. If we get significant performance improvements, we will consider publishing.
Q: Not good on some datasets?
A: Choose other models or continue to use your data for pre-training on this checkpoint.
Q: Will pre-training data be released?
A: Sorry, it cannot be published due to copyright issues.
Q: How long does it take to train XLNet?
A: XLNet-mid trained 2M steps (batch=32) using Cloud TPU v3 (128G HBM), which takes about 3 weeks. XLNet-base trained 4M steps.
Q: Why has XLNet not officially released Multilingual or Chinese XLNet?
A: (The following are personal opinions) It is unknown, many people left messages saying that they hope there is, click XLNet-issue-#3. With XLNet's official technology and computing power, training such a model is not difficult (the multilingual version may be more complicated and requires consideration of the balance between different languages. You can also refer to the description in multilingual-bert.). But thinking about it on the other hand, the authors are not obliged to do so. As scholars, their technical contribution is sufficient and they should not be criticized if they are not published, and they call on everyone to treat other people's work rationally.
Q: Is XLNet better than BERT in most cases?
A: At present, it seems that at least the above tasks are effective, and the data used is the same as the BERT-wwm-ext we released.
Q: ?
A: .
If the content in this directory is helpful to your research work, please refer to the following technical report in your paper: https://arxiv.org/abs/2004.13922
@inproceedings{cui-etal-2020-revisiting,
title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
author = "Cui, Yiming and
Che, Wanxiang and
Liu, Ting and
Qin, Bing and
Wang, Shijin and
Hu, Guoping",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
pages = "657--668",
}
Project authors: Cui Yiming (IFLYTEK Joint Laboratory), Che Wanxiang (Harbin Institute of Technology), Liu Ting (Harbin Institute of Technology), Wang Shijin (iFLYTEK), Hu Guoping (iFLYTEK)
This project is funded by Google's TensorFlow Research Cloud (TFRC) program.
During the construction of this project, we have referred to the following warehouse, and we would like to express our thanks here:
This project is not the Chinese XLNet model officially released by XLNet. At the same time, this project is not an official product of Harbin Institute of Technology or iFLYTEK. The content in this project is for technical research reference only and is not used as any concluding basis. Users may use the model at any time within the scope of the license, but we are not responsible for direct or indirect losses caused by the use of the content of the project.
Welcome to follow the official WeChat official account of iFlytek Joint Laboratory.

If you have any questions, please submit it in GitHub Issue.
We have no operations and encourage netizens to help each other solve problems.
If you find implementation issues or are willing to jointly build the project, please submit a Pull Request.