Chinese XLNet Download - Chinese XLNet Source code download

Chinese Description | English

This project provides an XLNet pre-trained model for Chinese, aiming to enrich Chinese natural language processing resources and provide diversified Chinese pre-trained model selection. We welcome all experts and scholars to download and use it, and jointly promote and develop the construction of Chinese resources.

This project is based on CMU/Google's official XLNet: https://github.com/zihangdai/xlnet

See more resources released by iFL of Harbin Institute of Technology (HFL): https://github.com/ymcui/HFL-Anthology

news

2023/3/28 Open source Chinese LLaMA&Alpaca big model, which can be quickly deployed and experienced on PC, view: https://github.com/ymcui/Chinese-LLaMA-Alpaca

2022/10/29 We propose a pre-trained model LERT that integrates linguistic information. View: https://github.com/ymcui/LERT

2022/3/30 We open source a new pre-trained model PERT. View: https://github.com/ymcui/PERT

2021/12/17 IFLYTEK Joint Laboratory launches the model cutting toolkit TextPruner. View: https://github.com/airaria/TextPruner

2021/10/24 IFLYTEK Joint Laboratory released a pre-trained model CINO for ethnic minority languages. View: https://github.com/ymcui/Chinese-Minority-PLM

2021/7/21 "Natural Language Processing: Methods based on Pre-training Models" written by many scholars from Harbin Institute of Technology SCIR has been published, and everyone is welcome to purchase it.

2021/1/27 All models have supported TensorFlow 2, please call or download it through the transformers library. https://huggingface.co/hfl

Historical News

2020/9/15 Our paper ["Revisiting Pre-Trained Models for Chinese Natural Language Processing"](https://arxiv.org/abs/2004.13922) was hired as a long article by [Findings of EMNLP](https://2020.emnlp.org).

2020/8/27 IFL Joint Laboratory topped the list in the GLUE general natural language understanding evaluation, check the GLUE list, news.

2020/3/11 In order to better understand the needs, you are invited to fill out the questionnaire to provide you with better resources.

2020/2/26 IFLYTEK Joint Laboratory releases knowledge distillation tool TextBrewer

2019/12/19 The model published in this directory has been connected to Huggingface-Transformers to view the fast loading

2019/9/5 XLNet-base is available for download, view the model download

2019/8/19 provides Chinese XLNet-mid model trained on large-scale universal corpus (5.4B word count), view model download

Content guidance

chapter	describe
Model download	Provides the download address of pre-trained XLNet in Chinese
Baseline system effects	List some baseline system effects
Pre-training details	Description of pre-training details
Downstream task fine-tuning details	Related descriptions of downstream task fine-tuning details
FAQ	FAQs and Answers
Quote	Technical Reports in this directory

Model download

XLNet-mid : 24-layer, 768-hidden, 12-heads, 209M parameters
XLNet-base : 12-layer, 768-hidden, 12-heads, 117M parameters

Model abbreviation	Materials	Google Download	Baidu Netdisk download
`XLNet-mid, Chinese`	Chinese Wiki+ General data ^[1]	TensorFlow PyTorch	TensorFlow (password 2jv2)
`XLNet-base, Chinese`	Chinese Wiki+ General data ^[1]	TensorFlow PyTorch	TensorFlow (password ge7w)

[1] General data include: encyclopedia, news, Q&A and other data, with a total number of words reaching 5.4B, the same as the BERT-wwm-ext training corpus we released.

PyTorch version

If you need the PyTorch version,

1) Please convert it yourself through the conversion script provided by Transformers.

2) Or directly download PyTorch through the official website of huggingface: https://huggingface.co/hfl

Method: Click any model you want to download → Pull to the bottom and click "List all files in model" → Download bin and json files in the pop-up box.

Instructions for use

It is recommended to use Baidu Netdisk download points in mainland China, and overseas users are recommended to use Google download points. XLNet-mid model file size is about 800M . Taking TensorFlow version XLNet-mid, Chinese as an example, after downloading, decompress the zip file to obtain:

 chinese_xlnet_mid_L-24_H-768_A-12.zip
    |- xlnet_model.ckpt      # 模型权重
    |- xlnet_model.meta      # 模型meta信息
    |- xlnet_model.index     # 模型index信息
    |- xlnet_config.json     # 模型参数
    |- spiece.model          # 词表

Quick loading

Relying on Huggingface-Transformers 2.2.2, the above models can be easily called.

 tokenizer = AutoTokenizer.from_pretrained("MODEL_NAME")
model = AutoModel.from_pretrained("MODEL_NAME")

The corresponding list of MODEL_NAME is as follows:

Model name	MODEL_NAME
XLNet-mid	hfl/chinese-xlnet-mid
XLNet-base	hfl/chinese-xlnet-base

Baseline system effects

To compare the baseline effect, we tested it on the following Chinese datasets. Chinese BERT, BERT-wwm, BERT-wwm-ext, XLNet-base, XLNet-mid were compared. Among them, the results of Chinese BERT, BERT-wwm, and BERT-wwm-ext are taken from the Chinese BERT-wwm project. Time and energy are limited and have not been able to cover more categories of tasks. Please try it yourself.

Note: To ensure the reliability of the results, for the same model, we run 10 times (different random seeds) to report the maximum and average values of the model performance. If nothing unexpected happens, the result of your operation should be in this range.

In the evaluation indicator, the average value is represented in brackets and the maximum value is represented outside brackets.

Simplified Chinese Reading Comprehension: CMRC 2018

**CMRC 2018 data set** is Chinese machine reading comprehension data released by iFLYTEK Joint Laboratory. According to a given question, the system needs to extract fragments from the chapter as the answer, in the same form as SQuAD. Evaluation indicators are: EM / F1

Model	Development Set	Test set	Challenge Set
BERT	65.5 (64.4) / 84.5 (84.0)	70.0 (68.7) / 87.0 (86.3)	18.6 (17.0) / 43.3 (41.3)
BERT-wwm	66.3 (65.0) / 85.6 (84.7)	70.5 (69.1) / 87.4 (86.7)	21.0 (19.3) / 47.0 (43.9)
BERT-wwm-ext	67.1 (65.6) / 85.7 (85.0)	71.4 (70.0) / 87.7 (87.0)	24.0 (20.0) / 47.3 (44.6)
XLNet-base	65.2 (63.0) / 86.9 (85.9)	67.0 (65.8) / 87.2 (86.8)	25.0 (22.7) / 51.3 (49.5)
XLNet-mid	66.8 (66.3) / 88.4 (88.1)	69.3 (68.5) / 89.2 (88.8)	29.1 (27.1) / 55.8 (54.9)

Traditional Chinese Reading Comprehension: DRCD

**DRCD dataset** is released by Delta Research Institute, Taiwan, China. Its form is the same as SQuAD and is an extracted reading comprehension dataset based on traditional Chinese. Evaluation indicators are: EM / F1

Model	Development Set	Test set
BERT	83.1 (82.7) / 89.9 (89.6)	82.2 (81.6) / 89.2 (88.8)
BERT-wwm	84.3 (83.4) / 90.5 (90.2)	82.8 (81.8) / 89.7 (89.0)
BERT-wwm-ext	85.0 (84.5) / 91.2 (90.9)	83.6 (83.0) / 90.4 (89.9)
XLNet-base	83.8 (83.2) / 92.3 (92.0)	83.5 (82.8) / 92.2 (91.8)
XLNet-mid	85.3 (84.9) / 93.5 (93.3)	85.5 (84.8) / 93.6 (93.2)

Emotional Category: ChnSentiCorp

In the emotion classification task, we used the ChnSentiCorp dataset. The model needs to divide the text into two categories:积极and消极. The evaluation indicator is: Accuracy

Model	Development Set	Test set
BERT	94.7 (94.3)	95.0 (94.7)
BERT-wwm	95.1 (94.5)	95.4 (95.0)
XLNet-base
XLNet-mid	95.8 (95.2)	95.4 (94.9)

Pre-training details

The following is to describe the pre-training details using XLNet-mid model as an example.

Generate vocabulary list

Follow the steps of XLNet official tutorial, you first need to use Sentence Piece to generate a vocabulary list. In this project, we used a vocabulary size of 32000, and the rest of the parameters are configured in the default configuration in the official example.

 spm_train 
	--input=wiki.zh.txt 
	--model_prefix=sp10m.cased.v3 
	--vocab_size=32000 
	--character_coverage=0.99995 
	--model_type=unigram 
	--control_symbols=<cls>,<sep>,<pad>,<mask>,<eod> 
	--user_defined_symbols=<eop>,.,(,),",-,–,£,€ 
	--shuffle_input_sentence 
	--input_sentence_size=10000000

Generate tf_records

After generating the vocabulary list, the original text corpus is used to generate the training tf_records file. The original text is constructed the same way as the original tutorial:

Each line is a sentence
Blank lines represent the end of the document

The following are the commands when generating data (please set num_task and task based on the actual number of slices):

 SAVE_DIR=./output_b32
INPUT=./data/*.proc.txt

python data_utils.py 
	--bsz_per_host=32 
	--num_core_per_host=8 
	--seq_len=512 
	--reuse_len=256 
	--input_glob=${INPUT} 
	--save_dir=${SAVE_DIR} 
	--num_passes=20 
	--bi_data=True 
	--sp_path=spiece.model 
	--mask_alpha=6 
	--mask_beta=1 
	--num_predict=85 
	--uncased=False 
	--num_task=10 
	--task=1

Pre-training

After obtaining the above data, pre-training XLNet officially begins. The reason why it is called XLNet-mid is that the number of layers is increased only compared to XLNet-base (12 layers increase to 24 layers), and the remaining parameters have not changed, mainly because of the limitations of computing devices. The commands used are as follows:

 DATA=YOUR_GS_BUCKET_PATH_TO_TFRECORDS
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
TPU_NAME=v3-xlnet
TPU_ZONE=us-central1-b

python train.py 
	--record_info_dir=$DATA 
	--model_dir=$MODEL_DIR 
	--train_batch_size=32 
	--seq_len=512 
	--reuse_len=256 
	--mem_len=384 
	--perm_size=256 
	--n_layer=24 
	--d_model=768 
	--d_embed=768 
	--n_head=12 
	--d_head=64 
	--d_inner=3072 
	--untie_r=True 
	--mask_alpha=6 
	--mask_beta=1 
	--num_predict=85 
	--uncased=False 
	--train_steps=2000000 
	--save_steps=20000 
	--warmup_steps=20000 
	--max_save=20 
	--weight_decay=0.01 
	--adam_epsilon=1e-6 
	--learning_rate=1e-4 
	--dropout=0.1 
	--dropatt=0.1 
	--tpu=$TPU_NAME 
	--tpu_zone=$TPU_ZONE 
	--use_tpu=True

Downstream task fine-tuning details

The device used for fine-tuning of downstream tasks is Google Cloud TPU v2 (64G HBM). The following briefly describes the configuration of each task when fine-tuning is performed. If you use the GPU for fine adjustment, please change the corresponding parameters to adapt, especially batch_size , learning_rate and other parameters. For related code, please check the src directory.

CMRC 2018

For reading comprehension tasks, the tf_records data is first needed. Please refer to the SQuAD 2.0 processing method of XLNet official tutorial, which will not be described here. The following are the script parameters used in the CMRC 2018 Chinese machine reading comprehension task:

 XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b

python -u run_cmrc_drcd.py 
	--spiece_model_file=./spiece.model 
	--model_config_path=${XLNET_DIR}/xlnet_config.json 
	--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt 
	--tpu_zone=${TPU_ZONE} 
	--use_tpu=True 
	--tpu=${TPU_NAME} 
	--num_hosts=1 
	--num_core_per_host=8 
	--output_dir=${DATA_DIR} 
	--model_dir=${MODEL_DIR} 
	--predict_dir=${MODEL_DIR}/eval 
	--train_file=${DATA_DIR}/cmrc2018_train.json 
	--predict_file=${DATA_DIR}/cmrc2018_dev.json 
	--uncased=False 
	--max_answer_length=40 
	--max_seq_length=512 
	--do_train=True 
	--train_batch_size=16 
	--do_predict=True 
	--predict_batch_size=16 
	--learning_rate=3e-5 
	--adam_epsilon=1e-6 
	--iterations=1000 
	--save_steps=2000 
	--train_steps=2400 
	--warmup_steps=240

DRCD

The following are the script parameters used in the DRCD traditional Chinese machine reading comprehension task:

 XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b

python -u run_cmrc_drcd.py 
	--spiece_model_file=./spiece.model 
	--model_config_path=${XLNET_DIR}/xlnet_config.json 
	--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt 
	--tpu_zone=${TPU_ZONE} 
	--use_tpu=True 
	--tpu=${TPU_NAME} 
	--num_hosts=1 
	--num_core_per_host=8 
	--output_dir=${DATA_DIR} 
	--model_dir=${MODEL_DIR} 
	--predict_dir=${MODEL_DIR}/eval 
	--train_file=${DATA_DIR}/DRCD_training.json 
	--predict_file=${DATA_DIR}/DRCD_dev.json 
	--uncased=False 
	--max_answer_length=30 
	--max_seq_length=512 
	--do_train=True 
	--train_batch_size=16 
	--do_predict=True 
	--predict_batch_size=16 
	--learning_rate=3e-5 
	--adam_epsilon=1e-6 
	--iterations=1000 
	--save_steps=2000 
	--train_steps=3600 
	--warmup_steps=360

ChnSentiCorp

Unlike reading comprehension tasks, classification tasks do not need to generate tf_records in advance. The following are the script parameters used in the ChnSentiCorp emotion classification task:

 XLNET_DIR=YOUR_GS_BUCKET_PATH_TO_XLNET
MODEL_DIR=YOUR_OUTPUT_MODEL_PATH
DATA_DIR=YOUR_DATA_DIR_TO_TFRECORDS
RAW_DIR=YOUR_RAW_DATA_DIR
TPU_NAME=v2-xlnet
TPU_ZONE=us-central1-b

python -u run_classifier.py 
	--spiece_model_file=./spiece.model 
	--model_config_path=${XLNET_DIR}/xlnet_config.json 
	--init_checkpoint=${XLNET_DIR}/xlnet_model.ckpt 
	--task_name=csc 
	--do_train=True 
	--do_eval=True 
	--eval_all_ckpt=False 
	--uncased=False 
	--data_dir=${RAW_DIR} 
	--output_dir=${DATA_DIR} 
	--model_dir=${MODEL_DIR} 
	--train_batch_size=48 
	--eval_batch_size=48 
	--num_hosts=1 
	--num_core_per_host=8 
	--num_train_epochs=3 
	--max_seq_length=256 
	--learning_rate=2e-5 
	--save_steps=5000 
	--use_tpu=True 
	--tpu=${TPU_NAME} 
	--tpu_zone=${TPU_ZONE}

FAQ

Q: Will a larger model be released?
A: Not sure, not guaranteed. If we get significant performance improvements, we will consider publishing.

Q: Not good on some datasets?
A: Choose other models or continue to use your data for pre-training on this checkpoint.

Q: Will pre-training data be released?
A: Sorry, it cannot be published due to copyright issues.

Q: How long does it take to train XLNet?
A: XLNet-mid trained 2M steps (batch=32) using Cloud TPU v3 (128G HBM), which takes about 3 weeks. XLNet-base trained 4M steps.

Q: Why has XLNet not officially released Multilingual or Chinese XLNet?
A: (The following are personal opinions) It is unknown, many people left messages saying that they hope there is, click XLNet-issue-#3. With XLNet's official technology and computing power, training such a model is not difficult (the multilingual version may be more complicated and requires consideration of the balance between different languages. You can also refer to the description in multilingual-bert.). But thinking about it on the other hand, the authors are not obliged to do so. As scholars, their technical contribution is sufficient and they should not be criticized if they are not published, and they call on everyone to treat other people's work rationally.

Q: Is XLNet better than BERT in most cases?
A: At present, it seems that at least the above tasks are effective, and the data used is the same as the BERT-wwm-ext we released.

Q: ?
A: .

Quote

If the content in this directory is helpful to your research work, please refer to the following technical report in your paper: https://arxiv.org/abs/2004.13922

 @inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
    pages = "657--668",
}

Acknowledgements

Project authors: Cui Yiming (IFLYTEK Joint Laboratory), Che Wanxiang (Harbin Institute of Technology), Liu Ting (Harbin Institute of Technology), Wang Shijin (iFLYTEK), Hu Guoping (iFLYTEK)

This project is funded by Google's TensorFlow Research Cloud (TFRC) program.

During the construction of this project, we have referred to the following warehouse, and we would like to express our thanks here:

XLNet: https://github.com/zihangdai/xlnet
Malaya: https://github.com/huseinzol05/Malaya/tree/master/xlnet
Korean XLNet (Korean description, no translation): https://github.com/yeontaek/XLNET-Korean-Model

Disclaimer

This project is not the Chinese XLNet model officially released by XLNet. At the same time, this project is not an official product of Harbin Institute of Technology or iFLYTEK. The content in this project is for technical research reference only and is not used as any concluding basis. Users may use the model at any time within the scope of the license, but we are not responsible for direct or indirect losses caused by the use of the content of the project.

Welcome to follow the official WeChat official account of iFlytek Joint Laboratory.

Question Feedback & Contribution

If you have any questions, please submit it in GitHub Issue.
We have no operations and encourage netizens to help each other solve problems.
If you find implementation issues or are willing to jointly build the project, please submit a Pull Request.

Expand

Chinese XLNet

news

Content guidance

Model download

PyTorch version

Instructions for use

Quick loading

Baseline system effects

Simplified Chinese Reading Comprehension: CMRC 2018

Traditional Chinese Reading Comprehension: DRCD

Emotional Category: ChnSentiCorp

Pre-training details

Generate vocabulary list

Generate tf_records

Pre-training

Downstream task fine-tuning details

CMRC 2018

DRCD

ChnSentiCorp

FAQ

Quote

Acknowledgements

Disclaimer

Follow us

Question Feedback & Contribution

GitHub sgrebnov/cordova plugin background download

Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

Chinese DOS games (Chinese DOS games in browser) project source code official version

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express