ACL2020 Best Paper has a paper nomination award, "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks". This paper did a lot of experiments on language model pre-training and systematically analyzed the improvement of language model pre-training on subtasks. There are several main conclusions:
Continuing pre-training (DAPT) on the data set of the target field can improve the effect; the more unrelated the corpus of the target field is to RoBERTa's original pre-training corpus, the more obvious the DAPT effect will be.
Continuing pre-training (TAPT) on the dataset of specific tasks can improve the effect very "cheap".
Combining the two (DAPT first, then TAPT) can further improve the effect.
The effect is best if you can obtain more task-related unmarked data and continue to pre-train (Curated-TAPT).
If you cannot obtain more task-related unlabeled data, adopt a very lightweight and simple data selection strategy, and the effect will also be improved.
Although language model pre-training on bet is already a stable point-up operation in algorithm competitions. But what is commendable in the above article is that it systematically analyzes this operation. Most Chinese language models are trained on tensorflow, and a common example is the Chinese roberta project. You can refer to https://github.com/brightmart/roberta_zh
There are fewer examples of pre-training Chinese bert language models using pytorch. In huggingface's Transformers, some code supports language model pre-training (not very rich, and many functions do not support such as wwm). In order to complete the pre-training of the bert language model with minimal code cost, this article borrows some ready-made code from it. Also try to share some experiences in using pytorch for language model pre-training. There are three common Chinese language models
(https://huggingface.co/bert-base-chinese)
This is the most common Chinese bet language model, pre-trained based on Chinese Wikipedia-related corpus. It is easy to pre-train language models as baseline without supervision data in the domain. Just use the official examples.
https://github.com/huggingface/transformers/tree/master/examples/language-modeling (The transformers used in this article are updated to 3.0.2)
python run_language_model_bert.py --output_dir=output --model_type=bert --model_name_or_path=bert-base-chinese --do_train --train_data_file=train.txt --do_eval --eval_data_file=eval.txt --mlm --per_device_train_batch_size=4
(https://github.com/ymcui/Chinese-BERT-wwm)
The pre-trained language model released by the joint laboratory of Harbin Institute of Technology. The pre-training method is to use roberta similar methods, such as dynamic mask, more training data, etc. In many tasks, this model is better than bert-base-chinese. Because the configuration files of the Chinese roberta class such as vocab.txt are all designed using the bert method. The default format of the English roberta model reading configuration file is vocab.json. For some English roberta models, they can be automatically read through AutoModel. This explains why the Chinese roberta sample code of the model library of huggingface cannot run. https://huggingface.co/models?
If you want to continue pre-training roberta based on the above code run_language_modeling.py. Two more changes are needed.
Assuming that config.json has been modified, you can run the following command.
python run_language_model_roberta.py --output_dir=output --model_type=bert --model_name_or_path=hflroberta --do_train --train_data_file=train.txt --do_eval --eval_data_file=eval.txt --mlm --per_device_train_batch_size=4
https://github.com/nghuyong/ERNIE-Pytorch)
Ernie is a pre-trained model released by Baidu based on Chinese corpus such as Baidu Zhitieba combined with entity prediction and other tasks. The accuracy of this model is better than that of bert-base-chinese and roberta on some tasks. If you do field data pre-training based on the ernie1.0 model, you only need to modify it in one step.
python run_language_model_ernie.py --output_dir=output --model_type=bert --model_name_or_path=ernie --do_train --train_data_file=train.txt --do_eval --eval_data_file=eval.txt --mlm --per_device_train_batch_size=4