pytorch_chinese_lm_pretrain Download - pytorch_chinese_lm

pytorch_chinese_lm_pretrain

Other source code

1.0.0

Download

Pre-training of Chinese language model based on pytorch

ACL2020 Best Paper has a paper nomination award, "Don't Stop Pretraining: Adapt Language Models to Domains and Tasks". This paper did a lot of experiments on language model pre-training and systematically analyzed the improvement of language model pre-training on subtasks. There are several main conclusions:

Continuing pre-training (DAPT) on the data set of the target field can improve the effect; the more unrelated the corpus of the target field is to RoBERTa's original pre-training corpus, the more obvious the DAPT effect will be.
Continuing pre-training (TAPT) on the dataset of specific tasks can improve the effect very "cheap".
Combining the two (DAPT first, then TAPT) can further improve the effect.
The effect is best if you can obtain more task-related unmarked data and continue to pre-train (Curated-TAPT).
If you cannot obtain more task-related unlabeled data, adopt a very lightweight and simple data selection strategy, and the effect will also be improved.

Although language model pre-training on bet is already a stable point-up operation in algorithm competitions. But what is commendable in the above article is that it systematically analyzes this operation. Most Chinese language models are trained on tensorflow, and a common example is the Chinese roberta project. You can refer to https://github.com/brightmart/roberta_zh

There are fewer examples of pre-training Chinese bert language models using pytorch. In huggingface's Transformers, some code supports language model pre-training (not very rich, and many functions do not support such as wwm). In order to complete the pre-training of the bert language model with minimal code cost, this article borrows some ready-made code from it. Also try to share some experiences in using pytorch for language model pre-training. There are three common Chinese language models

bert-base-chinese
roberta-wwm-ext
ernie

bert-base-chinese

(https://huggingface.co/bert-base-chinese)

This is the most common Chinese bet language model, pre-trained based on Chinese Wikipedia-related corpus. It is easy to pre-train language models as baseline without supervision data in the domain. Just use the official examples.

https://github.com/huggingface/transformers/tree/master/examples/language-modeling (The transformers used in this article are updated to 3.0.2)

 python run_language_model_bert.py     --output_dir=output     --model_type=bert     --model_name_or_path=bert-base-chinese     --do_train     --train_data_file=train.txt     --do_eval     --eval_data_file=eval.txt     --mlm --per_device_train_batch_size=4

roberta-wwm-ext

(https://github.com/ymcui/Chinese-BERT-wwm)

The pre-trained language model released by the joint laboratory of Harbin Institute of Technology. The pre-training method is to use roberta similar methods, such as dynamic mask, more training data, etc. In many tasks, this model is better than bert-base-chinese. Because the configuration files of the Chinese roberta class such as vocab.txt are all designed using the bert method. The default format of the English roberta model reading configuration file is vocab.json. For some English roberta models, they can be automatically read through AutoModel. This explains why the Chinese roberta sample code of the model library of huggingface cannot run. https://huggingface.co/models?

If you want to continue pre-training roberta based on the above code run_language_modeling.py. Two more changes are needed.

Download roberta-wwm-ext to the local directory hflroberta, and modify "model_type":"roberta" to "model_type":"bert" in config.json.
Replace both AutoModel and AutoTokenizer in run_language_modeling.py above with BertModel and BertTokenizer.

Assuming that config.json has been modified, you can run the following command.

 python run_language_model_roberta.py     --output_dir=output     --model_type=bert     --model_name_or_path=hflroberta     --do_train     --train_data_file=train.txt     --do_eval     --eval_data_file=eval.txt     --mlm --per_device_train_batch_size=4

ernie

https://github.com/nghuyong/ERNIE-Pytorch)

Ernie is a pre-trained model released by Baidu based on Chinese corpus such as Baidu Zhitieba combined with entity prediction and other tasks. The accuracy of this model is better than that of bert-base-chinese and roberta on some tasks. If you do field data pre-training based on the ernie1.0 model, you only need to modify it in one step.

Download ernie1.0 to the local directory ernie and add the field "model_type":"bert" in config.json. run

 python run_language_model_ernie.py     --output_dir=output     --model_type=bert     --model_name_or_path=ernie     --do_train     --train_data_file=train.txt     --do_eval     --eval_data_file=eval.txt     --mlm --per_device_train_batch_size=4

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2025-04-16
size 28.83KB
From Github

Related Applications

OpenCore_NO_ACPI_Build

2024-11-13
nspanel_pro_tools_apk

2024-11-12
zkwork_aleo_gpu_worker

2024-11-11
nextcloud_share_url_downloader

2024-11-01
Dog_Fox_Bunny

2022-08-01
Lihua data analysis engine free version 3.0_search_navigation_collection_public opinion_ranking_api

2022-06-28

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All