Chinese BERT wwm Download - Chinese BERT wwm Source Code Download

Chinese BERT wwm

Other source code

1.0.0

Download

Chinese-LLaMA-Alpaca-2 v1.0 version has been officially released!

Chinese Description | English

In the field of natural language processing, pre-trained language models (Pre-trained Language Models) have become a very important basic technology. In order to further promote the research and development of Chinese information processing, we released the Chinese pre-trained model BERT-wwm based on Whole Word Masking technology, as well as models closely related to this technology: BERT-wwm-ext, RoBERTa-wwm-ext, RoBERTa-wwm-ext-large, RBT3, RBTL3, etc.

Pre-Training with Whole Word Masking for Chinese BERT
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang
Published in IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP)

This project is based on Google's official BERT: https://github.com/google-research/bert

See more resources released by iFL of Harbin Institute of Technology (HFL): https://github.com/ymcui/HFL-Anthology

news

2023/3/28 Open source Chinese LLaMA&Alpaca big model, which can be quickly deployed and experienced on PC, view: https://github.com/ymcui/Chinese-LLaMA-Alpaca

2023/3/9 We propose a multimodal pre-trained model VLE in graphics and text, view: https://github.com/iflytek/VLE

2022/11/15 We propose the Chinese small pre-trained model MiniRBT. View: https://github.com/iflytek/MiniRBT

2022/10/29 We propose a pre-trained model LERT that integrates linguistic information. View: https://github.com/ymcui/LERT

2022/3/30 We open source a new pre-trained model PERT. View: https://github.com/ymcui/PERT

Historical News

2021/12/17 IFLYTEK Joint Laboratory launches the model cutting toolkit TextPruner. View: https://github.com/airaria/TextPruner

2021/10/24 IFLYTEK Joint Laboratory released a pre-trained model CINO for ethnic minority languages. View: https://github.com/ymcui/Chinese-Minority-PLM

2021/7/21 "Natural Language Processing: Methods based on Pre-training Models" written by many scholars from Harbin Institute of Technology SCIR has been published, and everyone is welcome to purchase it.

2021/1/27 All models have supported TensorFlow 2, please call or download it through the transformers library. https://huggingface.co/hfl

2020/9/15 Our paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" was hired as a long article by Findings of EMNLP.

2020/8/27 IFL Joint Laboratory topped the list in the GLUE general natural language understanding evaluation, check the GLUE list, news.

2020/3/23 The model released in this directory has been connected to PaddlePaddleHub to view the fast loading

2020/3/11 In order to better understand the needs, you are invited to fill out the questionnaire to provide you with better resources.

2020/2/26 IFLYTEK Joint Laboratory releases knowledge distillation tool TextBrewer

2020/1/20 I wish you all good luck in the Year of the Rat. This time, RBT3 and RBTL3 (3-layer RoBERTa-wwm-ext-base/large) were released to view the small parameter quantity model.

2019/12/19 The model published in this directory has been connected to Huggingface-Transformers to view the fast loading

2019/10/14 Release the RoBERTa-wwm-ext-large model, view the Chinese model download

2019/9/10 Release the RoBERTa-wwm-ext model and view the Chinese model download

2019/7/30 provides Chinese BERT-wwm-ext model trained on a larger general corpus (5.4B word count), view the Chinese model download

2019/6/20 Initial version, the model can be downloaded through Google, and the domestic cloud disk has also been uploaded. Check the Chinese model download

Content guidance

chapter	describe
Introduction	Introduction to the basic principles of BERT-wwm
Chinese model download	Provides the download address of BERT-wwm
Quick loading	How to use Transformers and PaddleHub quickly loading models
Model comparison	Provides a comparison of the parameters of the model in this directory
Chinese baseline system effect	List some effects of Chinese baseline systems
Small parameter quantity model	List the effects of the small parameter quantity model (3-layer Transformer)
Recommendations for use	Several suggestions for using Chinese pre-trained models are provided
Download English model	Google's official English BERT-wwm download address
FAQ	FAQs and Answers
Quote	Technical Reports in this directory

Introduction

Whole Word Masking (wwm) , temporarily translated as全词Mask or整词Mask , is an upgraded version of BERT released by Google on May 31, 2019, which mainly changes the training sample generation strategy in the original pre-training stage. Simply put, the original WordPiece-based word segmentation method will divide a complete word into several subwords. When generating training samples, these separated subwords will be randomly masked. In全词Mask , if the WordPiece subword of a complete word is masked, other parts of the same word are masked, that is,全词Mask .

It should be noted that the mask here refers to the generalized mask (replaced with [MASK]; maintain the original vocabulary; randomly replaced with another word), and is not limited to the case where the word is replaced with the [MASK] tag. For more detailed descriptions and examples, please refer to: #4

Similarly, since Google officially released BERT-base, Chinese , Chinese is divided by characters as granularity, and does not take into account the Chinese participle (CWS) in traditional NLP. We applied the method of full word Mask in Chinese, used Chinese Wikipedia (including Simplified and Traditional Chinese) for training, and used Harbin Institute of Technology LTP as a word segmentation tool, that is, all Chinese characters that make up the same word are mapped.

The following text shows a sample generation of全词Mask . Note: For the sake of easy understanding, only the case of replacing the [MASK] tag is considered in the following examples.

illustrate	Sample
Original text	Use language models to predict the probability of the next word.
Word participle text	Use language models to predict the probability of the next word.
Original Mask input	Use the language [MASK] type to test the pro [MASK] of the next word pro [MASK] ##lity.
Full word Mask input	Use the language [MASK] [MASK] to [MASK] [MASK] the next word [MASK] [MASK].

Chinese model download

This directory mainly contains base models, so we do not label the word base in the abbreviation of the model. For models of other sizes, the corresponding tags (for example, large) are marked.

BERT-large模型: 24-layer, 1024-hidden, 16-heads, 330M parameters
BERT-base模型: 12-layer, 768-hidden, 12-heads, 110M parameters

Note: The open source version does not contain the weight of MLM tasks; if you need to do MLM tasks, please use additional data for secondary pre-training (like other downstream tasks).

Model abbreviation	Materials	Google Download	Baidu Netdisk download
`RBT6, Chinese`	EXT data ^[1]	-	TensorFlow (password hniy)
`RBT4, Chinese`	EXT data ^[1]	-	TensorFlow (password sjpt)
`RBTL3, Chinese`	EXT data ^[1]	TensorFlow PyTorch	TensorFlow (password s6cu)
`RBT3, Chinese`	EXT data ^[1]	TensorFlow PyTorch	TensorFlow (password 5a57)
`RoBERTa-wwm-ext-large, Chinese`	EXT data ^[1]	TensorFlow PyTorch	TensorFlow (password dqqe)
`RoBERTa-wwm-ext, Chinese`	EXT data ^[1]	TensorFlow PyTorch	TensorFlow (password vybq)
`BERT-wwm-ext, Chinese`	EXT data ^[1]	TensorFlow PyTorch	TensorFlow (password wgnt)
`BERT-wwm, Chinese`	Chinese Wiki	TensorFlow PyTorch	TensorFlow (password qfh8)
`BERT-base, Chinese` ^Google	Chinese Wiki	Google Cloud	-
`BERT-base, Multilingual Cased` ^Google	Multilingual Wiki	Google Cloud	-
`BERT-base, Multilingual Uncased` ^Google	Multilingual Wiki	Google Cloud	-

[1] EXT data includes: Chinese Wikipedia, other encyclopedias, news, Q&A and other data, with a total number of words reaching 5.4B.

PyTorch version

If you need the PyTorch version,

1) Please convert it yourself through the conversion script provided by Transformers.

2) Or directly download PyTorch through the official website of huggingface: https://huggingface.co/hfl

Download method: Click any model you want to download → select the "Files and versions" tab → download the corresponding model file.

Instructions for use

It is recommended to use Baidu Netdisk download points in mainland China, and overseas users are recommended to use Google download points. The base model file size is about 400M . Taking the TensorFlow version of BERT-wwm, Chinese as an example, after downloading, decompress the zip file to obtain:

 chinese_wwm_L-12_H-768_A-12.zip
    |- bert_model.ckpt      # 模型权重
    |- bert_model.meta      # 模型meta信息
    |- bert_model.index     # 模型index信息
    |- bert_config.json     # 模型参数
    |- vocab.txt            # 词表

Among them, bert_config.json and vocab.txt are exactly the same as Google's original BERT-base, Chinese . The PyTorch version contains pytorch_model.bin , bert_config.json , and vocab.txt files.

Quick loading

Using Huggingface-Transformers

Relying on the ?transformers library, the above models can be easily called.

 tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")

Note: All models in this directory are loaded using BertTokenizer and BertModel. Do not use RobertaTokenizer/RobertaModel!

The corresponding list of MODEL_NAME is as follows:

Model name	MODEL_NAME
RoBERTa-wwm-ext-large	hfl/chinese-roberta-wwm-ext-large
RoBERTa-wwm-ext	hfl/chinese-roberta-wwm-ext
BERT-wwm-ext	hfl/chinese-bert-wwm-ext
BERT-wwm	hfl/chinese-bert-wwm
RBT3	hfl/rbt3
RBTL3	hfl/rbtl3

Using PaddleHub

Relying on PaddleHub, you can download and install the model with just one line of code, and more than ten lines of code can complete tasks such as text classification, sequence annotation, reading comprehension, etc.

 import paddlehub as hub
module = hub.Module(name=MODULE_NAME)

The corresponding list of MODULE_NAME is as follows:

Model name	MODULE_NAME
RoBERTa-wwm-ext-large	chinese-roberta-wwm-ext-large
RoBERTa-wwm-ext	chinese-roberta-wwm-ext
BERT-wwm-ext	chinese-bert-wwm-ext
BERT-wwm	chinese-bert-wwm
RBT3	rbt3
RBTL3	rbtl3

Model comparison

The following is a summary of some of the model details that everyone is more concerned about.

-	BERT ^Google	BERT-wwm	BERT-wwm-ext	RoBERTa-wwm-ext	RoBERTa-wwm-ext-large
Masking	WordPiece	WWM ^[1]	WWM	WWM	WWM
Type	base	base	base	base	Large
Data Source	Wiki	Wiki	wiki+ext ^[2]	wiki+ext	wiki+ext
Training Tokens #	0.4B	0.4B	5.4B	5.4B	5.4B
Device	TPU Pod v2	TPU v3	TPU v3	TPU v3	TPU Pod v3-32 ^[3]
Training Steps	?	100K ^MAX128 +100K ^MAX512	1M ^MAX128 +400K ^MAX512	1M ^MAX512	2M ^MAX512
Batch Size	?	2,560 / 384	2,560 / 384	384	512
Optimizer	AdamW	LAMB	LAMB	AdamW	AdamW
Vocabulary	21,128	~BERT ^[4]	~BERT	~BERT	~BERT
Init Checkpoint	Random Init	~BERT	~BERT	~BERT	Random Init

[1] WWM = Whole Word Masking
[2] ext = extended data
[3] TPU Pod v3-32 (512G HBM) is equivalent to 4 TPU v3 (128G HBM)
[4] ~BERT means inheriting the attributes of Google's original Chinese BERT

Chinese baseline system effect

To compare the baseline effects, we tested it on the following Chinese datasets, including句子级and篇章级tasks. For BERT-wwm-ext , RoBERTa-wwm-ext , and RoBERTa-wwm-ext-large , we did not further adjust the optimal learning rate , but directly used the optimal learning rate of BERT-wwm .

Best learning rate:

Model	BERT	ERNIE	BERT-wwm*
CMRC 2018	3e-5	8e-5	3e-5
DRCD	3e-5	8e-5	3e-5
CJRC	4e-5	8e-5	4e-5
XNLI	3e-5	5e-5	3e-5
ChnSentiCorp	2e-5	5e-5	2e-5
LCQMC	2e-5	3e-5	2e-5
BQ Corpus	3e-5	5e-5	3e-5
THUCNews	2e-5	5e-5	2e-5

*Represents all wwm series models (BERT-wwm, BERT-wwm-ext, RoBERTa-wwm-ext, RoBERTa-wwm-ext-large)

Only some results are listed below. Please see our technical report for the complete results.

CMRC 2018 : Reading Comprehension of Chapter Fragment Extraction (Simplified Chinese)
DRCD : Reading Comprehension of Chapter Fragment Extraction (Traditional Chinese)
CJRC : Legal Reading Comprehension (Simplified Chinese)
XNLI : Natural Language Inference
ChnSentiCorp : Senti Analysis
LCQMC : sentence pair matching
BQ Corpus : sentence pair matching
THUCNews : Chapter-level text classification

Note: To ensure the reliability of the results, for the same model, we run 10 times (different random seeds) to report the maximum and average values of model performance (the average values in brackets). If nothing unexpected happens, the result of your operation should be in this range.

In the evaluation indicator, the average value is represented in brackets and the maximum value is represented outside brackets.

Simplified Chinese Reading Comprehension: CMRC 2018

The CMRC 2018 data set is Chinese machine reading comprehension data released by the joint laboratory of Harbin Institute of Technology. According to a given question, the system needs to extract fragments from the chapter as the answer, in the same form as SQuAD. Evaluation indicators are: EM / F1

Model	Development Set	Test set	Challenge Set
BERT	65.5 (64.4) / 84.5 (84.0)	70.0 (68.7) / 87.0 (86.3)	18.6 (17.0) / 43.3 (41.3)
ERNIE	65.4 (64.3) / 84.7 (84.2)	69.4 (68.2) / 86.6 (86.1)	19.6 (17.0) / 44.3 (42.8)
BERT-wwm	66.3 (65.0) / 85.6 (84.7)	70.5 (69.1) / 87.4 (86.7)	21.0 (19.3) / 47.0 (43.9)
BERT-wwm-ext	67.1 (65.6) / 85.7 (85.0)	71.4 (70.0) / 87.7 (87.0)	24.0 (20.0) / 47.3 (44.6)
RoBERTa-wwm-ext	67.4 (66.5) / 87.2 (86.5)	72.6 (71.4) / 89.4 (88.8)	26.2 (24.6) / 51.0 (49.1)
RoBERTa-wwm-ext-large	68.5 (67.6) / 88.4 (87.9)	74.2 (72.4) / 90.6 (90.0)	31.5 (30.1) / 60.1 (57.5)

Traditional Chinese Reading Comprehension: DRCD

The DRCD dataset was released by Delta Research Institute, Taiwan, China. Its form is the same as SQuAD and is an extracted reading comprehension dataset based on traditional Chinese. Since traditional Chinese characters are removed from ERNIE, it is not recommended to use ERNIE (or convert it to simplified Chinese and then process it) on traditional Chinese data. Evaluation indicators are: EM / F1

Model	Development Set	Test set
BERT	83.1 (82.7) / 89.9 (89.6)	82.2 (81.6) / 89.2 (88.8)
ERNIE	73.2 (73.0) / 83.9 (83.8)	71.9 (71.4) / 82.5 (82.3)
BERT-wwm	84.3 (83.4) / 90.5 (90.2)	82.8 (81.8) / 89.7 (89.0)
BERT-wwm-ext	85.0 (84.5) / 91.2 (90.9)	83.6 (83.0) / 90.4 (89.9)
RoBERTa-wwm-ext	86.6 (85.9) / 92.5 (92.2)	85.6 (85.2) / 92.0 (91.7)
RoBERTa-wwm-ext-large	89.6 (89.1) / 94.8 (94.4)	89.6 (88.9) / 94.5 (94.1)

Judicial Reading Comprehension: CJRC

The CJRC dataset is Chinese machine reading comprehension data for the judicial field released by the joint laboratory of IFLYTEK. It should be noted that the data used in the experiment are not the final data released by the official, and the results are for reference only. Evaluation indicators are: EM / F1

Model	Development Set	Test set
BERT	54.6 (54.0) / 75.4 (74.5)	55.1 (54.1) / 75.2 (74.3)
ERNIE	54.3 (53.9) / 75.3 (74.6)	55.0 (53.9) / 75.0 (73.9)
BERT-wwm	54.7 (54.0) / 75.2 (74.8)	55.1 (54.1) / 75.4 (74.4)
BERT-wwm-ext	55.6 (54.8) / 76.0 (75.3)	55.6 (54.9) / 75.8 (75.0)
RoBERTa-wwm-ext	58.7 (57.6) / 79.1 (78.3)	59.0 (57.8) / 79.0 (78.0)
RoBERTa-wwm-ext-large	62.1 (61.1) / 82.4 (81.6)	62.4 (61.4) / 82.2 (81.0)

Natural Language Inference: XNLI

In the natural language inference task, we adopt XNLI data, which requires the text to be divided into three categories: entailment , neutral , and contradictory . The evaluation indicator is: Accuracy

Model	Development Set	Test set
BERT	77.8 (77.4)	77.8 (77.5)
ERNIE	79.7 (79.4)	78.6 (78.2)
BERT-wwm	79.0 (78.4)	78.2 (78.0)
BERT-wwm-ext	79.4 (78.6)	78.7 (78.3)
RoBERTa-wwm-ext	80.0 (79.2)	78.8 (78.3)
RoBERTa-wwm-ext-large	82.1 (81.3)	81.2 (80.6)

SentiCorp

In the sentiment analysis task, the binary emotion classification dataset ChnSentiCorp. The evaluation indicator is: Accuracy

Model	Development Set	Test set
BERT	94.7 (94.3)	95.0 (94.7)
ERNIE	95.4 (94.8)	95.4 (95.3)
BERT-wwm	95.1 (94.5)	95.4 (95.0)
BERT-wwm-ext	95.4 (94.6)	95.3 (94.7)
RoBERTa-wwm-ext	95.0 (94.6)	95.6 (94.8)
RoBERTa-wwm-ext-large	95.8 (94.9)	95.8 (94.9)

Sentence pair classification: LCQMC, BQ Corpus

The following two data sets need to classify a sentence pair to determine whether the semantics of the two sentences are the same (binary classification task).

LCQMC

LCQMC was released by the Intelligent Computing Research Center of Harbin Institute of Technology Shenzhen Graduate School. The evaluation indicator is: Accuracy

Model	Development Set	Test set
BERT	89.4 (88.4)	86.9 (86.4)
ERNIE	89.8 (89.6)	87.2 (87.0)
BERT-wwm	89.4 (89.2)	87.0 (86.8)
BERT-wwm-ext	89.6 (89.2)	87.1 (86.6)
RoBERTa-wwm-ext	89.0 (88.7)	86.4 (86.1)
RoBERTa-wwm-ext-large	90.4 (90.0)	87.0 (86.8)

BQ Corpus

BQ Corpus is released by the Intelligent Computing Research Center of Harbin Institute of Technology Shenzhen Graduate School and is a data set for the banking field. The evaluation indicator is: Accuracy

Model	Development Set	Test set
BERT	86.0 (85.5)	84.8 (84.6)
ERNIE	86.3 (85.5)	85.0 (84.6)
BERT-wwm	86.1 (85.6)	85.2 (84.9)
BERT-wwm-ext	86.4 (85.5)	85.3 (84.8)
RoBERTa-wwm-ext	86.0 (85.4)	85.0 (84.6)
RoBERTa-wwm-ext-large	86.3 (85.7)	85.8 (84.9)

Chapter-level text classification: THUCNews

For chapter-level text classification tasks, we selected THUCNews, a news dataset released by the Natural Language Processing Laboratory of Tsinghua University. We are taking one of the subsets and need to divide the news into one of 10 categories. The evaluation indicator is: Accuracy

Model	Development Set	Test set
BERT	97.7 (97.4)	97.8 (97.6)
ERNIE	97.6 (97.3)	97.5 (97.3)
BERT-wwm	98.0 (97.6)	97.8 (97.6)
BERT-wwm-ext	97.7 (97.5)	97.7 (97.5)
RoBERTa-wwm-ext	98.3 (97.9)	97.7 (97.5)
RoBERTa-wwm-ext-large	98.3 (97.7)	97.8 (97.6)

Small parameter quantity model

The following are the experimental results on several NLP tasks, and only the comparison of the test set results is provided in the table.

Model	CMRC 2018	DRCD	XNLI	CSC	LCQMC	BQ	average	Parameter quantity
RoBERTa-wwm-ext-large	74.2 / 90.6	89.6 / 94.5	81.2	95.8	87.0	85.8	87.335	325M
RoBERTa-wwm-ext	72.6 / 89.4	85.6 / 92.0	78.8	95.6	86.4	85.0	85.675	102M
RBTL3	63.3 / 83.4	77.2 / 85.6	74.0	94.2	85.1	83.6	80.800	61M (59.8%)
RBT3	62.2 / 81.8	75.0 / 83.9	72.3	92.8	85.1	83.3	79.550	38M (37.3%)

Comparison of relative effects:

Model	CMRC 2018	DRCD	XNLI	CSC	LCQMC	BQ	average	Classification Average
RoBERTa-wwm-ext-large	102.2% / 101.3%	104.7% / 102.7%	103.0%	100.2%	100.7%	100.9%	101.9%	101.2%
RoBERTa-wwm-ext	100% / 100%	100% / 100%	100%	100%	100%	100%	100%	100%
RBTL3	87.2% / 93.3%	90.2% / 93.0%	93.9%	98.5%	98.5%	98.4%	94.3%	97.35%
RBT3	85.7% / 91.5%	87.6% / 91.2%	91.8%	97.1%	98.5%	98.0%	92.9%	96.35%

The parameter quantity is calculated based on the XNLI classification task
The percentage of parameters in brackets is based on the original base model (i.e. RoBERTa-wwm-ext)
RBT3: Initialized by RoBERTa-wwm-ext layer 3, and continued to train for 1M steps.
RBTL3: Initialized by RoBERTa-wwm-ext-large layer 3, and continued to train for 1M steps.
The name of RBT is composed of three syllable initials of RoBERTa, and L represents the large model
Directly using the first three layers of RoBERTa-wwm-ext-large for initialization and training downstream tasks will significantly reduce the effect. For example, on CMRC 2018, the test set can only reach 42.9/65.3, while RBTL3 can reach 63.3/83.4

Welcome to the Chinese small pre-trained model MiniRBT with better results: https://github.com/iflytek/MiniRBT

Recommendations for use

The initial learning rate is a very important parameter (whether it is BERT or other models) and needs to be adjusted according to the target task.
The optimal learning rate of ERNIE is quite different from BERT / BERT-wwm , so be sure to adjust the learning rate when using ERNIE (based on the above experimental results, the initial learning rate required by ERNIE is relatively high).
Since BERT / BERT-wwm uses Wikipedia data for training, they are better for modeling formal texts; while ERNIE uses additional network data such as Baidu Tieba and Zhi, which has advantages in modeling informal texts (such as Weibo, etc.).
In long text modeling tasks, such as reading comprehension, document classification, BERT and BERT-wwm have better results.
If the data of the target task is different from the fields of the pre-trained model, please do further pre-training on your own dataset.
If you want to process traditional Chinese data, use BERT or BERT-wwm . Because we found that there is almost no traditional Chinese in the vocabulary list of ERNIE .

Download English model

To facilitate everyone to download, bring the English BERT-large (wwm) model officially released by Google :

BERT-Large, Uncased (Whole Word Masking) : 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Large, Cased (Whole Word Masking) : 24-layer, 1024-hidden, 16-heads, 340M parameters

FAQ

Q: How to use this model?
A: How to use the Chinese BERT released by Google, how to use this. The text does not need to go through word segmentation, and wwm only affects the pre-training process and does not affect the input of downstream tasks.

Q: Is there any pre-training code provided?
A: Unfortunately, I cannot provide relevant code. You can refer to #10 and #13 for implementation.

Q: Where to download a certain data set?
A: Please check the data directory. README.md in the task directory indicates the source of the data. For copyrighted content, please search by yourself or contact the original author to obtain data.

Q: Will there be plans to release a larger model? For example, the BERT-large-wwm version?
A: If we get better results from the experiment, we will consider releasing a larger version.

Q: You are lying! Can't reproduce the result?
A: In the downstream task, we adopted the simplest model. For example, for classification tasks, we directly use run_classifier.py (provided by Google). If the average value cannot be reached, it means there is a bug in the experiment itself. Please check it carefully. There are many random factors for the highest value, and we cannot guarantee that we can reach the highest value. Another recognized factor: reducing batch size will significantly reduce the experimental effect. For details, please refer to the relevant Issue of the BERT and XLNet directory.

Q: I will get better results than you!
A: Congratulations.

Q: How long does it take to train and what equipment did it train?
A: Training was completed in Google TPU v3 version (128G HBM). Training BERT-wwm takes about 1.5 days, while BERT-wwm-ext takes several weeks (more data is used to iterate more). It should be noted that during the pre-training stage, we use LAMB Optimizer (TensorFlow version implementation). This optimizer has good support for large batches. When fine-tuning downstream tasks, we use BERT's default AdamWeightDecayOptimizer .

Q: Who is ERNIE?
A: The ERNIE model in this project specifically refers to the ERNIE proposed by Baidu, rather than the ERNIE published by Tsinghua University on ACL 2019.

Q: The effect of BERT-wwm is not very good in all tasks
A: The purpose of this project is to provide researchers with diversified pre-trained models, freely selecting BERT, ERNIE, or BERT-wwm. We only provide experimental data, and we still have to constantly try our best in our own tasks to draw conclusions. One more model, one more choice.

Q: Why are some data sets not tried?
A: To be frank, I’m not in the mood to find more data; 2) I don’t have to; 3) I don’t have money;

Q: Let's briefly evaluate these models
A: Each has its own focus and its own strengths. The research and development of Chinese natural language processing requires joint efforts from all parties.

Q: What is the name of the next pretrained model you predict?
A: Maybe it's called ZOE. ZOE: Zero-shOt Embeddings from language model

Q: More details about RoBERTa-wwm-ext model?
A: We integrate the advantages of RoBERTa and BERT-wwm to make a natural combination of the two. The difference between the models in this directory is as follows:
1) Use the wwm strategy to mask in the pre-training stage (but no dynamic masking)
2) Simply cancel Next Sentence Prediction (NSP) loss
3) No longer use the training mode of max_len=128 and then max_len=512, directly train max_len=512
4) Extend the training steps appropriately

It should be noted that this model is not the original RoBERTa model, but is just a BERT model trained in a similar RoBERTa training method, namely RoBERTa-like BERT. Therefore, when using downstream tasks and converting models, please process them in BERT, rather than RoBERTa.

Quote

If the resources or technologies in this project are helpful to your research work, please refer to the following paper in the paper.

Preferred (Journal Explore): https://ieeexplore.ieee.org/document/9599397

 @journal{cui-etal-2021-pretrain,
  title={Pre-Training with Whole Word Masking for Chinese BERT},
  author={Cui, Yiming and Che, Wanxiang and Liu, Ting and Qin, Bing and Yang, Ziqing},
  journal={IEEE Transactions on Audio, Speech and Language Processing},
  year={2021},
  url={https://ieeexplore.ieee.org/document/9599397},
  doi={10.1109/TASLP.2021.3124365},
 }

Or (conference version): https://www.aclweb.org/anthology/2020.findings-emnlp.58

 @inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
    pages = "657--668",
}

Acknowledgements

The first author is partially funded by Google's TPU Research Cloud program.

Disclaimer

This project is not the Chinese BERT-wwm model officially released by Google. At the same time, this project is not an official product of Harbin Institute of Technology or iFLYTEK. The experimental results presented in the technical report only show that performance under a specific data set and hyperparameter combination does not represent the nature of each model. The experimental results may change due to random number seeds and computing devices. The content in this project is for technical research reference only and is not used as any concluding basis. Users may use the model at any time within the scope of the license, but we are not responsible for direct or indirect losses caused by the use of the content of the project.