Chinese Description | English

In the field of natural language processing, pre-trained language models (Pre-trained Language Models) have become a very important basic technology. In order to further promote the research and development of Chinese information processing, we released the Chinese pre-trained model BERT-wwm based on Whole Word Masking technology, as well as models closely related to this technology: BERT-wwm-ext, RoBERTa-wwm-ext, RoBERTa-wwm-ext-large, RBT3, RBTL3, etc.
This project is based on Google's official BERT: https://github.com/google-research/bert
Chinese LERT | Chinese English PERT | Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Chinese BERT | Knowledge distillation tool TextBrewer | Model cutting tool TextPruner
See more resources released by iFL of Harbin Institute of Technology (HFL): https://github.com/ymcui/HFL-Anthology
2023/3/28 Open source Chinese LLaMA&Alpaca big model, which can be quickly deployed and experienced on PC, view: https://github.com/ymcui/Chinese-LLaMA-Alpaca
2023/3/9 We propose a multimodal pre-trained model VLE in graphics and text, view: https://github.com/iflytek/VLE
2022/11/15 We propose the Chinese small pre-trained model MiniRBT. View: https://github.com/iflytek/MiniRBT
2022/10/29 We propose a pre-trained model LERT that integrates linguistic information. View: https://github.com/ymcui/LERT
2022/3/30 We open source a new pre-trained model PERT. View: https://github.com/ymcui/PERT
2021/10/24 IFLYTEK Joint Laboratory released a pre-trained model CINO for ethnic minority languages. View: https://github.com/ymcui/Chinese-Minority-PLM
2021/7/21 "Natural Language Processing: Methods based on Pre-training Models" written by many scholars from Harbin Institute of Technology SCIR has been published, and everyone is welcome to purchase it.
2021/1/27 All models have supported TensorFlow 2, please call or download it through the transformers library. https://huggingface.co/hfl
2020/9/15 Our paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" was hired as a long article by Findings of EMNLP.
2020/8/27 IFL Joint Laboratory topped the list in the GLUE general natural language understanding evaluation, check the GLUE list, news.
2020/3/23 The model released in this directory has been connected to PaddlePaddleHub to view the fast loading
2020/3/11 In order to better understand the needs, you are invited to fill out the questionnaire to provide you with better resources.
2020/2/26 IFLYTEK Joint Laboratory releases knowledge distillation tool TextBrewer
2020/1/20 I wish you all good luck in the Year of the Rat. This time, RBT3 and RBTL3 (3-layer RoBERTa-wwm-ext-base/large) were released to view the small parameter quantity model.
2019/12/19 The model published in this directory has been connected to Huggingface-Transformers to view the fast loading
2019/10/14 Release the RoBERTa-wwm-ext-large model, view the Chinese model download
2019/9/10 Release the RoBERTa-wwm-ext model and view the Chinese model download
2019/7/30 provides Chinese BERT-wwm-ext model trained on a larger general corpus (5.4B word count), view the Chinese model download
2019/6/20 Initial version, the model can be downloaded through Google, and the domestic cloud disk has also been uploaded. Check the Chinese model download
| chapter | describe |
|---|---|
| Introduction | Introduction to the basic principles of BERT-wwm |
| Chinese model download | Provides the download address of BERT-wwm |
| Quick loading | How to use Transformers and PaddleHub quickly loading models |
| Model comparison | Provides a comparison of the parameters of the model in this directory |
| Chinese baseline system effect | List some effects of Chinese baseline systems |
| Small parameter quantity model | List the effects of the small parameter quantity model (3-layer Transformer) |
| Recommendations for use | Several suggestions for using Chinese pre-trained models are provided |
| Download English model | Google's official English BERT-wwm download address |
| FAQ | FAQs and Answers |
| Quote | Technical Reports in this directory |
Whole Word Masking (wwm) , temporarily translated as全词Mask or整词Mask , is an upgraded version of BERT released by Google on May 31, 2019, which mainly changes the training sample generation strategy in the original pre-training stage. Simply put, the original WordPiece-based word segmentation method will divide a complete word into several subwords. When generating training samples, these separated subwords will be randomly masked. In全词Mask , if the WordPiece subword of a complete word is masked, other parts of the same word are masked, that is,全词Mask .
It should be noted that the mask here refers to the generalized mask (replaced with [MASK]; maintain the original vocabulary; randomly replaced with another word), and is not limited to the case where the word is replaced with the [MASK] tag. For more detailed descriptions and examples, please refer to: #4
Similarly, since Google officially released BERT-base, Chinese , Chinese is divided by characters as granularity, and does not take into account the Chinese participle (CWS) in traditional NLP. We applied the method of full word Mask in Chinese, used Chinese Wikipedia (including Simplified and Traditional Chinese) for training, and used Harbin Institute of Technology LTP as a word segmentation tool, that is, all Chinese characters that make up the same word are mapped.
The following text shows a sample generation of全词Mask . Note: For the sake of easy understanding, only the case of replacing the [MASK] tag is considered in the following examples.
| illustrate | Sample |
|---|---|
| Original text | Use language models to predict the probability of the next word. |
| Word participle text | Use language models to predict the probability of the next word. |
| Original Mask input | Use the language [MASK] type to test the pro [MASK] of the next word pro [MASK] ##lity. |
| Full word Mask input | Use the language [MASK] [MASK] to [MASK] [MASK] the next word [MASK] [MASK]. |
This directory mainly contains base models, so we do not label the word base in the abbreviation of the model. For models of other sizes, the corresponding tags (for example, large) are marked.
BERT-large模型: 24-layer, 1024-hidden, 16-heads, 330M parametersBERT-base模型: 12-layer, 768-hidden, 12-heads, 110M parametersNote: The open source version does not contain the weight of MLM tasks; if you need to do MLM tasks, please use additional data for secondary pre-training (like other downstream tasks).
| Model abbreviation | Materials | Google Download | Baidu Netdisk download |
|---|---|---|---|
RBT6, Chinese | EXT data [1] | - | TensorFlow (password hniy) |
RBT4, Chinese | EXT data [1] | - | TensorFlow (password sjpt) |
RBTL3, Chinese | EXT data [1] | TensorFlow PyTorch | TensorFlow (password s6cu) |
RBT3, Chinese | EXT data [1] | TensorFlow PyTorch | TensorFlow (password 5a57) |
RoBERTa-wwm-ext-large, Chinese | EXT data [1] | TensorFlow PyTorch | TensorFlow (password dqqe) |
RoBERTa-wwm-ext, Chinese | EXT data [1] | TensorFlow PyTorch | TensorFlow (password vybq) |
BERT-wwm-ext, Chinese | EXT data [1] | TensorFlow PyTorch | TensorFlow (password wgnt) |
BERT-wwm, Chinese | Chinese Wiki | TensorFlow PyTorch | TensorFlow (password qfh8) |
BERT-base, Chinese Google | Chinese Wiki | Google Cloud | - |
BERT-base, Multilingual Cased Google | Multilingual Wiki | Google Cloud | - |
BERT-base, Multilingual Uncased Google | Multilingual Wiki | Google Cloud | - |
[1] EXT data includes: Chinese Wikipedia, other encyclopedias, news, Q&A and other data, with a total number of words reaching 5.4B.
If you need the PyTorch version,
1) Please convert it yourself through the conversion script provided by Transformers.
2) Or directly download PyTorch through the official website of huggingface: https://huggingface.co/hfl
Download method: Click any model you want to download → select the "Files and versions" tab → download the corresponding model file.
It is recommended to use Baidu Netdisk download points in mainland China, and overseas users are recommended to use Google download points. The base model file size is about 400M . Taking the TensorFlow version of BERT-wwm, Chinese as an example, after downloading, decompress the zip file to obtain:
chinese_wwm_L-12_H-768_A-12.zip
|- bert_model.ckpt # 模型权重
|- bert_model.meta # 模型meta信息
|- bert_model.index # 模型index信息
|- bert_config.json # 模型参数
|- vocab.txt # 词表
Among them, bert_config.json and vocab.txt are exactly the same as Google's original BERT-base, Chinese . The PyTorch version contains pytorch_model.bin , bert_config.json , and vocab.txt files.
Relying on the ?transformers library, the above models can be easily called.
tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")
Note: All models in this directory are loaded using BertTokenizer and BertModel. Do not use RobertaTokenizer/RobertaModel!
The corresponding list of MODEL_NAME is as follows:
| Model name | MODEL_NAME |
|---|---|
| RoBERTa-wwm-ext-large | hfl/chinese-roberta-wwm-ext-large |
| RoBERTa-wwm-ext | hfl/chinese-roberta-wwm-ext |
| BERT-wwm-ext | hfl/chinese-bert-wwm-ext |
| BERT-wwm | hfl/chinese-bert-wwm |
| RBT3 | hfl/rbt3 |
| RBTL3 | hfl/rbtl3 |
Relying on PaddleHub, you can download and install the model with just one line of code, and more than ten lines of code can complete tasks such as text classification, sequence annotation, reading comprehension, etc.
import paddlehub as hub
module = hub.Module(name=MODULE_NAME)
The corresponding list of MODULE_NAME is as follows:
| Model name | MODULE_NAME |
|---|---|
| RoBERTa-wwm-ext-large | chinese-roberta-wwm-ext-large |
| RoBERTa-wwm-ext | chinese-roberta-wwm-ext |
| BERT-wwm-ext | chinese-bert-wwm-ext |
| BERT-wwm | chinese-bert-wwm |
| RBT3 | rbt3 |
| RBTL3 | rbtl3 |
The following is a summary of some of the model details that everyone is more concerned about.
| - | BERT Google | BERT-wwm | BERT-wwm-ext | RoBERTa-wwm-ext | RoBERTa-wwm-ext-large |
|---|---|---|---|---|---|
| Masking | WordPiece | WWM [1] | WWM | WWM | WWM |
| Type | base | base | base | base | Large |
| Data Source | Wiki | Wiki | wiki+ext [2] | wiki+ext | wiki+ext |
| Training Tokens # | 0.4B | 0.4B | 5.4B | 5.4B | 5.4B |
| Device | TPU Pod v2 | TPU v3 | TPU v3 | TPU v3 | TPU Pod v3-32 [3] |
| Training Steps | ? | 100K MAX128 +100K MAX512 | 1M MAX128 +400K MAX512 | 1M MAX512 | 2M MAX512 |
| Batch Size | ? | 2,560 / 384 | 2,560 / 384 | 384 | 512 |
| Optimizer | AdamW | LAMB | LAMB | AdamW | AdamW |
| Vocabulary | 21,128 | ~BERT [4] | ~BERT | ~BERT | ~BERT |
| Init Checkpoint | Random Init | ~BERT | ~BERT | ~BERT | Random Init |
[1] WWM = Whole Word Masking
[2] ext = extended data
[3] TPU Pod v3-32 (512G HBM) is equivalent to 4 TPU v3 (128G HBM)
[4]~BERTmeans inheriting the attributes of Google's original Chinese BERT
To compare the baseline effects, we tested it on the following Chinese datasets, including句子级and篇章级tasks. For BERT-wwm-ext , RoBERTa-wwm-ext , and RoBERTa-wwm-ext-large , we did not further adjust the optimal learning rate , but directly used the optimal learning rate of BERT-wwm .
Best learning rate:
| Model | BERT | ERNIE | BERT-wwm* |
|---|---|---|---|
| CMRC 2018 | 3e-5 | 8e-5 | 3e-5 |
| DRCD | 3e-5 | 8e-5 | 3e-5 |
| CJRC | 4e-5 | 8e-5 | 4e-5 |
| XNLI | 3e-5 | 5e-5 | 3e-5 |
| ChnSentiCorp | 2e-5 | 5e-5 | 2e-5 |
| LCQMC | 2e-5 | 3e-5 | 2e-5 |
| BQ Corpus | 3e-5 | 5e-5 | 3e-5 |
| THUCNews | 2e-5 | 5e-5 | 2e-5 |
*Represents all wwm series models (BERT-wwm, BERT-wwm-ext, RoBERTa-wwm-ext, RoBERTa-wwm-ext-large)
Only some results are listed below. Please see our technical report for the complete results.
Note: To ensure the reliability of the results, for the same model, we run 10 times (different random seeds) to report the maximum and average values of model performance (the average values in brackets). If nothing unexpected happens, the result of your operation should be in this range.
In the evaluation indicator, the average value is represented in brackets and the maximum value is represented outside brackets.
The CMRC 2018 data set is Chinese machine reading comprehension data released by the joint laboratory of Harbin Institute of Technology. According to a given question, the system needs to extract fragments from the chapter as the answer, in the same form as SQuAD. Evaluation indicators are: EM / F1
| Model | Development Set | Test set | Challenge Set |
|---|---|---|---|
| BERT | 65.5 (64.4) / 84.5 (84.0) | 70.0 (68.7) / 87.0 (86.3) | 18.6 (17.0) / 43.3 (41.3) |
| ERNIE | 65.4 (64.3) / 84.7 (84.2) | 69.4 (68.2) / 86.6 (86.1) | 19.6 (17.0) / 44.3 (42.8) |
| BERT-wwm | 66.3 (65.0) / 85.6 (84.7) | 70.5 (69.1) / 87.4 (86.7) | 21.0 (19.3) / 47.0 (43.9) |
| BERT-wwm-ext | 67.1 (65.6) / 85.7 (85.0) | 71.4 (70.0) / 87.7 (87.0) | 24.0 (20.0) / 47.3 (44.6) |
| RoBERTa-wwm-ext | 67.4 (66.5) / 87.2 (86.5) | 72.6 (71.4) / 89.4 (88.8) | 26.2 (24.6) / 51.0 (49.1) |
| RoBERTa-wwm-ext-large | 68.5 (67.6) / 88.4 (87.9) | 74.2 (72.4) / 90.6 (90.0) | 31.5 (30.1) / 60.1 (57.5) |
The DRCD dataset was released by Delta Research Institute, Taiwan, China. Its form is the same as SQuAD and is an extracted reading comprehension dataset based on traditional Chinese. Since traditional Chinese characters are removed from ERNIE, it is not recommended to use ERNIE (or convert it to simplified Chinese and then process it) on traditional Chinese data. Evaluation indicators are: EM / F1
| Model | Development Set | Test set |
|---|---|---|
| BERT | 83.1 (82.7) / 89.9 (89.6) | 82.2 (81.6) / 89.2 (88.8) |
| ERNIE | 73.2 (73.0) / 83.9 (83.8) | 71.9 (71.4) / 82.5 (82.3) |
| BERT-wwm | 84.3 (83.4) / 90.5 (90.2) | 82.8 (81.8) / 89.7 (89.0) |
| BERT-wwm-ext | 85.0 (84.5) / 91.2 (90.9) | 83.6 (83.0) / 90.4 (89.9) |
| RoBERTa-wwm-ext | 86.6 (85.9) / 92.5 (92.2) | 85.6 (85.2) / 92.0 (91.7) |
| RoBERTa-wwm-ext-large | 89.6 (89.1) / 94.8 (94.4) | 89.6 (88.9) / 94.5 (94.1) |
The CJRC dataset is Chinese machine reading comprehension data for the judicial field released by the joint laboratory of IFLYTEK. It should be noted that the data used in the experiment are not the final data released by the official, and the results are for reference only. Evaluation indicators are: EM / F1
| Model | Development Set | Test set |
|---|---|---|
| BERT | 54.6 (54.0) / 75.4 (74.5) | 55.1 (54.1) / 75.2 (74.3) |
| ERNIE | 54.3 (53.9) / 75.3 (74.6) | 55.0 (53.9) / 75.0 (73.9) |
| BERT-wwm | 54.7 (54.0) / 75.2 (74.8) | 55.1 (54.1) / 75.4 (74.4) |
| BERT-wwm-ext | 55.6 (54.8) / 76.0 (75.3) | 55.6 (54.9) / 75.8 (75.0) |
| RoBERTa-wwm-ext | 58.7 (57.6) / 79.1 (78.3) | 59.0 (57.8) / 79.0 (78.0) |
| RoBERTa-wwm-ext-large | 62.1 (61.1) / 82.4 (81.6) | 62.4 (61.4) / 82.2 (81.0) |
In the natural language inference task, we adopt XNLI data, which requires the text to be divided into three categories: entailment , neutral , and contradictory . The evaluation indicator is: Accuracy
| Model | Development Set | Test set |
|---|---|---|
| BERT | 77.8 (77.4) | 77.8 (77.5) |
| ERNIE | 79.7 (79.4) | 78.6 (78.2) |
| BERT-wwm | 79.0 (78.4) | 78.2 (78.0) |
| BERT-wwm-ext | 79.4 (78.6) | 78.7 (78.3) |
| RoBERTa-wwm-ext | 80.0 (79.2) | 78.8 (78.3) |
| RoBERTa-wwm-ext-large | 82.1 (81.3) | 81.2 (80.6) |
In the sentiment analysis task, the binary emotion classification dataset ChnSentiCorp. The evaluation indicator is: Accuracy
| Model | Development Set | Test set |
|---|---|---|
| BERT | 94.7 (94.3) | 95.0 (94.7) |
| ERNIE | 95.4 (94.8) | 95.4 (95.3) |
| BERT-wwm | 95.1 (94.5) | 95.4 (95.0) |
| BERT-wwm-ext | 95.4 (94.6) | 95.3 (94.7) |
| RoBERTa-wwm-ext | 95.0 (94.6) | 95.6 (94.8) |
| RoBERTa-wwm-ext-large | 95.8 (94.9) | 95.8 (94.9) |
The following two data sets need to classify a sentence pair to determine whether the semantics of the two sentences are the same (binary classification task).
LCQMC was released by the Intelligent Computing Research Center of Harbin Institute of Technology Shenzhen Graduate School. The evaluation indicator is: Accuracy
| Model | Development Set | Test set |
|---|---|---|
| BERT | 89.4 (88.4) | 86.9 (86.4) |
| ERNIE | 89.8 (89.6) | 87.2 (87.0) |
| BERT-wwm | 89.4 (89.2) | 87.0 (86.8) |
| BERT-wwm-ext | 89.6 (89.2) | 87.1 (86.6) |
| RoBERTa-wwm-ext | 89.0 (88.7) | 86.4 (86.1) |
| RoBERTa-wwm-ext-large | 90.4 (90.0) | 87.0 (86.8) |
BQ Corpus is released by the Intelligent Computing Research Center of Harbin Institute of Technology Shenzhen Graduate School and is a data set for the banking field. The evaluation indicator is: Accuracy
| Model | Development Set | Test set |
|---|---|---|
| BERT | 86.0 (85.5) | 84.8 (84.6) |
| ERNIE | 86.3 (85.5) | 85.0 (84.6) |
| BERT-wwm | 86.1 (85.6) | 85.2 (84.9) |
| BERT-wwm-ext | 86.4 (85.5) | 85.3 (84.8) |
| RoBERTa-wwm-ext | 86.0 (85.4) | 85.0 (84.6) |
| RoBERTa-wwm-ext-large | 86.3 (85.7) | 85.8 (84.9) |
For chapter-level text classification tasks, we selected THUCNews, a news dataset released by the Natural Language Processing Laboratory of Tsinghua University. We are taking one of the subsets and need to divide the news into one of 10 categories. The evaluation indicator is: Accuracy
| Model | Development Set | Test set |
|---|---|---|
| BERT | 97.7 (97.4) | 97.8 (97.6) |
| ERNIE | 97.6 (97.3) | 97.5 (97.3) |
| BERT-wwm | 98.0 (97.6) | 97.8 (97.6) |
| BERT-wwm-ext | 97.7 (97.5) | 97.7 (97.5) |
| RoBERTa-wwm-ext | 98.3 (97.9) | 97.7 (97.5) |
| RoBERTa-wwm-ext-large | 98.3 (97.7) | 97.8 (97.6) |
The following are the experimental results on several NLP tasks, and only the comparison of the test set results is provided in the table.
| Model | CMRC 2018 | DRCD | XNLI | CSC | LCQMC | BQ | average | Parameter quantity |
|---|---|---|---|---|---|---|---|---|
| RoBERTa-wwm-ext-large | 74.2 / 90.6 | 89.6 / 94.5 | 81.2 | 95.8 | 87.0 | 85.8 | 87.335 | 325M |
| RoBERTa-wwm-ext | 72.6 / 89.4 | 85.6 / 92.0 | 78.8 | 95.6 | 86.4 | 85.0 | 85.675 | 102M |
| RBTL3 | 63.3 / 83.4 | 77.2 / 85.6 | 74.0 | 94.2 | 85.1 | 83.6 | 80.800 | 61M (59.8%) |
| RBT3 | 62.2 / 81.8 | 75.0 / 83.9 | 72.3 | 92.8 | 85.1 | 83.3 | 79.550 | 38M (37.3%) |
Comparison of relative effects:
| Model | CMRC 2018 | DRCD | XNLI | CSC | LCQMC | BQ | average | Classification Average |
|---|---|---|---|---|---|---|---|---|
| RoBERTa-wwm-ext-large | 102.2% / 101.3% | 104.7% / 102.7% | 103.0% | 100.2% | 100.7% | 100.9% | 101.9% | 101.2% |
| RoBERTa-wwm-ext | 100% / 100% | 100% / 100% | 100% | 100% | 100% | 100% | 100% | 100% |
| RBTL3 | 87.2% / 93.3% | 90.2% / 93.0% | 93.9% | 98.5% | 98.5% | 98.4% | 94.3% | 97.35% |
| RBT3 | 85.7% / 91.5% | 87.6% / 91.2% | 91.8% | 97.1% | 98.5% | 98.0% | 92.9% | 96.35% |
Welcome to the Chinese small pre-trained model MiniRBT with better results: https://github.com/iflytek/MiniRBT
BERT or other models) and needs to be adjusted according to the target task.ERNIE is quite different from BERT / BERT-wwm , so be sure to adjust the learning rate when using ERNIE (based on the above experimental results, the initial learning rate required by ERNIE is relatively high).BERT / BERT-wwm uses Wikipedia data for training, they are better for modeling formal texts; while ERNIE uses additional network data such as Baidu Tieba and Zhi, which has advantages in modeling informal texts (such as Weibo, etc.).BERT and BERT-wwm have better results.BERT or BERT-wwm . Because we found that there is almost no traditional Chinese in the vocabulary list of ERNIE . To facilitate everyone to download, bring the English BERT-large (wwm) model officially released by Google :
BERT-Large, Uncased (Whole Word Masking) : 24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Large, Cased (Whole Word Masking) : 24-layer, 1024-hidden, 16-heads, 340M parameters
Q: How to use this model?
A: How to use the Chinese BERT released by Google, how to use this. The text does not need to go through word segmentation, and wwm only affects the pre-training process and does not affect the input of downstream tasks.
Q: Is there any pre-training code provided?
A: Unfortunately, I cannot provide relevant code. You can refer to #10 and #13 for implementation.
Q: Where to download a certain data set?
A: Please check the data directory. README.md in the task directory indicates the source of the data. For copyrighted content, please search by yourself or contact the original author to obtain data.
Q: Will there be plans to release a larger model? For example, the BERT-large-wwm version?
A: If we get better results from the experiment, we will consider releasing a larger version.
Q: You are lying! Can't reproduce the result?
A: In the downstream task, we adopted the simplest model. For example, for classification tasks, we directly use run_classifier.py (provided by Google). If the average value cannot be reached, it means there is a bug in the experiment itself. Please check it carefully. There are many random factors for the highest value, and we cannot guarantee that we can reach the highest value. Another recognized factor: reducing batch size will significantly reduce the experimental effect. For details, please refer to the relevant Issue of the BERT and XLNet directory.
Q: I will get better results than you!
A: Congratulations.
Q: How long does it take to train and what equipment did it train?
A: Training was completed in Google TPU v3 version (128G HBM). Training BERT-wwm takes about 1.5 days, while BERT-wwm-ext takes several weeks (more data is used to iterate more). It should be noted that during the pre-training stage, we use LAMB Optimizer (TensorFlow version implementation). This optimizer has good support for large batches. When fine-tuning downstream tasks, we use BERT's default AdamWeightDecayOptimizer .
Q: Who is ERNIE?
A: The ERNIE model in this project specifically refers to the ERNIE proposed by Baidu, rather than the ERNIE published by Tsinghua University on ACL 2019.
Q: The effect of BERT-wwm is not very good in all tasks
A: The purpose of this project is to provide researchers with diversified pre-trained models, freely selecting BERT, ERNIE, or BERT-wwm. We only provide experimental data, and we still have to constantly try our best in our own tasks to draw conclusions. One more model, one more choice.
Q: Why are some data sets not tried?
A: To be frank, I’m not in the mood to find more data; 2) I don’t have to; 3) I don’t have money;
Q: Let's briefly evaluate these models
A: Each has its own focus and its own strengths. The research and development of Chinese natural language processing requires joint efforts from all parties.
Q: What is the name of the next pretrained model you predict?
A: Maybe it's called ZOE. ZOE: Zero-shOt Embeddings from language model
Q: More details about RoBERTa-wwm-ext model?
A: We integrate the advantages of RoBERTa and BERT-wwm to make a natural combination of the two. The difference between the models in this directory is as follows:
1) Use the wwm strategy to mask in the pre-training stage (but no dynamic masking)
2) Simply cancel Next Sentence Prediction (NSP) loss
3) No longer use the training mode of max_len=128 and then max_len=512, directly train max_len=512
4) Extend the training steps appropriately
It should be noted that this model is not the original RoBERTa model, but is just a BERT model trained in a similar RoBERTa training method, namely RoBERTa-like BERT. Therefore, when using downstream tasks and converting models, please process them in BERT, rather than RoBERTa.
If the resources or technologies in this project are helpful to your research work, please refer to the following paper in the paper.
@journal{cui-etal-2021-pretrain,
title={Pre-Training with Whole Word Masking for Chinese BERT},
author={Cui, Yiming and Che, Wanxiang and Liu, Ting and Qin, Bing and Yang, Ziqing},
journal={IEEE Transactions on Audio, Speech and Language Processing},
year={2021},
url={https://ieeexplore.ieee.org/document/9599397},
doi={10.1109/TASLP.2021.3124365},
}
@inproceedings{cui-etal-2020-revisiting,
title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
author = "Cui, Yiming and
Che, Wanxiang and
Liu, Ting and
Qin, Bing and
Wang, Shijin and
Hu, Guoping",
booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
pages = "657--668",
}
The first author is partially funded by Google's TPU Research Cloud program.
This project is not the Chinese BERT-wwm model officially released by Google. At the same time, this project is not an official product of Harbin Institute of Technology or iFLYTEK. The experimental results presented in the technical report only show that performance under a specific data set and hyperparameter combination does not represent the nature of each model. The experimental results may change due to random number seeds and computing devices. The content in this project is for technical research reference only and is not used as any concluding basis. Users may use the model at any time within the scope of the license, but we are not responsible for direct or indirect losses caused by the use of the content of the project.
Welcome to follow the official WeChat official account of IFLYTEK Joint Laboratory to learn about the latest technical trends.

If you have any questions, please submit it in GitHub Issue.