MacBERT Download - MacBERT Source code download

MacBERT

Other source code

1.0.0

Download

Simplified Chinese | English

This directory contains **MacBERT pre-trained model**, which introduces an error-corrected mask language model (Mac) pre-training task, alleviating the problem of "pre-training-downstream tasks" inconsistency. MacBERT has achieved significant performance improvements on a variety of NLP tasks.

Revisiting Pre-trained Models for Chinese Natural Language Processing
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, Guoping Hu
Published in Findings of EMNLP 2020

Chinese MacBERT | Chinese ELECTRA | Chinese XLNet | Knowledge distillation tool TextBrewer | Model cutting tool TextPruner

More resources released by HFL: https://github.com/ymcui/HFL-Anthology

News

2023/3/28 Open source Chinese LLaMA&Alpaca big model, which can be quickly deployed and experienced on PC, view: https://github.com/ymcui/Chinese-LLaMA-Alpaca

2022/3/30 released a new pre-trained model PERT: https://github.com/ymcui/PERT

2021/12/17 released the model cropping tool TextPruner: https://github.com/airaria/TextPruner

2021/10/24 released the first pre-trained model for ethnic minority languages: https://github.com/ymcui/Chinese-Minority-PLM

2021/7/21 The book "Natural Language Processing: Methods based on Pre-trained Models" was officially published.

2020/11/3 Pre-trained Chinese MacBERT has been released and its usage method is the same as that of BERT.

2020/9/15 The paper "Revisiting Pre-Trained Models for Chinese Natural Language Processing" was hired as a long article by Findings of EMNLP.

chapter	describe
Introduction	Brief introduction to MacBERT
download	Download MacBERT
Quick loading	How to use Transformers quickly load models
Baseline effect	Effects on Chinese NLP tasks
FAQ	Frequently Asked Questions
Quote	Article citation information

Introduction

MacBERT is an improved version of BERT, introducing the error-corrected mask language model (MLM as correction, Mac) pre-training task, alleviating the problem of "pre-training-downstream tasks".

In the Mask Language Model (MLM), the [MASK] tag is introduced for masking, but the [MASK] tag does not appear in downstream tasks. In MacBERT, we use similar words to replace the [MASK] tag . Similar words are obtained by the Synonyms toolkit (Wang and Hu, 2017) tool, and the algorithm is calculated based on word2vec (Mikolov et al., 2013). At the same time, we have also introduced Whole Word Masking (wwm) and N-gram masking technologies. When masking N-gram, we look up similar words for each word in N-gram. When there are no similar words to replace, we will use random words for replacement.

The following is a training sample example.

	example
Original sentence	we use a language model to predict the probability of the next word.
MLM	we use a language [M] to [M] ##di ##ct the pro [M] ##ability of the next word .
Whole word masking	we use a language [M] to [M] [M] [M] the [M] [M] of the next word .
N-gram masking	we use a [M] [M] to [M] [M] the [M] [M] the [M] [M] [M] next word .
MLM as correction	we use a text system to ca ##lc ##ulate the po ##si ##ability of the next word.

MacBERT's main framework is exactly the same as BERT, allowing seamless transitions without modifying existing code.

For more details, please refer to our paper: Revisiting Pre-trained Models for Chinese Natural Language Processing

download

Mainly provides model downloads for TensorFlow 1.x version.

MacBERT-large, Chinese : 24-layer, 1024-hidden, 16-heads, 324M parameters
MacBERT-base, Chinese : 12-layer, 768-hidden, 12-heads, 102M parameters

Model	Google Drive	Baidu disk	size
`MacBERT-large, Chinese`	TensorFlow	TensorFlow (pw:zejf)	1.2G
`MacBERT-base, Chinese`	TensorFlow	TensorFlow (pw:61ga)	383M

PyTorch/TensorFlow2 version

If you need PyTorch or TensorFlow2 version of the model:

Use Transformers to convert
Or download it from https://huggingface.co/hfl

Download steps (you can also clone the entire directory directly using git):

After entering https://huggingface.co/hfl, select a MacBERT model, such as MacBERT-base: https://huggingface.co/hfl/chinese-macbert-base
Select the "files and versions" tab
Click on the bin/json and other files that you need to download

Quick loading

MacBERT models can be loaded quickly through Transformers.

 tokenizer = BertTokenizer.from_pretrained("MODEL_NAME")
model = BertModel.from_pretrained("MODEL_NAME")

Note: Please use BertTokenizer and BertModel to load MacBERT models!

The corresponding MODEL_NAME is as follows:

Original model	Model call name
MacBERT-large	hfl/chinese-macbert-large
MacBERT-base	hfl/chinese-macbert-base

Baseline effect

Here is a display of the effect of MacBERT on 6 downstream tasks (see the paper for more results):

CMRC 2018 (Cui et al., 2019) : Extracted reading comprehension (simplified Chinese)
DRCD (Shao et al., 2018) : Extracted reading comprehension (Traditional Chinese)
XNLI (Conneau et al., 2018) : Natural Language Inference
ChnSentiCorp : Emotional Classification
LCQMC (Liu et al., 2018) : sentence pair matching
BQ Corpus (Chen et al., 2018) : sentence pair matching

To ensure the stability of the results, we give the average value (in parentheses) and the maximum value of the independent runs 10 times at the same time.

CMRC 2018

The CMRC 2018 data set is Chinese machine reading comprehension data released by the joint laboratory of Harbin Institute of Technology. According to a given question, the system needs to extract fragments from the chapter as the answer, in the same form as SQuAD. Evaluation indicators are: EM / F1

Model	Development	Test	Challenge	#Params
BERT-base	65.5 (64.4) / 84.5 (84.0)	70.0 (68.7) / 87.0 (86.3)	18.6 (17.0) / 43.3 (41.3)	102M
BERT-wwm	66.3 (65.0) / 85.6 (84.7)	70.5 (69.1) / 87.4 (86.7)	21.0 (19.3) / 47.0 (43.9)	102M
BERT-wwm-ext	67.1 (65.6) / 85.7 (85.0)	71.4 (70.0) / 87.7 (87.0)	24.0 (20.0) / 47.3 (44.6)	102M
RoBERTa-wwm-ext	67.4 (66.5) / 87.2 (86.5)	72.6 (71.4) / 89.4 (88.8)	26.2 (24.6) / 51.0 (49.1)	102M
ELECTRA-base	68.4 (68.0) / 84.8 (84.6)	73.1 (72.7) / 87.1 (86.9)	22.6 (21.7) / 45.0 (43.8)	102M
MacBERT-base	68.5 (67.3) / 87.9 (87.1)	73.2 (72.4) / 89.5 (89.2)	30.2 (26.4) / 54.0 (52.2)	102M
ELECTRA-large	69.1 (68.2) / 85.2 (84.5)	73.9 (72.8) / 87.1 (86.6)	23.0 (21.6) / 44.2 (43.2)	324M
RoBERTa-wwm-ext-large	68.5 (67.6) / 88.4 (87.9)	74.2 (72.4) / 90.6 (90.0)	31.5 (30.1) / 60.1 (57.5)	324M
MacBERT-large	70.7 (68.6) / 88.9 (88.2)	74.8 (73.2) / 90.7 (90.1)	31.9 (29.6) / 60.2 (57.6)	324M

DRCD

The DRCD dataset was released by Delta Research Institute, Taiwan, China. Its form is the same as SQuAD and is an extracted reading comprehension dataset based on traditional Chinese. Since traditional Chinese characters are removed from ERNIE, it is not recommended to use ERNIE (or convert it to simplified Chinese and then process it) on traditional Chinese data. Evaluation indicators are: EM / F1

Model	Development	Test	#Params
BERT-base	83.1 (82.7) / 89.9 (89.6)	82.2 (81.6) / 89.2 (88.8)	102M
BERT-wwm	84.3 (83.4) / 90.5 (90.2)	82.8 (81.8) / 89.7 (89.0)	102M
BERT-wwm-ext	85.0 (84.5) / 91.2 (90.9)	83.6 (83.0) / 90.4 (89.9)	102M
RoBERTa-wwm-ext	86.6 (85.9) / 92.5 (92.2)	85.6 (85.2) / 92.0 (91.7)	102M
ELECTRA-base	87.5 (87.0) / 92.5 (92.3)	86.9 (86.6) / 91.8 (91.7)	102M
MacBERT-base	89.4 (89.2) / 94.3 (94.1)	89.5 (88.7) / 93.8 (93.5)	102M
ELECTRA-large	88.8 (88.7) / 93.3 (93.2)	88.8 (88.2) / 93.6 (93.2)	324M
RoBERTa-wwm-ext-large	89.6 (89.1) / 94.8 (94.4)	89.6 (88.9) / 94.5 (94.1)	324M
MacBERT-large	91.2 (90.8) / 95.6 (95.3)	91.7 (90.9) / 95.6 (95.3)	324M

XNLI

In the natural language inference task, we adopt XNLI data, which requires the text to be divided into three categories: entailment , neutral , and contradictory . The evaluation indicator is: Accuracy

Model	Development	Test	#Params
BERT-base	77.8 (77.4)	77.8 (77.5)	102M
BERT-wwm	79.0 (78.4)	78.2 (78.0)	102M
BERT-wwm-ext	79.4 (78.6)	78.7 (78.3)	102M
RoBERTa-wwm-ext	80.0 (79.2)	78.8 (78.3)	102M
ELECTRA-base	77.9 (77.0)	78.4 (77.8)	102M
MacBERT-base	80.3 (79.7)	79.3 (78.8)	102M
ELECTRA-large	81.5 (80.8)	81.0 (80.9)	324M
RoBERTa-wwm-ext-large	82.1 (81.3)	81.2 (80.6)	324M
MacBERT-large	82.4 (81.8)	81.3 (80.6)	324M

ChnSentiCorp

In the sentiment analysis task, the binary emotion classification dataset ChnSentiCorp. The evaluation indicator is: Accuracy

Model	Development	Test	#Params
BERT-base	94.7 (94.3)	95.0 (94.7)	102M
BERT-wwm	95.1 (94.5)	95.4 (95.0)	102M
BERT-wwm-ext	95.4 (94.6)	95.3 (94.7)	102M
RoBERTa-wwm-ext	95.0 (94.6)	95.6 (94.8)	102M
ELECTRA-base	93.8 (93.0)	94.5 (93.5)	102M
MacBERT-base	95.2 (94.8)	95.6 (94.9)	102M
ELECTRA-large	95.2 (94.6)	95.3 (94.8)	324M
RoBERTa-wwm-ext-large	95.8 (94.9)	95.8 (94.9)	324M
MacBERT-large	95.7 (95.0)	95.9 (95.1)	324M

LCQMC

LCQMC was released by the Intelligent Computing Research Center of Harbin Institute of Technology Shenzhen Graduate School. The evaluation indicator is: Accuracy

Model	Development	Test	#Params
BERT	89.4 (88.4)	86.9 (86.4)	102M
BERT-wwm	89.4 (89.2)	87.0 (86.8)	102M
BERT-wwm-ext	89.6 (89.2)	87.1 (86.6)	102M
RoBERTa-wwm-ext	89.0 (88.7)	86.4 (86.1)	102M
ELECTRA-base	90.2 (89.8)	87.6 (87.3)	102M
MacBERT-base	89.5 (89.3)	87.0 (86.5)	102M
ELECTRA-large	90.7 (90.4)	87.3 (87.2)	324M
RoBERTa-wwm-ext-large	90.4 (90.0)	87.0 (86.8)	324M
MacBERT-large	90.6 (90.3)	87.6 (87.1)	324M

BQ Corpus

BQ Corpus is released by the Intelligent Computing Research Center of Harbin Institute of Technology Shenzhen Graduate School and is a data set for the banking field. The evaluation indicator is: Accuracy

Model	Development	Test	#Params
BERT	86.0 (85.5)	84.8 (84.6)	102M
BERT-wwm	86.1 (85.6)	85.2 (84.9)	102M
BERT-wwm-ext	86.4 (85.5)	85.3 (84.8)	102M
RoBERTa-wwm-ext	86.0 (85.4)	85.0 (84.6)	102M
ELECTRA-base	84.8 (84.7)	84.5 (84.0)	102M
MacBERT-base	86.0 (85.5)	85.2 (84.9)	102M
ELECTRA-large	86.7 (86.2)	85.1 (84.8)	324M
RoBERTa-wwm-ext-large	86.3 (85.7)	85.8 (84.9)	324M
MacBERT-large	86.2 (85.7)	85.6 (85.0)	324M

FAQ

Q1: Is there an English version of MacBERT?

A1: None at the moment.

Q2: How to use MacBERT?

A2: Just like using BERT, you only need to simply replace the model file and config to use it. Of course, you can also further train other pretrained models by loading our model (i.e. initializing transformers section).

Q3: Can you provide MacBERT training code?

A3: There is no open source plan yet.

Q4: Can I open source pre-trained corpus?

A4: We cannot open source training corpus because there is no right to re-release accordingly. There are some open source Chinese corpus resources on GitHub, which you can pay more attention to and use.

Q5: Are there any plans to train MacBERT on a larger corpus and open source?

A5: We have no plans for the time being.

Quote

If the resources in this project are helpful for your research, please cite the following paper.

 @inproceedings{cui-etal-2020-revisiting,
    title = "Revisiting Pre-Trained Models for {C}hinese Natural Language Processing",
    author = "Cui, Yiming  and
      Che, Wanxiang  and
      Liu, Ting  and
      Qin, Bing  and
      Wang, Shijin  and
      Hu, Guoping",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.findings-emnlp.58",
    pages = "657--668",
}

or:

 @journal{cui-etal-2021-pretrain,
  title={Pre-Training with Whole Word Masking for Chinese BERT},
  author={Cui, Yiming and Che, Wanxiang and Liu, Ting and Qin, Bing and Yang, Ziqing},
  journal={IEEE Transactions on Audio, Speech and Language Processing},
  year={2021},
  url={https://ieeexplore.ieee.org/document/9599397},
  doi={10.1109/TASLP.2021.3124365},
 }

Acknowledgements

Thanks to Google TPU Research Cloud (TFRC) for its computing resource support.

Question feedback

If you have any questions, please submit it in GitHub Issue.

Before submitting the question, please check whether the FAQ can solve the problem. It is also recommended to check whether the previous issue can solve your problem.
Repeated reproductions and issues not related to this project will be processed by [stable-bot](stale · GitHub Marketplace), please understand.
We will answer your questions as much as possible, but we cannot guarantee that your questions will be answered.
Ask questions politely and build a harmonious discussion community.

Expand

Additional Information