Chinese | English
Although pre-trained language models have been widely used in various fields of NLP, their high time and computing power costs are still an urgent problem. This requires us to develop models with better indicators under certain computing power constraints.
Our goal is not to pursue larger model sizes, but lightweight but more powerful models, while more deployable and industrial landing-friendly.
Based on methods such as linguistic information integration and training acceleration, we developed the Mengzi series model. Thanks to the model structure consistent with BERT, the Mengzi model can quickly replace existing pretrained models.
For detailed technical reports, please refer to:
Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese
Add two open source GPT architecture models:
@huajingyun
@hululuzhu Based on mengzi-t5-base, Chinese AI writing model is trained to generate poetry and pairs. For the model and specific usage, please refer to: chinese-ai-writing-share
Some generation examples:
上: 不待鸣钟已汗颜,重来试手竟何艰
下: 何堪击鼓频催泪?一别伤心更枉然
上: 北国风光,千里冰封,万里雪飘
下: 南疆气象,五湖浪涌,三江潮来
標題: 作诗:中秋
詩歌: 秋氣侵肌骨,寒光入鬢毛。雲收千里月,風送一帆高。
標題: 作诗:中秋 模仿:苏轼
詩歌: 月從海上生,照我庭下影。不知此何夕,但見天宇靜。
Thanks to the PaddleNLP version model and documentation provided by the PaddlePaddle team @yingyibiao.
Note: The PaddleNLP version model is not a product of Lanzhou Technology, and we do not assume corresponding responsibility for its results and results.
| Model | Parameter quantity | Applicable scenarios | Features | Download link |
|---|---|---|---|---|
| Mengzi-BERT-base | 110M | Natural language comprehension tasks such as text classification, entity recognition, relationship extraction, and reading comprehension | The same as the BERT structure, the existing BERT weights can be replaced directly. | HuggingFace, Domestic ZIP Download, PaddleNLP |
| Mengzi-BERT-L6-H768 | 60M | Natural language comprehension tasks such as text classification, entity recognition, relationship extraction, and reading comprehension | Obtained by Mengzi-BERT-large distillation | HuggingFace |
| Mengzi-BERT-base-fin | 110M | Natural language understanding tasks in the financial field | Training on financial corpus based on Mengzi-BERT-base | HuggingFace, Domestic ZIP Download, PaddleNLP |
| Mengzi-T5-base | 220M | Suitable for controllable text generation tasks such as copywriting generation and news generation | The same structure as T5, does not include downstream tasks, and needs to be used after Finetune on a specific task. Unlike GPT positioning, it is not suitable for text sequel | HuggingFace, Domestic ZIP Download, PaddleNLP |
| Mengzi-T5-base-MT | 220M | Provide Zero-Shot and Few-Shot capabilities | Multitasking model, can complete various tasks through prompt | HuggingFace |
| Mengzi-Oscar-base | 110M | Suitable for pictures description, picture and text inspection and other tasks | Multimodal model based on Mengzi-BERT-base. Training on million-level pictures and text pairs | HuggingFace |
| Mengzi-GPT-neo-base | 125M | Text Continuation Task | Based on Chinese corpus refrain training, suitable as a baseline model for related work | HuggingFace |
| BLOOM-389m-zh | 389M | Text Continuation Task | The BLOOM model that trims multilingual versions based on Chinese corpus reduces the need for video memory | HuggingFace |
| BLOOM-800m-zh | 800M | Text Continuation Task | The BLOOM model that trims multilingual versions based on Chinese corpus reduces the need for video memory | HuggingFace |
| BLOOM-1b4-zh | 1400M | Text Continuation Task | The BLOOM model that trims multilingual versions based on Chinese corpus reduces the need for video memory | HuggingFace |
| BLOOM-2b5-zh | 2500M | Text Continuation Task | The BLOOM model that trims multilingual versions based on Chinese corpus reduces the need for video memory | HuggingFace |
| BLOOM-6b4-zh | 6400M | Text Continuation Task | The BLOOM model that trims multilingual versions based on Chinese corpus reduces the need for video memory | HuggingFace |
| ReGPT-125M-200G | 125M | Text Continuation Task | Model trained on GPT-Neo-125M via https://github.com/Langboat/mengzi-retrieval-lm | HuggingFace |
| Guohua-Diffusion | - | Generation of Chinese painting style and text | DreamBooth training based on StableDiffusion v1.5 | HuggingFace |
# 使用 Huggingface transformers 加载
from transformers import BertTokenizer , BertModel
tokenizer = BertTokenizer . from_pretrained ( "Langboat/mengzi-bert-base" )
model = BertModel . from_pretrained ( "Langboat/mengzi-bert-base" )or
# 使用 PaddleNLP 加载
from paddlenlp . transformers import BertTokenizer , BertModel
tokenizer = BertTokenizer . from_pretrained ( "Langboat/mengzi-bert-base" )
model = BertModel . from_pretrained ( "Langboat/mengzi-bert-base" )Integrated to Huggingface Spaces with Gradio. See demo:
# 使用 Huggingface transformers 加载
from transformers import T5Tokenizer , T5ForConditionalGeneration
tokenizer = T5Tokenizer . from_pretrained ( "Langboat/mengzi-t5-base" )
model = T5ForConditionalGeneration . from_pretrained ( "Langboat/mengzi-t5-base" )or
# 使用 PaddleNLP 加载
from paddlenlp . transformers import T5Tokenizer , T5ForConditionalGeneration
tokenizer = T5Tokenizer . from_pretrained ( "Langboat/mengzi-t5-base" )
model = T5ForConditionalGeneration . from_pretrained ( "Langboat/mengzi-t5-base" )Reference Documents
# 使用 Huggingface transformers 加载
pip install transformersor
# 使用 PaddleNLP 加载
pip install paddlenlp| Model | AFQMC | TNEWS | IFLYTEK | CMNLI | WSC | CSL | CMRC2018 | C3 | CHID |
|---|---|---|---|---|---|---|---|---|---|
| RoBERTa-wwm-ext | 74.30 | 57.51 | 60.80 | 80.70 | 67.20 | 80.67 | 77.59 | 67.06 | 83.78 |
| Mengzi-BERT-base | 74.58 | 57.97 | 60.68 | 82.12 | 87.50 | 85.40 | 78.54 | 71.70 | 84.16 |
| Mengzi-BERT-L6-H768 | 74.75 | 56.68 | 60.22 | 81.10 | 84.87 | 85.77 | 78.06 | 65.49 | 80.59 |
RoBERTa-wwm-ext score comes from CLUE baseline
| Task | Learning rate | Global batch size | Epochs |
|---|---|---|---|
| AFQMC | 3e-5 | 32 | 10 |
| TNEWS | 3e-5 | 128 | 10 |
| IFLYTEK | 3e-5 | 64 | 10 |
| CMNLI | 3e-5 | 512 | 10 |
| WSC | 8e-6 | 64 | 50 |
| CSL | 5e-5 | 128 | 5 |
| CMRC2018 | 5e-5 | 8 | 5 |
| C3 | 1e-4 | 240 | 3 |
| CHID | 5e-5 | 256 | 5 |

wangyulong[at]langboat[dot]com
Q. mengzi-bert-base The saved model size is 196M. But is the model size of bert-base around 389M? Is there any difference in the defined base, or is it missing some unnecessary content when it is saved?
A: This is because Mengzi-bert-base is trained with FP16.
Q. What is the source of data for financial pre-trained models?
A: Financial news, announcements, and research reports crawling on web pages.
Q. Is there a Tensorflow version model?
A: You can convert it by yourself.
Q. Can training code be open sourced?
A: Due to the tight coupling with internal infrastructure, there is currently no plan.
Q. How can we achieve the same effect as text generation on Langboat official website?
A: Our core text generation model is based on the T5 architecture. The basic text generation algorithm can refer to Google's T5 paper: https://arxiv.org/pdf/1910.10683.pdf. Our open source Mengzi-T5 model is the same as Google's T5 pre-trained model architecture, which is a general pre-trained model and does not have special text generation tasks. Our marketing copywriting generation feature is to use a large amount of data on it for specific downstream tasks Finetune. On this basis, in order to achieve controllable generation effects, we have built a complete set of text generation Pipelines: from data cleaning, knowledge extraction, training data construction to generation quality evaluation. Most of them are customized according to commercial implementation scenarios: different pre-training and Finetune tasks are constructed according to different business needs and different data forms. This part involves relatively complex software architectures and specific business scenarios, and we have not yet conducted open source.
Q. Can Mengzi-T5-base directly Inference?
A: We refer to T5 v1.1 and do not include downstream tasks.
Q: What should I do if I load errors with Huggingface Transformer?
A: Try adding force_download=True .
Q: Mengzi-T5-base always tends to generate candidates for word granularity when doing constrain generation, while mT5 is the opposite, word granularity is preferred. Is this the training process the word granularity process?
A: Instead of using mT5's vocabulary, we retrained the Tokenizer based on the corpus, including more vocabulary. In this way, after encode texts of the same length, the number of tokens will be smaller, the memory usage will be smaller, and the training speed will be faster.
The content in this project is for technical research reference only and is not used as any concluding basis. Users may use the model at any time within the scope of the license, but we are not responsible for direct or indirect losses caused by the use of the content of the project. The experimental results presented in the technical report only show that performance under a specific data set and hyperparameter combination does not represent the nature of each model. The experimental results may change due to random number seeds and computing devices.
During the process of using this model in various ways (including but not limited to modification, direct use, and use through third parties), users shall not directly or indirectly engage in acts that violate the laws and regulations of the jurisdiction to which they belong (including but not limited to modification, direct use, and social morality in any way. Users are responsible for their own actions. The user shall bear all legal and joint liability for all disputes arising from the use of this model. We do not assume any legal or joint liability.
We have the right to interpret, modify and update this disclaimer.
@misc{zhang2021mengzi,
title={Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese},
author={Zhuosheng Zhang and Hanqing Zhang and Keming Chen and Yuhang Guo and Jingyun Hua and Yulong Wang and Ming Zhou},
year={2021},
eprint={2110.06696},
archivePrefix={arXiv},
primaryClass={cs.CL}
}