By cleaning the Chinese part of Common Crawl, we finally obtained 100GB of high-quality Chinese pre-trained corpus. The models produced by the experiment are shown in: high-quality Chinese pre-trained models, large, ultra-small and similarity pre-trained models.
For more details, please refer to our technical report https://arxiv.org/pdf/2003.01355

The statistics of Google's original Chinese vocabulary and the small vocabulary we published are as follows:
| Token Type | CLUE | |
|---|---|---|
| Simplified Chinese | 11378 | 5689 |
| Traditional Chinese | 3264 | ✗ |
| English | 3529 | 1320 |
| Japanese | 573 | ✗ |
| Korean | 84 | ✗ |
| Emoji | 56 | ✗ |
| Numbers | 1179 | 140 |
| Special Tokens | 106 | 106 |
| Other Tokens | 959 | 766 |
| Total | 21128 | 8021 |
Comparison of effects on BERT-base using small datasets:
| Model | Vocab | Data | Steps | AFQMC | TNEWS' | IFLYTEK' | CMNLI | AVG |
|---|---|---|---|---|---|---|---|---|
| BERT-base | Wiki (1 GB) | 125K | 69.93% | 54.77% | 57.54% | 75.64% | 64.47% | |
| BERT-base | C5 (1 GB) | 125K | 69.63% | 55.72% | 58.87% | 75.75% | 64.99% | |
| BERT-base | CLUE | C5 (1 GB) | 125K | 69.00% | 55.04% | 59.07% | 75.84% | 64.74% |
| BERT-base mm | C5 (1 GB) | 125K | 69.57% | 55.17% | 59.69% | 75.86% | 65.07% | |
| BERT-base | C5 (1 GB) | 375K | 69.85% | 55.97% | 59.62% | 76.41% | 65.46% | |
| BERT-base | CLUE | C5 (1 GB) | 375K | 69.93% | 56.38% | 59.35% | 76.58% | 65.56% |
| BERT-base | C5 (3 GB) | 375K | 70.22% | 56.41% | 59.58% | 76.70% | 65.73% | |
| BERT-base | CLUE | C5 (3 GB) | 375K | 69.49% | 55.97% | 60.12% | 77.66% | 65.81% |
For more experimental results and analysis, please refer to: CLUEPretrainedModels
Application method: The purpose and purpose of using corpus research, plans, research institutions and applicant introductions will be sent to the email address, and promised not to provide it to third parties.
Email: [email protected], title is: CLUECorpus2020 200G Corpus
It can be used for language modeling, pre-training or generative tasks, etc. The data volume exceeds 14G, nearly 4,000 well-defined txt files and 5 billion words. The main part comes from the nlp_chinese_corpus project
The current corpus is processed in [Pre-training format] and contains multiple folders; each folder has many small files of no more than 4M size, and the file format meets the pre-training format: one line per sentence, separated by blank lines between documents.
Contains the following sub-corpus (14G corpus in total):
1. News Corpus news2016zh_corpus: 8G Corpus, divided into two upper and lower parts, with a total of 2,000 small files. Password: mzlk
2. Community Interaction-corpus webText2019zh_corpus: 3G corpus, containing 3G text, and a total of more than 900 small files. Password:qvlq
3. Wikipedia-corpus wiki2019zh_corpus: About 1.1G text, containing about 300 small files. Password:xv7e
4. Comment data - Corpus comments2019zh_corpus: text around 2.3G, a total of 784 small files, including 547 comments and 227 Amazon comments, merge multiple comment data from ChineseNLPCorpus, clean, convert formats, and split into small files. Password: gc3m
You can submit an issue and join the discussion group (QQ: 836811304)
Or send an email to [email protected]
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC)
@article{CLUECorpus2020,
title={CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model},
author={Liang Xu and Xuanwei Zhang and Qianqian Dong},
journal={ArXiv},
year={2020},
volume={abs/2003.01355}
}
CLUE is an open source organization dedicated to Chinese natural language processing. If you think our work is helpful to your study or business, you hope to get your sponsorship so that we can provide you with more useful open source work in the future. Let us do our part to the development and progress of Chinese natural language processing~
Please note the donor organization and name, thank you very much!
| Alipay | |
|---|---|
![]() | ![]() |