Zhiyuan Research Institute releases Chinese Internet Corpus CCI3.0 containing 1000GB data set

Author：Eve Cole Update Time：2025-03-07 00:00:03

Beijing Zhiyuan Artificial Intelligence Institute (BAAI) released a new generation of Chinese Internet corpus CCI3.0 at the 2024 Beijing Cultural Forum. This is another important update after CCI1.0 and CCI2.0. CCI3.0 includes a 1000GB data set and a 498GB high-quality subset CCI3.0-HQ. Since being open sourced, the CCI series data sets have been downloaded more than 40,000 times and serve the research and development of large models in more than 500 enterprises and institutions. It provides strong support for the ecological development of China's artificial intelligence industry. The editor of Downcodes will explain in detail the features and download methods of CCI3.0.

At the 2024 Beijing Cultural Forum, Beijing Zhiyuan Artificial Intelligence Research Institute (BAAI) announced the official release of a new generation of Chinese Internet corpus CCI3.0 (Chinese Corpora Internet) to further promote data co-construction and sharing. CCI3.0 includes a 1000GB data set and a 498GB high-quality subset CCI3.0-HQ. It is another important update after the first open source CCI1.0 in November 2023 and the release of CCI2.0 in April 2024.

Since the first open source, the CCI series data sets have been downloaded more than 40,000 times, serving the large-scale model research and development of more than 500 enterprises and institutions, effectively supporting the development of China's artificial intelligence industry ecosystem.

Features of CCI3.0 include:

Expanded scale and wide range of sources: CCI3.0 includes more than 268 million web pages, covering news, social media, blogs and other fields. Compared with CCI2.0, the data scale of CCI3.0 has nearly doubled, and the number of data source institutions has increased to more than 20, significantly improving the coverage and representativeness of the data.
Fine annotation, empowering applications: CCI3.0 conducts fine-grained classification and detailed labeling of raw data in more than 10 dimensions, including grammar, syntax, education level, etc., to filter out high-value data. In addition, CCI3.0HQ is based on the 70B model that automatically labels samples, and then trains small-size quality models to optimize high-quality subsets to better meet the needs of different industries and application scenarios.
Remarkable effect, better understanding of Chinese: In a comparative experiment in which a 500M model was trained from scratch on 100B data, CCI3.0 was better than other data sets in both separate Chinese corpus training and Chinese and English mixed corpus training, while CCI3.0HQ's The effect is even more significant.

Zhiyuan Research Institute stated that it will continue to cooperate with the industry ecosystem in the future to promote the co-construction and sharing of corpora, build large-scale, high-quality, and high-knowledge-density Chinese data sets, and make greater contributions to the development of China's artificial intelligence industry.

CCI3.0 download address

Flopsera:

https://open.flopsera.com/flopsera-open/data-details/BAAI-CCI3

Huggingface: https://huggingface.co/datasets/BAAI/CCI3-Data

Datahub:

https://data.baai.ac.cn/details/BAAI-CCI3

All in all, the release of CCI3.0 marks a new step in the construction of Chinese Chinese corpus. Its large-scale, high-quality data set will provide strong support for scientific research and application in the field of artificial intelligence, and help China's artificial intelligence industry flourish. Everyone is welcome to visit the link above to download and use.