The oldest and most mysterious (nothing) MOP Liwu Community on the Chinese Internet solemnly announced on 2023.1.1:
Under the guidance of the wise and powerful Maopu Guanzi, he is determined to give full play to the strengths of the community (everything is good) and help the open source community update the largest collection of Chinese Internet corpus for a long time
The MNBVC corpus includes not only mainstream culture, but also data from various niche cultures and even Martian cultures. The MNBVC data set includes news, compositions, novels, books, magazines, papers, lines, posts, wikis, ancient poems, lyrics, product introductions, jokes, embarrassing stories, chat records and other forms of pure text Chinese data. All data are collected from the Internet.
The current total data volume is 42915GB, and the goal is to reach 40T data of chatGPT3.5, with a current progress of 107.2%.
The password for the compressed package is 253874
The Chinese corpus in the compressed package includes txt, json, jsonl and parquet (multimodal dedicated) formats, and will eventually be unified into jsonl and parquet formats.
The links.txt in the root directory of the compressed package has the URL of each subfolder data source.
There is a picture in png format in each subfolder, which is a screenshot of the web page from the data source.
The collected data will remove digit strings greater than or equal to 8 digits for desensitization.
The data in the compressed package is only roughly processed, such as html&xml to txt, csv&tsv to json, etc.
We do not have the ability to conduct copyright audits on data sources. Although this data set includes data source information, in order to provide long-lasting updates and downloads of the data set, and to avoid copyright disputes, this data set does not provide indexing and classification of data in the compressed package. We also ask everyone to restrain their desire to share and not discuss the index of the compressed package and the specific content information contained in it. Please pay more attention to the application of the big data corpus itself, and please use data in a low-key manner.
The classified data completed by cleaning will be placed in: https://huggingface.co/datasets/liwu/MNBVC
The team leaders of each team reported that there is a lot of work on data cleaning and the technology is implemented a bit slowly. I hope that students with a lot of time will come to help, and just know how to use python, and someone will guide you step by step. Please help students first read the three red lines of the project.
Even if you don’t have time to help the project develop, you can participate in the construction of the MNBVC corpus by participating in the (corpus energy bomb) project and uploading corpus documents at will.
To handle large-scale Chinese corpus, students from the MNBVC project team optimized the existing open source software to provide a more efficient version:
There are serious artificial filtering phenomena in various existing open source code corpuses, which makes it more difficult to catch up with chatGPT. To avoid repeated labor, provide code repository crawler code that has been verified on a large scale by MNBVC.
1. Synchronize all compressed packets through p2p micro force and receive updates. It is recommended to turn off tcp penetration and udp transmission micro force settings. If not turned off, the micro force may block the router (and perhaps the transmission speed is faster)
Micro-power key: B4MVPVJTK3DOOAOPVLJ3E7TA7RWW4J2ZEAXJRMRSRHSBPDB7OAFHUQ
Weili Direct Link
2. Download via Baidu Netdisk: Baidu Netdisk download link for each compressed package
Please cite the repo if you use the data or code in this repo.
@misc{mnbvc,
author = {{MOP-LIWU Community} and {MNBVC Team}},
title = {MNBVC: Massive Never-ending BT Vast Chinese corpus},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {url{https://github.com/esbatmop/MNBVC}},
}