MNBVC Download - MNBVC Source code download

MNBVC

Other source code

1.0.0

Download

MNBVC (Massive Never-ending BT Vast Chinese corpus) super large-scale Chinese corpus

Please don’t report to us, so that we can have a longer time to collect and organize data. We are most afraid of being praised and killed. If you keep us low profile, you have made great contributions to the Chinese algorithm circle!

The oldest and most mysterious (nothing) MOP Liwu Community on the Chinese Internet solemnly announced on 2023.1.1:

Under the guidance of the wise and powerful Maopu Guanzi, he is determined to give full play to the strengths of the community (everything is good) and help the open source community update the largest collection of Chinese Internet corpus for a long time

The MNBVC corpus includes not only mainstream culture, but also data from various niche cultures and even Martian cultures. The MNBVC data set includes news, compositions, novels, books, magazines, papers, lines, posts, wikis, ancient poems, lyrics, product introductions, jokes, embarrassing stories, chat records and other forms of pure text Chinese data. All data are collected from the Internet.

schedule

The current total data volume is 42915GB, and the goal is to reach 40T data of chatGPT3.5, with a current progress of 107.2%.

Data description

The password for the compressed package is 253874

The Chinese corpus in the compressed package includes txt, json, jsonl and parquet (multimodal dedicated) formats, and will eventually be unified into jsonl and parquet formats.

The links.txt in the root directory of the compressed package has the URL of each subfolder data source.

There is a picture in png format in each subfolder, which is a screenshot of the web page from the data source.

The collected data will remove digit strings greater than or equal to 8 digits for desensitization.

The data in the compressed package is only roughly processed, such as html&xml to txt, csv&tsv to json, etc.

Indexing and classification

We do not have the ability to conduct copyright audits on data sources. Although this data set includes data source information, in order to provide long-lasting updates and downloads of the data set, and to avoid copyright disputes, this data set does not provide indexing and classification of data in the compressed package. We also ask everyone to restrain their desire to share and not discuss the index of the compressed package and the specific content information contained in it. Please pay more attention to the application of the big data corpus itself, and please use data in a low-key manner.

huggingface

The classified data completed by cleaning will be placed in: https://huggingface.co/datasets/liwu/MNBVC

One person walks fast, everyone walks far away (shake people to speed up sending email [email protected])

The team leaders of each team reported that there is a lot of work on data cleaning and the technology is implemented a bit slowly. I hope that students with a lot of time will come to help, and just know how to use python, and someone will guide you step by step. Please help students first read the three red lines of the project.

OCR transcoding team (forced by GPT4 to become a multimodal corpus group containing text-pictures, and the compilation was added), currently 5 people are missing, 5 people are missing (need to have a background in CV and NLP algorithms. I want to use nlp to assist OCR transcoding, and I have the leading team leaders in this field in the industry to lead the team and guide it)
Question and Answer Corporate Group, currently 3 people are missing, 4 people are missing (currently, all are hardworking to write python code to align Q&A items and check human flesh. I want to use the algorithm model to do automatic alignment later)
Corpus Enhancement Team, currently 3 people are missing, 2 people are missing (I want to use Nlp to complete the corpus of missing words and conduct text quality testing, etc.)
The code corpus group and the parallel corpus group are still missing a few chores (the team leader will decide what to do later)
Ancient literature research group to be built (studying the transcoding of local chronicles and other ancient books, with many corpus and great difficulty)
Test group to be built (Please join the test classmates to help us improve the quality of data. I hope that students in this group can study using llm to directly generate test cases and test codes)

Even if you don’t have time to help the project develop, you can participate in the construction of the MNBVC corpus by participating in the (corpus energy bomb) project and uploading corpus documents at will.

Chinese large corpus cleaning tools

To handle large-scale Chinese corpus, students from the MNBVC project team optimized the existing open source software to provide a more efficient version:

Faster and accurate Chinese encoding detection tool: charset_mnbvc
Convert txt into jsonl in batches and pick out files with high paragraph repetition: deduplication_mnbvc
Sample a certain number of files by keyword from a multi-layer directory and preserve the directory structure: scan_copy_files_mnbvc
Format checking tool that unifies the MNBVC corpus format: DataCheck_MNBVC

Code Repository Crawler Tool

There are serious artificial filtering phenomena in various existing open source code corpuses, which makes it more difficult to catch up with chatGPT. To avoid repeated labor, provide code repository crawler code that has been verified on a large scale by MNBVC.

Crawl github code repository meta information: publicRepos_mnbvc
Crawl the latest version of the github code repository: github_downloader_mnbvc
Crawl notabug code repository: notabug_download_mnbvc
Crawl bitbucket code repository: bitbucket_crawl_mnbvc
Convert code to corpus: githubcode_extractor_mnbvc
Crawl commit record: get_github_commit_mnbvc

Multimodal processing tools

PDF meta information extraction tool: pdf_meta_data_mnbvc
PDF parsing rules tool: mmdp_mnbvc
The first version of the pdf to txt tool: pdf2txt_mnbvc
Arxiv document parsing tool: Arxiv_mllm_mnbvc

Various cleaning codes

wikihow cleaning code: WikiHowQAExtractor-mnbvc
Chinese Ministry of Foreign Affairs spoke cleansing code: QA_with_reporters_from_the_Ministry_of_Foreign_Affair_mnbvc
Cleaning codes for various math problems: Math_mnbvc
stackexchange cleaning code: stackexchange_mnbvc
Cleaning code for parallel corpus: parallel_corpus_mnbvc
Cleaning code of the test paper: Exam-Question-Bank-Dataset-zh_mnbvc
Cleaning code of the Judgment Document Network: MNBVC-judgment
Cleaning code for script killing: MNBVC-pdf-extract
DocLayNet cleaning code: DocLayNetPlus_mnbvc

Other gadgets

chinarxiv's crawler: chinaxivCrawler_mnbvc
Extract file from warc: warc_extractor_mnbvc
Psyarxiv, chemrxiv, biorxiv, medrxiv crawler: xxarxiv_mnbvc

Corpus download information (each compressed package will be updated with cleaning progress):

1. Synchronize all compressed packets through p2p micro force and receive updates. It is recommended to turn off tcp penetration and udp transmission micro force settings. If not turned off, the micro force may block the router (and perhaps the transmission speed is faster)

Micro-power key: B4MVPVJTK3DOOAOPVLJ3E7TA7RWW4J2ZEAXJRMRSRHSBPDB7OAFHUQ
Weili Direct Link

2. Download via Baidu Netdisk: Baidu Netdisk download link for each compressed package

Citation

Please cite the repo if you use the data or code in this repo.

 @misc{mnbvc,
  author = {{MOP-LIWU Community} and {MNBVC Team}},
  title = {MNBVC: Massive Never-ending BT Vast Chinese corpus},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {url{https://github.com/esbatmop/MNBVC}},
}

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2025-04-15
size 490.26KB
From Github

Related Applications

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All