chinese instruction datasets for llms Download - chinese instruction datasets for llms Source code download

chinese instruction datasets for llms

AI Source Code

1.0.0

Download

Chinese Instruction Datasets for LLMs

Everyone loves ChatGPT, but only a few large technology companies or laboratories have the ability to train such models. Recently, a self-Instructive approach has been popular in the open source community: create Instruction datasets through Instruct/ChatGPT and then fine-tuning on small-scale LLMs (such as LLaMA 7B), which can also achieve "comparable to" ChatGPT. One of the typical jobs is Stanford Alpaca.

Currently, there are very few open source instruction data sets and are mainly in English. The only few Chinese instruction data sets are also translated on the English data sets. However, considering everyone's strong demand for ChatGPT, we believe that more and more large-scale Chinese instruction data sets will appear in the future.

This project aims to collect Chinese instruction data sets so that everyone can more conveniently fine-tuning Chinese LLMs.

Dataset	Size	Description	Source
Guanaco Dataset	27808	Multilingual instruction dataset, the scale will be updated to 92530	Guanaco
alpaca_chinese_dataset	Updating	Machine translation + manual verification of the Alpaca dataset and supplement some dialogue data	Stanford Alpaca
alpaca-chinese-dataset	20465	Machine translation of Alpaca dataset	Stanford Alpaca
Chinese-alpaca-lora	Updating	The Alpaca dataset is machine-translated, and the translation model is gpt-3.5-turbo, which will be combined with Guanaco dataset in the future.	Stanford Alpaca
GPT-4-LLM	52k	The Alpaca dataset propt is translated using ChatGPT, and then the Chinese Response is obtained using GPT-4.	Stanford Alpaca
BelleGroup/train_0.5M_CN	0.5M	The Chinese seed prompt created by the author, using text-davinci-003 to obtain Response	BELLE
BelleGroup/train_1M_CN	1M	The Chinese seed propt is the same as above. Response is obtained using text-davinci-003. Compared with the 0.5M data set, the author cleaned the data: some low-quality data were removed, such as data that claimed to be `GPT模型`, data that the model cannot answer due to incomplete input, and data whose instructions are Chinese but input or target are English.	BELLE
BelleGroup/school_math_0.25M	0.25M	Chinese math problem data, including problem-solving process, generated by ChatGPT	BELLE
BelleGroup/multiturn_chat_0.8M	0.8M	Multiple rounds of conversations between users and assistants, generated by ChatGPT	BELLE
BelleGroup/generated_chat_0.4M	0.4M	Personalized role dialogue data, including role introduction, generated by ChatGPT	BELLE
BelleGroup/train_2M_CN	2M	Chinese instruction data generated by ChatGPT	BELLE

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2025-09-10
size 5.95KB
From Github

Related Applications

PHPOpt for IIS

2013-01-18
WNPM For Windows

2009-06-26
ZLPMServer for IIS

2009-06-23
Flashgot for Firefox

2009-06-22
iTunes for Windows

2009-06-03
Ajax For Dummies

2009-05-23

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
ML stack

AI Source Code

1.0.0
awesome free chatgpt

AI Source Code

1.0.0
pywin_contextmenu

AI Source Code

Version update
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All