Everyone loves ChatGPT, but only a few large technology companies or laboratories have the ability to train such models. Recently, a self-Instructive approach has been popular in the open source community: create Instruction datasets through Instruct/ChatGPT and then fine-tuning on small-scale LLMs (such as LLaMA 7B), which can also achieve "comparable to" ChatGPT. One of the typical jobs is Stanford Alpaca.
Currently, there are very few open source instruction data sets and are mainly in English. The only few Chinese instruction data sets are also translated on the English data sets. However, considering everyone's strong demand for ChatGPT, we believe that more and more large-scale Chinese instruction data sets will appear in the future.
This project aims to collect Chinese instruction data sets so that everyone can more conveniently fine-tuning Chinese LLMs.
| Dataset | Size | Description | Source |
|---|---|---|---|
| Guanaco Dataset | 27808 | Multilingual instruction dataset, the scale will be updated to 92530 | Guanaco |
| alpaca_chinese_dataset | Updating | Machine translation + manual verification of the Alpaca dataset and supplement some dialogue data | Stanford Alpaca |
| alpaca-chinese-dataset | 20465 | Machine translation of Alpaca dataset | Stanford Alpaca |
| Chinese-alpaca-lora | Updating | The Alpaca dataset is machine-translated, and the translation model is gpt-3.5-turbo, which will be combined with Guanaco dataset in the future. | Stanford Alpaca |
| GPT-4-LLM | 52k | The Alpaca dataset propt is translated using ChatGPT, and then the Chinese Response is obtained using GPT-4. | Stanford Alpaca |
| BelleGroup/train_0.5M_CN | 0.5M | The Chinese seed prompt created by the author, using text-davinci-003 to obtain Response | BELLE |
| BelleGroup/train_1M_CN | 1M | The Chinese seed propt is the same as above. Response is obtained using text-davinci-003. Compared with the 0.5M data set, the author cleaned the data: some low-quality data were removed, such as data that claimed to be GPT模型, data that the model cannot answer due to incomplete input, and data whose instructions are Chinese but input or target are English. | BELLE |
| BelleGroup/school_math_0.25M | 0.25M | Chinese math problem data, including problem-solving process, generated by ChatGPT | BELLE |
| BelleGroup/multiturn_chat_0.8M | 0.8M | Multiple rounds of conversations between users and assistants, generated by ChatGPT | BELLE |
| BelleGroup/generated_chat_0.4M | 0.4M | Personalized role dialogue data, including role introduction, generated by ChatGPT | BELLE |
| BelleGroup/train_2M_CN | 2M | Chinese instruction data generated by ChatGPT | BELLE |