Chinese Mixtral Download - Chinese Mixtral Source code download

Chinese Mixtral

Other source code

v1.2

Download

This project is developed based on the Mixtral model released by Mistral.ai, which uses the Sparse MoE architecture. This project uses large-scale Chinese label-free data to conduct Chinese incremental training to obtain the Chinese Mixtral basic model, and further uses instruction fine adjustment to obtain the Chinese Mixtral-Instruct instruction model. The model natively supports 32K context (tested up to 128K) , which can effectively process long text, and at the same time achieve significant performance improvements in mathematical reasoning, code generation, etc. When using llama.cpp for quantitative reasoning, it only takes 16G memory (or video memory).

Technical report : [Cui and Yao, 2024] Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral [Paper Interpretation]

Main content of this project

Open source Chinese Mixtral basic model, which is based on Mixtral-8x7B-v0.1 and is an incremental training in Chinese.
Open source Chinese Mixtral-Instruct instruction model, which further performs instruction fine-tuning based on Chinese Mixtral
Open source pre-training scripts and instruction fine-tuning scripts, users can further train or fine-tune the model as needed.
Provides tutorials to quickly quantify and deploy large-scale models locally using personal computer CPU/GPU
Supports Mixtral ecology such as transformers, llama.cpp, text-generation-webui, LangChain, privateGPT, vLLM, etc.

news

[2024/04/30] Chinese-LLaMA-Alpaca-3 has been officially released, open source Llama-3-Chinese-8B and Llama-3-Chinese-8B-Instruct based on Llama-3, please refer to: https://github.com/ymcui/Chinese-LLaMA-Alpaca-3

[2024/03/27] Add 1-bit/2-bit/3-bit quantitative version of GGUF model: [?HF]; at the same time, this project has been deployed in the Heart of Machine SOTA! model platform, welcome to follow: https://sota.jiqizhixin.com/project/chinese-mixtral

[2024/03/26] Add a mimic OpenAI API deployment mode. View details: v1.2 version release log

[2024/03/05] Open source model training and fine-tuning code, publish technical reports. View details: v1.1 version release log

[2024/01/29] Officially released Chinese-Mixtral (base model) and Chinese-Mixtral-Instruct (instruction/chat model). View details: v1.0 version release log

Content guidance

chapter	describe
??‍♂️Model Introduction	Briefly introduce the technical characteristics of the relevant models of this project
⏬Model Download	Chinese Mixtral model download address
Reasoning and deployment	Introduces how to quantify models and deploy and experience large models using a personal computer
?Model effect	The effect of the model on some tasks is introduced
Training and fine tune	Introducing how to train and fine tune the Chinese Mixtral model
❓FAQ	Replies to some FAQs

Model Introduction

This project open source Chinese Mixtral and Chinese Mixtral-Instruct models developed based on Mixtral model, and its main features are as follows:

Sparse hybrid expert model

Mixtral is a sparse hybrid expert model. This model has significant differences from previous mainstream large-scale models such as LLaMA, which is mainly reflected in the following points:

Each FFN layer contains 8 different "experts" (full connection layers), and the optimal 2 are selected according to the gate value for activation.
Each token in the input sequence will select experts independently, rather than a group of experts in the entire sequence.
The actual parameter amount is about 46.7B, and the parameter amount activated during inference is about 13B.

The following is a structural diagram in the Mixtral paper:

? Natively supports 32K context (actually tested supports 128K)

Unlike the Chinese-LLaMA-Alpaca and Chinese-LLaMA-Alpaca-2 projects, the Mixtral model natively supports 32K context (the actual measurement can reach 128K). Users can use a single model to solve various tasks of different lengths.

Model download

Model selection guidelines

The following is a comparison of the model of this project and the recommended usage scenarios. For chat interaction, select Instruct version.

Comparison items	Chinese Mixtral	Chinese Mixtral-Instruct
Model Type	Base model	Directive/Chat model (Class ChatGPT)
Model size	8x7B (actually activated about 13B)	8x7B (actually activated about 13B)
Number of experts	8 (actually activated 2)	8 (actually activated 2)
Training Type	Causal-LM (CLM)	Instruction fine adjustment
Training method	QLoRA + full amount emb/lm-head	QLoRA + full amount emb/lm-head
What model to train	Original Mixtral-8x7B-v0.1	Chinese Mixtral
Training materials	Unmarked general essay	Labeled instruction data
Vocabulary size	Original vocabulary list, 32000	Original vocabulary list, 32000
Supports context length	32K (actually measured up to 128K)	32K (actually measured up to 128K)
Input template	unnecessary	Need to apply the Mixtral-Instruct template
Applicable scenarios	Text continuation: Given the above text, let the model generate the following text	Command understanding: Q&A, writing, chat, interaction, etc.

Download address

Here are 3 different types of models:

Full version of the model : it can be used directly without any other merge steps. It is recommended for users with sufficient network bandwidth;
LoRA version model : cannot be used alone. It must be merged with the original Mixtral-8x7B-v0.1 to be converted to the full version model. It is recommended that users who have insufficient network bandwidth and have the original Mixtral at hand. For merge method, please refer to: Model merge steps
GGUF version model : GGUF quantitative version model compatible with llama.cpp and other tools. It is recommended to download users who only need to do inference deployment.

Model name	type	Specification	Full version (87 GB)	LoRA version (2.4 GB)	GGUF version
Chinese-Mixtral	Base model	8x7B	[Baidu] [?HF] [?ModelScope]	[Baidu] [?HF] [?ModelScope]	[?HF]
Chinese-Mixtral-Instruct	Instruction Model	8x7B	[Baidu] [?HF] [?ModelScope]	[Baidu] [?HF] [?ModelScope]	[?HF]

Note

If you cannot access HF, you can consider some mirror sites (such as hf-mirror.com). Please find and solve the specific methods yourself.

Reasoning and deployment

The relevant models in this project mainly support the following quantization, reasoning and deployment methods. For details, please refer to the corresponding tutorial.

tool	Features	CPU	GPU	Quantification	GUI	API	vLLM	Tutorial
llama.cpp	Rich quantitative options and efficient local reasoning	✅	✅	✅		✅		[link]
?Transformers	Native transformers inference interface	✅	✅	✅	✅		✅	[link]
Imitation of OpenAI API calls	Server demo that emulates OpenAI API interface	✅	✅	✅		✅	✅	[link]
text-generation-webui	How to deploy the front-end Web UI interface	✅	✅	✅	✅	✅		[link]
LangChain	Open source framework for large-scale application suitable for secondary development	✅	✅	✅				[link]
privateGPT	Multi-document local question and answer framework	✅	✅	✅				[link]
LM Studio	Multi-platform chat software (with interface)	✅	✅	✅	✅	✅		[link]

Model Effect

In order to evaluate the effects of related models, this project conducted generative effect evaluation and objective effect evaluation (NLU class) respectively, and evaluated the large model from different angles. It is recommended that users test on tasks they are concerned about and select models that adapt to related tasks.

Generate effect evaluation

This project has launched an online model battle platform modeled after Fastchat Chatbot Arena, which can browse and evaluate the quality of model responses. The battle platform provides evaluation indicators such as winning rate and Elo score, and can view the results of the winning rate of the pair-to-door model. ⚔️ Model Arena: http://llm-arena.ymcui.com
The examples directory provides the output examples of Chinese-Mixtral-Instruct and Chinese-Alpaca-2-13B, and the score is compared with GPT-4. The average score of Chinese-Mixtral-Instruct is 8.20 and the average score of Chinese-Alpaca-2-13B is 7.05 . ? Output sample comparison: examples

Objective effect evaluation

C-Eval

C-Eval is a comprehensive Chinese basic model evaluation suite, in which the verification set and the test set contain 1.3K and 12.3K multiple-choice questions, covering 52 subjects, respectively. Please refer to this project for C-Eval inference code: GitHub Wiki

Models	type	Valid (0-shot)	Valid (5-shot)	Test (0-shot)	Test (5-shot)
Chinese-Mixtral-Instruct	instruction	51.7	55.0	50.0	51.5
Chinese-Mixtral	Pedestal	45.8	54.2	43.1	49.1
Mixtral-8x7B-Instruct-v0.1	instruction	51.6	54.0	48.7	50.7
Mixtral-8x7B-v0.1	Pedestal	47.3	54.6	46.1	50.3
Chinese-Alpaca-2-13B	instruction	44.3	45.9	42.6	44.0
Chinese-LLaMA-2-13B	Pedestal	40.6	42.7	38.0	41.6

CMMLU

CMMLU is another comprehensive Chinese evaluation dataset, specifically used to evaluate the knowledge and reasoning ability of language models in the Chinese context, covering 67 topics from basic subjects to advanced professional level, with a total of 11.5K multiple-choice questions. Please refer to this project for CMMLU inference code: GitHub Wiki

Models	type	Test (0-shot)	Test (5-shot)
Chinese-Mixtral-Instruct	instruction	50.0	53.0
Chinese-Mixtral	Pedestal	42.5	51.0
Mixtral-8x7B-Instruct-v0.1	instruction	48.2	51.6
Mixtral-8x7B-v0.1	Pedestal	44.3	51.6
Chinese-Alpaca-2-13B	instruction	43.2	45.5
Chinese-LLaMA-2-13B	Pedestal	38.9	42.5

MMLU

MMLU is an English evaluation dataset for evaluating natural language comprehension ability. It is one of the main datasets used to evaluate large model capabilities today. The verification set and test set contain 1.5K and 14.1K multiple-choice questions, respectively, covering 57 subjects. Please refer to this project for MMLU inference code: GitHub Wiki

Models	type	Valid (0-shot)	Valid (5-shot)	Test (0-shot)	Test (5-shot)
Chinese-Mixtral-Instruct	instruction	65.1	69.6	67.5	69.8
Chinese-Mixtral	Pedestal	63.2	67.1	65.5	68.3
Mixtral-8x7B-Instruct-v0.1	instruction	68.5	70.4	68.2	70.2
Mixtral-8x7B-v0.1	Pedestal	64.9	69.0	67.0	69.5
Chinese-Alpaca-2-13B	instruction	49.6	53.2	50.9	53.5
Chinese-LLaMA-2-13B	Pedestal	46.8	50.0	46.6	51.8

LongBench

LongBench is a benchmark for evaluating long text comprehension ability of a large model. It consists of 6 major categories and 20 different tasks. The average length of most tasks is between 5K-15K, and contains about 4.75K test data. The following is the evaluation effect of this project model on this Chinese task (including code tasks). Please refer to this project for LongBench inference code: GitHub Wiki

Models	Single document QA	Multi-document QA	summary	FS Learning	Code completion	Synthesis task	average
Chinese-Mixtral-Instruct	50.3	34.2	16.4	42.0	56.1	89.5	48.1
Chinese-Mixtral	32.0	23.7	0.4	42.5	27.4	14.0	23.3
Mixtral-8x7B-Instruct-v0.1	56.5	35.7	15.4	46.0	63.6	98.0	52.5
Mixtral-8x7B-v0.1	35.5	9.5	16.4	46.5	57.2	83.5	41.4
Chinese-Alpaca-2-13B-16K	47.9	26.7	13.0	22.3	46.6	21.5	29.7
Chinese-LLaMA-2-13B-16K	36.7	17.7	3.1	29.8	13.8	3.0	17.3
Chinese-Alpaca-2-7B-64K	44.7	28.1	14.4	39.0	44.6	5.0	29.3
Chinese-LLaMA-2-7B-64K	27.2	16.4	6.5	33.0	7.8	5.0	16.0

Quantitative effect evaluation

Under llama.cpp, the performance of the Chinese-Mixtral quantitative version model was tested, as shown in the following table.

	F16	Q8_0	Q6_K	Q5_K	Q5_0	Q4_K	Q4_0	Q3_K	IQ3_XXS	Q2_K	IQ2_XS	IQ2_XXS
Size (GB)	87.0	46.2	35.7	30.0	30.0	24.6	24.6	19.0	17.1	16.1	12.7	11.4
BPW	16.0	8.50	6.57	5.69	5.52	4.87	4.53	3.86	3.14	2.96	2.34	2.10
PPL	-	4.4076	4.4092	4.4192	4.4224	4.4488	4.4917	4.5545	4.5990	5.1846	6.9784	8.5981
M3 Max Speed	-	-	36.0	36.9	35.7	31.2	27.8	37.6	-	29.1	-	-
A100 Speed	-	-	29.9	22.6	20.5	21.7	17.1	21.7	20.6	20.3	23.7	22.5

Note

Model size: Unit GB
BPW (Bits-Per-Weight): Unit parameter bits, for example, the actual average accuracy of Q6_K is 6.57
PPL (confusion): measured in 4K context, the lower the value, the better
Generation speed: Provides the generation speed (unit ms/token) of Apple M3 Max (Metal) and NVIDIA A100 (40G). The lower the value, the better

Taking Chinese-Mixtral-Q4_0 as an example, the figure below shows the PPL change trend under different context lengths, and two different sets of plain text data were selected. The experimental results show that the context length supported by the Mixtral model has exceeded the nominal 32K, and it still has good performance under 64K+ context (actually measured up to 128K).

Training and fine tune

Pre-training

Based on the original Mixtral, large-scale label-free data are used for incremental training to obtain the Chinese-Mixtral pedestal model
The training data uses data consistent with the basic version model in the Chinese-LLaMA-Alpaca project, with a total of about 20G plain text files.
Training code and usage tutorial: Pre-training script Wiki

Instruction fine adjustment

Based on Chinese-Mixtral, the Chinese-Mixtral-Instruct instruction model is obtained by further fine-tuning using labeled instruction data to obtain the Chinese-Mixtral-Instruct instruction model
The training data uses the instruction data used in the Chinese-LLaMA-Alpaca-2 project, with a total of about 5 million instruction data.
Training code and usage tutorial: Instruction fine-tuning script Wiki

Directive template:

 <s> [INST] Instruction [/INST] Model answer</s> [INST] Follow-up instruction [/INST]

Note: <s> and </s> are special tokens representing the beginning and end of a sequence, while [INST] and [/INST] are ordinary strings.

Frequently Asked Questions

Please check whether the solution already exists in the FAQ before you mention the Issue. For specific questions and answers, please refer to this project GitHub Wiki

问题1：后续会不会用更多数据进行训练？会不会做RLHF/DPO对齐？
问题2：为什么本次的模型没有做中文词表扩展？
问题3：是否支持Mixtral的下游生态？

Quote

@article{chinese-mixtral,
      title={Rethinking LLM Language Adaptation: A Case Study on Chinese Mixtral}, 
      author={Cui, Yiming and Yao, Xin},
      journal={arXiv preprint arXiv:2403.01851},
      url={https://arxiv.org/abs/2403.01851},
      year={2024}
}

Disclaimer

This project is developed based on the Mixtral model published by Mistral.ai. Please strictly abide by the Mixtral open source license agreement during use. If using third-party code is involved, be sure to comply with the relevant open source license agreement. The content generated by the model may affect its accuracy due to calculation methods, random factors, and quantitative accuracy losses. Therefore, this project does not provide any guarantee for the accuracy of the model output, nor will it be liable for any losses caused by the use of relevant resources and output results. If the relevant models of this project are used for commercial purposes, the developer shall abide by local laws and regulations to ensure compliance with the output content of the model. This project shall not be liable for any products or services derived therefrom.

Question feedback

If you have any questions, please submit it in GitHub Issue. Ask questions politely and build a harmonious discussion community.

Before submitting the question, please check whether the FAQ can solve the problem. It is also recommended to check whether the previous issue can solve your problem.
To submit a question, please use the Issue template set by this project to help quickly locate specific questions.
Repeat and issues not related to this project will be processed by stable-bot. Please understand.

Expand

Additional Information

Version v1.2
Type Other source code
Update Time 2025-04-16
size 454.47KB
From Github

Related Applications

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
Chinese DOS games (Chinese DOS games in browser) project source code official version

2022-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All