Chinese Mixtral 8x7B Download - Chinese Mixtral 8x7B Source code download

Chinese Mixtral 8x7B

Other source code

1.0.0

Download

Chinese-Mixtral-8x7B

? news

[2024-02-09] Release a model based on Chinese-Mixtral-8x7B instruction fine-tuning: movable type 3.0; open source instruction fine-tuning code.
[2024-01-18] Release Chinese-Mixtral-8x7B dock model; open source incremental pre-training code.

introduce

This project is based on the model Mixtral-8x7B released by Mistral, and is expected to further promote the study of MoE model by the Chinese natural language processing community. Our expanded vocabulary list significantly improves the model's encoding and decoding efficiency of Chinese, and performs incremental pre-training of the expanded vocabulary list model through large-scale open source corpus, so that the model has strong Chinese generation and understanding capabilities.

Project open source content:

Chinese Mixtral-8x7B word list big model
Expanded word list incremental pre-training code

Please note that Chinese-Mixtral-8x7B may still generate misleading replies containing factual errors or harmful content containing bias/discrimination. Please be careful to identify and use the generated content and do not spread the generated harmful content to the Internet.

? Model download

This project is trained using QLoRA. The LoRA weight and the combined weight model are open source respectively. You can choose to download according to your needs:

Model name	Model size	Download address	Remark
Chinese-Mixtral-8x7B	88GB	HuggingFace ModelScope	Complete Chinese word list model, can be used directly
Chinese-Mixtral-8x7B-adapter	2.7GB	HuggingFace	LoRA weights need to be merged with the original Mixtral-8x7B before they can be used. Please refer to the merge script here.

Model reasoning

Chinese-Mixtral-8x7B supports the complete Mixtral-8x7B model ecosystem, including using vLLM and Flash Attention 2 for acceleration, using bitsandbytes for model quantization, etc. Here is a code example for reasoning using Chinese-Mixtral-8x7B.

Using Flash Attention 2:

 import torch
from transformers import AutoModelForCausalLM , AutoTokenizer

model_id = "HIT-SCIR/Chinese-Mixtral-8x7B"
tokenizer = AutoTokenizer . from_pretrained ( model_id )

model = AutoModelForCausalLM . from_pretrained ( model_id , attn_implementation = "flash_attention_2" , torch_dtype = torch . bfloat16 , device_map = "auto" )

text = "我的名字是"
inputs = tokenizer ( text , return_tensors = "pt" ). to ( 0 )

outputs = model . generate ( ** inputs , max_new_tokens = 20 )
print ( tokenizer . decode ( outputs [ 0 ], skip_special_tokens = True ))

Quantification using 4bit:

 import torch
from transformers import AutoModelForCausalLM , AutoTokenizer

model_id = "HIT-SCIR/Chinese-Mixtral-8x7B"
tokenizer = AutoTokenizer . from_pretrained ( model_id )

model = AutoModelForCausalLM . from_pretrained ( model_id , load_in_4bit = True , device_map = "auto" )

text = "我的名字是"
inputs = tokenizer ( text , return_tensors = "pt" ). to ( 0 )

outputs = model . generate ( ** inputs , max_new_tokens = 20 )
print ( tokenizer . decode ( outputs [ 0 ], skip_special_tokens = True ))

Please note that Chinese-Mixtral-8x7B is a base model and has not been fine-tuned by instructions, so the instruction compliance capabilities are limited. You can refer to the fine-tuning section to fine-tune the model.

? Model performance

Model comprehensive capabilities

We used the following evaluation datasets to evaluate Chinese-Mixtral-8x7B separately:

C-Eval: A comprehensive Chinese basic model evaluation suite. It contains 13,948 multiple-choice questions, covering 52 different subjects and four difficulty levels.
CMMLU: A comprehensive Chinese evaluation benchmark dedicated to evaluating the knowledge and reasoning ability of language models in the Chinese context, covering 67 topics from basic disciplines to advanced professional levels.
MMLU: An English evaluation dataset containing 57 multi-select tasks, covering elementary mathematics, American history, computer science, law, etc., with difficulty covering high school level to expert level. It is one of the mainstream LLM evaluation datasets at present.
HellaSwag: A very challenging NLI evaluation dataset in English. Each question requires a deep understanding of the context and cannot be answered based on common sense.

According to a technical report released by Mistral, the Mixtral-8x7B will activate the 13B parameter when inference. The following table shows the 5-shot results of Chinese-Mixtral-8x7B and other 13B-scale Chinese word expansion models on each evaluation dataset:

Model name	Incremental training material	C-Eval (Chinese)	CMMLU (Chinese)	MMLU (English)	HellaSwag (English)
IDEA-CCNL/Ziya2-13B-Base	650B Token	59.29	60.93	59.86	58.90
TigerResearch/tigerbot-13b-base-v3	500B Token	50.52	51.65	53.46	59.16
Linly-AI/Chinese-LLaMA-2-13B-hf	11B Token	42.57	41.95	51.32	59.05
hfl/chinese-llama-2-13b	About 30B Token(120GB)	41.90	42.08	51.92	59.28
Chinese-Mixtral-8x7B (this project)	42B Token	52.08	51.08	69.80	65.69

In terms of Chinese knowledge and understanding, our Chinese-Mixtral-8x7B is comparable to TigerBot-13B-Base-v3 performance. Since the amount of training data of Chinese-Mixtral-8x7B is only 8% of that of TigerBot-13B-Base-v3, our model still has room for further improvement. At the same time, thanks to the powerful performance of the original Mixtral-8x7B model, our Chinese-Mixtral-8x7B has reached the strongest English level of each word list model.

Due to the subtle differences in the implementation details of the evaluation scripts in different versions, in order to ensure the consistency and fairness of the evaluation results, our evaluation scripts use the lm-evaluation-harness released by EleutherAI, and the commit hash is 28ec7fa.

Model generation effect

The following table shows the generation effects of each word expansion model. Since the pre-trained corpus of some models is not separated by eos_token , we use max_tokens = 100 to truncate the generated text. Our sampling parameters are temperature = 0.8, top_p = 0.9 .

Chinese codec efficiency

For Chinese encoding and decoding efficiency, we used the word segmenter of each word list model to encode a slice of the SkyPile dataset (2023-06_zh_head_0000.jsonl) and compared the Chinese text token amount output by each word segmenter:

Model name	Model Category	Vocabulary size	Token quantity in Chinese text	Codec efficiency
meta-llama/Llama-2-13B-hf	LLaMA	32000	780M	Low
mistralai/Mixtral-8x7B-v0.1	Mixtral	32000	606M	Low
Linly-AI/Chinese-LLaMA-2-13B-hf	LLaMA	40076	532M	middle
IDEA-CCNL/Ziya2-13B-Base	LLaMA	39424	532M	middle
hfl/chinese-llama-2-13b	LLaMA	55296	365M	high
TigerResearch/tigerbot-13b-base-v3	LLaMA	65112	342M	high
Chinese-Mixtral-8x7B (this project)	Mixtral	57000	355M	high

Among the test text of about 1.4GB, our Chinese-Mixtral-8x7B Chinese codec efficiency is second only to TigerBot-13B-Base-v3, which is 41.5% higher than the original model. This is conducive to accelerating the inference speed of Chinese texts, and saving sequence length in scenarios such as In-Context Learning and Chain-of-Thought, which is conducive to improving the performance of complex inference tasks.

Training details

Vocabulary expansion

We use sentencepiece to train the Chinese BPE vocabulary on 12G Zhihu data and 2G enlightenment data. When training the vocabulary list, we enumerated the number of Chinese single-word tokens and the total number of Chinese tokens, and combined the two to obtain hundreds of vocabulary lists with different sizes and contents. In order to obtain the most suitable vocabulary list, we calculated the Chinese vocabulary ability of these vocabulary lists through ALP proposed by Zheng Bo et al. ALP is a convenient and quick indicator to measure the vocabulary ability of a specific language by calculating the granularity of subwords in a specific language and punishing the middle and low frequency subwords of the vocabulary list.

We evaluated the ALP values for different vocabulary lists on books and encyclopedia corpus. In the illustration, the four curves represent the word list of four Chinese single-word tokens (4451, 5435, 6414 and 7434). In order to avoid the Chinese compression rate that is too small, and the embedding layer that is too sparse, we select the inflection point of the ALP curve, which will add 25,000 Chinese tokens to the vocabulary list. On this basis, we selected the largest ALP among the four curves, that is, the vocabulary list of 6414 Chinese single-word Tokens was added as the final vocabulary list selected by Chinese-Mixtral-8x7B.

After obtaining the new vocabulary, we need to expand and initialize the embedding and lm_head layers. We initialize the expansion using the word embedding average of the new token in the old embedding layer. In our previous experiments, this approach is slightly better than HuggingFace's default implementation, i.e. initialization is performed using a fixed normal distribution.

Incremental pre-training

The Mixtral-8x7B model has a parameter volume of 46.7B. Full parameter training requires the use of multiple parallel strategies at the same time. The time cost is too high when training resources are limited. Therefore, we use the method officially recommended by HuggingFace to train the model using QLoRA. Based on LoRA low-rank decomposition, QLoRA further reduces the video memory required for training and maintains performance comparable to full-parameter training by introducing 4-bit quantization, dual quantization and using NVIDIA unified memory for paging.

We refer to the settings of LoRA by Yiming Cui et al., apply low-rank decomposition to all Linear layers of the original model, and set the parameters of the amplified embedding and lm_head layers to be trainable. For the model body, we use the NF4 format for quantization, which can make the quantized data have the same data distribution as before quantization, and the model's weight information loss is less.

Environmental preparation

We recommend using Python 3.10 + torch 2.0.1

 # Pytorch + Transformers
$ pip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2
$ pip install transformers==4.36.2 datasets evaluate peft accelerate gradio optimum sentencepiece trl
$ pip install jupyterlab scikit-learn pandas matplotlib tensorboard nltk rouge bitsandbytes fire
# CUDA Toolkit
$ conda install nvidia/label/cuda-11.7.1::cuda
# DeepSpeed
$ git clone https://github.com/microsoft/DeepSpeed.git
$ cd DeepSpeed
$ DS_BUILD_FUSED_ADAM=1 pip3 install .
# Flash Attention
$ pip install flash-attn --no-build-isolation

Dataset download

We trained Chinese-Mixtral-8x7B based on the existing open source dataset, which includes:

Dataset name	Dataset Language	The amount of data used	Remark
Skywork/SkyPile-150B	Chinese	30B	Use only data from 2022 + 2023
DKYoon/SlimPajama-6B	English	12B	Dataset duplication 2 Epoch

Download the dataset into data via data/download.py . For Slimpajama dataset, you need to use data/parquet2jsonl.py to convert the original dataset to jsonl format.

The downloaded dataset is a shard of multiple jsonl files. Use cat to merge multiple shards into one jsonl file.

$ cat * .jsonl > all.jsonl

split jsonl into train and valid collections. The ratio of train and valid lines in this project is 999:1.

$ wc -l all.jsonl                          # 计算数据集总行数
$ split -l < lines > all.jsonl               # 按999:1计算train/valid行数，进行切分
$ mv xaa DKYoon-SlimPajama-6B-train.jsonl  # 重命名
$ mv xab DKYoon-SlimPajama-6B-dev.jsonl

Dataset preprocessing

Register the dataset name and path into data/datasets.toml :

[ DKYoon-SlimPajama-6B ]              # 数据集名称
splits = [ " train " , " dev " ]           # 数据集train/valid集合
root = " {DATA_DIR}/en/{name} "       # 数据集根目录
doc = " {name}-{split} "              # 数据集文件名
encoded = " encoded-{name}-{split} "  # 预处理保存位置

Use data/preprocess_datasets.py to subword segment the dataset to speed up training.

$ python data/preprocess_datasets.py --ds_name SkyPile-150B-2023 --tokenizer_name_or_path tokenizer/Mixtral-8x7B-v0.1-vocab
$ python data/preprocess_datasets.py --ds_name DKYoon-SlimPajama-6B --tokenizer_name_or_path tokenizer/Mixtral-8x7B-v0.1-vocab

After performing subword segmentation, you can use data/utils.py to view the total tokens of each dataset:

$ python data/utils.py

Start training

The training startup script is scripts/train.sh . The training dataset and dataset ratio can be modified by modifying TRAIN_DATASETS :

TRAIN_DATASETS=(
    1:SkyPile-150B-2022     # 使用全量SkyPile-150B-2022
    0.1:SkyPile-150B-2023   # 使用SkyPile-150B-2023的10%数据
    1:DKYoon-SlimPajama-6B  # 使用全量DKYoon-SlimPajama-6B
)

If you use SLURM cluster management system, you can submit it through sbatch :

$ sbatch scripts/train-pt.sh

If you don't have SLURM or want to start training via the command line, you can directly extract torchrun in scripts/train-pt.sh to start training.

Fine adjustment

Dataset preparation

The dataset format required for fine-tuning is similar to pre-training. The dataset file needs to be in jsonl format: one json per line, which needs to contain the "text" field, and splice the instruction, input and output according to the template you need.

Then you need to register the dataset name and path into data/datasets.toml :

[ ShareGPT-Chinese ]              # 数据集名称
splits = [ " train " ]              # 数据集train/valid集合
root = " {DATA_DIR}/sft/{name} "  # 数据集根目录
doc = " {name}-{split} "          # 数据集文件名

Start training

The training startup script is scripts/train-sft.sh . The training dataset and dataset ratio can be modified by modifying TRAIN_DATASETS :

TRAIN_DATASETS=(
    1.0:ShareGPT-Chinese  # 使用全量ShareGPT-Chinese
    0.5:ShareGPT-English  # 使用ShareGPT-English的50%数据
)

If you use SLURM cluster management system, you can submit it through sbatch :

$ sbatch scripts/train-sft.sh

If you don't have SLURM or want to start training via the command line, you can directly extract torchrun in scripts/train-sft.sh to start training.

✒️ Quote

If you feel this project is helpful to your research or use the code of this project, please refer to this project:

 @misc { Chinese-Mixtral-8x7B ,
    author = { HIT-SCIR } ,
    title = { Chinese-Mixtral-8x7B: An Open-Source Mixture-of-Experts LLM } ,
    year = { 2024 } ,
    publisher = { GitHub } ,
    journal = { GitHub repository } ,
    howpublished = { url{https://github.com/HIT-SCIR/Chinese-Mixtral-8x7B} }
}

? Star History

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2025-04-18
size 1.48MB
From Github

Related Applications

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch ull navra maza navsacha 2 2024 ull ovie Fr e Online On Strea ings

2024-11-03
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
Chinese DOS games (Chinese DOS games in browser) project source code official version

2022-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All