ChatGLM Finetuning Download - ChatGLM Finetuning Source Code Download

ChatGLM fine-tuning

This project mainly performs fine-tuning in different ways (Freeze method, Lora method, P-Tuning method, full parameters, etc.) for ChatGLM, ChatGLM2 and ChatGLM3 models, and compares the effects of the large model on different fine-tuning methods, mainly targeting information extraction tasks, generation tasks, classification tasks, etc.

This project supports single-card training and multiple-card training. Due to the single instruction set fine-tuning, there is no serious catastrophic forgetting after the fine-tuning of the model.

Since the official code and model have been updated, the current code and model of ChatGLM1 and 2 is version 20230806 (note that if the code is running incorrectly, you can replace the py file in the file with the ChatGLM-related source code, because the model version you are under may be inconsistent with the code version of this project). ChatGLM3 is version 20231212.

PS: There is no use of Trainer (although the Trainer code is simple, it is not easy to modify. In the era of big models, algorithm engineers have become data engineers, so they need to understand the training process more)

Update introduction

update-2023.12.12 Add ChatGLM3 code support, complete model switching through model_type, and add inference code.
update-2023.08.06 The code and model have been updated to the latest, supporting single card & multiple card training, supporting ChatGLM2 model training, and supporting full parameter training. All codes have been structured to increase readability.
Update-2023.06.12 Add pipeline parallel training method , please see v0.1 Tag
Update-2023.04.18 Add text generation task evaluation , please see v0.1 Tag
Update-2023.04.05 Added information extraction task evaluation , please see v0.1 Tag

Fine-tuning method

When fine-tuning the model, if there is insufficient video memory, you can turn on parameters such as gradient_checkpointing, zero3, and offload to save video memory.

The following model_name_or_path parameter is the model path. Please modify it according to the address you can save according to your actual model.

Freeze method

Freeze method, that is, parameter freezes, and some parameters are frozen for the original model, and only some parameters are trained to achieve single or multiple cards. If the TP or PP operation is not performed, the big model can be trained.

For fine-tuning code, see train.py, the core part is as follows:

 freeze_module_name = args . freeze_module_name . split ( "," )
for name , param in model . named_parameters ():
	if not any ( nd in name for nd in freeze_module_name ):
		param . requires_grad = False

For modifications to different layers of the model, you can modify the freeze_module_name parameter configuration yourself, such as "layers.27.,layers.26.,layers.25.,layers.24." The training codes are all trained using DeepSpeed. The parameters can be set including train_path, model_name_or_path, mode, train_type, freeze_module_name, ds_file, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, output_dir, etc., and can be configured according to your own tasks.

ChatGLM single card training

 CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm 
                --train_type freeze 
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --output_dir ./output-glm

ChatGLM four-card training, CUDA_VISIBLE_DEVICES controls which cards are trained. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm 
                --train_type freeze 
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --output_dir ./output-glm

ChatGLM2 single card training

 CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM2-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm2 
                --train_type freeze 
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --output_dir ./output-glm2

ChatGLM2 four-card training, control which cards are trained through CUDA_VISIBLE_DEVICES. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM2-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm2 
                --train_type freeze 
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --output_dir ./output-glm2

ChatGLM3 single card training

 CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM3-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm3 
                --train_type freeze 
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --output_dir ./output-glm3

ChatGLM3 four-card training, control which cards are trained through CUDA_VISIBLE_DEVICES. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM3-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm3 
                --train_type freeze 
                --freeze_module_name "layers.27.,layers.26.,layers.25.,layers.24." 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --output_dir ./output-glm3

PS: The video memory used for fine-tuning of ChatGLM is more than that of ChatGLM2, and the detailed video memory accounts for the following:

Model	DeepSpeed-Stage	Offload	Gradient Checkpointing	Batch Size	Max Length	GPU-A40 Number	Consumed video memory
ChaGLM	zero2	No	Yes	1	1560	1	36G
ChaGLM	zero2	No	No	1	1560	1	38G
ChaGLM	zero2	No	Yes	1	1560	4	24G
ChaGLM	zero2	No	No	1	1560	4	29G
ChaGLM2	zero2	No	Yes	1	1560	1	35G
ChaGLM2	zero2	No	No	1	1560	1	36G
ChaGLM2	zero2	No	Yes	1	1560	4	22G
ChaGLM2	zero2	No	No	1	1560	4	27G

PT method

The PT method, namely the P-Tuning method, refers to the official code of ChatGLM, is a soft-prompt method for large models.

P-Tuning only adds new parameters to the Embedding of the large model. paper
P-Tuning-V2, add new parameters to the Embedding and each layer of the big model. paper

For fine-tuning code, see train.py, the core part is as follows:

 config = MODE [ args . mode ][ "config" ]. from_pretrained ( args . model_name_or_path )
config . pre_seq_len = args . pre_seq_len
config . prefix_projection = args . prefix_projection
model = MODE [ args . mode ][ "model" ]. from_pretrained ( args . model_name_or_path , config = config )
for name , param in model . named_parameters ():
	if not any ( nd in name for nd in [ "prefix_encoder" ]):
		param . requires_grad = False

When prefix_projection is True, new parameters are added to the Embedding and each layer of the big model; when False, new parameters are added to the Embedding of the big model only.

The training codes are trained using DeepSpeed. The parameters can be set including train_path, model_name_or_path, mode, train_type, pre_seq_len, prefix_projection, ds_file, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, output_dir, etc., and can be configured according to your own tasks.

ChatGLM single card training

 CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM-6B 
                --per_device_train_batch_size 1 
                --max_len 768 
                --max_src_len 512 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm 
                --train_type ptuning 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --pre_seq_len 16 
                --prefix_projection True 
                --output_dir ./output-glm

ChatGLM four-card training, CUDA_VISIBLE_DEVICES controls which cards are trained. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM-6B 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm 
                --train_type ptuning 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --pre_seq_len 16 
                --prefix_projection True 
                --output_dir ./output-glm

ChatGLM2 single card training

 CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM2-6B 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm2 
                --train_type ptuning 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --pre_seq_len 16 
                --prefix_projection True 
                --output_dir ./output-glm2

ChatGLM2 four-card training, control which cards are trained through CUDA_VISIBLE_DEVICES. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM2-6B 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm2 
                --train_type ptuning 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --pre_seq_len 16 
                --prefix_projection True 
                --output_dir ./output-glm2

ChatGLM3 single card training

 CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM3-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm3 
                --train_type ptuning 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --pre_seq_len 16 
                --prefix_projection True 
                --output_dir ./output-glm3

ChatGLM3 four-card training, control which cards are trained through CUDA_VISIBLE_DEVICES. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM3-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm3 
                --train_type ptuning 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --pre_seq_len 16 
                --prefix_projection True 
                --output_dir ./output-glm3

PS: The video memory used for fine-tuning of ChatGLM is more than that of ChatGLM2, and the detailed video memory accounts for the following:

Model	DeepSpeed-Stage	Offload	Gradient Checkpointing	Batch Size	Max Length	GPU-A40 Number	Consumed video memory
ChaGLM	zero2	No	Yes	1	768	1	43G
ChaGLM	zero2	No	No	1	300	1	44G
ChaGLM	zero2	No	Yes	1	1560	4	37G
ChaGLM	zero2	No	No	1	1360	4	44G
ChaGLM2	zero2	No	Yes	1	1560	1	20G
ChaGLM2	zero2	No	No	1	1560	1	40G
ChaGLM2	zero2	No	Yes	1	1560	4	19G
ChaGLM2	zero2	No	No	1	1560	4	39G

Lora Method

The Lora method, that is, to add additional low-rank matrix to the specified parameters (weight matrix) in parallel on large language models, and during model training, only the additional parameters of the parallel low-rank matrix are trained. When the "rank value" is much smaller than the original parameter dimension, the number of newly added low-rank matrix parameters is also very small. When tuning downstream tasks, only small parameters must be trained, but good performance results can be obtained.

Paper: paper
Official code: Github
HuggingFace encapsulated peft library: Github

For fine-tuning code, see train.py, the core part is as follows:

 model = MODE [ args . mode ][ "model" ]. from_pretrained ( args . model_name_or_path )
lora_module_name = args . lora_module_name . split ( "," )
config = LoraConfig ( r = args . lora_dim ,
					lora_alpha = args . lora_alpha ,
					target_modules = lora_module_name ,
					lora_dropout = args . lora_dropout ,
					bias = "none" ,
					task_type = "CAUSAL_LM" ,
					inference_mode = False ,
					)
model = get_peft_model ( model , config )
model . config . torch_dtype = torch . float32

PS: After Lora training, please merge the parameters first and make model predictions.

The training codes are trained using DeepSpeed. The parameters can be set including train_path, model_name_or_path, mode, train_type, lora_dim, lora_alpha, lora_dropout, lora_module_name, ds_file, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, output_dir, etc., and can be configured according to your own tasks.

ChatGLM single card training

 CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py 
              --train_path data/spo_0.json 
              --model_name_or_path ChatGLM-6B 
              --per_device_train_batch_size 1 
              --max_len 1560 
              --max_src_len 1024 
              --learning_rate 1e-4 
              --weight_decay 0.1 
              --num_train_epochs 2 
              --gradient_accumulation_steps 4 
              --warmup_ratio 0.1 
              --mode glm 
              --train_type lora 
              --lora_dim 16 
              --lora_alpha 64 
              --lora_dropout 0.1 
              --lora_module_name "query_key_value" 
              --seed 1234 
              --ds_file ds_zero2_no_offload.json 
              --gradient_checkpointing 
              --show_loss_step 10 
              --output_dir ./output-glm

ChatGLM four-card training, CUDA_VISIBLE_DEVICES controls which cards are trained. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
              --train_path data/spo_0.json 
              --model_name_or_path ChatGLM-6B 
              --per_device_train_batch_size 1 
              --max_len 1560 
              --max_src_len 1024 
              --learning_rate 1e-4 
              --weight_decay 0.1 
              --num_train_epochs 2 
              --gradient_accumulation_steps 4 
              --warmup_ratio 0.1 
              --mode glm 
              --train_type lora 
              --lora_dim 16 
              --lora_alpha 64 
              --lora_dropout 0.1 
              --lora_module_name "query_key_value" 
              --seed 1234 
              --ds_file ds_zero2_no_offload.json 
              --gradient_checkpointing 
              --show_loss_step 10 
              --output_dir ./output-glm

ChatGLM2 single card training

 CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py 
              --train_path data/spo_0.json 
              --model_name_or_path ChatGLM2-6B 
              --per_device_train_batch_size 1 
              --max_len 1560 
              --max_src_len 1024 
              --learning_rate 1e-4 
              --weight_decay 0.1 
              --num_train_epochs 2 
              --gradient_accumulation_steps 4 
              --warmup_ratio 0.1 
              --mode glm2 
              --train_type lora 
              --lora_dim 16 
              --lora_alpha 64 
              --lora_dropout 0.1 
              --lora_module_name "query_key_value,dense_h_to_4h,dense_4h_to_h,dense" 
              --seed 1234 
              --ds_file ds_zero2_no_offload.json 
              --gradient_checkpointing 
              --show_loss_step 10 
              --output_dir ./output-glm2

ChatGLM2 four-card training, control which cards are trained through CUDA_VISIBLE_DEVICES. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
              --train_path data/spo_0.json 
              --model_name_or_path ChatGLM2-6B 
              --per_device_train_batch_size 1 
              --max_len 1560 
              --max_src_len 1024 
              --learning_rate 1e-4 
              --weight_decay 0.1 
              --num_train_epochs 2 
              --gradient_accumulation_steps 4 
              --warmup_ratio 0.1 
              --mode glm2 
              --train_type lora 
              --lora_dim 16 
              --lora_alpha 64 
              --lora_dropout 0.1 
              --lora_module_name "query_key_value,dense_h_to_4h,dense_4h_to_h,dense" 
              --seed 1234 
              --ds_file ds_zero2_no_offload.json 
              --gradient_checkpointing 
              --show_loss_step 10 
              --output_dir ./output-glm2

ChatGLM3 single card training

 CUDA_VISIBLE_DEVICES=0 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM3-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm3 
                --lora_dim 16 
                --lora_alpha 64 
                --lora_dropout 0.1 
                --lora_module_name "query_key_value,dense_h_to_4h,dense_4h_to_h,dense" 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --output_dir ./output-glm3

ChatGLM3 four-card training, control which cards are trained through CUDA_VISIBLE_DEVICES. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
                --train_path data/spo_0.json 
                --model_name_or_path ChatGLM3-6B/ 
                --per_device_train_batch_size 1 
                --max_len 1560 
                --max_src_len 1024 
                --learning_rate 1e-4 
                --weight_decay 0.1 
                --num_train_epochs 2 
                --gradient_accumulation_steps 4 
                --warmup_ratio 0.1 
                --mode glm3 
                --lora_dim 16 
                --lora_alpha 64 
                --lora_dropout 0.1 
                --lora_module_name "query_key_value,dense_h_to_4h,dense_4h_to_h,dense" 
                --seed 1234 
                --ds_file ds_zero2_no_offload.json 
                --gradient_checkpointing 
                --show_loss_step 10 
                --output_dir ./output-glm3

PS: The video memory used for fine-tuning of ChatGLM is more than that of ChatGLM2, and the detailed video memory accounts for the following:

Model	DeepSpeed-Stage	Offload	Gradient Checkpointing	Batch Size	Max Length	GPU-A40 Number	Consumed video memory
ChaGLM	zero2	No	Yes	1	1560	1	20G
ChaGLM	zero2	No	No	1	1560	1	45G
ChaGLM	zero2	No	Yes	1	1560	4	20G
ChaGLM	zero2	No	No	1	1560	4	45G
ChaGLM2	zero2	No	Yes	1	1560	1	20G
ChaGLM2	zero2	No	No	1	1560	1	43G
ChaGLM2	zero2	No	Yes	1	1560	4	19G
ChaGLM2	zero2	No	No	1	1560	4	42G

Note: The Lora method only saves the Lora training parameters when saving the model, so the model parameters need to be merged when predicting the model. For details, please refer to merge_lora.py.

Method of full ginseng

The full parameter method is used to train the large model in full parameters. It mainly uses the DeepSpeed-Zero3 method to divide the model parameters into multiple cards, and uses the Offload method to unload the optimizer parameters to the CPU to solve the problem of insufficient graphics cards.

For fine-tuning code, see train.py, the core part is as follows:

 model = MODE [ args . mode ][ "model" ]. from_pretrained ( args . model_name_or_path )

The training codes are trained using DeepSpeed. The parameters can be set including train_path, model_name_or_path, mode, train_type, ds_file, num_train_epochs, per_device_train_batch_size, gradient_accumulation_steps, output_dir, etc., which can be configured according to your own tasks.

ChatGLM four-card training, CUDA_VISIBLE_DEVICES controls which cards are trained. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
              --train_path data/spo_0.json 
              --model_name_or_path ChatGLM-6B 
              --per_device_train_batch_size 1 
              --max_len 1560 
              --max_src_len 1024 
              --learning_rate 1e-4 
              --weight_decay 0.1 
              --num_train_epochs 2 
              --gradient_accumulation_steps 4 
              --warmup_ratio 0.1 
              --mode glm 
              --train_type all 
              --seed 1234 
              --ds_file ds_zero3_offload.json 
              --gradient_checkpointing 
              --show_loss_step 10 
              --output_dir ./output-glm

ChatGLM2 four-card training, control which cards are trained through CUDA_VISIBLE_DEVICES. If this parameter is not added, it means that all cards on the running machine are trained.

 CUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
              --train_path data/spo_0.json 
              --model_name_or_path ChatGLM2-6B 
              --per_device_train_batch_size 1 
              --max_len 1560 
              --max_src_len 1024 
              --learning_rate 1e-4 
              --weight_decay 0.1 
              --num_train_epochs 2 
              --gradient_accumulation_steps 4 
              --warmup_ratio 0.1 
              --mode glm2 
              --train_type all 
              --seed 1234 
              --ds_file ds_zero3_no_offload.json 
              --gradient_checkpointing 
              --show_loss_step 10 
              --output_dir ./output-glm2

ChatGLM3 four-card training, control which cards are trained through CUDA_VISIBLE_DEVICES. If this parameter is not added, it means that all cards on the running machine are trained.

 CCUDA_VISIBLE_DEVICES=0,1,2,3 deepspeed --master_port 520 train.py 
              --train_path data/spo_0.json 
              --model_name_or_path ChatGLM3-6B 
              --per_device_train_batch_size 1 
              --max_len 1560 
              --max_src_len 1024 
              --learning_rate 1e-4 
              --weight_decay 0.1 
              --num_train_epochs 2 
              --gradient_accumulation_steps 4 
              --warmup_ratio 0.1 
              --mode glm3 
              --train_type all 
              --seed 1234 
              --ds_file ds_zero3_no_offload.json 
              --gradient_checkpointing 
              --show_loss_step 10 
              --output_dir ./output-glm3

PS: The video memory used for fine-tuning of ChatGLM is more than that of ChatGLM2, and the detailed video memory accounts for the following:

Model	DeepSpeed-Stage	Offload	Gradient Checkpointing	Batch Size	Max Length	GPU-A40 Number	Consumed video memory
ChaGLM	zero3	Yes	Yes	1	1560	4	33G
ChaGLM2	zero3	No	Yes	1	1560	4	44G
ChaGLM2	zero3	Yes	Yes	1	1560	4	26G

The following is a description of the relevant content of DeepSpeed's Zero-Stage.

Running environment

View requirements.txt file

Experimental results

Triple extraction

In order to prevent data leakage of large models, a domain competition data set-automotive industry failure mode relationship extraction was adopted, and 50 randomly selected as the test set
Training example:

 {
    "instruction": "你现在是一个信息抽取模型，请你帮我抽取出关系内容为"性能故障", "部件故障", "组成"和 "检测工具"的相关三元组，三元组内部用"_"连接，三元组之间用\n分割。文本：",
    "input": "故障现象：发动机水温高，风扇始终是低速转动，高速档不工作，开空调尤其如此。",
    "output": "发动机_部件故障_水温高n风扇_部件故障_低速转动"
}

Fine-tuning method	PT-Only-Embedding	PT	Freeze	Lora
Test results F1	0.0	0.6283	0.5675	0.5359

Structural Analysis:

The effect is PT>Freeze>Lora>PT-Only-Embedding
PT-Only-Embedding has a very poor effect. It is found that during training, the final loss can only converge to 2.0 months, while other mechanisms can converge to 0.0 months. The reason for the analysis is that the output content form is very different from the original language model task, and only adding additional Embedding parameters is not enough to change complex downstream tasks.
The above test only represents personal test results, and since the generation length of the generative model has a great impact on the inference time, other data can have different results.
After fine-tuning the model on the specified task, it does not lose its original ability. For example, generating "Write a quick-row algorithm for me", it can still generate - quick-row code.
Since large-scale model fine-tuning uses a large number of instructions for model training, when only a single instruction is used for fine-tuning, it has little impact on the original other instructions, so it does not lead to the loss of the original model's capabilities.

Many students experienced catastrophic forgetting after fine-tuning, but the training code of this project did not appear. They tested the "translation task", "code task", and "question and answer task". The specific test results are as follows:

Translation tasks

Code Tasks

Q&A task

Text generation

In order to prevent data leakage of the big model, a "Wanchuang Cup" Traditional Chinese Medicine Tianchi Big Data Competition-Traditional Medicine Literature Problem Generation Challenge, and 20 items were randomly selected as the test set
PT is the official P-Tuning V2 training method. PT-Only-Embedding means that only soft-prompt is performed on Embedding, Freeze only trains the last five layers of parameters of the model, Lora uses the low-rank matrix method to train, and the rank is 8;
Training example:

 {
    "instruction": "你现在是一个问题生成模型，请根据下面文档生成一个问题，文档：",
    "input": "清热解毒口服液由生石膏、知母、紫花地丁、金银花、麦门冬、黄芩、玄参、连翘、龙胆草、生地黄、栀子、板蓝根组成。具有疏风解表、清热解毒利咽、生津止渴的功效，适用于治疗外感时邪、内有蕴热所致的身热汗出、头痛身痛、心烦口渴、微恶寒或反恶热、舌红、苔黄、脉数等症。现代临床主要用于治疗流行性感冒、流行性脑脊髓膜炎、肺炎等各种发热性疾病。口服液：每支10毫升，每次10~20毫升，每日3次。〔注意事项〕阳虚便澹者不宜使用。",
    "output": "清热解毒口服的功效有哪些？"
}

Since the content of the generative model cannot be evaluated like the information extraction task, it is not appropriate to use existing BLUE or Rouge to evaluate, so a scoring rule was formulated. The D2Q model is judged through the two perspectives of diversity and accuracy. Each sample has a total of 5 points, with a total of 20 samples.

Diversity:
- Whether the questions are highly similar, 0.25 points will be deducted for each repeated question;
- Whether the answer to the question is the same, and every time there is a repeated answer or the answer cannot be found, 0.25 points will be deducted;
accuracy:
- Can the answer be found in the document? If you cannot find the answer, you will deduct 0.25 points;
- Whether the content of the question is smooth, if every problem is not smooth, 0.25 points will be deducted;
- Whether the content of the question is harmful or not, 0.25 points will be deducted for each harmful one;