chinese_llm_sft Download - chinese_llm

chinese_llm_sft

AI Source Code

1.0.0

Download

chinese_llm_sft

Use instruction fine-tuning to fine-tune the big model. The main running code is copied from chinese-LLaMA-Alpaca and has made some modifications:

The way to save the model using lora has been modified, and the original method cannot save the complete model parameters. At the same time, only one copy of the lora parameters is saved.
Modify the supported model to chatglm-6b.

Note : There are still problems.

RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn

Please check the relevant information later and add loss.requires_grad_(True) to modeling_chatglm.py to run successfully. for

Excluding chatglm is not supported, and the same model in the same model in chinese-LLaMA-Alpaca still has this problem. No matter how you say it is modified

It can still run successfully afterwards.

Although there is no problem with the entire process, the model does not seem to be able to be trained effectively. The loss has been around 4 o'clock, and the same problem still exists after trying different learning rates and training longer.
The project is mainly:
- sentencepiece_chinese_bpe: How to make the English language model support Chinese? (I) Construction of Chinese tokenization
- chinese_llm_pretrained: How to make the English language model support Chinese? (II) Continue pre-training
The third part of the pre-trained model is fine-tuned. The main purpose is to explain the entire process. For detailed introduction, you can check Zhihu: https://zhuanlan.zhihu.com/p/640086409. If you want to use it in practice, you can refer to other published projects: [taishan1994 (Xiximamayo) (github.com)](https://github.com/taishan1994).

rely

 mpi4py
transformers == 4.28 . 1
peft == 0.3 . 0
icetk
deepspeed == 0.9 . 2
accelerate
cpm_kernels
sentencepiece == 0.1 . 99
peft = 0.3 . 0
torch = 2.0 . 0 
datasets

The latest package version should be OK.

process

1. Download the chatglm-6b model to model_hub/chatglm-6b

2. Prepare the data, such as the format of the data in data/msra/train.txt, with one sample in one behavior, and the sample is similar:

{ "instruct" : "你现在是一个实体识别模型，你需要提取文本里面的人名、地名、机构名，如果存在结果，返回'实体_实体类型'，不同实体间用n分隔。如果没有结果，回答'没有'。" , "query" : "文本：一位郑州学人说，越秀学术讲座对郑州学界而言堪称功德之举。" , "answer" : "郑州_地名n越秀_机构名" }

3. After preparing the data, you can use instructions to train:

 torchrun - - nnodes 1 - - nproc_per_node 1 run_clm_sft_with_peft . py 
    - - deepspeed ds_zero2_no_offoad . json 
    - - model_name_or_path model_hub / chatglm - 6 b 
    - - tokenizer_name_or_path model_hub / chatglm - 6 b 
    - - dataset_dir data / msra / 
    - - per_device_train_batch_size 8 
    - - per_device_eval_batch_size 8 
    - - do_train 
    - - seed $ RANDOM 
    - - fp16 
    - - num_train_epochs 3 
    - - learning_rate 3e-5 
    - - warmup_ratio 0.01 
    - - weight_decay 0 
    - - logging_strategy steps 
    - - logging_steps 10 
    - - save_strategy steps 
    - - save_total_limit 3 
    - - save_steps 200 
    - - gradient_accumulation_steps 1 
    - - preprocessing_num_workers 8 
    - - max_seq_length 256 
    - - output_dir output_dir 
    - - overwrite_output_dir 
    - - ddp_timeout 30000 
    - - logging_first_step True 
    - - lora_rank 8 
    - - lora_alpha 32 
    - - trainable query_key_value 
    - - lora_dropout 0.05 
    - - torch_dtype float16 
    - - gradient_checkpointing 
    - - ddp_find_unused_parameters False

4. After the training is completed, you can use test_sft_model.py to predict:

 import os
import torch
from transformers import AutoTokenizer , AutoModel
from peft import PeftModel
tokenizer = AutoTokenizer . from_pretrained ( "model_hub/chatglm-6b" , trust_remote_code = True )
model = AutoModel . from_pretrained ( "model_hub/chatglm-6b" , trust_remote_code = True ). half ()

model_vocab_size = model . get_output_embeddings (). weight . size ( 0 )
model . resize_token_embeddings ( len ( tokenizer ))

model = PeftModel . from_pretrained ( model , os . path . join ( "output_dir" , "adapter_model" ))
model . cuda ()
model . eval ()

response , history = model . chat ( tokenizer , "你好" , history = [])
print ( response )
response , history = model . chat ( tokenizer , "晚上睡不着应该怎么办" , history = [])
print ( response )
response , history = model . chat ( tokenizer , "你现在是一个实体识别模型，你需要提取文本里面的人名、地名、机构名，如果存在结果，返回'实体_实体类型'，不同实体间用n分隔。如果没有结果，回答'没有'。文本：我们是受到郑振铎先生、阿英先生著作的启示，从个人条件出发，瞄准现代出版史研究的空白，重点集藏解放区、国民党毁禁出版物。" , history = [])
print ( response )

5. Others, such as how to define the trainable layer of lora, can be viewed using fin_lora_names.py. You can use test_datset.py to test data. Test word segmenter using test_toenizer.py. Test the original model with test_model.py.