LLaMA-MOSS-RLHF-LoRA
The RLHF code of this code does not require a Megatron or deepspeed framework, it only requires traditional alchemy torch and graphics cards. The Critic of RLHF uses a reduced version of the target GPT, and Reward we can use a similarity model that compares with the target output. In this way, you only need to learn the core PPO algorithm, and the rest are models and structures that you have already understood. It is very conducive to NLPer's entry into RLHF, and it seems that only RLHF is needed can finetune the model.
LLaMA or MOSS can be selected in the code, and the optimization method LoRA is optional.
Function:
- Definition and use of RLHF data format√
- The model was fine-tuned using only RLHF√
- Let the model recognize its master√
- Modify self-cognitive steel stamp
- Master's name
- Robot's nickname
- batch generates multiple different propts, and then RLHF×
Installation environment
The installation environment refers to the extracted requirement.txt, mainly torch, transformers
- Running Moss requires an accelerate library
- Running Lora requires peft
- Among them, peft has changed a lot due to its faster updates. Here you need to specify peft as version 0.2.0
How to use
0 Select the model you need (set model_name_or_path in rlhf_train_gpt.py, and whether lora is needed), and preprocessing
- moss
- llama
- Need to perform a combination of llama-based model and retrained lora parameters
- python merge_llama_with_chinese_lora_to_hf.py
- You can set different llama parameter quantities and lora in it
- The generated hf model is saved
1 Modify the owner name and nickname you want and execute the following code. To generate target data, you can also use the default one.
python data / generate_data . py
2 Start RLHF (LoRA) based training horn
python rlhf_train_gpt . py
Resource consumption
- moss
- 13b parameter quantity
- Four 3090s are required, among which the moss model needs to load about 26G training 46G video memory (3 pictures), and one more critical and reward is needed. You can try an A6000, which may also run.
- Total of approximately 50G of video memory
- llama
- 7b parameter quantity
- Two 3090s are required, one for llama loading and training, and one for placing the critical model
Effect display
Training about 6 epochs, or when ratio is almost 1, it means that the probability of model generation has not changed much, so you can experience it.
- What is meimei?
- Meimei is the nickname given to me by my master.
- Who gave you the Meme?
- Baba is my nickname.
- The master gave me the meimei.
- Who is your master?
- Zhang San is my master.
- My master is Zhang San
- The generalization ability is maintained very well
- who is your master
- What is your nickname
- What is your relationship with Zhang San
- what is your relationship with
- Meimei is the nickname given to me by my master.
Contact information
- Communication group
- QQ group: 788598358
- WeChat group: WeChat group may expire