LLM RLHF Tuning
1.0.0
本项目从零实现了RLHF三阶段训练,并在文档中详细写了实现细节,欢迎大家交流讨论WeChat
与开源的RLHF训练框架的功能进行对比
| 框架 | SFT Train | RM Train | PPO Train | DPO Train |
|---|---|---|---|---|
| Our | ✅ | ✅ | ✅ | ✅ |
| Deepspeed-chat | ✅ | ✅ | ✅ | |
| trl | ✅ | ✅ | ✅ | ✅ |
| MOSS-RLHF | ✅ |
| 框架 | Accelerate | Deepspeed | Multi LORA | 最低模型参数量 (7B为例) |
|---|---|---|---|---|
| Our | ✅ | ✅ | ✅ | single model size ~ 7B |
| Deepspeed-chat | ✅ | sft+rm+actor+critic ~ 28B | ||
| trl | ✅ | single model size(not use ref model)~ 7B | ||
| MOSS-RLHF | actor model、critic model | sft model、rm model | sft+rm+actor+critic ~ 28B |
accelerate==0.21.0
datasets==2.13.1
scikit-learn==1.3.0
sentencepiece==0.1.99
tqdm==4.65.0
transformers==4.31.0
wandb==0.15.8
peft==0.4.0
torch==2.0.1
trl==0.5.0
deepspeed==0.10.0
基于两个基模型
基于一个基模型
欢迎加群讨论 WeChat