LLM RLHF Tuning
1.0.0
本項目從零實現了RLHF三階段訓練,並在文檔中詳細寫了實現細節,歡迎大家交流討論WeChat
與開源的RLHF訓練框架的功能進行對比
| 框架 | SFT Train | RM Train | PPO Train | DPO Train |
|---|---|---|---|---|
| Our | ✅ | ✅ | ✅ | ✅ |
| Deepspeed-chat | ✅ | ✅ | ✅ | |
| trl | ✅ | ✅ | ✅ | ✅ |
| MOSS-RLHF | ✅ |
| 框架 | Accelerate | Deepspeed | Multi LORA | 最低模型參數量(7B為例) |
|---|---|---|---|---|
| Our | ✅ | ✅ | ✅ | single model size ~ 7B |
| Deepspeed-chat | ✅ | sft+rm+actor+critic ~ 28B | ||
| trl | ✅ | single model size(not use ref model)~ 7B | ||
| MOSS-RLHF | actor model、critic model | sft model、rm model | sft+rm+actor+critic ~ 28B |
accelerate==0.21.0
datasets==2.13.1
scikit-learn==1.3.0
sentencepiece==0.1.99
tqdm==4.65.0
transformers==4.31.0
wandb==0.15.8
peft==0.4.0
torch==2.0.1
trl==0.5.0
deepspeed==0.10.0
基於兩個基模型
基於一個基模型
歡迎加群討論WeChat