Llama2 chinese下載Llama2 chinese源代碼下載

Llama2 chinese

Ai源碼

1.0.0

下載

LLaMA2中文微調

LLaMA2模型的許可證發生了變化，已允許商用，模型推出時，LLaMA2-Chat也同時推出，本人在16G推理卡上實踐了微調Llama-2-7b-chat（ https://zhuanlan.zhihu.com/p/645152512 ，代碼在https://github.com/git-cloner/llama2-lora-fine-tuning ），但即使擴充了中文詞表，推理效果依然不佳，回答主要以英文為主。

官方在LLaMA2模型發佈時，就已開源了官方微調程序，叫做LLaMA伴侶（ https://github.com/facebookresearch/llama-recipes ），支持全量、Lora等方式微調，相對來說兼容性優於第三方的程序。

本文是在llama-recipes的基礎上，修改適配顯卡資源，基於Lora對LLaMA2-7b原始模型進行微調實踐，結果推理效果尚可，本項目也提供了測試過程和流式接口。

LLaMA2中文微調的效果可在Aiit-Chat查看，鏈接地址為： https://gitclone.com/aiit/chat/ 。

1、推理卡要求

16G及以上，最好有兩塊以上。

100多M的語料，在兩塊P100（16G）上微調一輪需要120小時。所以建議使用V100、4090等推理卡微調。

2、微調過程

2.1 下載代碼

git clone https://github.com/git-cloner/Llama2-chinese
cd Llama2-chinese

2.2 安裝虛擬環境

conda create -n llama-recipes python=3.9 -y
conda activate llama-recipes
# 因为requirements中有从github中安装的依赖，网络环境不佳，打开这两个参数可以观察进度
export GIT_TRACE=1
export GIT_CURL_VERBOSE=1
pip install -r requirements.txt -i https://pypi.mirrors.ustc.edu.cn/simple --trusted-host=pypi.mirrors.ustc.edu.cn
# 问题比较多的是bitsandbytes，pip install后用以下命令验证
python -m bitsandbytes

2.3 下載Llama2-7b原始模型

 # 用本项目开发的下载器下载模型，可以断点续传和重连
python model_download.py --repo_id NousResearch/Llama-2-7b-hf
# 下载后的模型在 ./modelsNousResearchLlama-2-7b-hf 下

2.4 語料準備

語料採用了alpaca格式（huggingface.co中alpaca語料很多，可自行整理），個性化修改後，命名為：ft_datasets/alpaca_data.json

2.5 微調過程

 # kill process force
pkill -9 -f llama_finetuning
# train，batch_size_training可按显存大小反复试，尽量把显存占满
# 本例是用两块P100，分别是第1、2块
# ！注意如果用两块卡，nproc_per_node是1，不是2
CUDA_VISIBLE_DEVICES=1,2 nohup torchrun --nnodes 1 --nproc_per_node 1   
llama_finetuning.py 
--use_peft 
--peft_method lora 
--model_name ./models/NousResearch/Llama-2-7b-hf 
--use_fp16 
--output_dir output/model 
--dataset alpaca_dataset 
--batch_size_training 40 
--num_epochs 3 
--quantization > train.log  2>&1 &
# check log
tail -f train.log

3、推理測試

微調一輪後，會產生peft增量模型，在output/model下，用以下命令在客戶端交互測試。由於未採用流模式，一次性生成後，才能看到結果，所以速度較慢。

CUDA_VISIBLE_DEVICES=0 python generate.py 
    --base_model ' ./models/NousResearch/Llama-2-7b-hf ' 
    --lora_weights ' ./output/model ' 
    --load_8bit

4、流式API測試

4.1 開啟API服務

 # 可以用4bit或8bit量化方式或半精度装入模型测试
# --load_4bit  需要约6G显存
# --load_8bit  需要9G显存
# 半精度  需要13G显存
CUDA_VISIBLE_DEVICES=0 nohup python -u api_stream.py 
--load_4bit > api_stream.log  2>&1 &
tail -f api_stream.log

4.2 測試API

 # 多次发POST请求，直到返回的response中包含[stop]后停止调用
curl -X POST " http://127.0.0.1:8000/stream " 
     -H ' Content-Type: application/json ' 
     -d ' {"prompt": "你好", "history": []} '

5、模型合併

python inference/hf-text-generation-inference/merge_lora_weights.py 
--base_model ./models/NousResearch/Llama-2-7b-hf 
--peft_model output/model 
--output_dir output/merged_model_output