The license for the LLaMA2 model has changed and has been commercially available. When the model was launched, LLaMA2-Chat was also launched. I have practiced fine-tuning of Llama-2-7b-chat on the 16G reasoning card (https://zhuanlan.zhihu.com/p/645152512, the code is https://github.com/git-cloner/llama2-lora-fine-tuning). However, even if the Chinese vocabulary list is expanded, the reasoning effect is still not good, and the answers are mainly in English.
When the LLaMA2 model was released, the official fine-tuning program was opened, called LLaMA Companion (https://github.com/facebookresearch/llama-recipes), which supports fine-tuning of full-scale, Lora and other methods, and is relatively more compatible than third-party programs.
This article is based on llama-recipes, modifying the adaptive graphics card resources, and fine-tuning the original LLaMA2-7b model based on Lora. The result is reasonable inference. This project also provides a test process and streaming interface.
16G or above, it is best to have more than two pieces.
It takes 120 hours to fine-tune a round of more than 100 M of corpus on two P100s (16G). Therefore, it is recommended to use V100, 4090 and other reasoning cards to fine-tune.
git clone https://github.com/git-cloner/Llama2-chinese
cd Llama2-chineseconda create -n llama-recipes python=3.9 -y
conda activate llama-recipes
# 因为requirements中有从github中安装的依赖,网络环境不佳,打开这两个参数可以观察进度
export GIT_TRACE=1
export GIT_CURL_VERBOSE=1
pip install -r requirements.txt -i https://pypi.mirrors.ustc.edu.cn/simple --trusted-host=pypi.mirrors.ustc.edu.cn
# 问题比较多的是bitsandbytes,pip install后用以下命令验证
python -m bitsandbytes # 用本项目开发的下载器下载模型,可以断点续传和重连
python model_download.py --repo_id NousResearch/Llama-2-7b-hf
# 下载后的模型在 ./modelsNousResearchLlama-2-7b-hf 下The corpus is in the alpaca format (the alpaca corpus in huggingface.co is very large and can be sorted out by yourself). After personalization, it is named: ft_datasets/alpaca_data.json
# kill process force
pkill -9 -f llama_finetuning
# train,batch_size_training可按显存大小反复试,尽量把显存占满
# 本例是用两块P100,分别是第1、2块
# !注意如果用两块卡,nproc_per_node是1,不是2
CUDA_VISIBLE_DEVICES=1,2 nohup torchrun --nnodes 1 --nproc_per_node 1
llama_finetuning.py
--use_peft
--peft_method lora
--model_name ./models/NousResearch/Llama-2-7b-hf
--use_fp16
--output_dir output/model
--dataset alpaca_dataset
--batch_size_training 40
--num_epochs 3
--quantization > train.log 2>&1 &
# check log
tail -f train.logAfter a round of fine-tuning, a peft incremental model will be generated. Under output/model, use the following command to test it interactively on the client. Since the stream mode is not used, the results can only be seen after generating it at one time, so the speed is slow.
CUDA_VISIBLE_DEVICES=0 python generate.py
--base_model ' ./models/NousResearch/Llama-2-7b-hf '
--lora_weights ' ./output/model '
--load_8bit # 可以用4bit或8bit量化方式或半精度装入模型测试
# --load_4bit 需要约6G显存
# --load_8bit 需要9G显存
# 半精度 需要13G显存
CUDA_VISIBLE_DEVICES=0 nohup python -u api_stream.py
--load_4bit > api_stream.log 2>&1 &
tail -f api_stream.log # 多次发POST请求,直到返回的response中包含[stop]后停止调用
curl -X POST " http://127.0.0.1:8000/stream "
-H ' Content-Type: application/json '
-d ' {"prompt": "你好", "history": []} ' python inference/hf-text-generation-inference/merge_lora_weights.py
--base_model ./models/NousResearch/Llama-2-7b-hf
--peft_model output/model
--output_dir output/merged_model_output