llama lora fine tuning 다운로드 -Llama llama lora fine tuning 소스 코드 다운로드

中文

단일 16g GPU에서 미세 조정 Vicuna-7b

1. 개요

Facebook/Llama를 미세 조정하기위한 두 가지 계획이 있습니다. 하나는 스탠포드의 알파카 시리즈이고 다른 하나는 Sharegpt 코퍼스를 기반으로 한 Vicuna입니다. Vicuna는 다중 라운드 대화 코퍼스를 사용하며 훈련 효과는 단일 라운드 대화로 기본적으로 기본적으로 알파카보다 낫습니다. 따라서 Vicuna를 기반으로 Llama를 미세 조정하는 것이 좋습니다. 두 가지 미세 조정 방식은 다음 프로젝트에서 자세히 설명됩니다 (FastChat의 LORA 모드에 대한 설명은 비교적 간단합니다).
https://github.com/tloen/alpaca-lora
https://github.com/lm-sys/fastchat
Alpaca-Lora는 메모리 요구 사항이 낮으며 약 12G 2080TI는 지원할 수 있지만 Vicuna와 같은 다중 라운드 세션 모델은 높은 GPU 메모리가 필요합니다. Vicuna 모델 교육에는 최소 24g GPU 메모리가 필요합니다 [공식 권장 사항은 4 * V100 (32g)]. 고급 그래픽 카드가있는 경우 파일을 따라 훈련하십시오. 16G 그래픽 카드 만 있지만 Corpus를 사용자 정의하여 Vicuna 모델을 재현하려면 정밀도를 32 비트에서 절반 정밀 16 비트, 16 비트에서 8 비트에서 16 비트에서 8 비트로 지속적으로 줄이는 여러 가지 방법을 생각해야합니다.

2. 미세 조정 방법

• LORA 방법을 사용하여 매개 변수의 일부만 훈련
• 기본 모델은 Half-Precision LLAMA-7B-HF를 채택합니다
• load_in_8bit을 사용하여 기본 모델을로드하십시오
• 미세 조정에 PEFT 기술을 사용하십시오
• Bitsandbytes를 사용하여 가속하십시오
그런 다음 FastChat을 기반으로합니다.이 기사는 LORA 교육 코드를 수정하고 ShareGpt 코퍼스를 사용하며 16G 카드를 사용하여 약 13g의 GPU 메모리를 차지합니다.
• 운영 체제 : Centos 또는 Ubuntu
• NVIDA P100 또는 T4 : 16G GPU 메모리 이상
• Cuda, Conda

3. 피지 조정 과정

3.1 의존성 환경을 설치하십시오

3.1.1 다운로드 소스 코드

git clone https://github.com/git-cloner/llama-lora-fine-tuning
cd llama-lora-fine-tuning

3.1.2 미세 조정 의존성 환경을 설치하십시오

3.1.2.1 PKG-Config를 설치하십시오

wget https://pkg-config.freedesktop.org/releases/pkg-config-0.29.2.tar.gz
tar -zxvf pkg-config-0.29.2.tar.gz
cd pkg-config-0.29.2
./configure --with-internal-glib  
make -j4
make check  
sudo make install

3.1.2.2 Libicu를 설치하십시오

wget https://mirrors.aliyun.com/blfs/conglomeration/icu/icu4c-73_1-src.tgz
tar xf icu4c-73_1-src.tgz
cd icu/source  
./configure  
make  
make check  
sudo make install
sudo ldconfig

3.1.2.3 패키지 설치

conda create -n llama-lora python=3.10
conda activate llama-lora
pip3 install -r requirements.txt

3.2 라마 모델을 준비하십시오

원래 모델을 다운로드하여 반 정밀도로 변환하거나 https://huggingface.co/decapoda-research/llama-7b-hf에서 직접 변환 된 반 정밀 모델을 다운로드 할 수 있습니다.

3.2.1 라마 모델을 다운로드하십시오

 export GIT_TRACE=1
export GIT_CURL_VERBOSE=1
pip3 install git+https://github.com/juncongmoo/pyllama -i https://pypi.mirrors.ustc.edu.cn/simple --trusted-host=pypi.mirrors.ustc.edu.cn
python -m llama.download --model_size 7B

3.2.2 모델을 Huggingface 형식으로 변환합니다

CUDA_VISIBLE_DEVICES=1 python3 ./convert_llama_weights_to_hf.py --input_dir ./pyllama_data --model_size 7B --output_dir ./pyllama_data/output/7B

3.3 코퍼스를 조직하십시오

3.3.1 코퍼스 다운로드

Download 52k ShareGPT: https: // huggingface.co/datasets/RyokoAI/ShareGPT52K
Other corpora refer to: https: // github.com/Zjh-819/LLMDataHub
Download sg_90k_part1.json and sg_90k_part2.json into the data directory

3.3.2 병합 코퍼스 파일

python3 fastchat/data/merge.py --in ./data/sg_90k_part1.json ./data/sg_90k_part2.json ./data/dummy_cn.json ./data/dummy_en.json --out ./data/sg_90k.json

3.3.3 HTML에서 마크 다운

python3 fastchat/data/clean_sharegpt.py --in ./data/sg_90k.json --out ./data/sharegpt_clean.json

3.3.4 사용하지 않은 언어 제거 (선택 사항)

python3 fastchat/data/optional_clean.py --in ./data/sharegpt_clean.json --out ./data/sharegpt_clean_1.json --skip-lang SOME_LANGUAGE_CODE 
The values of SOME_LANGUAGE_CODE are as follows:
en - English
es - Spanish 
fr - French
de - German
it - Italian
ja - Japanese
ko - Korean 
zh - Chinese
ar - Arabic
ru - Russian
pt - Portuguese
nl - Dutch

3.3.5 긴 대화를 짧은 대화로 나눕니다

CUDA_VISIBLE_DEVICES=1 python3 fastchat/data/split_long_conversation.py --in ./data/sharegpt_clean.json --out ./data/sharegpt_clean_split.json --model-name ./pyllama_data/output/7B

3.4 미세 조정

3.4.1 미세 조정 명령

 # Disable wandb 
wandb disabled 
# In order to prevent the SSH terminal from disconnecting and stopping the training, the training can run in the background (remove the # in three places to run in the background)
# If you have multiple GPUs,using --num_gpus parameter
CUDA_VISIBLE_DEVICES=0,1  # nohup  
deepspeed --num_gpus=2 fastchat/train/train_lora.py  
  --deepspeed ./deepspeed-config.json  
  --lora_r 8  
  --lora_alpha 16  
  --lora_dropout 0.05  
  --model_name_or_path ./pyllama_data/output/7B  
  --data_path ./data/sharegpt_clean_split.json  
  --fp16 True  
  --output_dir ./output  
  --num_train_epochs 1  
  --per_device_train_batch_size 14  
  --per_device_eval_batch_size 14  
  --gradient_accumulation_steps 1  
  --evaluation_strategy " no "  
  --save_strategy " steps "  
  --save_steps 2400  
  --save_total_limit 5  
  --learning_rate 2e-5  
  --weight_decay 0.  
  --warmup_ratio 0.03  
  --lr_scheduler_type " cosine "  
  --logging_steps 1  
  --model_max_length 512  
  --gradient_checkpointing True # >> lora.log 2>&1 &
# If running in the background, tail lora.log to check the training progress 
tail -f lora.log

3.4.2 미세 조정 성능

P100 (16G)의 미세 조정은 13.5g의 메모리를 차지합니다. 한 번의 훈련의 경우, 약 5 일, 약 5 일이 걸리며 여전히 시간이 많이 걸립니다. 결과 모델의 효과를 확인해야합니다. model_max_length는 교육 시간에 영향을 미칩니다. 1024로 설정하면 2048 년에 비해 시간이 절반으로 줄어들지 만 추론 효과에 영향을 미칩니다.

3.4.3 A100의 미세 조정

단일 A100에서 미세 조정하고 약 16 시간이 걸립니다.

deepspeed fastchat/train/train_lora.py 
    --deepspeed ./deepspeed-config.json 
    --lora_r 8 
    --lora_alpha 16 
    --lora_dropout 0.05 
    --model_name_or_path ./pyllama_data/output/7B 
    --data_path ./data/sharegpt_clean_split.json 
    --fp16 True 
    --output_dir ./output 
    --num_train_epochs 1 
    --per_device_train_batch_size 56 
    --per_device_eval_batch_size 56 
    --gradient_accumulation_steps 1
    --evaluation_strategy " no " 
    --save_strategy " steps " 
    --save_steps 1200 
    --save_total_limit 5 
    --learning_rate 2e-5 
    --weight_decay 0. 
    --warmup_ratio 0.03 
    --lr_scheduler_type " cosine " 
    --logging_steps 1 
    --model_max_length 1024 
    --gradient_checkpointing True

4. 훈련 된 모델 테스트

4.1 모델 파일 구조

훈련 된 lora peft 모델은 adapter_config.json, adapter_model.bin 및 trainer_state.json으로 구성됩니다. 아래는 PEFT의 파일 구조와 원래 LLAMA 모델입니다.

model
───llama-peft
│      adapter_config.json
│      adapter_model.bin
│      trainer_state.json
│
└──llama_7b
        config.json
        generation_config.json
        pytorch_model-00001-of-00002.bin
        pytorch_model-00002-of-00002.bin
        pytorch_model.bin.index.json
        special_tokens_map.json
        tokenizer.json
        tokenizer.model
        tokenizer_config.json

4.2 테스트 생성

CUDA_VISIBLE_DEVICES=0  python generate.py  --base_model ./model/llama-7b --lora_weights ./model/llama-peft

확장하다