llama lora fine tuning ดาวน์โหลด - llama lora fine tuning Source Source Download

中文

ปรับแต่ง Vicuna-7b ใน GPU 16G เดียว

1. ภาพรวม

โดยทั่วไปมีสองแผนสำหรับการปรับแต่ง Facebook/Llama หนึ่งคือซีรี่ส์ Alpaca ของ Stanford และอีกชุดหนึ่งคือ Vicuna จาก ShareGpt Corpus Vicuna ใช้คอร์ปัสบทสนทนาหลายรอบและเอฟเฟกต์การฝึกอบรมดีกว่า Alpaca ซึ่งเริ่มต้นจากบทสนทนารอบเดียว ดังนั้นจึงขอแนะนำให้ปรับแต่ง Llama ตาม Vicuna วิธีการปรับจูนสองวิธีอธิบายไว้ในรายละเอียดในโครงการต่อไปนี้ (คำอธิบายของโหมด LORA ใน FastChat นั้นค่อนข้างง่าย)
https://github.com/tloen/alpaca-lora
https://github.com/lm-sys/fastchat
Alpaca-Lora มีข้อกำหนดด้านหน่วยความจำต่ำประมาณ 12G 2080TI สามารถรองรับได้ แต่การฝึกอบรมแบบจำลองหลายรอบหลายรอบเช่น Vicuna ต้องใช้หน่วยความจำ GPU สูง การฝึกอบรมแบบจำลอง Vicuna ต้องใช้หน่วยความจำ GPU อย่างน้อย 24 กรัม [คำแนะนำอย่างเป็นทางการคือ 4 * V100 (32G)] หากคุณมีการ์ดกราฟิกระดับไฮเอนด์เพียงแค่ติดตามไฟล์เพื่อฝึกอบรม หากคุณมีการ์ดกราฟิก 16G เท่านั้น แต่ต้องการปรับแต่งคลังข้อมูลเพื่อสร้างโมเดล Vicuna ซ้ำคุณต้องนึกถึงหลายวิธีในการลดความแม่นยำอย่างต่อเนื่องจาก 32 บิตถึงครึ่งความแม่นยำ 16 บิตจาก 16 บิตเป็น 8 บิตและเร่งวิธีการฝึกอบรมเพื่อให้บรรลุเป้าหมาย

2. วิธีการปรับแต่ง

•ใช้วิธี LORA เพื่อฝึกเฉพาะส่วนหนึ่งของพารามิเตอร์
•โมเดลพื้นฐานใช้ความแม่นยำครึ่ง LLAMA-7B-HF
•ใช้ load_in_8bit เพื่อโหลดโมเดลพื้นฐาน
•ใช้เทคโนโลยี PEFT เพื่อปรับแต่ง
•ใช้ Bitsandbytes เพื่อเร่งความเร็ว
จากนั้นเราจาก FastChat บทความนี้จะปรับเปลี่ยนรหัสการฝึกอบรม LORA ใช้ ShareGpt Corpus และการปรับแต่งบนการ์ด 16G ซึ่งครอบครองหน่วยความจำ GPU ประมาณ 13 กรัม
•ระบบปฏิบัติการ: Centos หรือ Ubuntu
• NVIDA P100 หรือ T4: หน่วยความจำ GPU 16G หรือสูงกว่า
• Cuda, Conda

3. กระบวนการปรับจูน

3.1 การติดตั้งสภาพแวดล้อมการพึ่งพา

3.1.1 ดาวน์โหลดซอร์สโค้ด

git clone https://github.com/git-cloner/llama-lora-fine-tuning
cd llama-lora-fine-tuning

3.1.2 ติดตั้งสภาพแวดล้อมการพึ่งพาการปรับแต่งอย่างละเอียด

3.1.2.1 ติดตั้ง pkg-config

wget https://pkg-config.freedesktop.org/releases/pkg-config-0.29.2.tar.gz
tar -zxvf pkg-config-0.29.2.tar.gz
cd pkg-config-0.29.2
./configure --with-internal-glib  
make -j4
make check  
sudo make install

3.1.2.2 ติดตั้ง libicu

wget https://mirrors.aliyun.com/blfs/conglomeration/icu/icu4c-73_1-src.tgz
tar xf icu4c-73_1-src.tgz
cd icu/source  
./configure  
make  
make check  
sudo make install
sudo ldconfig

3.1.2.3 ติดตั้งแพ็คเกจ

conda create -n llama-lora python=3.10
conda activate llama-lora
pip3 install -r requirements.txt

3.2 เตรียมโมเดล Llama

คุณสามารถดาวน์โหลดโมเดลดั้งเดิมและแปลงเป็นครึ่งความแม่นยำหรือดาวน์โหลดโมเดลความแม่นยำครึ่งที่แปลงได้โดยตรงจาก https://huggingface.co/decapoda-research/llama-7b-hf

3.2.1 ดาวน์โหลดรุ่น llama

 export GIT_TRACE=1
export GIT_CURL_VERBOSE=1
pip3 install git+https://github.com/juncongmoo/pyllama -i https://pypi.mirrors.ustc.edu.cn/simple --trusted-host=pypi.mirrors.ustc.edu.cn
python -m llama.download --model_size 7B

3.2.2 แปลงโมเดลเป็นรูปแบบ HuggingFace

CUDA_VISIBLE_DEVICES=1 python3 ./convert_llama_weights_to_hf.py --input_dir ./pyllama_data --model_size 7B --output_dir ./pyllama_data/output/7B

3.3 จัดระเบียบคลังข้อมูล

ดาวน์โหลด 3.3.1 Corpus

Download 52k ShareGPT: https: // huggingface.co/datasets/RyokoAI/ShareGPT52K
Other corpora refer to: https: // github.com/Zjh-819/LLMDataHub
Download sg_90k_part1.json and sg_90k_part2.json into the data directory

3.3.2 ไฟล์คลังข้อมูลผสาน

python3 fastchat/data/merge.py --in ./data/sg_90k_part1.json ./data/sg_90k_part2.json ./data/dummy_cn.json ./data/dummy_en.json --out ./data/sg_90k.json

3.3.3 html ถึง markdown

python3 fastchat/data/clean_sharegpt.py --in ./data/sg_90k.json --out ./data/sharegpt_clean.json

3.3.4 ลบภาษาที่ไม่ได้ใช้ (เป็นทางเลือก)

python3 fastchat/data/optional_clean.py --in ./data/sharegpt_clean.json --out ./data/sharegpt_clean_1.json --skip-lang SOME_LANGUAGE_CODE 
The values of SOME_LANGUAGE_CODE are as follows:
en - English
es - Spanish 
fr - French
de - German
it - Italian
ja - Japanese
ko - Korean 
zh - Chinese
ar - Arabic
ru - Russian
pt - Portuguese
nl - Dutch

3.3.5 บทสนทนายาว ๆ เป็นบทสนทนาสั้น ๆ

CUDA_VISIBLE_DEVICES=1 python3 fastchat/data/split_long_conversation.py --in ./data/sharegpt_clean.json --out ./data/sharegpt_clean_split.json --model-name ./pyllama_data/output/7B

3.4 การปรับแต่ง

3.4.1 คำสั่งปรับแต่ง

 # Disable wandb 
wandb disabled 
# In order to prevent the SSH terminal from disconnecting and stopping the training, the training can run in the background (remove the # in three places to run in the background)
# If you have multiple GPUs,using --num_gpus parameter
CUDA_VISIBLE_DEVICES=0,1  # nohup  
deepspeed --num_gpus=2 fastchat/train/train_lora.py  
  --deepspeed ./deepspeed-config.json  
  --lora_r 8  
  --lora_alpha 16  
  --lora_dropout 0.05  
  --model_name_or_path ./pyllama_data/output/7B  
  --data_path ./data/sharegpt_clean_split.json  
  --fp16 True  
  --output_dir ./output  
  --num_train_epochs 1  
  --per_device_train_batch_size 14  
  --per_device_eval_batch_size 14  
  --gradient_accumulation_steps 1  
  --evaluation_strategy " no "  
  --save_strategy " steps "  
  --save_steps 2400  
  --save_total_limit 5  
  --learning_rate 2e-5  
  --weight_decay 0.  
  --warmup_ratio 0.03  
  --lr_scheduler_type " cosine "  
  --logging_steps 1  
  --model_max_length 512  
  --gradient_checkpointing True # >> lora.log 2>&1 &
# If running in the background, tail lora.log to check the training progress 
tail -f lora.log

3.4.2 ประสิทธิภาพการปรับจูน

การปรับแต่งที่ P100 (16G) มีหน่วยความจำ 13.5 กรัม ในกรณีของการฝึกอบรมรอบหนึ่งมันใช้เวลา 120 ชั่วโมงประมาณ 5 วันซึ่งยังคงใช้เวลานานมาก จะต้องมีการตรวจสอบผลกระทบของแบบจำลองที่เกิดขึ้น model_max_length จะส่งผลต่อเวลาการฝึกอบรม หากตั้งค่าเป็น 1024 เวลาจะลดลงครึ่งหนึ่งเมื่อเทียบกับ 2048 แต่จะส่งผลกระทบต่อผลการอนุมาน

3.4.3 การปรับแต่งที่ A100

ปรับแต่ง A100 เดี่ยวและใช้เวลาประมาณ 16 ชั่วโมง

deepspeed fastchat/train/train_lora.py 
    --deepspeed ./deepspeed-config.json 
    --lora_r 8 
    --lora_alpha 16 
    --lora_dropout 0.05 
    --model_name_or_path ./pyllama_data/output/7B 
    --data_path ./data/sharegpt_clean_split.json 
    --fp16 True 
    --output_dir ./output 
    --num_train_epochs 1 
    --per_device_train_batch_size 56 
    --per_device_eval_batch_size 56 
    --gradient_accumulation_steps 1
    --evaluation_strategy " no " 
    --save_strategy " steps " 
    --save_steps 1200 
    --save_total_limit 5 
    --learning_rate 2e-5 
    --weight_decay 0. 
    --warmup_ratio 0.03 
    --lr_scheduler_type " cosine " 
    --logging_steps 1 
    --model_max_length 1024 
    --gradient_checkpointing True

4、 ทดสอบแบบจำลองที่ผ่านการฝึกอบรม

4.1 โครงสร้างไฟล์รุ่น

โมเดล Lora Peft ที่ผ่านการฝึกอบรมประกอบด้วย adapter_config.json, adapter_model.bin และ trainer_state.json ด้านล่างนี้เป็นโครงสร้างไฟล์ของ PEFT และรุ่น Llama ดั้งเดิม

model
───llama-peft
│      adapter_config.json
│      adapter_model.bin
│      trainer_state.json
│
└──llama_7b
        config.json
        generation_config.json
        pytorch_model-00001-of-00002.bin
        pytorch_model-00002-of-00002.bin
        pytorch_model.bin.index.json
        special_tokens_map.json
        tokenizer.json
        tokenizer.model
        tokenizer_config.json

4.2 การทดสอบสร้าง

CUDA_VISIBLE_DEVICES=0  python generate.py  --base_model ./model/llama-7b --lora_weights ./model/llama-peft

ขยาย