Chinese LLaMA Alpaca Usage下載 - Chinese LLaMA Alpaca Usage來源代碼下載

本項目基於Chinese-LLaMA-Alpaca V3.1進行使用說明。 Chinese-LLaMA-Alpaca 開創了基於LLaMA的中文擴充改進，在原版LLaMA的基礎上擴充了中文詞表並使用了中文數據進行二次預訓練，進一步提升了中文基礎語義理解能力。

項目構成：<sources>

.
├── README.md # 使用说明文件
├── SHA256.md # LLaMA模型SHA值对比文件
├── notebooks
│   ├── convert_and_quantize_chinese_alpaca_plus.ipynb
│   └── convert_and_quantize_chinese_llama.ipynb
├── requirements.txt # 依赖文件
└── scripts
    ├── chinese_sp.model # 中文词表文件
    ├── crawl_prompt.py # 1. 通过OpenAI的大模型（如ChatGPT、GPT4等）生成可用于微调的数据
    ├── inference_hf.py # 5. 对微调训练产生的LoRA模型和原始LLaMA模型做推理
    ├── merge_llama_with_chinese_lora.py # 4. 合并模型权重
    ├── merge_tokenizers.py # 2. 词表扩充
    └── run_clm_pt_with_peft.py # 3. 对模型进行训练或者微调

1.準備數據<sources>

不管你是要進行預訓練還是微調，你都需要準備數據，數據準備的兩種方式：

（公開）如果您可以使用公開的標準的可用於微調或者訓練的數據，您可以跳過此步驟；
（生成）如果您沒有合適的微調或者訓練數據，您可以使用scripts/crawl_prompt.py生成相應數據。基本思路為使用ChatGPT或者其它OpenAI高效模型進行數據生成。

2.準備LLaMA權重

 # tokenizer
wget https : // agi . gpt4 . org / llama / LLaMA / tokenizer . model - O . / tokenizer . model
wget https : // agi . gpt4 . org / llama / LLaMA / tokenizer_checklist . chk - O . / tokenizer_checklist . chk
# 7B
wget https : // agi . gpt4 . org / llama / LLaMA / 7 B / consolidated . 00. pth - O . / 7 B / consolidated . 00. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 7 B / params . json - O . / 7 B / params . json
wget https : // agi . gpt4 . org / llama / LLaMA / 7 B / checklist . chk - O . / 7 B / checklist . chk
# 13B
wget https : // agi . gpt4 . org / llama / LLaMA / 13 B / consolidated . 00. pth - O . / 13 B / consolidated . 00. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 13 B / consolidated . 01. pth - O . / 13 B / consolidated . 01. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 13 B / params . json - O . / 13 B / params . json
wget https : // agi . gpt4 . org / llama / LLaMA / 13 B / checklist . chk - O . / 13 B / checklist . chk
# 30B
wget https : // agi . gpt4 . org / llama / LLaMA / 30 B / consolidated . 00. pth - O . / 30 B / consolidated . 00. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 30 B / consolidated . 01. pth - O . / 30 B / consolidated . 01. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 30 B / consolidated . 02. pth - O . / 30 B / consolidated . 02. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 30 B / consolidated . 03. pth - O . / 30 B / consolidated . 03. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 30 B / params . json - O . / 30 B / params . json
wget https : // agi . gpt4 . org / llama / LLaMA / 30 B / checklist . chk - O . / 30 B / checklist . chk
# 65B
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / consolidated . 00. pth - O . / 65 B / consolidated . 00. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / consolidated . 01. pth - O . / 65 B / consolidated . 01. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / consolidated . 02. pth - O . / 65 B / consolidated . 02. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / consolidated . 03. pth - O . / 65 B / consolidated . 03. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / consolidated . 04. pth - O . / 65 B / consolidated . 04. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / consolidated . 05. pth - O . / 65 B / consolidated . 05. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / consolidated . 06. pth - O . / 65 B / consolidated . 06. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / consolidated . 07. pth - O . / 65 B / consolidated . 07. pth
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / params . json - O . / 65 B / params . json
wget https : // agi . gpt4 . org / llama / LLaMA / 65 B / checklist . chk - O . / 65 B / checklist . chk

您需要下載不同大小參數的LLaMA模型權重，參數越大的模型權重體積越大，精度相對較好，微調和訓練時間也相對較長。一般情況下，7B或者13B模型將是大部分人的選擇。

務必確認LLaMA基模型的完整性，檢查是否與SHA256.md 所示的值一致，否則無法進行合併操作。

3.轉化為HF格式權重

 # 安装依赖库
pip install git + https : // github . com / huggingface / transformers

# 转化HF权重
python - m transformers . models . llama . convert_llama_weights_to_hf 
   - - input_dir llama - weights 
   - - model_size 7 B 
   - - output_dir llama - hf - weights
  
> python - m transformers . models . llama . convert_llama_weights_to_hf - - input_dir . / - - model_size 7 B - - output_dir . / output / 7 B - hf

如果你不想要自己手動轉化，也可以使用別人轉化好的LLaMA-HF模型，pinkmanlove 有在HuggingFace提供轉化好的LLaMA-HF權重，如果失效可以在HuggingFace-Models搜索其他人轉化好的。

4.訓練和微調模型<sources>

整個訓練和微調過程包括三個步驟：

詞表擴充；
預訓練（可選）；
指令微調；

4.1詞表擴充<sources>

 python scripts / merge_tokenizers . py 
  - - llama_tokenizer_dir llama_tokenizer_dir 
  - - chinese_sp_model_file chinese_sp_model_file

> python scripts / merge_tokenizers . py - - llama_tokenizer_dir output / 7 B - hf - - chinese_sp_model_file scripts / chinese_sp . model

參數說明：

llama_tokenizer_dir :指向存放原版LLaMA tokenizer的目錄；
chinese_sp_model_file :指向用sentencepiece訓練的中文詞表文件（chinese_sp.model）；

Note
詞表的擴充有兩大方法：（1）合併擴充詞表；（2）找一個大的詞表，刪除無用的詞得到一個詞表；

4.2預訓練（可選）

在預訓練階段，使用通用中文語料在原版LLaMA權重的基礎上進一步進行預訓練。該過程又分為兩個階段：

第一階段：凍結transformer參數，僅訓練embedding，在盡量不干擾原模型的情況下適配新增的中文詞向量；
第二階段：使用LoRA技術，為模型添加LoRA權重（adapter），訓練embedding的同時也更新LoRA參數；

預訓練的第一階段中模型收斂速度較慢，如果不是有特別充裕的時間和計算資源，建議跳過該階段。預訓練的第二階段訓練如下（單機單卡）：

 ########参数设置########
lr = 2e-4
lora_rank = 8
lora_alpha = 32
lora_trainable = "q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save = "embed_tokens,lm_head"
lora_dropout = 0.05

pretrained_model = path / to / hf / llama / dir
chinese_tokenizer_path = path / to / chinese / llama / tokenizer / dir
dataset_dir = path / to / pt / data / dir
data_cache = temp_data_cache_dir
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
training_steps = 100
gradient_accumulation_steps = 1
output_dir = output_dir

deepspeed_config_file = ds_zero2_no_offload . json

########启动命令########
torchrun - - nnodes 1 - - nproc_per_node 1 run_clm_pt_with_peft . py 
    - - deepspeed ${ deepspeed_config_file } 
    - - model_name_or_path ${ pretrained_model } 
    - - tokenizer_name_or_path ${ chinese_tokenizer_path } 
    - - dataset_dir ${ dataset_dir } 
    - - data_cache_dir ${ data_cache } 
    - - validation_split_percentage 0.001 
    - - per_device_train_batch_size ${ per_device_train_batch_size } 
    - - per_device_eval_batch_size ${ per_device_eval_batch_size } 
    - - do_train 
    - - seed $ RANDOM 
    - - fp16 
    - - max_steps ${ training_steps } 
    - - lr_scheduler_type cosine 
    - - learning_rate ${ lr } 
    - - warmup_ratio 0.05 
    - - weight_decay 0.01 
    - - logging_strategy steps 
    - - logging_steps 10 
    - - save_strategy steps 
    - - save_total_limit 3 
    - - save_steps 500 
    - - gradient_accumulation_steps ${ gradient_accumulation_steps } 
    - - preprocessing_num_workers 8 
    - - block_size 512 
    - - output_dir ${ output_dir } 
    - - overwrite_output_dir 
    - - ddp_timeout 30000 
    - - logging_first_step True 
    - - lora_rank ${ lora_rank } 
    - - lora_alpha ${ lora_alpha } 
    - - trainable ${ lora_trainable } 
    - - modules_to_save ${ modules_to_save } 
    - - lora_dropout ${ lora_dropout } 
    - - torch_dtype float16 
    - - gradient_checkpointing 
    - - ddp_find_unused_parameters False

參數說明：

--model_name_or_path : 原版HF格式的LLaMA模型所在目錄；
--tokenizer_name_or_path : Chinese-LLaMA tokenizer所在的目錄（merge_tokenizers.py合成的結果）；
--dataset_dir : 預訓練數據的目錄，可包含多個以txt結尾的純文本文件；
--data_cache_dir : 指定一個存放數據緩存文件的目錄；

多機多卡：

 torchrun 
  - - nnodes ${ num_nodes } 
  - - nproc_per_node ${ num_gpu_per_node } 
  - - node_rank ${ node_rank } 
  - - master_addr ${ master_addr } 
  - - master_port ${ master_port } 
  run_clm_pt_with_peft . py 
    ...

中文LLaMA模型在原版的基礎上擴充了中文詞表，使用了中文通用純文本數據進行二次預訓練。這裡作者提供了兩種下載這些預訓練權重的方式，而不需要我們自己花費資源訓練：

（1）Google Drive或者百度網盤

模型名稱	訓練數據	重構模型	大小	LoRA下載
Chinese-LLaMA-7B	通用20G	原版LLaMA-7B	770M	[百度網盤] [Google Drive]
Chinese-LLaMA-Plus-7B ️	通用120G	原版LLaMA-7B	790M	[百度網盤] [Google Drive]
Chinese-LLaMA-13B	通用20G	原版LLaMA-13B	1G	[百度網盤] [Google Drive]
Chinese-LLaMA-Plus-13B ️	通用120G	原版LLaMA-13B	1G	[百度網盤] [Google Drive]

（2）可以在?Model Hub下載以上所有模型，並且使用transformers和PEFT調用中文LLaMA模型。以下模型調用名稱指的是使用.from_pretrained()中指定的模型名稱。

模型名	模型調用名稱	鏈接
Chinese-LLaMA-7B	ziqingyang/chinese-llama-lora-7b	Model Hub Link
Chinese-LLaMA-Plus-7B	ziqingyang/chinese-llama-plus-lora-7b	Model Hub Link
Chinese-LLaMA-13B	ziqingyang/chinese-llama-lora-13b	Model Hub Link
Chinese-LLaMA-Plus-13B	ziqingyang/chinese-llama-plus-lora-13b	Model Hub Link

4.3指令微調

訓練方案同樣採用了LoRA進行高效精調，並進一步增加了可訓練參數數量。

單機單卡：

 ########参数部分########
lr = 1e-4
lora_rank = 8
lora_alpha = 32
lora_trainable = "q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save = "embed_tokens,lm_head"
lora_dropout = 0.05

pretrained_model = path / to / hf / llama / or / merged / llama / dir / or / model_id
chinese_tokenizer_path = path / to / chinese / llama / tokenizer / dir
dataset_dir = path / to / sft / data / dir
per_device_train_batch_size = 1
per_device_eval_batch_size = 1
training_steps = 100
gradient_accumulation_steps = 1
output_dir = output_dir
peft_model = path / to / peft / model / dir
validation_file = validation_file_name

deepspeed_config_file = ds_zero2_no_offload . json

########启动命令########
torchrun - - nnodes 1 - - nproc_per_node 1 run_clm_sft_with_peft . py 
    - - deepspeed ${ deepspeed_config_file } 
    - - model_name_or_path ${ pretrained_model } 
    - - tokenizer_name_or_path ${ chinese_tokenizer_path } 
    - - dataset_dir ${ dataset_dir } 
    - - validation_split_percentage 0.001 
    - - per_device_train_batch_size ${ per_device_train_batch_size } 
    - - per_device_eval_batch_size ${ per_device_eval_batch_size } 
    - - do_train 
    - - do_eval 
    - - seed $ RANDOM 
    - - fp16 
    - - max_steps ${ training_steps } 
    - - lr_scheduler_type cosine 
    - - learning_rate ${ lr } 
    - - warmup_ratio 0.03 
    - - weight_decay 0 
    - - logging_strategy steps 
    - - logging_steps 10 
    - - save_strategy steps 
    - - save_total_limit 3 
    - - evaluation_strategy steps 
    - - eval_steps 250 
    - - save_steps 500 
    - - gradient_accumulation_steps ${ gradient_accumulation_steps } 
    - - preprocessing_num_workers 8 
    - - max_seq_length 512 
    - - output_dir ${ output_dir } 
    - - overwrite_output_dir 
    - - ddp_timeout 30000 
    - - logging_first_step True 
    - - lora_rank ${ lora_rank } 
    - - lora_alpha ${ lora_alpha } 
    - - trainable ${ lora_trainable } 
    - - modules_to_save ${ modules_to_save } 
    - - lora_dropout ${ lora_dropout } 
    - - torch_dtype float16 
    - - validation_file ${ validation_file } 
    - - peft_path ${ peft_model } 
    - - gradient_checkpointing 
    - - ddp_find_unused_parameters False

參數說明：

--tokenizer_name_or_path : Chinese-Alpaca tokenizer所在的目錄（merge_tokenizers.py合成的結果）；
--dataset_dir : 指令精調數據的目錄，包含一個或多個以json結尾的Stanford Alpaca格式的指令精調數據文件；
--validation_file : 用作驗證集的單個指令精調文件，以json結尾，同樣遵循Stanford Alpaca格式；

所謂Stanford Alpaca格式即：

[
  { "instruction" : ...,
   "input" : ...,
   "output" : ... },
  ...
]

這裡的數據同樣可以使用Chinese-LLaMA-Alpaca-Usage/#準備數據-生成方式生成。

配置說明：

如果想繼續訓練Chinese-Alpaca模型的LoRA權重：
- --model_name_or_path : 原版HF格式LLaMA模型（如果繼續訓練非Plus Alpaca模型）或合併Chinese-LLaMA-Plus-LoRA後的Chinese-LLaMA模型（如果繼續訓練Plus模型）；
- --peft_path : Chinese-Alpaca的LoRA權重目錄；

無需指定--lora_rank 、 --lora_alpha 、 --lora_dropout 、 --trainable和--modules_to_save參數。

如果想基於中文Chinese-LLaMA訓練全新的指令精調LoRA權重：
- --model_name_or_path : 合併對應Chinese-LLaMA-LoRA後的HF格式Chinese-LLaMA模型（無論是否是Plus模型）；
- --peft_path : 勿提供此參數，並且從腳本中刪除--peft_path ；

需指定--lora_rank 、 --lora_alpha 、 --lora_dropout 、 --trainable和--modules_to_save參數。

多機多卡：

 torchrun 
  - - nnodes ${ num_nodes } 
  - - nproc_per_node ${ num_gpu_per_node } 
  - - node_rank ${ node_rank } 
  - - master_addr ${ master_addr } 
  - - master_port ${ master_port } 
  run_clm_sft_with_peft . py 
    ...

5.合併權重<sources>

5.1單LoRA權重合併

適用於Chinese-LLaMA, Chinese-LLaMA-Plus, Chinese-Alpaca

 python scripts / merge_llama_with_chinese_lora . py 
    - - base_model path_to_original_llama_hf_dir 
    - - lora_model path_to_chinese_llama_or_alpaca_lora 
    - - output_type [ pth | huggingface ]
    - - output_dir path_to_output_dir

參數說明：

--base_model ：存放HF格式的LLaMA模型權重和配置文件的目錄；
--lora_model ：中文LLaMA/Alpaca LoRA解壓後文件所在目錄，也可使用?Model Hub模型調用名稱；
--output_type : 指定輸出格式，可為pth或huggingface 。若不指定，默認為pth ；
--output_dir ：指定保存全量模型權重的目錄，默認為./ ；
（可選） --offload_dir ：對於低內存用戶需要指定一個offload緩存路徑；

關於output_type的更進一步說明：

.pth文件可用於：llama.cpp 工具進行量化和部署；
.bin文件可用於：Transformers 進行推理；text-generation-webui 搭建界面；

在線進行單LoRA權重合併同時進行量化：

5.2多LoRA權重合併

合併Chinese-Alpaca-Plus需要提供兩個LoRA權重，分別為Chinese-LLaMA-Plus-LoRA和Chinese-Alpaca-Plus-LoRA

 python scripts / merge_llama_with_chinese_lora . py 
    - - base_model path_to_original_llama_hf_dir 
    - - lora_model path_to_chinese_llama_plus_lora , path_to_chinese_alpaca_plus_lora 
    - - output_type [ pth | huggingface ]
    - - output_dir path_to_output_dir

️兩個LoRA模型的順序很重要，不能顛倒。先寫LLaMA-Plus-LoRA然後寫Alpaca-Plus-LoRA。 ️

在線進行多LoRA權重合併同時進行量化：

6.部署和運行模型<sources>

 CUDA_VISIBLE_DEVICES = 0 python scripts / inference_hf . py 
    - - base_model path_to_original_llama_hf_dir 
    - - lora_model path_to_chinese_llama_or_alpaca_lora 
    - - with_prompt 
    - - interactive

如果之前已執行了merge_llama_with_chinese_lora_to_hf.py腳本將lora權重合併，那麼無需再指定--lora_model ，啟動方式更簡單：

 CUDA_VISIBLE_DEVICES = 0 python scripts / inference_hf . py 
    - - base_model path_to_merged_llama_or_alpaca_hf_dir 
    - - with_prompt 
    - - interactive

刪除CUDA_VISIBLE_DEVICES=0則為CPU推理模式。當然這裡也可以以WebUI方式進行運行部署。 <sources>

本項目中的模型主要支持以下量化、推理和部署方式。

推理和部署方式	特點	平台	CPU	GPU	量化加載	圖形界面	教程
llama.cpp	豐富的量化選項和高效本地推理	通用	✅	✅	✅		鏈接
?Transformers	原生transformers推理接口	通用	✅	✅	✅	✅	鏈接
text-generation-webui	前端Web UI界面的部署方式	通用	✅	✅	✅	✅	鏈接
LlamaChat	macOS下的圖形交互界面（需搭配llama.cpp模型）	MacOS	✅		✅	✅	鏈接