Chinese Llama 2 7b下載 - Chinese Llama 2 7b源代碼下載

Chinese Llama 2 7B

全部開源，完全可商用的中文版Llama2 模型及中英文SFT 數據集，輸入格式嚴格遵循llama-2-chat格式，兼容適配所有針對原版llama-2-chat模型的優化。

Chinese LLaMA2 7B

基礎演示

Base Demo

在線試玩

Talk is cheap, Show you the Demo.

Demo 地址/ HuggingFace Spaces
Colab (FP16/需要開啟高RAM,免費版無法使用)
Colab (INT4/需要開啟高RAM,免費版無法使用)

資源下載

模型下載
- 始智AI: Chinese Llama2 Chat Model
- ModelScope: Chinese Llama2 Chat Model
- HuggingFace: Chinese Llama2 Chat Model
- 百度網盤: 1.0 正式版
- 百度網盤: 1.1 加強火力版
4bit量化
- HuggingFace：Chinese Llama2 4bit Chat Model
- 百度網盤: Chinese Llama2 4bit Chat Model
GGML Q4 模型：
- https://huggingface.co/LinkSoul/Chinese-Llama-2-7b-ggml
- https://huggingface.co/rffx0/Chinese-Llama-2-7b-ggml-model-q4_0
- https://huggingface.co/soulteary/Chinese-Llama-2-7b-ggml-q4
- 百度網盤: Chinese-Llama-2-7b-ggml

我們使用了中英文SFT 數據集，數據量1000 萬。

數據集：https://huggingface.co/datasets/LinkSoul/instruction_merge_set

快速測試

 from transformers import AutoTokenizer , AutoModelForCausalLM , TextStreamer

model_path = "LinkSoul/Chinese-Llama-2-7b"

tokenizer = AutoTokenizer . from_pretrained ( model_path , use_fast = False )
model = AutoModelForCausalLM . from_pretrained ( model_path ). half (). cuda ()
streamer = TextStreamer ( tokenizer , skip_prompt = True , skip_special_tokens = True )

instruction = """[INST] <<SYS>> n You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

            If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. n <</SYS>> n n {} [/INST]"""

prompt = instruction . format ( "用中文回答，When is the best time to visit Beijing, and do you have any suggestions for me?" )
generate_ids = model . generate ( tokenizer ( prompt , return_tensors = 'pt' ). input_ids . cuda (), max_new_tokens = 4096 , streamer = streamer )

Docker

你可以使用倉庫中的Dockerfile，來快速製作基於Nvidia 最新版本的nvcr.io/nvidia/pytorch:23.06-py3基礎鏡像，在任何地方使用容器來運行中文的LLaMA2 模型應用。

docker build -t linksoul/chinese-llama2-chat .

鏡像構建完畢，使用命令運行鏡像即可：

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -v ` pwd ` /LinkSoul:/app/LinkSoul -p 7860:7860 linksoul/chinese-llama2-chat

GGML / Llama.cpp

想要在CPU 環境運行LLaMA2 模型麼？使用下面的方法吧。

使用ggml/convert_to_ggml.py進行轉換操作，詳見腳本支持的CLI 參數。
或使用docker pull soulteary/llama2:converter下載模型格式轉換工具鏡像，在Docker 容器中使用下面的兩條命令完成操作（教程構建能夠使用CPU 運行的MetaAI LLaMA2 中文大模型）：

python3 convert.py /app/LinkSoul/Chinese-Llama-2-7b/ --outfile /app/LinkSoul/Chinese-Llama-2-7b-ggml.bin
./quantize /app/LinkSoul/Chinese-Llama-2-7b-ggml.bin /app/LinkSoul/Chinese-Llama-2-7b-ggml-q4.bin q4_0

量化配置的定義:

轉自: https://www.reddit.com/r/LocalLLaMA/comments/139yt87/notable_differences_between_q4_2_and_q5_1/

q4_0 = 32 numbers in chunk, 4 bits per weight, 1 scale value at 32-bit float (5 bits per value in average), each weight is given by the common scale * quantized value.
q4_1 = 32 numbers in chunk, 4 bits per weight, 1 scale value and 1 bias value at 32-bit float (6 bits per value in average), each weight is given by the common scale * quantized value + common bias.
q4_2 = same as q4_0, but 16 numbers in chunk, 4 bits per weight, 1 scale value that is 16-bit float, same size as q4_0 but better because chunks are smaller.
q4_3 = already dead, but analogous: q4_1 but 16 numbers in chunk, 4 bits per weight, scale value that is 16 bit and bias also 16 bits, same size as q4_1 but better because chunks are smaller.
q5_0 = 32 numbers in chunk, 5 bits per weight, 1 scale value at 16-bit float, size is 5.5 bits per weight
q5_1 = 32 numbers in a chunk, 5 bits per weight, 1 scale value at 16 bit float and 1 bias value at 16 bit, size is 6 bits per weight.
q8_0 = same as q4_0, except 8 bits per weight, 1 scale value at 32 bits, making total of 9 bits per weight.

API部署

首先需要安裝額外的依賴pip install fastapi uvicorn ，然後運行倉庫中的api.py：

python api.py

默認部署在本地的8000 端口，通過POST 方法進行調用

curl -X POST " http://127.0.0.1:8000 " 
     -H ' Content-Type: application/json ' 
     -d ' {"prompt": "你好", "history": []} '

得到的返回值為

{
  " response " : " 你好！我是一个人工智能语言模型，可以回答你的问题和进行对话。请问你有什么需要帮助的吗？ " ,
  " history " :[[ " <<SYS>>nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.nn            If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.n<</SYS>>nn你好" , " 你好！我是一个人工智能语言模型，可以回答你的问题和进行对话。请问你有什么需要帮助的吗？ " ]],
  " status " :200,
  " time " : " 2023-08-01 09:22:16 "
}

如何訓練

DATASET= " LinkSoul/instruction_merge_set "

DATA_CACHE_PATH= " hf_datasets_cache "
MODEL_PATH= " /PATH/TO/TRANSFORMERS/VERSION/LLAMA2 "

output_dir= " ./checkpoints_llama2 "

torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 
    --master_port=25003 
        train.py 
        --model_name_or_path ${MODEL_PATH} 
        --data_path ${DATASET} 
        --data_cache_path ${DATA_CACHE_PATH} 
        --bf16 True 
        --output_dir ${output_dir} 
        --num_train_epochs 1 
        --per_device_train_batch_size 4 
        --per_device_eval_batch_size 4 
        --gradient_accumulation_steps 1 
        --evaluation_strategy ' no ' 
        --save_strategy ' steps ' 
        --save_steps 1200 
        --save_total_limit 5 
        --learning_rate 2e-5 
        --weight_decay 0. 
        --warmup_ratio 0.03 
        --lr_scheduler_type cosine 
        --logging_steps 1 
        --fsdp ' full_shard auto_wrap ' 
        --fsdp_transformer_layer_cls_to_wrap ' LlamaDecoderLayer ' 
        --tf32 True 
        --model_max_length 4096 
        --gradient_checkpointing True