全部開源,完全可商用的中文版Llama2 模型及中英文SFT 數據集,輸入格式嚴格遵循llama-2-chat格式,兼容適配所有針對原版llama-2-chat模型的優化。


Talk is cheap, Show you the Demo.
模型下載
4bit量化
GGML Q4 模型:
我們使用了中英文SFT 數據集,數據量1000 萬。
from transformers import AutoTokenizer , AutoModelForCausalLM , TextStreamer
model_path = "LinkSoul/Chinese-Llama-2-7b"
tokenizer = AutoTokenizer . from_pretrained ( model_path , use_fast = False )
model = AutoModelForCausalLM . from_pretrained ( model_path ). half (). cuda ()
streamer = TextStreamer ( tokenizer , skip_prompt = True , skip_special_tokens = True )
instruction = """[INST] <<SYS>> n You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. n <</SYS>> n n {} [/INST]"""
prompt = instruction . format ( "用中文回答,When is the best time to visit Beijing, and do you have any suggestions for me?" )
generate_ids = model . generate ( tokenizer ( prompt , return_tensors = 'pt' ). input_ids . cuda (), max_new_tokens = 4096 , streamer = streamer )你可以使用倉庫中的Dockerfile,來快速製作基於Nvidia 最新版本的nvcr.io/nvidia/pytorch:23.06-py3基礎鏡像,在任何地方使用容器來運行中文的LLaMA2 模型應用。
docker build -t linksoul/chinese-llama2-chat .鏡像構建完畢,使用命令運行鏡像即可:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -v ` pwd ` /LinkSoul:/app/LinkSoul -p 7860:7860 linksoul/chinese-llama2-chat想要在CPU 環境運行LLaMA2 模型麼?使用下面的方法吧。
ggml/convert_to_ggml.py進行轉換操作,詳見腳本支持的CLI 參數。docker pull soulteary/llama2:converter下載模型格式轉換工具鏡像,在Docker 容器中使用下面的兩條命令完成操作(教程構建能夠使用CPU 運行的MetaAI LLaMA2 中文大模型): python3 convert.py /app/LinkSoul/Chinese-Llama-2-7b/ --outfile /app/LinkSoul/Chinese-Llama-2-7b-ggml.bin
./quantize /app/LinkSoul/Chinese-Llama-2-7b-ggml.bin /app/LinkSoul/Chinese-Llama-2-7b-ggml-q4.bin q4_0量化配置的定義:
轉自: https://www.reddit.com/r/LocalLLaMA/comments/139yt87/notable_differences_between_q4_2_and_q5_1/
q4_0 = 32 numbers in chunk, 4 bits per weight, 1 scale value at 32-bit float (5 bits per value in average), each weight is given by the common scale * quantized value.
q4_1 = 32 numbers in chunk, 4 bits per weight, 1 scale value and 1 bias value at 32-bit float (6 bits per value in average), each weight is given by the common scale * quantized value + common bias.
q4_2 = same as q4_0, but 16 numbers in chunk, 4 bits per weight, 1 scale value that is 16-bit float, same size as q4_0 but better because chunks are smaller.
q4_3 = already dead, but analogous: q4_1 but 16 numbers in chunk, 4 bits per weight, scale value that is 16 bit and bias also 16 bits, same size as q4_1 but better because chunks are smaller.
q5_0 = 32 numbers in chunk, 5 bits per weight, 1 scale value at 16-bit float, size is 5.5 bits per weight
q5_1 = 32 numbers in a chunk, 5 bits per weight, 1 scale value at 16 bit float and 1 bias value at 16 bit, size is 6 bits per weight.
q8_0 = same as q4_0, except 8 bits per weight, 1 scale value at 32 bits, making total of 9 bits per weight.
首先需要安裝額外的依賴pip install fastapi uvicorn ,然後運行倉庫中的api.py:
python api.py默認部署在本地的8000 端口,通過POST 方法進行調用
curl -X POST " http://127.0.0.1:8000 "
-H ' Content-Type: application/json '
-d ' {"prompt": "你好", "history": []} '得到的返回值為
{
" response " : " 你好!我是一个人工智能语言模型,可以回答你的问题和进行对话。请问你有什么需要帮助的吗? " ,
" history " :[[ " <<SYS>>nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.nn If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.n<</SYS>>nn你好" , " 你好!我是一个人工智能语言模型,可以回答你的问题和进行对话。请问你有什么需要帮助的吗? " ]],
" status " :200,
" time " : " 2023-08-01 09:22:16 "
}DATASET= " LinkSoul/instruction_merge_set "
DATA_CACHE_PATH= " hf_datasets_cache "
MODEL_PATH= " /PATH/TO/TRANSFORMERS/VERSION/LLAMA2 "
output_dir= " ./checkpoints_llama2 "
torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8
--master_port=25003
train.py
--model_name_or_path ${MODEL_PATH}
--data_path ${DATASET}
--data_cache_path ${DATA_CACHE_PATH}
--bf16 True
--output_dir ${output_dir}
--num_train_epochs 1
--per_device_train_batch_size 4
--per_device_eval_batch_size 4
--gradient_accumulation_steps 1
--evaluation_strategy ' no '
--save_strategy ' steps '
--save_steps 1200
--save_total_limit 5
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type cosine
--logging_steps 1
--fsdp ' full_shard auto_wrap '
--fsdp_transformer_layer_cls_to_wrap ' LlamaDecoderLayer '
--tf32 True
--model_max_length 4096
--gradient_checkpointing TrueApache-2.0 license
