Chinese Llama 2 7b 다운로드 Chinese Llama 2 7b 소스 코드 다운로드

중국 라마 2 7b

LLAMA2 모델 및 중국어 영어 SFT 데이터 세트의 모든 오픈 소스, 완전히 상용화 된 중국어 버전 . 입력 형식은 LLAMA-2-Chat 형식을 엄격하게 따르며 원래 LLAMA-2-Chat 모델에 대한 모든 최적화와 호환됩니다.

중국어 llama2 7b

기본 데모

온라인 시험

대화는 싸고 데모를 보여줍니다.

데모 주소/포옹 페이스 공간
Colab (FP16/High RAM을 활성화하려면 무료 버전을 사용할 수 없습니다)
Colab (int4/High RAM을 활성화해야 할 필요성, 무료 버전을 사용할 수 없습니다)

리소스 다운로드

모델 다운로드
- Shizhi AI : 중국어 LLAMA2 채팅 모델
- ModelScope : Chinese LLLLO2 채팅 모델
- Huggingface : 중국어 llama2 채팅 모델
- Baidu Netdisk : 1.0 공식 버전
- Baidu NetDisk : 1.1 향상된 화력 버전
4 비트 양자화
- Huggingface : Chinese llama2 4 비트 채팅 모델
- Baidu Netdisk : 중국 llama2 4 비트 채팅 모델
GGML Q4 모델 :
- https://huggingface.co/linksoul/chinese-llama-2-7b-ggml
- https://huggingface.co/rffx0/chinese-llama-2-7b-ggml-model-q4_0
- https://huggingface.co/soulteary/chinese-llama-2-7b-ggml-q4
- Baidu Netdisk : 중국-롤라마 -2-7B-GGML

우리는 데이터 볼륨이 천만 인 중국어 및 영어 SFT 데이터 세트를 사용했습니다.

데이터 세트 : https://huggingface.co/datasets/linksoul/instruction_merge_set

빠른 테스트

 from transformers import AutoTokenizer , AutoModelForCausalLM , TextStreamer

model_path = "LinkSoul/Chinese-Llama-2-7b"

tokenizer = AutoTokenizer . from_pretrained ( model_path , use_fast = False )
model = AutoModelForCausalLM . from_pretrained ( model_path ). half (). cuda ()
streamer = TextStreamer ( tokenizer , skip_prompt = True , skip_special_tokens = True )

instruction = """[INST] <<SYS>> n You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.

            If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information. n <</SYS>> n n {} [/INST]"""

prompt = instruction . format ( "用中文回答，When is the best time to visit Beijing, and do you have any suggestions for me?" )
generate_ids = model . generate ( tokenizer ( prompt , return_tensors = 'pt' ). input_ids . cuda (), max_new_tokens = 4096 , streamer = streamer )

도커

리포지토리에서 DockerFile을 사용하여 NVIDIA의 최신 버전의 nvcr.io/nvidia/pytorch:23.06-py3 를 기반으로 기본 이미지를 신속하게 작성하고 컨테이너를 사용하여 중국 LLAMA2 모델 응용 프로그램을 실행할 수 있습니다.

docker build -t linksoul/chinese-llama2-chat .

이미지가 만들어진 후 명령을 사용하여 이미지를 실행하십시오.

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --rm -it -v ` pwd ` /LinkSoul:/app/LinkSoul -p 7860:7860 linksoul/chinese-llama2-chat

ggml/llama.cpp

CPU 환경에서 LLAMA2 모델을 실행하고 싶습니까? 다음 방법을 사용하십시오.

변환 작업을 수행하려면 ggml/convert_to_ggml.py 사용하십시오. 자세한 내용은 스크립트에서 지원하는 CLI 매개 변수를 참조하십시오.
또는 docker pull soulteary/llama2:converter 사용하여 모델 형식 변환 도구 이미지를 다운로드하고 Docker 컨테이너의 다음 두 명령을 사용하여 작업을 완료합니다 (CPU로 실행할 수있는 Metaai LLAMA2 중국 대형 모델을 구축하기위한 자습서) :

python3 convert.py /app/LinkSoul/Chinese-Llama-2-7b/ --outfile /app/LinkSoul/Chinese-Llama-2-7b-ggml.bin
./quantize /app/LinkSoul/Chinese-Llama-2-7b-ggml.bin /app/LinkSoul/Chinese-Llama-2-7b-ggml-q4.bin q4_0

정량적 구성 정의 :

재 인쇄 : https://www.reddit.com/r/localllama/comments/139yt87/notable_differences_between_q4_2_and_q5_1/

Q4_0 = 32 숫자, 중량 당 4 비트, 32 비트 플로트에서 1 스케일 값 (평균 값 당 5 비트), 각 가중치는 공통 규모 * 양자화 된 값으로 제공됩니다.
Q4_1 = 32 숫자, 중량 당 4 비트, 무게 당 4 비트, 1 스케일 값 및 1 개의 바이어스 값 (평균 값 당 6 비트)에서 각 가중치는 공통 척도 * 양자화 된 값 + 공통 바이어스로 제공됩니다.
Q4_2 = Q4_0과 동일하지만 덩어리는 16 숫자, 중량 당 4 비트, 16 비트 플로트 인 1 스케일 값, Q4_0과 동일한 크기이지만 청크는 더 작기 때문에 더 좋습니다.
Q4_3 = 이미 죽었지 만 유사하지만 Q4_1이지만 덩어리에서 16 숫자, 중량 당 4 비트, 16 비트 인 스케일 값 및 16 비트, Q4_1과 동일한 크기이지만 청크가 더 작기 때문에 더 좋습니다.
Q5_0 = 덩어리에서 32 숫자, 무게 당 5 비트, 16 비트 플로트에서 1 스케일 값, 크기는 무게 당 5.5 비트입니다.
Q5_1 = 32 숫자, 중량 당 5 비트, 16 비트 플로트에서 1 스케일 값, 16 비트에서 1 바이어스 값, 크기는 중량 당 6 비트입니다.
Q8_0 = 무게 당 8 비트, 32 비트에서 1 스케일 값을 제외하고 Q4_0과 동일합니다. 무게 당 총 9 비트를 만듭니다.

API 배포

먼저 추가 종속성 pip install fastapi uvicorn 다음 저장소에서 API.py를 실행해야합니다.

python api.py

기본적으로 로컬 포트 8000에 배포하고 게시물을 통해 호출하십시오.

curl -X POST " http://127.0.0.1:8000 " 
     -H ' Content-Type: application/json ' 
     -d ' {"prompt": "你好", "history": []} '

얻은 반환 값은 다음과 같습니다

{
  " response " : " 你好！我是一个人工智能语言模型，可以回答你的问题和进行对话。请问你有什么需要帮助的吗？ " ,
  " history " :[[ " <<SYS>>nYou are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.nn            If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.n<</SYS>>nn你好" , " 你好！我是一个人工智能语言模型，可以回答你的问题和进行对话。请问你有什么需要帮助的吗？ " ]],
  " status " :200,
  " time " : " 2023-08-01 09:22:16 "
}

훈련하는 방법

DATASET= " LinkSoul/instruction_merge_set "

DATA_CACHE_PATH= " hf_datasets_cache "
MODEL_PATH= " /PATH/TO/TRANSFORMERS/VERSION/LLAMA2 "

output_dir= " ./checkpoints_llama2 "

torchrun --nnodes=1 --node_rank=0 --nproc_per_node=8 
    --master_port=25003 
        train.py 
        --model_name_or_path ${MODEL_PATH} 
        --data_path ${DATASET} 
        --data_cache_path ${DATA_CACHE_PATH} 
        --bf16 True 
        --output_dir ${output_dir} 
        --num_train_epochs 1 
        --per_device_train_batch_size 4 
        --per_device_eval_batch_size 4 
        --gradient_accumulation_steps 1 
        --evaluation_strategy ' no ' 
        --save_strategy ' steps ' 
        --save_steps 1200 
        --save_total_limit 5 
        --learning_rate 2e-5 
        --weight_decay 0. 
        --warmup_ratio 0.03 
        --lr_scheduler_type cosine 
        --logging_steps 1 
        --fsdp ' full_shard auto_wrap ' 
        --fsdp_transformer_layer_cls_to_wrap ' LlamaDecoderLayer ' 
        --tf32 True 
        --model_max_length 4096 
        --gradient_checkpointing True