fastLLaMa 다운로드 - fastLLaMa 소스 코드 다운로드

Fastllama

fastLLaMa 프로덕션 환경에서 LLM (Lange Language Models)을 배포하는 것과 관련된 과제를 해결하기 위해 설계된 실험적인 고성능 프레임 워크입니다.

C ++ 라이브러리 LLAMA.CPP에 대한 사용자 친화적 인 Python 인터페이스를 제공하여 개발자가 사용자 정의 워크 플로를 만들고 적응 가능한 로깅을 구현하며 세션 간의 컨텍스트를 완벽하게 전환 할 수 있도록합니다. 이 프레임 워크는 LLMS 운영 효율성을 규모로 향상시키기위한 것이며, 최적화 된 Cold Boot Times, NVIDIA GPU에 대한 Int4 지원, 모델 아티팩트 관리 및 다중 프로그래밍 언어 지원과 같은 기능을 도입하는 데 중점을두고 있습니다.

                ___            __    _    _         __ __      
                | | '___  ___ _| |_ | |  | |   ___ |     ___ 
                | |-<_> |<_-<  | |  | |_ | |_ <_> ||     |<_> |
                |_| <___|/__/  |_|  |___||___|<___||_|_|_|<___|
                                                            
                                                                                        
                                                                           
                                                       .+*+-.                
                                                      -%#--                  
                                                    :=***%*++=.              
                                                   :+=+**####%+              
                                                   ++=+*%#                   
                                                  .*+++==-                   
                  ::--:.                           .**++=::                   
                 #%##*++=......                    =*+==-::                   
                .@@@*@%*==-==-==---:::::------::==*+==--::                   
                 %@@@@+--====+===---=---==+=======+++----:                   
                 .%@@*++*##***+===-=====++++++*++*+====++.                   
                 :@@%*##%@@%#*%#+==++++++=++***==-=+==+=-                    
                  %@%%%%%@%#+=*%*##%%%@###**++++==--==++                     
                  #@%%@%@@##**%@@@%#%%%%**++*++=====-=*-                     
                  -@@@@@@@%*#%@@@@@@@%%%%#+*%#++++++=*+.                     
                   +@@@@@%%*-#@@@@@@@@@@@%%@%**#*#+=-.                       
                    #%%###%:  ..+#%@@@@%%@@@@%#+-                            
                    :***#*-         ...  *@@@%*+:                            
                     =***=               -@%##**.                            
                    :#*++                -@#-:*=.                            
                     =##-                .%*..##                             
                      +*-                 *:  +-                             
                      :+-                :+   =.                             
                       =-.               *+   =-                             
                        :-:-              =--  :::

특징

지원되는 모델

요구 사항

cmake
- Linux 용 :
  sudo apt-get -y install cmake
- OS X의 경우 :
  brew install cmake
- Windows의 경우
  다운로드 페이지에서 cmake-*. exe 설치 프로그램을 다운로드하고 실행하십시오.
GCC 11 이상
최소 C ++ 17
파이썬 3.x

설치

PIP 사용을 통해 fastLLaMa 설치합니다

pip install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main

용법

패키지 가져 오기

Fastllama를 가져 오려면 방금 실행됩니다

 from fastllama import Model

모델 초기화

 MODEL_PATH = "./models/7B/ggml-model-q4_0.bin"

model = Model (
        path = MODEL_PATH , #path to model
        num_threads = 8 , #number of threads to use
        n_ctx = 512 , #context size of model
        last_n_size = 64 , #size of last n tokens (used for repetition penalty) (Optional)
        seed = 0 , #seed for random number generator (Optional)
        n_batch = 128 , #batch size (Optional)
        use_mmap = False , #use mmap to load model (Optional)
    )

프롬프트 섭취

 prompt = """Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: """

res = model . ingest ( prompt , is_system_prompt = True ) #ingest model with prompt

출력 생성

 def stream_token ( x : str ) -> None :
    """
    This function is called by the library to stream tokens
    """
    print ( x , end = '' , flush = True )

res = model . generate (
    num_tokens = 100 , 
    top_p = 0.95 , #top p sampling (Optional)
    temp = 0.8 , #temperature (Optional)
    repeat_penalty = 1.0 , #repetition penalty (Optional)
    streaming_fn = stream_token , #streaming function
    stop_words = [ "User:" , " n " ] #stop generation when this word is encountered (Optional)
    )

멀티 스레드를 사용한 로딩 모델

 model = Model (
        path = MODEL_PATH , #path to model
        num_threads = 8 , #number of threads to use
        n_ctx = 512 , #context size of model
        last_n_size = 64 , #size of last n tokens (used for repetition penalty) (Optional)
        seed = 0 , #seed for random number generator (Optional)
        n_batch = 128 , #batch size (Optional)
        load_parallel = True
    )

모델 상태 저장

세션을 캐시하기 위해 save_state 메소드를 사용할 수 있습니다.

 res = model . save_state ( "./models/fast_llama.bin" )

로딩 모델 상태

세션을로드하려면 load_state 메소드를 사용하십시오.

 res = model . load_state ( "./models/fast_llama.bin" )

모델 상태를 재설정합니다

세션을 재설정하려면 reset 방법을 사용하십시오.

 model . reset ()

런타임 동안 LORA 어댑터를 기본 모델에 연결합니다

런타임 중에 LORA 어댑터를 연결하려면 attach_lora 메소드를 사용하십시오.

 LORA_ADAPTER_PATH = "./models/ALPACA-7B-ADAPTER/ggml-adapter-model.bin"

model . attach_lora ( LORA_ADAPTER_PATH )

참고 : LORA 어댑터를 부착 한 후 모델 상태를 재설정하는 것이 좋습니다.

런타임 동안 LORA 어댑터를 기본 모델로 분리합니다

런타임 동안 Lora 어댑터를 분리하려면 detach_lora 방법을 사용하십시오.

 model . detach_lora ()

당황 성 계산

당혹감을 조성하려면 perplexity 방법을 사용하십시오.

 with open ( "test.txt" , "r" ) as f :
    data = f . read ( 8000 )
       
total_perplexity = model . perplexity ( data )
print ( f"Total Perplexity: { total_perplexity :.4f } " )

모델의 임베딩을 얻습니다

모델의 임베딩을 얻으려면 get_embeddings 메소드를 사용하십시오.

 embeddings = model . get_embeddings ()

모델의 벌목을 얻습니다

모델의 로이트를 얻으려면 get_logits 메소드를 사용하십시오.

 logits = model . get_logits ()

로거 사용

 from fastLLaMa import Logger

class MyLogger ( Logger ):
    def __init__ ( self ):
        super (). __init__ ()
        self . file = open ( "logs.log" , "w" )

    def log_info ( self , func_name : str , message : str ) -> None :
        #Modify this to do whatever you want when you see info logs
        print ( f"[Info]: Func(' { func_name } ') { message } " , flush = True , end = '' , file = self . file )
        pass
    
    def log_err ( self , func_name : str , message : str ) -> None :
        #Modify this to do whatever you want when you see error logs
        print ( f"[Error]: Func(' { func_name } ') { message } " , flush = True , end = '' , file = self . file )
    
    def log_warn ( self , func_name : str , message : str ) -> None :
        #Modify this to do whatever you want when you see warning logs
        print ( f"[Warn]: Func(' { func_name } ') { message } " , flush = True , end = '' , file = self . file )

더 명확하게 보려면 examples/python/ 폴더를 확인하십시오.

라마 달리기

 # obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# convert the 7B model to ggml FP16 format
# python [PythonFile] [ModelPath] [Floattype] [Vocab Only] [SplitType]
python3 scripts/convert-pth-to-ggml.py models/7B/ 1 0

# quantize the model to 4-bits
./build/src/quantize models/7B/ggml-model-f16.bin models/7B/ggml-model-q4_0.bin 2

# run the inference
# Run the scripts from the root dir of the project for now!
python ./examples/python/example.py

알파카 로라 달리기

 # Before running this command
# You need to provide the HF model paths here
python ./scripts/export-from-huggingface.py
# Alternatively you can just download the ggml models from huggingface directly and run them! 

python3 ./scripts/convert-pth-to-ggml.py models/ALPACA-LORA-7B 1 0

./build/src/quantize models/ALPACA-LORA-7B/ggml-model-f16.bin models/ALPACA-LORA-7B/alpaca-lora-q4_0.bin 2

python ./examples/python/example-alpaca.py

런타임 동안 LORA 어댑터 사용

 # Download lora adapters and paste them inside models folder
# https://huggingface.co/tloen/alpaca-lora-7b


python scripts/convert-lora-to-ggml.py models/ALPACA-7B-ADAPTER/ -t fp32 
# Change -t to fp16 to use fp16 weights
# Inorder to use LoRA adapters without caching, pass the --no-cache flag
#   - Only supported for fp32 adapter weights

python examples/python/example-lora-adapter.py

# Make sure to set paths correctly for the base model and adapter inside the example
# Commands: 
# load_lora: Attaches the adapter to the base model 
# unload_lora: Deattaches the adapter (Deattach for fp16 is yet to be added!)
# reset: Resets the model state