Download do fastLLaMa - Download do código -fonte fastLLaMa

Fastllama

fastLLaMa é uma estrutura experimental de alto desempenho, projetada para enfrentar os desafios associados à implantação de grandes modelos de idiomas (LLMS) em ambientes de produção.

Ele oferece uma interface Python amigável para uma biblioteca C ++, LLAMA.CPP, permitindo que os desenvolvedores criem fluxos de trabalho personalizados, implementem o log adaptável e alternem perfeitamente os contextos entre as sessões. Essa estrutura é voltada para melhorar a eficiência dos LLMs operacionais em escala, com o desenvolvimento contínuo focado na introdução de recursos como tempos de inicialização a frio otimizados, suporte INT4 para GPUs NVIDIA, gerenciamento de artefato de modelo e suporte à linguagem de programação múltipla.

                ___            __    _    _         __ __      
                | | '___  ___ _| |_ | |  | |   ___ |     ___ 
                | |-<_> |<_-<  | |  | |_ | |_ <_> ||     |<_> |
                |_| <___|/__/  |_|  |___||___|<___||_|_|_|<___|
                                                            
                                                                                        
                                                                           
                                                       .+*+-.                
                                                      -%#--                  
                                                    :=***%*++=.              
                                                   :+=+**####%+              
                                                   ++=+*%#                   
                                                  .*+++==-                   
                  ::--:.                           .**++=::                   
                 #%##*++=......                    =*+==-::                   
                .@@@*@%*==-==-==---:::::------::==*+==--::                   
                 %@@@@+--====+===---=---==+=======+++----:                   
                 .%@@*++*##***+===-=====++++++*++*+====++.                   
                 :@@%*##%@@%#*%#+==++++++=++***==-=+==+=-                    
                  %@%%%%%@%#+=*%*##%%%@###**++++==--==++                     
                  #@%%@%@@##**%@@@%#%%%%**++*++=====-=*-                     
                  -@@@@@@@%*#%@@@@@@@%%%%#+*%#++++++=*+.                     
                   +@@@@@%%*-#@@@@@@@@@@@%%@%**#*#+=-.                       
                    #%%###%:  ..+#%@@@@%%@@@@%#+-                            
                    :***#*-         ...  *@@@%*+:                            
                     =***=               -@%##**.                            
                    :#*++                -@#-:*=.                            
                     =##-                .%*..##                             
                      +*-                 *:  +-                             
                      :+-                :+   =.                             
                       =-.               *+   =-                             
                        :-:-              =--  :::

Características

Modelos suportados

Requisitos

Cmake
- Para Linux:
  sudo apt-get -y install cmake
- Para o OS X:
  brew install cmake
- Para Windows
  Faça o download do cmake-*. Exe Installer na página de download e execute-o.
GCC 11 ou superior
C ++ mínimo 17
Python 3.x

Instalação

Para instalar fastLLaMa através do uso do PIP

pip install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main

Uso

Importando o pacote

Para importar o fastllama, apenas correr

 from fastllama import Model

Inicializando o modelo

 MODEL_PATH = "./models/7B/ggml-model-q4_0.bin"

model = Model (
        path = MODEL_PATH , #path to model
        num_threads = 8 , #number of threads to use
        n_ctx = 512 , #context size of model
        last_n_size = 64 , #size of last n tokens (used for repetition penalty) (Optional)
        seed = 0 , #seed for random number generator (Optional)
        n_batch = 128 , #batch size (Optional)
        use_mmap = False , #use mmap to load model (Optional)
    )

Ingestão de avisos

 prompt = """Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: """

res = model . ingest ( prompt , is_system_prompt = True ) #ingest model with prompt

Gerando saída

 def stream_token ( x : str ) -> None :
    """
    This function is called by the library to stream tokens
    """
    print ( x , end = '' , flush = True )

res = model . generate (
    num_tokens = 100 , 
    top_p = 0.95 , #top p sampling (Optional)
    temp = 0.8 , #temperature (Optional)
    repeat_penalty = 1.0 , #repetition penalty (Optional)
    streaming_fn = stream_token , #streaming function
    stop_words = [ "User:" , " n " ] #stop generation when this word is encountered (Optional)
    )

Modelo de carregamento usando multithreads

 model = Model (
        path = MODEL_PATH , #path to model
        num_threads = 8 , #number of threads to use
        n_ctx = 512 , #context size of model
        last_n_size = 64 , #size of last n tokens (used for repetition penalty) (Optional)
        seed = 0 , #seed for random number generator (Optional)
        n_batch = 128 , #batch size (Optional)
        load_parallel = True
    )

Salvando o estado do modelo

Para armazenar em cache a sessão, você pode usar o método save_state .

 res = model . save_state ( "./models/fast_llama.bin" )

Estado do modelo de carregamento

Para carregar a sessão, use o método load_state .

 res = model . load_state ( "./models/fast_llama.bin" )

Redefinindo o estado do modelo

Para redefinir a sessão, use o método reset .

 model . reset ()

Anexando adaptadores LORA ao modelo básico durante o tempo de execução

Para anexar o adaptador LORA durante o tempo de execução, use o método attach_lora .

 LORA_ADAPTER_PATH = "./models/ALPACA-7B-ADAPTER/ggml-adapter-model.bin"

model . attach_lora ( LORA_ADAPTER_PATH )

NOTA: É uma boa idéia redefinir o estado do modelo depois de anexar um adaptador LORA.

Adaptadores de LORA destacando para o modelo básico durante o tempo de execução

Para separar o adaptador LORA durante o tempo de execução, use o método detach_lora .

 model . detach_lora ()

Calculando perplexidade

Para cacular a perplexidade, use o método perplexity .

 with open ( "test.txt" , "r" ) as f :
    data = f . read ( 8000 )
       
total_perplexity = model . perplexity ( data )
print ( f"Total Perplexity: { total_perplexity :.4f } " )

Obtendo as incorporações do modelo

Para obter as incorporações do modelo, use o método get_embeddings .

 embeddings = model . get_embeddings ()

Obtendo as logits do modelo

Para obter as logits do modelo, use o método get_logits .

 logits = model . get_logits ()

Usando o logger

 from fastLLaMa import Logger

class MyLogger ( Logger ):
    def __init__ ( self ):
        super (). __init__ ()
        self . file = open ( "logs.log" , "w" )

    def log_info ( self , func_name : str , message : str ) -> None :
        #Modify this to do whatever you want when you see info logs
        print ( f"[Info]: Func(' { func_name } ') { message } " , flush = True , end = '' , file = self . file )
        pass
    
    def log_err ( self , func_name : str , message : str ) -> None :
        #Modify this to do whatever you want when you see error logs
        print ( f"[Error]: Func(' { func_name } ') { message } " , flush = True , end = '' , file = self . file )
    
    def log_warn ( self , func_name : str , message : str ) -> None :
        #Modify this to do whatever you want when you see warning logs
        print ( f"[Warn]: Func(' { func_name } ') { message } " , flush = True , end = '' , file = self . file )

Para mais clareza, verifique os examples/python/ Pasta.

Correndo lhama

 # obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# convert the 7B model to ggml FP16 format
# python [PythonFile] [ModelPath] [Floattype] [Vocab Only] [SplitType]
python3 scripts/convert-pth-to-ggml.py models/7B/ 1 0

# quantize the model to 4-bits
./build/src/quantize models/7B/ggml-model-f16.bin models/7B/ggml-model-q4_0.bin 2

# run the inference
# Run the scripts from the root dir of the project for now!
python ./examples/python/example.py

Em execução Alpaca-Lora

 # Before running this command
# You need to provide the HF model paths here
python ./scripts/export-from-huggingface.py
# Alternatively you can just download the ggml models from huggingface directly and run them! 

python3 ./scripts/convert-pth-to-ggml.py models/ALPACA-LORA-7B 1 0

./build/src/quantize models/ALPACA-LORA-7B/ggml-model-f16.bin models/ALPACA-LORA-7B/alpaca-lora-q4_0.bin 2

python ./examples/python/example-alpaca.py

Usando adaptadores LORA durante o tempo de execução

 # Download lora adapters and paste them inside models folder
# https://huggingface.co/tloen/alpaca-lora-7b


python scripts/convert-lora-to-ggml.py models/ALPACA-7B-ADAPTER/ -t fp32 
# Change -t to fp16 to use fp16 weights
# Inorder to use LoRA adapters without caching, pass the --no-cache flag
#   - Only supported for fp32 adapter weights

python examples/python/example-lora-adapter.py

# Make sure to set paths correctly for the base model and adapter inside the example
# Commands: 
# load_lora: Attaches the adapter to the base model 
# unload_lora: Deattaches the adapter (Deattach for fp16 is yet to be added!)
# reset: Resets the model state

Executando o webui

Para executar o WebSocket Server e o Webui, siga as instruções nas respectivas filiais.

Requisitos de memória/disco

Como os modelos estão atualmente totalmente carregados na memória, você precisará de espaço em disco adequado para salvá -los e RAM suficiente para carregá -los. No momento, os requisitos de memória e disco são os mesmos.

Tamanho do modelo	tamanho original	Tamanho quantizado (4 bits)
7b	13 GB	3,9 GB
13b	24 GB	7,8 GB
30b	60 GB	19.5 GB
65b	120 GB	38,5 GB

Informações: o tempo de execução pode exigir memória extra durante a inferência!
(Depende dos hiperparmetros usados durante a inicialização do modelo)

Contribuindo

Os colaboradores podem abrir PRS
Os colaboradores podem empurrar para as filiais para o repositório e mesclar os PRs no ramo principal
Os colaboradores serão convidados com base em contribuições
Qualquer ajuda no gerenciamento de problemas e PRs é muito apreciada!
Certifique -se de ler sobre nossa visão

Notas

Testado em
- Hardware: Apple Silicon, Intel, braço (pendente)
- OS: MacOS, Linux, Windows (pendente), Android (pendente)