تنزيل fastLLaMa - تنزيل رمز المصدر fastLLaMa

Fastllama

fastLLaMa هو إطار عمل تجريبي عالي الأداء مصمم لمعالجة التحديات المرتبطة بنشر نماذج لغة كبيرة (LLMS) في بيئات الإنتاج.

إنه يوفر واجهة بيثون سهلة الاستخدام إلى مكتبة C ++ ، llama.cpp ، تمكين المطورين من إنشاء مهام سير عمل مخصصة ، وتنفيذ تسجيلات تسجيل قابلة للتكيف ، وتبديل السياقات بسلاسة بين الجلسات. هذا الإطار موجه نحو تعزيز كفاءة تشغيل LLMs على نطاق واسع ، مع التركيز المستمر على إدخال ميزات مثل أوقات التمهيد الباردة المحسنة ، ودعم INT4 لاتحاد معالجة الرسومات NVIDIA ، وإدارة القطع الأثرية النموذجية ، ودعم لغة البرمجة المتعددة.

                ___            __    _    _         __ __      
                | | '___  ___ _| |_ | |  | |   ___ |     ___ 
                | |-<_> |<_-<  | |  | |_ | |_ <_> ||     |<_> |
                |_| <___|/__/  |_|  |___||___|<___||_|_|_|<___|
                                                            
                                                                                        
                                                                           
                                                       .+*+-.                
                                                      -%#--                  
                                                    :=***%*++=.              
                                                   :+=+**####%+              
                                                   ++=+*%#                   
                                                  .*+++==-                   
                  ::--:.                           .**++=::                   
                 #%##*++=......                    =*+==-::                   
                .@@@*@%*==-==-==---:::::------::==*+==--::                   
                 %@@@@+--====+===---=---==+=======+++----:                   
                 .%@@*++*##***+===-=====++++++*++*+====++.                   
                 :@@%*##%@@%#*%#+==++++++=++***==-=+==+=-                    
                  %@%%%%%@%#+=*%*##%%%@###**++++==--==++                     
                  #@%%@%@@##**%@@@%#%%%%**++*++=====-=*-                     
                  -@@@@@@@%*#%@@@@@@@%%%%#+*%#++++++=*+.                     
                   +@@@@@%%*-#@@@@@@@@@@@%%@%**#*#+=-.                       
                    #%%###%:  ..+#%@@@@%%@@@@%#+-                            
                    :***#*-         ...  *@@@%*+:                            
                     =***=               -@%##**.                            
                    :#*++                -@#-:*=.                            
                     =##-                .%*..##                             
                      +*-                 *:  +-                             
                      :+-                :+   =.                             
                       =-.               *+   =-                             
                        :-:-              =--  :::

سمات

النماذج المدعومة

متطلبات

cmake
- للينكس:
  sudo apt-get -y install cmake
- لنظام التشغيل X:
  brew install cmake
- لنظام التشغيل Windows
  قم بتنزيل Cmake-*. Exe Installer من صفحة التنزيل وقم بتشغيله.
GCC 11 أو أكثر
الحد الأدنى C ++ 17
بيثون 3.x

تثبيت

لتثبيت fastLLaMa من خلال استخدام PIP

pip install git+https://github.com/PotatoSpudowski/fastLLaMa.git@main

الاستخدام

استيراد الحزمة

لاستيراد Fastllama فقط

 from fastllama import Model

تهيئة النموذج

 MODEL_PATH = "./models/7B/ggml-model-q4_0.bin"

model = Model (
        path = MODEL_PATH , #path to model
        num_threads = 8 , #number of threads to use
        n_ctx = 512 , #context size of model
        last_n_size = 64 , #size of last n tokens (used for repetition penalty) (Optional)
        seed = 0 , #seed for random number generator (Optional)
        n_batch = 128 , #batch size (Optional)
        use_mmap = False , #use mmap to load model (Optional)
    )

تطورات الاستيعاب

 prompt = """Transcript of a dialog, where the User interacts with an Assistant named Bob. Bob is helpful, kind, honest, good at writing, and never fails to answer the User's requests immediately and with precision.

User: Hello, Bob.
Bob: Hello. How may I help you today?
User: Please tell me the largest city in Europe.
Bob: Sure. The largest city in Europe is Moscow, the capital of Russia.
User: """

res = model . ingest ( prompt , is_system_prompt = True ) #ingest model with prompt

توليد الإخراج

 def stream_token ( x : str ) -> None :
    """
    This function is called by the library to stream tokens
    """
    print ( x , end = '' , flush = True )

res = model . generate (
    num_tokens = 100 , 
    top_p = 0.95 , #top p sampling (Optional)
    temp = 0.8 , #temperature (Optional)
    repeat_penalty = 1.0 , #repetition penalty (Optional)
    streaming_fn = stream_token , #streaming function
    stop_words = [ "User:" , " n " ] #stop generation when this word is encountered (Optional)
    )

نموذج التحميل باستخدام multithreads

 model = Model (
        path = MODEL_PATH , #path to model
        num_threads = 8 , #number of threads to use
        n_ctx = 512 , #context size of model
        last_n_size = 64 , #size of last n tokens (used for repetition penalty) (Optional)
        seed = 0 , #seed for random number generator (Optional)
        n_batch = 128 , #batch size (Optional)
        load_parallel = True
    )

إنقاذ حالة النموذج

لتخزين الجلسة ، يمكنك استخدام طريقة save_state .

 res = model . save_state ( "./models/fast_llama.bin" )

حالة تحميل نموذج

لتحميل الجلسة ، استخدم طريقة load_state .

 res = model . load_state ( "./models/fast_llama.bin" )

إعادة تعيين حالة النموذج

لإعادة تعيين الجلسة ، استخدم طريقة reset .

 model . reset ()

إرفاق محولات Lora إلى النموذج الأساسي أثناء وقت التشغيل

لإرفاق محول Lora أثناء وقت التشغيل ، استخدم طريقة attach_lora .

 LORA_ADAPTER_PATH = "./models/ALPACA-7B-ADAPTER/ggml-adapter-model.bin"

model . attach_lora ( LORA_ADAPTER_PATH )

ملاحظة: من الجيد إعادة تعيين حالة النموذج بعد إرفاق محول Lora.

فصل محولات لورا إلى النموذج القاعدي أثناء وقت التشغيل

لفصل محول Lora أثناء وقت التشغيل ، استخدم طريقة detach_lora .

 model . detach_lora ()

حساب الحيرة

لعلاج الحيرة ، استخدم طريقة perplexity .

 with open ( "test.txt" , "r" ) as f :
    data = f . read ( 8000 )
       
total_perplexity = model . perplexity ( data )
print ( f"Total Perplexity: { total_perplexity :.4f } " )

الحصول على تضمينات النموذج

للحصول على تضمينات النموذج ، استخدم طريقة get_embeddings .

 embeddings = model . get_embeddings ()

الحصول على سجلات النموذج

للحصول على سجلات النموذج ، استخدم طريقة get_logits .

 logits = model . get_logits ()

باستخدام المسجل

 from fastLLaMa import Logger

class MyLogger ( Logger ):
    def __init__ ( self ):
        super (). __init__ ()
        self . file = open ( "logs.log" , "w" )

    def log_info ( self , func_name : str , message : str ) -> None :
        #Modify this to do whatever you want when you see info logs
        print ( f"[Info]: Func(' { func_name } ') { message } " , flush = True , end = '' , file = self . file )
        pass
    
    def log_err ( self , func_name : str , message : str ) -> None :
        #Modify this to do whatever you want when you see error logs
        print ( f"[Error]: Func(' { func_name } ') { message } " , flush = True , end = '' , file = self . file )
    
    def log_warn ( self , func_name : str , message : str ) -> None :
        #Modify this to do whatever you want when you see warning logs
        print ( f"[Warn]: Func(' { func_name } ') { message } " , flush = True , end = '' , file = self . file )

لمزيد من الوضوح ، تحقق examples/python/ المجلد.

تشغيل لاما

 # obtain the original LLaMA model weights and place them in ./models
ls ./models
65B 30B 13B 7B tokenizer_checklist.chk tokenizer.model

# convert the 7B model to ggml FP16 format
# python [PythonFile] [ModelPath] [Floattype] [Vocab Only] [SplitType]
python3 scripts/convert-pth-to-ggml.py models/7B/ 1 0

# quantize the model to 4-bits
./build/src/quantize models/7B/ggml-model-f16.bin models/7B/ggml-model-q4_0.bin 2

# run the inference
# Run the scripts from the root dir of the project for now!
python ./examples/python/example.py

تشغيل الألبكة لورا

 # Before running this command
# You need to provide the HF model paths here
python ./scripts/export-from-huggingface.py
# Alternatively you can just download the ggml models from huggingface directly and run them! 

python3 ./scripts/convert-pth-to-ggml.py models/ALPACA-LORA-7B 1 0

./build/src/quantize models/ALPACA-LORA-7B/ggml-model-f16.bin models/ALPACA-LORA-7B/alpaca-lora-q4_0.bin 2

python ./examples/python/example-alpaca.py

باستخدام محولات Lora أثناء وقت التشغيل

 # Download lora adapters and paste them inside models folder
# https://huggingface.co/tloen/alpaca-lora-7b


python scripts/convert-lora-to-ggml.py models/ALPACA-7B-ADAPTER/ -t fp32 
# Change -t to fp16 to use fp16 weights
# Inorder to use LoRA adapters without caching, pass the --no-cache flag
#   - Only supported for fp32 adapter weights

python examples/python/example-lora-adapter.py

# Make sure to set paths correctly for the base model and adapter inside the example
# Commands: 
# load_lora: Attaches the adapter to the base model 
# unload_lora: Deattaches the adapter (Deattach for fp16 is yet to be added!)
# reset: Resets the model state

تشغيل webui

لتشغيل WebSocket Server و WebUI ، اتبع الإرشادات الموجودة على الفروع المعنية.

متطلبات الذاكرة/القرص

نظرًا لأن النماذج يتم تحميلها بشكل كامل في الذاكرة ، فستحتاج إلى مساحة كافية من القرص لحفظها وذاكرة الوصول العشوائي الكافية لتحميلها. في الوقت الحالي ، تكون متطلبات الذاكرة والقرص هي نفسها.

حجم النموذج	الحجم الأصلي	الحجم الكمي (4 بت)
7 ب	13 غيغابايت	3.9 جيجابايت
13 ب	24 غيغابايت	7.8 جيجابايت
30 ب	60 جيجابايت	19.5 غيغابايت
65 ب	120 غيغابايت	38.5 جيجابايت

معلومات: قد يتطلب وقت التشغيل ذاكرة إضافية أثناء الاستدلال!
(يعتمد على أجهزة قياس الزوايا المستخدمة أثناء تهيئة النموذج)

المساهمة

يمكن للمساهمين فتح PRS
يمكن للمتعاونين الضغط على الفروع إلى الريبو ودمج PRS في الفرع الرئيسي
سيتم دعوة المتعاونين بناءً على المساهمات
أي مساعدة في إدارة القضايا و PRS موضع تقدير كبير!
تأكد من القراءة عن رؤيتنا

ملحوظات

تم اختباره على
- الأجهزة: Apple Silicon ، Intel ، ARM (معلق)
- OS: MacOS ، Linux ، Windows (معلق) ، Android (معلق)

يوسع