ดาวน์โหลด AutoGPTQ - ดาวน์โหลดซอร์สโค้ด AutoGPTQ

AutoGPTQ

ซอร์สโค้ดอื่น ๆ

v0.7.1: patch release

ดาวน์โหลด

AutogptQ

การพัฒนา AutoGptQ หยุดลง โปรดเปลี่ยนเป็น GPTQModel เป็นการเปลี่ยนแบบดรอปอิน

แพ็คเกจการหาปริมาณ LLM ที่ใช้งานง่ายพร้อม API ที่ใช้งานง่ายขึ้นอยู่กับอัลกอริทึม GPTQ (การหาปริมาณแบบรวมเป็นแบบเดียว)

ข่าวหรืออัปเดต

2024-02-15 - (ข่าว) - AutoGPTQ 0.7.0 ถูกปล่อยออกมาพร้อมกับ Marlin Int4*FP16 การสนับสนุนเคอร์เนลการคูณเมทริกซ์พร้อมกับอาร์กิวเมนต์ use_marlin=True เมื่อโหลดโมเดล
2023-08-23 - (ข่าว) -? Transformers, Optimum และ Peft มีการรวม auto-gptq ดังนั้นตอนนี้การทำงานและการฝึกอบรมรุ่น GPTQ สามารถให้บริการได้มากขึ้นสำหรับทุกคน! ดูบล็อกนี้และเป็นแหล่งข้อมูลสำหรับรายละเอียดเพิ่มเติม!

สำหรับประวัติศาสตร์เพิ่มเติมโปรดหันไปที่นี่

การเปรียบเทียบประสิทธิภาพ

ความเร็วการอนุมาน

ผลลัพธ์ถูกสร้างขึ้นโดยใช้สคริปต์นี้ขนาดแบทช์ของอินพุตคือ 1 กลยุทธ์การถอดรหัสคือการค้นหาลำแสงและบังคับใช้โมเดลเพื่อสร้างโทเค็น 512 ตัวชี้วัดความเร็วคือโทเค็น/s (ยิ่งใหญ่ยิ่งดี)
แบบจำลองเชิงปริมาณถูกโหลดโดยใช้การตั้งค่าที่สามารถรับความเร็วการอนุมานที่เร็วที่สุด

แบบอย่าง	GPU	num_beams	FP16	GPTQ-INT4
LLAMA-7B	1xa100-40g	1	18.87	25.53
LLAMA-7B	1xa100-40g	4	68.79	91.30
มอสมูน 16B	1xa100-40g	1	12.48	15.25
มอสมูน 16B	1xa100-40g	4	สิ่งที่น่าเบื่อหน่าย	42.67
มอสมูน 16B	2xa100-40g	1	06.83	06.78
มอสมูน 16B	2xa100-40g	4	13.10	10.80
GPT-J 6B	1XRTX3060-12G	1	สิ่งที่น่าเบื่อหน่าย	29.55
GPT-J 6B	1XRTX3060-12G	4	สิ่งที่น่าเบื่อหน่าย	47.36

ความงุนงง

สำหรับการเปรียบเทียบความงุนงงคุณสามารถหันไปที่นี่และที่นี่

การติดตั้ง

AutoGPTQ พร้อมใช้งานบน Linux และ Windows เท่านั้น คุณสามารถติดตั้ง AutoGPTQ ที่มีความเสถียรล่าสุดจาก PIP พร้อมล้อที่สร้างไว้ล่วงหน้า:

เวอร์ชันแพลตฟอร์ม	การติดตั้ง	สร้างขึ้นกับ pytorch
Cuda 11.8	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`	2.2.1+Cu118
Cuda 12.1	`pip install auto-gptq --no-build-isolation`	2.2.1+Cu121
Rocm 5.7	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/`	2.2.1+ROCM5.7

AutoGPTQ สามารถติดตั้งได้ด้วยการพึ่งพา Triton ด้วย pip install auto-gptq[triton] --no-build-isolation เพื่อให้สามารถใช้แบ็กเอนด์ไทรทัน (ปัจจุบันรองรับ Linux เท่านั้นไม่มีการหาปริมาณ 3 บิต)

สำหรับ AutoGPTQ ที่เก่ากว่าโปรดดูที่ตารางการติดตั้งรุ่นก่อนหน้า

ในระบบ NVIDIA AutoGPTQ ไม่รองรับ Maxwell หรือ GPU ที่ต่ำกว่า

ติดตั้งจากแหล่งที่มา

โคลนซอร์สโค้ด:

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ

จำเป็นต้องมีแพ็คเกจสองสามชุดเพื่อสร้างจากแหล่งที่มา: pip install numpy gekko pandas

จากนั้นติดตั้งในเครื่องจากแหล่งที่มา:

pip install -vvv --no-build-isolation -e .

คุณสามารถตั้งค่า BUILD_CUDA_EXT=0 เพื่อปิดการสร้างอาคารส่วนขยายของ Pytorch แต่นี่เป็น สิ่งที่ไม่สนับสนุนอย่างมาก เนื่องจาก AutogptQ จากนั้นกลับไปที่การใช้งาน Python ช้า

เป็นทางเลือกสุดท้ายหากคำสั่งข้างต้นล้มเหลวคุณสามารถลองใช้ python setup.py install

บนระบบ ROCM

ในการติดตั้งจากแหล่งที่มาสำหรับ AMD GPU ที่รองรับ ROCM โปรดระบุตัวแปรสภาพแวดล้อม ROCM_VERSION ตัวอย่าง:

ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .

การรวบรวมสามารถเร่งความเร็วได้โดยการระบุตัวแปร PYTORCH_ROCM_ARCH (อ้างอิง) เพื่อสร้างอุปกรณ์เป้าหมายเดียวเช่น gfx90a สำหรับอุปกรณ์ MI200 Series

สำหรับระบบ ROCM ต้องใช้แพ็คเกจ rocsparse-dev , hipsparse-dev , rocthrust-dev , rocblas-dev และ hipblas-dev เพื่อสร้าง

บนระบบIntel®Gaudi® 2

ข้อสังเกต: ตรวจสอบให้แน่ใจว่าคุณอยู่ในการกระทำ 65C2E15 หรือใหม่กว่า

ในการติดตั้งจากแหล่งที่มาสำหรับ Intel Gaudi 2 HPU ให้ตั้งค่า BUILD_CUDA_EXT=0 ตัวแปรสภาพแวดล้อมเพื่อปิดใช้งานการสร้างส่วนขยาย cuda pytorch ตัวอย่าง:

BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .

ขอให้สังเกตว่า Intel Gaudi 2 ใช้เคอร์เนลที่ได้รับการปรับปรุงให้ดีที่สุดเมื่อมีการอนุมานและต้องใช้ BUILD_CUDA_EXT=0 บนเครื่องที่ไม่ใช่ CUDA

ทัวร์ด่วน

การวัดปริมาณและการอนุมาน

คำเตือน: นี่เป็นเพียงการจัดแสดงการใช้ API พื้นฐานใน AutoGPTQ ซึ่งใช้ตัวอย่างเพียงตัวอย่างเดียวในการหารูปแบบขนาดเล็กขนาดเล็กมากคุณภาพของแบบจำลองเชิงปริมาณโดยใช้ตัวอย่างเล็ก ๆ น้อย ๆ เช่นนี้อาจไม่ดี

ด้านล่างเป็นตัวอย่างสำหรับการใช้งาน auto_gptq ที่ง่ายที่สุดในการหาปริมาณแบบจำลองและการอนุมานหลังจากการวัดปริมาณ:

 from transformers import AutoTokenizer , TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
import logging

logging . basicConfig (
    format = "%(asctime)s %(levelname)s [%(name)s] %(message)s" , level = logging . INFO , datefmt = "%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer . from_pretrained ( pretrained_model_dir , use_fast = True )
examples = [
    tokenizer (
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig (
    bits = 4 ,  # quantize model to 4-bit
    group_size = 128 ,  # it is recommended to set the value to 128
    desc_act = False ,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM . from_pretrained ( pretrained_model_dir , quantize_config )

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model . quantize ( examples )

# save quantized model
model . save_quantized ( quantized_model_dir )

# save quantized model using safetensors
model . save_quantized ( quantized_model_dir , use_safetensors = True )

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM . from_quantized ( quantized_model_dir , device = "cuda:0" )

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print ( tokenizer . decode ( model . generate ( ** tokenizer ( "auto_gptq is" , return_tensors = "pt" ). to ( model . device ))[ 0 ]))

# or you can also use pipeline
pipeline = TextGenerationPipeline ( model = model , tokenizer = tokenizer )
print ( pipeline ( "auto-gptq is" )[ 0 ][ "generated_text" ])

สำหรับคุณสมบัติขั้นสูงของการหาปริมาณแบบจำลองโปรดอ้างอิงถึงสคริปต์นี้

ปรับแต่งโมเดล

ด้านล่างนี้เป็นตัวอย่างที่จะขยาย `auto_gptq` เพื่อรองรับโมเดล` opt 'อย่างที่คุณเห็นมันง่ายมาก:

 from auto_gptq . modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM ( BaseGPTQForCausalLM ):
    # chained attribute name of transformer layer block
    layers_block_name = "model.decoder.layers"
    # chained attribute names of other nn modules that in the same level as the transformer layer block
    outside_layer_modules = [
        "model.decoder.embed_tokens" , "model.decoder.embed_positions" , "model.decoder.project_out" ,
        "model.decoder.project_in" , "model.decoder.final_layer_norm"
    ]
    # chained attribute names of linear layers in transformer layer module
    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,
    # and the order should be the order when they are truly executed, in this case (and usually in most cases),
    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
    inside_layer_modules = [
        [ "self_attn.k_proj" , "self_attn.v_proj" , "self_attn.q_proj" ],
        [ "self_attn.out_proj" ],
        [ "fc1" ],
        [ "fc2" ]
    ]

หลังจากนี้คุณสามารถใช้ OPTGPTQForCausalLM.from_pretrained และวิธีการอื่น ๆ ดังที่แสดงในพื้นฐาน

การประเมินผลงานดาวน์สตรีม

คุณสามารถใช้งานที่กำหนดไว้ใน auto_gptq.eval_tasks เพื่อประเมินประสิทธิภาพของโมเดลในงานดาวน์สตรีมเฉพาะก่อนและหลังการหาปริมาณ

งานที่กำหนดไว้ล่วงหน้าสนับสนุนรูปแบบภาษาเชิงสาเหตุทั้งหมดที่นำมาใช้ใน? Transformers และในโครงการนี้

ด้านล่างนี้เป็นตัวอย่างในการประเมิน `eleutherai/gpt-j-6b` ในงานการจัดประเภทลำดับโดยใช้` cardiffnlp/tweet_sentiment_multilingual `ชุดข้อมูล:

 from functools import partial

import datasets
from transformers import AutoTokenizer , AutoModelForCausalLM , GenerationConfig

from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
from auto_gptq . eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}. n Text: {text} n Answer:"
ID2LABEL = {
    0 : "negative" ,
    1 : "neutral" ,
    2 : "positive"
}
LABELS = list ( ID2LABEL . values ())


def ds_refactor_fn ( samples ):
    text_data = samples [ "text" ]
    label_data = samples [ "label" ]

    new_samples = { "prompt" : [], "label" : []}
    for text , label in zip ( text_data , label_data ):
        prompt = TEMPLATE . format ( labels = LABELS , text = text )
        new_samples [ "prompt" ]. append ( prompt )
        new_samples [ "label" ]. append ( ID2LABEL [ label ])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM . from_pretrained ( MODEL , BaseQuantizeConfig ())
tokenizer = AutoTokenizer . from_pretrained ( MODEL )

task = SequenceClassificationTask (
        model = model ,
        tokenizer = tokenizer ,
        classes = LABELS ,
        data_name_or_path = DATASET ,
        prompt_col_name = "prompt" ,
        label_col_name = "label" ,
        ** {
            "num_samples" : 1000 ,  # how many samples will be sampled to evaluation
            "sample_max_len" : 1024 ,  # max tokens for each sample
            "block_max_len" : 2048 ,  # max tokens for each data block
            # function to load dataset, one must only accept data_name_or_path as input
            # and return datasets.Dataset
            "load_fn" : partial ( datasets . load_dataset , name = "english" ),
            # function to preprocess dataset, which is used for datasets.Dataset.map,
            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
            "preprocess_fn" : ds_refactor_fn ,
            # truncate label when sample's length exceed sample_max_len
            "truncate_prompt" : False
        }
    )

# note that max_new_tokens will be automatically specified internally based on given classes
print ( task . run ())

# self-consistency
print (
    task . run (
        generation_config = GenerationConfig (
            num_beams = 3 ,
            num_return_sequences = 3 ,
            do_sample = True
        )
    )
)

เรียนรู้เพิ่มเติม

บทช่วยสอนให้คำแนะนำทีละขั้นตอนเพื่อรวม auto_gptq เข้ากับโครงการของคุณเองและหลักการปฏิบัติที่ดีที่สุด

ตัวอย่างมีสคริปต์ตัวอย่างมากมายที่จะใช้ auto_gptq ในรูปแบบที่แตกต่างกัน

รุ่นที่รองรับ

คุณสามารถใช้ model.config.model_type เพื่อเปรียบเทียบกับตารางด้านล่างเพื่อตรวจสอบว่าโมเดลที่คุณใช้นั้นรองรับโดย auto_gptq หรือไม่
ตัวอย่างเช่น model_type ของ WizardLM , vicuna และ gpt4all เป็น llama ทั้งหมดดังนั้นพวกเขาทั้งหมดได้รับการสนับสนุนโดย auto_gptq

ประเภทรุ่น	การวัดปริมาณ	การอนุมาน	Peft-lora	Peft-ada-lora	peft-adaption_prompt
ผลิบาน
GPT2
gpt_neox					✅แก้ไขสาขา peft นี้
gptj					✅แก้ไขสาขา peft นี้
ลาม่า
มอส					✅แก้ไขสาขา peft นี้
เลือก
gpt_bigcode
codegen
Falcon (RefinedWebModel/RefinedWeb)

งานประเมินผล

ปัจจุบัน auto_gptq รองรับ: LanguageModelingTask , SequenceClassificationTask และ TextSummarizationTask ; งานเพิ่มเติมจะมาเร็ว ๆ นี้!

การทดสอบกำลังดำเนินการ

การทดสอบสามารถทำงานได้ด้วย:

 pytest tests/ -s

คำถามที่พบบ่อย

เคอร์เนลใดที่ใช้โดยค่าเริ่มต้น?

AutoGPTQ ค่าเริ่มต้นในการใช้เคอร์เนล exllamav2 int4*fp16 สำหรับการคูณเมทริกซ์

วิธีใช้เคอร์เนล Marlin?

Marlin เป็นเคอร์เนล Int4 * FP16 ที่ได้รับการปรับปรุงให้ดีที่สุดเมื่อเร็ว ๆ นี้ที่ https://github.com/ist-daslab/marlin สิ่งนี้รวมอยู่ใน AutoGPTQ เมื่อโหลดโมเดลด้วย use_marlin=True เคอร์เนลนี้มีเฉพาะบนอุปกรณ์ที่มีความสามารถในการคำนวณ 8.0 หรือ 8.6 (AMPERE GPU)

การรับทราบ

ขอขอบคุณเป็นพิเศษ Elias Frantar , Saleh Ashkboos , Torsten Hoefler และ Dan Alistarh สำหรับการเสนออัลกอริทึม GPTQ และโอเพ่นซอร์สรหัสและสำหรับการปล่อยเคอร์เนล Marlin สำหรับการคำนวณที่แม่นยำผสม
ขอขอบคุณเป็นพิเศษ QWOPQWOP200 สำหรับรหัสในโครงการนี้ที่เกี่ยวข้องกับปริมาณส่วนใหญ่อ้างอิงจาก GPTQ-for-llama
ขอขอบคุณเป็นพิเศษสำหรับ Turboderp สำหรับการปล่อยไลบรารี่ Exllama และ Exllama V2 ที่มีเมล็ดพันธุ์ที่มีความแม่นยำผสมอย่างมีประสิทธิภาพ

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน v0.7.1: patch release
ประเภท ซอร์สโค้ดอื่น ๆ
เวลาอัปเดต 2025-04-18
ขนาด 7.22MB
มาจาก Github

แอปที่เกี่ยวข้อง

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด