AutoGPTQ下載AutoGPTQ源代碼下載

AutoGPTQ

其他源碼

v0.7.1: patch release

下載

AutoGPTQ

AutoGPTQ開發已經停止。請切換到GPTQMODEL作為置換式替換。

基於GPTQ算法（僅重量量化）的易於使用的LLM量化包，具有用戶友好的API。

新聞或更新

2024-02-15-（新聞） - AutoGPTQ 0.7.0釋放，使用Marlin INT4*FP16矩陣乘法內核支持，而加載模型時的參數use_marlin=True 。
2023-08-23-（新聞） - ？變壓器，最佳和PEFT已集成了auto-gptq ，因此現在正在運行和培訓GPTQ模型可以為每個人提供！請參閱此博客及其資源以獲取更多詳細信息！

有關更多歷史，請轉到這裡

性能比較

推理速度

結果是使用此腳本生成的，輸入的批量大小為1，解碼策略是梁搜索並強制執行模型生成512代幣，速度度量是令牌/s（越大，越大）。
使用可以獲得最快推理速度的設置加載量化模型。

模型	GPU	num_beams	FP16	GPTQ-INT4
Llama-7b	1XA100-40G	1	18.87	25.53
Llama-7b	1XA100-40G	4	68.79	91.30
moss-moon 16b	1XA100-40G	1	12.48	15.25
moss-moon 16b	1XA100-40G	4	oom	42.67
moss-moon 16b	2XA100-40G	1	06.83	06.78
moss-moon 16b	2XA100-40G	4	13.10	10.80
GPT-J 6B	1XRTX3060-12G	1	oom	29.55
GPT-J 6B	1XRTX3060-12G	4	oom	47.36

困惑

為了進行困惑，您可以在這里和這裡轉到此處

安裝

AutoGPTQ僅在Linux和Windows上可用。您可以使用PIP的最新穩定版本使用預構建的車輪：

平台版本	安裝	建立在Pytorch上
庫達11.8	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`	2.2.1+CU118
CUDA 12.1	`pip install auto-gptq --no-build-isolation`	2.2.1+CU121
ROCM 5.7	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/`	2.2.1+ROCM5.7

可以使用pip install auto-gptq[triton] --no-build-isolation安裝自動PPTQ，以便能夠使用Triton後端（當前僅支持Linux，不支持Linux，不支持3位量化）。

對於較舊的AutoGPTQ，請參閱以前的版本安裝表。

在NVIDIA系統上，AutoGPTQ不支持Maxwell或較低的GPU。

從源安裝

克隆源代碼：

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ

為了從來源構建： pip install numpy gekko pandas需要一些軟件包。

然後，從源本地安裝：

pip install -vvv --no-build-isolation -e .

您可以將BUILD_CUDA_EXT=0設置為禁用Pytorch擴展構建，但是由於AutoGPTQ然後倒在慢速的Python實現中，因此強烈勸阻。

作為最後的手段，如果上述命令失敗，則可以嘗試python setup.py install 。

在ROCM系統上

要從源為支持ROCM的AMD GPU的源安裝，請指定ROCM_VERSION環境變量。例子：

ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .

可以通過指定PYTORCH_ROCM_ARCH變量（參考）來加快彙編，以構建單個目標設備，例如MI200系列設備的gfx90a 。

對於ROCM系統，需要構建包裝rocsparse-dev ， hipsparse-dev ， rocthrust-dev ， rocblas-dev和hipblas-dev 。

在Intel®Gaudi®2系統上

注意：確保您在65c2e15或更高版本中

要從Intel Gaudi 2 HPU的源安裝，請設置BUILD_CUDA_EXT=0環境變量以禁用構建cuda pytorch擴展。例子：

BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .

請注意，在推斷時，英特爾Gaudi 2使用優化的內核，並且需要在非CUDA機器上BUILD_CUDA_EXT=0 。

快速遊覽

量化和推理

警告：這只是AutoGPTQ中基本API的使用情況，該顯示僅使用一個樣本來量化一個小型模型，使用如此小樣本的量化模型的質量可能不好。

以下是最簡單使用auto_gptq來量化模型和推斷後的示例：

 from transformers import AutoTokenizer , TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
import logging

logging . basicConfig (
    format = "%(asctime)s %(levelname)s [%(name)s] %(message)s" , level = logging . INFO , datefmt = "%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer . from_pretrained ( pretrained_model_dir , use_fast = True )
examples = [
    tokenizer (
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig (
    bits = 4 ,  # quantize model to 4-bit
    group_size = 128 ,  # it is recommended to set the value to 128
    desc_act = False ,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM . from_pretrained ( pretrained_model_dir , quantize_config )

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model . quantize ( examples )

# save quantized model
model . save_quantized ( quantized_model_dir )

# save quantized model using safetensors
model . save_quantized ( quantized_model_dir , use_safetensors = True )

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM . from_quantized ( quantized_model_dir , device = "cuda:0" )

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print ( tokenizer . decode ( model . generate ( ** tokenizer ( "auto_gptq is" , return_tensors = "pt" ). to ( model . device ))[ 0 ]))

# or you can also use pipeline
pipeline = TextGenerationPipeline ( model = model , tokenizer = tokenizer )
print ( pipeline ( "auto-gptq is" )[ 0 ][ "generated_text" ])

有關模型量化的更高級功能，請參考此腳本

自定義模型

以下是擴展“ auto_gptq”以支持`opt`型號的示例，這很容易：

 from auto_gptq . modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM ( BaseGPTQForCausalLM ):
    # chained attribute name of transformer layer block
    layers_block_name = "model.decoder.layers"
    # chained attribute names of other nn modules that in the same level as the transformer layer block
    outside_layer_modules = [
        "model.decoder.embed_tokens" , "model.decoder.embed_positions" , "model.decoder.project_out" ,
        "model.decoder.project_in" , "model.decoder.final_layer_norm"
    ]
    # chained attribute names of linear layers in transformer layer module
    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,
    # and the order should be the order when they are truly executed, in this case (and usually in most cases),
    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
    inside_layer_modules = [
        [ "self_attn.k_proj" , "self_attn.v_proj" , "self_attn.q_proj" ],
        [ "self_attn.out_proj" ],
        [ "fc1" ],
        [ "fc2" ]
    ]

之後，您可以使用OPTGPTQForCausalLM.from_pretrained和其他方法，如基本所示。

對下游任務的評估

您可以使用auto_gptq.eval_tasks中定義的任務來評估量化前後模型在特定下游任務上的性能。

預定義的任務支持實施的所有因果關係模型？變壓器和這個項目。

以下是使用`cardiffnlp/tweet_sentiment_multlingual` dataset：

 from functools import partial

import datasets
from transformers import AutoTokenizer , AutoModelForCausalLM , GenerationConfig

from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
from auto_gptq . eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}. n Text: {text} n Answer:"
ID2LABEL = {
    0 : "negative" ,
    1 : "neutral" ,
    2 : "positive"
}
LABELS = list ( ID2LABEL . values ())


def ds_refactor_fn ( samples ):
    text_data = samples [ "text" ]
    label_data = samples [ "label" ]

    new_samples = { "prompt" : [], "label" : []}
    for text , label in zip ( text_data , label_data ):
        prompt = TEMPLATE . format ( labels = LABELS , text = text )
        new_samples [ "prompt" ]. append ( prompt )
        new_samples [ "label" ]. append ( ID2LABEL [ label ])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM . from_pretrained ( MODEL , BaseQuantizeConfig ())
tokenizer = AutoTokenizer . from_pretrained ( MODEL )

task = SequenceClassificationTask (
        model = model ,
        tokenizer = tokenizer ,
        classes = LABELS ,
        data_name_or_path = DATASET ,
        prompt_col_name = "prompt" ,
        label_col_name = "label" ,
        ** {
            "num_samples" : 1000 ,  # how many samples will be sampled to evaluation
            "sample_max_len" : 1024 ,  # max tokens for each sample
            "block_max_len" : 2048 ,  # max tokens for each data block
            # function to load dataset, one must only accept data_name_or_path as input
            # and return datasets.Dataset
            "load_fn" : partial ( datasets . load_dataset , name = "english" ),
            # function to preprocess dataset, which is used for datasets.Dataset.map,
            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
            "preprocess_fn" : ds_refactor_fn ,
            # truncate label when sample's length exceed sample_max_len
            "truncate_prompt" : False
        }
    )

# note that max_new_tokens will be automatically specified internally based on given classes
print ( task . run ())

# self-consistency
print (
    task . run (
        generation_config = GenerationConfig (
            num_beams = 3 ,
            num_return_sequences = 3 ,
            do_sample = True
        )
    )
)

了解更多

教程提供了逐步指導，將auto_gptq與您自己的項目和一些最佳實踐原則集成在一起。

示例提供了很多示例腳本以不同方式使用auto_gptq 。

支持的模型

您可以使用model.config.model_type與下表進行比較，以檢查您使用的模型是否由auto_gptq支持。
例如， WizardLM ， vicuna和gpt4all的model_type都是llama ，因此它們都得到了auto_gptq的支持。

型號類型	量化	推理	peft-lora	peft-ada-lora	peft-adaption_prompt
盛開	✅	✅	✅	✅
GPT2	✅	✅	✅	✅
gpt_neox	✅	✅	✅	✅	✅取代這個peft分支
GPTJ	✅	✅	✅	✅	✅取代這個peft分支
駱駝	✅	✅	✅	✅	✅
苔蘚	✅	✅	✅	✅	✅取代這個peft分支
選擇	✅	✅	✅	✅
gpt_bigcode	✅	✅	✅	✅
代碼根	✅	✅	✅	✅
Falcon（精製Webmodel/精製Web）	✅	✅	✅	✅

支持的評估任務

當前， auto_gptq支持： LanguageModelingTask ， SequenceClassificationTask和TextSummarizationTask ；更多的任務將很快到來！

運行測試

測試可以進行：

 pytest tests/ -s

常問問題

默認情況下使用哪個內核？

autoGPTQ默認為使用exllamav2 int4*fp16內核進行矩陣乘法。

如何使用Marlin內核？

Marlin是一款優化的INT4 * FP16內核，最近在https://github.com/ist-daslab/marlin上提出。當將模型加載使用use_marlin=True時，將其集成在AutoGPTQ中。此內核僅在具有計算能力8.0或8.6（安培GPU）的設備上可用。

致謝

特別感謝Elias Frantar ， Saleh Ashkboos ， Torsten Hoefler和Dan Alistarh提出了GPTQ算法和開源代碼，並釋放Marlin內核進行混合精確計算。
特別感謝QWOPQWOP200 ，在此項目中，與量化相關的代碼主要是從GPTQ-FOR-LALA引用的。
特別感謝Turboderp ，以有效的混合精度內核發布了Exllama和Exllama V2庫。

展開

附加信息

版本 v0.7.1: patch release
類型其他源碼
更新時間 2025-04-18
大小 7.22MB
來自於 Github

相關應用

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部