AutoGPTQ下载AutoGPTQ源代码下载

AutoGPTQ

其他源码

v0.7.1: patch release

下载

AutoGPTQ

AutoGPTQ开发已经停止。请切换到GPTQMODEL作为置换式替换。

基于GPTQ算法（仅重量量化）的易于使用的LLM量化包，具有用户友好的API。

新闻或更新

2024-02-15-（新闻） - AutoGPTQ 0.7.0释放，使用Marlin INT4*FP16矩阵乘法内核支持，而加载模型时的参数use_marlin=True 。
2023-08-23-（新闻） - ？变压器，最佳和PEFT已集成了auto-gptq ，因此现在正在运行和培训GPTQ模型可以为每个人提供！请参阅此博客及其资源以获取更多详细信息！

有关更多历史，请转到这里

性能比较

推理速度

结果是使用此脚本生成的，输入的批量大小为1，解码策略是梁搜索并强制执行模型生成512代币，速度度量是令牌/s（越大，越大）。
使用可以获得最快推理速度的设置加载量化模型。

模型	GPU	num_beams	FP16	GPTQ-INT4
Llama-7b	1XA100-40G	1	18.87	25.53
Llama-7b	1XA100-40G	4	68.79	91.30
moss-moon 16b	1XA100-40G	1	12.48	15.25
moss-moon 16b	1XA100-40G	4	oom	42.67
moss-moon 16b	2XA100-40G	1	06.83	06.78
moss-moon 16b	2XA100-40G	4	13.10	10.80
GPT-J 6B	1XRTX3060-12G	1	oom	29.55
GPT-J 6B	1XRTX3060-12G	4	oom	47.36

困惑

为了进行困惑，您可以在这里和这里转到此处

安装

AutoGPTQ仅在Linux和Windows上可用。您可以使用PIP的最新稳定版本使用预构建的车轮：

平台版本	安装	建立在Pytorch上
库达11.8	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`	2.2.1+CU118
CUDA 12.1	`pip install auto-gptq --no-build-isolation`	2.2.1+CU121
ROCM 5.7	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/`	2.2.1+ROCM5.7

可以使用pip install auto-gptq[triton] --no-build-isolation安装自动PPTQ，以便能够使用Triton后端（当前仅支持Linux，不支持Linux，不支持3位量化）。

对于较旧的AutoGPTQ，请参阅以前的版本安装表。

在NVIDIA系统上，AutoGPTQ不支持Maxwell或较低的GPU。

从源安装

克隆源代码：

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ

为了从来源构建： pip install numpy gekko pandas需要一些软件包。

然后，从源本地安装：

pip install -vvv --no-build-isolation -e .

您可以将BUILD_CUDA_EXT=0设置为禁用Pytorch扩展构建，但是由于AutoGPTQ然后倒在慢速的Python实现中，因此强烈劝阻。

作为最后的手段，如果上述命令失败，则可以尝试python setup.py install 。

在ROCM系统上

要从源为支持ROCM的AMD GPU的源安装，请指定ROCM_VERSION环境变量。例子：

ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .

可以通过指定PYTORCH_ROCM_ARCH变量（参考）来加快汇编，以构建单个目标设备，例如MI200系列设备的gfx90a 。

对于ROCM系统，需要构建包装rocsparse-dev ， hipsparse-dev ， rocthrust-dev ， rocblas-dev和hipblas-dev 。

在Intel®Gaudi®2系统上

注意：确保您在65c2e15或更高版本中

要从Intel Gaudi 2 HPU的源安装，请设置BUILD_CUDA_EXT=0环境变量以禁用构建cuda pytorch扩展。例子：

BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .

请注意，在推断时，英特尔Gaudi 2使用优化的内核，并且需要在非CUDA机器上BUILD_CUDA_EXT=0 。

快速游览

量化和推理

警告：这只是AutoGPTQ中基本API的使用情况，该显示仅使用一个样本来量化一个小型模型，使用如此小样本的量化模型的质量可能不好。

以下是最简单使用auto_gptq来量化模型和推断后的示例：

 from transformers import AutoTokenizer , TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
import logging

logging . basicConfig (
    format = "%(asctime)s %(levelname)s [%(name)s] %(message)s" , level = logging . INFO , datefmt = "%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer . from_pretrained ( pretrained_model_dir , use_fast = True )
examples = [
    tokenizer (
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig (
    bits = 4 ,  # quantize model to 4-bit
    group_size = 128 ,  # it is recommended to set the value to 128
    desc_act = False ,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM . from_pretrained ( pretrained_model_dir , quantize_config )

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model . quantize ( examples )

# save quantized model
model . save_quantized ( quantized_model_dir )

# save quantized model using safetensors
model . save_quantized ( quantized_model_dir , use_safetensors = True )

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM . from_quantized ( quantized_model_dir , device = "cuda:0" )

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print ( tokenizer . decode ( model . generate ( ** tokenizer ( "auto_gptq is" , return_tensors = "pt" ). to ( model . device ))[ 0 ]))

# or you can also use pipeline
pipeline = TextGenerationPipeline ( model = model , tokenizer = tokenizer )
print ( pipeline ( "auto-gptq is" )[ 0 ][ "generated_text" ])

有关模型量化的更高级功能，请参考此脚本

自定义模型

以下是扩展“ auto_gptq”以支持`opt`型号的示例，这很容易：

 from auto_gptq . modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM ( BaseGPTQForCausalLM ):
    # chained attribute name of transformer layer block
    layers_block_name = "model.decoder.layers"
    # chained attribute names of other nn modules that in the same level as the transformer layer block
    outside_layer_modules = [
        "model.decoder.embed_tokens" , "model.decoder.embed_positions" , "model.decoder.project_out" ,
        "model.decoder.project_in" , "model.decoder.final_layer_norm"
    ]
    # chained attribute names of linear layers in transformer layer module
    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,
    # and the order should be the order when they are truly executed, in this case (and usually in most cases),
    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
    inside_layer_modules = [
        [ "self_attn.k_proj" , "self_attn.v_proj" , "self_attn.q_proj" ],
        [ "self_attn.out_proj" ],
        [ "fc1" ],
        [ "fc2" ]
    ]

之后，您可以使用OPTGPTQForCausalLM.from_pretrained和其他方法，如基本所示。

对下游任务的评估

您可以使用auto_gptq.eval_tasks中定义的任务来评估量化前后模型在特定下游任务上的性能。

预定义的任务支持实施的所有因果关系模型？变压器和这个项目。

以下是使用`cardiffnlp/tweet_sentiment_multlingual` dataset：

 from functools import partial

import datasets
from transformers import AutoTokenizer , AutoModelForCausalLM , GenerationConfig

from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
from auto_gptq . eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}. n Text: {text} n Answer:"
ID2LABEL = {
    0 : "negative" ,
    1 : "neutral" ,
    2 : "positive"
}
LABELS = list ( ID2LABEL . values ())


def ds_refactor_fn ( samples ):
    text_data = samples [ "text" ]
    label_data = samples [ "label" ]

    new_samples = { "prompt" : [], "label" : []}
    for text , label in zip ( text_data , label_data ):
        prompt = TEMPLATE . format ( labels = LABELS , text = text )
        new_samples [ "prompt" ]. append ( prompt )
        new_samples [ "label" ]. append ( ID2LABEL [ label ])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM . from_pretrained ( MODEL , BaseQuantizeConfig ())
tokenizer = AutoTokenizer . from_pretrained ( MODEL )

task = SequenceClassificationTask (
        model = model ,
        tokenizer = tokenizer ,
        classes = LABELS ,
        data_name_or_path = DATASET ,
        prompt_col_name = "prompt" ,
        label_col_name = "label" ,
        ** {
            "num_samples" : 1000 ,  # how many samples will be sampled to evaluation
            "sample_max_len" : 1024 ,  # max tokens for each sample
            "block_max_len" : 2048 ,  # max tokens for each data block
            # function to load dataset, one must only accept data_name_or_path as input
            # and return datasets.Dataset
            "load_fn" : partial ( datasets . load_dataset , name = "english" ),
            # function to preprocess dataset, which is used for datasets.Dataset.map,
            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
            "preprocess_fn" : ds_refactor_fn ,
            # truncate label when sample's length exceed sample_max_len
            "truncate_prompt" : False
        }
    )

# note that max_new_tokens will be automatically specified internally based on given classes
print ( task . run ())

# self-consistency
print (
    task . run (
        generation_config = GenerationConfig (
            num_beams = 3 ,
            num_return_sequences = 3 ,
            do_sample = True
        )
    )
)

了解更多

教程提供了逐步指导，将auto_gptq与您自己的项目和一些最佳实践原则集成在一起。

示例提供了很多示例脚本以不同方式使用auto_gptq 。

支持的模型

您可以使用model.config.model_type与下表进行比较，以检查您使用的模型是否由auto_gptq支持。
例如， WizardLM ， vicuna和gpt4all的model_type都是llama ，因此它们都得到了auto_gptq的支持。

型号类型	量化	推理	peft-lora	peft-ada-lora	peft-adaption_prompt
盛开	✅	✅	✅	✅
GPT2	✅	✅	✅	✅
gpt_neox	✅	✅	✅	✅	✅取代这个peft分支
GPTJ	✅	✅	✅	✅	✅取代这个peft分支
骆驼	✅	✅	✅	✅	✅
苔藓	✅	✅	✅	✅	✅取代这个peft分支
选择	✅	✅	✅	✅
gpt_bigcode	✅	✅	✅	✅
代码根	✅	✅	✅	✅
Falcon（精制Webmodel/精制Web）	✅	✅	✅	✅

支持的评估任务

当前， auto_gptq支持： LanguageModelingTask ， SequenceClassificationTask和TextSummarizationTask ；更多的任务将很快到来！

运行测试

测试可以进行：

 pytest tests/ -s

常问问题

默认情况下使用哪个内核？

autoGPTQ默认为使用exllamav2 int4*fp16内核进行矩阵乘法。

如何使用Marlin内核？

Marlin是一款优化的INT4 * FP16内核，最近在https://github.com/ist-daslab/marlin上提出。当将模型加载使用use_marlin=True时，将其集成在AutoGPTQ中。此内核仅在具有计算能力8.0或8.6（安培GPU）的设备上可用。

致谢

特别感谢Elias Frantar ， Saleh Ashkboos ， Torsten Hoefler和Dan Alistarh提出了GPTQ算法和开源代码，并释放Marlin内核进行混合精确计算。
特别感谢QWOPQWOP200 ，在此项目中，与量化相关的代码主要是从GPTQ-FOR-LALA引用的。
特别感谢Turboderp ，以有效的混合精度内核发布了Exllama和Exllama V2库。

展开

附加信息

版本 v0.7.1: patch release
类型其他源码
更新时间 2025-04-18
大小 7.22MB
来自于 Github

AutoGPTQ

AutoGPTQ

AutoGPTQ开发已经停止。请切换到GPTQMODEL作为置换式替换。

新闻或更新

性能比较

推理速度

困惑

安装

从源安装

在ROCM系统上

在Intel®Gaudi®2系统上

快速游览

量化和推理

自定义模型

对下游任务的评估

了解更多

支持的模型

支持的评估任务

运行测试

常问问题

默认情况下使用哪个内核？

如何使用Marlin内核？

致谢

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express