AutoGPTQ 다운로드 AutoGPTQ 소스 코드 다운로드

AutoGPTQ

기타 소스코드

v0.7.1: patch release

다운로드

autogptq

AutoGPTQ 개발이 중단되었습니다. 드롭 인 교체로 GPTQModel로 전환하십시오.

GPTQ 알고리즘 (중량 전용 양자화)을 기반으로하는 사용자 친화적 인 API가있는 사용하기 쉬운 LLM 양자화 패키지.

뉴스 또는 업데이트

2024-02-15- (News) -AutoGPTQ 0.7.0은 Marlin Int4*FP16 Matrix Attrix 곱셈 커널 지원이 포함되어 있으며, 인수 use_marlin=True .
2023-08-23- (뉴스) -? Transformers, Optimum 및 PEFT는 auto-gptq 통합되어 있으므로 이제 GPTQ 모델을 실행하고 훈련시킬 수 있습니다. 자세한 내용은이 블로그와 리소스를 참조하십시오!

더 많은 역사를 보려면 여기로 돌아가십시오

성능 비교

추론 속도

결과는이 스크립트를 사용하여 생성됩니다. 입력의 배치 크기는 1입니다. 디코딩 전략은 빔 검색이며 512 토큰을 생성하기 위해 모델을 시행하고 속도 메트릭은 토큰/s입니다 (더 크고, 더 좋습니다).
정량화 된 모델은 가장 빠른 추론 속도를 얻을 수있는 설정을 사용하여로드됩니다.

모델	GPU	NUM_BEAMS	FP16	gptq-int4
llama-7b	1XA100-40G	1	18.87	25.53
llama-7b	1XA100-40G	4	68.79	91.30
Moss-Moon 16b	1XA100-40G	1	12.48	15.25
Moss-Moon 16b	1XA100-40G	4	우	42.67
Moss-Moon 16b	2xA100-40G	1	06.83	06.78
Moss-Moon 16b	2xA100-40G	4	13.10	10.80
GPT-J 6B	1XRTX3060-12G	1	우	29.55
GPT-J 6B	1XRTX3060-12G	4	우	47.36

당황

당황한 비교를 위해 여기와 여기로 돌아갈 수 있습니다.

설치

AutoGPTQ는 Linux 및 Windows에서만 사용할 수 있습니다. 사전 제작 된 바퀴로 PIP에서 AutoGPTQ의 최신 안정적인 릴리스를 설치할 수 있습니다.

플랫폼 버전	설치	Pytorch에 대해 구축되었습니다
CUDA 11.8	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`	2.2.1+Cu118
CUDA 12.1	`pip install auto-gptq --no-build-isolation`	2.2.1+CU121
ROCM 5.7	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/`	2.2.1+ROCM5.7

AutoGPTQ는 Triton 백엔드를 사용할 수 있으려면 pip install auto-gptq[triton] --no-build-isolation 의 Triton 종속성을 사용하여 설치할 수 있습니다 (현재 Linux, 3 비트 양자화 없음).

구형 AutoGPTQ의 경우 이전 릴리스 설치 테이블을 참조하십시오.

NVIDIA 시스템에서 AutOGPTQ는 Maxwell 또는 낮은 GPU를 지원하지 않습니다.

소스에서 설치하십시오

소스 코드 복제 :

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ

소스에서 구축하려면 몇 가지 패키지가 필요합니다 : pip install numpy gekko pandas .

그런 다음 소스에서 로컬로 설치하십시오.

pip install -vvv --no-build-isolation -e .

Pytorch Extension Building을 비활성화하기 위해 BUILD_CUDA_EXT=0 설정할 수 있지만 AutoGPTQ가 느린 Python 구현에 빠지기 때문에 강력하게 낙담합니다 .

최후의 수단으로 위의 명령이 실패하면 python setup.py install 시도 할 수 있습니다.

ROCM 시스템에서

ROCM을 지원하는 AMD GPU의 소스에서 설치하려면 ROCM_VERSION 환경 변수를 지정하십시오. 예:

ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .

단일 대상 장치 (예 : MI200 시리즈 장치 용 gfx90a )를 위해 빌드하기 위해 PYTORCH_ROCM_ARCH 변수 (참조)를 지정하여 컴파일을 속도를 높일 수 있습니다.

ROCM 시스템의 경우 rocsparse-dev , hipsparse-dev , rocthrust-dev , rocblas-dev 및 hipblas-dev 패키지를 빌드해야합니다.

Intel® Gaudi® 2 시스템에서

주목 : 65C2E15 이상을 커밋하고 있는지 확인하십시오

Intel Gaudi 2 HPU의 소스에서 설치하려면 Cuda Pytorch 확장을 비활성화하려면 BUILD_CUDA_EXT=0 환경 변수를 설정하십시오. 예:

BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .

Intel Gaudi 2는 추론시 최적화 된 커널을 사용하며 비 Cuda 머신에서 BUILD_CUDA_EXT=0 필요합니다.

빠른 여행

양자화 및 추론

경고 : 이것은 AutoGPTQ에서 기본 API 사용의 쇼케이스 일 뿐이며,이 작은 샘플을 사용하여 훨씬 작은 모델을 정량화하기 위해 하나의 샘플 만 사용하여 양자화 된 모델의 품질은 좋지 않을 수 있습니다.

아래는 양자화 후 모델과 추론을 정량화하기 위해 auto_gptq 를 가장 간단하게 사용하는 예입니다.

 from transformers import AutoTokenizer , TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
import logging

logging . basicConfig (
    format = "%(asctime)s %(levelname)s [%(name)s] %(message)s" , level = logging . INFO , datefmt = "%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer . from_pretrained ( pretrained_model_dir , use_fast = True )
examples = [
    tokenizer (
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig (
    bits = 4 ,  # quantize model to 4-bit
    group_size = 128 ,  # it is recommended to set the value to 128
    desc_act = False ,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM . from_pretrained ( pretrained_model_dir , quantize_config )

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model . quantize ( examples )

# save quantized model
model . save_quantized ( quantized_model_dir )

# save quantized model using safetensors
model . save_quantized ( quantized_model_dir , use_safetensors = True )

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM . from_quantized ( quantized_model_dir , device = "cuda:0" )

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print ( tokenizer . decode ( model . generate ( ** tokenizer ( "auto_gptq is" , return_tensors = "pt" ). to ( model . device ))[ 0 ]))

# or you can also use pipeline
pipeline = TextGenerationPipeline ( model = model , tokenizer = tokenizer )
print ( pipeline ( "auto-gptq is" )[ 0 ][ "generated_text" ])

모델 양자화의 고급 기능은이 스크립트를 참조하십시오.

모델을 사용자 정의하십시오

아래는`auto_gptq`를 확장하여`opt` 모델을 지원하기 위해 확장하는 예입니다. 보시다시피 매우 쉽습니다.

 from auto_gptq . modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM ( BaseGPTQForCausalLM ):
    # chained attribute name of transformer layer block
    layers_block_name = "model.decoder.layers"
    # chained attribute names of other nn modules that in the same level as the transformer layer block
    outside_layer_modules = [
        "model.decoder.embed_tokens" , "model.decoder.embed_positions" , "model.decoder.project_out" ,
        "model.decoder.project_in" , "model.decoder.final_layer_norm"
    ]
    # chained attribute names of linear layers in transformer layer module
    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,
    # and the order should be the order when they are truly executed, in this case (and usually in most cases),
    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
    inside_layer_modules = [
        [ "self_attn.k_proj" , "self_attn.v_proj" , "self_attn.q_proj" ],
        [ "self_attn.out_proj" ],
        [ "fc1" ],
        [ "fc2" ]
    ]

그런 다음 OPTGPTQForCausalLM.from_pretrained 및 기본에 표시된 기타 방법을 사용할 수 있습니다.

다운 스트림 작업에 대한 평가

auto_gptq.eval_tasks 에 정의 된 작업을 사용하여 양자화 전후의 특정 다운 스트림 작업에서 모델의 성능을 평가할 수 있습니다.

사전 정의 된 작업은 구현 된 모든 인과 관계 모델을 지원합니까? 트랜스포머 와이 프로젝트에서.

아래는`cardiffnlp/tweet_sentiment_multingual` dataSet을 사용하여 시퀀스 클래스 화 작업에서`eleutherai/gpt-j-6b`를 평가하는 예입니다.

 from functools import partial

import datasets
from transformers import AutoTokenizer , AutoModelForCausalLM , GenerationConfig

from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
from auto_gptq . eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}. n Text: {text} n Answer:"
ID2LABEL = {
    0 : "negative" ,
    1 : "neutral" ,
    2 : "positive"
}
LABELS = list ( ID2LABEL . values ())


def ds_refactor_fn ( samples ):
    text_data = samples [ "text" ]
    label_data = samples [ "label" ]

    new_samples = { "prompt" : [], "label" : []}
    for text , label in zip ( text_data , label_data ):
        prompt = TEMPLATE . format ( labels = LABELS , text = text )
        new_samples [ "prompt" ]. append ( prompt )
        new_samples [ "label" ]. append ( ID2LABEL [ label ])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM . from_pretrained ( MODEL , BaseQuantizeConfig ())
tokenizer = AutoTokenizer . from_pretrained ( MODEL )

task = SequenceClassificationTask (
        model = model ,
        tokenizer = tokenizer ,
        classes = LABELS ,
        data_name_or_path = DATASET ,
        prompt_col_name = "prompt" ,
        label_col_name = "label" ,
        ** {
            "num_samples" : 1000 ,  # how many samples will be sampled to evaluation
            "sample_max_len" : 1024 ,  # max tokens for each sample
            "block_max_len" : 2048 ,  # max tokens for each data block
            # function to load dataset, one must only accept data_name_or_path as input
            # and return datasets.Dataset
            "load_fn" : partial ( datasets . load_dataset , name = "english" ),
            # function to preprocess dataset, which is used for datasets.Dataset.map,
            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
            "preprocess_fn" : ds_refactor_fn ,
            # truncate label when sample's length exceed sample_max_len
            "truncate_prompt" : False
        }
    )

# note that max_new_tokens will be automatically specified internally based on given classes
print ( task . run ())

# self-consistency
print (
    task . run (
        generation_config = GenerationConfig (
            num_beams = 3 ,
            num_return_sequences = 3 ,
            do_sample = True
        )
    )
)

자세히 알아보십시오

튜토리얼은 auto_gptq 자신의 프로젝트 및 모범 사례 원칙과 통합하기위한 단계별 지침을 제공합니다.

예제는 다양한 방식으로 auto_gptq 사용하는 많은 예제 스크립트를 제공합니다.

지원되는 모델

model.config.model_type 사용하여 아래 표와 비교하여 사용하는 모델이 auto_gptq 에서 지원되는지 확인할 수 있습니다.
예를 들어, WizardLM , vicuna 및 gpt4all 의 Model_type는 모두 llama 이므로 모두 auto_gptq 에서 지원합니다.

모델 유형	양자화	추론	PEFT-LORA	PEFT-DADA-LORA	peft-adaption_prompt
꽃	✅	✅	✅	✅
GPT2	✅	✅	✅	✅
gpt_neox	✅	✅	✅	✅	이 peft 지점을 요구합니다
gptj	✅	✅	✅	✅	이 peft 지점을 요구합니다
야마	✅	✅	✅	✅	✅
이끼	✅	✅	✅	✅	이 peft 지점을 요구합니다
고르다	✅	✅	✅	✅
gpt_bigcode	✅	✅	✅	✅
Codegen	✅	✅	✅	✅
FALCON (refinedWebModel/RefinedWeb)	✅	✅	✅	✅

지원되는 평가 작업

현재 auto_gptq 지원 : LanguageModelingTask , SequenceClassificationTask 및 TextSummarizationTask ; 더 많은 작업이 곧 올 것입니다!

실행 테스트

테스트는 다음과 같이 실행할 수 있습니다.

 pytest tests/ -s

FAQ

기본적으로 어떤 커널이 사용됩니까?

AutoGPTQ는 행렬 곱셈을 위해 exllamav2 int4*fp16 커널을 사용합니다.

말린 커널을 사용하는 방법?

말린은 최근 https://github.com/ist-daslab/marlin에서 최적화 된 INT4 * FP16 커널이 제안되었습니다. use_marlin=True 있는 모델을로드 할 때 AutoGPTQ에 통합됩니다. 이 커널은 Compute Capability 8.0 또는 8.6 (Ampere GPU)이있는 장치에서만 사용할 수 있습니다.

승인

특별 감사합니다 Elias Frantar , Saleh Ashkboos , Torsten Hoefler 및 Dan Alistarh는 GPTQ 알고리즘 및 오픈 소스 코드를 제안하고 혼합 정밀 계산을 위해 Marlin 커널을 출시했습니다.
이 프로젝트에서 양자화와 관련된 코드는 주로 GPTQ-for-llama에서 참조되는 특별 감사 QWOPQWOP200 .
효율적인 혼합 정밀 커널을 갖춘 exllama 및 exllama v2 라이브러리를 출시 한 Turboderp 에게 특별한 감사를드립니다.

확장하다

추가 정보

버전 v0.7.1: patch release
유형 기타 소스코드
업데이트 시간 2025-04-18
크기 7.22MB
출처 Github

AutoGPTQ

autogptq

AutoGPTQ 개발이 중단되었습니다. 드롭 인 교체로 GPTQModel로 전환하십시오.

뉴스 또는 업데이트

성능 비교

추론 속도

당황

설치

소스에서 설치하십시오

ROCM 시스템에서

Intel® Gaudi® 2 시스템에서

빠른 여행

양자화 및 추론

모델을 사용자 정의하십시오

다운 스트림 작업에 대한 평가

자세히 알아보십시오

지원되는 모델

지원되는 평가 작업

실행 테스트

FAQ

기본적으로 어떤 커널이 사용됩니까?

말린 커널을 사용하는 방법?

승인

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express