AutoGPTQダウンロードAutoGPTQソースコードのダウンロード

AutoGPTQ

その他のソースコード

v0.7.1: patch release

ダウンロード

autogptq

AutoGPTQ開発が停止しました。ドロップインの交換としてGPTQModelに切り替えてください。

GPTQアルゴリズム（重量のみの量子化）に基づく、使いやすいAPIを使用した使いやすいLLM量子化パッケージ。

ニュースまたは更新

2024-02-15-（News）-AutoGPTQ 0.7.0がリリースされ、Marlin INT4*FP16 MATRIX乗算カーネルサポートを使用して、モデルをロードするときにuse_marlin=True 。
2023-08-23-（ニュース） - ？ Transformers、Optimum、およびPEFTにはauto-gptqが統合されているため、現在、GPTQモデルを実行してトレーニングすることができるようになりました。このブログをご覧ください。詳細についてはリソースです！

その他の履歴については、こちらをご覧ください

パフォーマンスの比較

推論速度

結果はこのスクリプトを使用して生成され、入力のバッチサイズは1、デコード戦略はビーム検索であり、モデルを強制して512トークンを生成し、速度メトリックはトークン/s（より大きく、より良い）です。
量子化されたモデルは、最速の推論速度を得ることができるセットアップを使用してロードされます。

モデル	GPU	num_beams	FP16	GPTQ-INT4
llama-7b	1xa100-40g	1	18.87	25.53
llama-7b	1xa100-40g	4	68.79	91.30
モスムーン16b	1xa100-40g	1	12.48	15.25
モスムーン16b	1xa100-40g	4	ooom	42.67
モスムーン16b	2xa100-40g	1	06.83	06.78
モスムーン16b	2xa100-40g	4	13.10	10.80
GPT-J 6B	1xrtx3060-12g	1	ooom	29.55
GPT-J 6B	1xrtx3060-12g	4	ooom	47.36

困惑

困惑の比較のために、こことこちらに頼ることができます

インストール

AutoGPTQは、LinuxおよびWindowsでのみ利用できます。事前に構築されたホイールを使用して、PIPからAutoGPTQの最新の安定したリリースをインストールできます。

プラットフォームバージョン	インストール	Pytorchに対して建てられました
CUDA 11.8	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/`	2.2.1+Cu118
CUDA 12.1	`pip install auto-gptq --no-build-isolation`	2.2.1+Cu121
ROCM 5.7	`pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/`	2.2.1+ROCM5.7

AutoGPTQは、Tritonバックエンドを使用できるようpip install auto-gptq[triton] --no-build-isolationを使用してTriton依存関係でインストールできます（現在、Linux、3ビットの量子化なしをサポートしています）。

古いAutoGPTQについては、以前のリリースインストールテーブルを参照してください。

NVIDIAシステムでは、AutoGPTQはMaxwellまたはLower GPUをサポートしていません。

ソースからインストールします

ソースコードをクローンします：

git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ

ソースからビルドするためには、いくつかのパッケージが必要です。PIP pip install numpy gekko pandas 。

次に、ソースからローカルにインストールします。

pip install -vvv --no-build-isolation -e .

BUILD_CUDA_EXT=0を設定してpytorch拡張ビルディングを無効にすることができますが、これはautogptqが遅いpython実装に戻るにつれて強く阻止されます。

最後の手段として、上記のコマンドが失敗した場合は、 python setup.py install試すことができます。

ROCMシステムについて

ROCMをサポートするAMD GPUのソースからインストールするには、 ROCM_VERSION環境変数を指定してください。例：

ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .

コンピレーションは、MI200シリーズデバイスのgfx90aなど、単一のターゲットデバイス用に構築するために、 PYTORCH_ROCM_ARCH変数（リファレンス）を指定することで高速化できます。

ROCMシステムの場合、パッケージrocsparse-dev 、 hipsparse-dev 、 rocthrust-dev 、 rocblas-dev 、およびhipblas-dev構築する必要があります。

Intel®Gaudi®2システムについて

注意：65C2E15以降にコミットしていることを確認してください

Intel Gaudi 2 HPUのソースからインストールするには、 BUILD_CUDA_EXT=0環境変数を設定して、Cuda pytorch拡張機能の構築を無効にします。例：

BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .

Intel Gaudi 2は、推論時に最適化されたカーネルを使用し、非CUDAマシンでBUILD_CUDA_EXT=0必要とすることに注意してください。

クイックツアー

量子化と推論

警告：これは、AutoGPTQの基本的なAPIの使用の紹介にすぎません。これは、1つのサンプルを使用して、はるかに小さなモデルを量子化するために1つのサンプルのみを使用して、このような小さなサンプルを使用して量子化されたモデルの品質は良くないかもしれません。

以下はauto_gptqを最も単純な使用の例です。量子化後のモデルと推論を定量化する：

 from transformers import AutoTokenizer , TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
import logging

logging . basicConfig (
    format = "%(asctime)s %(levelname)s [%(name)s] %(message)s" , level = logging . INFO , datefmt = "%Y-%m-%d %H:%M:%S"
)

pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"

tokenizer = AutoTokenizer . from_pretrained ( pretrained_model_dir , use_fast = True )
examples = [
    tokenizer (
        "auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
    )
]

quantize_config = BaseQuantizeConfig (
    bits = 4 ,  # quantize model to 4-bit
    group_size = 128 ,  # it is recommended to set the value to 128
    desc_act = False ,  # set to False can significantly speed up inference but the perplexity may slightly bad
)

# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM . from_pretrained ( pretrained_model_dir , quantize_config )

# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model . quantize ( examples )

# save quantized model
model . save_quantized ( quantized_model_dir )

# save quantized model using safetensors
model . save_quantized ( quantized_model_dir , use_safetensors = True )

# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)

# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)

# load quantized model to the first GPU
model = AutoGPTQForCausalLM . from_quantized ( quantized_model_dir , device = "cuda:0" )

# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)

# inference with model.generate
print ( tokenizer . decode ( model . generate ( ** tokenizer ( "auto_gptq is" , return_tensors = "pt" ). to ( model . device ))[ 0 ]))

# or you can also use pipeline
pipeline = TextGenerationPipeline ( model = model , tokenizer = tokenizer )
print ( pipeline ( "auto-gptq is" )[ 0 ][ "generated_text" ])

モデル量子化のより高度な機能については、このスクリプトを参照してください

モデルをカスタマイズします

以下は、「auto_gptq」を拡張して「オプト」モデルをサポートする例です。ご覧のとおり、非常に簡単です。

 from auto_gptq . modeling import BaseGPTQForCausalLM


class OPTGPTQForCausalLM ( BaseGPTQForCausalLM ):
    # chained attribute name of transformer layer block
    layers_block_name = "model.decoder.layers"
    # chained attribute names of other nn modules that in the same level as the transformer layer block
    outside_layer_modules = [
        "model.decoder.embed_tokens" , "model.decoder.embed_positions" , "model.decoder.project_out" ,
        "model.decoder.project_in" , "model.decoder.final_layer_norm"
    ]
    # chained attribute names of linear layers in transformer layer module
    # normally, there are four sub lists, for each one the modules in it can be seen as one operation,
    # and the order should be the order when they are truly executed, in this case (and usually in most cases),
    # they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
    inside_layer_modules = [
        [ "self_attn.k_proj" , "self_attn.v_proj" , "self_attn.q_proj" ],
        [ "self_attn.out_proj" ],
        [ "fc1" ],
        [ "fc2" ]
    ]

この後、Basicに示すように、 OPTGPTQForCausalLM.from_pretrainedおよびその他の方法を使用できます。

ダウンストリームタスクの評価

auto_gptq.eval_tasksで定義されたタスクを使用して、量子化の前後に特定のダウンストリームタスクでモデルのパフォーマンスを評価できます。

事前定義されたタスクは、実装されているすべての因果言語モデルをサポートしていますか？トランスフォーマーとこのプロジェクトで。

以下は、 `cardiffnlp/tweet_sentiment_multilingual`データセットを使用して、シーケンス分類タスクで` eleutherai/gpt-j-6b`を評価する例です。

 from functools import partial

import datasets
from transformers import AutoTokenizer , AutoModelForCausalLM , GenerationConfig

from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
from auto_gptq . eval_tasks import SequenceClassificationTask


MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}. n Text: {text} n Answer:"
ID2LABEL = {
    0 : "negative" ,
    1 : "neutral" ,
    2 : "positive"
}
LABELS = list ( ID2LABEL . values ())


def ds_refactor_fn ( samples ):
    text_data = samples [ "text" ]
    label_data = samples [ "label" ]

    new_samples = { "prompt" : [], "label" : []}
    for text , label in zip ( text_data , label_data ):
        prompt = TEMPLATE . format ( labels = LABELS , text = text )
        new_samples [ "prompt" ]. append ( prompt )
        new_samples [ "label" ]. append ( ID2LABEL [ label ])

    return new_samples


#  model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM . from_pretrained ( MODEL , BaseQuantizeConfig ())
tokenizer = AutoTokenizer . from_pretrained ( MODEL )

task = SequenceClassificationTask (
        model = model ,
        tokenizer = tokenizer ,
        classes = LABELS ,
        data_name_or_path = DATASET ,
        prompt_col_name = "prompt" ,
        label_col_name = "label" ,
        ** {
            "num_samples" : 1000 ,  # how many samples will be sampled to evaluation
            "sample_max_len" : 1024 ,  # max tokens for each sample
            "block_max_len" : 2048 ,  # max tokens for each data block
            # function to load dataset, one must only accept data_name_or_path as input
            # and return datasets.Dataset
            "load_fn" : partial ( datasets . load_dataset , name = "english" ),
            # function to preprocess dataset, which is used for datasets.Dataset.map,
            # must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
            "preprocess_fn" : ds_refactor_fn ,
            # truncate label when sample's length exceed sample_max_len
            "truncate_prompt" : False
        }
    )

# note that max_new_tokens will be automatically specified internally based on given classes
print ( task . run ())

# self-consistency
print (
    task . run (
        generation_config = GenerationConfig (
            num_beams = 3 ,
            num_return_sequences = 3 ,
            do_sample = True
        )
    )
)

もっと詳しく知る

チュートリアルでは、 auto_gptq独自のプロジェクトといくつかのベストプラクティスの原則と統合するための段階的なガイダンスを提供します。

例は、さまざまな方法でauto_gptq使用するための多くのサンプルスクリプトを提供します。

サポートされているモデル

model.config.model_typeを使用して以下の表と比較して、使用するモデルがauto_gptqによってサポートされているかどうかを確認できます。
たとえば、 WizardLM 、 vicuna 、 gpt4allのmodel_typeはすべてllamaです。したがって、それらはすべてauto_gptqによってサポートされています。

モデルタイプ	量子化	推論	peft-lora	peft-aad-lora	peft-adaption_prompt
咲く	✅	✅	✅	✅
GPT2	✅	✅	✅	✅
gpt_neox	✅	✅	✅	✅	このPEFTブランチを解除します
gptj	✅	✅	✅	✅	このPEFTブランチを解除します
ラマ	✅	✅	✅	✅	✅
苔	✅	✅	✅	✅	このPEFTブランチを解除します
Opt	✅	✅	✅	✅
gpt_bigcode	✅	✅	✅	✅
codegen	✅	✅	✅	✅
Falcon（RefinedWebModel/RefinedWeb）	✅	✅	✅	✅

サポートされている評価タスク

現在、 auto_gptqサポート： LanguageModelingTask 、 SequenceClassificationTask 、およびTextSummarizationTask 。より多くのタスクがすぐに来るでしょう！

実行中のテスト

テストは：で実行できます：

 pytest tests/ -s

よくある質問

デフォルトで使用されているカーネルはどれですか？

AutoGPTQは、Matrix乗算にExllamav2 int4*fp16カーネルを使用することを義務付けています。

マーリンカーネルの使用方法は？

Marlinは最適化されたINT4 * FP16カーネルが最近https://github.com/ist-daslab/marlinで提案されました。これは、 use_marlin=TrueでモデルをロードするときにAutoGPTQに統合されます。このカーネルは、計算機能8.0または8.6（アンペアGPU）を備えたデバイスでのみ使用できます。

了承

GPTQアルゴリズムとオープンソースを提案してくれたElias Frantar 、 Saleh Ashkboos 、 Torsten Hoefler 、 Dan Alistarh 、およびMarlin Kernelを混合精密計算のためにリリースしてくれたことに感謝します。
量子化に関連するこのプロジェクトのコードについては、主にGPTQ-For-llamaから参照されているこのプロジェクトについて、 QWOPQWOP200に感謝します。
ExllamaとExllama V2ライブラリを効率的に混合した精度カーネルをリリースしてくれたTurboderpに感謝します。

拡大する

追加情報

バージョン v0.7.1: patch release
タイプその他のソースコード
更新時間 2025-04-18
サイズ 7.22MB
から Github

AutoGPTQ

autogptq

AutoGPTQ開発が停止しました。ドロップインの交換としてGPTQModelに切り替えてください。

ニュースまたは更新

パフォーマンスの比較

推論速度

困惑

インストール

ソースからインストールします

ROCMシステムについて

Intel®Gaudi®2システムについて

クイックツアー

量子化と推論

モデルをカスタマイズします

ダウンストリームタスクの評価

もっと詳しく知る

サポートされているモデル

サポートされている評価タスク

実行中のテスト

よくある質問

デフォルトで使用されているカーネルはどれですか？

マーリンカーネルの使用方法は？

了承

Google Dorks

shepherd

mongo express

hidusbf

Free Algorithms Books

markdownpedia

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express