AutoGPTQ開發已經停止。請切換到GPTQMODEL作為置換式替換。
基於GPTQ算法(僅重量量化)的易於使用的LLM量化包,具有用戶友好的API。
use_marlin=True 。auto-gptq ,因此現在正在運行和培訓GPTQ模型可以為每個人提供!請參閱此博客及其資源以獲取更多詳細信息!有關更多歷史,請轉到這裡
結果是使用此腳本生成的,輸入的批量大小為1,解碼策略是梁搜索並強制執行模型生成512代幣,速度度量是令牌/s(越大,越大)。
使用可以獲得最快推理速度的設置加載量化模型。
| 模型 | GPU | num_beams | FP16 | GPTQ-INT4 |
|---|---|---|---|---|
| Llama-7b | 1XA100-40G | 1 | 18.87 | 25.53 |
| Llama-7b | 1XA100-40G | 4 | 68.79 | 91.30 |
| moss-moon 16b | 1XA100-40G | 1 | 12.48 | 15.25 |
| moss-moon 16b | 1XA100-40G | 4 | oom | 42.67 |
| moss-moon 16b | 2XA100-40G | 1 | 06.83 | 06.78 |
| moss-moon 16b | 2XA100-40G | 4 | 13.10 | 10.80 |
| GPT-J 6B | 1XRTX3060-12G | 1 | oom | 29.55 |
| GPT-J 6B | 1XRTX3060-12G | 4 | oom | 47.36 |
為了進行困惑,您可以在這里和這裡轉到此處
AutoGPTQ僅在Linux和Windows上可用。您可以使用PIP的最新穩定版本使用預構建的車輪:
| 平台版本 | 安裝 | 建立在Pytorch上 |
|---|---|---|
| 庫達11.8 | pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ | 2.2.1+CU118 |
| CUDA 12.1 | pip install auto-gptq --no-build-isolation | 2.2.1+CU121 |
| ROCM 5.7 | pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ | 2.2.1+ROCM5.7 |
可以使用pip install auto-gptq[triton] --no-build-isolation安裝自動PPTQ,以便能夠使用Triton後端(當前僅支持Linux,不支持Linux,不支持3位量化)。
對於較舊的AutoGPTQ,請參閱以前的版本安裝表。
在NVIDIA系統上,AutoGPTQ不支持Maxwell或較低的GPU。
克隆源代碼:
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ為了從來源構建: pip install numpy gekko pandas需要一些軟件包。
然後,從源本地安裝:
pip install -vvv --no-build-isolation -e .您可以將BUILD_CUDA_EXT=0設置為禁用Pytorch擴展構建,但是由於AutoGPTQ然後倒在慢速的Python實現中,因此強烈勸阻。
作為最後的手段,如果上述命令失敗,則可以嘗試python setup.py install 。
要從源為支持ROCM的AMD GPU的源安裝,請指定ROCM_VERSION環境變量。例子:
ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .可以通過指定PYTORCH_ROCM_ARCH變量(參考)來加快彙編,以構建單個目標設備,例如MI200系列設備的gfx90a 。
對於ROCM系統,需要構建包裝rocsparse-dev , hipsparse-dev , rocthrust-dev , rocblas-dev和hipblas-dev 。
注意:確保您在65c2e15或更高版本中
要從Intel Gaudi 2 HPU的源安裝,請設置BUILD_CUDA_EXT=0環境變量以禁用構建cuda pytorch擴展。例子:
BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .請注意,在推斷時,英特爾Gaudi 2使用優化的內核,並且需要在非CUDA機器上
BUILD_CUDA_EXT=0。
警告:這只是AutoGPTQ中基本API的使用情況,該顯示僅使用一個樣本來量化一個小型模型,使用如此小樣本的量化模型的質量可能不好。
以下是最簡單使用auto_gptq來量化模型和推斷後的示例:
from transformers import AutoTokenizer , TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
import logging
logging . basicConfig (
format = "%(asctime)s %(levelname)s [%(name)s] %(message)s" , level = logging . INFO , datefmt = "%Y-%m-%d %H:%M:%S"
)
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
tokenizer = AutoTokenizer . from_pretrained ( pretrained_model_dir , use_fast = True )
examples = [
tokenizer (
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
quantize_config = BaseQuantizeConfig (
bits = 4 , # quantize model to 4-bit
group_size = 128 , # it is recommended to set the value to 128
desc_act = False , # set to False can significantly speed up inference but the perplexity may slightly bad
)
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM . from_pretrained ( pretrained_model_dir , quantize_config )
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model . quantize ( examples )
# save quantized model
model . save_quantized ( quantized_model_dir )
# save quantized model using safetensors
model . save_quantized ( quantized_model_dir , use_safetensors = True )
# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)
# load quantized model to the first GPU
model = AutoGPTQForCausalLM . from_quantized ( quantized_model_dir , device = "cuda:0" )
# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)
# inference with model.generate
print ( tokenizer . decode ( model . generate ( ** tokenizer ( "auto_gptq is" , return_tensors = "pt" ). to ( model . device ))[ 0 ]))
# or you can also use pipeline
pipeline = TextGenerationPipeline ( model = model , tokenizer = tokenizer )
print ( pipeline ( "auto-gptq is" )[ 0 ][ "generated_text" ])有關模型量化的更高級功能,請參考此腳本
from auto_gptq . modeling import BaseGPTQForCausalLM
class OPTGPTQForCausalLM ( BaseGPTQForCausalLM ):
# chained attribute name of transformer layer block
layers_block_name = "model.decoder.layers"
# chained attribute names of other nn modules that in the same level as the transformer layer block
outside_layer_modules = [
"model.decoder.embed_tokens" , "model.decoder.embed_positions" , "model.decoder.project_out" ,
"model.decoder.project_in" , "model.decoder.final_layer_norm"
]
# chained attribute names of linear layers in transformer layer module
# normally, there are four sub lists, for each one the modules in it can be seen as one operation,
# and the order should be the order when they are truly executed, in this case (and usually in most cases),
# they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
inside_layer_modules = [
[ "self_attn.k_proj" , "self_attn.v_proj" , "self_attn.q_proj" ],
[ "self_attn.out_proj" ],
[ "fc1" ],
[ "fc2" ]
]之後,您可以使用OPTGPTQForCausalLM.from_pretrained和其他方法,如基本所示。
您可以使用auto_gptq.eval_tasks中定義的任務來評估量化前後模型在特定下游任務上的性能。
預定義的任務支持實施的所有因果關係模型?變壓器和這個項目。
from functools import partial
import datasets
from transformers import AutoTokenizer , AutoModelForCausalLM , GenerationConfig
from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
from auto_gptq . eval_tasks import SequenceClassificationTask
MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}. n Text: {text} n Answer:"
ID2LABEL = {
0 : "negative" ,
1 : "neutral" ,
2 : "positive"
}
LABELS = list ( ID2LABEL . values ())
def ds_refactor_fn ( samples ):
text_data = samples [ "text" ]
label_data = samples [ "label" ]
new_samples = { "prompt" : [], "label" : []}
for text , label in zip ( text_data , label_data ):
prompt = TEMPLATE . format ( labels = LABELS , text = text )
new_samples [ "prompt" ]. append ( prompt )
new_samples [ "label" ]. append ( ID2LABEL [ label ])
return new_samples
# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM . from_pretrained ( MODEL , BaseQuantizeConfig ())
tokenizer = AutoTokenizer . from_pretrained ( MODEL )
task = SequenceClassificationTask (
model = model ,
tokenizer = tokenizer ,
classes = LABELS ,
data_name_or_path = DATASET ,
prompt_col_name = "prompt" ,
label_col_name = "label" ,
** {
"num_samples" : 1000 , # how many samples will be sampled to evaluation
"sample_max_len" : 1024 , # max tokens for each sample
"block_max_len" : 2048 , # max tokens for each data block
# function to load dataset, one must only accept data_name_or_path as input
# and return datasets.Dataset
"load_fn" : partial ( datasets . load_dataset , name = "english" ),
# function to preprocess dataset, which is used for datasets.Dataset.map,
# must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
"preprocess_fn" : ds_refactor_fn ,
# truncate label when sample's length exceed sample_max_len
"truncate_prompt" : False
}
)
# note that max_new_tokens will be automatically specified internally based on given classes
print ( task . run ())
# self-consistency
print (
task . run (
generation_config = GenerationConfig (
num_beams = 3 ,
num_return_sequences = 3 ,
do_sample = True
)
)
)教程提供了逐步指導,將auto_gptq與您自己的項目和一些最佳實踐原則集成在一起。
示例提供了很多示例腳本以不同方式使用auto_gptq 。
您可以使用
model.config.model_type與下表進行比較,以檢查您使用的模型是否由auto_gptq支持。例如,
WizardLM,vicuna和gpt4all的model_type都是llama,因此它們都得到了auto_gptq的支持。
| 型號類型 | 量化 | 推理 | peft-lora | peft-ada-lora | peft-adaption_prompt |
|---|---|---|---|---|---|
| 盛開 | ✅ | ✅ | ✅ | ✅ | |
| GPT2 | ✅ | ✅ | ✅ | ✅ | |
| gpt_neox | ✅ | ✅ | ✅ | ✅ | ✅取代這個peft分支 |
| GPTJ | ✅ | ✅ | ✅ | ✅ | ✅取代這個peft分支 |
| 駱駝 | ✅ | ✅ | ✅ | ✅ | ✅ |
| 苔蘚 | ✅ | ✅ | ✅ | ✅ | ✅取代這個peft分支 |
| 選擇 | ✅ | ✅ | ✅ | ✅ | |
| gpt_bigcode | ✅ | ✅ | ✅ | ✅ | |
| 代碼根 | ✅ | ✅ | ✅ | ✅ | |
| Falcon(精製Webmodel/精製Web) | ✅ | ✅ | ✅ | ✅ |
當前, auto_gptq支持: LanguageModelingTask , SequenceClassificationTask和TextSummarizationTask ;更多的任務將很快到來!
測試可以進行:
pytest tests/ -s
autoGPTQ默認為使用exllamav2 int4*fp16內核進行矩陣乘法。
Marlin是一款優化的INT4 * FP16內核,最近在https://github.com/ist-daslab/marlin上提出。當將模型加載使用use_marlin=True時,將其集成在AutoGPTQ中。此內核僅在具有計算能力8.0或8.6(安培GPU)的設備上可用。