AutoGPTQ开发已经停止。请切换到GPTQMODEL作为置换式替换。
基于GPTQ算法(仅重量量化)的易于使用的LLM量化包,具有用户友好的API。
use_marlin=True 。auto-gptq ,因此现在正在运行和培训GPTQ模型可以为每个人提供!请参阅此博客及其资源以获取更多详细信息!有关更多历史,请转到这里
结果是使用此脚本生成的,输入的批量大小为1,解码策略是梁搜索并强制执行模型生成512代币,速度度量是令牌/s(越大,越大)。
使用可以获得最快推理速度的设置加载量化模型。
| 模型 | GPU | num_beams | FP16 | GPTQ-INT4 |
|---|---|---|---|---|
| Llama-7b | 1XA100-40G | 1 | 18.87 | 25.53 |
| Llama-7b | 1XA100-40G | 4 | 68.79 | 91.30 |
| moss-moon 16b | 1XA100-40G | 1 | 12.48 | 15.25 |
| moss-moon 16b | 1XA100-40G | 4 | oom | 42.67 |
| moss-moon 16b | 2XA100-40G | 1 | 06.83 | 06.78 |
| moss-moon 16b | 2XA100-40G | 4 | 13.10 | 10.80 |
| GPT-J 6B | 1XRTX3060-12G | 1 | oom | 29.55 |
| GPT-J 6B | 1XRTX3060-12G | 4 | oom | 47.36 |
为了进行困惑,您可以在这里和这里转到此处
AutoGPTQ仅在Linux和Windows上可用。您可以使用PIP的最新稳定版本使用预构建的车轮:
| 平台版本 | 安装 | 建立在Pytorch上 |
|---|---|---|
| 库达11.8 | pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ | 2.2.1+CU118 |
| CUDA 12.1 | pip install auto-gptq --no-build-isolation | 2.2.1+CU121 |
| ROCM 5.7 | pip install auto-gptq --no-build-isolation --extra-index-url https://huggingface.github.io/autogptq-index/whl/rocm573/ | 2.2.1+ROCM5.7 |
可以使用pip install auto-gptq[triton] --no-build-isolation安装自动PPTQ,以便能够使用Triton后端(当前仅支持Linux,不支持Linux,不支持3位量化)。
对于较旧的AutoGPTQ,请参阅以前的版本安装表。
在NVIDIA系统上,AutoGPTQ不支持Maxwell或较低的GPU。
克隆源代码:
git clone https://github.com/PanQiWei/AutoGPTQ.git && cd AutoGPTQ为了从来源构建: pip install numpy gekko pandas需要一些软件包。
然后,从源本地安装:
pip install -vvv --no-build-isolation -e .您可以将BUILD_CUDA_EXT=0设置为禁用Pytorch扩展构建,但是由于AutoGPTQ然后倒在慢速的Python实现中,因此强烈劝阻。
作为最后的手段,如果上述命令失败,则可以尝试python setup.py install 。
要从源为支持ROCM的AMD GPU的源安装,请指定ROCM_VERSION环境变量。例子:
ROCM_VERSION=5.6 pip install -vvv --no-build-isolation -e .可以通过指定PYTORCH_ROCM_ARCH变量(参考)来加快汇编,以构建单个目标设备,例如MI200系列设备的gfx90a 。
对于ROCM系统,需要构建包装rocsparse-dev , hipsparse-dev , rocthrust-dev , rocblas-dev和hipblas-dev 。
注意:确保您在65c2e15或更高版本中
要从Intel Gaudi 2 HPU的源安装,请设置BUILD_CUDA_EXT=0环境变量以禁用构建cuda pytorch扩展。例子:
BUILD_CUDA_EXT=0 pip install -vvv --no-build-isolation -e .请注意,在推断时,英特尔Gaudi 2使用优化的内核,并且需要在非CUDA机器上
BUILD_CUDA_EXT=0。
警告:这只是AutoGPTQ中基本API的使用情况,该显示仅使用一个样本来量化一个小型模型,使用如此小样本的量化模型的质量可能不好。
以下是最简单使用auto_gptq来量化模型和推断后的示例:
from transformers import AutoTokenizer , TextGenerationPipeline
from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
import logging
logging . basicConfig (
format = "%(asctime)s %(levelname)s [%(name)s] %(message)s" , level = logging . INFO , datefmt = "%Y-%m-%d %H:%M:%S"
)
pretrained_model_dir = "facebook/opt-125m"
quantized_model_dir = "opt-125m-4bit"
tokenizer = AutoTokenizer . from_pretrained ( pretrained_model_dir , use_fast = True )
examples = [
tokenizer (
"auto-gptq is an easy-to-use model quantization library with user-friendly apis, based on GPTQ algorithm."
)
]
quantize_config = BaseQuantizeConfig (
bits = 4 , # quantize model to 4-bit
group_size = 128 , # it is recommended to set the value to 128
desc_act = False , # set to False can significantly speed up inference but the perplexity may slightly bad
)
# load un-quantized model, by default, the model will always be loaded into CPU memory
model = AutoGPTQForCausalLM . from_pretrained ( pretrained_model_dir , quantize_config )
# quantize model, the examples should be list of dict whose keys can only be "input_ids" and "attention_mask"
model . quantize ( examples )
# save quantized model
model . save_quantized ( quantized_model_dir )
# save quantized model using safetensors
model . save_quantized ( quantized_model_dir , use_safetensors = True )
# push quantized model to Hugging Face Hub.
# to use use_auth_token=True, Login first via huggingface-cli login.
# or pass explcit token with: use_auth_token="hf_xxxxxxx"
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, commit_message=commit_message, use_auth_token=True)
# alternatively you can save and push at the same time
# (uncomment the following three lines to enable this feature)
# repo_id = f"YourUserName/{quantized_model_dir}"
# commit_message = f"AutoGPTQ model for {pretrained_model_dir}: {quantize_config.bits}bits, gr{quantize_config.group_size}, desc_act={quantize_config.desc_act}"
# model.push_to_hub(repo_id, save_dir=quantized_model_dir, use_safetensors=True, commit_message=commit_message, use_auth_token=True)
# load quantized model to the first GPU
model = AutoGPTQForCausalLM . from_quantized ( quantized_model_dir , device = "cuda:0" )
# download quantized model from Hugging Face Hub and load to the first GPU
# model = AutoGPTQForCausalLM.from_quantized(repo_id, device="cuda:0", use_safetensors=True, use_triton=False)
# inference with model.generate
print ( tokenizer . decode ( model . generate ( ** tokenizer ( "auto_gptq is" , return_tensors = "pt" ). to ( model . device ))[ 0 ]))
# or you can also use pipeline
pipeline = TextGenerationPipeline ( model = model , tokenizer = tokenizer )
print ( pipeline ( "auto-gptq is" )[ 0 ][ "generated_text" ])有关模型量化的更高级功能,请参考此脚本
from auto_gptq . modeling import BaseGPTQForCausalLM
class OPTGPTQForCausalLM ( BaseGPTQForCausalLM ):
# chained attribute name of transformer layer block
layers_block_name = "model.decoder.layers"
# chained attribute names of other nn modules that in the same level as the transformer layer block
outside_layer_modules = [
"model.decoder.embed_tokens" , "model.decoder.embed_positions" , "model.decoder.project_out" ,
"model.decoder.project_in" , "model.decoder.final_layer_norm"
]
# chained attribute names of linear layers in transformer layer module
# normally, there are four sub lists, for each one the modules in it can be seen as one operation,
# and the order should be the order when they are truly executed, in this case (and usually in most cases),
# they are: attention q_k_v projection, attention output projection, MLP project input, MLP project output
inside_layer_modules = [
[ "self_attn.k_proj" , "self_attn.v_proj" , "self_attn.q_proj" ],
[ "self_attn.out_proj" ],
[ "fc1" ],
[ "fc2" ]
]之后,您可以使用OPTGPTQForCausalLM.from_pretrained和其他方法,如基本所示。
您可以使用auto_gptq.eval_tasks中定义的任务来评估量化前后模型在特定下游任务上的性能。
预定义的任务支持实施的所有因果关系模型?变压器和这个项目。
from functools import partial
import datasets
from transformers import AutoTokenizer , AutoModelForCausalLM , GenerationConfig
from auto_gptq import AutoGPTQForCausalLM , BaseQuantizeConfig
from auto_gptq . eval_tasks import SequenceClassificationTask
MODEL = "EleutherAI/gpt-j-6b"
DATASET = "cardiffnlp/tweet_sentiment_multilingual"
TEMPLATE = "Question:What's the sentiment of the given text? Choices are {labels}. n Text: {text} n Answer:"
ID2LABEL = {
0 : "negative" ,
1 : "neutral" ,
2 : "positive"
}
LABELS = list ( ID2LABEL . values ())
def ds_refactor_fn ( samples ):
text_data = samples [ "text" ]
label_data = samples [ "label" ]
new_samples = { "prompt" : [], "label" : []}
for text , label in zip ( text_data , label_data ):
prompt = TEMPLATE . format ( labels = LABELS , text = text )
new_samples [ "prompt" ]. append ( prompt )
new_samples [ "label" ]. append ( ID2LABEL [ label ])
return new_samples
# model = AutoModelForCausalLM.from_pretrained(MODEL).eval().half().to("cuda:0")
model = AutoGPTQForCausalLM . from_pretrained ( MODEL , BaseQuantizeConfig ())
tokenizer = AutoTokenizer . from_pretrained ( MODEL )
task = SequenceClassificationTask (
model = model ,
tokenizer = tokenizer ,
classes = LABELS ,
data_name_or_path = DATASET ,
prompt_col_name = "prompt" ,
label_col_name = "label" ,
** {
"num_samples" : 1000 , # how many samples will be sampled to evaluation
"sample_max_len" : 1024 , # max tokens for each sample
"block_max_len" : 2048 , # max tokens for each data block
# function to load dataset, one must only accept data_name_or_path as input
# and return datasets.Dataset
"load_fn" : partial ( datasets . load_dataset , name = "english" ),
# function to preprocess dataset, which is used for datasets.Dataset.map,
# must return Dict[str, list] with only two keys: [prompt_col_name, label_col_name]
"preprocess_fn" : ds_refactor_fn ,
# truncate label when sample's length exceed sample_max_len
"truncate_prompt" : False
}
)
# note that max_new_tokens will be automatically specified internally based on given classes
print ( task . run ())
# self-consistency
print (
task . run (
generation_config = GenerationConfig (
num_beams = 3 ,
num_return_sequences = 3 ,
do_sample = True
)
)
)教程提供了逐步指导,将auto_gptq与您自己的项目和一些最佳实践原则集成在一起。
示例提供了很多示例脚本以不同方式使用auto_gptq 。
您可以使用
model.config.model_type与下表进行比较,以检查您使用的模型是否由auto_gptq支持。例如,
WizardLM,vicuna和gpt4all的model_type都是llama,因此它们都得到了auto_gptq的支持。
| 型号类型 | 量化 | 推理 | peft-lora | peft-ada-lora | peft-adaption_prompt |
|---|---|---|---|---|---|
| 盛开 | ✅ | ✅ | ✅ | ✅ | |
| GPT2 | ✅ | ✅ | ✅ | ✅ | |
| gpt_neox | ✅ | ✅ | ✅ | ✅ | ✅取代这个peft分支 |
| GPTJ | ✅ | ✅ | ✅ | ✅ | ✅取代这个peft分支 |
| 骆驼 | ✅ | ✅ | ✅ | ✅ | ✅ |
| 苔藓 | ✅ | ✅ | ✅ | ✅ | ✅取代这个peft分支 |
| 选择 | ✅ | ✅ | ✅ | ✅ | |
| gpt_bigcode | ✅ | ✅ | ✅ | ✅ | |
| 代码根 | ✅ | ✅ | ✅ | ✅ | |
| Falcon(精制Webmodel/精制Web) | ✅ | ✅ | ✅ | ✅ |
当前, auto_gptq支持: LanguageModelingTask , SequenceClassificationTask和TextSummarizationTask ;更多的任务将很快到来!
测试可以进行:
pytest tests/ -s
autoGPTQ默认为使用exllamav2 int4*fp16内核进行矩阵乘法。
Marlin是一款优化的INT4 * FP16内核,最近在https://github.com/ist-daslab/marlin上提出。当将模型加载使用use_marlin=True时,将其集成在AutoGPTQ中。此内核仅在具有计算能力8.0或8.6(安培GPU)的设备上可用。