transformers bloom inference下载 - transformers bloom inference源代码下载

transformers bloom inference

其他源码

1.0.0

下载

笔记

该存储库已被存档，并且由于最近发布了像VLLM和TGI一样发布了更高效的服务框架，因此无法再维护。

快速推理解决方案

该仓库提供了演示和软件包，以执行Bloom的快速推理解决方案。某些解决方案具有自己的存储库，在这种情况下，提供了指向相应存储库的链接。

Bloom 176b的推理解决方案

我们支持拥抱面的加速和深速推断。

安装所需的软件包：

pip install flask flask_api gunicorn pydantic accelerate huggingface_hub > =0.9.0 deepspeed > =0.7.3 deepspeed-mii==0.0.2

另外，您也可以从源头安装DeepSpeed：

git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
CFLAGS= " -I $CONDA_PREFIX /include/ " LDFLAGS= " -L $CONDA_PREFIX /lib/ " TORCH_CUDA_ARCH_LIST= " 7.0 " DS_BUILD_CPU_ADAM=1 DS_BUILD_AIO=1 DS_BUILD_UTILS=1 pip install -e . --global-option= " build_ext " --global-option= " -j8 " --no-cache -v --disable-pip-version-check

所有提供的脚本均在8个A100 A100 80GB GPU上进行BLOOM 176B（FP16/BF16）和4 A100 80GB GPU，用于Bloom 176b（INT8）。这些脚本可能不适用于其他模型或不同数量的GPU。

DS推断是使用从DeepSpeed MII库借入的逻辑来部署的。

注意：有时DS推理部署崩溃时，有时GPU内存不会释放。您可以通过在终端运行killall python来释放此内存。

要使用Bloom量化，请使用dtype = int8。另外，将model_name更改为Microsoft/Bloom-Deepspeed-Int8，以进行深速推动。对于HF加速，model_name不需要更改。

HF加速使用llm.int8（），而DS-推断使用零定量进行培训量化。

通过命令行开花推理

这每次都要求generate_kwargs。示例：generate_kwargs =

{ "min_length" : 100 , "max_new_tokens" : 100 , "do_sample" : false }

使用HF加速

python -m inference_server.cli --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --generate_kwargs ' {"min_length": 100, "max_new_tokens": 100, "do_sample": false} '

使用DS推理

python -m inference_server.cli --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --generate_kwargs ' {"min_length": 100, "max_new_tokens": 100, "do_sample": false} '

BLOOM服务器部署

制作<model_name>可用于启动生成服务器。请注意，服务方法是同步的，用户必须等待队列，直到处理上述请求为止。这里给出了一个fire服务器请求的示例。还提供了dockerfile的替代品，它在端口5000上启动了一家发电服务器。

可以通过以下命令启动交互式UI，以连接到Generation Server。 UI的默认URL为http://127.0.0.1:5001/ 。 UI仅使用model_name来检查模型是解码器还是编码器模型。

python -m ui --model_name bigscience/bloom

该命令启动以下UI来播放一代。对不起，糟糕的设计。未来的是，我的UI技能只会走得太远。 ???

Bloom推理的基准系统

使用HF加速

python -m inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework hf_accelerate --benchmark_cycles 5

使用DS推理

deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5

或者，加载模型的速度更快：

deepspeed --num_gpus 8 --module inference_server.benchmark --model_name microsoft/bloom-deepspeed-inference-fp16 --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5

使用DS零

deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype bf16 --deployment_framework ds_zero --benchmark_cycles 5

支持

如果您遇到不起作用或有其他问题的事情，请在相应的后端打开一个问题：

加速
深速推导
深速零

如果其中一个脚本存在特定问题，而不是后端，请在此处打开一个问题，并在 @Mayank31398上打开问题。

其他推理解决方案

客户端解决方案

开发的解决方案是为了在本地执行大批推断：

自定义HF代码。

JAX：

jax中的绽放推断

服务器解决方案

可以在此处找到用于在服务器模式下使用的解决方案（即不同的批量大小，不同的请求率）。这是在生锈中实现的。

展开

附加信息

版本 1.0.0
类型其他源码
更新时间 2025-04-17
大小 133.5KB
来自于 Github

transformers bloom inference

快速推理解决方案

Bloom 176b的推理解决方案

通过命令行开花推理

BLOOM服务器部署

Bloom推理的基准系统

支持

其他推理解决方案

客户端解决方案

服务器解决方案

艾咪花店中文版手游（Bloom Sort）

bloom sort游戏无广告

变形金刚：塞伯坦之战

盛开

变形金刚：德

变形金刚：黑暗火花崛起

chat.petals.dev

GPT Prompt Templates

GPTyped

Google Dorks

shepherd

mongo express

Google Dorks

shepherd

mongo express