stable fast下载 - stable fast源代码下载

快速稳定

重要的公告？

经过一年的延迟，我很高兴地宣布，我计划建立一个新的项目舒适波动，以为所有使用ComfyUI运行的车型提供最快的推理速度。它刚刚开始，我希望它将是一个很棒的项目？..请继续专注于它并给我反馈吗？

笔记

stable-fast的积极发展已暂停。我目前正在研究一个新的torch._dynamo的项目，以stable-cascade模型， SD3和Sora （如Mmodels）为目标。它将更快，更灵活，并且支持更多的硬件后端而不是CUDA 。

接触受到欢迎。

不和谐频道

即使使用最新的StableVideoDiffusionPipeline ， stable-fast可以在各种扩散器模型上实现SOTA推理性能。与TensorRT或AITemplate （需要数十分钟来编译模型）不同， stable-fast仅需几秒钟即可编译模型。 stable-fast还支持dynamic shape ， LoRA和ControlNet开箱即用。

模型	火炬	torch.com	AIT	单流	张力	稳定狂
SD 1.5（MS）	1897年	1510	1158	1003	991	995
svd-xt（s）	83	70				47

注意：在基准测试期间，在stable-fast以动态形状运行时，用static batch size和CUDA Graph enabled测试了TensorRT 。

快速稳定
- 介绍
  - 这是什么？
  - 与其他加速库的差异
- 安装
  - 安装预制车轮
  - 从源安装
- 用法
  - 优化StablediffusionPipeline
  - 优化LCM管道
  - 优化StableVideDiffusionPipeline
  - 动态切换洛拉
  - 模型量化
  - 一些速度加快Pytorch的常见方法
- 性能比较
  - RTX 4080（512x512，批量1，FP16，在WSL2中）
  - H100
  - A100
- 兼容性
- 故障排除

介绍

这是什么？

stable-fast是用于NVIDIA GPU上拥抱面扩散器的超轻量推理优化框架。 stable-fast通过利用一些关键技术和功能来提供超快速的推理优化：

Cudnn卷积融合： stable-fast实现了一系列完整且完全兼容的Cudnn卷积融合算子，用于Conv + Bias + Add + Act计算模式的各种组合。
低精度和融合的GEMM ： stable-fast实现了一系列用fp16精度计算的Fused Gemm运算符，该操作员比Pytorch的默认值快（用fp32计算时使用fp16读写）。
融合的线性geglu ： stable-fast能够融合GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c)中一个CUDA内核。
NHWC和Fused GroupNorm ： stable-fast与OpenAI的Triton实现了高度优化的Fused NHWC GroupNorm + Silu运算符，这消除了对内存格式置换式运算符的需求。
完全跟踪的模型： stable-fast可改善torch.jit.trace接口，使其更适合追踪复杂模型。 StableDiffusionPipeline/StableVideoDiffusionPipeline几乎每个部分都可以追踪并转换为Torchscript 。它比torch.compile更稳定，并且CPU开销的开销明显低于torch.compile ，并支持ControlNet和Lora 。
CUDA图： stable-fast可以将UNet ， VAE和TextEncoder捕获到CUDA图格式中，当批处理大小很小时，可以将CPU开销。该实现还支持动态形状。
融合的多头注意： stable-fast仅使用Xformer，并使其与Torchscript兼容。

我的下一个目标是将stable-fast作为diffusers最快的推理优化框架之一，并为transformers提供加速和VRAM减少。实际上，我已经使用stable-fast来优化LLM并实现重大的加速。但是我仍然需要做一些工作，以使其更稳定且易于使用并提供稳定的用户界面。

与其他加速库的差异

快速：针对拥抱面扩散器进行了特别优化的stable-fast 。它在许多图书馆中取得了高度的性能。它在短短几秒钟内提供了非常快速的汇编速度。在编译时间内，它比torch.compile TensorRT要AITemplate 。
最小值： stable-fast可作为PyTorch的插件框架。它利用现有的PyTorch功能和基础架构，与其他加速技术以及流行的微调技术和部署解决方案兼容。
最大兼容性： stable-fast与各种HuggingFace Diffusers和PyTorch版本兼容。它也与ControlNet和LoRA兼容。它甚至支持开箱即用的最新StableVideoDiffusionPipeline ！

安装

注意：当前仅在Windows中的Linux和WSL2 in Windows上测试stable-fast 。首先，您需要安装具有CUDA支持的Pytorch（建议使用1.12到2.1版本）。

我仅测试用torch>=2.1.0 ， xformers>=0.0.22和triton>=2.1.0在CUDA 12.1和Python 3.10上测试stable-fast 。其他版本可能会成功构建和运行，但不能保证。

安装预制车轮

从版本页面下载与系统对应的车轮，并使用pip3 install <wheel file>安装。

目前， Linux和Windows轮毂都可以使用。

 # Change cu121 to your CUDA version and <wheel file> to the path of the wheel file.
# And make sure the wheel file is compatible with your PyTorch version.
pip3 install --index-url https://download.pytorch.org/whl/cu121 
    ' torch>=2.1.0 ' ' xformers>=0.0.22 ' ' triton>=2.1.0 ' ' diffusers>=0.19.3 ' 
    ' <wheel file> '

从源安装

 # Make sure you have CUDNN/CUBLAS installed.
# https://developer.nvidia.com/cudnn
# https://developer.nvidia.com/cublas

# Install PyTorch with CUDA and other packages at first.
# Windows user: Triton might be not available, you could skip it.
# NOTE: 'wheel' is required or you will meet `No module named 'torch'` error when building.
pip3 install wheel ' torch>=2.1.0 ' ' xformers>=0.0.22 ' ' triton>=2.1.0 ' ' diffusers>=0.19.3 '

# (Optional) Makes the build much faster.
pip3 install ninja

# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types.
# You can also install the latest stable release from PyPI.
# pip3 install -v -U stable-fast
pip3 install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast
# (this can take dozens of minutes)

注意： sfast.compilers以外的任何用法都不能保证向后兼容。

注意：要获得最佳性能，需要安装和启用xformers和OpenAI的triton>=2.1.0 。您可能需要从源构建xformers ，以使其与PyTorch兼容。

用法

优化StablediffusionPipeline

stable-fast能够直接优化StableDiffusionPipeline和StableDiffusionPipelineXL 。

 import time
import torch
from diffusers import ( StableDiffusionPipeline ,
                       EulerAncestralDiscreteScheduler )
from sfast . compilers . diffusion_pipeline_compiler import ( compile ,
                                                         CompilationConfig )

def load_model ():
    model = StableDiffusionPipeline . from_pretrained (
        'runwayml/stable-diffusion-v1-5' ,
        torch_dtype = torch . float16 )

    model . scheduler = EulerAncestralDiscreteScheduler . from_config (
        model . scheduler . config )
    model . safety_checker = None
    model . to ( torch . device ( 'cuda' ))
    return model

model = load_model ()

config = CompilationConfig . Default ()
# xformers and Triton are suggested for achieving best performance.
try :
    import xformers
    config . enable_xformers = True
except ImportError :
    print ( 'xformers not installed, skip' )
try :
    import triton
    config . enable_triton = True
except ImportError :
    print ( 'Triton not installed, skip' )
# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
# But it can increase the amount of GPU memory used.
# For StableVideoDiffusionPipeline it is not needed.
config . enable_cuda_graph = True

model = compile ( model , config )

kwarg_inputs = dict (
    prompt =
    '(masterpiece:1,2), best quality, masterpiece, best detailed face, a beautiful girl' ,
    height = 512 ,
    width = 512 ,
    num_inference_steps = 30 ,
    num_images_per_prompt = 1 ,
)

# NOTE: Warm it up.
# The initial calls will trigger compilation and might be very slow.
# After that, it should be very fast.
for _ in range ( 3 ):
    output_image = model ( ** kwarg_inputs ). images [ 0 ]

# Let's see it!
# Note: Progress bar might work incorrectly due to the async nature of CUDA.
begin = time . time ()
output_image = model ( ** kwarg_inputs ). images [ 0 ]
print ( f'Inference time: { time . time () - begin :.3f } s' )

# Let's view it in terminal!
from sfast . utils . term_image import print_image

print_image ( output_image , max_width = 80 )

有关更多详细信息，请参阅示例/Optimize_stable_diffusion_pipeline.py。

您可以检查此COLAB以查看其在T4 GPU上的工作方式：

优化LCM管道

stable-fast能够优化最新的latent consistency model管道并实现重大加速。

有关如何使用LCM LORA优化普通SD模型的更多详细信息，请参阅示例/Optimize_LCM_PIPELINE.PY。有关如何优化独立LCM模型的更多详细信息，请参阅示例/Optimize_LCM_PIPELINE.PY。

优化StableVideDiffusionPipeline

stable-fast能够优化最新的StableVideoDiffusionPipeline并实现2x加速

有关更多详细信息

动态切换洛拉

支持动态切换洛拉，但您需要做一些额外的工作。这是可能的，因为编译的图形和CUDA Graph与原始UNET模型共享相同的底层数据（指针）。因此，您需要做的就是更新原始的UNET模型的参数。

以下代码假设您已经加载了洛拉并编译了该模型，并且要切换到另一个洛拉。

如果您不启用CUDA图并保持preserve_parameters = True ，那么事情可能会容易得多。甚至可能不需要以下代码。

 # load_state_dict with assign=True requires torch >= 2.1.0

def update_state_dict ( dst , src ):
    for key , value in src . items ():
        # Do inplace copy.
        # As the traced forward function shares the same underlaying data (pointers),
        # this modification will be reflected in the traced forward function.
        dst [ key ]. copy_ ( value )

# Switch "another" LoRA into UNet
def switch_lora ( unet , lora ):
    # Store the original UNet parameters
    state_dict = unet . state_dict ()
    # Load another LoRA into unet
    unet . load_attn_procs ( lora )
    # Inplace copy current UNet parameters to the original unet parameters
    update_state_dict ( state_dict , unet . state_dict ())
    # Load the original UNet parameters back.
    # We use assign=True because we still want to hold the references
    # of the original UNet parameters
    unet . load_state_dict ( state_dict , assign = True )

switch_lora ( compiled_model . unet , lora_b_path )

模型量化

stable-fast扩展了Pytorch的quantize_dynamic功能，并在CUDA后端提供了动态量化的线性操作员。通过启用它，您可以减少diffusers的VRAM减少，并为transformers减少显着的VRAM，并且Cound获得了潜在的速度（并非总是）。

对于SD XL ，预计将看到2GB的VRAM降低，图像大小为1024x1024 。

 def quantize_unet ( m ):
    from diffusers . utils import USE_PEFT_BACKEND
    assert USE_PEFT_BACKEND
    m = torch . quantization . quantize_dynamic ( m , { torch . nn . Linear },
                                            dtype = torch . qint8 ,
                                            inplace = True )
    return m

model . unet = quantize_unet ( model . unet )
if hasattr ( model , 'controlnet' ):
    model . controlnet = quantize_unet ( model . controlnet )

有关更多详细信息，请参阅示例/Optimize_stable_diffusion_pipeline.py。

一些速度加快Pytorch的常见方法

 # TCMalloc is highly suggested to reduce CPU overhead
# https://github.com/google/tcmalloc
LD_PRELOAD=/path/to/libtcmalloc.so python3 ...

 import packaging . version
import torch

if packaging . version . parse ( torch . __version__ ) >= packaging . version . parse ( '1.12.0' ):
    torch . backends . cuda . matmul . allow_tf32 = True

性能比较

性能在不同的硬件/软件/平台/驱动程序配置上大大不同。很难准确基准测试。为基准测试准备环境也是一项艰巨的工作。我以前曾在某些平台上进行过测试，但结果可能仍然不准确。请注意，在基准测试时，由于CUDA的异步性质， tqdm显示的进度条可能不准确。为了解决这个问题，我使用CUDA Event来准确测量每秒迭代速度。

预计stable-fast有望在新的GPU和更新的CUDA版本上更好地工作。在较旧的GPU上，性能提高可能是有限的。在基准测试过程中，由于CUDA的异步性质，进度条可能会错误地起作用。

RTX 4080（512x512，批量1，FP16，在WSL2中）

这是我的个人游戏电脑？它的CPU比云服务器提供商的CPU更强大。

框架	SD 1.5	SD XL（1024x1024）	SD 1.5控制网
香草pytorch（2.1.0）	29.5 IT/s	4.6 IT/s	19.7 it/s
TORCH.compile（2.1.0，最大Autotune）	40.0 IT/s	6.1 IT/s	21.8 IT/s
Aitemplate	44.2 IT/s
单流	53.6 IT/s
auto1111 webui	17.2 IT/s	3.6 IT/s
Auto1111 WebUI（带有SDPA）	24.5 IT/s	4.3 IT/s
tensorrt（auto1111 webui）	40.8 IT/s
Tensorrt官方演示	52.6 IT/s
稳定狂（带有Xformers＆Triton）	51.6 IT/s	9.1 IT/s	36.7 IT/s

H100

感谢@Consceleratus和@harishp的帮助，我已经测试了H100的速度。

框架	SD 1.5	SD XL（1024x1024）	SD 1.5控制网
香草pytorch（2.1.0）	54.5 IT/s	14.9 IT/s	35.8 IT/s
TORCH.compile（2.1.0，最大Autotune）	66.0 IT/s	18.5 IT/s
稳定狂（带有Xformers＆Triton）	104.6 IT/s	21.6 IT/s	72.6 IT/s

A100

感谢@supersecurehuman和@jon-chuang的帮助，现在可以使用A100进行基准测试。

框架	SD 1.5	SD XL（1024x1024）	SD 1.5控制网
香草pytorch（2.1.0）	35.6 IT/s	8.7 IT/s	25.1 IT/s
TORCH.compile（2.1.0，最大Autotune）	41.9 IT/s	10.0 IT/s
稳定狂（带有Xformers＆Triton）	61.8 IT/s	11.9 IT/s	41.1 IT/s

兼容性

模型	支持
拥抱面部扩散器（1.5/2.1/XL）	是的
使用ControlNet	是的
和洛拉	是的
潜在一致性模型	是的
SDXL涡轮增压	是的
稳定的视频扩散	是的