stable fast下載 - stable fast源代碼下載

快速穩定

重要的公告？

經過一年的延遲，我很高興地宣布，我計劃建立一個新的項目舒適波動，以為所有使用ComfyUI運行的車型提供最快的推理速度。它剛剛開始，我希望它將是一個很棒的項目？ ..請繼續專注於它並給我反饋嗎？

筆記

stable-fast的積極發展已暫停。我目前正在研究一個新的torch._dynamo的項目，以stable-cascade模型， SD3和Sora （如Mmodels）為目標。它將更快，更靈活，並且支持更多的硬件後端而不是CUDA 。

接觸受到歡迎。

不和諧頻道

即使使用最新的StableVideoDiffusionPipeline ， stable-fast可以在各種擴散器模型上實現SOTA推理性能。與TensorRT或AITemplate （需要數十分鐘來編譯模型）不同， stable-fast僅需幾秒鐘即可編譯模型。 stable-fast還支持dynamic shape ， LoRA和ControlNet開箱即用。

模型	火炬	torch.com	AIT	單流	張力	穩定狂
SD 1.5（MS）	1897年	1510	1158	1003	991	995
svd-xt（s）	83	70				47

注意：在基準測試期間，在stable-fast以動態形狀運行時，用static batch size和CUDA Graph enabled測試了TensorRT 。

快速穩定
- 介紹
  - 這是什麼？
  - 與其他加速庫的差異
- 安裝
  - 安裝預製車輪
  - 從源安裝
- 用法
  - 優化StablediffusionPipeline
  - 優化LCM管道
  - 優化StableVideDiffusionPipeline
  - 動態切換洛拉
  - 模型量化
  - 一些速度加快Pytorch的常見方法
- 性能比較
  - RTX 4080（512x512，批量1，FP16，在WSL2中）
  - H100
  - A100
- 相容性
- 故障排除

介紹

這是什麼？

stable-fast是用於NVIDIA GPU上擁抱面擴散器的超輕量推理優化框架。 stable-fast通過利用一些關鍵技術和功能來提供超快速的推理優化：

Cudnn卷積融合： stable-fast實現了一系列完整且完全兼容的Cudnn卷積融合算子，用於Conv + Bias + Add + Act計算模式的各種組合。
低精度和融合的GEMM ： stable-fast實現了一系列用fp16精度計算的Fused Gemm運算符，該操作員比Pytorch的默認值快（用fp32計算時使用fp16讀寫）。
融合的線性geglu ： stable-fast能夠融合GEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c)中一個CUDA內核。
NHWC和Fused GroupNorm ： stable-fast與OpenAI的Triton實現了高度優化的Fused NHWC GroupNorm + Silu運算符，這消除了對內存格式置換式運算符的需求。
完全跟踪的模型： stable-fast可改善torch.jit.trace接口，使其更適合追踪複雜模型。 StableDiffusionPipeline/StableVideoDiffusionPipeline幾乎每個部分都可以追踪並轉換為Torchscript 。它比torch.compile更穩定，並且CPU開銷的開銷明顯低於torch.compile ，並支持ControlNet和Lora 。
CUDA圖： stable-fast可以將UNet ， VAE和TextEncoder捕獲到CUDA圖格式中，當批處理大小很小時，可以將CPU開銷。該實現還支持動態形狀。
融合的多頭注意： stable-fast僅使用Xformer，並使其與Torchscript兼容。

我的下一個目標是將stable-fast作為diffusers最快的推理優化框架之一，並為transformers提供加速和VRAM減少。實際上，我已經使用stable-fast來優化LLM並實現重大的加速。但是我仍然需要做一些工作，以使其更穩定且易於使用並提供穩定的用戶界面。

與其他加速庫的差異

快速：針對擁抱面擴散器進行了特別優化的stable-fast 。它在許多圖書館中取得了高度的性能。它在短短幾秒鐘內提供了非常快速的彙編速度。在編譯時間內，它比torch.compile TensorRT要AITemplate 。
最小值： stable-fast可作為PyTorch的插件框架。它利用現有的PyTorch功能和基礎架構，與其他加速技術以及流行的微調技術和部署解決方案兼容。
最大兼容性： stable-fast與各種HuggingFace Diffusers和PyTorch版本兼容。它也與ControlNet和LoRA兼容。它甚至支持開箱即用的最新StableVideoDiffusionPipeline ！

安裝

注意：當前僅在Windows中的Linux和WSL2 in Windows上測試stable-fast 。首先，您需要安裝具有CUDA支持的Pytorch（建議使用1.12到2.1版本）。

我僅測試用torch>=2.1.0 ， xformers>=0.0.22和triton>=2.1.0在CUDA 12.1和Python 3.10上測試stable-fast 。其他版本可能會成功構建和運行，但不能保證。

安裝預製車輪

從版本頁面下載與系統對應的車輪，並使用pip3 install <wheel file>安裝。

目前， Linux和Windows輪轂都可以使用。

 # Change cu121 to your CUDA version and <wheel file> to the path of the wheel file.
# And make sure the wheel file is compatible with your PyTorch version.
pip3 install --index-url https://download.pytorch.org/whl/cu121 
    ' torch>=2.1.0 ' ' xformers>=0.0.22 ' ' triton>=2.1.0 ' ' diffusers>=0.19.3 ' 
    ' <wheel file> '

從源安裝

 # Make sure you have CUDNN/CUBLAS installed.
# https://developer.nvidia.com/cudnn
# https://developer.nvidia.com/cublas

# Install PyTorch with CUDA and other packages at first.
# Windows user: Triton might be not available, you could skip it.
# NOTE: 'wheel' is required or you will meet `No module named 'torch'` error when building.
pip3 install wheel ' torch>=2.1.0 ' ' xformers>=0.0.22 ' ' triton>=2.1.0 ' ' diffusers>=0.19.3 '

# (Optional) Makes the build much faster.
pip3 install ninja

# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types.
# You can also install the latest stable release from PyPI.
# pip3 install -v -U stable-fast
pip3 install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast
# (this can take dozens of minutes)

注意： sfast.compilers以外的任何用法都不能保證向後兼容。

注意：要獲得最佳性能，需要安裝和啟用xformers和OpenAI的triton>=2.1.0 。您可能需要從源構建xformers ，以使其與PyTorch兼容。

用法

優化StablediffusionPipeline

stable-fast能夠直接優化StableDiffusionPipeline和StableDiffusionPipelineXL 。

 import time
import torch
from diffusers import ( StableDiffusionPipeline ,
                       EulerAncestralDiscreteScheduler )
from sfast . compilers . diffusion_pipeline_compiler import ( compile ,
                                                         CompilationConfig )

def load_model ():
    model = StableDiffusionPipeline . from_pretrained (
        'runwayml/stable-diffusion-v1-5' ,
        torch_dtype = torch . float16 )

    model . scheduler = EulerAncestralDiscreteScheduler . from_config (
        model . scheduler . config )
    model . safety_checker = None
    model . to ( torch . device ( 'cuda' ))
    return model

model = load_model ()

config = CompilationConfig . Default ()
# xformers and Triton are suggested for achieving best performance.
try :
    import xformers
    config . enable_xformers = True
except ImportError :
    print ( 'xformers not installed, skip' )
try :
    import triton
    config . enable_triton = True
except ImportError :
    print ( 'Triton not installed, skip' )
# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
# But it can increase the amount of GPU memory used.
# For StableVideoDiffusionPipeline it is not needed.
config . enable_cuda_graph = True

model = compile ( model , config )

kwarg_inputs = dict (
    prompt =
    '(masterpiece:1,2), best quality, masterpiece, best detailed face, a beautiful girl' ,
    height = 512 ,
    width = 512 ,
    num_inference_steps = 30 ,
    num_images_per_prompt = 1 ,
)

# NOTE: Warm it up.
# The initial calls will trigger compilation and might be very slow.
# After that, it should be very fast.
for _ in range ( 3 ):
    output_image = model ( ** kwarg_inputs ). images [ 0 ]

# Let's see it!
# Note: Progress bar might work incorrectly due to the async nature of CUDA.
begin = time . time ()
output_image = model ( ** kwarg_inputs ). images [ 0 ]
print ( f'Inference time: { time . time () - begin :.3f } s' )

# Let's view it in terminal!
from sfast . utils . term_image import print_image

print_image ( output_image , max_width = 80 )

有關更多詳細信息，請參閱示例/Optimize_stable_diffusion_pipeline.py。

您可以檢查此COLAB以查看其在T4 GPU上的工作方式：

優化LCM管道

stable-fast能夠優化最新的latent consistency model管道並實現重大加速。

有關如何使用LCM LORA優化普通SD模型的更多詳細信息，請參閱示例/Optimize_LCM_PIPELINE.PY。有關如何優化獨立LCM模型的更多詳細信息，請參閱示例/Optimize_LCM_PIPELINE.PY。

優化StableVideDiffusionPipeline

stable-fast能夠優化最新的StableVideoDiffusionPipeline並實現2x加速

有關更多詳細信息

動態切換洛拉

支持動態切換洛拉，但您需要做一些額外的工作。這是可能的，因為編譯的圖形和CUDA Graph與原始UNET模型共享相同的底層數據（指針）。因此，您需要做的就是更新原始的UNET模型的參數。

以下代碼假設您已經加載了洛拉並編譯了該模型，並且要切換到另一個洛拉。

如果您不啟用CUDA圖並保持preserve_parameters = True ，那麼事情可能會容易得多。甚至可能不需要以下代碼。

 # load_state_dict with assign=True requires torch >= 2.1.0

def update_state_dict ( dst , src ):
    for key , value in src . items ():
        # Do inplace copy.
        # As the traced forward function shares the same underlaying data (pointers),
        # this modification will be reflected in the traced forward function.
        dst [ key ]. copy_ ( value )

# Switch "another" LoRA into UNet
def switch_lora ( unet , lora ):
    # Store the original UNet parameters
    state_dict = unet . state_dict ()
    # Load another LoRA into unet
    unet . load_attn_procs ( lora )
    # Inplace copy current UNet parameters to the original unet parameters
    update_state_dict ( state_dict , unet . state_dict ())
    # Load the original UNet parameters back.
    # We use assign=True because we still want to hold the references
    # of the original UNet parameters
    unet . load_state_dict ( state_dict , assign = True )

switch_lora ( compiled_model . unet , lora_b_path )

模型量化

stable-fast擴展了Pytorch的quantize_dynamic功能，並在CUDA後端提供了動態量化的線性操作員。通過啟用它，您可以減少diffusers的VRAM減少，並為transformers減少顯著的VRAM，並且Cound獲得了潛在的速度（並非總是）。

對於SD XL ，預計將看到2GB的VRAM降低，圖像大小為1024x1024 。

 def quantize_unet ( m ):
    from diffusers . utils import USE_PEFT_BACKEND
    assert USE_PEFT_BACKEND
    m = torch . quantization . quantize_dynamic ( m , { torch . nn . Linear },
                                            dtype = torch . qint8 ,
                                            inplace = True )
    return m

model . unet = quantize_unet ( model . unet )
if hasattr ( model , 'controlnet' ):
    model . controlnet = quantize_unet ( model . controlnet )

有關更多詳細信息，請參閱示例/Optimize_stable_diffusion_pipeline.py。

一些速度加快Pytorch的常見方法

 # TCMalloc is highly suggested to reduce CPU overhead
# https://github.com/google/tcmalloc
LD_PRELOAD=/path/to/libtcmalloc.so python3 ...

 import packaging . version
import torch

if packaging . version . parse ( torch . __version__ ) >= packaging . version . parse ( '1.12.0' ):
    torch . backends . cuda . matmul . allow_tf32 = True

性能比較

性能在不同的硬件/軟件/平台/驅動程序配置上大大不同。很難準確基準測試。為基準測試準備環境也是一項艱鉅的工作。我以前曾在某些平台上進行過測試，但結果可能仍然不准確。請注意，在基準測試時，由於CUDA的異步性質， tqdm顯示的進度條可能不准確。為了解決這個問題，我使用CUDA Event來準確測量每秒迭代速度。

預計stable-fast有望在新的GPU和更新的CUDA版本上更好地工作。在較舊的GPU上，性能提高可能是有限的。在基準測試過程中，由於CUDA的異步性質，進度條可能會錯誤地起作用。

RTX 4080（512x512，批量1，FP16，在WSL2中）

這是我的個人遊戲電腦？它的CPU比雲服務器提供商的CPU更強大。

框架	SD 1.5	SD XL（1024x1024）	SD 1.5控製網
香草pytorch（2.1.0）	29.5 IT/s	4.6 IT/s	19.7 it/s
TORCH.compile（2.1.0，最大Autotune）	40.0 IT/s	6.1 IT/s	21.8 IT/s
Aitemplate	44.2 IT/s
單流	53.6 IT/s
auto1111 webui	17.2 IT/s	3.6 IT/s
Auto1111 WebUI（帶有SDPA）	24.5 IT/s	4.3 IT/s
tensorrt（auto1111 webui）	40.8 IT/s
Tensorrt官方演示	52.6 IT/s
穩定狂（帶有Xformers＆Triton）	51.6 IT/s	9.1 IT/s	36.7 IT/s

H100

感謝@Consceleratus和@harishp的幫助，我已經測試了H100的速度。

框架	SD 1.5	SD XL（1024x1024）	SD 1.5控製網
香草pytorch（2.1.0）	54.5 IT/s	14.9 IT/s	35.8 IT/s
TORCH.compile（2.1.0，最大Autotune）	66.0 IT/s	18.5 IT/s
穩定狂（帶有Xformers＆Triton）	104.6 IT/s	21.6 IT/s	72.6 IT/s

A100

感謝@supersecurehuman和@jon-chuang的幫助，現在可以使用A100進行基準測試。

框架	SD 1.5	SD XL（1024x1024）	SD 1.5控製網
香草pytorch（2.1.0）	35.6 IT/s	8.7 IT/s	25.1 IT/s
TORCH.compile（2.1.0，最大Autotune）	41.9 IT/s	10.0 IT/s
穩定狂（帶有Xformers＆Triton）	61.8 IT/s	11.9 IT/s	41.1 IT/s

相容性

模型	支持
擁抱面部擴散器（1.5/2.1/XL）	是的
使用ControlNet	是的
和洛拉	是的
潛在一致性模型	是的
SDXL渦輪增壓	是的
穩定的視頻擴散	是的