stable fastダウンロード - stable fastソースコードのダウンロード

安定した高速

？重要な発表？

1年間の遅延の後、 ComfyUIで実行されているすべてのモデルに最速の推論速度を提供するために、新しいプロジェクトComfy Wavespeedを構築する予定であると発表できてうれしいです。開始されたばかりで、それが素晴らしいプロジェクトになることを願っていますか？..それに焦点を合わせ続けて、私にフィードバックを与えてください？！

注記

stable-fastの積極的な開発は一時停止されました。私は現在、新しいtorch._dynamoベースのプロジェクトに取り組んでいますstable-cascade 、 SD3 、 Sora Like Mmodelsなどの新しいモデルをターゲットにしています。それはより速く、より柔軟であり、 CUDAではなくより多くのハードウェアバックエンドをサポートします。

連絡先を歓迎します。

不一致チャンネル

stable-fast最新のStableVideoDiffusionPipelineであっても、あらゆる種類のディフューザーモデルでSOTA推論パフォーマンスを実現します。また、モデルをコンパイルするのに数十分かかるTensorRTやAITemplateとは異なり、 stable-fastモデルをコンパイルするのに数秒しかかかりません。 stable-fast箱から出してdynamic shape 、 LoRA 、 ControlNetもサポートしています。

モデル	トーチ	torch.compile	ait	ワンフロー	Tensort	安定した
SD 1.5（MS）	1897	1510	1158	1003	991	995
svd-xt（s）	83	70				47

注：ベンチマーク中、 TensorRTはstatic batch sizeでテストされ、 CUDA Graph enabled間、 stable-fastが動的な形状で動作しています。

安定した高速
- 導入
  - これは何ですか？
  - 他の加速ライブラリとの違い
- インストール
  - 事前に構築されたホイールを取り付けます
  - ソースからインストールします
- 使用法
  - stabled fiusionpipelineを最適化します
  - LCMパイプラインを最適化します
  - stablevideodiffusionpipelineを最適化します
  - 動的にLORAを切り替えます
  - モデルの量子化
  - Pytorchをスピードアップするいくつかの一般的な方法
- パフォーマンスの比較
  - RTX 4080（512x512、バッチサイズ1、FP16、WSL2）
  - H100
  - A100
- 互換性
- トラブルシューティング

導入

これは何ですか？

stable-fast Nvidia GPUのhuggingfaceディフューザーのための超軽量推論最適化フレームワークです。 stable-fastいくつかの重要な技術と機能を利用することにより、超高速推論の最適化を提供します。

Cudnn畳み込み融合： stable-fast Conv + Bias + Add + Act計算パターンのあらゆる種類の組み合わせに対して、一連の完全に機能し、完全に互換性のあるCudnn畳み込み融合オペレーターを実装します。
低精度と融合GEMM ：安定しfp16 stable-fast 、 fp16精度で計算される一連の融合GEMM演算子fp32実装します。
融合線形geglu ： stable-fastはGEGLU(x, W, V, b, c) = GELU(xW + b) ⊗ (xV + c) 1つのcudaカーネルに融合できます。
NHWC＆Fused GroupNorm ： stable-fast高度に最適化された融合NHWC GroupNorm + SiluオペレーターをOpenAIのTritonで実装し、メモリ形式の順列演算子の必要性を排除します。
完全にトレースされたモデル： stable-fast torch.jit.traceインターフェイスを改善して、複雑なモデルをより適切にします。 StableDiffusionPipeline/StableVideoDiffusionPipelineのほぼすべての部分をトレースしてTorchscriptに変換できます。 torch.compileよりも安定しており、 torch.compileよりもCPUオーバーヘッドが大幅に低く、 ControlNetとLoraをサポートしています。
CUDAグラフ： stable-fast 、 UNet 、 VAEおよびTextEncoder CUDAグラフ形式にキャプチャできます。これにより、バッチサイズが小さいときにCPUオーバーヘッドを減らすことができます。この実装は、動的な形状もサポートします。
融合したマルチヘッドの注意： stable-fastはXformersを使用して、 Torchscriptと互換性があります。

私の次の目標は、 diffusersの最速推論最適化フレームワークの1つとしてstable-fastにし、 transformersのスピードアップとVRAMの両方の削減を提供することです。実際、私はすでにstable-fastを使用してLLMSを最適化し、重要なスピードアップを達成しています。しかし、より安定して使いやすくし、安定したユーザーインターフェイスを提供するために、まだいくつかの作業を行う必要があります。

他の加速ライブラリとの違い

高速： stable-fast Huggingface Diffusersに最適化されたSpecialyです。多くの図書館で高性能を達成します。また、ほんの数秒以内に非常に速いコンピレーション速度を提供します。コンピレーション時間には、 torch.compile 、 TensorRT 、 AITemplateよりも大幅に高速です。
Minimal ： stable-fast PyTorchのプラグインフレームワークとして機能します。既存のPyTorch機能とインフラストラクチャを利用しており、他の加速技術、および一般的な微調整技術と展開ソリューションと互換性があります。
最大互換性： stable-fastあらゆる種類のHuggingFace DiffusersとPyTorchバージョンと互換性があります。また、 ControlNetとLoRAと互換性があります。また、最新のStableVideoDiffusionPipeline箱から出してもサポートしています！

インストール

注： stable-fastは現在、WindowsのLinuxとWSL2 in Windowsでのみテストされています。最初はCUDAサポート付きのPytorchをインストールする必要があります（1.12から2.1のバージョンが推奨されます）。

CUDA 12.1およびPython 3.10では、 torch>=2.1.0 、 xformers>=0.0.22 、 triton>=2.1.0でstable-fastのみをテストします。他のバージョンは構築および正常に実行される場合がありますが、それは保証されていません。

事前に構築されたホイールを取り付けます

[リリース]ページからシステムに対応するホイールをダウンロードし、 pip3 install <wheel file>でインストールします。

現在、 LinuxとWindowsホイールの両方が利用可能です。

 # Change cu121 to your CUDA version and <wheel file> to the path of the wheel file.
# And make sure the wheel file is compatible with your PyTorch version.
pip3 install --index-url https://download.pytorch.org/whl/cu121 
    ' torch>=2.1.0 ' ' xformers>=0.0.22 ' ' triton>=2.1.0 ' ' diffusers>=0.19.3 ' 
    ' <wheel file> '

ソースからインストールします

 # Make sure you have CUDNN/CUBLAS installed.
# https://developer.nvidia.com/cudnn
# https://developer.nvidia.com/cublas

# Install PyTorch with CUDA and other packages at first.
# Windows user: Triton might be not available, you could skip it.
# NOTE: 'wheel' is required or you will meet `No module named 'torch'` error when building.
pip3 install wheel ' torch>=2.1.0 ' ' xformers>=0.0.22 ' ' triton>=2.1.0 ' ' diffusers>=0.19.3 '

# (Optional) Makes the build much faster.
pip3 install ninja

# Set TORCH_CUDA_ARCH_LIST if running and building on different GPU types.
# You can also install the latest stable release from PyPI.
# pip3 install -v -U stable-fast
pip3 install -v -U git+https://github.com/chengzeyi/stable-fast.git@main#egg=stable-fast
# (this can take dozens of minutes)

注： sfast.compilers以外の使用は、後方互換性があることを保証されていません。

注：最高のパフォーマンスを得るには、 xformersとOpenaiのtriton>=2.1.0をインストールして有効にする必要があります。ソースからxformersを構築して、 PyTorchと互換性のあるものにする必要があるかもしれません。

使用法

stabled fiusionpipelineを最適化します

stable-fast StableDiffusionPipelineとStableDiffusionPipelineXL直接最適化することができます。

 import time
import torch
from diffusers import ( StableDiffusionPipeline ,
                       EulerAncestralDiscreteScheduler )
from sfast . compilers . diffusion_pipeline_compiler import ( compile ,
                                                         CompilationConfig )

def load_model ():
    model = StableDiffusionPipeline . from_pretrained (
        'runwayml/stable-diffusion-v1-5' ,
        torch_dtype = torch . float16 )

    model . scheduler = EulerAncestralDiscreteScheduler . from_config (
        model . scheduler . config )
    model . safety_checker = None
    model . to ( torch . device ( 'cuda' ))
    return model

model = load_model ()

config = CompilationConfig . Default ()
# xformers and Triton are suggested for achieving best performance.
try :
    import xformers
    config . enable_xformers = True
except ImportError :
    print ( 'xformers not installed, skip' )
try :
    import triton
    config . enable_triton = True
except ImportError :
    print ( 'Triton not installed, skip' )
# CUDA Graph is suggested for small batch sizes and small resolutions to reduce CPU overhead.
# But it can increase the amount of GPU memory used.
# For StableVideoDiffusionPipeline it is not needed.
config . enable_cuda_graph = True

model = compile ( model , config )

kwarg_inputs = dict (
    prompt =
    '(masterpiece:1,2), best quality, masterpiece, best detailed face, a beautiful girl' ,
    height = 512 ,
    width = 512 ,
    num_inference_steps = 30 ,
    num_images_per_prompt = 1 ,
)

# NOTE: Warm it up.
# The initial calls will trigger compilation and might be very slow.
# After that, it should be very fast.
for _ in range ( 3 ):
    output_image = model ( ** kwarg_inputs ). images [ 0 ]

# Let's see it!
# Note: Progress bar might work incorrectly due to the async nature of CUDA.
begin = time . time ()
output_image = model ( ** kwarg_inputs ). images [ 0 ]
print ( f'Inference time: { time . time () - begin :.3f } s' )

# Let's view it in terminal!
from sfast . utils . term_image import print_image

print_image ( output_image , max_width = 80 )

詳細については、例を参照してください。

このコラブを確認して、T4 GPUでどのように機能するかを確認できます。

LCMパイプラインを最適化します

stable-fast最新のlatent consistency modelパイプラインを最適化し、大幅なスピードアップを実現することができます。

LCM LORAで通常のSDモデルを最適化する方法の詳細については、例を参照してください。スタンドアロンLCMモデルを最適化する方法の詳細については、例を参照してください。

stablevideodiffusionpipelineを最適化します

stable-fast最新のStableVideoDiffusionPipelineを最適化し、 2xスピードアップを達成することができます

詳細については、例を参照してください

動的にLORAを切り替えます

LORAを動的に切り替えることはサポートされていますが、追加の作業を行う必要があります。コンパイルされたグラフとCUDA Graph同じ下層データ（ポインター）を元のUNETモデルと共有するため可能です。したがって、あなたがする必要があるのは、元のUNETモデルのパラメーターを適切に更新することだけです。

次のコードは、すでにLORAをロードしてモデルをコンパイルしていると仮定しており、別のLORAに切り替えたいと考えています。

CUDAグラフを有効にしてpreserve_parameters = Trueを保持しないと、物事がはるかに簡単になる可能性があります。次のコードさえ必要ないかもしれません。

 # load_state_dict with assign=True requires torch >= 2.1.0

def update_state_dict ( dst , src ):
    for key , value in src . items ():
        # Do inplace copy.
        # As the traced forward function shares the same underlaying data (pointers),
        # this modification will be reflected in the traced forward function.
        dst [ key ]. copy_ ( value )

# Switch "another" LoRA into UNet
def switch_lora ( unet , lora ):
    # Store the original UNet parameters
    state_dict = unet . state_dict ()
    # Load another LoRA into unet
    unet . load_attn_procs ( lora )
    # Inplace copy current UNet parameters to the original unet parameters
    update_state_dict ( state_dict , unet . state_dict ())
    # Load the original UNet parameters back.
    # We use assign=True because we still want to hold the references
    # of the original UNet parameters
    unet . load_state_dict ( state_dict , assign = True )

switch_lora ( compiled_model . unet , lora_b_path )

モデルの量子化

stable-fast Pytorchのquantize_dynamic機能を拡張し、CUDAバックエンドで動的に量子化された線形演算子を提供します。それを有効にすることで、 diffusersためのわずかなVRAM削減とtransformersの大幅なVRAM削減を得ることができ、首足は潜在的なスピードアップを取得します（常にではありません）。

SD XLの場合、画像サイズが1024x1024で2GBのVRAM削減が見られると予想されます。

 def quantize_unet ( m ):
    from diffusers . utils import USE_PEFT_BACKEND
    assert USE_PEFT_BACKEND
    m = torch . quantization . quantize_dynamic ( m , { torch . nn . Linear },
                                            dtype = torch . qint8 ,
                                            inplace = True )
    return m

model . unet = quantize_unet ( model . unet )
if hasattr ( model , 'controlnet' ):
    model . controlnet = quantize_unet ( model . controlnet )

詳細については、例を参照してください。

Pytorchをスピードアップするいくつかの一般的な方法

 # TCMalloc is highly suggested to reduce CPU overhead
# https://github.com/google/tcmalloc
LD_PRELOAD=/path/to/libtcmalloc.so python3 ...

 import packaging . version
import torch

if packaging . version . parse ( torch . __version__ ) >= packaging . version . parse ( '1.12.0' ):
    torch . backends . cuda . matmul . allow_tf32 = True

パフォーマンスの比較

パフォーマンスは、異なるハードウェア/ソフトウェア/プラットフォーム/ドライバーの構成によって大きく異なります。正確にベンチマークするのは非常に困難です。また、ベンチマークのために環境を準備することも大変な仕事です。以前にいくつかのプラットフォームでテストしましたが、結果はまだ不正確である可能性があります。ベンチマークの場合、 tqdmによって示された進行状況バーは、CUDAの非同期性のために不正確である可能性があることに注意してください。この問題を解決するために、 CUDA Eventを使用して、1秒あたりの反復速度を正確に測定します。

stable-fast 、新しいGPUと新しいCUDAバージョンでよりよく機能すると予想されます。古いGPUでは、パフォーマンスの向上が制限される可能性があります。ベンチマーク中、CUDAの非同期性のために、進行状況バーは誤って動作する可能性があります。

RTX 4080（512x512、バッチサイズ1、FP16、WSL2）

これは私の個人的なゲーミングPCですか？クラウドサーバープロバイダーのCPUよりも強力なCPUを持っています。

フレームワーク	SD 1.5	SD XL（1024x1024）	SD 1.5 ControlNet
バニラ・ピトルチ（2.1.0）	29.5 IT/s	4.6 IT/s	19.7 it/s
torch.compile（2.1.0、max-autotune）	40.0 it/s	6.1 it/s	21.8 it/s
aitemplate	44.2 it/s
ワンフロー	53.6 it/s
Auto1111 WebUI	17.2 IT/s	3.6 it/s
Auto1111 WebUI（SDPA付き）	24.5 IT/s	4.3 IT/s
Tensort（auto1111 webui）	40.8 it/s
Tensortの公式デモ	52.6 it/s
stable-fast（xformers＆tritonを使用）	51.6 it/s	9.1 it/s	36.7 it/s

H100

@consceleratusと@harishpのヘルプをありがとう、H100で速度をテストしました。

フレームワーク	SD 1.5	SD XL（1024x1024）	SD 1.5 ControlNet
バニラ・ピトルチ（2.1.0）	54.5 it/s	14.9 IT/s	35.8 it/s
torch.compile（2.1.0、max-autotune）	66.0 it/s	18.5 it/s
stable-fast（xformers＆tritonを使用）	104.6 it/s	21.6 it/s	72.6 IT/s

A100

@supersecurehumanと@Jon-Chuangのヘルプをありがとう、A100のベンチマークは現在入手可能です。

フレームワーク	SD 1.5	SD XL（1024x1024）	SD 1.5 ControlNet
バニラ・ピトルチ（2.1.0）	35.6 it/s	8.7 IT/s	25.1 it/s
torch.compile（2.1.0、max-autotune）	41.9 IT/s	10.0 IT/s
stable-fast（xformers＆tritonを使用）	61.8 it/s	11.9 IT/s	41.1 it/s

互換性

モデル	サポート
抱きしめるフェイスディフューザー（1.5/2.1/xl）	はい
controlnetを使用	はい
ロラと	はい
潜在的な一貫性モデル	はい
SDXLターボ	はい
安定したビデオ拡散	はい