flashinfer下載 - flashinfer源代碼下載

flashinfer

Python

v0.2.0.post1

下載

Flashinfer

llm服務的內核圖書館

|博客|文檔|鬆弛|討論論壇|

FlashInfer是用於大型語言模型的庫和內核生成器，可提供LLM GPU內核的高性能實現，例如FlashTattiention，SparSeateention，PageTtention，PageTention，採樣等。 FlashInfer專注於LLM服務和推理，並在各種情況下提供最先進的表現。

查看我們的V0.2發行博客以獲取新功能！

FlashInfer的核心功能包括：

有效的稀疏/密集注意力內核：在CUDA內核和張量核（FA2和FA3）模板上稀疏（分頁）/緻密的KV存儲的有效單/批次注意。矢量 - 帕克斯的注意力可以達到具有相同問題大小的密集核的90％。
負載平衡的調度：FlashInfer Decouples plan / run Comput Computitation，我們安排在plan階段的可變長度輸入計算以減輕負載不足問題。
記憶效率：FlashInfer對層次KV-CACHE提供了級聯的關注，並實現了加速分組疑問的頭Query Fusion，以及有效的核心，可用於低精度注意力和壓縮KV-CACHE的融合索。
可自定義的關注：通過JIT兼容帶來自己的注意變體。
cudagraph and torch.com兼容性：可以通過cudagraphs and torch.com捕獲FlashInfer內核，以獲取低延遲推斷。
有效的LLM特異性運算符：高性能融合核，用於頂級P，TOP-K/min-P採樣無需分類。

FlashInfer支持Pytorch，TVM和C ++（僅標題）API，並且可以輕鬆地集成到現有項目中。

消息

[2024年12月16日]博客文章FlashInfer 0.2- llm推理服務的高效且可自定義的內核
[2024年9月]我們為FlashInfer用戶和開發人員啟動了一個Slack Workspace。加入我們，以及時支持，討論，更新和知識共享！
[2024年1月31日]博客文章級聯推斷：記憶效率的共享前綴批處理解碼
[2024年1月31日]博客文章加快llm的自我攻擊，與FlashInfer一起服務

入門

使用我們的Pytorch API是最簡單的入門方法：

安裝

我們為Linux提供了預構建的車輪。您可以使用以下命令安裝FlashInfer：

 # For CUDA 12.4 & torch 2.4
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

我們還提供夜間製造的車輪，以嘗試主要分支的最新功能：

pip install flashinfer -i https://flashinfer.ai/whl/nightly/cu124/torch2.4

另外，您可以從來源構建FlashInfer：

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
pip install -e . -v

默認情況下，FlashInfer將其內核使用Just-time（JIT）彙編。要預編譯本質內核，請在運行安裝命令之前設置環境變量FLASHINFER_ENABLE_AOT=1 ：

FLASHINFER_ENABLE_AOT=1 pip install -e . -v

有關更多詳細信息，請參閱源文檔中的安裝。

嘗試一下

以下是使用FlashInfer的單重點解碼/附錄/預填充注意內核的最小示例：

 import torch
import flashinfer

kv_len = 2048
num_kv_heads = 32
head_dim = 128

k = torch . randn ( kv_len , num_kv_heads , head_dim ). half (). to ( 0 )
v = torch . randn ( kv_len , num_kv_heads , head_dim ). half (). to ( 0 )

# decode attention

num_qo_heads = 32
q = torch . randn ( num_qo_heads , head_dim ). half (). to ( 0 )

o = flashinfer . single_decode_with_kv_cache ( q , k , v ) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer . single_decode_with_kv_cache ( q , k , v , pos_encoding_mode = "ROPE_LLAMA" ) # decode with LLaMA style RoPE on-the-fly

# append attention
append_qo_len = 128
q = torch . randn ( append_qo_len , num_qo_heads , head_dim ). half (). to ( 0 ) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer . single_prefill_with_kv_cache ( q , k , v , causal = True ) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly = flashinfer . single_prefill_with_kv_cache ( q , k , v , causal = True , pos_encoding_mode = "ROPE_LLAMA" ) # append attention with LLaMA style RoPE on-the-fly, apply causal mask

# prefill attention
qo_len = 2048
q = torch . randn ( qo_len , num_qo_heads , head_dim ). half (). to ( 0 ) # prefill attention
o = flashinfer . single_prefill_with_kv_cache ( q , k , v , causal = False ) # prefill attention without RoPE on-the-fly, do not apply causal mask

查看文檔，以了解批處理解碼/附錄/預填充內核和共享式級聯內核的文檔。

運行基準

我們使用NVBENCH介紹FlashInfer內核性能，您可以使用以下命令編譯和運行基準：

mkdir build
cp cmake/config.cmake build # you can modify the config.cmake to enable/disable benchmarks and change CUDA architectures
cd build
cmake ..
make -j12

您可以運行./bench_{single/batch}_{prefill/decode}來基於性能（例如，單次重複預填充注意力，例如./bench_single_prefill ）。 ./bench_{single/batch}_{prefill/decode} --help將向您展示可用的選項。