flashinfer下载 - flashinfer源代码下载

flashinfer

Python

v0.2.0.post1

下载

Flashinfer

llm服务的内核图书馆

|博客|文档|松弛|讨论论坛|

FlashInfer是用于大型语言模型的库和内核生成器，可提供LLM GPU内核的高性能实现，例如FlashTattiention，SparSeateention，PageTtention，PageTention，采样等。 FlashInfer专注于LLM服务和推理，并在各种情况下提供最先进的表现。

查看我们的V0.2发行博客以获取新功能！

FlashInfer的核心功能包括：

有效的稀疏/密集注意力内核：在CUDA内核和张量核（FA2和FA3）模板上稀疏（分页）/致密的KV存储的有效单/批次注意。矢量 - 帕克斯的注意力可以达到具有相同问题大小的密集核的90％。
负载平衡的调度：FlashInfer Decouples plan / run Comput Computitation，我们安排在plan阶段的可变长度输入计算以减轻负载不足问题。
记忆效率：FlashInfer对层次KV-CACHE提供了级联的关注，并实现了加速分组疑问的头Query Fusion，以及有效的核心，可用于低精度注意力和压缩KV-CACHE的融合索。
可自定义的关注：通过JIT兼容带来自己的注意变体。
cudagraph and torch.com兼容性：可以通过cudagraphs and torch.com捕获FlashInfer内核，以获取低延迟推断。
有效的LLM特异性运算符：高性能融合核，用于顶级P，TOP-K/min-P采样无需分类。

FlashInfer支持Pytorch，TVM和C ++（仅标题）API，并且可以轻松地集成到现有项目中。

消息

[2024年12月16日]博客文章FlashInfer 0.2- llm推理服务的高效且可自定义的内核
[2024年9月]我们为FlashInfer用户和开发人员启动了一个Slack Workspace。加入我们，以及时支持，讨论，更新和知识共享！
[2024年1月31日]博客文章级联推断：记忆效率的共享前缀批处理解码
[2024年1月31日]博客文章加快llm的自我攻击，与FlashInfer一起服务

入门

使用我们的Pytorch API是最简单的入门方法：

安装

我们为Linux提供了预构建的车轮。您可以使用以下命令安装FlashInfer：

 # For CUDA 12.4 & torch 2.4
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

我们还提供夜间制造的车轮，以尝试主要分支的最新功能：

pip install flashinfer -i https://flashinfer.ai/whl/nightly/cu124/torch2.4

另外，您可以从来源构建FlashInfer：

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
pip install -e . -v

默认情况下，FlashInfer将其内核使用Just-time（JIT）汇编。要预编译本质内核，请在运行安装命令之前设置环境变量FLASHINFER_ENABLE_AOT=1 ：

FLASHINFER_ENABLE_AOT=1 pip install -e . -v

有关更多详细信息，请参阅源文档中的安装。

尝试一下

以下是使用FlashInfer的单重点解码/附录/预填充注意内核的最小示例：

 import torch
import flashinfer

kv_len = 2048
num_kv_heads = 32
head_dim = 128

k = torch . randn ( kv_len , num_kv_heads , head_dim ). half (). to ( 0 )
v = torch . randn ( kv_len , num_kv_heads , head_dim ). half (). to ( 0 )

# decode attention

num_qo_heads = 32
q = torch . randn ( num_qo_heads , head_dim ). half (). to ( 0 )

o = flashinfer . single_decode_with_kv_cache ( q , k , v ) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer . single_decode_with_kv_cache ( q , k , v , pos_encoding_mode = "ROPE_LLAMA" ) # decode with LLaMA style RoPE on-the-fly

# append attention
append_qo_len = 128
q = torch . randn ( append_qo_len , num_qo_heads , head_dim ). half (). to ( 0 ) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer . single_prefill_with_kv_cache ( q , k , v , causal = True ) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly = flashinfer . single_prefill_with_kv_cache ( q , k , v , causal = True , pos_encoding_mode = "ROPE_LLAMA" ) # append attention with LLaMA style RoPE on-the-fly, apply causal mask

# prefill attention
qo_len = 2048
q = torch . randn ( qo_len , num_qo_heads , head_dim ). half (). to ( 0 ) # prefill attention
o = flashinfer . single_prefill_with_kv_cache ( q , k , v , causal = False ) # prefill attention without RoPE on-the-fly, do not apply causal mask

查看文档，以了解批处理解码/附录/预填充内核和共享式级联内核的文档。

运行基准

我们使用NVBENCH介绍FlashInfer内核性能，您可以使用以下命令编译和运行基准：

mkdir build
cp cmake/config.cmake build # you can modify the config.cmake to enable/disable benchmarks and change CUDA architectures
cd build
cmake ..
make -j12

您可以运行./bench_{single/batch}_{prefill/decode}来基于性能（例如，单次重复预填充注意力，例如./bench_single_prefill ）。 ./bench_{single/batch}_{prefill/decode} --help将向您展示可用的选项。