flashinfer 다운로드 - flashinfer 소스 코드 다운로드

flashinfer

파이썬

v0.2.0.post1

다운로드

Flashinfer

LLM 서빙 용 커널 라이브러리

| 블로그 | 문서 | 슬랙 | 토론 포럼 |

FlashInfer는 FlashAttention, SpareAttention, PageAttention, Sampling 등과 같은 LLM GPU 커널의 고성능 구현을 제공하는 대형 언어 모델을위한 라이브러리 및 커널 생성기입니다. Flashinfer는 LLM 서빙 및 추론에 중점을두고 다양한 시나리오에서 최첨단 성능을 제공합니다.

새로운 기능은 V0.2 릴리스 블로그를 확인하십시오!

FlashInfer의 핵심 기능에는 다음이 포함됩니다.

효율적인 희소/조밀 한주의 커널 : CUDA 코어 및 텐서 코어 (FA2 및 FA3) 템플릿의 희소 (PAGED)/조밀 한 KV 저장에 대한 효율적인 단일/배치주의. 벡터-스파르즈주의는 동일한 문제 크기로 조밀 한 커널의 대역폭의 90%를 달성 할 수 있습니다.
로드 균형 일정 스케줄링 : plan 핀퍼는 run 단계에서 가변 길이 plan 의 계산을 예약하여로드-볼균 문제를 완화시킵니다.
메모리 효율성 : FlashInfer는 계층 적 KV-Cache에 대한 캐스케이드주의를 제공하고 그룹화 된 쿼리주의를 가속화하기위한 헤드 쿼리 융합 및 압축 된 KV-Cache에 대한 저렴한주의 및 융합 로프주의를위한 효율적인 커널을 구현합니다.
사용자 정의 가능한주의 : JIT 컴파일을 통해 자신의주의 변형을 가져옵니다.
cudagraph 및 torch.compile 호환성 : Flashinfer 커널은 Cudagraphs 및 Torch.compile로 캡처 할 수 있습니다.
효율적인 LLM- 특이 적 연산자 : 정렬없이 상위 P, Top-K/Min-P 샘플링을위한 고성능 융합 커널.

Flashinfer는 Pytorch, TVM 및 C ++ (헤더 전용) API를 지원하며 기존 프로젝트에 쉽게 통합 할 수 있습니다.

소식

[2024 년 12 월 16 일] 블로그 게시물 Flashinfer 0.2- LLM 추론을위한 효율적이고 사용자 정의 가능한 커널
[2024 년 9 월] 우리는 Flashinfer 사용자 및 개발자를위한 슬랙 작업 공간을 출시했습니다. 적시에 지원, 토론, 업데이트 및 지식 공유를 위해 우리와 함께하십시오!
[2024 년 1 월 31 일] 블로그 게시물 캐스케이드 추론 : 메모리 효율적인 공유 접두사 배치 디코딩
[2024 년 1 월 31 일] 블로그 게시물 Flashinfer와 함께 제공되는 LLM에 대한 자체 항목 가속화

시작하기

Pytorch API 사용은 시작하는 가장 쉬운 방법입니다.

설치

우리는 Linux에 미리 빌드 바퀴를 제공합니다. 다음 명령으로 Flashinfer를 설치할 수 있습니다.

 # For CUDA 12.4 & torch 2.4
pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.4
# For other CUDA & torch versions, please check https://docs.flashinfer.ai/installation.html

또한 주요 지점에서 최신 기능을 시도하기 위해 야간 제작 휠을 제공합니다.

pip install flashinfer -i https://flashinfer.ai/whl/nightly/cu124/torch2.4

또는 소스에서 Flashinfer를 빌드 할 수 있습니다.

git clone https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
pip install -e . -v

기본적으로 Flashinfer는 커널에 정시 (JIT) 컴파일을 사용합니다. 필수 커널을 사전 컴파일하려면 설치 명령을 실행하기 전에 환경 변수 FLASHINFER_ENABLE_AOT=1 설정하십시오.

FLASHINFER_ENABLE_AOT=1 pip install -e . -v

자세한 내용은 소스 문서의 설치를 참조하십시오.

시도해보십시오

아래는 Flashinfer의 단일 요청 디코딩/부록/프리 필주의 커널을 사용하는 최소한의 예입니다.

 import torch
import flashinfer

kv_len = 2048
num_kv_heads = 32
head_dim = 128

k = torch . randn ( kv_len , num_kv_heads , head_dim ). half (). to ( 0 )
v = torch . randn ( kv_len , num_kv_heads , head_dim ). half (). to ( 0 )

# decode attention

num_qo_heads = 32
q = torch . randn ( num_qo_heads , head_dim ). half (). to ( 0 )

o = flashinfer . single_decode_with_kv_cache ( q , k , v ) # decode attention without RoPE on-the-fly
o_rope_on_the_fly = flashinfer . single_decode_with_kv_cache ( q , k , v , pos_encoding_mode = "ROPE_LLAMA" ) # decode with LLaMA style RoPE on-the-fly

# append attention
append_qo_len = 128
q = torch . randn ( append_qo_len , num_qo_heads , head_dim ). half (). to ( 0 ) # append attention, the last 128 tokens in the KV-Cache are the new tokens
o = flashinfer . single_prefill_with_kv_cache ( q , k , v , causal = True ) # append attention without RoPE on-the-fly, apply causal mask
o_rope_on_the_fly = flashinfer . single_prefill_with_kv_cache ( q , k , v , causal = True , pos_encoding_mode = "ROPE_LLAMA" ) # append attention with LLaMA style RoPE on-the-fly, apply causal mask

# prefill attention
qo_len = 2048
q = torch . randn ( qo_len , num_qo_heads , head_dim ). half (). to ( 0 ) # prefill attention
o = flashinfer . single_prefill_with_kv_cache ( q , k , v , causal = False ) # prefill attention without RoPE on-the-fly, do not apply causal mask

배치 디코드/부록/프리 필드 커널 및 공유 준비된 캐스케이딩 커널 사용에 대한 설명서를 확인하십시오.

벤치 마크를 실행하십시오

우리는 NVBench를 사용하여 FlashInfer 커널 성능을 프로파일 링하고 다음 명령으로 벤치 마크를 컴파일하고 실행할 수 있습니다.

mkdir build
cp cmake/config.cmake build # you can modify the config.cmake to enable/disable benchmarks and change CUDA architectures
cd build
cmake ..
make -j12

./bench_{single/batch}_{prefill/decode} /batch} ./bench_single_prefill prefill/decode}를 실행하여 성능을 벤치마킹하기 위해 실행할 수 있습니다. ./bench_{single/batch}_{prefill/decode} --help 사용 가능한 옵션이 표시됩니다.