Important
bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
< English | 中文 >
IPEX-LLM is an LLM acceleration library for Intel GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max), NPU and CPU 1.
Note
llama.cpp, transformers, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.ipex-llm (e.g., Llama, Phi, Mistral, Mixtral, Whisper, Qwen, MiniCPM, Qwen-VL, MiniCPM-V and more), with state-of-art LLM optimizations, XPU acceleration and low-bit (FP8/FP6/FP4/INT4) support; see the complete list here.ipex-llm on Intel GPU.ipex-llm now supports Axolotl for LLM finetuning on Intel GPU; see the quickstart here.ipex-llm inference, serving and finetuning using the Docker images.ipex-llm on Windows using just "one command".ipex-llm; see the quickstart here.llama.cpp and ollama with ipex-llm; see the quickstart here.ipex-llm now supports Llama 3 on both Intel GPU and CPU.ipex-llm now provides C++ interface, which can be used as an accelerated backend for running llama.cpp and ollama on Intel GPU.bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.ipex-llm now supports directly loading model from ModelScope (魔搭).ipex-llm added initial INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-sized LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.ipex-llm through Text-Generation-WebUI GUI.ipex-llm now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.ipex-llm now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).ipex-llm QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).ipex-llm now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").ipex-llm now supports Mixtral-8x7B on both Intel GPU and CPU.ipex-llm now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").ipex-llm now supports FP8 and FP4 inference on Intel GPU.ipex-llm is available.ipex-llm now supports vLLM continuous batching on both Intel GPU and CPU.ipex-llm now supports QLoRA finetuning on both Intel GPU and CPU.ipex-llm now supports FastChat serving on on both Intel CPU and GPU.ipex-llm now supports Intel GPU (including iGPU, Arc, Flex and MAX).ipex-llm tutorial is released.ipex-llm DemoSee demos of running local LLMs on Intel Core Ultra iGPU, Intel Core Ultra NPU, single-card Arc GPU, or multi-card Arc GPUs using ipex-llm below.
| Intel Core Ultra (Series 1) iGPU | Intel Core Ultra (Series 2) NPU | Intel Arc dGPU | 2-Card Intel Arc dGPUs |
|
Ollama (Mistral-7B Q4_K) |
HuggingFace (Llama3.2-3B SYM_INT4) |
TextGeneration-WebUI (Llama3-8B FP8) |
FastChat (QWen1.5-32B FP6) |
ipex-llm PerformanceSee the Token Generation Speed on Intel Core Ultra and Intel Arc GPU below1 (and refer to [2][3][4] for more details).
You may follow the Benchmarking Guide to run ipex-llm performance benchmark yourself.
Please see the Perplexity result below (tested on Wikitext dataset using the script here).
| Perplexity | sym_int4 | q4_k | fp6 | fp8_e5m2 | fp8_e4m3 | fp16 |
|---|---|---|---|---|---|---|
| Llama-2-7B-chat-hf | 6.364 | 6.218 | 6.092 | 6.180 | 6.098 | 6.096 |
| Mistral-7B-Instruct-v0.2 | 5.365 | 5.320 | 5.270 | 5.273 | 5.246 | 5.244 |
| Baichuan2-7B-chat | 6.734 | 6.727 | 6.527 | 6.539 | 6.488 | 6.508 |
| Qwen1.5-7B-chat | 8.865 | 8.816 | 8.557 | 8.846 | 8.530 | 8.607 |
| Llama-3.1-8B-Instruct | 6.705 | 6.566 | 6.338 | 6.383 | 6.325 | 6.267 |
| gemma-2-9b-it | 7.541 | 7.412 | 7.269 | 7.380 | 7.268 | 7.270 |
| Baichuan2-13B-Chat | 6.313 | 6.160 | 6.070 | 6.145 | 6.086 | 6.031 |
| Llama-2-13b-chat-hf | 5.449 | 5.422 | 5.341 | 5.384 | 5.332 | 5.329 |
| Qwen1.5-14B-Chat | 7.529 | 7.520 | 7.367 | 7.504 | 7.297 | 7.334 |
ipex-llm Quickstartllama.cpp, ollama, etc., with ipex-llm on Intel GPUtransformers, LangChain, LlamaIndex, ModelScope, etc. with ipex-llm on Intel GPUvLLM serving with ipex-llm on Intel GPUvLLM serving with ipex-llm on Intel CPUFastChat serving with ipex-llm on Intel GPUipex-llm applications in Python using VSCode on Intel GPUipex-llm on Intel NPU in both Python and C++ipex-llm) on Intel GPUipex-llm) on Intel GPUipex-llm) on Intel GPU for Windows and Linuxipex-llm in vLLM on both Intel GPU and CPUipex-llm in FastChat serving on on both Intel GPU and CPUipex-llm serving on multiple Intel GPUs by leveraging DeepSpeed AutoTP and FastAPIipex-llm in oobabooga WebUIipex-llm in Axolotl for LLM finetuningipex-llm on Intel CPU and GPUGraphRAG using local LLM with ipex-llmRAGFlow (an open-source RAG engine) with ipex-llmLangChain-Chatchat (Knowledge Base QA using RAG pipeline) with ipex-llmContinue (coding copilot in VSCode) with ipex-llmOpen WebUI with ipex-llmPrivateGPT to interact with documents with ipex-llmipex-llm in Dify(production-ready LLM app development platform)ipex-llm on Windows with Intel GPUipex-llm on Linux with Intel GPUipex-llm low-bit models (INT4/FP4/FP6/INT8/FP8/FP16/etc.)ipex-llmipex-llmipex-llmOver 70 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.
| Model | CPU Example | GPU Example | NPU Example |
|---|---|---|---|
| LLaMA | link1, link2 | link | |
| LLaMA 2 | link1, link2 | link | Python link, C++ link |
| LLaMA 3 | link | link | Python link, C++ link |
| LLaMA 3.1 | link | link | |
| LLaMA 3.2 | link | Python link, C++ link | |
| LLaMA 3.2-Vision | link | ||
| ChatGLM | link | ||
| ChatGLM2 | link | link | |
| ChatGLM3 | link | link | |
| GLM-4 | link | link | |
| GLM-4V | link | link | |
| GLM-Edge | link | Python link | |
| GLM-Edge-V | link | ||
| Mistral | link | link | |
| Mixtral | link | link | |
| Falcon | link | link | |
| MPT | link | link | |
| Dolly-v1 | link | link | |
| Dolly-v2 | link | link | |
| Replit Code | link | link | |
| RedPajama | link1, link2 | ||
| Phoenix | link1, link2 | ||
| StarCoder | link1, link2 | link | |
| Baichuan | link | link | |
| Baichuan2 | link | link | Python link |
| InternLM | link | link | |
| InternVL2 | link | ||
| Qwen | link | link | |
| Qwen1.5 | link | link | |
| Qwen2 | link | link | Python link, C++ link |
| Qwen2.5 | link | Python link, C++ link | |
| Qwen-VL | link | link | |
| Qwen2-VL | link | ||
| Qwen2-Audio | link | ||
| Aquila | link | link | |
| Aquila2 | link | link | |
| MOSS | link | ||
| Whisper | link | link | |
| Phi-1_5 | link | link | |
| Flan-t5 | link | link | |
| LLaVA | link | link | |
| CodeLlama | link | link | |
| Skywork | link | ||
| InternLM-XComposer | link | ||
| WizardCoder-Python | link | ||
| CodeShell | link | ||
| Fuyu | link | ||
| Distil-Whisper | link | link | |
| Yi | link | link | |
| BlueLM | link | link | |
| Mamba | link | link | |
| SOLAR | link | link | |
| Phixtral | link | link | |
| InternLM2 | link | link | |
| RWKV4 | link | ||
| RWKV5 | link | ||
| Bark | link | link | |
| SpeechT5 | link | ||
| DeepSeek-MoE | link | ||
| Ziya-Coding-34B-v1.0 | link | ||
| Phi-2 | link | link | |
| Phi-3 | link | link | |
| Phi-3-vision | link | link | |
| Yuan2 | link | link | |
| Gemma | link | link | |
| Gemma2 | link | ||
| DeciLM-7B | link | link | |
| Deepseek | link | link | |
| StableLM | link | link | |
| CodeGemma | link | link | |
| Command-R/cohere | link | link | |
| CodeGeeX2 | link | link | |
| MiniCPM | link | link | Python link, C++ link |
| MiniCPM3 | link | ||
| MiniCPM-V | link | ||
| MiniCPM-V-2 | link | link | |
| MiniCPM-Llama3-V-2_5 | link | Python link | |
| MiniCPM-V-2_6 | link | link | Python link |
| StableDiffusion | link | ||
| Bce-Embedding-Base-V1 | Python link | ||
| Speech_Paraformer-Large | Python link |
Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩ ↩2