


Welcome to libLLM, an open-source project designed for efficient inference of large language models (LLM) on ordinary personal computers and mobile devices. The core is implemented in C++14, without any third-party dependencies (such as BLAS or SentencePiece), enabling seamless operation across a variety of devices.
Welcome to libLLM, an open source project designed for efficient inference of large language models (LLM) on average personal computers and mobile devices. The core is written in C++14, and it has no third-party dependencies (BLAS, SentencePiece, etc.), and can run seamlessly in various devices.
| Model | Download | llm Command |
|---|---|---|
| Index-1.9B-Character (Role-playing) | [?HF] [MS] | llm chat -m index:character |
| Index-1.9B-Chat | [?HF] [MS] | llm chat -m index |
| Qwen2-1.5B-Instruct | [?HF] [MS] | llm chat -m qwen:1.5b |
| Qwen2-7B-Instruct | [?HF] [MS] | llm chat -m qwen:7b |
| Llama3.2-1B-Instruct | [?HF] [MS] | llm chat -m llama3.2:1b |
| Llama3.2-3B-Instruct | [?HF] [MS] | llm chat -m llama3.2 |
| Whisper-large-v3 | [?HF] [MS] | llm transcribe -m whisper |
HF = HuggingFace, MS = ModelScope
| OS | Platform | CUDA | avx2 | avx512 | asimdhp |
|---|---|---|---|---|---|
| Linux | x64 | ✅ | ✅ | ✅ | |
| Windows | x64 | ✅ | ✅ | ✅ | |
| macOS | arm64 | ✅ |
llm chat -model index-character will automatically download the index-character model from ?Huggingface. To run and chat with Bilibili-Index-1.9B-Character:
$ llm chat -m index-character It will automatically download the Bilibili-Index-1.9B-Character from Huggingface or ModelScope (in China), and start the chat CLI in llm.
Chat with Bilibili-Index-1.9B-Character model:
$ llm chat -m index-character llm will automatically download the model Bilibili-Index-1.9B-Character from Huggingface or ModelScope (if it is Chinese IP) and start talking to it.
$ src/libllm/llm chat -m index-character
INFO 2024-07-30T12:02:28Z interface.cc:67] ISA support: AVX2=1 F16C=1 AVX512F=1
INFO 2024-07-30T12:02:28Z interface.cc:71] Use Avx512 backend.
INFO 2024-07-30T12:02:30Z matmul.cc:43] Use GEMM from cuBLAS.
INFO 2024-07-30T12:02:30Z cuda_operators.cc:51] cuda numDevices = 2
INFO 2024-07-30T12:02:30Z cuda_operators.cc:52] cuda:0 maxThreadsPerMultiProcessor = 2048
INFO 2024-07-30T12:02:30Z cuda_operators.cc:54] cuda:0 multiProcessorCount = 20
INFO 2024-07-30T12:02:30Z thread_pool.cc:73] ThreadPool started. numThreads=20
INFO 2024-07-30T12:02:30Z llm.cc:204] read model package: /home/xiaoych/.libllm/models/bilibili-index-1.9b-character-q4.llmpkg
INFO 2024-07-30T12:02:30Z model_for_generation.cc:43] model_type = index
INFO 2024-07-30T12:02:30Z model_for_generation.cc:44] device = cuda
INFO 2024-07-30T12:02:31Z state_map.cc:66] 220 tensors read.
Please input your question.
Type ' :new ' to start a new session (clean history).
Type ' :sys <system_prompt> ' to set the system prompt and start a new session .
> hi
您好!我是Index,请问有什么我可以帮助您的吗?
(12 tokens, time=0.76s, 63.47ms per token)
> $ mkdir build && cd build
$ cmake ..
$ make -jPlease brew install OpenMP before cmake. NOTE: currently libllm macOS expected to be very slow since there is no aarch64 kernel for it.
% brew install libomp
% export OpenMP_ROOT= $( brew --prefix ) /opt/libomp
% mkdir build && cd build
% cmake ..
% make -j NOTE: specify -DCUDAToolkit_ROOT=<CUDA-DIR> if there is multiple CUDA versions in your OS.
Recommend versions are:
$ mkdir build && cd build
$ cmake -DWITH_CUDA=ON [-DCUDAToolkit_ROOT =< CUDA-DIR > ] ..
$ make -j from libllm import Model , ControlToken
model = Model ( "tools/bilibili_index.llmpkg" )
prompt = [ ControlToken ( "<|reserved_0|>" ), "hi" , ControlToken ( "<|reserved_1|>" )]
for chunk in model . complete ( prompt ):
print ( chunk . text , end = "" , flush = True )
print ( " n Done!" ) package main
import (
"fmt"
"log"
"github.com/ling0322/libllm/go/llm"
)
func main () {
model , err := llm . NewModel ( "../../tools/bilibili_index.llmpkg" , llm . Auto )
if err != nil {
log . Fatal ( err )
}
prompt := llm . NewPrompt ()
prompt . AppendControlToken ( "<|reserved_0|>" )
prompt . AppendText ( "hi" )
prompt . AppendControlToken ( "<|reserved_1|>" )
comp , err := model . Complete ( llm . NewCompletionConfig (), prompt )
if err != nil {
log . Fatal ( err )
}
for comp . IsActive () {
chunk , err := comp . GenerateNextChunk ()
if err != nil {
log . Fatal ( err )
}
fmt . Print ( chunk . Text )
}
fmt . Println ()
}Here is an example of exporting Index-1.9B model from huggingface.
$ cd tools
$ python bilibili_index_exporter.py
-huggingface_name IndexTeam/Index-1.9B-Character
-quant q4
-output index.llmpkg
Then all required modules realted to IndexTeam/Index-1.9B-Character , including model, tokenizer and configs will be written to index.llmpkg .