
Booster, according to Merriam-Webster dictionary:
Large Model Booster aims to be an simple and mighty LLM inference accelerator both for those who needs to scale GPTs within production environment or just experiment with models on its own.
Within first month of llama.go development I was literally shocked of how original ggml.cpp project made it very clear - there are no limits for talented people on bringing mind-blowing features and moving to AI future.
So I've decided to start a new project where best-in-class C++ / CUDA core will be embedded into mighty Golang server ready for robust and performant inference at large scale within real production environments.
Booster was (and still) developed on Mac with Apple Silicon M1 processor, so it's really easy peasy:
make macFollow step 1 and step 2, then just make!
Ubuntu Step 1: Install C++ and Golang compilers, as well some developer libraries
sudo apt update -y && sudo apt upgrade -y &&
apt install -y git git-lfs make build-essential &&
wget https://golang.org/dl/go1.21.5.linux-amd64.tar.gz &&
tar -xf go1.21.5.linux-amd64.tar.gz -C /usr/local &&
rm go1.21.5.linux-amd64.tar.gz &&
echo 'export PATH="${PATH}:/usr/local/go/bin"' >> ~/.bashrc && source ~/.bashrc
Ubuntu Step 2: Install Nvidia drivers and CUDA Toolkit 12.2 with NVCC
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-ubuntu2204.pin &&
sudo mv cuda-ubuntu2204.pin /etc/apt/preferences.d/cuda-repository-pin-600 &&
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/3bf863cc.pub &&
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/ /" &&
sudo apt update -y &&
sudo apt install -y cuda-toolkit-12-2
Now you are ready to rock!
make cudaYou shold go through steps below:
make clean && make macwget https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B-GGUF/resolve/main/Hermes-2-Pro-Llama-3-8B-Q4_K_M.ggufid: mac
host: localhost
port: 8080
log: booster.log
deadline: 180
pods:
gpu:
model: hermes
prompt: chat
sampling: janus
threads: 1
gpus: [ 100 ]
batch: 512
models:
hermes:
name: Hermes2 Pro 8B
path: ~/models/Hermes-2-Pro-Llama-3-8B-Q4_K_M.gguf
context: 8K
predict: 1K
prompts:
chat:
locale: en_US
prompt: "Today is {DATE}. You are virtual assistant. Please answer the question."
system: "<|im_start|>systemn{PROMPT}<|im_end|>"
user: "n<|im_start|>usern{USER}<|im_end|>"
assistant: "n<|im_start|>assistantn{ASSISTANT}<|im_end|>"
samplings:
janus:
janus: 1
depth: 200
scale: 0.97
hi: 0.99
lo: 0.96Launch Booster in interactive mode to just chatting with the model:
./boosterLaunch Booster as server to handle all API endpoints and show debug info:
./booster --server --debughttp://localhost:8080/jobs
{
"id": "5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc6",
"prompt": "Who are you?"
}http://localhost:8080/jobs/5fb8ebd0-e0c9-4759-8f7d-35590f6c9fc6
{
{
"id": "5fb8ebd0-e0c9-4759-8f7d-35590f6c9f77",
"output": "I'm a virtual assistant.",
"prompt": "Who are you?",
"status": "finished"
}
}booster.service file on how to create daemond service out of this API server.