ialacol下载 - ialacol源代码下载

伊拉科（Localai）

？被从Python重写为Rust/WebAssembly，请参阅详细信息＃93

介绍

ialacol（发音为“ Localai”）是对OpenAI API的轻巧替代品。

它是一款兼容API兼容的包装器Ctransformers，具有可选的CUDA/金属加速度的GGML/GPTQ。

Ialacol的灵感来自其他类似项目，例如Localai，Privategpt，Local.ai，Llama-Cpp-Python，lockai和MLC-LLM，并特别关注Kubernetes部署。

特征

与OpenAI API的兼容性，与Langchain兼容。
轻巧，轻松的部署在Kubernetes群集上，具有1键式头盔安装。
首先流！为了更好的UX。
可选的CUDA加速度。
与github副副词VSCODE扩展兼容，请参阅Copilot

支持的模型

有关部署说明，请参见下面的收据。

Llama 2变体，包括Openllama，Mistral，OpenChat_3.5和Zephyr。
Starcoder变体
WizardCoder
Starchat变体
MPT-7B
MPT-30B
鹘

以及所有由Ctransformers支持的LLM。

UI

ialacol没有UI，但是它与支持OpenAI API的任何Web UI兼容，例如PR＃541合并后的CHAT-UI。

假设ialacol运行在端口8000，您可以配置CHAT-UI使用zephyr-7b-beta.Q4_K_M.gguf ialacol提供的。

MODELS= ` [
  {
      " name " : " zephyr-7b-beta.Q4_K_M.gguf " ,
      " displayName " : " Zephyr 7B β " ,
      " preprompt " : " <|system|>nYou are a friendly chatbot who always responds in the style of a pirate.</s>n " ,
      " userMessageToken " : " <|user|>n " ,
      " userMessageEndToken " : " </s>n " ,
      " assistantMessageToken " : " <|assistant|>n " ,
      " assistantMessageEndToken " : " n " ,
      " parameters " : {
        " temperature " : 0.1,
        " top_p " : 0.95,
        " repetition_penalty " : 1.2,
        " top_k " : 50,
        " max_new_tokens " : 4096,
        " truncate " : 999999
      },
      " endpoints " : [{
        " type " : " openai " ,
        " baseURL " : " http://localhost:8000/v1 " ,
        " completion " : " chat_completions "
      }]
  }
]

OpenChat_3.5.Q4_K_M.GGUF

MODELS= ` [
  {
      " name " : " openchat_3.5.Q4_K_M.gguf " ,
      " displayName " : " OpenChat 3.5 " ,
      " preprompt " : " " ,
      " userMessageToken " : " GPT4 User: " ,
      " userMessageEndToken " : " <|end_of_turn|> " ,
      " assistantMessageToken " : " GPT4 Assistant: " ,
      " assistantMessageEndToken " : " <|end_of_turn|> " ,
      " parameters " : {
        " temperature " : 0.1,
        " top_p " : 0.95,
        " repetition_penalty " : 1.2,
        " top_k " : 50,
        " max_new_tokens " : 4096,
        " truncate " : 999999,
        " stop " : [ " <|end_of_turn|> " ]
      },
      " endpoints " : [{
        " type " : " openai " ,
        " baseURL " : " http://localhost:8000/v1 " ,
        " completion " : " chat_completions "
      }]
  }
] `

博客

使用Code Llama（和其他开放llms）作为Copilot代码完成的倒入替换
启示录之前的容器AI？
现在在Kubernetes上部署Llama 2 AI，现在
私有MPT-30B AI应用程序的云本地工作流程
离线AI？在github动作上？‍♂️？

快速开始

Kubernetes

ialacol提供了对Kubernetes的一流公民支持，这意味着您可以自动化/配置所有内容与不用的运行。

要迅速开始使用Kubernetes上的Ialacol，请按照以下步骤操作：

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama-2-7b-chat ialacol/ialacol

默认情况下，它将部署由TheBloke量化的Meta的Llama 2聊天模型。

港口

kubectl port-forward svc/llama-2-7b-chat 8000:8000

与默认模型llama-2-7b-chat.ggmlv3.q4_0.bin curl

curl -X POST 
     -H ' Content-Type: application/json ' 
     -d ' { "messages": [{"role": "user", "content": "How are you?"}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false} ' 
     http://localhost:8000/v1/chat/completions

或者，使用OpenAI的客户端库（请参阅examples/openai文件夹中的更多示例）。

openai -k " sk-fake " 
     -b http://localhost:8000/v1 -vvvvv 
     api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin 
     -g user " Hello world! "

配置

所有配置均通过环境变量完成。

范围	描述	默认	例子
`DEFAULT_MODEL_HG_REPO_ID`	拥抱的面部回购ID下载模型	`None`	`TheBloke/orca_mini_3B-GGML`
`DEFAULT_MODEL_HG_REPO_REVISION`	拥抱的脸回购修订	`main`	`gptq-4bit-32g-actorder_True`
`DEFAULT_MODEL_FILE`	要从仓库下载的文件名，GPTQ型号的可选	`None`	`orca-mini-3b.ggmlv3.q4_0.bin`
`MODE_TYPE`	覆盖自动模型类型检测的型号类型	`None`	`gptq` ， `gpt_bigcode` ， `llama` ， `mpt` ， `replit` ， `falcon` ， `gpt_neox` `gptj`
`LOGGING_LEVEL`	记录级别	`INFO`	`DEBUG`
`TOP_K`	用于采样的顶部。	`40`	整数
`TOP_P`	抽样的顶部P。	`1.0`	浮子
`REPETITION_PENALTY`	RP用于采样。	`1.1`	浮子
`LAST_N_TOKENS`	重复罚款的最后一个标记。	`1.1`	整数
`SEED`	采样的种子。	`-1`	整数
`BATCH_SIZE`	用于评估令牌的批次大小，仅用于GGGUF/GGML模型	`8`	整数
`THREADS`	线程号覆盖自动检测到CPU/2，为GPTQ型号设置`1`	`Auto`	整数
`MAX_TOKENS`	生成的最大令牌数	`512`	整数
`STOP`	代币阻止这一代人	`None`	<<
`CONTEXT_LENGTH`	覆盖自动检测上下文长度	`512`	整数
`GPU_LAYERS`	向GPU关闭负载的层数	`0`	整数
`TRUNCATE_PROMPT_LENGTH`	截断提示如果设置	`0`	整数

采样参数，包括TOP_K ， TOP_P ， REPETITION_PENALTY ， LAST_N_TOKENS ， SEED ， MAX_TOKENS ， STOP可以通过请求主体每个请求覆盖，例如：

curl -X POST 
     -H ' Content-Type: application/json ' 
     -d ' { "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" } ' 
     http://localhost:8000/v1/chat/completions

对于此请求，将使用temperature=2 ， top_p=1和top_k=0 。

在容器中运行

GitHub注册表的图像

在ghcr.io上有一个图像（替代cuda11，cuda12，金属，gptq变体）。

docker run --rm -it -p 8000:8000 
     -e DEFAULT_MODEL_HG_REPO_ID= " TheBloke/Llama-2-7B-Chat-GGML " 
     -e DEFAULT_MODEL_FILE= " llama-2-7b-chat.ggmlv3.q4_0.bin " 
     ghcr.io/chenhunghan/ialacol:latest

来自来源

对于开发人员/贡献者

Python

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
DEFAULT_MODEL_HG_REPO_ID= " TheBloke/stablecode-completion-alpha-3b-4k-GGML " DEFAULT_MODEL_FILE= " stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin " LOGGING_LEVEL= " DEBUG " THREAD=4 uvicorn main:app --reload --host 0.0.0.0 --port 9999

Docker

构建图像

docker build --file ./Dockerfile -t ialacol .

运行容器

 export DEFAULT_MODEL_HG_REPO_ID= " TheBloke/orca_mini_3B-GGML "
export DEFAULT_MODEL_FILE= " orca-mini-3b.ggmlv3.q4_0.bin "
docker run --rm -it -p 8000:8000 
     -e DEFAULT_MODEL_HG_REPO_ID= $DEFAULT_MODEL_HG_REPO_ID 
     -e DEFAULT_MODEL_FILE= $DEFAULT_MODEL_FILE ialacol

GPU加速度

要启用GPU/CUDA加速度，您需要使用为GPU构建的容器图像并添加GPU_LAYERS环境变量。 GPU_LAYERS是通过GPU内存的大小来确定的。请参阅Llama.cpp中的公关/讨论以找到最佳价值。

库达11

deployment.image = ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS是将加载到GPU的层。

库达12

deployment.image = ghcr.io/chenhunghan/ialacol-cuda12:latest
deployment.env.GPU_LAYERS是将加载到GPU的层。

只有llama ， falcon ， mpt和gpt_bigcode （Starcoder/Starchat）支持CUDA。

与cuda12的美洲驼

helm install llama2-7b-chat-cuda12 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda12.yaml

部署Llama2 7b型号，其中有40层卸载到GPU。 CUDA 12加速了推断。

带有CUDA12的StarCoderplus

helm install starcoderplus-guanaco-cuda12 ialacol/ialacol -f examples/values/starcoderplus-guanaco-cuda12.yaml

将STARCODERPLUS-GUANACO-GPT4-15B-V1.0模型部署为40层卸载到GPU。 CUDA 12加速了推断。

CUDA驾驶员问题

如果您看到CUDA driver version is insufficient for CUDA runtime version ，那么您可能会使用与CUDA版本不兼容的NVIDIA驱动程序。

在节点上手动升级驱动程序（如果您使用的是CUDA11 + AMI，请参见此处）。或尝试其他版本的CUDA。

金属

要启用金属支撑，请使用为金属制造的图像ialacol-metal 。

deployment.image = ghcr.io/chenhunghan/ialacol-metal:latest

例如

helm install llama2-7b-chat-metal ialacol/ialacol -f examples/values/llama2-7b-chat-metal.yaml.yaml

GPTQ

要使用GPTQ，您必须

deployment.image = ghcr.io/chenhunghan/ialacol-gptq:latest
deployment.env.MODEL_TYPE = gptq

例如

helm install llama2-7b-chat-gptq ialacol/ialacol -f examples/values/llama2-7b-chat-gptq.yaml.yaml

kubectl port-forward svc/llama2-7b-chat-gptq 8000:8000
openai -k " sk-fake " -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m gptq_model-4bit-128g.safetensors -g user " Hello world! "

尖端

副驾驶

ialacol可以用作副驾驶客户端，因为Github的副驾驶几乎与OpenAI完成API相同。

但是，很少有事情需要记住：

Copilot客户端发送Lenthy提示，以包括代码完成的所有相关上下文，请参见Copilot-explorer，如果您试图本地运行ialacol ，则可以在服务器上给予重负载，请选择TRUNCATE_PROMPT_LENGTH环境变量以从一开始就将提示截断以减少工作量。
Copilot并行发送请求，以增加吞吐量，您可能需要一个队列，例如文本批次批次。

启动两个ialacol的实例：

gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt
LOGGING_LEVEL= " DEBUG "
THREAD=2
DEFAULT_MODEL_HG_REPO_ID= " TheBloke/stablecode-completion-alpha-3b-4k-GGML "
DEFAULT_MODEL_FILE= " stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin "
TRUNCATE_PROMPT_LENGTH=100 # optional
uvicorn main:app --host 0.0.0.0 --port 9998
uvicorn main:app --host 0.0.0.0 --port 9999

启动tib，指向上游ialacol实例。

gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install
UPSTREAMS= " http://localhost:9998,http://localhost:9999 " npm start

配置VSCODE GITHUB COPILOT使用TIB。

 "github.copilot.advanced" : {
     "debug.overrideEngine" : " stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin " ,
     "debug.testOverrideProxyUrl" : " http://localhost:8000 " ,
     "debug.overrideProxyUrl" : " http://localhost:8000 "
}

创意与保守派

众所周知，LLM对参数很敏感，较高的temperature会导致更多的“随机性”，因此LLM变得更加“创意”， top_p和top_k也有助于“随机性”

如果您想让LLM发挥创造力。

curl -X POST 
     -H ' Content-Type: application/json ' 
     -d ' { "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" } ' 
     http://localhost:8000/v1/chat/completions

如果您想使LLM更加一致，并通过相同的输入获得相同的结果。

curl -X POST 
     -H ' Content-Type: application/json ' 
     -d ' { "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "0.1", "top_p": "0.1", "top_k": "40" } ' 
     http://localhost:8000/v1/chat/completions

路线图

通过CTRANSFORMERS支持starcoder模型类型，包括：
- Starchat https://huggingface.co/thebloke/starchat-beta-ggml
- Starcoder https://huggingface.co/thebloke/starcoder-ggml
- starcoderplus https://huggingface.co/thebloke/starcoderplus-ggml
模仿OpenAi API，包括GET /models和POST /completions
GPU加速度（CUDA/金属）
支持POST /embeddings支持Forness Apache-2.0嵌入模型，例如句子变形金刚和Hkunlp /ersenter
Suuport Apache-2.0 FastChat-T5-3B
支持更多的Apache-2.0模型，例如Codet5p和此处列出的其他模型

星历史

收据

Llama-2

部署由TheBloke量化的Meta的Llama 2聊天模型。

7B聊天

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml

13B聊天

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml

70B聊天

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml

OpenLM Research的Openllama模型

部署由RustFormers量化的OpenLlama 7B模型。

这是一个基本模型，可能仅对文本完成有用。

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml

VMware的OpenLlama 13B Open指示

部署TheBloke量化的OpenLlama 13B Open指示模型。

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml

马赛克的MPT模型

部署MosaiCML的MPT-7B模型由RustFormer量化。这是一个基本模型，可能仅对文本完成有用。

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml

部署Mosaicml的MPT-30B聊天模型由TheBloke量化。

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml

猎鹰模型

部署由TheBloke量化的未经审查的Falcon 7B模型。

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml

部署未经审查的Falcon 40b模型由TheBloke量化。

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml

Starcoder模型（StartCoder，StartChat，StarCoderplus，WizardCoder）

部署由TheBloke量化的starchat-beta模型。

helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml

部署由TheBloke量化的WizardCoder模型。

helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml

毕曲霉模型

部署仅使用RustFormers量化的700万参数（〜40MB）的轻量pythia-70m型号。

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml

Redpajama模型

部署RedPajama 3B型号

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml

Stablelm型号

部署stableLM 7b型号

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml

发展

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
pip freeze > requirements.txt

展开