ialacol 다운로드 ialacol 소스 코드 다운로드

Ialacol (localai)

? Python에서 Rust/WebAssembly까지 다시 작성하려면 세부 사항 #93을 참조하십시오

소개

Ialacol ( "LocalAi"로 발음)은 OpenAI API의 가벼운 드롭 인 교체품입니다.

옵션 CUDA/금속 가속도로 GGML/GPTQ를 지원하는 OpenAI API 호환 래퍼 CTransformers입니다.

Ialacol은 LocalAi, PrivateGpt, Local.ai, Llama-CPP-Python, ClosedAi 및 MLC-LLM과 같은 다른 유사한 프로젝트에서 영감을 얻었으며 Kubernetes 배포에 특정한 중점을 둡니다.

특징

Langchain과 호환되는 OpenAI API와의 호환성.
1 클릭 헬름 설치로 Kubernetes 클러스터의 경량, 쉬운 배포.
먼저 스트리밍! 더 나은 UX를 위해.
선택적 CUDA 가속도.
GitHub CopleLot VSCODE 확장과 호환됩니다. Copilot을 참조하십시오

지원되는 모델

배포 지침은 아래 영수증을 참조하십시오.

Openllama, Mistral, OpenChat_3.5 및 Zephyr를 포함한 llama 2 변형.
스타 코더 변형
마법사
Starchat 변형
MPT-7B
MPT-30B
매

그리고 ctransformers가 지원하는 모든 LLM.

UI

ialacol 에는 UI가 없지만 OpenAI API (예 : PR #541이 병합 된 후 Chat-UI)를 지원하는 Web UI와 호환됩니다.

ialacol 이 포트 8000에서 실행한다고 가정하면 Chat-UI를 zephyr-7b-beta.Q4_K_M.gguf 사용하여 ialacol 에서 제공 할 수 있습니다.

MODELS= ` [
  {
      " name " : " zephyr-7b-beta.Q4_K_M.gguf " ,
      " displayName " : " Zephyr 7B β " ,
      " preprompt " : " <|system|>nYou are a friendly chatbot who always responds in the style of a pirate.</s>n " ,
      " userMessageToken " : " <|user|>n " ,
      " userMessageEndToken " : " </s>n " ,
      " assistantMessageToken " : " <|assistant|>n " ,
      " assistantMessageEndToken " : " n " ,
      " parameters " : {
        " temperature " : 0.1,
        " top_p " : 0.95,
        " repetition_penalty " : 1.2,
        " top_k " : 50,
        " max_new_tokens " : 4096,
        " truncate " : 999999
      },
      " endpoints " : [{
        " type " : " openai " ,
        " baseURL " : " http://localhost:8000/v1 " ,
        " completion " : " chat_completions "
      }]
  }
]

Openchat_3.5.q4_k_m.gguf

MODELS= ` [
  {
      " name " : " openchat_3.5.Q4_K_M.gguf " ,
      " displayName " : " OpenChat 3.5 " ,
      " preprompt " : " " ,
      " userMessageToken " : " GPT4 User: " ,
      " userMessageEndToken " : " <|end_of_turn|> " ,
      " assistantMessageToken " : " GPT4 Assistant: " ,
      " assistantMessageEndToken " : " <|end_of_turn|> " ,
      " parameters " : {
        " temperature " : 0.1,
        " top_p " : 0.95,
        " repetition_penalty " : 1.2,
        " top_k " : 50,
        " max_new_tokens " : 4096,
        " truncate " : 999999,
        " stop " : [ " <|end_of_turn|> " ]
      },
      " endpoints " : [{
        " type " : " openai " ,
        " baseURL " : " http://localhost:8000/v1 " ,
        " completion " : " chat_completions "
      }]
  }
] `

블로그

Code Llama (및 기타 Open LLM)를 Copilot Code 완료를위한 드롭 인 대체로 사용하십시오.
묵시록 전에 컨테이너 화 된 AI ??
Kubernetes에 llama 2 ai를 배포하십시오
개인 MPT-30B AI 앱을위한 클라우드 기본 워크 플로우
오프라인 AI? Github Actions? ub♂️?

빠른 시작

Kubernetes

ialacol Kubernetes에 대한 일등석 시민 지원을 제공합니다. 즉, 런과 비교하여 모든 것을 자동화/구성 할 수 있습니다.

Kubernetes의 Ialacol을 신속하게 시작하려면 아래 단계를 따르십시오.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama-2-7b-chat ialacol/ialacol

기본값으로는 Meta의 LLAMA 2 채팅 모델을 TheBloke에 의해 배치합니다.

포트 포워드

kubectl port-forward svc/llama-2-7b-chat 8000:8000

기본 모델 llama-2-7b-chat.ggmlv3.q4_0.bin curl 사용하여 채팅하십시오

curl -X POST 
     -H ' Content-Type: application/json ' 
     -d ' { "messages": [{"role": "user", "content": "How are you?"}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false} ' 
     http://localhost:8000/v1/chat/completions

또는 OpenAI의 클라이언트 라이브러리를 사용합니다 ( examples/openai 폴더의 더 많은 예를 참조하십시오).

openai -k " sk-fake " 
     -b http://localhost:8000/v1 -vvvvv 
     api chat_completions.create -m llama-2-7b-chat.ggmlv3.q4_0.bin 
     -g user " Hello world! "

구성

모든 구성은 환경 변수를 통해 수행됩니다.

매개 변수	설명	기본	예
`DEFAULT_MODEL_HG_REPO_ID`	포옹 페이스 리포 ID는 모델을 다운로드합니다	`None`	`TheBloke/orca_mini_3B-GGML`
`DEFAULT_MODEL_HG_REPO_REVISION`	포옹 페이스 리포 개정	`main`	`gptq-4bit-32g-actorder_True`
`DEFAULT_MODEL_FILE`	repo에서 다운로드 할 파일 이름, GPTQ 모델의 선택 사항	`None`	`orca-mini-3b.ggmlv3.q4_0.bin`
`MODE_TYPE`	모델 유형 자동 모델 유형 감지를 무시합니다	`None`	`gptq` , `gpt_bigcode` , `llama` , `mpt` , `replit` , `falcon` , `gpt_neox` `gptj`
`LOGGING_LEVEL`	로깅 레벨	`INFO`	`DEBUG`
`TOP_K`	샘플링을위한 최상위 K.	`40`	정수
`TOP_P`	샘플링을위한 상단 -P.	`1.0`	부유물
`REPETITION_PENALTY`	샘플링을위한 RP.	`1.1`	부유물
`LAST_N_TOKENS`	반복 페널티를위한 마지막 n 토큰.	`1.1`	정수
`SEED`	샘플링을위한 씨앗.	`-1`	정수
`BATCH_SIZE`	토큰 평가를위한 배치 크기, GGUF/GGML 모델의 경우에만	`8`	정수
`THREADS`	스레드 번호 재정의 자동 감지 CPU/2, GPTQ 모델의 `1` 세트 1 세트	`Auto`	정수
`MAX_TOKENS`	생성 할 토큰의 최대 수	`512`	정수
`STOP`	세대를 멈추는 토큰	`None`	`<
`CONTEXT_LENGTH`	자동 감지 컨텍스트 길이를 무시하십시오	`512`	정수
`GPU_LAYERS`	gpu에 부하를 끄는 층 수	`0`	정수
`TRUNCATE_PROMPT_LENGTH`	설정하면 프롬프트를 자릅니다	`0`	정수

TOP_K , TOP_P , REPETITION_PENALTY , LAST_N_TOKENS , SEED , MAX_TOKENS , STOP 포함한 샘플링 매개 변수는 예를 들어 요청 본문을 통해 요청 당 재정의 할 수 있습니다.

curl -X POST 
     -H ' Content-Type: application/json ' 
     -d ' { "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" } ' 
     http://localhost:8000/v1/chat/completions

이 요청에는 temperature=2 , top_p=1 및 top_k=0 사용합니다.

컨테이너로 실행하십시오

GitHub 레지스트리의 이미지

ghcr.io에서 호스팅 된 이미지가 있습니다 (또는 Cuda11, Cuda12, Metal, GPTQ 변형).

docker run --rm -it -p 8000:8000 
     -e DEFAULT_MODEL_HG_REPO_ID= " TheBloke/Llama-2-7B-Chat-GGML " 
     -e DEFAULT_MODEL_FILE= " llama-2-7b-chat.ggmlv3.q4_0.bin " 
     ghcr.io/chenhunghan/ialacol:latest

소스에서

개발자/기고자를 위해

파이썬

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
DEFAULT_MODEL_HG_REPO_ID= " TheBloke/stablecode-completion-alpha-3b-4k-GGML " DEFAULT_MODEL_FILE= " stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin " LOGGING_LEVEL= " DEBUG " THREAD=4 uvicorn main:app --reload --host 0.0.0.0 --port 9999

도커

이미지를 빌드하십시오

docker build --file ./Dockerfile -t ialacol .

컨테이너를 실행하십시오

 export DEFAULT_MODEL_HG_REPO_ID= " TheBloke/orca_mini_3B-GGML "
export DEFAULT_MODEL_FILE= " orca-mini-3b.ggmlv3.q4_0.bin "
docker run --rm -it -p 8000:8000 
     -e DEFAULT_MODEL_HG_REPO_ID= $DEFAULT_MODEL_HG_REPO_ID 
     -e DEFAULT_MODEL_FILE= $DEFAULT_MODEL_FILE ialacol

GPU 가속도

GPU/CUDA 가속도를 활성화하려면 GPU 용 컨테이너 이미지를 사용하고 GPU_LAYERS 환경 변수를 추가해야합니다. GPU_LAYERS GPU 메모리의 크기에 따라 결정됩니다. 최고의 가치를 찾으려면 llama.cpp의 PR/토론을 참조하십시오.

CUDA 11

deployment.image = ghcr.io/chenhunghan/ialacol-cuda11:latest
deployment.env.GPU_LAYERS GPU에로드를 꺼리는 계층입니다.

CUDA 12

deployment.image = ghcr.io/chenhunghan/ialacol-cuda12:latest
deployment.env.GPU_LAYERS GPU에로드를 꺼리는 계층입니다.

llama , falcon , mpt 및 gpt_bigcode (Starcoder/Starchat) 만 Cuda를 지원합니다.

CUDA12와 LLAMA

helm install llama2-7b-chat-cuda12 ialacol/ialacol -f examples/values/llama2-7b-chat-cuda12.yaml

40 개의 레이어가 GPU에 오프로드 된 LLAMA2 7B 모델을 배포합니다. 추론은 Cuda 12에 의해 가속됩니다.

CUDA12를 가진 별 코더 플러스

helm install starcoderplus-guanaco-cuda12 ialacol/ialacol -f examples/values/starcoderplus-guanaco-cuda12.yaml

GPU에 40 개의 레이어를 오프로드 한 StarCoderPlus-Guanaco-GPT4-15B-V1.0 모델을 배치합니다. 추론은 Cuda 12에 의해 가속됩니다.

CUDA 운전자 문제

CUDA driver version is insufficient for CUDA runtime version 경우 CUDA 버전과 호환되지 않는 NVIDIA 드라이버를 사용하는 것일 수 있습니다.

노드에서 드라이버를 수동으로 업그레이드하십시오 (CUDA11 + AMI를 사용하는 경우 여기를 참조하십시오). 또는 Cuda의 다른 버전을 시도하십시오.

금속

금속 지원을 가능하게하려면 금속을 위해 제작 된 이미지 ialacol-metal 사용하십시오.

deployment.image = ghcr.io/chenhunghan/ialacol-metal:latest

예를 들어

helm install llama2-7b-chat-metal ialacol/ialacol -f examples/values/llama2-7b-chat-metal.yaml.yaml

GPTQ

GPTQ를 사용하려면해야합니다

deployment.image = ghcr.io/chenhunghan/ialacol-gptq:latest
deployment.env.MODEL_TYPE = gptq

예를 들어

helm install llama2-7b-chat-gptq ialacol/ialacol -f examples/values/llama2-7b-chat-gptq.yaml.yaml

kubectl port-forward svc/llama2-7b-chat-gptq 8000:8000
openai -k " sk-fake " -b http://localhost:8000/v1 -vvvvv api chat_completions.create -m gptq_model-4bit-128g.safetensors -g user " Hello world! "

팁

부조종사

GitHub의 부실은 OpenAI 완료 API와 거의 동일한 API이므로 ialacol Coplot 클라이언트로 사용할 수 있습니다.

그러나 명심해야 할 일은 거의 없습니다.

Coplelot Client는 코드 완료를위한 모든 관련 컨텍스트를 포함시키기 위해 Lenthy 프롬프트를 보냅니다. 서버에 큰 부하를 제공하는 Copilot-Explorer를 참조하십시오. 로컬로 ialacol 로 실행하려는 경우 옵트 인 TRUNCATE_PROMPT_LENGTH 변수가 처음부터 프롬프트를 절단하여 워크로드를 줄입니다.
Colecilot은 요청을 병렬로 보냅니다. 처리량을 늘리려면 텍스트-인식 배치와 같은 대기열이 필요할 것입니다.

Ialacol의 두 가지 인스턴스를 시작하십시오.

gh repo clone chenhunghan/ialacol && cd ialacol && python3 -m venv .venv && source .venv/bin/activate && python3 -m pip install -r requirements.txt
LOGGING_LEVEL= " DEBUG "
THREAD=2
DEFAULT_MODEL_HG_REPO_ID= " TheBloke/stablecode-completion-alpha-3b-4k-GGML "
DEFAULT_MODEL_FILE= " stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin "
TRUNCATE_PROMPT_LENGTH=100 # optional
uvicorn main:app --host 0.0.0.0 --port 9998
uvicorn main:app --host 0.0.0.0 --port 9999

상류 Ialacol 인스턴스를 가리키고 TIB를 시작하십시오.

gh repo clone ialacol/text-inference-batcher && cd text-inference-batcher && npm install
UPSTREAMS= " http://localhost:9998,http://localhost:9999 " npm start

TIB를 사용하도록 VSCODE GITHUB COPOLOT를 구성하십시오.

 "github.copilot.advanced" : {
     "debug.overrideEngine" : " stablecode-completion-alpha-3b-4k.ggmlv1.q4_0.bin " ,
     "debug.testOverrideProxyUrl" : " http://localhost:8000 " ,
     "debug.overrideProxyUrl" : " http://localhost:8000 "
}

창의적 대 보수

LLM은 매개 변수에 민감한 것으로 알려져 있고, temperature 높을수록 "무작위성"이 많아서 LLM은 "창의적"이되며 top_p 및 top_k 도 "무작위성"에 기여합니다.

LLM을 창의적으로 만들고 싶다면.

curl -X POST 
     -H ' Content-Type: application/json ' 
     -d ' { "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "2", "top_p": "1.0", "top_k": "0" } ' 
     http://localhost:8000/v1/chat/completions

LLM을보다 일관성있게 만들고 동일한 입력으로 동일한 결과를 생성하려면

curl -X POST 
     -H ' Content-Type: application/json ' 
     -d ' { "messages": [{"role": "user", "content": "Tell me a story."}], "model": "llama-2-7b-chat.ggmlv3.q4_0.bin", "stream": false, "temperature": "0.1", "top_p": "0.1", "top_k": "40" } ' 
     http://localhost:8000/v1/chat/completions

로드맵

다음을 포함하여 CTRANSFORMERS를 통해 starcoder 모델 유형을 지원합니다.
- Starchat https://huggingface.co/thebloke/starchat-beta-ggml
- 스타 코더 https://huggingface.co/thebloke/starcoder-ggml
- StarCoderPlus https://huggingface.co/thebloke/starcoderplus-ggml
GET /models 및 POST /completions 포함하여 OpenAi API를 모방합니다.
GPU 가속도 (Cuda/Metal)
Huggingface Apache-2.0 문장 변압기 및 hkunlp /강사와 같은 임베딩 모델 지원 POST /embeddings 지원
Suuport Apache-2.0 Fastchat-T5-3B
Codet5p 및 여기에 나열된 다른 Apache-2.0 모델을 지원합니다.

스타 역사

영수증

llama-2

TheBloke에 의해 정량화 된 Meta의 LLAMA 2 채팅 모델을 배치합니다.

7B 채팅

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-7b-chat ialacol/ialacol -f examples/values/llama2-7b-chat.yaml

13B 채팅

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-13b-chat ialacol/ialacol -f examples/values/llama2-13b-chat.yaml

70b 채팅

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install llama2-70b-chat ialacol/ialacol -f examples/values/llama2-70b-chat.yaml

OpenLM Research의 Openllama 모델

Openllama 7B 모델을 배치하여 RustFormers에 의해 양자화됩니다.

이것은 기본 모델이며 텍스트 완료에만 유용 할 수 있습니다.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-7b ialacol/ialacol -f examples/values/openllama-7b.yaml

VMware의 Openllama 13B Open Instruct

Openllama 13B Open Instruct 모델을 배치하십시오.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install openllama-13b-instruct ialacol/ialacol -f examples/values/openllama-13b-instruct.yaml

모자이크의 MPT 모델

Rustformers에 의해 양자화 된 MosaicML의 MPT-7B 모델을 배포하십시오. 이것은 기본 모델이며 텍스트 완료에만 유용 할 수 있습니다.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-7b ialacol/ialacol -f examples/values/mpt-7b.yaml

TheBloke에 의해 양자화 된 MosaicML의 MPT-30B 채팅 모델을 배포하십시오.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install mpt-30b-chat ialacol/ialacol -f examples/values/mpt-30b-chat.yaml

팔콘 모델

무수정 FALCON 7B 모델을 배치하여 블로크에 의해 양자화됩니다.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-7b ialacol/ialacol -f examples/values/falcon-7b.yaml

무수정 FALCON 40B 모델을 배치하여 TheBloke에 의해 양자화됩니다.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install falcon-40b ialacol/ialacol -f examples/values/falcon-40b.yaml

스타 코더 모델 (STARTCODER, StartChat, StarCoderPlus, WizardCoder)

TheBloke에 의해 정량화 된 starchat-beta 모델을 배포하십시오.

helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install starchat-beta ialacol/ialacol -f examples/values/starchat-beta.yaml

TheBloke에 의해 양자화 된 WizardCoder 모델을 배포합니다.

helm repo add starchat https://chenhunghan.github.io/ialacol
helm repo update
helm install wizard-coder-15b ialacol/ialacol -f examples/values/wizard-coder-15b.yaml

피티아 모델

Rustformers에 의해 양자화 된 70 백만 개의 매개 변수 (~ 40MB)만으로 가벼운 pythia-70m 모델을 배포하십시오.

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install pythia70m ialacol/ialacol -f examples/values/pythia-70m.yaml

레드파자마 모델

RedPajama 3B 모델을 배포하십시오

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install redpajama-3b ialacol/ialacol -f examples/values/redpajama-3b.yaml

Stablelm 모델

stableLM 7B 모델을 배포하십시오

helm repo add ialacol https://chenhunghan.github.io/ialacol
helm repo update
helm install stablelm-7b ialacol/ialacol -f examples/values/stablelm-7b.yaml

개발

python3 -m venv .venv
source .venv/bin/activate
python3 -m pip install -r requirements.txt
pip freeze > requirements.txt

확장하다