A text-to-speech and speech-to-text server compatible with the OpenAI API, powered by backend support from Whisper, FunASR, Bark, and CosyVoice.
You can install the project using pip:
pip install vox-box
# For MacOS, you need to manually install `openfst`, `pynini`, and `wetextprocessing` after installing `vox-box` to make `cosyvoice` work:
brew install openfst
export CPLUS_INCLUDE_PATH=$(brew --prefix openfst)/include
export LIBRARY_PATH=$(brew --prefix openfst)/lib
pip install pynini==2.1.6
pip install wetextprocessing==1.0.4.1vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir ./cache/data-dir --host 0.0.0.0 --port 80
# Windows
vox-box start --huggingface-repo-id Systran/faster-whisper-small --data-dir C:UsersmicheliaAppDataRoamingvox-box --host 0.0.0.0 --port 8082| Model | Type | Link | Verified Platforms |
|---|---|---|---|
| Faster-whisper-large-v3 | speech-to-text | Hugging Face, ModelScope | Linux ✅, Windows ✅, MacOS ✅ |
| Faster-whisper-large-v2 | speech-to-text | Hugging Face, ModelScope | Linux ✅, Windows ✅, MacOS ✅ |
| Faster-whisper-large-v1 | speech-to-text | Hugging Face, ModelScope | |
| Faster-whisper-medium | speech-to-text | Hugging Face, ModelScope | Linux ✅, Windows ✅, MacOS ✅ |
| Faster-whisper-medium.en | speech-to-text | Hugging Face, ModelScope | |
| Faster-whisper-small | speech-to-text | Hugging Face, ModelScope | Linux ✅, Windows ✅, MacOS ✅ |
| Faster-whisper-small.en | speech-to-text | Hugging Face, ModelScope | |
| Faster-distil-whisper-large-v3 | speech-to-text | Hugging Face, ModelScope | MacOS ✅ |
| Faster-distil-whisper-large-v2 | speech-to-text | Hugging Face, ModelScope | MacOS ✅ |
| Faster-distil-whisper-medium.en | speech-to-text | Hugging Face, ModelScope | |
| Faster-whisper-tiny | speech-to-text | Hugging Face, ModelScope | |
| Faster-whisper-tiny.en | speech-to-text | Hugging Face, ModelScope | |
| Paraformer-zh | speech-to-text | Hugging Face, ModelScope | |
| Paraformer-zh-streaming | speech-to-text | Hugging Face, ModelScope | Linux ✅, MacOS ✅ |
| Paraformer-en | speech-to-text | Hugging Face, ModelScope | |
| Conformer-en | speech-to-text | Hugging Face, Modelscope | |
| SenseVoiceSmall | speech-to-text | Hugging Face, ModelScope | Linux ✅, Windows ✅, MacOS ✅ |
| Bark | text-to-speech | Hugging Face | |
| Bark-small | text-to-speech | Hugging Face | |
| CosyVoice-300M-Instruct | text-to-speech | Hugging Face, ModelScope | Linux(ARM not supported), Windows(Not supported), macOS ✅ |
| CosyVoice-300M-SFT | text-to-speech | Hugging Face, ModelScope | Linux(ARM not supported), Windows(Not supported), macOS ✅ |
| CosyVoice-300M | text-to-speech | Hugging Face, ModelScope | Linux(ARM not supported), Windows(Not supported), macOS ✅ |
| CosyVoice-300M-25Hz | text-to-speech | ModelScope | Linux(ARM not supported), Windows(Not supported), macOS ✅ |
Endpoint: POST /v1/audio/speech
Generates audio from the input text. Compatible with the OpenAI audio/speech API.
Example Request:
curl http://localhost/v1/audio/speech
-H "Authorization: Bearer $OPENAI_API_KEY"
-H "Content-Type: application/json"
-d '{
"model": "cosyvoice",
"input": "Hello world",
"voice": "English Female"
}'
--output speech.mp3Response: The audio file content.
Endpoint: POST /v1/audio/transcriptions
Transcribes audio into the input language. Compatible with the OpenAI audio/transcription API.
Example Request:
curl https://localhost/v1/audio/transcriptions
-H "Authorization: Bearer $OPENAI_API_KEY"
-H "Content-Type: multipart/form-data"
-F file="@/path/to/file/audio.mp3"
-F model="whisper-large-v3"Response:
{
"text": "Hello world."
}Endpoint: GET /v1/models
Returns the current running models.
Endpoint: GET /v1/models/{model_id}
Returns the current running model.
Endpoint: GET /v1/voices
Returns the supported voice for current running model.
Endpoint: GET /health
Returns the heath check result of the Vox Box.