Metavoice-1b是一種1.2B參數基本模型,該模型在TTS的100K語音(文本到語音)上訓練。它的構建具有以下優先級:
我們將在Apache 2.0許可下釋放Metavoice-1B,可以使用它而無限制。
Web UI
docker-compose up -d ui && docker-compose ps && docker-compose logs -f伺服器
# navigate to <URL>/docs for API definitions
docker-compose up -d server && docker-compose ps && docker-compose logs -f先決條件:
環境設置
# install ffmpeg
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz.md5
md5sum -c ffmpeg-git-amd64-static.tar.xz.md5
tar xvf ffmpeg-git-amd64-static.tar.xz
sudo mv ffmpeg-git- * -static/ffprobe ffmpeg-git- * -static/ffmpeg /usr/local/bin/
rm -rf ffmpeg-git- *
# install rust if not installed (ensure you've restarted your terminal after installation)
curl --proto ' =https ' --tlsv1.2 -sSf https://sh.rustup.rs | sh # install poetry if not installed (ensure you've restarted your terminal after installation)
pipx install poetry
# disable any conda envs that might interfere with poetry's venv
conda deactivate
# if running from Linux, keyring backend can hang on `poetry install`. This prevents that.
export PYTHON_KEYRING_BACKEND=keyring.backends.fail.Keyring
# pip's dependency resolver will complain, this is temporary expected behaviour
# full inference & finetuning functionality will still be available
poetry install && poetry run pip install torch==2.2.1 torchaudio==2.2.1注意1:提出問題時,我們會要求您先嘗試詩歌。注2:默認情況下,此讀書中的所有命令都使用poetry ,因此您只需刪除任何poetry run即可。
pip install -r requirements.txt
pip install torch==2.2.1 torchaudio==2.2.1
pip install -e . # You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference. This will degrade the quality of the audio.
# Note: int8 is slower than bf16/fp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16/fp16.
poetry run python -i fam/llm/fast_inference.py
# Run e.g. of API usage within the interactive python session
tts.synthesise(text= " This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model. " , spk_ref_path= " assets/bria.mp3 " )注意:腳本需要30-90秒才能啟動(取決於硬件)。這是因為我們對快速推理模型進行了折磨。
在Ampere,Ada-Lovelace和Hopper Architecture GPU上,一旦編譯,Synthesise()API的運行速度比實時因子(RTF)<1.0快。
# You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference. This will degrade the quality of the audio.
# Note: int8 is slower than bf16/fp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16/fp16.
# navigate to <URL>/docs for API definitions
poetry run python serving.py
poetry run python app.py我們支持FINETUNTINE fineTuning first Stage LLM(請參閱體系結構部分)。
為了獲得Finetune,我們期望以下格式的“ |”限制在以下格式的CSV數據集:
audio_files|captions
./data/audio.wav|./data/caption.txt
請注意,我們沒有執行任何數據集重疊檢查,因此請確保您的火車和Val數據集是不相交的。
通過以下方式使用我們的示例數據集嘗試一下
poetry run finetune --train ./datasets/sample_dataset.csv --val ./datasets/sample_val_dataset.csv訓練模型後,您可以將其用於推斷:
poetry run python -i fam/llm/fast_inference.py --first_stage_path ./my-finetuned_model.pt為了設置超參數,例如學習率,凍結的內容等,您可以編輯Finetune_params.py文件。
我們已經與W&B進行了輕巧的可選集成,可以通過設置wandb_log = True &通過安裝適當的依賴項來啟用。
poetry install -E observable我們從文本和揚聲器信息中預測Encodec令牌。然後將其擴散到波形水平,並應用後處理以清理音頻。
模型支持:
我們感謝他們在一起的24/7幫助,以編造我們的集群。我們感謝AWS,GCP和Hugging Face的團隊為他們的雲平台提供支持。
如果我們錯過了任何人,請提前道歉。如果我們有,請告訴我們。