metavoice src下載 - metavoice src源代碼下載

metavoice src

其他源碼

1.0.0

下載

metavoice-1b

Metavoice-1b是一種1.2B參數基本模型，該模型在TTS的100K語音（文本到語音）上訓練。它的構建具有以下優先級：

英語的情感語音節奏和語氣。
美國和英國聲音的零射克隆，帶有30年代參考音頻。
支持（跨語言）語音和填充。
- 我們的成功數據僅為印度說話者的1分鐘培訓數據。
任意長度文本的合成

我們將在Apache 2.0許可下釋放Metavoice-1B，可以使用它而無限制。

Quickstart -Tl; Dr

Web UI

docker-compose up -d ui && docker-compose ps && docker-compose logs -f

伺服器

 # navigate to <URL>/docs for API definitions
docker-compose up -d server && docker-compose ps && docker-compose logs -f

安裝

先決條件：

GPU VRAM> = 12GB
Python> = 3.10，<3.12
PIPX（安裝說明）

環境設置

 # install ffmpeg
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz
wget https://johnvansickle.com/ffmpeg/builds/ffmpeg-git-amd64-static.tar.xz.md5
md5sum -c ffmpeg-git-amd64-static.tar.xz.md5
tar xvf ffmpeg-git-amd64-static.tar.xz
sudo mv ffmpeg-git- * -static/ffprobe ffmpeg-git- * -static/ffmpeg /usr/local/bin/
rm -rf ffmpeg-git- *

# install rust if not installed (ensure you've restarted your terminal after installation)
curl --proto ' =https ' --tlsv1.2 -sSf https://sh.rustup.rs | sh

項目依賴項安裝

使用詩歌
使用PIP/CONDA

使用詩歌（推薦）

 # install poetry if not installed (ensure you've restarted your terminal after installation)
pipx install poetry

# disable any conda envs that might interfere with poetry's venv
conda deactivate

# if running from Linux, keyring backend can hang on `poetry install`. This prevents that.
export PYTHON_KEYRING_BACKEND=keyring.backends.fail.Keyring

# pip's dependency resolver will complain, this is temporary expected behaviour
# full inference & finetuning functionality will still be available
poetry install && poetry run pip install torch==2.2.1 torchaudio==2.2.1

使用PIP/CONDA

注意1：提出問題時，我們會要求您先嘗試詩歌。注2：默認情況下，此讀書中的所有命令都使用poetry ，因此您只需刪除任何poetry run即可。

pip install -r requirements.txt
pip install torch==2.2.1 torchaudio==2.2.1
pip install -e .

用法

下載並在我們的參考實現中（包括本地）使用它（包括本地）

 # You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference.  This will degrade the quality of the audio.
# Note: int8 is slower than bf16/fp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16/fp16.
poetry run python -i fam/llm/fast_inference.py

# Run e.g. of API usage within the interactive python session
tts.synthesise(text= " This is a demo of text to speech by MetaVoice-1B, an open-source foundational audio model. " , spk_ref_path= " assets/bria.mp3 " )

注意：腳本需要30-90秒才能啟動（取決於硬件）。這是因為我們對快速推理模型進行了折磨。

在Ampere，Ada-Lovelace和Hopper Architecture GPU上，一旦編譯，Synthesise（）API的運行速度比實時因子（RTF）<1.0快。

使用我們的推理服務器或Web UI，將其部署在任何云（AWS/GCP/Azure）上

 # You can use `--quantisation_mode int4` or `--quantisation_mode int8` for experimental faster inference. This will degrade the quality of the audio.
# Note: int8 is slower than bf16/fp16 for undebugged reasons. If you want fast, try int4 which is roughly 2x faster than bf16/fp16.

# navigate to <URL>/docs for API definitions
poetry run python serving.py

poetry run python app.py

通過擁抱臉使用它
Google COLAB演示

微調

我們支持FINETUNTINE fineTuning first Stage LLM（請參閱體系結構部分）。

為了獲得Finetune，我們期望以下格式的“ |”限制在以下格式的CSV數據集：

 audio_files|captions
./data/audio.wav|./data/caption.txt

請注意，我們沒有執行任何數據集重疊檢查，因此請確保您的火車和Val數據集是不相交的。

通過以下方式使用我們的示例數據集嘗試一下

poetry run finetune --train ./datasets/sample_dataset.csv --val ./datasets/sample_val_dataset.csv

訓練模型後，您可以將其用於推斷：

poetry run python -i fam/llm/fast_inference.py --first_stage_path ./my-finetuned_model.pt

配置

為了設置超參數，例如學習率，凍結的內容等，您可以編輯Finetune_params.py文件。

我們已經與W＆B進行了輕巧的可選集成，可以通過設置wandb_log = True ＆通過安裝適當的依賴項來啟用。

poetry install -E observable

即將到來

更快的推理⚡
微調代碼？
任意長度文本的合成

建築學

我們從文本和揚聲器信息中預測Encodec令牌。然後將其擴散到波形水平，並應用後處理以清理音頻。

我們使用因果GPT來預測Encodec令牌的前兩個層次結構。文本和音頻是LLM上下文的一部分。揚聲器信息通過在令牌嵌入層處的調節傳遞。該揚聲器調節是從單獨訓練的揚聲器驗證網絡獲得的。
- 這兩個層次結構以“扁平交錯”方式預測，我們預測第一個層次結構的第一個令牌，然後是第二個層次結構的第一個令牌，然後是第一個層次結構的第二個標記，等等。
- 我們使用無條件採樣來提高模型的克隆能力。
- 該文本使用具有512個令牌的自定義訓練的BPE令牌來進行令牌化。
- 請注意，我們已經跳過了預測其他作品的語義令牌，因為我們發現這不是嚴格必要的。
我們使用非混合（編碼器式）變壓器來預測前兩個層次結構的6個層次結構的其餘部分。這是一個超級小型模型（〜10MN參數），並且對我們嘗試過的大多數揚聲器具有廣泛的零彈性概括。由於它是非毒品的，我們也能夠並行預測所有時間步。
我們使用多波段擴散來從Encodec代幣中生成波形。我們注意到演講比使用原始RVQ解碼器或VOCOS更清晰。但是，波形水平的擴散留下了一些背景偽像，這些偽影對耳朵非常不愉快。我們在下一步中清理此操作。
我們使用DeepFilternet清除多波段擴散引入的偽影。