AivisSpeech Engine Download - AivisSpeech Engine Source code download

AivisSpeech Engine

? AivisSpeech Engine: AI V oice I mitation S ystem - Text to Speech Engine

AivisSpeech Engine is a Japanese speech synthesis engine based on VOICEVOX ENGINE.
It is built into AivisSpeech, a Japanese speech synthesis software, and can easily generate emotional voices.

? Download AivisSpeech / ? Download AivisSpeech Engine

To users
Operating environment
Supported speech synthesis models
- Supported model architecture
- Where to place the model file
How to install
- Windows/macOS
- Linux
- Linux + Docker
  - Run on the CPU
  - Running on NVIDIA GPU (CUDA)
Using the Speech Synthesis API
Compatibility with the VOICEVOX API
- Style ID in AivisSpeech Engine
- Changed AudioQuery type specifications
- Mora type specifications changed
- Changed specifications for Preset type
- API endpoints not supported by AivisSpeech Engine
- API parameters not supported by AivisSpeech Engine
Frequently Asked Questions / Q&A
Development Policy
Building a development environment
development
license

To users

If you are looking to use AivisSpeech, please visit the AivisSpeech official website.

This page contains information mainly for developers.
Below is a document for users:

How to use
Frequently Asked Questions / Q&A
inquiry

Operating environment

Compatible with PCs with Windows, macOS, and Linux.
To start the AivisSpeech Engine, your PC needs at least 3GB of free memory (RAM).

Windows: Windows 10 (22H2 or later)/Windows 11
macOS: macOS 13 Ventura or later
Linux: Ubuntu 20.04 or later

Tip

The desktop app, AivisSpeech, is only supported on Windows and macOS.
Meanwhile, AivisSpeech Engine, a speech synthesis API server, is also available for Ubuntu/Debian Linux.

Note

We have not actively verified the operation on Macs with Intel CPUs.
Macs with Intel CPUs are already out of production, and it is becoming more difficult to prepare a verification and build environment. We recommend using this product on a Mac with Apple Silicon as much as possible.

Warning

On Windows 10, we only check the operation of version 22H2.
There have been reports of cases where AivisSpeech Engine crashes and fails to start on older versions of Windows 10 that have no support.
From a security perspective, we strongly recommend that you use Windows 10 environments only update to the minimum version 22H2 before using the service.

Supported speech synthesis models

AivisSpeech Engine supports speech synthesis model files in the AIVMX ( Ai vis V oice M odel for ONN X ) ( .aivmx ) format.

AIVM ( Ai vis V oice M odel) / AIVMX ( Ai vis V oice M odel for ONN X ) is an open file format for AI speech synthesis models that combines pre-trained models, hyperparameters, style vectors, speaker metadata (names, overviews, licenses, icons, voice samples, etc.) into one file .

For more information about the AIVM specifications and AIVM/AIVMX files, please refer to the AIVM specifications developed in the Aivis Project.

Note

"AIVM" is also a general term for both AIVM/AIVMX format specifications and metadata specifications.
Specifically, the AIVM file is a model file in "Safetensors format with AIVM metadata added", and the AIVMX file is a model file in "ONNX format with AIVM metadata added".
"AIVM Metadata" refers to various metadata that is linked to a trained model as defined in the AIVM specification.

Important

The AivisSpeech Engine is also a reference implementation of the AIVM specification, but it is deliberately designed to support only AIVMX files.
This eliminates dependency on PyTorch, reduces installation size and provides fast CPU inference with ONNX Runtime.

Tip

AIVM Generator allows you to generate AIVM/AIVMX files from existing speech synthesis models and edit the metadata of existing AIVM/AIVMX files!

Supported model architecture

The AIVMX files for the following model architectures are available:

Style-Bert-VITS2
Style-Bert-VITS2 (JP-Extra)

Note

The AIVM metadata specification allows you to define multilingual speakers, but AivisSpeech Engine, like VOICEVOX ENGINE, only supports Japanese speech synthesis.
Therefore, even if you use a speech synthesis model that supports English or Chinese, speech synthesis other than Japanese cannot be performed.

Where to place the model file

Place the AIVMX files in the following folders for each OS:

Windows: C:Users(ユーザー名)AppDataRoamingAivisSpeech-EngineModels
macOS: ~/Library/Application Support/AivisSpeech-Engine/Models
Linux: ~/.local/share/AivisSpeech-Engine/Models

The actual folder path will be displayed as Models directory: in the logs immediately after starting AivisSpeech Engine.

Tip

When using AivisSpeech, you can easily add speech synthesis models from the AivisSpeech UI screen!
For end users, we recommend that you add a speech synthesis model using this method.

Important

The deployment folder for the development version (if running without being built with PyInstaller) is not AivisSpeech-Engine or below AivisSpeech-Engine-Dev .

How to install

AivisSpeech Engine offers useful command line options such as:

--host 0.0.0.0 allows AivisSpeech Engine to be accessed from other devices in the same network.
Specify --cors_policy_mode all to allow CORS requests from all domains.
Specify --load_all_models to preload all installed speech synthesis models when AivisSpeech Engine starts.
Specify --help to display a list and description of all available options.

There are many other options available. For details, please refer to the --help option.

Tip

When executed with the --use_gpu option, it can utilize DirectML on Windows and NVIDIA GPU (CUDA) on Linux for high speed speech synthesis.
In addition, in Windows environments, DirectML inference can be performed on PCs with only GPUs (iGPUs) with built-in CPUs, but in most cases it is much slower than CPU inference, so this is not recommended.
For more information, see Frequently Asked Questions.

Note

The AivisSpeech Engine works with port number 10101 by default.
If it conflicts with other applications, you can change it to any port number using the --port option.

Warning

Unlike VOICEVOX ENGINE, some options are not implemented in AivisSpeech Engine.

Windows/macOS

On Windows/macOS, you can install the AivisSpeech Engine on its own, but it's easier to start the AivisSpeech Engine, which comes with the AivisSpeech console, on its own.

The path to the AivisSpeech Engine executable file ( run.exe / run ) that is shipped with AivisSpeech is as follows:

Windows: C:Program FilesAivisSpeechAivisSpeech-Enginerun.exe
- If installed with user privileges, it will be C:Users(ユーザー名)AppDataLocalProgramsAivisSpeechAivisSpeech-Enginerun.exe .
macOS: /Applications/AivisSpeech.app/Contents/Resources/AivisSpeech-Engine/run
- If installed with user privileges, it will be ~/Applications/AivisSpeech.app/Contents/Resources/AivisSpeech-Engine/run .

Note

The default model (approx. 250MB) and the BERT model (approx. 1.3GB) required for inference are automatically downloaded at the first startup, so it will take up to a few minutes for the startup to complete.
Please wait a while until the startup is complete.

To add a speech synthesis model to AivisSpeech Engine, see Where to Place the Model File.
You can also add it by clicking "Settings" > "Manage Speech Synthesis Model" in AivisSpeech.

Linux

When running in a Linux + NVIDIA GPU environment, the CUDA/cuDNN version that ONNX Runtime supports must match the CUDA/cuDNN version of the host environment, and the operating conditions are strict.
Specifically, the ONNX Runtime used by AivisSpeech Engine requires CUDA 12.x / cuDNN 9.x or higher.

Docker works regardless of the host OS environment, so we recommend installing it with Docker.

Linux + Docker

When running a Docker container, always mount ~/.local/share/AivisSpeech-Engine to /home/user/.local/share/AivisSpeech-Engine-Dev in the container.
This way, even after you have stopped or restarted the container, you can still maintain the installed speech synthesis model and BERT model cache (approximately 1.3GB).

To add a speech synthesis model to the AivisSpeech Engine in a Docker environment, place the model file (.aivmx) under ~/.local/share/AivisSpeech-Engine/Models in the host environment.

Important

Be sure to mount it against /home/user/.local/share/AivisSpeech-Engine-Dev .
Since the AivisSpeech Engine on the Docker image is not built with PyInstaller, the data folder name is given the Suffix of -Dev and AivisSpeech-Engine-Dev .

Run on the CPU

docker pull ghcr.io/aivis-project/aivisspeech-engine:cpu-latest
docker run --rm -p ' 10101:10101 ' 
  -v ~ /.local/share/AivisSpeech-Engine:/home/user/.local/share/AivisSpeech-Engine-Dev 
  ghcr.io/aivis-project/aivisspeech-engine:cpu-latest

Running on NVIDIA GPU (CUDA)

docker pull ghcr.io/aivis-project/aivisspeech-engine:nvidia-latest
docker run --rm --gpus all -p ' 10101:10101 ' 
  -v ~ /.local/share/AivisSpeech-Engine:/home/user/.local/share/AivisSpeech-Engine-Dev 
  ghcr.io/aivis-project/aivisspeech-engine:nvidia-latest

Using the Speech Synthesis API

When you run the following one-liner in Bash, the speech synthesized WAV file will be output to audio.wav .

Important

It is assumed that AivisSpeech Engine has been started beforehand and that Models directory: The following directory contains the speech synthesis model (.aivmx) that corresponds to the style ID.

 # STYLE_ID は音声合成対象のスタイル ID 、別途 /speakers API から取得が必要
STYLE_ID=888753760 && 
echo -n "こんにちは、音声合成の世界へようこそ！ " > text.txt && 
curl -s -X POST " 127.0.0.1:10101/audio_query?speaker= $STYLE_ID " --get --data-urlencode [email protected] > query.json && 
curl -s -H " Content-Type: application/json " -X POST -d @query.json " 127.0.0.1:10101/synthesis?speaker= $STYLE_ID " > audio.wav && 
rm text.txt query.json

Tip

For detailed API request and response specifications, please refer to the API documentation and compatibility with the VOICEVOX API. The API documentation always reflects changes in the latest development version.

You can view the API documentation (Swagger UI) for the running AivisSpeech Engine by accessing http://127.0.0.1:10101/docs with the AivisSpeech Engine or the AivisSpeech editor in progress.

Compatibility with the VOICEVOX API

AivisSpeech Engine is generally compatible with VOICEVOX ENGINE's HTTP API.

If you have software that supports VOICEVOX ENGINE's HTTP API, you should be able to support AivisSpeech Engine by simply replacing the API URL with http://127.0.0.1:10101 .

Important

However, if you edit the AudioQuery content obtained from the /audio_query API on the API client and then pass it to the /synthesis API, it may not be possible to properly synthesize speech due to differences in specifications (see below).

Due to this, the AivisSpeech editor can use both AivisSpeech Engine and VOICEVOX ENGINE (when using multi-engine functions), but you cannot use AivisSpeech Engine from the VOICEVOX editor.
Using AivisSpeech Engine in the VOICEVOX editor significantly reduces the quality of speech synthesis due to limitations in the editor implementation. In addition to being unable to utilize the unique parameters of AivisSpeech Engine, there is a possibility that an error may occur when calling non-compatible functions.
We highly recommend using it with the AivisSpeech editor to get better speech synthesis results.

Note

Although it should be generally compatible in general API use cases, there may be incompatible APIs other than the following, as speech synthesis systems with fundamentally different model architectures are forcibly included in the same API specifications.
If you report it via the Issue, we will fix any compatibility improvements.

The following are changes to the API specifications from VOICEVOX ENGINE:

Style ID in AivisSpeech Engine

The speaker-style local IDs in the AIVM manifest contained in the AIVMX file are managed in serial numbers starting from 0 for each speaker.
In the speech synthesis model of the Style-Bert-VITS2 architecture, this value matches the value of the model's hyperparameter data.style2id .

On the other hand, VOICEVOX ENGINE's API does not specify a speaker UUID ( speaker_uuid ) and only the style ID ( style_id ) is passed to the speech synthesis API, perhaps due to historical reasons.
VOICEVOX ENGINE has fixed speakers and styles, so the development team was able to uniquely manage the "style ID."

Meanwhile, AivisSpeech Engine allows users to freely add speech synthesis models.
Therefore, the VOICEVOX API compatible "Style ID" must be unique no matter what speech synthesis model is added.
This is because if the value is not unique, the speaker style and style IDs included in the existing model can overlap when a new speech synthesis model is added.

Therefore, AivisSpeech Engine combines the speaker UUID and style ID on the AIVM manifest to generate a globally unique "style ID" compatible with the VOICEVOX API.
The specific generation method is as follows:

Convert speaker UUID to MD5 hash value
Combine the lower 27bits of this hash value with 5bits (0 to 31) of the local style ID.
Convert to a 32bit signed integer

Warning

Due to this, unexpected problems can occur with VOICEVOX API-enabled software that does not assume that a 32-bit signed integer is included in the "Style ID."

Warning

It is extremely low probability that different speaker style IDs can overlap (collision) because they are sacrificing the global uniqueness of speaker UUIDs to fit within a 32-bit signed integer range.
At this point, there is no workaround for duplicate style IDs, but in reality, this is not a problem in most cases.

Tip

The VOICEVOX API compatible "style IDs" automatically generated by AivisSpeech Engine can be obtained from the /speakers API.
This API returns a list of speaker information installed on AivisSpeech Engine.

Changed `AudioQuery` type specifications

AudioQuery type is a query for performing speech synthesis by specifying text or phoneme sequences.

The main changes from AudioQuery type in VOICEVOX ENGINE are as follows:

The meaning of intonationScale field is different.
- In VOICEVOX ENGINE, it was a parameter that represents the "wide intonation," but in AivisSpeech Engine, it is a parameter that represents the "wide intonation strength."
- Specify the speaker style tone strength in the range 0.0 to 2.0 (default: 1.0).
- The higher the value, the more intonation the voice has.
  - For example, if you are in a "happy" style, the higher the value, the more cheerful the way you speak, the more happy you will.
  - However, depending on the speaker and style, if you raise the numbers too much, the voice may be unnatural.
- It cannot be specified as a normal style, which is the average of all styles (ignored regardless of the value).
- The Style Strength parameter in Style-Bert-VITS2 is converted as follows when converted to intonationScale in AivisSpeech Engine:
  - If intonationScale is 0.0 to 1.0, Style-Bert-VITS2 corresponds to the range of 0.0 to 1.0.
  - If intonationScale is 1.0 to 2.0, Style-Bert-VITS2 corresponds to the range of 1.0 to 10.0.
Added the tempoDynamicsScale field by itself.
- AivisSpeech Engine specific parameters. You can specify the strength of the speaking speed in the range of 0.0 to 2.0 (default: 1.0).
- The higher the value, the faster the voice has a raw, more intonation.
- The "Tempo Slow/Temp" parameter in Style-Bert-VITS2 is converted as follows when converted to tempoDynamicsScale in AivisSpeech Engine:
  - If tempoDynamicsScale is 0.0 to 1.0, Style-Bert-VITS2 corresponds to the range of 0.0 to 0.2.
  - If tempoDynamicsScale is 1.0 to 2.0, Style-Bert-VITS2 corresponds to the range of 0.2 to 1.0.
The specifications of pitchScale field are different.
- Unlike VOICEVOX ENGINE, changing this value from 0.0 may result in deterioration in sound quality.
pauseLength and pauseLengthScale fields are not supported.
- It exists as a field for compatibility purposes, but is always ignored.
The specifications of kana field are different.
- In VOICEVOX ENGINE, it was a read-only field that contained AquesTalk style text, but in AivisSpeech Engine, it is used as a field that specifies normal speech to the text.
- If null or an empty string is specified, the automatically generated hiragana string from the accent clause will be the text to be the speech to the text, but this may result in unnatural intonation.
- We recommend specifying normal speech tone text whenever possible to obtain more natural speech synthesis results.

For more information about the changes, see model.py.

`Mora` type specifications changed

Mora type is a data structure that represents the maura of speech text.

Tip

A mora is the smallest unit of sound grouping when actually pronounced (such as "a", "ka", or "o").
Mora types are not used alone for API request responses, and are always used indirectly through AudioQuery.accent_phrases[n].moras or AudioQuery.accent_phrases[n].pause_mora .

The main changes from Mora type in VOICEVOX ENGINE are as follows:

Symbols are also treated as moras.
- In VOICEVOX ENGINE, symbols such as exclamation marks and punctuation were treated as pause_mora , but in AivisSpeech Engine it is treated as normal moras.
- For symbol mora, text is the same symbol and vowel is set to "pau".
consonant / vowel field is read-only.
- The value of text field is always used to read text during speech synthesis.
- Changing the values of these fields does not affect the speech synthesis results.
consonant_length / vowel_length / pitch fields are not supported.
- Because the implementation of AivisSpeech Engine does not allow these values to be calculated, a dummy value of 0.0 is always returned.
- It exists as a field for compatibility purposes, but is always ignored.

For more information about the changes, see tts_pipeline/model.py.

Changed specifications for `Preset` type

Preset type is preset information for determining the initial value of a speech synthesis query on the editor.

The changes generally correspond to the changes in the fields specification for intonationScale / tempoDynamicsScale / pitchScale / pauseLength / pauseLengthScale described in AudioQuery types.

For more information about the changes, see preset/model.py.

API endpoints not supported by AivisSpeech Engine

Warning

Singing and cancelable speech synthesis APIs are not supported.
It exists as an endpoint for compatibility purposes, but always returns 501 Not Implemented .
For more information, please check app/routers/character.py / app/routers/tts_pipeline.py.

GET /singers
GET /singer_info
POST /cancellable_synthesis
POST /sing_frame_audio_query
POST /sing_frame_volume
POST /frame_synthesis

Warning

The /synthesis_morphing API that provides morphing functionality is not supported.
Because the voice timing differs for each speaker, it is impossible to implement (it works but cannot stand listening), so 400 Bad Request is always returned.
Returns whether morphing is available for each speaker /morphable_targets API disallows morphing for all speakers.
For more information, please check app/routers/morphing.py.

POST /synthesis_morphing
POST /morphable_targets

API parameters not supported by AivisSpeech Engine

Warning

It exists as a parameter for compatibility purposes, but is always ignored.
For more information, please check app/routers/character.py / app/routers/tts_pipeline.py.

core_version parameter
- VOICEVOX This is a parameter that specifies the version of CORE.
- AivisSpeech Engine does not have a component that supports VOICEVOX CORE and is always ignored.
enable_interrogative_upspeak parameter
- This is the parameter for automatically adjusting the endings when a questionable text is given.
- In AivisSpeech Engine, you will always be read out in natural intonation, corresponding to symbols in text such as "!", "?", "...", "...", and ".
- So,どうですか…？ Simply add a "?" to the end of the text to read aloud, like this, and you can read it out with a questionable intonation.

Frequently Asked Questions / Q&A

Tip

Please also take a look at the AivisSpeech Editor's FAQs/Q&A.

Q. Increased "intonationScale" ( `intonationScale` ) will cause a spurt of speech.

This is the current specification of the Style-Bert-VITS2 model architecture supported by AivisSpeech Engine.
Depending on the speaker and style, if you raise intonationScale too much, your voice may become strange or you may end up reading an unnatural voice.

The upper limit of intonationScale value that allows you to properly speak will vary depending on the speaker and style. Adjust the value appropriately.

Q. The reading method is different from what you expected.

At AivisSpeech Engine, we try to make sure that the correct reading and accent is correct in one go, but there are times when it is inevitable that it is incorrect.
Words that are not registered in the built-in dictionary, such as proper nouns and people's names (especially sparkling names), are often not read correctly.

You can change the way these words are read by registering the dictionary. Try registering words from the AivisSpeech editor or the API.
In addition, for compound words and English words, the content registered in the dictionary may not be reflected regardless of the word's priority. This is the current specification.

Q. If you send long sentences to the speech synthesis API at once, the audio may become unnatural and memory leaks.

The AivisSpeech Engine is designed to synthesize speech in relatively short sentences, such as sentences or groupings of meaning.
Therefore, sending long sentences that exceed 1000 characters at once to /synthesis API can cause problems like the following:

Memory usage increases rapidly, causing PC to run slower
Memory leak causes AivisSpeech Engine to crash
The intonation of the sound becomes unnatural, making it a voice that looks like a silence

When synthesizing long sentences, we recommend separating the sentences at the following positions and sending them to the speech synthesis API.
There are no hard limits, but it is desirable to have up to 500 characters per speech synthesis.

Location of punctuation marks (".""")
Several meanings of sentences (such as paragraph separators)
Conversation separator (a section surrounded by ")

Tip

Dividing sentences with semantic breaks tends to produce more natural intonation sounds.
This is because emotional expressions and intonations corresponding to the content of the text are applied to the entire sentence sent to the speech synthesis API at once.
By properly dividing the sentences, you can reset the emotional expression and intonation of each sentence, resulting in more natural reading.

Q. Can I use it on an offline PC?

Internet access is required only when starting AivisSpeech for the first time because model data is downloaded.
You can use the PC offline on the second or later boot.

Q. I want to import/export a dictionary.

This can be done on the AivisSpeech Engine configuration screen that is currently running.

When you access http://127.0.0.1:[AivisSpeech Engine のポート番号]/setting from your browser while starting the AivisSpeech Engine, the AivisSpeech Engine configuration screen will open.
The default port number for AivisSpeech Engine is 10101 .

Q. I switched to GPU mode ( `--use_gpu` ) and still produces voice slower than CPU mode.

GPU mode can be used on PCs with built-in CPU (iGPU) only, but in most cases it is much slower than CPU mode, so it is not recommended.

This is because GPUs with built-in CPUs have lower performance than independent GPUs (dGPUs) and are not good at heavy processing like AI speech synthesis.
On the other hand, the performance of CPUs these days has been significantly improved, and the CPU alone can generate audio at a high speed.
Therefore, we recommend using CPU mode on PCs that do not have dGPU.

Q. When generating audio, full performance cannot be achieved on Intel 12th generation or later CPUs.

If you are using a PC with Intel's 12th generation or later CPU (P-core/E-core hybrid configuration), the performance of audio generation can change dramatically depending on the power settings of Windows.
This is because the default "balance" mode makes it easier for voice generation tasks to be assigned to power-saving E-cores.

Change settings using the steps below to make the most of both the P and E cores and generate voice faster.

Open Windows 11 Settings
Go to System → Power
Change "Power Mode" to "Optimal Performance"

*The "Power Plan" in the Control Panel also has a "High Performance" setting, but the settings are different.
For Intel 12th generation or later CPUs, we recommend changing the "Power Mode" from the Windows 11 settings screen.

Q. Is credit required?

AivisSpeech aims to be free AI speech synthesis software that is not bound by its use.
(It depends on the license of the speech synthesis model used in the deliverable) At the very least, the software itself does not require credit, and can be used freely, whether for individuals, corporations, commercial or non-commercial purposes.

...However, I also want more people to know about AivisSpeech.
If you like, I would be happy if you could credit AivisSpeech somewhere in the deliverable. (The credit format is left to you.)

Q. Where can I check the error log for AivisSpeech Engine?

It is saved in the following folder:

Windows: C:Users(ユーザー名)AppDataRoamingAivisSpeech-EngineLogs
Mac: ~/Library/Application Support/AivisSpeech-Engine/Logs
Linux: ~/.local/share/AivisSpeech-Engine/Logs

Q. I found a problem. Where should I report it?

If you find a problem, please report it using one of the following methods:

GitHub Issue (recommended)
If you have a GitHub account, please report via our GitHub issue and we can respond early.
Twitter (X)
You can reply to the official Aivis Project account, DM, or tweet with the hashtag #AivisSpeech.
Contact form
You can also report via the Aivis Project Contact Form.

Please report the following information as much as possible, and we will be able to respond more quickly.

The contents of the defect
Reproducing steps (please attach any videos or photos)
OS type: AivisSpeech version
What was tried to resolve
Whether or not there is antivirus software or something (if it seems to be related)
Error message displayed
Error log

Development Policy

VOICEVOX is a huge software that is still actively developed today.
Therefore, AivisSpeech Engine is developing the latest version of VOICEVOX ENGINE under the following policy:

To make it easier to follow the latest version of VOICEVOX, keep the changes to the minimum possible.
- Rebrand from VOICEVOX ENGINE to AivisSpeech Engine only where necessary
- Rename voicevox_engine directory and the difference in the import statement changes will be huge, so rebranding is not performed.
Do not refactor
- It is easy to expect conflicts with VOICEVOX ENGINE and we are not familiar with the entire code.
Even if a function is not used in AivisSpeech (such as the singing voice synthesis function), the code will not be deleted.
- This also helps to avoid conflicts
- Disabling unused codes is not deleted, but comment out
  - VOICEVOX ENGINE との差分を最小限に抑えるため、大量にコメントアウトが必要な場合は、# ではなく """ """ を使う
- ただし、Dockerfile や GitHub Actions などの構成ファイルやビルドツール類はこの限りではない
  - 元々 AivisSpeech Engine での改変量が大きい部分につき、コメントアウトでは非常に雑多なコードになるため
保守や追従が困難なため、ドキュメントの更新は行わない
- このため各ドキュメントは一切更新されておらず、AivisSpeech Engine での変更を反映していない
AivisSpeech Engine 向けの改変にともないテストコードの維持が困難なため、テストコードの追加は行わない
- 既存のテストコードのみ、テストが通るように一部箇所の修正やコメントアウトを行い、消極的に維持する
  - AivisSpeech Engine での改変により、テスト結果のスナップショットは VOICEVOX ENGINE と異なる
  - AivisSpeech Engine での改変により動かなくなったテストの修正は行わず、コメントアウトで対応する
- AivisSpeech Engine 向けに新規開発した箇所は、保守コストを鑑みテストコードを追加しない

Building a development environment

手順はオリジナルの VOICEVOX ENGINE と大幅に異なります。
事前に Python 3.11 がインストールされている必要があります。

 # Poetry と pre-commit をインストール
pip install poetry poetry-plugin-export pre-commit

# pre-commit を有効化
pre-commit install

# 依存関係をすべてインストール
poetry install

development

手順はオリジナルの VOICEVOX ENGINE と大幅に異なります。

 # 開発環境で AivisSpeech Engine を起動
poetry run task serve

# AivisSpeech Engine のヘルプを表示
poetry run task serve --help

# コードフォーマットを自動修正
poetry run task format

# コードフォーマットをチェック
poetry run task lint

# typos によるタイポチェック
poetry run task typos

# テストを実行
poetry run task test

# テストのスナップショットを更新
poetry run task update-snapshots

# ライセンス情報を更新
poetry run task update-licenses

# AivisSpeech Engine をビルド
poetry run task build

license

ベースである VOICEVOX ENGINE のデュアルライセンスのうち、LGPL-3.0 のみを単独で継承します。

下記ならびに docs/ 以下のドキュメントは、VOICEVOX ENGINE 本家のドキュメントを改変なしでそのまま引き継いでいます。これらのドキュメントの内容が AivisSpeech Engine にも通用するかは保証されません。

VOICEVOX ENGINE

VOICEVOX のエンジンです。
実態は HTTP サーバーなので、リクエストを送信すればテキスト音声合成できます。

（エディターは VOICEVOX 、コアは VOICEVOX CORE 、全体構成はこちらに詳細があります。）

目的に合わせたガイドはこちらです。

ユーザーガイド: 音声合成をしたい方向け
貢献者ガイド: コントリビュートしたい方向け
開発者ガイド: コードを利用したい方向け

User Guide

download

こちらから対応するエンジンをダウンロードしてください。

API ドキュメント

API ドキュメントをご参照ください。

VOICEVOX エンジンもしくはエディタを起動した状態で http://127.0.0.1:50021/docs にアクセスすると、起動中のエンジンのドキュメントも確認できます。
今後の方針などについては VOICEVOX 音声合成エンジンとの連携も参考になるかもしれません。

Docker イメージ

CPU

docker pull voicevox/voicevox_engine:cpu-latest
docker run --rm -p ' 127.0.0.1:50021:50021 ' voicevox/voicevox_engine:cpu-latest

GPU

docker pull voicevox/voicevox_engine:nvidia-latest
docker run --rm --gpus all -p ' 127.0.0.1:50021:50021 ' voicevox/voicevox_engine:nvidia-latest

troubleshooting

GPU 版を利用する場合、環境によってエラーが発生することがあります。その場合、 --runtime=nvidiaをdocker runにつけて実行すると解決できることがあります。

HTTP リクエストで音声合成するサンプルコード

 echo -n "こんにちは、音声合成の世界へようこそ" > text.txt

curl -s 
    -X POST 
    " 127.0.0.1:50021/audio_query?speaker=1 " 
    --get --data-urlencode [email protected] 
    > query.json

curl -s 
    -H " Content-Type: application/json " 
    -X POST 
    -d @query.json 
    " 127.0.0.1:50021/synthesis?speaker=1 " 
    > audio.wav

生成される音声はサンプリングレートが 24000Hz と少し特殊なため、音声プレーヤーによっては再生できない場合があります。

speakerに指定する値は/speakersエンドポイントで得られるstyle_idです。互換性のためにspeakerという名前になっています。

音声を調整するサンプルコード

/audio_queryで得られる音声合成用のクエリのパラメータを編集することで、音声を調整できます。

例えば、話速を 1.5 倍速にしてみます。

 echo -n "こんにちは、音声合成の世界へようこそ" > text.txt

curl -s 
    -X POST 
    " 127.0.0.1:50021/audio_query?speaker=1 " 
    --get --data-urlencode [email protected] 
    > query.json

# sed を使用して speedScale の値を 1.5 に変更
sed -i -r ' s/"speedScale":[0-9.]+/"speedScale":1.5/ ' query.json

curl -s 
    -H " Content-Type: application/json " 
    -X POST 
    -d @query.json 
    " 127.0.0.1:50021/synthesis?speaker=1 " 
    > audio_fast.wav

読み方を AquesTalk 風記法で取得・修正

AquesTalk 風記法

「 AquesTalk 風記法」はカタカナと記号だけで読み方を指定する記法です。AquesTalk 本家の記法とは一部が異なります。
AquesTalk 風記法は次のルールに従います：

全てのカナはカタカナで記述される
アクセント句は/または、で区切る。 、で区切った場合に限り無音区間が挿入される。
カナの手前に_を入れるとそのカナは無声化される
アクセント位置を'で指定する。全てのアクセント句にはアクセント位置を 1 つ指定する必要がある。
アクセント句末に？ (全角)を入れることにより疑問文の発音ができる

AquesTalk 風記法のサンプルコード

/audio_queryのレスポンスにはエンジンが判断した読み方がAquesTalk 風記法で記述されます。
これを修正することで音声の読み仮名やアクセントを制御できます。

 # 読ませたい文章をutf-8でtext.txtに書き出す
echo -n "ディープラーニングは万能薬ではありません" > text.txt

curl -s 
    -X POST 
    " 127.0.0.1:50021/audio_query?speaker=1 " 
    --get --data-urlencode [email protected] 
    > query.json

cat query.json | grep -o -E " " kana " : " .* " "
# 結果... "kana":"ディ'イプ/ラ'アニングワ/バンノオヤクデワアリマセ'ン"

# "ディイプラ'アニングワ/バンノ'オヤクデワ/アリマセ'ン"と読ませたいので、
# is_kana=trueをつけてイントネーションを取得しnewphrases.jsonに保存
echo -n "ディイプラ'アニングワ/バンノ'オヤクデワ/アリマセ'ン" > kana.txt
curl -s 
    -X POST 
    " 127.0.0.1:50021/accent_phrases?speaker=1&is_kana=true " 
    --get --data-urlencode [email protected] 
    > newphrases.json

# query.jsonの"accent_phrases"の内容をnewphrases.jsonの内容に置き換える
cat query.json | sed -e " s/[{.*}]/ $( cat newphrases.json ) /g " > newquery.json

curl -s 
    -H " Content-Type: application/json " 
    -X POST 
    -d @newquery.json 
    " 127.0.0.1:50021/synthesis?speaker=1 " 
    > audio.wav

ユーザー辞書機能について

API からユーザー辞書の参照、単語の追加、編集、削除を行うことができます。

reference

/user_dictに GET リクエストを投げることでユーザー辞書の一覧を取得することができます。

curl -s -X GET " 127.0.0.1:50021/user_dict "

単語追加

/user_dict_wordに POST リクエストを投げる事でユーザー辞書に単語を追加することができます。
URL パラメータとして、以下が必要です。

surface （辞書に登録する単語）
pronunciation （カタカナでの読み方）
accent_type （アクセント核位置、整数）

アクセント核位置については、こちらの文章が参考になるかと思います。
〇型となっている数字の部分がアクセント核位置になります。
https://tdmelodic.readthedocs.io/ja/latest/pages/introduction.html

成功した場合の返り値は単語に割り当てられる UUID の文字列になります。

surface= " test "
pronunciation= "テスト"
accent_type= " 1 "

curl -s -X POST " 127.0.0.1:50021/user_dict_word " 
    --get 
    --data-urlencode " surface= $surface " 
    --data-urlencode " pronunciation= $pronunciation " 
    --data-urlencode " accent_type= $accent_type "

単語修正

/user_dict_word/{word_uuid}に PUT リクエストを投げる事でユーザー辞書の単語を修正することができます。
URL パラメータとして、以下が必要です。

surface （辞書に登録するワード）
pronunciation （カタカナでの読み方）
accent_type （アクセント核位置、整数）

word_uuid は単語追加時に確認できるほか、ユーザー辞書を参照することでも確認できます。
成功した場合の返り値は204 No Contentになります。

surface= " test2 "
pronunciation= "テストツー"
accent_type= " 2 "
# 環境によってword_uuidは適宜書き換えてください
word_uuid= " cce59b5f-86ab-42b9-bb75-9fd3407f1e2d "

curl -s -X PUT " 127.0.0.1:50021/user_dict_word/ $word_uuid " 
    --get 
    --data-urlencode " surface= $surface " 
    --data-urlencode " pronunciation= $pronunciation " 
    --data-urlencode " accent_type= $accent_type "

単語削除

/user_dict_word/{word_uuid}に DELETE リクエストを投げる事でユーザー辞書の単語を削除することができます。

word_uuid は単語追加時に確認できるほか、ユーザー辞書を参照することでも確認できます。
成功した場合の返り値は204 No Contentになります。

 # 環境によってword_uuidは適宜書き換えてください
word_uuid= " cce59b5f-86ab-42b9-bb75-9fd3407f1e2d "

curl -s -X DELETE " 127.0.0.1:50021/user_dict_word/ $word_uuid "

辞書のインポート&エクスポート

エンジンの設定ページ内の「ユーザー辞書のエクスポート&インポート」節で、ユーザー辞書のインポート&エクスポートが可能です。

他にも API でユーザー辞書のインポート&エクスポートが可能です。
インポートにはPOST /import_user_dict 、エクスポートにはGET /user_dictを利用します。
引数等の詳細は API ドキュメントをご覧ください。

プリセット機能について

ユーザーディレクトリにあるpresets.yamlを編集することでキャラクターや話速などのプリセットを使うことができます。

 echo -n "プリセットをうまく活用すれば、サードパーティ間で同じ設定を使うことができます" > text.txt

# プリセット情報を取得
curl -s -X GET " 127.0.0.1:50021/presets " > presets.json

preset_id= $( cat presets.json | sed -r ' s/^.+"id":s?([0-9]+?).+$/1/g ' )
style_id= $( cat presets.json | sed -r ' s/^.+"style_id":s?([0-9]+?).+$/1/g ' )

# 音声合成用のクエリを取得
curl -s 
    -X POST 
    " 127.0.0.1:50021/audio_query_from_preset?preset_id= $preset_id " 
    --get --data-urlencode [email protected] 
    > query.json

# 音声合成
curl -s 
    -H " Content-Type: application/json " 
    -X POST 
    -d @query.json 
    " 127.0.0.1:50021/synthesis?speaker= $style_id " 
    > audio.wav

speaker_uuidは、 /speakersで確認できます
idは重複してはいけません
エンジン起動後にファイルを書き換えるとエンジンに反映されます

2 種類のスタイルでモーフィングするサンプルコード

/synthesis_morphingでは、2 種類のスタイルでそれぞれ合成された音声を元に、モーフィングした音声を生成します。

 echo -n "モーフィングを利用することで、２種類の声を混ぜることができます。 " > text.txt

curl -s 
    -X POST 
    " 127.0.0.1:50021/audio_query?speaker=8 " 
    --get --data-urlencode [email protected] 
    > query.json

# 元のスタイルでの合成結果
curl -s 
    -H " Content-Type: application/json " 
    -X POST 
    -d @query.json 
    " 127.0.0.1:50021/synthesis?speaker=8 " 
    > audio.wav

export MORPH_RATE=0.5

# スタイル2種類分の音声合成+WORLDによる音声分析が入るため時間が掛かるので注意
curl -s 
    -H " Content-Type: application/json " 
    -X POST 
    -d @query.json 
    " 127.0.0.1:50021/synthesis_morphing?base_speaker=8&target_speaker=10&morph_rate= $MORPH_RATE " 
    > audio.wav

export MORPH_RATE=0.9

# query、base_speaker、target_speakerが同じ場合はキャッシュが使用されるため比較的高速に生成される
curl -s 
    -H " Content-Type: application/json " 
    -X POST 
    -d @query.json 
    " 127.0.0.1:50021/synthesis_morphing?base_speaker=8&target_speaker=10&morph_rate= $MORPH_RATE " 
    > audio.wav

キャラクターの追加情報を取得するサンプルコード

追加情報の中の portrait.png を取得するコードです。
（jqを使用して json をパースしています。）

curl -s -X GET " 127.0.0.1:50021/speaker_info?speaker_uuid=7ffcb7ce-00ec-4bdc-82cd-45a8889e43ff " 
    | jq  -r " .portrait " 
    | base64 -d 
    > portrait.png

キャンセル可能な音声合成

/cancellable_synthesisでは通信を切断した場合に即座に計算リソースが開放されます。
( /synthesisでは通信を切断しても最後まで音声合成の計算が行われます)
この API は実験的機能であり、エンジン起動時に引数で--enable_cancellable_synthesisを指定しないと有効化されません。
音声合成に必要なパラメータは/synthesisと同様です。

HTTP リクエストで歌声合成するサンプルコード

 echo -n ' {
  "notes": [
    { "key": null, "frame_length": 15, "lyric": "" },
    { "key": 60, "frame_length": 45, "lyric": "ド" },
    { "key": 62, "frame_length": 45, "lyric": "レ" },
    { "key": 64, "frame_length": 45, "lyric": "ミ" },
    { "key": null, "frame_length": 15, "lyric": "" }
  ]
} ' > score.json

curl -s 
    -H " Content-Type: application/json " 
    -X POST 
    -d @score.json 
    " 127.0.0.1:50021/sing_frame_audio_query?speaker=6000 " 
    > query.json

curl -s 
    -H " Content-Type: application/json " 
    -X POST 
    -d @query.json 
    " 127.0.0.1:50021/frame_synthesis?speaker=3001 " 
    > audio.wav

楽譜のkeyは MIDI 番号です。
lyricは歌詞で、任意の文字列を指定できますが、エンジンによってはひらがな・カタカナ１モーラ以外の文字列はエラーになることがあります。
フレームレートはデフォルトが 93.75Hz で、エンジンマニフェストのframe_rateで取得できます。
１つ目のノートは無音である必要があります。

/sing_frame_audio_queryで指定できるspeakerは、 /singersで取得できるスタイルの内、種類がsingかsinging_teacherなスタイルのstyle_idです。
/frame_synthesisで指定できるspeakerは、 /singersで取得できるスタイルの内、種類がframe_decodeのstyle_idです。
引数がspeakerという名前になっているのは、他の API と一貫性をもたせるためです。

/sing_frame_audio_queryと/frame_synthesisに異なるスタイルを指定することも可能です。

CORS 設定

VOICEVOX ではセキュリティ保護のためlocalhost・127.0.0.1・app://・Origin なし以外の Origin からリクエストを受け入れないようになっています。そのため、一部のサードパーティアプリからのレスポンスを受け取れない可能性があります。
これを回避する方法として、エンジンから設定できる UI を用意しています。

設定方法

http://127.0.0.1:50021/setting にアクセスします。
利用するアプリに合わせて設定を変更、追加してください。
保存ボタンを押して、変更を確定してください。
設定の適用にはエンジンの再起動が必要です。必要に応じて再起動をしてください。

データを変更する API を無効化する

実行時引数--disable_mutable_apiか環境変数VV_DISABLE_MUTABLE_API=1を指定することで、エンジンの設定や辞書などを変更する API を無効にできます。

Character code

リクエスト・レスポンスの文字コードはすべて UTF-8 です。

その他の引数

エンジン起動時に引数を指定できます。詳しいことは-h引数でヘルプを確認してください。

$ python run.py -h

usage: run.py [-h] [--host HOST] [--port PORT] [--use_gpu] [--voicevox_dir VOICEVOX_DIR] [--voicelib_dir VOICELIB_DIR] [--runtime_dir RUNTIME_DIR] [--enable_mock] [--enable_cancellable_synthesis]
              [--init_processes INIT_PROCESSES] [--load_all_models] [--cpu_num_threads CPU_NUM_THREADS] [--output_log_utf8] [--cors_policy_mode {CorsPolicyMode.all,CorsPolicyMode.localapps}]
              [--allow_origin [ALLOW_ORIGIN ...]] [--setting_file SETTING_FILE] [--preset_file PRESET_FILE] [--disable_mutable_api]

VOICEVOX のエンジンです。

options:
  -h , --help            show this help message and exit
  --host HOST           接続を受け付けるホストアドレスです。
  --port PORT           接続を受け付けるポート番号です。
  --use_gpu             GPUを使って音声合成するようになります。
  --voicevox_dir VOICEVOX_DIR
                        VOICEVOXのディレクトリパスです。
  --voicelib_dir VOICELIB_DIR
                        VOICEVOX COREのディレクトリパスです。
  --runtime_dir RUNTIME_DIR
                        VOICEVOX COREで使用するライブラリのディレクトリパスです。
  --enable_mock         VOICEVOX COREを使わずモックで音声合成を行います。
  --enable_cancellable_synthesis
                        音声合成を途中でキャンセルできるようになります。
  --init_processes INIT_PROCESSES
                        cancellable_synthesis機能の初期化時に生成するプロセス数です。
  --load_all_models     起動時に全ての音声合成モデルを読み込みます。
  --cpu_num_threads CPU_NUM_THREADS
                        音声合成を行うスレッド数です。指定しない場合、代わりに環境変数 VV_CPU_NUM_THREADS の値が使われます。VV_CPU_NUM_THREADS が空文字列でなく数値でもない場合はエラー終了します。
  --output_log_utf8     ログ出力をUTF-8でおこないます。指定しない場合、代わりに環境変数 VV_OUTPUT_LOG_UTF8 の値が使われます。VV_OUTPUT_LOG_UTF8 の値が1の場合はUTF-8で、0または空文字、値がない場合は環境によって自動的に決定されます。
  --cors_policy_mode {CorsPolicyMode.all,CorsPolicyMode.localapps}
                        CORSの許可モード。allまたはlocalappsが指定できます。allはすべてを許可します。localappsはオリジン間リソース共有ポリシーを、app://.とlocalhost関連に限定します。その他のオリジンはallow_originオプションで追加できます。デフォルトはlocalapps。このオプションは--
                        setting_fileで指定される設定ファイルよりも優先されます。
  --allow_origin [ALLOW_ORIGIN ...]
                        許可するオリジンを指定します。スペースで区切ることで複数指定できます。このオプションは--setting_fileで指定される設定ファイルよりも優先されます。
  --setting_file SETTING_FILE
                        設定ファイルを指定できます。
  --preset_file PRESET_FILE
                        プリセットファイルを指定できます。指定がない場合、環境変数 VV_PRESET_FILE、ユーザーディレクトリのpresets.yamlを順に探します。
  --disable_mutable_api
                        辞書登録や設定変更など、エンジンの静的なデータを変更するAPIを無効化します。指定しない場合、代わりに環境変数 VV_DISABLE_MUTABLE_API の値が使われます。VV_DISABLE_MUTABLE_API の値が1の場合は無効化で、0または空文字、値がない場合は無視されます。

update

エンジンディレクトリ内にあるファイルを全て消去し、新しいものに置き換えてください。

貢献者ガイド

VOICEVOX ENGINE は皆さんのコントリビューションをお待ちしています！
詳細は CONTRIBUTING.md をご覧ください。
また VOICEVOX 非公式 Discord サーバーにて、開発の議論や雑談を行っています。 Please feel free to join us.

なお、Issue を解決するプルリクエストを作成される際は、別の方と同じ Issue に取り組むことを避けるため、Issue 側で取り組み始めたことを伝えるか、最初に Draft プルリクエストを作成することを推奨しています。

開発者ガイド

Environmental construction

Python 3.11.9を用いて開発されています。インストールするには、各 OS ごとの C/C++ コンパイラ、CMake が必要になります。

 # 実行環境のインストール
python -m pip install -r requirements.txt

# 開発環境・テスト環境・ビルド環境のインストール
python -m pip install -r requirements-dev.txt -r requirements-build.txt

execution

コマンドライン引数の詳細は以下のコマンドで確認してください。

python run.py --help

 # 製品版 VOICEVOX でサーバーを起動
VOICEVOX_DIR= " C:/path/to/voicevox " # 製品版 VOICEVOX ディレクトリのパス
python run.py --voicevox_dir= $VOICEVOX_DIR

 # モックでサーバー起動
python run.py --enable_mock

 # ログをUTF8に変更
python run.py --output_log_utf8
# もしくは VV_OUTPUT_LOG_UTF8=1 python run.py

CPU スレッド数を指定する

CPU スレッド数が未指定の場合は、論理コア数の半分が使われます。（殆どの CPU で、これは全体の処理能力の半分です）
もし IaaS 上で実行していたり、専用サーバーで実行している場合など、
エンジンが使う処理能力を調節したい場合は、CPU スレッド数を指定することで実現できます。

実行時引数で指定する

python run.py --voicevox_dir= $VOICEVOX_DIR --cpu_num_threads=4

環境変数で指定する

 export VV_CPU_NUM_THREADS=4
python run.py --voicevox_dir= $VOICEVOX_DIR

過去のバージョンのコアを使う

VOICEVOX Core 0.5.4 以降のコアを使用する事が可能です。
Mac での libtorch 版コアのサポートはしていません。

過去のバイナリを指定する

製品版 VOICEVOX もしくはコンパイル済みエンジンのディレクトリを--voicevox_dir引数で指定すると、そのバージョンのコアが使用されます。

python run.py --voicevox_dir= " /path/to/voicevox "

Mac では、 DYLD_LIBRARY_PATHの指定が必要です。

DYLD_LIBRARY_PATH= " /path/to/voicevox " python run.py --voicevox_dir= " /path/to/voicevox "

音声ライブラリを直接指定する

VOICEVOX Core の zip ファイルを解凍したディレクトリを--voicelib_dir引数で指定します。
また、コアのバージョンに合わせて、libtorchやonnxruntime (共有ライブラリ) のディレクトリを--runtime_dir引数で指定します。
ただし、システムの探索パス上に libtorch、onnxruntime がある場合、 --runtime_dir引数の指定は不要です。
--voicelib_dir引数、 --runtime_dir引数は複数回使用可能です。
API エンドポイントでコアのバージョンを指定する場合はcore_version引数を指定してください。（未指定の場合は最新のコアが使用されます）

python run.py --voicelib_dir= " /path/to/voicevox_core " --runtime_dir= " /path/to/libtorch_or_onnx "

Mac では、 --runtime_dir引数の代わりにDYLD_LIBRARY_PATHの指定が必要です。

DYLD_LIBRARY_PATH= " /path/to/onnx " python run.py --voicelib_dir= " /path/to/voicevox_core "

ユーザーディレクトリに配置する

以下のディレクトリにある音声ライブラリは自動で読み込まれます。

ビルド版: <user_data_dir>/voicevox-engine/core_libraries/
Python 版: <user_data_dir>/voicevox-engine-dev/core_libraries/

<user_data_dir>は OS によって異なります。

Windows: C:Users<username>AppDataLocal
macOS: /Users/<username>/Library/Application Support/
Linux: /home/<username>/.local/share/

Build

pyinstallerを用いたパッケージ化と Dockerfile を用いたコンテナ化によりローカルでビルドが可能です。
手順の詳細は貢献者ガイド#ビルドを御覧ください。

GitHub を用いる場合、fork したリポジトリで GitHub Actions によるビルドが可能です。
Actions を ON にし、workflow_dispatch でbuild-engine-package.ymlを起動すればビルドできます。成果物は Release にアップロードされます。ビルドに必要な GitHub Actions の設定は貢献者ガイド#GitHub Actions を御覧ください。

テスト・静的解析

pytestを用いたテストと各種リンターを用いた静的解析が可能です。
手順の詳細は貢献者ガイド#テスト, 貢献者ガイド#静的解析を御覧ください。

依存関係

依存関係はpoetryで管理されています。また、導入可能な依存ライブラリにはライセンス上の制約があります。
詳細は貢献者ガイド#パッケージを御覧ください。

マルチエンジン機能に関して

VOICEVOX エディターでは、複数のエンジンを同時に起動することができます。この機能を利用することで、自作の音声合成エンジンや既存の音声合成エンジンを VOICEVOX エディター上で動かすことが可能です。

マルチエンジン機能の仕組み

VOICEVOX API に準拠した複数のエンジンの Web API をポートを分けて起動し、統一的に扱うことでマルチエンジン機能を実現しています。エディターがそれぞれのエンジンを実行バイナリ経由で起動し、EngineID と結びつけて設定や状態を個別管理します。

マルチエンジン機能への対応方法

VOICEVOX API 準拠エンジンを起動する実行バイナリを作ることで対応が可能です。 VOICEVOX ENGINE リポジトリを fork し、一部の機能を改造するのが簡単です。

改造すべき点はエンジン情報・キャラクター情報・音声合成の３点です。

エンジンの情報はルート直下のマニフェストファイル（ engine_manifest.json ）で管理されています。この形式のマニフェストファイルは VOICEVOX API 準拠エンジンに必須です。マニフェストファイル内の情報を見て適宜変更してください。音声合成手法によっては、例えばモーフィング機能など、VOICEVOX と同じ機能を持つことができない場合があります。その場合はマニフェストファイル内のsupported_features内の情報を適宜変更してください。

キャラクター情報はresources/character_infoディレクトリ内のファイルで管理されています。ダミーのアイコンなどが用意されているので適宜変更してください。

音声合成はvoicevox_engine/tts_pipeline/tts_engine.pyで行われています。 VOICEVOX API での音声合成は、エンジン側で音声合成用のクエリAudioQueryの初期値を作成してユーザーに返し、ユーザーが必要に応じてクエリを編集したあと、エンジンがクエリに従って音声合成することで実現しています。クエリ作成は/audio_queryエンドポイントで、音声合成は/synthesisエンドポイントで行っており、最低この２つに対応すれば VOICEVOX API に準拠したことになります。

マルチエンジン機能対応エンジンの配布方法

VVPP ファイルとして配布するのがおすすめです。 VVPP は「VOICEVOX プラグインパッケージ」の略で、中身はビルドしたエンジンなどを含んだディレクトリの Zip ファイルです。拡張子を.vvppにすると、ダブルクリックで VOICEVOX エディターにインストールできます。

エディター側は受け取った VVPP ファイルをローカルディスク上に Zip 展開したあと、ルートの直下にあるengine_manifest.jsonに従ってファイルを探査します。 VOICEVOX エディターにうまく読み込ませられないときは、エディターのエラーログを参照してください。

また、 xxx.vvppは分割して連番を付けたxxx.0.vvpppファイルとして配布することも可能です。これはファイル容量が大きくて配布が困難な場合に有用です。インストールに必要なvvppおよびvvpppファイルはvvpp.txtファイルにリストアップしています。

事例紹介

voicevox-client @voicevox-client ･･･ VOICEVOX ENGINE の各言語向け API ラッパー

license

LGPL v3 と、ソースコードの公開が不要な別ライセンスのデュアルライセンスです。別ライセンスを取得したい場合は、ヒホに求めてください。
X アカウント: @hiho_karuta

Expand

AivisSpeech Engine