大规模的预训练的语言模型已被证明有助于改善文本到语音(TTS)模型的自然性,使它们能够产生更自然的韵律模式。但是,这些模型通常是单词级或sup-phoneme级别的,并通过音素共同训练,这使得仅需要音素的下游TTS任务效率低下。在这项工作中,我们提出了一个音素级Bert(PL-BERT),其借口任务是预测相应的素描以及常规的蒙版音素预测。主观评估表明,与最先进(SOTA)的STYLETTS基线(OOD)文本相比,我们的音素级Bert编码器显着改善了合成语音自然的平均意见分数(MOS)。
论文:https://arxiv.org/abs/2301.08810
音频样本:https://pl-bert.github.io/
git clone https://github.com/yl4579/PL-BERT.git
cd PL-BERTconda create --name BERT python=3.8
conda activate BERT
python -m ipykernel install --user --name BERT --display-name " BERT "pip install pandas singleton-decorator datasets " transformers<4.33.3 " accelerate nltk phonemizer sacremoses pebble有关更多详细信息,请参阅笔记本preprocess.ipynb。预处理仅适用于英文Wikipedia数据集。如果我有额外的时间来对其他语言进行培训,我将为日语建立一个新的分支。您也可以参考#6,以使用日语等其他语言进行预处理。
请在笔记本火车上运行每个单元。IPYNB。如果您想使用其他配置文件,则需要在单元2中更改单元格2中的行config_path = "Configs/config.yml" 。培训代码在Jupyter笔记本中主要是因为最初的植被是在Jupyter Notebook中进行的,但是如果您愿意,您可以轻松使其成为Python脚本。
这是如何将其用于Styletts Finetuning的示例。您可以通过用预训练的PL-Bert替换文本编码器来将其用于其他TTS模型。
from transformers import AlbertConfig , AlbertModel
log_dir = "YOUR PL-BERT CHECKPOINT PATH"
config_path = os . path . join ( log_dir , "config.yml" )
plbert_config = yaml . safe_load ( open ( config_path ))
albert_base_configuration = AlbertConfig ( ** plbert_config [ 'model_params' ])
bert = AlbertModel ( albert_base_configuration )
files = os . listdir ( log_dir )
ckpts = []
for f in os . listdir ( log_dir ):
if f . startswith ( "step_" ): ckpts . append ( f )
iters = [ int ( f . split ( '_' )[ - 1 ]. split ( '.' )[ 0 ]) for f in ckpts if os . path . isfile ( os . path . join ( log_dir , f ))]
iters = sorted ( iters )[ - 1 ]
checkpoint = torch . load ( log_dir + "/step_" + str ( iters ) + ".t7" , map_location = 'cpu' )
state_dict = checkpoint [ 'net' ]
from collections import OrderedDict
new_state_dict = OrderedDict ()
for k , v in state_dict . items ():
name = k [ 7 :] # remove `module.`
if name . startswith ( 'encoder.' ):
name = name [ 8 :] # remove `encoder.`
new_state_dict [ name ] = v
bert . load_state_dict ( new_state_dict )
nets = Munch ( bert = bert ,
# linear projection to match the hidden size (BERT 768, StyleTTS 512)
bert_encoder = nn . Linear ( plbert_config [ 'model_params' ][ 'hidden_size' ], args . hidden_dim ),
predictor = predictor ,
decoder = decoder ,
pitch_extractor = pitch_extractor ,
text_encoder = text_encoder ,
style_encoder = style_encoder ,
text_aligner = text_aligner ,
discriminator = discriminator ) # for stability
for g in optimizer . optimizers [ 'bert' ]. param_groups :
g [ 'betas' ] = ( 0.9 , 0.99 )
g [ 'lr' ] = 1e-5
g [ 'initial_lr' ] = 1e-5
g [ 'min_lr' ] = 0
g [ 'weight_decay' ] = 0.01 bert_dur = model . bert ( texts , attention_mask = ( ~ text_mask ). int ()). last_hidden_state
d_en = model . bert_encoder ( bert_dur ). transpose ( - 1 , - 2 )
d , _ = model . predictor ( d_en , s ,
input_lengths ,
s2s_attn_mono ,
m )第257行:
_ , p = model . predictor ( d_en , s ,
input_lengths ,
s2s_attn_mono ,
m )和第415行:
bert_dur = model . bert ( texts , attention_mask = ( ~ text_mask ). int ()). last_hidden_state
d_en = model . bert_encoder ( bert_dur ). transpose ( - 1 , - 2 )
d , p = model . predictor ( d_en , s ,
input_lengths ,
s2s_attn_mono ,
m ) optimizer . step ( 'bert_encoder' )
optimizer . step ( 'bert' )可以在以下位置下载Wikipedia上1M步骤的预训练的PL-BERT:PL-BERT LINK。
可以在此处下载LJSpeech数据集上的演示以及预先修饰的Styletts回购和预训练的模型:Styletts链接。该ZIP文件包含上面的代码修改,上面列出的预训练的PL-BERT模型,w/ pl-bert的预先训练的styletts,w/ o pl-bert的预训练的styletts和pretains the pl-bert和preaded hifigan在styletts repo上的ljspeech上。