A paper and project list about the cutting edge Speech Synthesis, Text-to-Speech (TTS), Singing Voice Synthesis (SVS), Voice Conversion (VC), Singing Voice Conversion (SVC), and related interesting works (such as Music Synthesis, Automatic Music Transcription, Automatic MOS Prediction, SSL-based ASR, ...etc).
Welcome to PR or contact me via email ([email protected]) for updating papers and works.
IEEE/ACM TASLP, IEEE JSTSP, JSLHR, IEEE TPAMI
NeuraIPS, ICLR, ICML, IJAI, AAAI, ACL, NAACL, EMNLP, ISMIR, ACM MM, ICASSP, INTERSPEECH, ICME
ASRU, SLT
[2022]
Learn2Sing 2.0: Diffusion and Mutual Information-Based Target Speaker SVS by Learning from Singing Teacher | INTERSPEECH 2022 | ✔️Code | Demo
A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion | INTERSPEECH 2022 | Demo
Improving Adversarial Waveform Generation based Singing Voice Conversion with Harmonic Signals | ICASSP 2022 | Demo
[2021]
DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion | ASRU 2021 | Demo
Controllable and Interpretable Singing Voice Decomposition via Assem-VC | NeurIPS 2021 Workshop | Demo
Towards High-fidelity Singing Voice Conversion with Acoustic Reference and Contrastive Predictive Coding | 2021/10 | Demo
FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation | ICME 2021 | Demo
Unsupervised WaveNet-based Singing Voice Conversion Using Pitch Augmentation and Two-phase Approach | 2021/07 | ✔️Code | Demo
[2020]
Zero-shot Singing Voice Conversion | ISMIR 2020 | Demo
Phonetic Posteriorgrams based Many-to-Many Singing Voice Conversion via Adversarial Training | 2020/12 | Demo | Unofficial Code
DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System | INTERSPEECH 2020 | Demo
Unsupervised Cross-Domain Singing Voice Conversion | INTERSPEECH 2020 | Demo
PitchNet: Unsupervised Singing Voice Conversion with Pitch Adversarial Network | ICASSP 2020 | Demo
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data | APSIPA 2020 | ✔️Code | Demo
M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus | NeurIPS 2022 | ?Apply&Download | Demo
NUS-48E Sung and Spoken Lyrics Corpus | ?Apply&Download
NHSS: A Speech and Singing Parallel Database | ?Apply&Download
[2022]
[2021]
Investigating Time-Frequency Representations for Audio Feature Extraction in Singing Technique Classification | APSIPA 2021
Zero-shot Singing Technique Conversion | CMMR 2021
[2022]
Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers | INTERSPEECH 2022 | Demo
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion | INTERSPEECH 2022 | Demo
Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme | ICLR 2022 | ✔️Code | Demo
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone | ICML 2022 | ✔️Code | Demo | Demo | Blog
A Comparative Study of Self-supervised Speech Representation Based Voice Conversion | IEEE JSTSP 2022/07
S3PRL-VC: Open-Source Voice Conversion Framework with Self-Supervised Speech Representations | ICASSP 2022 | ✔️Code
A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion | ICASSP 2022 | ✔️Code | Demo
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques | ICASSP 2022 | ✔️Code | Demo
NVC-Net: End-to-End Adversarial Voice Conversion | ICASSP 2022 | ✔️Code | Demo
Robust Disentangled Variational Speech Representation Learning for Zero-Shot Voice Conversion | ICASSP 2022 | Demo
Training Robust Zero-Shot Voice Conversion Models with Self-supervised Features | ICASSP 2022 | Demo
Toward Degradation-Robust Voice Conversion | ICASSP 2022
DGC-vector: A new speaker embedding for zero-shot voice conversion | ICASSP 2022 | Demo
End-to-End Zero-Shot Voice Style Transfer with Location-Variable Convolutions | 2022/05 | Demo
[2021]
On Prosody Modeling for ASR+TTS based Voice Conversion | ASRU 2021 | Demo
Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations | NeurIPS 2021 | Demo | Unofficial Code
MediumVC: Any-to-any voice conversion using synthetic specific-speaker speeches as intermedium features | 2021/10 | ✔️Code | Demo
StarGANv2-VC: A Diverse, Unsupervised, Non-parallel Framework for Natural-Sounding Voice Conversion | INTERSPEECH 2021 Best Paper Award | ✔️Code | Demo
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations | INTERSPEECH 2021 | ✔️Code | Demo
Many-to-Many Voice Conversion based Feature Disentanglement using Variational Autoencoder | INTERSPEECH 2021 | ✔️Code | Demo
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations | INTERSPEECH 2021 | Demo
Improving Zero-shot Voice Style Transfer via Disentangled Representation Learning | ICLR 2021
Global Rhythm Style Transfer Without Text Transcriptions | ICML 2021 | ✔️Code
AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization | ICASSP 2021 | ✔️Code | Demo
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling | IEEE/ACM TASLP 2021/05 | ✔️Code | Demo
[2020]
An Overview of Voice Conversion and its Challenges: From Statistical Modeling to Deep Learning | IEEE/ACM TASLP 2020/11
Unsupervised Speech Decomposition via Triple Information Bottleneck | ICML 2020 | ✔️Code
[2019]
One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization | INTERSPEECH 2019 | ✔️Code
AUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss | ICML 2019 | ✔️Code | Demo
CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit | 2019 | ?Apply&Download
AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines | 2020 | ?Apply&Download | Demo
AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale | 2018 | ?Apply&Download
AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline | 2017 | ?Apply&Download
[2022]
Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion | INTERSPEECH 2022 | Demo
Cross-speaker Emotion Transfer Based On Prosody Compensation for End-to-End Speech Synthesis | INTERSPEECH 2022 | Demo
Emotion Intensity and its Control for Emotional Voice Conversion | IEEE Transactions on Affective Computing 2022/07 | ✔️Code | Demo
Textless Speech Emotion Conversion using Discrete and Decomposed Representations | 202202 | Demo
[2021]
[2020]
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion | INTERSPEECH 2020 | ✔️Code | Demo
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data | Odyssey 2020 | ✔️Code | Demo
[2022]
Muskits: an End-to-End Music Processing Toolkit for Singing Voice Synthesis | INTERSPEECH 2022 | ✔️Code
SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy | INTERSPEECH 2022 | ✔️Code
WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses | INTERSPEECH 2022 | Demo
WeSinger 2: Fully Parallel Singing Voice Synthesis via Multi-Singer Conditional Adversarial Training | 2022/08 | Demo
Deep Learning Approaches in Topics of Singing Information Processing | IEEE/ACM TASLP 2022/07
Learning the Beauty in Songs: Neural Singing Voice Beautifier | ACL 2022 | ✔️Code | Demo
DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism | AAAI 2022 | ✔️Code | Demo
[2021]
[2020]
M4Singer: a Multi-Style, Multi-Singer and Musical Score Provided Mandarin Singing Corpus | NeurIPS 2022 | ?Apply&Download | Demo
PopCS | AAAI 2022 | ?Apply&Download
Opencpop: A High-Quality Open Source Chinese Popular Song Corpus for Singing Voice Synthesis | INTERSPEECH 2022 | ?Apply&Download
[2022]
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech | ACM MM 2022 | ✔️Code | Demo
BDDM: Bilateral Denoising Diffusion Models for Fast and High-Quality Speech Synthesis | ICLR 2022 | ✔️Code | Demo
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | IJCAI 2022 | ✔️Code | Demo
[2022]
DDSP-based Singing Vocoders: A New Subtractive-based Synthesizer and A Comprehensive Evaluation | ISMIR 2022 | ✔️Code | Demo
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis | IJCAI 2022 | ✔️Code | Demo
BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for Binaural Audio Synthesis | 2022/05 | Demo
[2021]
Multi-Singer: Fast Multi-Singer Singing Voice Vocoder With A Large-Scale Corpus | ACM MM 2021 | ?Apply&Download | ✔️Code | Demo
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis | INTERSPEECH 2021 | Demo
DiffWave: A Versatile Diffusion Model for Audio Synthesis | ICLR 2021 | ✔️Code | Demo
WaveGrad: Estimating Gradients for Waveform Generation | ICLR 2021 | Demo
[2020]
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis | NeurIPS 2020 | ✔️Code | Demo
Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech | INTERSPEECH 2020 | Demo
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram | ICASSP 2020 | Demo | Unofficial Code
[2019]
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis | NeurIPS 2019 | ✔️Code | Demo
Towards achieving robust universal neural vocoding | INTERSPEECH 2019 | ✔️Code | Demo | Unofficial Code
[2022]
Multi-instrument Music Synthesis with Spectrogram Diffusion | ISMIR 2022 | ✔️Code | Demo
Musika! Fast Infinite Waveform Music Generation | ISMIR 2022 | ✔️Code | Demo
[2022]
[2021]
[2022]
UniSpeech-SAT: Universal Speech Representation Learning with Speaker Aware Pre-Training | ICASSP 2022 | ✔️Code | ✔️Code
Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition | ICASSP 2022 | ✔️Code | ✔️Code
Pseudo-Labeling for Massively Multilingual Speech Recognition | ICASSP 2022 | ✔️Code | ✔️Code
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing | IEEE JSTSP 2022/06 | ✔️Code | ✔️Code
[2021]
XLS-R: Self-supervised Cross-lingual Speech Representation Learning at Scale | 2021/12 | ✔️Code | ✔️Code
Simple and Effective Zero-shot Cross-lingual Phoneme Recognition | 2021/09 | ✔️Code | ✔️Code
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech | IEEE/ACM TASLP 2021/08 | ✔️Code
UniSpeech: Unified Speech Representation Learning with Labeled and Unlabeled Data | ICML 2021 | ✔️Code | ✔️Code | ✔️Code
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units | IEEE/ACM TASLP 2021/06 | ✔️Code | ✔️Code
[2020]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations | NeurIPS 2020 | ✔️Code | ✔️Code
vq-wav2vec: Self-Supervised Learning of Discrete Speech Representations | ICLR 2020 | ✔️Code | ✔️Code
Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders | ICASSP 2020 | ✔️Code
Unsupervised Cross-lingual Representation Learning for Speech Recognition | 2020/06 | ✔️Code | ✔️Code
fairseq S2T: Fast Speech-to-Text Modeling with fairseq | AACL 2020 | ✔️Code | ✔️Code
[2019]
[2022]
[2021]
[2021]
[2022]
[2022]
[2021]
[2022]
[2021]
[2021]
Voice Conversion Challenge 2020 | ?Apply&Download | ✔️Code
The Blizzard Challenge