All About Speech
This repository organizes papers, learning materials, codes for the purpose of understanding speech. There is another repository for machine/deep learning here.
To Dos:
- organize stars
- add more papers
- papers to read:
- Speech=T:Transducer for TTS and Beyond
TTS
ASR
- Towards End-to-End Spoken Language Understanding
Speech Classification, Detection, Filter, etc.
- HTS-AT: A Hierarchial Token-Semantic Audio Transformer for Sound Classification and Detection [[paper]] [code]
- Google AI's VoiceFilter System [[paper]] [code]
- Improved End-to-End Speech Emotion Recognition Using Self Attention Mechanism and Multitask Learning (Interspeech 2019) [[paper]] [code]
- Multimodal Emotion Recognition with Tranformer-Based Self Supervised Feature Fusion [[paper]] [code]
- Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings (Interspeech 2021) [[paper]] [code]
- Exploring Wav2vec 2.0 fine-tuning for improved speech emotion recognition [[paper]] [code]
- Rethinking CNN Models for Audio Classification [[paper]] [code]
- EEG-based emotion recognition using SincNet [[paper]] [code]
Speaker Verification
- Cross attentive pooling for speaker verification (IEEE SLT 2021) [[paper]] [code]
Linguistics
Datasets
- VGGSound: A Large-scale Audio-Visual Dataset [[paper]] [code]
- CSS10: A collection of single speaker speech datsets for 10 langauges [code]
- IEMOCAP: 12 hours of audiovisual data with 10 male and female actors [website]
- VoxCeleb [repo]
Data Augmentation
- Audiomentations (Fast audio data augmentation in pytorch) [code]
Aligners
- Montreal Forced Aligner
Data (Pre)processing / Augmentation
- Korean pronunciation and romanization based on Wiktionary ko-pron lua module [code]
- Audio Signal Processing [code]
- Phonological Features (for the paper "Phonological features for 0-shot multilingual speech synthesis") [[paper]] [code]
- SMART-G2P (change English and Kanji expressions in Korean sentence into Korean pronunciation) [code]
- Kakao Grapheme to Phoneme Conversion Package for "Mandarin" [code]
- Webaverse Speech Tool [code]
Verification
- MCD [repo]
- Code works, but I am not sure if it is right. MCD numbers are a bit too high even for pairs of similar audios.
Other Research That May Help
- Text to Image Synthesis
- AudioMAE (Masked Autoencoders that Listen) [code]
Organizations
- DeepMind [repo]
- OpenAI [repo]
- Club House: WeeklyArxivTalk [repo]
Other Repositories to Refer to - Speech Included/Related
- Speech Researchers List [repo]
- Jackson-Kang [repo]
- Rosinality's ML [repo]
- ivallesp's [repo]
- ddlBoJack's Speech Pretraining [repo]
- fuzhenxin's Style Transfer in Text [repo]
Learning Materials
- Digital Signal Processing Lecture [link]
- Ratsgo's Speechbook [link]
- YSDA Course in Speech Processing [code]
- NHN Forward Youtube video [link]