VoiceprintRecognition Pytorch Download - VoiceprintRecognition Pytorch Source code download

VoiceprintRecognition Pytorch

Python

1.0.0

Download

Simplified Chinese | English

Voiceprint recognition system based on Pytorch

This branch is version 1.1. If you want to use the previous version 1.0, please use it in the 1.0 branch. This project uses various advanced voiceprint recognition models such as EcapaTdnn, ResNetSE, ERes2Net, and CAM++. It is not ruled out that more models will be supported in the future. At the same time, this project also supports various data preprocessing methods such as MelSpectrogram, Spectrogram, MFCC, and Fbank. ArcFace Loss, ArcFace loss: Additive Angular Margin Loss (additive angle interval loss function), corresponding to the AAMLoss in the project, normalizes the eigenvectors and weights, and adds the angle interval m to θ. The angle interval has a more direct impact on the angle than the cosine interval. In addition, it also supports various loss functions such as AMLoss, ARMLoss, and CELoss.

If this project is helpful to you, welcome Star to avoid it not being found later.

Everyone is welcome to scan the code to enter the Knowledge Planet or QQ group to discuss. Knowledge Planet provides project model files and bloggers' other related projects model files, as well as other resources.

Usage environment:

Anaconda 3
Python 3.11
Pytorch 2.4.0
Windows 11 or Ubuntu 22.04

Project Record

2024.10.12: Release version 1.1.

Project Features

Supported models: EcapaTdnn, TDNN, Res2Net, ResNetSE, ERes2Net, CAM++
Support pooling layers: AttentiveStatsPool (ASP), SelfAttentivePooling (SAP), TemporalStatisticsPooling (TSP), TemporalAveragePooling (TAP), TemporalStatsPool(TSTP)
Support loss functions: AAMLoss, SphereFace2, AMLLoss, ARMLLoss, CELoss, SubCenterLoss, TripletAngularMarginLoss
Support preprocessing methods: MelSpectrogram, Spectrogram, MFCC, Fbank, Wav2vec2.0, WavLM
Support data enhancement methods: speech speed enhancement, volume enhancement, noise enhancement, reverb enhancement, SpecAugment

Model paper:

EcapaTdnn: ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification
PANNS: PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
TDNN: Prediction of speech intelligence with DNN-based performance measures
Res2Net: Res2Net: A New Multi-scale Backbone Architecture
ResNetSE: Squeeze-and-Excitation Networks
CAMPPlus: CAM++: A Fast and Efficient Network for Speaker Verification Using Context-Aware Masking
ERes2Net: An Enhanced Res2Net with Local and Global Feature Fusion for Speaker Verification

Model download

CN-Celeb data was trained, and there were 2796 speakers in total.

Model	Params(M)	Dataset	train speakers	threshold	EER	MinDCF	Model download
ERes2NetV2	6.6	CN-Celeb	2796	0.20089	0.08071	0.45705	Join the knowledge planet to obtain
ERes2Net	6.6	CN-Celeb	2796	0.20014	0.08132	0.45544	Join the knowledge planet to obtain
CAM++	6.8	CN-Celeb	2796	0.23323	0.08332	0.48536	Join the knowledge planet to obtain
ResNetSE	7.8	CN-Celeb	2796	0.19066	0.08544	0.49142	Join the knowledge planet to obtain
EcapaTdnn	6.1	CN-Celeb	2796	0.23646	0.09259	0.51378	Join the knowledge planet to obtain
TDNN	2.6	CN-Celeb	2796	0.23858	0.10825	0.59545	Join the knowledge planet to obtain
Res2Net	5.0	CN-Celeb	2796	0.19526	0.12436	0.65347	Join the knowledge planet to obtain
CAM++	6.8	Larger data set	2W+	0.33	0.07874	0.52524	Join the knowledge planet to obtain
ERes2Net	55.1	Other data sets	20W+	0.36	0.02936	0.18355	Join the knowledge planet to obtain
ERes2NetV2	56.2	Other data sets	20W+	0.36	0.03847	0.24301	Join the knowledge planet to obtain
CAM++	6.8	Other data sets	20W+	0.29	0.04765	0.31436	Join the knowledge planet to obtain

illustrate:

The tested set evaluated is a test set of CN-Celeb, containing 196 speakers.
Triple the size of the classification using speed enhancement speed_perturb_3_class: True .
The preprocessing method used is Fbank and the loss function is AAMLoss .
The number of parameters does not include the number of parameters of the classifier.
Noise enhancement and reverb enhancement are used.

Training VoxCeleb1&2 data, there are 7205 speakers in total.

Model	Params(M)	Dataset	train speakers	threshold	EER	MinDCF	Model download
CAM++	6.8	VoxCeleb1&2	7205	0.22504	0.02436	0.15543	Join the knowledge planet to obtain
EcapaTdnn	6.1	VoxCeleb1&2	7205	0.24877	0.02480	0.16188	Join the knowledge planet to obtain
ERes2NetV2	6.6	VoxCeleb1&2	7205	0.20710	0.02742	0.17709	Join the knowledge planet to obtain
ERes2Net	6.6	VoxCeleb1&2	7205	0.20233	0.02954	0.17377	Join the knowledge planet to obtain
ResNetSE	7.8	VoxCeleb1&2	7205	0.22567	0.03189	0.23040	Join the knowledge planet to obtain
TDNN	2.6	VoxCeleb1&2	7205	0.23834	0.03486	0.26792	Join the knowledge planet to obtain
Res2Net	5.0	VoxCeleb1&2	7205	0.19472	0.04370	0.40072	Join the knowledge planet to obtain
CAM++	6.8	Larger data set	2W+	0.28	0.03182	0.23731	Join the knowledge planet to obtain
ERes2Net	55.1	Other data sets	20W+	0.53	0.08904	0.62130	Join the knowledge planet to obtain
ERes2NetV2	56.2	Other data sets	20W+	0.52	0.08649	0.64193	Join the knowledge planet to obtain
CAM++	6.8	Other data sets	20W+	0.49	0.10334	0.71200	Join the knowledge planet to obtain

illustrate:

The tested set evaluated is the test set of VoxCeleb1&2, containing 158 speakers.
Triple the size of the classification using speed enhancement speed_perturb_3_class: True .
The preprocessing method used is Fbank and the loss function is AAMLoss .
The number of parameters does not include the number of parameters of the classifier.

Comparative experiment of pretreatment method effects

Preprocessing method	Dataset	train speakers	threshold	EER	MinDCF	Model download
Fbank	CN-Celeb	2796	0.14574	0.10988	0.58955	Join the knowledge planet to obtain
MFCC	CN-Celeb	2796	0.14868	0.11483	0.61275	Join the knowledge planet to obtain
Spectrogram	CN-Celeb	2796	0.14962	0.11613	0.60057	Join the knowledge planet to obtain
MelSpectrogram	CN-Celeb	2796	0.13458	0.12498	0.60741	Join the knowledge planet to obtain
wavlm-base-plus	CN-Celeb	2796	0.14166	0.13247	0.62451	Join the knowledge planet to obtain
w2v-bert-2.0	CN-Celeb	2796				Join the knowledge planet to obtain
wav2vec2-large-xlsr-53	CN-Celeb	2796				Join the knowledge planet to obtain
wavlm-large	CN-Celeb	2796				Join the knowledge planet to obtain

illustrate:

The tested set evaluated is a test set of CN-Celeb, containing 196 speakers.
The experimental data are CN-Celeb, the experimental model is CAM++ , and the loss function is AAMLoss .
The data uses extract_features.py to extract features in advance, which means that data enhancement to audio is not used during training.
w2v-bert-2.0 , wav2vec2-large-xlsr-53 are obtained from pre-training of multilingual data. The pre-training data of wavlm-base-plus and wavlm-large are only in English.

Comparative experiment of loss function effects

Loss function	Dataset	train speakers	threshold	EER	MinDCF	Model download
AAMLoss	CN-Celeb	2796	0.14574	0.10988	0.58955	Join the knowledge planet to obtain
SphereFace2	CN-Celeb	2796	0.20377	0.11309	0.61536	Join the knowledge planet to obtain
TripletAngularMarginLoss	CN-Celeb	2796	0.28940	0.11749	0.63735	Join the knowledge planet to obtain
SubCenterLoss	CN-Celeb	2796	0.13126	0.11775	0.56995	Join the knowledge planet to obtain
ARMLLoss	CN-Celeb	2796	0.14563	0.11805	0.57171	Join the knowledge planet to obtain
AMLoss	CN-Celeb	2796	0.12870	0.12301	0.63263	Join the knowledge planet to obtain
CELoss	CN-Celeb	2796	0.13607	0.12684	0.65176	Join the knowledge planet to obtain

illustrate:

The tested set evaluated is a test set of CN-Celeb, containing 196 speakers.
The experimental data are CN-Celeb, the experimental model is CAM++ , and the preprocessing method is Fbank .
The data uses extract_features.py to extract features in advance, which means that data enhancement to audio is not used during training.

Installation environment

The first thing to install is the GPU version of Pytorch. If it has been installed, please skip it.

conda install pytorch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Install the ppvector library.

Use pip to install, the command is as follows:

python -m pip install mvector -U -i https://pypi.tuna.tsinghua.edu.cn/simple

It is recommended to install the source code , which can ensure the use of the latest code.

git clone https://github.com/yeyupiaoling/VoiceprintRecognition-Pytorch.git
cd VoiceprintRecognition-Pytorch/
pip install .

Create data

The author uses CN-Celeb in this tutorial. This data set has a total of about 3,000 voice data and 65W+ voice data. After downloading, you need to decompress the dataset to the dataset directory. In addition, if you want to evaluate, you also need to download the CN-Celeb test set. If the reader has other better data sets, it can be mixed together, but it is best to use python's tool module aukit to process audio, noise reduction and mute.

The first is to create a data list. The format of the data list is <语音文件路径t语音分类标签> . Creating this list is mainly for the convenience of later reading and the convenience of reading using other voice data sets. Voice classification tags refer to the speaker's unique ID. Different voice data sets can be written in the same data list by writing corresponding functions to generate data lists.

Execute the create_data.py program to complete data preparation.

python create_data.py

After executing the above program, the following data format will be generated. If you want to customize the data, refer to the following data list. The front is the relative path of the audio, and the back is the label of the speaker corresponding to the audio, which is the same as the classification. Note that the ID of the test data list does not need to be the same as the training ID, which means that the speaker of the test data may not need to appear in the training set, as long as the same person in the test data list is ensured that the same person has the same ID.

 dataset/CN-Celeb2_flac/data/id11999/recitation-03-019.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-10-023.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-06-025.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-04-014.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-06-030.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-10-032.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-06-028.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-10-031.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-05-003.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-04-017.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-10-016.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-09-001.flac      2795
dataset/CN-Celeb2_flac/data/id11999/recitation-05-010.flac      2795

Modify the preprocessing method (optional)

The default Fbank preprocessing method is used in the configuration file. If you want to use other preprocessing methods, you can modify the installation method in the configuration file. The specific value can be modified according to your own situation. If you are not sure how to set parameters, you can delete this part directly and use the default value directly.

 # 数据预处理参数
preprocess_conf :
  # 是否使用HF上的Wav2Vec2类似模型提取音频特征
  use_hf_model : False
  # 音频预处理方法，也可以叫特征提取方法
  # 当use_hf_model为False时，支持：MelSpectrogram、Spectrogram、MFCC、Fbank
  # 当use_hf_model为True时，指定的是HuggingFace的模型或者本地路径，比如facebook/w2v-bert-2.0或者./feature_models/w2v-bert-2.0
  feature_method : ' Fbank '
  # 当use_hf_model为False时，设置API参数，更参数查看对应API，不清楚的可以直接删除该部分，直接使用默认值。
  # 当use_hf_model为True时，可以设置参数use_gpu，指定是否使用GPU提取特征
  method_args :
    sample_frequency : 16000
    num_mel_bins : 80

Extract features (optional)

During the training process, the first thing is to read the audio data, then extract the features, and finally carry out training. Among them, reading audio data and extracting features is also time-consuming, so we can choose to extract features in advance. If we train the model, we can directly load and extract the features, so that the training speed will be faster. This extracted feature is optional. If no good features are extracted, the training model will start by reading the audio data and then extracting the features. The steps to extract features are as follows:

Execute extract_features.py to extract features. The features will be saved in the dataset/features directory and generate a new data list train_list_features.txt , enroll_list_features.txt and trials_list_features.txt .

python extract_features.py --configs=configs/cam++.yml --save_dir=dataset/features

Modify the configuration file, modify dataset_conf.train_list , dataset_conf.enroll_list , and dataset_conf.trials_list to train_list_features.txt , enroll_list_features.txt , and trials_list_features.txt .

Training the model

Using train.py to train the model, this project supports multiple audio preprocessing methods. The parameters preprocess_conf.feature_method of configs/ecapa_tdnn.yml configuration file can be specified. MelSpectrogram is the Mel spectrum, Spectrogram is the spectrum graph, MFCC Mel spectrum cepspectral coefficient, etc. The data enhancement method can be specified by the parameter augment_conf_path . During the training process, VisualDL will be used to save the training log. By starting VisualDL, you can view the training results at any time. The startup command visualdl --logdir=log --host 0.0.0.0

 # 单卡训练
CUDA_VISIBLE_DEVICES=0 python train.py
# 多卡训练
CUDA_VISIBLE_DEVICES=0,1 torchrun --standalone --nnodes=1 --nproc_per_node=2 train.py

Training output log:

 [2023-08-05 09:52:06.497988 INFO   ] utils:print_arguments:13 - ----------- 额外配置参数 -----------
[2023-08-05 09:52:06.498094 INFO   ] utils:print_arguments:15 - configs: configs/ecapa_tdnn.yml
[2023-08-05 09:52:06.498149 INFO   ] utils:print_arguments:15 - do_eval: True
[2023-08-05 09:52:06.498191 INFO   ] utils:print_arguments:15 - local_rank: 0
[2023-08-05 09:52:06.498230 INFO   ] utils:print_arguments:15 - pretrained_model: None
[2023-08-05 09:52:06.498269 INFO   ] utils:print_arguments:15 - resume_model: None
[2023-08-05 09:52:06.498306 INFO   ] utils:print_arguments:15 - save_model_path: models/
[2023-08-05 09:52:06.498342 INFO   ] utils:print_arguments:15 - use_gpu: True
[2023-08-05 09:52:06.498378 INFO   ] utils:print_arguments:16 - ------------------------------------------------
[2023-08-05 09:52:06.513761 INFO   ] utils:print_arguments:18 - ----------- 配置文件参数 -----------
[2023-08-05 09:52:06.513906 INFO   ] utils:print_arguments:21 - dataset_conf:
[2023-08-05 09:52:06.513957 INFO   ] utils:print_arguments:24 -         dataLoader:
[2023-08-05 09:52:06.513995 INFO   ] utils:print_arguments:26 -                 batch_size: 64
[2023-08-05 09:52:06.514031 INFO   ] utils:print_arguments:26 -                 num_workers: 4
[2023-08-05 09:52:06.514066 INFO   ] utils:print_arguments:28 -         do_vad: False
[2023-08-05 09:52:06.514101 INFO   ] utils:print_arguments:28 -         enroll_list: dataset/enroll_list.txt
[2023-08-05 09:52:06.514135 INFO   ] utils:print_arguments:24 -         eval_conf:
[2023-08-05 09:52:06.514169 INFO   ] utils:print_arguments:26 -                 batch_size: 1
[2023-08-05 09:52:06.514203 INFO   ] utils:print_arguments:26 -                 max_duration: 20
[2023-08-05 09:52:06.514237 INFO   ] utils:print_arguments:28 -         max_duration: 3
[2023-08-05 09:52:06.514274 INFO   ] utils:print_arguments:28 -         min_duration: 0.5
[2023-08-05 09:52:06.514308 INFO   ] utils:print_arguments:28 -         noise_aug_prob: 0.2
[2023-08-05 09:52:06.514342 INFO   ] utils:print_arguments:28 -         noise_dir: dataset/noise
[2023-08-05 09:52:06.514374 INFO   ] utils:print_arguments:28 -         num_speakers: 3242
[2023-08-05 09:52:06.514408 INFO   ] utils:print_arguments:28 -         sample_rate: 16000
[2023-08-05 09:52:06.514441 INFO   ] utils:print_arguments:28 -         speed_perturb: True
[2023-08-05 09:52:06.514475 INFO   ] utils:print_arguments:28 -         target_dB: -20
[2023-08-05 09:52:06.514508 INFO   ] utils:print_arguments:28 -         train_list: dataset/train_list.txt
[2023-08-05 09:52:06.514542 INFO   ] utils:print_arguments:28 -         trials_list: dataset/trials_list.txt
[2023-08-05 09:52:06.514575 INFO   ] utils:print_arguments:28 -         use_dB_normalization: True
[2023-08-05 09:52:06.514609 INFO   ] utils:print_arguments:21 - loss_conf:
[2023-08-05 09:52:06.514643 INFO   ] utils:print_arguments:24 -         args:
[2023-08-05 09:52:06.514678 INFO   ] utils:print_arguments:26 -                 easy_margin: False
[2023-08-05 09:52:06.514713 INFO   ] utils:print_arguments:26 -                 margin: 0.2
[2023-08-05 09:52:06.514746 INFO   ] utils:print_arguments:26 -                 scale: 32
[2023-08-05 09:52:06.514779 INFO   ] utils:print_arguments:24 -         margin_scheduler_args:
[2023-08-05 09:52:06.514814 INFO   ] utils:print_arguments:26 -                 final_margin: 0.3
[2023-08-05 09:52:06.514848 INFO   ] utils:print_arguments:28 -         use_loss: AAMLoss
[2023-08-05 09:52:06.514882 INFO   ] utils:print_arguments:28 -         use_margin_scheduler: True
[2023-08-05 09:52:06.514915 INFO   ] utils:print_arguments:21 - model_conf:
[2023-08-05 09:52:06.514950 INFO   ] utils:print_arguments:24 -         backbone:
[2023-08-05 09:52:06.514984 INFO   ] utils:print_arguments:26 -                 embd_dim: 192
[2023-08-05 09:52:06.515017 INFO   ] utils:print_arguments:26 -                 pooling_type: ASP
[2023-08-05 09:52:06.515050 INFO   ] utils:print_arguments:24 -         classifier:
[2023-08-05 09:52:06.515084 INFO   ] utils:print_arguments:26 -                 num_blocks: 0
[2023-08-05 09:52:06.515118 INFO   ] utils:print_arguments:21 - optimizer_conf:
[2023-08-05 09:52:06.515154 INFO   ] utils:print_arguments:28 -         learning_rate: 0.001
[2023-08-05 09:52:06.515188 INFO   ] utils:print_arguments:28 -         optimizer: Adam
[2023-08-05 09:52:06.515221 INFO   ] utils:print_arguments:28 -         scheduler: CosineAnnealingLR
[2023-08-05 09:52:06.515254 INFO   ] utils:print_arguments:28 -         scheduler_args: None
[2023-08-05 09:52:06.515289 INFO   ] utils:print_arguments:28 -         weight_decay: 1e-06
[2023-08-05 09:52:06.515323 INFO   ] utils:print_arguments:21 - preprocess_conf:
[2023-08-05 09:52:06.515357 INFO   ] utils:print_arguments:28 -         feature_method: MelSpectrogram
[2023-08-05 09:52:06.515390 INFO   ] utils:print_arguments:24 -         method_args:
[2023-08-05 09:52:06.515426 INFO   ] utils:print_arguments:26 -                 f_max: 14000.0
[2023-08-05 09:52:06.515460 INFO   ] utils:print_arguments:26 -                 f_min: 50.0
[2023-08-05 09:52:06.515493 INFO   ] utils:print_arguments:26 -                 hop_length: 320
[2023-08-05 09:52:06.515527 INFO   ] utils:print_arguments:26 -                 n_fft: 1024
[2023-08-05 09:52:06.515560 INFO   ] utils:print_arguments:26 -                 n_mels: 64
[2023-08-05 09:52:06.515593 INFO   ] utils:print_arguments:26 -                 sample_rate: 16000
[2023-08-05 09:52:06.515626 INFO   ] utils:print_arguments:26 -                 win_length: 1024
[2023-08-05 09:52:06.515660 INFO   ] utils:print_arguments:21 - train_conf:
[2023-08-05 09:52:06.515694 INFO   ] utils:print_arguments:28 -         log_interval: 100
[2023-08-05 09:52:06.515728 INFO   ] utils:print_arguments:28 -         max_epoch: 30
[2023-08-05 09:52:06.515761 INFO   ] utils:print_arguments:30 - use_model: EcapaTdnn
[2023-08-05 09:52:06.515794 INFO   ] utils:print_arguments:31 - ------------------------------------------------
······
===============================================================================================
Layer (type:depth-idx)                        Output Shape              Param #
===============================================================================================
Sequential                                    [1, 9726]                 --
├─EcapaTdnn: 1-1                              [1, 192]                  --
│    └─Conv1dReluBn: 2-1                      [1, 512, 98]              --
│    │    └─Conv1d: 3-1                       [1, 512, 98]              163,840
│    │    └─BatchNorm1d: 3-2                  [1, 512, 98]              1,024
│    └─Sequential: 2-2                        [1, 512, 98]              --
│    │    └─Conv1dReluBn: 3-3                 [1, 512, 98]              263,168
│    │    └─Res2Conv1dReluBn: 3-4             [1, 512, 98]              86,912
│    │    └─Conv1dReluBn: 3-5                 [1, 512, 98]              263,168
│    │    └─SE_Connect: 3-6                   [1, 512, 98]              262,912
│    └─Sequential: 2-3                        [1, 512, 98]              --
│    │    └─Conv1dReluBn: 3-7                 [1, 512, 98]              263,168
│    │    └─Res2Conv1dReluBn: 3-8             [1, 512, 98]              86,912
│    │    └─Conv1dReluBn: 3-9                 [1, 512, 98]              263,168
│    │    └─SE_Connect: 3-10                  [1, 512, 98]              262,912
│    └─Sequential: 2-4                        [1, 512, 98]              --
│    │    └─Conv1dReluBn: 3-11                [1, 512, 98]              263,168
│    │    └─Res2Conv1dReluBn: 3-12            [1, 512, 98]              86,912
│    │    └─Conv1dReluBn: 3-13                [1, 512, 98]              263,168
│    │    └─SE_Connect: 3-14                  [1, 512, 98]              262,912
│    └─Conv1d: 2-5                            [1, 1536, 98]             2,360,832
│    └─AttentiveStatsPool: 2-6                [1, 3072]                 --
│    │    └─Conv1d: 3-15                      [1, 128, 98]              196,736
│    │    └─Conv1d: 3-16                      [1, 1536, 98]             198,144
│    └─BatchNorm1d: 2-7                       [1, 3072]                 6,144
│    └─Linear: 2-8                            [1, 192]                  590,016
│    └─BatchNorm1d: 2-9                       [1, 192]                  384
├─SpeakerIdentification: 1-2                  [1, 9726]                 1,867,392
===============================================================================================
Total params: 8,012,992
Trainable params: 8,012,992
Non-trainable params: 0
Total mult-adds (M): 468.81
===============================================================================================
Input size (MB): 0.03
Forward/backward pass size (MB): 10.36
Params size (MB): 32.05
Estimated Total Size (MB): 42.44
===============================================================================================
[2023-08-05 09:52:08.084231 INFO   ] trainer:train:388 - 训练数据：874175
[2023-08-05 09:52:09.186542 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [0/13659], loss: 11.95824, accuracy: 0.00000, learning rate: 0.00100000, speed: 58.09 data/sec, eta: 5 days, 5:24:08
[2023-08-05 09:52:22.477905 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [100/13659], loss: 10.35675, accuracy: 0.00278, learning rate: 0.00100000, speed: 481.65 data/sec, eta: 15:07:15
[2023-08-05 09:52:35.948581 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [200/13659], loss: 10.22089, accuracy: 0.00505, learning rate: 0.00100000, speed: 475.27 data/sec, eta: 15:19:12
[2023-08-05 09:52:49.249098 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [300/13659], loss: 10.00268, accuracy: 0.00706, learning rate: 0.00100000, speed: 481.45 data/sec, eta: 15:07:11
[2023-08-05 09:53:03.716015 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [400/13659], loss: 9.76052, accuracy: 0.00830, learning rate: 0.00100000, speed: 442.74 data/sec, eta: 16:26:16
[2023-08-05 09:53:18.258807 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [500/13659], loss: 9.50189, accuracy: 0.01060, learning rate: 0.00100000, speed: 440.46 data/sec, eta: 16:31:08
[2023-08-05 09:53:31.618354 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [600/13659], loss: 9.26083, accuracy: 0.01256, learning rate: 0.00100000, speed: 479.50 data/sec, eta: 15:10:12
[2023-08-05 09:53:45.439642 INFO   ] trainer:__train_epoch:334 - Train epoch: [1/30], batch: [700/13659], loss: 9.03548, accuracy: 0.01449, learning rate: 0.00099999, speed: 463.63 data/sec, eta: 15:41:08

Start VisualDL: visualdl --logdir=log --host 0.0.0.0 , the VisualDL page is as follows:

Evaluate the model

After training, the prediction model will be saved. We use the prediction model to predict the audio features in the test set, and then use the audio features to compare them in pairs to calculate EER and MinDCF.

python eval.py

The output is similar to the following:

 ······
------------------------------------------------
W0425 08:27:32.057426 17654 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0425 08:27:32.065165 17654 device_context.cc:465] device: 0, cuDNN Version: 7.6.
[2023-03-16 20:20:47.195908 INFO   ] trainer:evaluate:341 - 成功加载模型：models/EcapaTdnn_Fbank/best_model/model.pth
100%|███████████████████████████| 84/84 [00:28<00:00,  2.95it/s]
开始两两对比音频特征...
100%|███████████████████████████| 5332/5332 [00:05<00:00, 1027.83it/s]
评估消耗时间：65s，threshold：0.26，EER: 0.14739, MinDCF: 0.41999

Inference interface

Here are several commonly used interfaces. For more interfaces, please refer to mvector/predict.py . You can also look at the examples of声纹对比声纹识别.

 from mvector . predict import MVectorPredictor

predictor = MVectorPredictor ( configs = 'configs/cam++.yml' ,
                             model_path = 'models/CAMPPlus_Fbank/best_model/' )
# 获取音频特征
embedding = predictor . predict ( audio_data = 'dataset/a_1.wav' )
# 获取两个音频的相似度
similarity = predictor . contrast ( audio_data1 = 'dataset/a_1.wav' , audio_data2 = 'dataset/a_2.wav' )

# 注册用户音频
predictor . register ( user_name = '夜雨飘零' , audio_data = 'dataset/test.wav' )
# 识别用户音频
name , score = predictor . recognition ( audio_data = 'dataset/test1.wav' )
# 获取所有用户
users_name = predictor . get_users ()
# 删除用户音频
predictor . remove_user ( user_name = '夜雨飘零' )

Voiceprint comparison

Let’s start to implement voiceprint comparison and create an infer_contrast.py program. First, we introduce several important functions. predict() function can obtain voiceprint features, predict_batch() function can obtain a batch of voiceprint features, contrast() function can compare the similarity between two audios, register() function registers an audio into the voiceprint library, enters an audio from recognition() function and compares and recognizes it from the voiceprint library, remove_user() function and removes it. Register in the Voiceprint Library. We enter two speeches and obtain their feature data through the prediction function. Using this feature data, we can find their diagonal cosine value, and the result can be used as their acquaintance. For this threshold of recognition, readers can modify it according to the accuracy requirements of their projects.

python infer_contrast.py --audio_path1=audio/a_1.wav --audio_path2=audio/b_2.wav

The output is similar to the following:

 [2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:13 - ----------- 额外配置参数 -----------
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - audio_path1: dataset/a_1.wav
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - audio_path2: dataset/b_2.wav
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - configs: configs/ecapa_tdnn.yml
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - model_path: models/EcapaTdnn_Fbank/best_model/
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - threshold: 0.6
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:15 - use_gpu: True
[2023-04-02 18:30:48.009149 INFO   ] utils:print_arguments:16 - ------------------------------------------------
······································································
W0425 08:29:10.006249 21121 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0425 08:29:10.008555 21121 device_context.cc:465] device: 0, cuDNN Version: 7.6.
成功加载模型参数和优化方法参数：models/EcapaTdnn_Fbank/best_model/model.pth
audio/a_1.wav 和 audio/b_2.wav 不是同一个人，相似度为：-0.09565544128417969

At the same time, a voiceprint comparison program with a GUI interface is also provided. Execute infer_contrast_gui.py to start the program. The interface is as follows. Select two audios respectively. Click to start to judge to determine whether they are the same person.

Voiceprint recognition

In news recognition, the register() function and recognition() function are mainly used. First, use register() function to register the audio into the voiceprint library. You can also directly add the file to the audio_db folder. When using it, you can initiate recognition through the recognition() function. Enter an audio and you can identify the required speaker from the voiceprint library.

With the above voiceprint recognition function, readers can complete voiceprint recognition according to their own project needs. For example, what I provide below is to complete voiceprint recognition through recording. First, the voice in the voice library must be loaded. The voice library folder is audio_db , and then the user will record the sound for 3 seconds after pressing the car. Then the program will automatically record the sound and use the recorded audio for voiceprint recognition to match the voice in the voice library and obtain user information. In this way, readers can also modify it to complete voiceprint recognition through service requests, for example, provide an API for APP to call. When the user logs in through voiceprint on the APP, he sends the recorded voice to the backend to complete voiceprint recognition, and then returns the result to the APP, provided that the user has registered with voiceprint and successfully stores the voice data in the audio_db folder.

python infer_recognition.py

The output is similar to the following:

 [2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:13 - ----------- 额外配置参数 -----------
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - audio_db_path: audio_db/
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - configs: configs/ecapa_tdnn.yml
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - model_path: models/EcapaTdnn_Fbank/best_model/
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - record_seconds: 3
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - threshold: 0.6
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:15 - use_gpu: True
[2023-04-02 18:31:20.521040 INFO   ] utils:print_arguments:16 - ------------------------------------------------
······································································
W0425 08:30:13.257884 23889 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.6, Runtime API Version: 10.2
W0425 08:30:13.260191 23889 device_context.cc:465] device: 0, cuDNN Version: 7.6.
成功加载模型参数和优化方法参数：models/ecapa_tdnn/model.pth
Loaded 沙瑞金 audio.
Loaded 李达康 audio.
请选择功能，0为注册音频到声纹库，1为执行声纹识别：0
按下回车键开机录音，录音3秒中：
开始录音......
录音已结束!
请输入该音频用户的名称：夜雨飘零
请选择功能，0为注册音频到声纹库，1为执行声纹识别：1
按下回车键开机录音，录音3秒中：
开始录音......
录音已结束!
识别说话的为：夜雨飘零，相似度为：0.920434

At the same time, a voiceprint recognition program with a GUI interface is also provided. Execute infer_recognition_gui.py to start, click注册音频到声纹库button, understand and start speaking, record for 3 seconds, and then enter the name of the registrant. After that, you can执行声纹识别button, and then speak immediately. After recording for 3 seconds, wait for the recognition result. The删除用户button can delete the user. The实时识别button can be recognized in real time, and can be recorded and recognized in real time.

Speaker log (separate speaker)

Execute the infer_speaker_diarization.py program, enter the audio path, and you can separate the speaker and display the results. It is recommended that the audio length should not be less than 10 seconds. For more functions, you can view the program parameters.

python infer_speaker_diarization.py --audio_path=dataset/test_long.wav

The output is similar to the following:

 2024-10-10 19:30:40.768 | INFO     | mvector.predict:__init__:61 - 成功加载模型参数：models/CAMPPlus_Fbank/best_model/model.pth
2024-10-10 19:30:40.795 | INFO     | mvector.predict:__create_index:127 - 声纹特征索引创建完成，一共有3个用户，分别是：['沙瑞金', '夜雨飘零', '李达康']
2024-10-10 19:30:40.796 | INFO     | mvector.predict:__load_audio_db:142 - 正在加载声纹库数据...
100%|██████████| 3/3 [00:00<?, ?it/s]
2024-10-10 19:30:40.798 | INFO     | mvector.predict:__create_index:127 - 声纹特征索引创建完成，一共有3个用户，分别是：['沙瑞金', '夜雨飘零', '李达康']
2024-10-10 19:30:40.798 | INFO     | mvector.predict:__load_audio_db:172 - 声纹库数据加载完成！
识别结果：
{'speaker': '沙瑞金', 'start': 0.0, 'end': 2.0}
{'speaker': '陌生人1', 'start': 4.0, 'end': 7.0}
{'speaker': '李达康', 'start': 7.0, 'end': 8.0}
{'speaker': '沙瑞金', 'start': 9.0, 'end': 12.0}
{'speaker': '沙瑞金', 'start': 13.0, 'end': 14.0}
{'speaker': '陌生人1', 'start': 15.0, 'end': 19.0}

The result image is displayed as follows. You can control the playback of audio through空格bar. Click the position to jump to the specified position:

The project also provides a program with the GUI interface, executing the infer_speaker_diarization_gui.py program. For more functions, you can view the program parameters.

python infer_speaker_diarization_gui.py

You can open such a page to identify the speaker:

Note: If the speaker's name is in Chinese, you need to set the installation font to display it normally. Generally speaking, Windows does not need to be installed, but Ubuntu needs to be installed. If Windows is indeed missing fonts, you only need to download the .ttf format file here and copy it to C:WindowsFonts . The Ubuntu system operation is as follows.

Install fonts

git clone https://github.com/tracyone/program_font && cd program_font && ./install.sh

Execute the following Python code

 import matplotlib
import shutil
import os

path = matplotlib . matplotlib_fname ()
path = path . replace ( 'matplotlibrc' , 'fonts/ttf/' )
print ( path )
shutil . copy ( '/usr/share/fonts/MyFonts/simhei.ttf' , path )
user_dir = os . path . expanduser ( '~' )
shutil . rmtree ( f' { user_dir } /.cache/matplotlib' , ignore_errors = True )