Whisper Finetune Download - Whisper Finetune Source Code Download

Fine-tuning Whisper speech recognition model and accelerated inference

Simplified Chinese | English

Preface

OpenAI opens the Whisper project, which claims to have reached the human level of English voice recognition, and it also supports automatic voice recognition in other 98 languages. The automatic speech recognition and translation tasks provided by Whisper can turn speech in various languages into text, and can also translate these texts into English. The main purpose of this project is to fine-tune the Whisper model using Lora, support time stamp data training, time stamp data training, and speechless data training . Currently, several models are open sourced. You can view them in openai. The following lists several commonly used models. In addition, the project also supports CTranslate2 accelerated inference and GGML accelerated inference. As a reminder, accelerated inference supports direct conversion using Whisper original model, and does not necessarily require fine-tuning. Supports Windows desktop applications, Android applications and server deployment.

Please click first

Support model

openai/whisper-tiny
openai/whisper-base
openai/whisper-small
openai/whisper-medium
openai/whisper-large
openai/whisper-large-v2
openai/whisper-large-v3

Everyone is welcome to scan the QR code to enter the Knowledge Planet (left) or QQ group (right) for discussion. The Knowledge Planet provides project model files and bloggers' other related projects model files, as well as some other resources.

Usage environment:

Anaconda 3
Python 3.8
Pytorch 1.13.1
Ubuntu 18.04
GPU A100-PCIE-40GB*1

Video explanation: Bilibili

Demo address: Web Deployment

Introduction to the main project procedures
Model Test Table
Installation environment
Prepare data
Fine-tuning model
- Single card training
- Doka training
Merge the model
Evaluate the model
predict
GUI interface prediction
Web Deployment
- Interface Documentation
Prediction using Ctranslate2 format model
Android deployment
Windows desktop applications
Reward the author

Introduction to the main project procedures

aishell.py : Make AIShell training data.
finetune.py : fine-tune the model.
merge_lora.py : A model that merges Whisper and Lora.
evaluation.py : Evaluate the fine-tuned model or Whisper original model.
infer.py : Use the fine-tuned model to call or the Whisper model on transformers to predict.
infer_ct2.py : Use the model converted to CTranslate2 to predict, mainly refer to the usage of this program.
infer_gui.py : There is a GUI interface operation, using the fine-tuned model or the Whisper model on transformers to predict.
infer_server.py : Use the fine-tuned model or the Whisper model on transformers to deploy to the server and provide it to the client to call.
convert-ggml.py : Convert the model to GGML format model for Android or Windows applications.
AndroidDemo : This directory stores the source code for deploying the model to Android.
WhisperDesktop : This directory stores programs for Windows desktop applications.

Model Test Table

Original model word error rate test table.

Using the model	Specify language	aishell_test	test_net	test_meeting	Cantonese test set	Model acquisition
whisper-tiny	Chinese	0.31898	0.40482	0.75332	N/A	Join the knowledge planet to obtain
whisper-base	Chinese	0.22196	0.30404	0.50378	N/A	Join the knowledge planet to obtain
whisper-small	Chinese	0.13897	0.18417	0.31154	N/A	Join the knowledge planet to obtain
whisper-medium	Chinese	0.09538	0.13591	0.26669	N/A	Join the knowledge planet to obtain
whisper-large	Chinese	0.08969	0.12933	0.23439	N/A	Join the knowledge planet to obtain
whisper-large-v2	Chinese	0.08817	0.12332	0.26547	N/A	Join the knowledge planet to obtain
whisper-large-v3	Chinese	0.08086	0.11452	0.19878	0.18782	Join the knowledge planet to obtain

The word error rate test table after fine-tuning the dataset.

Using the model	Specify language	Dataset	aishell_test	test_net	test_meeting	Cantonese test set	Model acquisition
whisper-tiny	Chinese	AIShell	0.13043	0.4463	0.57728	N/A	Join the knowledge planet to obtain
whisper-base	Chinese	AIShell	0.08999	0.33089	0.40713	N/A	Join the knowledge planet to obtain
whisper-small	Chinese	AIShell	0.05452	0.19831	0.24229	N/A	Join the knowledge planet to obtain
whisper-medium	Chinese	AIShell	0.03681	0.13073	0.16939	N/A	Join the knowledge planet to obtain
whisper-large-v2	Chinese	AIShell	0.03139	0.12201	0.15776	N/A	Join the knowledge planet to obtain
whisper-large-v3	Chinese	AIShell	0.03660	0.09835	0.13706	0.20060	Join the knowledge planet to obtain
whisper-large-v3	Cantonese	Cantonese dataset	0.06857	0.11369	0.17452	0.03524	Join the knowledge planet to obtain
whisper-tiny	Chinese	WenetSpeech	0.17711	0.24783	0.39226	N/A	Join the knowledge planet to obtain
whisper-base	Chinese	WenetSpeech	0.14548	0.17747	0.30590	N/A	Join the knowledge planet to obtain
whisper-small	Chinese	WenetSpeech	0.08484	0.11801	0.23471	N/A	Join the knowledge planet to obtain
whisper-medium	Chinese	WenetSpeech	0.05861	0.08794	0.19486	N/A	Join the knowledge planet to obtain
whisper-large-v2	Chinese	WenetSpeech	0.05443	0.08367	0.19087	N/A	Join the knowledge planet to obtain
whisper-large-v3	Chinese	WenetSpeech	0.04947	0.10711	0.17429	0.47431	Join the knowledge planet to obtain

The inference speed test table uses GPU as GTX3090 (24G), the audio as test_long.wav , and the duration is 3 minutes. The test program is in tools/run_compute.sh .

Acceleration method	tiny	base	Small	medium	Large-v2	Large-v3
Transformers ( `fp16` + `batch_size=16` )	1.458s	1.671s	2.331s	11.071s	4.779s	12.826s
Transformers ( `fp16` + `batch_size=16` + `Compile` )	1.477s	1.675s	2.357s	11.003s	4.799s	12.643s
Transformers ( `fp16` + `batch_size=16` + `BetterTransformer` )	1.461s	1.676s	2.301s	11.062s	4.608s	12.505s
Transformers ( `fp16` + `batch_size=16` + `Flash Attention 2` )	1.436s	1.630s	2.258s	10.533s	4.344s	11.651s
Transformers ( `fp16` + `batch_size=16` + `Compile` + `BetterTransformer` )	1.442s	1.686s	2.277s	11.000s	4.543s	12.592s
Transformers ( `fp16` + `batch_size=16` + `Compile` + `Flash Attention 2` )	1.409s	1.643s	2.220s	10.390s	4.377s	11.703s
Faster Whisper ( `fp16` + `beam_size=1` )	2.179s	1.492s	2.327s	3.752s	5.677s	31.541s
Faster Whisper ( `8-bit` + `beam_size=1` )	2.609s	1.728s	2.744s	4.688s	6.571s	29.307s

A list of processed data.

Data list processing method	AiShell	WenetSpeech
Add punctuation	Join the knowledge planet to obtain	Join the knowledge planet to obtain
Add punctuation and timestamps	Join the knowledge planet to obtain	Join the knowledge planet to obtain

Important Note:

Remove the punctuation marks from the model output during evaluation and convert traditional Chinese to simplified Chinese.
aishell_test is the test set of AIShell, and test_net and test_meeting are the test sets of WenetSpeech.
The audio for the test speed is dataset/test_long.wav , and the duration is 3 minutes.
The training data uses punctuation data, with a higher font error rate.
Fine-tuning AiShell data without timestamping, fine-tuning WenetSpeech with timestamping.

Installation environment

The first thing to install is the GPU version of Pytorch. The following introduces two ways to install Pytorch. You only need to choose one.

The following is to use Anaconda to install the Pytorch environment. If it has been installed, please skip it.

conda install pytorch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia

The following is a mirror that uses Docker images to pull a Pytorch environment.

sudo docker pull pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel

Then enter the image and mount the current path to the /workspace directory of the container.

sudo nvidia-docker run --name pytorch -it -v $PWD :/workspace pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel /bin/bash

Install the required dependency library.

python -m pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Windows requires bitsandbytes to be installed separately.

python -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.41.2.post2-py3-none-win_amd64.whl

Prepare data

The training data set is as follows, which is a data list of jsonlines, that is, each row is a JSON data, and the data format is as follows. This project provides a program aishell.py that makes AIShell data set. Executing this program can automatically download and generate training and test sets in the following formats. Note: This program can skip the download process by specifying AIShell's compressed files. If it is downloaded directly, it will be very slow. You can use some downloaders such as Thunder and other downloaders, and then specify the downloaded compressed file path through the parameter --filepath , such as /home/test/data_aishell.tgz .

Tips:

If you do not use timestamp training, you can not include data in sentences field.
If there is only one language data, it can not contain language field data.
If the null voice data is trained, the sentences field is [] , sentence field is "" and language field may not exist.
The data may not contain punctuation marks, but fine-tuned models will lose the ability to add symbols.

{
   "audio" : {
      "path" : " dataset/0.wav "
   },
   "sentence" : "近几年，不但我用书给女儿压岁，也劝说亲朋不要给女儿压岁钱，而改送压岁书。 " ,
   "language" : " Chinese " ,
   "sentences" : [
      {
         "start" : 0 ,
         "end" : 1.4 ,
         "text" : "近几年， "
      },
      {
         "start" : 1.42 ,
         "end" : 8.4 ,
         "text" : "不但我用书给女儿压岁，也劝说亲朋不要给女儿压岁钱，而改送压岁书。 "
      }
   ],
   "duration" : 7.37
}

Fine-tuning model

Once the data is prepared, you can start fine-tuning the model. The two most important parameters for training are: --base_model specifies the fine-tuned Whisper model. This parameter value needs to exist in HuggingFace. This does not require advance download. It can be downloaded automatically when starting training. Of course, it can also be downloaded in advance. Then --base_model specifies is the path, and --local_files_only is set to True. The second --output_path is the Lora checkpoint path saved during training, because we use Lora to fine-tune the model. If you want to save enough, it is best to set --use_8bit to False, so that the training speed is much faster. For more other parameters, please check this program.

Single card training

The single card training command is as follows. The Windows system can not add CUDA_VISIBLE_DEVICES parameter.

CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model=openai/whisper-tiny --output_dir=output/

Doka training

There are two methods for multi-card training, namely torchrun and accelerate. Developers can use the corresponding methods according to their own habits.

Use torchrun to start multi-card training. The command is as follows: specify the number of graphics cards to use through --nproc_per_node .

torchrun --nproc_per_node=2 finetune.py --base_model=openai/whisper-tiny --output_dir=output/

Use acceleration to start multi-card training. If it is the first time you use acceleration, you need to configure the training parameters, as follows.

First, configure the training parameters. The process is to ask the developer to answer several questions. It is basically done by default, but there are several parameters that need to be set according to the actual situation.

accelerate config

This is probably the process:

 --------------------------------------------------------------------In which compute environment are you running?
This machine
--------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]:
Do you want to use FullyShardedDataParallel? [yes/NO]:
Do you want to use Megatron-LM ? [yes/NO]: 
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:
--------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16
accelerate configuration saved at /home/test/.cache/huggingface/accelerate/default_config.yaml

After the configuration is complete, you can use the following command to view the configuration.

accelerate env

The start training command is as follows.

accelerate launch finetune.py --base_model=openai/whisper-tiny --output_dir=output/

The output log is as follows:

{ ' loss ' : 0.9098, ' learning_rate ' : 0.000999046843662503, ' epoch ' : 0.01}                                                     
{ ' loss ' : 0.5898, ' learning_rate ' : 0.0009970611012927184, ' epoch ' : 0.01}                                                    
{ ' loss ' : 0.5583, ' learning_rate ' : 0.0009950753589229333, ' epoch ' : 0.02}                                                  
{ ' loss ' : 0.5469, ' learning_rate ' : 0.0009930896165531485, ' epoch ' : 0.02}                                          
{ ' loss ' : 0.5959, ' learning_rate ' : 0.0009911038741833634, ' epoch ' : 0.03}

Merge the model

After the fine-tuning is completed, there will be two models. The first is the Whisper basic model and the second is the Lora model. These two models need to be merged before subsequent operations can be performed. This program only needs to pass two parameters. --lora_model specifies the Lora model path saved after training, which is actually the checkpoint folder path. The second --output_dir is the saved directory of the merged model.

python merge_lora.py --lora_model=output/whisper-tiny/checkpoint-best/ --output_dir=models/

Evaluate the model

The following procedures are executed to evaluate the model, the two most important parameters are. The first --model_path specifies the merged model path, and also supports the use of Whisper original model directly, such as directly specifying openai/whisper-large-v2 , and the second is --metric specifies the evaluation method, such as verb error rate cer and word error rate wer . Tip: There is no fine-tuned model, and the output may be punctuated, affecting the accuracy. For more other parameters, please check this program.

python evaluation.py --model_path=models/whisper-tiny-finetune --metric=cer

predict

Execute the following program for speech recognition. This uses transformers to directly call the fine-tuned model or Whisper original model prediction, and supports compiler acceleration, FlashAttention2 acceleration, and BetterTransformer acceleration of Pytorch2.0. The first --audio_path parameter specifies the audio path to predict. The second --model_path specifies the merged model path, and also supports the use of Whisper original model directly, such as directly specifying openai/whisper-large-v2 . For more other parameters, please check this program.

python infer.py --audio_path=dataset/test.wav --model_path=models/whisper-tiny-finetune

GUI interface prediction

--model_path specifies the Transformers model. For more other parameters, please check this program.

python infer_gui.py --model_path=models/whisper-tiny-finetune

The interface after startup is as follows:

Web Deployment

--host specifies the address of the service startup, which is set to 0.0.0.0 , that is, any address can be accessed. --port specifies the port number used. --model_path specifies Transformers model. --num_workers specifies how many threads are used to concurrent inference, which is important in web deployment. When there are multiple concurrent accesses, it is possible to reason at the same time. For more other parameters, please check this program.

python infer_server.py --host=0.0.0.0 --port=5000 --model_path=models/whisper-tiny-finetune --num_workers=2

Interface Documentation

Currently, the identification interface /recognition is provided, and the interface parameters are as follows.

Fields	Is it necessary	type	default value	illustrate
audio	yes	File		Audio file to be identified
to_simple	no	int	1	Whether to switch to traditional Chinese
remove_pun	no	int	0	Whether to remove punctuation marks
task	no	String	transcript	Identify task types, support transcribe and translate
language	no	String	zh	Set language, abbreviation, if None, automatically detect language

Return result:

Fields	type	illustrate
Results	list	Segmentation identification results
+result	str	The result of each piece of text
+start	int	The start time of each slice, unit seconds
+end	int	The end time of each slice, unit seconds
code	int	Error code, 0 is successful identification

Examples are as follows:

{
  "results" : [
    {
      "result" : "近几年,不但我用书给女儿压碎,也全说亲朋不要给女儿压碎钱,而改送压碎书。 " ,
      "start" : 0 ,
      "end" : 8
    }
  ],
  "code" : 0
}

For easy understanding, here is a Python code that calls the web interface. The following is the call method of /recognition .

 import requests

response = requests . post ( url = "http://127.0.0.1:5000/recognition" , 
                         files = [( "audio" , ( "test.wav" , open ( "dataset/test.wav" , 'rb' ), 'audio/wav' ))],
                         json = { "to_simple" : 1 , "remove_pun" : 0 , "language" : "zh" , "task" : "transcribe" }, timeout = 20 )
print ( response . text )

The test pages provided are as follows:

The page of home page http://127.0.0.1:5000/ is as follows:

The document page http://127.0.0.1:5000/docs page is as follows:

Prediction using Ctranslate2 format model

Here is a way to accelerate CTranslate2. Although the pipeline reasoning speed of using Transformers is already very fast, you must first convert the model and convert the merged model into the CTranslate2 model. As the following command, --model parameter specifies the merged model path, and also supports the use of Whisper original model directly, such as directly specifying openai/whisper-large-v2 . The --output_dir parameter specifies the converted CTranslate2 model path, and the --quantization parameter specifies the quantization model size. If you do not want the quantization model, you can directly remove this parameter.

ct2-transformers-converter --model models/whisper-tiny-finetune --output_dir models/whisper-tiny-finetune-ct2 --copy_files tokenizer.json preprocessor_config.json --quantization float16

Execute the following program for speech recognition, --audio_path parameter specifies the audio path to be predicted. --model_path specifies the converted CTranslate2 model. For more other parameters, please check this program.

python infer_ct2.py --audio_path=dataset/test.wav --model_path=models/whisper-tiny-finetune-ct2

The output result is as follows:

-----------  Configuration Arguments -----------
audio_path: dataset/test.wav
model_path: models/whisper-tiny-finetune-ct2
language: zh
use_gpu: True
use_int8: False
beam_size: 10
num_workers: 1
vad_filter: False
local_files_only: True
------------------------------------------------
[0.0 - 8.0]：近几年,不但我用书给女儿压碎,也全说亲朋不要给女儿压碎钱,而改送压碎书。