Reference paper "Distilling Task-Specific Knowledge from BERT into Simple Neural Networks"
The experiments were conducted based on textcnn and bilstm (gru) using keras and pytorch respectively.
Experimental data is divided into 1 (tag training): 8 (no label training): 1 (test)
The preliminary results on the dataset of emotion 2 classification clothing are as follows:
The accuracy of the small model (textcnn & bilstm) is between 0.80 and 0.81
The accuracy of the BERT model is between 0.90 and 0.91
The accuracy of the distillation model is between 0.87 and 0.88
The experimental results are basically consistent with the paper's conclusions and are consistent with expectations
Other more effective distillation schemes will be tried later
First of all, finetune BERT
python ptbert.pyThen distil BERT's knowledge into the small model
You need to decompress data/cache/word2vec.gz first
Then
python distill.py Adjusting use_aug and the following parameters in the file can use two of the data enhancement methods mentioned in the paper (masking, n-gram sampling)