micronet Download - micronet Source code download

micronet

"At present, there are two schools in the field of deep learning. One is an academic school, which studies powerful and complex model networks and experimental methods in order to pursue higher performance; the other is an engineering school, which aims to implement algorithms more stably and efficiently on hardware platforms. Efficiency is its goal. Although complex models have better performance, high storage space and computing resource consumption are important reasons that make it difficult to effectively apply on various hardware platforms. Therefore, the growing scale of deep neural networks has brought huge challenges to the deployment of deep learning on the mobile terminal, and deep learning model compression and deployment have become one of the research areas that both academia and industry have focused on."

Project Introduction

microt, a model compression and deploy lib.

compression

Quantification: High-Bit(>2b): QAT, PTQ, QAFT; Low-Bit(≤2b)/Ternary and Binary: QAT
Pruning: normal, regular and grouped convolutional structure pruning
BN fusion for binary quantization for feature (A) (Binding BN parameters—> conv)
BN fusion quantized by High-Bit (in training quantization, first fusion and then quantization, fusion: BN parameters—> conv weight w and bias b)

deploy

TensorRT (fp32/fp16/int8(ptq-calibration), op-adapt(upsample), dynamic_shape, etc.)

Code structure

code_structure

 micronet
├── __init__.py
├── base_module
│   ├── __init__.py
│   └── op.py
├── compression
│   ├── README.md
│   ├── __init__.py
│   ├── pruning
│   │   ├── README.md
│   │   ├── __init__.py
│   │   ├── gc_prune.py
│   │   ├── main.py
│   │   ├── models_save
│   │   │   └── models_save.txt
│   │   └── normal_regular_prune.py
│   └── quantization
│       ├── README.md
│       ├── __init__.py
│       ├── wbwtab
│       │   ├── __init__.py
│       │   ├── bn_fuse
│       │   │   ├── bn_fuse.py
│       │   │   ├── bn_fused_model_test.py
│       │   │   └── models_save
│       │   │       └── models_save.txt
│       │   ├── main.py
│       │   ├── models_save
│       │   │   └── models_save.txt
│       │   └── quantize.py
│       └── wqaq
│           ├── __init__.py
│           ├── dorefa
│           │   ├── __init__.py
│           │   ├── main.py
│           │   ├── models_save
│           │   │   └── models_save.txt
│           │   ├── quant_model_test
│           │   │   ├── models_save
│           │   │   │   └── models_save.txt
│           │   │   ├── quant_model_para.py
│           │   │   └── quant_model_test.py
│           │   └── quantize.py
│           └── iao
│               ├── __init__.py
│               ├── bn_fuse
│               │   ├── bn_fuse.py
│               │   ├── bn_fused_model_test.py
│               │   └── models_save
│               │       └── models_save.txt
│               ├── main.py
│               ├── models_save
│               │   └── models_save.txt
│               └── quantize.py
├── data
│   └── data.txt
├── deploy
│   ├── README.md
│   ├── __init__.py
│   └── tensorrt
│       ├── README.md
│       ├── __init__.py
│       ├── calibrator.py
│       ├── eval_trt.py
│       ├── models
│       │   ├── __init__.py
│       │   └── models_trt.py
│       ├── models_save
│       │   └── calibration_seg.cache
│       ├── test_trt.py
│       └── util_trt.py
├── models
│   ├── __init__.py
│   ├── nin.py
│   ├── nin_gc.py
│   └── resnet.py
└── readme_imgs
    ├── code_structure.jpg
    └── micronet.xmind

Project Progress

2019.12.4 , First Submit
12.8 , DoReFa feature (A) is scaled first (* 0.1) before quantization, and then truncate to reduce the truncation error
12.11 , Add project code structure diagram
12.12, Improve usage examples
12.14, Added: 1. The quantization situation of BN fusion (W three-value/binary value) is optional, that is, when training quantization, W three-value/binary value is selected, and here is the corresponding selection; 2. The processing of convolution kernel (conv) without bias during BN fusion
12.17 , Add data comparison before and after model compression (example)
12.20, add device options (cpu, gpu (single card, multiple card))
12.27 , Supplementary related papers
12.29, Removing the limit of High-Bit quantization within 8-bit, that is, it can now quantify to 10-bit, 16-bit, etc.
2020.2.17 , 1. Simplify W three-value/binary quantization code; 2. Accelerate W three-value quantization training
2.18 , Optimize BN fusion for binary values of feature (A): remove the limitations on the BN layer gamma parameters, that is, BN can be trained normally during fusion in this case.
2.24 , optimize the three/binary quantization code organization structure again to enhance portability, the old version is indeed not easy to transplant. Current porting method: replace the Conv you want to quantify with QuantConv2d in compression/quantization/wbwtab/models/util_wbwtab.py. You can refer to the usage method in nin_gc.py under this path
3.1 , Added: 1. Google’s High-Bit quantization method; 2. BN fusion of High-Bit quantization during training
3.2 , 3.3, regularize the overall structure of the quantization code. At present, all quantization methods can adopt a similar porting method: the Conv (or FC, currently supported by dorefa, other methods are similar to writable) can be replaced by QuantConv2d (or QuantLinear) in models/util_wxax.py. You can refer to the usage methods in nin_gc.py under this path for porting (classification, detection, segmentation, etc. are applicable, but they need to be debugged according to the actual situation)
3.4 . Regularly optimize the relevant implementation code of "BN fusion for feature (A) binary value" in wbwtab/bn_fuse, and can perform BN fusion and model comparison test before and after fusion (accuracy/speed/(size))
3.11, adjust the BN layer momentum parameter in compression/wqaq/iao (0.1 —> 0.01), weaken the proportion of batch statistical parameters, and suppress the jitter caused by quantization to a certain extent. After experiments, quantitative training is more stable, acc is increased by about 1%.
3.13 , Update the code structure diagram
4.6, Fixed the problem of W_clip in binary quantization training (before, due to this, the binary quantization training accuracy was not improved, and now it can be used normally) (also fixed the problem of not being able to find some modules such as models/util_wxax.py)
12.14 , 1. Improve code structure; 2. add deploy-tensorrt(main module, but not running yet)
12.18, 1. Improve code structure/module reference/module_name; 2. Add transfer-use demo
12.21 , improve pruning-quantization pipeline and code
2021.1.4 , add other quant_op
1.5, add quant_weight's per-channel and per-layer selection
1.7 , fix iao's loss-nan bug. The bug is due to per-channel min/max error
1.8, 1. Improve quant_para save. Now, only save scale and zero_point; 2. add optional weight_observer(MinMaxObserver or MovingAverageMinMaxObserver)
1.11 , fix bug in binary_a(1/0) and binary_w preprocessing
1.12 , add "pip install"
1.22 , add auto_insert_quant_op(this still needs to be improved)
1.27 , improve auto_insert_quant_op(now you can easily use quantization, as quant_test_auto)
1.28, 1. fix prune-quantization pipeline and code; 2. improve code structure
2.1 , improve wbwtab_bn_fuse
2.4 , 1. add wqaq_bn_fuse; 2. add quant_model_inference_simulation; 3. improve code format
4.30, 1. update code_structure img; 2. fix iao's quant_weight_range, quant_contrans and quant_bn_fuse_conv pretrained_model bn_para load bug
5.4 , add qaft , it's beneficial to improve the quantization accuracy
5.6 , add ptq , its quantization accuracy is also good
5.11, add bn_fuse_calib flag
5.14 , 1. Change ste to clip_ste , it's beneficial to improve the quant_train; 2. Remove quant_relu and add quant_leaky_relu
5.15, fix bug in quant_model_para post-processing
6.7 , add quant_add(need use base_module's op) and quant_resnet demo
6.9 , iao_quant support multi gpus
6.16, fix quant_round() and quant_binary()
10.6, format

Environmental Requirements

python >= 3.5
torch >= 1.1.0
torchvison >= 0.3.0
numpy
onnx == 1.6.0
tensorrt == 7.0.0.11

Install

PyPI

pip install micronet -i https://pypi.org/simple

GitHub

git clone https://github.com/666DZY666/micronet.git
cd micronet
python setup.py install

verify

python -c " import micronet; print(micronet.__version__) "

test

Install from github

compression

Quantification

--refine, can load pretrained floating point model parameters and quantize them based on them

wbwtab

--W --A, Weight W and Feature A quantized value

 cd micronet/compression/quantization/wbwtab

WbAb

python main.py --W 2 --A 2

WbA32

python main.py --W 2 --A 32

WtAb

python main.py --W 3 --A 2

WtA32

python main.py --W 3 --A 32

wqaq

--w_bits --a_bits, weight W and feature A quantized bit count

dorefa

 cd micronet/compression/quantization/wqaq/dorefa

W16A16

python main.py --w_bits 16 --a_bits 16

W8A8

python main.py --w_bits 8 --a_bits 8

W4A4

python main.py --w_bits 4 --a_bits 4

Other bits situation analogy

iao

 cd micronet/compression/quantization/wqaq/iao

Quantitative digit selection same as dorefa

Single card

QAT/PTQ —> QAFT

! Note that you need to do QAFT after QAT/PTQ!

--q_type, quantization type (0-symmetric, 1-symmetric)

--q_level, weighting level (0-channel level, 1-level)

--weight_observer, weight_observer selection (0-MinMaxObserver, 1-MovingAverageMinMaxObserver)

--bn_fuse, bn fusion flag in quantification

--bn_fuse_calib, bn fusion calibration mark in quantization

--pretrained_model, pretrained floating point model

--qaft, qaft flag

--ptq, ptq_observer

--ptq_control, ptq_control

--ptq_batch, the number of batches of ptq

--percentile, ptq calibration ratio

QAT

Default: symmetric, (weight) channel-level quantization, bn does not fusion, weight_observer-MinMaxObserver, pre-trained floating-point model is not loaded, qat

python main.py --q_type 0 --q_level 0 --weight_observer 0

Symmetric, (weight) channel-level quantization, bn not fusion, weight_observer-MovingAverageMinMaxObserver

python main.py --q_type 0 --q_level 0 --weight_observer 1

Symmetric, (weight) level quantization, bn does not fusion

python main.py --q_type 0 --q_level 1

Asymmetric, (weight) channel-level quantization, bn does not fusion

python main.py --q_type 1 --q_level 0

Asymmetric, (weight) level quantization, bn does not fusion

python main.py --q_type 1 --q_level 1

Symmetric, (weight) channel-level quantization, bn fusion

python main.py --q_type 0 --q_level 0 --bn_fuse

Symmetric, (weight) level quantization, bn fusion

python main.py --q_type 0 --q_level 1 --bn_fuse

Asymmetric, (weight) channel-level quantization, bn fusion

python main.py --q_type 1 --q_level 0 --bn_fuse

Asymmetric, (weight) level quantization, bn fusion

python main.py --q_type 1 --q_level 1 --bn_fuse

Symmetric, (weight) channel-level quantization, bn fusion calibration

python main.py --q_type 0 --q_level 0 --bn_fuse --bn_fuse_calib

PTQ

Pre-trained floating-point model needs to be loaded, which can be obtained by normal training in pruning.

Symmetric, (weight) channel-level quantization, bn fusion

python main.py --refine ../../../pruning/models_save/nin_gc.pth --q_level 0 --bn_fuse --pretrained_model --ptq_control --ptq --batch_size 32 --ptq_batch 200 --percentile 0.999999

Other situation analogies

QAFT

! Note that you need to do QAFT after QAT/PTQ!

QAT —> QAFT

Symmetric, (weight) channel-level quantization, bn fusion

python main.py --resume models_save/nin_gc_bn_fused.pth --q_type 0 --q_level 0 --bn_fuse --qaft --lr 0.00001

Other situation analogies

PTQ —> QAFT

Symmetric, (weight) channel-level quantization, bn fusion

python main.py --resume models_save/nin_gc_bn_fused.pth --q_level 0 --bn_fuse --qaft --lr 0.00001 --ptq

Other situation analogies

Pruning

Sparse training—> Pruning—> Fine adjustment

 cd micronet/compression/pruning

Sparse training

-sr sparse sign

--s sparse rate (need to be adjusted according to dataset and model conditions)

--model_type Model type (0-nin, 1-nin_gc)

nin (normal convolutional structure)

python main.py -sr --s 0.0001 --model_type 0

nin_gc (including grouping convolutional structure)

python main.py -sr --s 0.001 --model_type 1

Pruning

--percent pruning rate

--normal_regular Normal, regular pruning flags and regular pruning base (if set to N, the number of filters per layer of the model after pruning is a multiple of N)

--model The model path after sparse training

--save The model path saved after pruning (the path has been given by default and can be changed according to the actual situation)

Normal pruning (nin)

python normal_regular_prune.py --percent 0.5 --model models_save/nin_sparse.pth --save models_save/nin_prune.pth

Regular pruning (nin)

python normal_regular_prune.py --percent 0.5 --normal_regular 8 --model models_save/nin_sparse.pth --save models_save/nin_prune.pth

python normal_regular_prune.py --percent 0.5 --normal_regular 16 --model models_save/nin_sparse.pth --save models_save/nin_prune.pth

Grouped convolutional structure pruning (nin_gc)

python gc_prune.py --percent 0.4 --model models_save/nin_gc_sparse.pth

Fine adjustment

--prune_refine The model path after pruning (fine-tuning based on it)

python main.py --model_type 0 --prune_refine models_save/nin_prune.pth

nin_gc

You need to pass in the cfg of the new model obtained after pruning

like

python main.py --model_type 1 --gc_prune_refine 154 162 144 304 320 320 608 584

Pruning—> Quantification (note the pruning rate and quantization rate equilibrium)

Load the pruned floating point model and then quantize it

Pruning—> Quantification (high level) (the pruning rate is too large and the quantization rate is too small)

w8a8(dorefa)

 cd micronet/compression/quantization/wqaq/dorefa

nin (normal convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 0 --prune_quant ../../../pruning/models_save/nin_finetune.pth

nin_gc (including grouping convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 1 --prune_quant ../../../pruning/models_save/nin_gc_retrain.pth

w8a8(iao)

 cd micronet/compression/quantization/wqaq/iao

QAT/PTQ —> QAFT

! Note that you need to do QAFT after QAT/PTQ!

QAT

bn does not fusion

nin (normal convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 0 --prune_quant ../../../pruning/models_save/nin_finetune.pth --lr 0.001

nin_gc (including grouping convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 1 --prune_quant ../../../pruning/models_save/nin_gc_retrain.pth --lr 0.001

bn fusion

nin (normal convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 0 --prune_quant ../../../pruning/models_save/nin_finetune.pth --bn_fuse --pretrained_model --lr 0.001

nin_gc (including grouping convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 1 --prune_quant ../../../pruning/models_save/nin_gc_retrain.pth --bn_fuse --pretrained_model --lr 0.001

PTQ

nin (normal convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 0 --prune_quant ../../../pruning/models_save/nin_finetune.pth --bn_fuse --pretrained_model --ptq_control --ptq --batch_size 32 --ptq_batch 200 --percentile 0.999999

Other situation analogies

QAFT

! Note that you need to do QAFT after QAT/PTQ!

QAT —> QAFT

bn does not fusion

nin (normal convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 0 --prune_qaft models_save/nin.pth --qaft --lr 0.00001

nin_gc (including grouping convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 1 --prune_qaft models_save/nin_gc.pth --qaft --lr 0.00001

bn fusion

nin (normal convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 0 --prune_qaft models_save/nin_bn_fused.pth --bn_fuse --qaft --lr 0.00001

nin_gc (including grouping convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 1 --prune_qaft models_save/nin_gc_bn_fused.pth --bn_fuse --qaft --lr 0.00001

PTQ —> QAFT

bn does not fusion

nin (normal convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 0 --prune_qaft models_save/nin.pth --qaft --lr 0.00001 --ptq

nin_gc (including grouping convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 1 --prune_qaft models_save/nin_gc.pth --qaft --lr 0.00001 --ptq

bn fusion

nin (normal convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 0 --prune_qaft models_save/nin_bn_fused.pth --bn_fuse --qaft --lr 0.00001 --ptq

nin_gc (including grouping convolutional structure)

python main.py --w_bits 8 --a_bits 8 --model_type 1 --prune_qaft models_save/nin_gc_bn_fused.pth --bn_fuse --qaft --lr 0.00001 --ptq

Other optional quantitative configuration analogies

Pruning—> Quantization (low) (the pruning rate is small, the quantization rate is large)

 cd micronet/compression/quantization/wbwtab

wbab

nin (normal convolutional structure)

python main.py --W 2 --A 2 --model_type 0 --prune_quant ../../pruning/models_save/nin_finetune.pth

nin_gc (including grouping convolutional structure)

python main.py --W 2 --A 2 --model_type 1 --prune_quant ../../pruning/models_save/nin_gc_retrain.pth

Other value-taking analogies

BN fusion and quantitative inference simulation test

wbwtab

 cd micronet/compression/quantization/wbwtab/bn_fuse

bn_fuse (get the structure and parameters of quant_model_train and quant_bn_fused_model_inference)

--model_type, 1 - nin_gc (including grouped convolutional structure); 0 - nin (normal convolutional structure)

--prune_quant, pruning_quantitative model flag

--W, weight quantization value

All need to be consistent with quantitative training, and you can use the default directly

nin_gc, quant_model, wb

python bn_fuse.py --model_type 1 --W 2

nin_gc, prune_quant_model, wb

python bn_fuse.py --model_type 1 --prune_quant --W 2

nin_gc, quant_model, wt

python bn_fuse.py --model_type 1 --W 3

nin, quant_model, wb

python bn_fuse.py --model_type 0 --W 2

bn_fused_model_test (tests on quant_model_train and quant_bn_fused_model_inference)

python bn_fused_model_test.py

dorefa

 cd micronet/compression/quantization/wqaq/dorefa/quant_model_test

quant_model_para (get the structure and parameters of quant_model_train and quant_model_inference)

--model_type, 1 - nin_gc (including grouped convolutional structure); 0 - nin (normal convolutional structure)

--prune_quant, pruning_quantitative model flag

--w_bits, weight quantization number of bits; --a_bits, activation quantization number of bits

All need to be consistent with quantitative training, and you can use the default directly

nin_gc, quant_model, w8a8

python quant_model_para.py --model_type 1 --w_bits 8 --a_bits 8

nin_gc, prune_quant_model, w8a8

python quant_model_para.py --model_type 1 --prune_quant --w_bits 8 --a_bits 8

nin, quant_model, w8a8

python quant_model_para.py --model_type 0 --w_bits 8 --a_bits 8

quant_model_test (tests quant_model_train and quant_model_inference)

python quant_model_test.py

iao

Note that when quantized training --bn_fuse needs to be set to True

 cd micronet/compression/quantization/wqaq/iao/bn_fuse

bn_fuse (get the structure and parameters of quant_bn_fused_model_train and quant_bn_fused_model_inference)

--model_type, 1 - nin_gc (including grouped convolutional structure); 0 - nin (normal convolutional structure)

--prune_quant, pruning_quantitative model flag

--w_bits, weight quantization number of bits; --a_bits, activation quantization number of bits

--q_type, 0 - symmetric; 1 - asymmetric

--q_level, 0 - channel level; 1 - level

All need to be consistent with quantitative training, and you can use the default directly

nin_gc, quant_model, w8a8

python bn_fuse.py --model_type 1 --w_bits 8 --a_bits 8

nin_gc, prune_quant_model, w8a8

python bn_fuse.py --model_type 1 --prune_quant --w_bits 8 --a_bits 8

nin, quant_model, w8a8

python bn_fuse.py --model_type 0 --w_bits 8 --a_bits 8

nin_gc, quant_model, w8a8, asymmetry, hierarchy

python bn_fuse.py --model_type 0 --w_bits 8 --a_bits 8 --q_type 1 --q_level 1

bn_fused_model_test (tests on quant_bn_fused_model_train and quant_bn_fused_model_inference)

python bn_fused_model_test.py

Equipment selection

Now supports CPU and GPU (single card, multiple card)

--cpu Use cpu, --gpu_id Use and select gpu

python main.py --cpu

GPU single card

python main.py --gpu_id 0

python main.py --gpu_id 1

gpu multicard

python main.py --gpu_id 0,1

python main.py --gpu_id 0,1,2

By default, use the server full card

deploy

TensorRT

Currently, only relevant core module code is provided, and a complete runnable demo will be added later.

Related interpretations

tensorrt-Basics
tensorrt-op/dynamic_shape

migrate

Quantitative training

LeNet example

quant_test_manual.py

A model can be quantized(High-Bit(>2b), Low-Bit(≤2b)/Ternary and Binary) by simply replacing op with quant_op .

 import torch . nn as nn
import torch . nn . functional as F

# some base_op, such as ``Add``、``Concat``
from micronet . base_module . op import *

# ``quantize`` is quant_module, ``QuantConv2d``, ``QuantLinear``, ``QuantMaxPool2d``, ``QuantReLU`` are quant_op
from micronet . compression . quantization . wbwtab . quantize import (
    QuantConv2d as quant_conv_wbwtab ,
)
from micronet . compression . quantization . wbwtab . quantize import (
    ActivationQuantizer as quant_relu_wbwtab ,
)
from micronet . compression . quantization . wqaq . dorefa . quantize import (
    QuantConv2d as quant_conv_dorefa ,
)
from micronet . compression . quantization . wqaq . dorefa . quantize import (
    QuantLinear as quant_linear_dorefa ,
)
from micronet . compression . quantization . wqaq . iao . quantize import (
    QuantConv2d as quant_conv_iao ,
)
from micronet . compression . quantization . wqaq . iao . quantize import (
    QuantLinear as quant_linear_iao ,
)
from micronet . compression . quantization . wqaq . iao . quantize import (
    QuantMaxPool2d as quant_max_pool_iao ,
)
from micronet . compression . quantization . wqaq . iao . quantize import (
    QuantReLU as quant_relu_iao ,
)


class LeNet ( nn . Module ):
    def __init__ ( self ):
        super ( LeNet , self ). __init__ ()
        self . conv1 = nn . Conv2d ( 1 , 10 , kernel_size = 5 )
        self . conv2 = nn . Conv2d ( 10 , 20 , kernel_size = 5 )
        self . fc1 = nn . Linear ( 320 , 50 )
        self . fc2 = nn . Linear ( 50 , 10 )
        self . max_pool = nn . MaxPool2d ( kernel_size = 2 )
        self . relu = nn . ReLU ( inplace = True )

    def forward ( self , x ):
        x = self . relu ( self . max_pool ( self . conv1 ( x )))
        x = self . relu ( self . max_pool ( self . conv2 ( x )))
        x = x . view ( - 1 , 320 )
        x = self . relu ( self . fc1 ( x ))
        x = F . dropout ( x , training = self . training )
        x = self . fc2 ( x )
        return F . log_softmax ( x , dim = 1 )


class QuantLeNetWbWtAb ( nn . Module ):
    def __init__ ( self ):
        super ( QuantLeNetWbWtAb , self ). __init__ ()
        self . conv1 = quant_conv_wbwtab ( 1 , 10 , kernel_size = 5 )
        self . conv2 = quant_conv_wbwtab ( 10 , 20 , kernel_size = 5 )
        self . fc1 = nn . Linear ( 320 , 50 )
        self . fc2 = nn . Linear ( 50 , 10 )
        self . max_pool = nn . MaxPool2d ( kernel_size = 2 )
        self . relu = quant_relu_wbwtab ()

    def forward ( self , x ):
        x = self . relu ( self . max_pool ( self . conv1 ( x )))
        x = self . relu ( self . max_pool ( self . conv2 ( x )))
        x = x . view ( - 1 , 320 )
        x = self . relu ( self . fc1 ( x ))
        x = F . dropout ( x , training = self . training )
        x = self . fc2 ( x )
        return F . log_softmax ( x , dim = 1 )


class QuantLeNetDoReFa ( nn . Module ):
    def __init__ ( self ):
        super ( QuantLeNetDoReFa , self ). __init__ ()
        self . conv1 = quant_conv_dorefa ( 1 , 10 , kernel_size = 5 )
        self . conv2 = quant_conv_dorefa ( 10 , 20 , kernel_size = 5 )
        self . fc1 = quant_linear_dorefa ( 320 , 50 )
        self . fc2 = quant_linear_dorefa ( 50 , 10 )
        self . max_pool = nn . MaxPool2d ( kernel_size = 2 )
        self . relu = nn . ReLU ( inplace = True )

    def forward ( self , x ):
        x = self . relu ( self . max_pool ( self . conv1 ( x )))
        x = self . relu ( self . max_pool ( self . conv2 ( x )))
        x = x . view ( - 1 , 320 )
        x = self . relu ( self . fc1 ( x ))
        x = F . dropout ( x , training = self . training )
        x = self . fc2 ( x )
        return F . log_softmax ( x , dim = 1 )


class QuantLeNetIAO ( nn . Module ):
    def __init__ ( self ):
        super ( QuantLeNetIAO , self ). __init__ ()
        self . conv1 = quant_conv_iao ( 1 , 10 , kernel_size = 5 )
        self . conv2 = quant_conv_iao ( 10 , 20 , kernel_size = 5 )
        self . fc1 = quant_linear_iao ( 320 , 50 )
        self . fc2 = quant_linear_iao ( 50 , 10 )
        self . max_pool = quant_max_pool_iao ( kernel_size = 2 )
        self . relu = nn . ReLU ( inplace = True )

    def forward ( self , x ):
        x = self . relu ( self . max_pool ( self . conv1 ( x )))
        x = self . relu ( self . max_pool ( self . conv2 ( x )))
        x = x . view ( - 1 , 320 )
        x = self . relu ( self . fc1 ( x ))
        x = F . dropout ( x , training = self . training )
        x = self . fc2 ( x )
        return F . log_softmax ( x , dim = 1 )


lenet = LeNet ()
quant_lenet_wbwtab = QuantLeNetWbWtAb ()
quant_lenet_dorefa = QuantLeNetDoReFa ()
quant_lenet_iao = QuantLeNetIAO ()

print ( "***ori_model*** n " , lenet )
print ( " n ***quant_model_wbwtab*** n " , quant_lenet_wbwtab )
print ( " n ***quant_model_dorefa*** n " , quant_lenet_dorefa )
print ( " n ***quant_model_iao*** n " , quant_lenet_iao )

print ( " n quant_model is ready" )
print ( "micronet is ready" )

quant_test_auto.py

A model can be quantized(High-Bit(>2b), Low-Bit(≤2b)/Ternary and Binary) by simply using micronet.compression.quantization.quantize.prepare(model) .

 import torch . nn as nn
import torch . nn . functional as F

# some base_op, such as ``Add``、``Concat``
from micronet . base_module . op import *

import micronet . compression . quantization . wqaq . dorefa . quantize as quant_dorefa
import micronet . compression . quantization . wqaq . iao . quantize as quant_iao


class LeNet ( nn . Module ):
    def __init__ ( self ):
        super ( LeNet , self ). __init__ ()
        self . conv1 = nn . Conv2d ( 1 , 10 , kernel_size = 5 )
        self . conv2 = nn . Conv2d ( 10 , 20 , kernel_size = 5 )
        self . fc1 = nn . Linear ( 320 , 50 )
        self . fc2 = nn . Linear ( 50 , 10 )
        self . max_pool = nn . MaxPool2d ( kernel_size = 2 )
        self . relu = nn . ReLU ( inplace = True )

    def forward ( self , x ):
        x = self . relu ( self . max_pool ( self . conv1 ( x )))
        x = self . relu ( self . max_pool ( self . conv2 ( x )))
        x = x . view ( - 1 , 320 )
        x = self . relu ( self . fc1 ( x ))
        x = F . dropout ( x , training = self . training )
        x = self . fc2 ( x )
        return F . log_softmax ( x , dim = 1 )


"""
--w_bits --a_bits, 权重W和特征A量化位数
--q_type, 量化类型(0-对称, 1-非对称)
--q_level, 权重量化级别(0-通道级, 1-层级)
--weight_observer, weight_observer选择(0-MinMaxObserver, 1-MovingAverageMinMaxObserver)
--bn_fuse, 量化中bn融合标志
--bn_fuse_calib, 量化中bn融合校准标志
--pretrained_model, 预训练浮点模型
--qaft, qaft标志
--ptq, ptq标志
--percentile, ptq校准的比例
"""
lenet = LeNet ()
quant_lenet_dorefa = quant_dorefa . prepare ( lenet , inplace = False , a_bits = 8 , w_bits = 8 )
quant_lenet_iao = quant_iao . prepare (
    lenet ,
    inplace = False ,
    a_bits = 8 ,
    w_bits = 8 ,
    q_type = 0 ,
    q_level = 0 ,
    weight_observer = 0 ,
    bn_fuse = False ,
    bn_fuse_calib = False ,
    pretrained_model = False ,
    qaft = False ,
    ptq = False ,
    percentile = 0.9999 ,
)

# if ptq == False, do qat/qaft, need train
# if ptq == True, do ptq, don't need train
# you can refer to micronet/compression/quantization/wqaq/iao/main.py

print ( "***ori_model*** n " , lenet )
print ( " n ***quant_model_dorefa*** n " , quant_lenet_dorefa )
print ( " n ***quant_model_iao*** n " , quant_lenet_iao )

print ( " n quant_model is ready" )
print ( "micronet is ready" )

test

quant_test_manual

python -c " import micronet; micronet.quant_test_manual() "

quant_test_auto

python -c " import micronet; micronet.quant_test_auto() "

When outputting "quant_model is ready", microt is ready.

Quantitative reasoning

Reference BN fusion and quantitative inference simulation test

Comparison of model compressed data (for reference only)

The following is a cifar10 example, where you can try other combined compression methods on more redundant models and larger data sets.

type	W(Bits)	A(Bits)	Acc	GFLOPs	Para(M)	Size(MB)	Compression rate	loss
Original model (nin)	FP32	FP32	91.01%	0.15	0.67	2.68	***	***
Using grouping convolution structure (nin_gc)	FP32	FP32	91.04%	0.15	0.58	2.32	13.43%	-0.03%
Pruning	FP32	FP32	90.26%	0.09	0.32	1.28	52.24%	0.75%
Quantification	1	FP32	90.93%	***	0.58	0.204	92.39%	0.08%
Quantification	1.5	FP32	91%	***	0.58	0.272	89.85%	0.01%
Quantification	1	1	86.23%	***	0.58	0.204	92.39%	4.78%
Quantification	1.5	1	86.48%	***	0.58	0.272	89.85%	4.53%
Quantification (DoReFa)	8	8	91.03%	***	0.58	0.596	77.76%	-0.02%
Quantification (IAO, full quantification, symmetric/per-channel/bn_fuse)	8	8	90.99%	***	0.58	0.596	77.76%	0.02%
Grouping + pruning + quantization	1.5	1	86.13%	***	0.32	0.19	92.91%	4.88%

--train_batch_size 256, single card

Related information

compression

Quantification

QAT

Binary value

BinarizedNeuralNetworks: TrainingNeuralNetworks withWeightsand ActivationsConstrainedto +1 or−1
XNOR-Net:ImageNetClassiﬁcUsingBinary ConvolutionalNeuralNetworks
AN EMPIRICAL STUDY OF BINARY NEURAL NETWORKS' OPTIMISATION
A Review of Binarized Neural Networks

Three values

Ternary weight networks

High-Bit

DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Quantizing deep convolutional networks for efficient inference: A whitepaper