Some Tricks of PyTorch
changelog
- November 29, 2019: Updated some model design techniques and reasoning acceleration content, and added an introduction link to the apex,
Also, I deleted tfrecord, can pytorch be used? I remember that I can't, so I deleted it (Indicates deletion:<) - November 30, 2019: Supplementary MAC's meaning, Supplementary ShuffleNetV2's paper link
- December 2, 2019: The pytorch I mentioned earlier cannot use tfrecord. Today I saw an answer from https://www.zhihu.com/question/358632497, and I'm in a rising posture.
- December 23, 2019: Added several popular scientific articles on model compression quantification
- February 7, 2020: A few things to note were excerpted from the article and added to the code level section
- April 30, 2020:
- Added a document backup of github
- Supplemented links to the introduction of convolutional layer and BN layer fusion
- Here is another explanation. For the articles and answers of many friends I have referenced before, the links and corresponding content summary are not linked together. It is estimated that some friends will ask questions when reading related content, and they cannot ask the original author. I am deeply sorry here.
- Adjust some content and try to correspond to the reference link
- May 18, 2020: Add some tips on PyTorch to save video memory. At the same time, simply adjust the format. I also found a previous error:
non_blocking=False 's suggestion should be non_blocking=True . - January 6, 2021: Adjust some introductions on reading image data.
- January 13, 2021: Added a strategy for accelerated reasoning. I think I should update the github document first. The update of Zhihu answers is a bit troublesome, and it is impossible to compare the change of information, so it is very difficult.
- June 26, 2022: The following format and content arrangement have been re-adjusted, while additional references and some of the latest discoveries have been added.
- June 20, 2024: Simple adjustment of the format is supplemented with an idea to accelerate data reading based on
tar format and IterableDataset .
PyTorch speed up
Note
Original document: https://www.yuque.com/lart/ugkv9f/ugysgn
Statement: Most of the content comes from sharing on Zhihu and other blogs, and is only listed here as a collection. More suggestions are welcome.
Zhihu answer (Welcome to like):
- pytorch dataloader Data loading takes up most of the time. How do you guys solve it? - People's Artist's answer - Zhihu
- When using pytorch, there are too many training set data to reach tens of millions, and what should I do if the Dataloader loads very slowly? - People's Artist's answer - Zhihu
Pretreatment speeds up
- To minimize preprocessing operations every time you read data, you can consider using some fixed operations, such as
resize , and save them in advance, and use them directly during training. - Move preprocessing to the GPU to accelerate.
- Linux can use
NVIDIA/DALI . - Use Tensor-based image processing operations.
IO speed up
- mmcv provides relatively efficient and comprehensive support for data reading: OpenMMLab: MMCV core component analysis (III): FileClient
Use faster image processing
-
opencv is generally faster than PIL .- Note that the lazy loading strategy of
PIL makes it look open than imread of opencv , but in fact it does not fully load the data. You can call the load() method on the object returned by open to load data manually. The speed is reasonable at this time.
- For
jpeg reads, you can try jpeg4py . - Save the
bmp graph (reduce the decoding time). - Discussion on the speed of different image processing libraries: What is the difference between the implementation method and the reading speed of Python's various imread functions? - Zhihu
Integrate data into a single continuous file (reduce the number of reads)
For large-scale small file readings, it can be saved as a continuous file format that can be read continuously. You can choose to consider TFRecord (Tensorflow) , recordIO , hdf5 , pth , n5 , lmdb , etc.
-
TFRecord : https://github.com/vahidk/tfrecord -
lmdb database:- https://github.com/Fangyh09/Image2LMDB
- https://blog.csdn.net/P_LarT/article/details/103208405
- https://github.com/lartpang/PySODToolBox/blob/master/ForBigDataset/ImageFolder2LMDB.py
- Implementation based on
Tar file and IterableDataset
Pre-read data
Pre-read the data required for the next iteration. Use cases:
- How to give you the Dataloader in PyTorch - MKFMIKU articles - Zhihu
- Accelerate reading data to pytorch - articles on hi - Zhihu
With memory
- Load directly into memory.
- Read the image and save it into a fixed container object.
- Map memory to disk.
With solid state
The mechanical hard disk is replaced with NVME solid state. Refer to How to Give You a Chicken Blood in the Dataloader in PyTorch - MKFMIKU's article - Zhihu
Training strategies
Low-precision training
In training, low-precision ( FP16 or even INT8 , binary network, and three-value network) representations are used instead of the original accuracy ( FP32 ) representations.
It can save a certain amount of video memory and speed up, but be careful of unsafe operations such as mean and sum.
- Introduction to mixed precision training:
- Mixed precision training tutorial from shallow to deep
- Mixed accuracy support provided by
NVIDIA/Apex .- PyTorch must-have artifact | Fast-free: Apex-based hybrid precision acceleration
- Pytorch Installation of APEX Difficult and Miscellaneous Disease Solutions - Chen Hanke's articles - Zhihu
- PyTorch1.6 starts providing
torch.cuda.amp to support mixed precision.
A bigger batch
Larger batches tend to lead to shorter training time in the case of fixed epoch. However, large batches face many considerations such as hyperparameter setting and memory usage, which is another area that has attracted much attention.
- Hyperparameter settings
- Accurate, large minibatch SGD: training imagenet in 1 hour, paper
- Optimize video memory usage
- Gradient Accumulation
- Gradient Checkpointing
- Training deep nets with sublinear memory cost, paper
- In-Place Operation
- In-Place Activated BatchNorm for Memory-Optimized Training of DNNs, Papers, Code
Code level
Library settings
- Setting
torch.backends.cudnn.benchmark = True before the training loop can speed up the calculation. Since the performance of cuDNN algorithms that calculate convolutions of different kernel sizes differ, the autotuner can run a benchmark to find the best algorithm. It is recommended to enable this setting when your input size does not change frequently. If the input size changes frequently, the autotuner will need to be benchmarked too often, which can hurt performance. It can increase forward and backward propagation speed by 1.27x to 1.70x. - Use the page to lock memory, that is, set
pin_memory=True in DataLoader. - For appropriate
num_worker , detailed discussions can be found in Pytorch speedup guide - Yunmeng's articles - Zhihu. - optimizer.zero_grad(set_to_none=False Here you can reduce memory footprint by setting
set_to_none=True and can moderately improve performance. But this will also change some behavior, which is visible to the documentation. memset is performed on all parameters through model.zero_grad() or optimizer.zero_grad() , and the gradient is updated by read and write operations. However, setting the gradient to None will not perform memset , and the gradient will be updated using the "write only" operation. Therefore, setting the gradient to None is faster. - During backpropagation, use
eval mode and use torch.no_grad to turn off gradient calculations. - Consider using channels_last memory format.
- Replace
DataParallel with DistributedDataParallel . For multi-GPUs, even if DataParallel is only a single node, DistributedDataParallel is always preferred because DistributedDataParallel is applied to multiple processes and creates one for each GPU, bypassing Python Global Interpreter Lock (GIL) and increasing speed.
Model
- Don't initialize any unused variables, because PyTorch's initialization and
forward are separate, and it will not initialize because you don't use them. -
@torch.jit.script , use PyTroch JIT to fuse point-by-point operations onto a single CUDA kernel. PyTorch optimizes the operation of tensors with large dimensions. It is very inefficient to do too many operations on small tensors in PyTorch. So if possible, rewriting all calculation operations into batches can reduce consumption and improve performance. If you cannot manually implement batch operations, then TorchScript can be used to improve the performance of your code. TorchScript is a subset of Python functions, but after PyTorch is verified by PyTorch, PyTorch can automatically optimize TorchScript code to improve performance through its just in time(jtt) compiler. But a better approach is to manually implement batch operations. - When using FP16 with mixed precision, set a multiple of size 8 for all different architectural designs.
- The convolutional layer before BN can remove bias. Because mathematically, bias can be offset by means of subtraction of BN. We can save model parameters and runtime memory.
data
- Set batch size to a multiple of 8 to maximize GPU memory usage.
- Perform NumPy-style operations as much as possible on the GPU.
- Use
del to free up memory footprint. - Avoid unnecessary data transmission between different devices.
- When creating a tensor, specify the device directly, instead of creating it and then transferring it to the target device.
- Use
torch.from_numpy(ndarray) or torch.as_tensor(data, dtype=None, device=None) , which can avoid reapplying space by sharing memory. For details and precautions, please refer to the corresponding document. If the source and target devices are CPUs, torch.from_numpy and torch.as_tensor will not copy the data. If the source data is a NumPy array, use torch.from_numpy is faster. If the source data is a tensor with the same data type and device type, torch.as_tensor can avoid copying the data, which can be a list, tuple, or tensor of Python. - Use non-blocking transmission, that is, set
non_blocking=True . This attempts asynchronous conversion where possible, for example, converting a CPU tensor in page lock memory into a CUDA tensor.
Optimization of the optimizer
- Store model parameters in a continuous piece of memory, thereby reducing the time of
optimizer.step() .-
contiguous_pytorch_params
- Using fused building blocks in APEX
Model design
CNN
- ShuffleNetV2, paper.
- The input and output channels of the convolution layer are consistent: When the number of input and output feature channels of the convolution layer is equal, the MAC (memory access consumption time,
memory access cost abbreviation is MAC ) is the smallest, and the model speed is the fastest at this time - Reduce convolutional grouping: Too many group operations will increase the MAC, which will slow down the model
- Reduce model branches: The fewer branches in the model, the faster the model is
- Reduce
element-wise operations: The time consumption brought by element-wise operations is much greater than the values reflected in FLOPs, so element-wise operations should be minimized as much as possible. depthwise convolution also has the characteristics of low FLOPs and high MAC.
Vision Transformer
- TRT-ViT: TensorRT-oriented Vision Transformer, paper, interpretation.
- stage-level: Transformer block is suitable for later stages of the model, which maximizes the trade-off between efficiency and performance.
- stage-level: The stage design pattern with shallow first and then deep can improve performance.
- block-level: A hybrid block of Transformer and BottleNeck is more effective than a separate Transformer.
- block-level: The global and then local block design pattern helps to compensate for performance problems.
General ideas
- Reduce complexity: For example, model cutting and pruning, reduce model layers and parameter scale
- Modify the model structure: For example, model distillation, and obtain small models through knowledge distillation method
Accelerate reasoning
Half precision and weighting
Use low-precision ( FP16 or even INT8 , binary network, and three-value network) representation in inference to replace the original accuracy ( FP32 ) representation.
-
TensorRT is a neural network inference engine proposed by NVIDIA, which supports post-training 8BIT quantization. It uses a cross-entropy-based model quantization algorithm to minimize the degree of difference between the two distributions. - Pytorch1.3 has already supported quantization function, based on QNNPACK implementation, and supports post-training quantization, dynamic quantization and quantization perception training and other technologies.
- In addition,
Distiller is an open source model optimization tool based on Pytorch, and naturally supports quantitative technology in Pytorch. - Microsoft's
NNI integrates a variety of quantitative perception training algorithms and supports multiple open source frameworks such as PyTorch/TensorFlow/MXNet/Caffe2
For more details, please refer to three AIs: [Miscellaneous Talk] What are the open source tools available for current model quantification?
Operational Fusion
- Model reasoning acceleration skills: Fusion of BN and Conv layers - Xiaoxiaojiang's articles - Zhihu
- The convergence of the conv layer and the BN layer in the network inference stage - autocyz's article - Zhihu
- PyTorch itself provides similar functionality
Re-Parameterization
- RepVGG
- RepVGG|Let your ConVNet to the end, plain network exceeds 80% top1 for the first time
Time analysis
- Python comes with several performance analysis
profile , cProfile and hotshot . The usage methods are basically the same. It is nothing more than whether the module is pure Python or written in C. - PyTorch Profiler is a tool that collects performance metrics during training and inference. Profiler's context manager API can be used to better understand which model operator is most expensive, check its input shape and stack records, study device kernel activity, and visualize execution records.
Project recommendation
- Implement model compression based on Pytorch:
- Quantification: 8/4/2 bits(dorefa), three-value/binary value (twn/bnn/xnor-net).
- Pruning: Normal, regular, channel pruning for grouped convolutional structures.
- Grouped convolutional structure.
- BN fusion for feature binary quantization.
Extended reading
- pytorch dataloader data loading takes up most of the time. How do you guys solve it? - Zhihu
- When using pytorch, there are too many training set data to reach tens of millions, and what should I do if the Dataloader loads very slowly? - Zhihu
- What are the pitfalls/bugs in PyTorch? - Zhihu
- Optimizing PyTorch training code
- 26 seconds single GPU training CIFAR10, Jeff Dean also likes the deep learning optimization skills - Articles on the Heart of Machines - Zhihu
- After training a few new features on the online model, why is the prediction time of tensorflow serving more than 20 times slower than the original one? - TzeSing's answer - Zhihu
- Deep Learning Model Compression
- Today, has your model accelerated? Here are 5 methods for your reference (with code analysis)
- Summary of common pitfalls in pytorch - Yu Zhenbo's articles - Zhihu
- Pytorch speedup guide - Yunmeng's articles - Zhihu
- Optimize PyTorch's speed and memory efficiency (2022)
PyTorch saves video memory
Original document: https://www.yuque.com/lart/ugkv9f/nvffyf
Collected from: What are the tips for saving memory (video memory) in Pytorch? - Zhihu https://www.zhihu.com/question/274635237
Use In-Place
- Try to enable operations that support
inplace by default. For example, relu can use inplace=True . -
batchnorm and some specific activation functions can be packaged into inplace_abn .
Loss function
Deleting loss at the end of each loop can save very little video memory, but it is better than nothing. Tensor to Variable and memory freeing best practices
Mixing Accuracy
It can save a certain amount of video memory and speed up, but be careful of unsafe operations such as mean and sum.
- Introduction to mixed precision training:
- Mixed precision training tutorial from shallow to deep
- Mixed accuracy support provided by
NVIDIA/Apex .- PyTorch must-have artifact | Fast-free: Apex-based hybrid precision acceleration
- Pytorch Installation of APEX Difficult and Miscellaneous Disease Solutions - Chen Hanke's articles - Zhihu
- PyTorch1.6 starts providing
torch.cuda.amp to support mixed precision.
Manage operations that do not require backpropagation
- For forward phases that do not require backpropagation, such as verification and inference periods, use
torch.no_grad to wrap the code.- Note that
model.eval() is not equal to torch.no_grad() , please see the following discussion: 'model.eval()' vs 'with torch.no_grad()'
- Set
requires_grad of variables that do not need to calculate the gradient to False , so that the variable does not participate in the backward propagation of the gradient to reduce the memory usage of unnecessary gradients. - Remove the gradient path that does not need to be calculated:
- Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models, the interpretation can be seen:
- https://www.yuque.com/lart/papers/xu5t00
- https://blog.csdn.net/P_LarT/article/details/124978961
Video memory cleaning
-
torch.cuda.empty_cache() is an advanced version of del . Using nvidia-smi will find that the video memory has obvious changes. However, the maximum video memory usage during training does not seem to change. You can try: How can we release GPU memory cache? - You can use
del to delete unnecessary intermediate variables, or use the form of replacing variables to reduce occupancy.
Gradient Accumulation
Divide a batchsize=64 into two batches of 32, and after two forwarding, backward once. But it will affect batchnorm and other layers related to batchsize .
In PyTorch's documentation, an example of using gradient accumulation and mixing precision is mentioned.
Use gradient accumulation technology to accelerate distributed training, which can be used to refer to: [Original][Deep][PyTorch] DDP Series 3: Practical and Skills - 996 Articles from the Golden Generation - Zhihu
Gradient Checkpointing
torch.utils.checkpoint is provided in PyTorch. This is achieved by re-execute forward propagation at each checkpoint location during backpropagation.
The paper Training Deep Nets with Sublinear Memory Cost is based on gradient checkpoint technology to reduce video memory from O(N) to O(sqrt(N)). For deeper models, the more memory this method saves and does not significantly slow down.
- Analysis of Checkpoint mechanism of PyTorch
- torch.utils.checkpoint Introduction and easy to use
- A PyTorch implementation of Sublinear Memory Cost, referenced from: What are the tips for saving memory (video memory) in Pytorch? - Lyken's answer - Zhihu
Related tools
- These codes can help you to detect your GPU memory during training with Pytorch. https://github.com/Oldpan/Pytorch-Memory-Utils
- Just less than nvidia-smi? https://github.com/wookayin/gpustat
References
- What are the tips for saving memory (video memory) in Pytorch? - Zheng Zhedong's answer - Zhihu
- A brief discussion on deep learning: How to calculate the memory footprint of models and intermediate variables
- How to finely utilize video memory in Pytorch
- What are the tips for saving video memory in Pytorch? - Chen Hanke's answer - Zhihu
- Analysis of PyTorch video memory mechanism - Connolly's article - Zhihu
Other tips
Reproduce
You can follow relevant chapters in the document.
Mandatory Deterministic Operation
Avoid using nondeterministic algorithms.
In PyTorch, torch.use_deterministic_algorithms() can force the use of deterministic algorithms instead of nondeterministic algorithms, and an error is thrown if the operation is known to be nondeterministic (and no deterministic alternative).
Set random number seeds
def seed_torch ( seed = 1029 ):
random . seed ( seed )
os . environ [ 'PYTHONHASHSEED' ] = str ( seed )
np . random . seed ( seed )
torch . manual_seed ( seed )
torch . cuda . manual_seed ( seed )
torch . cuda . manual_seed_all ( seed ) # if you are using multi-GPU.
torch . backends . cudnn . benchmark = False
torch . backends . cudnn . deterministic = True
seed_torch () Reference from https://www.zdaiot.com/MLFrameworks/Pytorch/Pytorch%E9%9A%8F%E6%9C%BA%E7%A7%8D%E5%AD%90/
Hidden BUG in DataLoader before PyTorch version 1.9
The specific details show that 95% of people are still making PyTorch mistakes - serendipity articles - Zhihu
For solutions, please refer to the documentation:
def seed_worker ( worker_id ):
worker_seed = torch . initial_seed () % 2 ** 32
numpy . random . seed ( worker_seed )
random . seed ( worker_seed )
DataLoader (..., worker_init_fn = seed_worker )