PyTorchTricks Download - PyTorchTricks Source code download

PyTorchTricks

Python

1.0.0

Download

Some Tricks of PyTorch

changelog

November 29, 2019: Updated some model design techniques and reasoning acceleration content, and added an introduction link to the apex, ~~Also, I deleted tfrecord, can pytorch be used? I remember that I can't, so I deleted it~~ (Indicates deletion:<)
November 30, 2019: Supplementary MAC's meaning, Supplementary ShuffleNetV2's paper link
December 2, 2019: The pytorch I mentioned earlier cannot use tfrecord. Today I saw an answer from https://www.zhihu.com/question/358632497, and I'm in a rising posture.
December 23, 2019: Added several popular scientific articles on model compression quantification
February 7, 2020: A few things to note were excerpted from the article and added to the code level section
April 30, 2020:
- Added a document backup of github
- Supplemented links to the introduction of convolutional layer and BN layer fusion
- Here is another explanation. For the articles and answers of many friends I have referenced before, the links and corresponding content summary are not linked together. It is estimated that some friends will ask questions when reading related content, and they cannot ask the original author. I am deeply sorry here.
- Adjust some content and try to correspond to the reference link
May 18, 2020: Add some tips on PyTorch to save video memory. At the same time, simply adjust the format. I also found a previous error: non_blocking=False 's suggestion should be non_blocking=True .
January 6, 2021: Adjust some introductions on reading image data.
January 13, 2021: Added a strategy for accelerated reasoning. I think I should update the github document first. The update of Zhihu answers is a bit troublesome, and it is impossible to compare the change of information, so it is very difficult.
June 26, 2022: The following format and content arrangement have been re-adjusted, while additional references and some of the latest discoveries have been added.
June 20, 2024: Simple adjustment of the format is supplemented with an idea to accelerate data reading based on tar format and IterableDataset .

PyTorch speed up

Note

Original document: https://www.yuque.com/lart/ugkv9f/ugysgn

Statement: Most of the content comes from sharing on Zhihu and other blogs, and is only listed here as a collection. More suggestions are welcome.

Zhihu answer (Welcome to like):

pytorch dataloader Data loading takes up most of the time. How do you guys solve it? - People's Artist's answer - Zhihu
When using pytorch, there are too many training set data to reach tens of millions, and what should I do if the Dataloader loads very slowly? - People's Artist's answer - Zhihu

Pretreatment speeds up

To minimize preprocessing operations every time you read data, you can consider using some fixed operations, such as resize , and save them in advance, and use them directly during training.
Move preprocessing to the GPU to accelerate.
- Linux can use NVIDIA/DALI .
- Use Tensor-based image processing operations.

IO speed up

mmcv provides relatively efficient and comprehensive support for data reading: OpenMMLab: MMCV core component analysis (III): FileClient

Use faster image processing

opencv is generally faster than PIL .
- Note that the lazy loading strategy of PIL makes it look open than imread of opencv , but in fact it does not fully load the data. You can call the load() method on the object returned by open to load data manually. The speed is reasonable at this time.
For jpeg reads, you can try jpeg4py .
Save the bmp graph (reduce the decoding time).
Discussion on the speed of different image processing libraries: What is the difference between the implementation method and the reading speed of Python's various imread functions? - Zhihu

Integrate data into a single continuous file (reduce the number of reads)

For large-scale small file readings, it can be saved as a continuous file format that can be read continuously. You can choose to consider TFRecord (Tensorflow) , recordIO , hdf5 , pth , n5 , lmdb , etc.

TFRecord : https://github.com/vahidk/tfrecord
lmdb database:
- https://github.com/Fangyh09/Image2LMDB
- https://blog.csdn.net/P_LarT/article/details/103208405
- https://github.com/lartpang/PySODToolBox/blob/master/ForBigDataset/ImageFolder2LMDB.py
Implementation based on Tar file and IterableDataset

Pre-read data

Pre-read the data required for the next iteration. Use cases:

How to give you the Dataloader in PyTorch - MKFMIKU articles - Zhihu
Accelerate reading data to pytorch - articles on hi - Zhihu

With memory

Load directly into memory.
- Read the image and save it into a fixed container object.
  - --cache in YoloV5.
Map memory to disk.

With solid state

The mechanical hard disk is replaced with NVME solid state. Refer to How to Give You a Chicken Blood in the Dataloader in PyTorch - MKFMIKU's article - Zhihu

Training strategies

Low-precision training

In training, low-precision ( FP16 or even INT8 , binary network, and three-value network) representations are used instead of the original accuracy ( FP32 ) representations.

It can save a certain amount of video memory and speed up, but be careful of unsafe operations such as mean and sum.

Introduction to mixed precision training:
- Mixed precision training tutorial from shallow to deep
Mixed accuracy support provided by NVIDIA/Apex .
- PyTorch must-have artifact | Fast-free: Apex-based hybrid precision acceleration
- Pytorch Installation of APEX Difficult and Miscellaneous Disease Solutions - Chen Hanke's articles - Zhihu
PyTorch1.6 starts providing torch.cuda.amp to support mixed precision.

A bigger batch

Larger batches tend to lead to shorter training time in the case of fixed epoch. However, large batches face many considerations such as hyperparameter setting and memory usage, which is another area that has attracted much attention.

Hyperparameter settings
- Accurate, large minibatch SGD: training imagenet in 1 hour, paper
Optimize video memory usage
- Gradient Accumulation
- Gradient Checkpointing
  - Training deep nets with sublinear memory cost, paper
- In-Place Operation
  - In-Place Activated BatchNorm for Memory-Optimized Training of DNNs, Papers, Code

Code level

Library settings

Setting torch.backends.cudnn.benchmark = True before the training loop can speed up the calculation. Since the performance of cuDNN algorithms that calculate convolutions of different kernel sizes differ, the autotuner can run a benchmark to find the best algorithm. It is recommended to enable this setting when your input size does not change frequently. If the input size changes frequently, the autotuner will need to be benchmarked too often, which can hurt performance. It can increase forward and backward propagation speed by 1.27x to 1.70x.
Use the page to lock memory, that is, set pin_memory=True in DataLoader.
For appropriate num_worker , detailed discussions can be found in Pytorch speedup guide - Yunmeng's articles - Zhihu.
optimizer.zero_grad(set_to_none=False Here you can reduce memory footprint by setting set_to_none=True and can moderately improve performance. But this will also change some behavior, which is visible to the documentation. memset is performed on all parameters through model.zero_grad() or optimizer.zero_grad() , and the gradient is updated by read and write operations. However, setting the gradient to None will not perform memset , and the gradient will be updated using the "write only" operation. Therefore, setting the gradient to None is faster.
During backpropagation, use eval mode and use torch.no_grad to turn off gradient calculations.
Consider using channels_last memory format.
Replace DataParallel with DistributedDataParallel . For multi-GPUs, even if DataParallel is only a single node, DistributedDataParallel is always preferred because DistributedDataParallel is applied to multiple processes and creates one for each GPU, bypassing Python Global Interpreter Lock (GIL) and increasing speed.

Model

Don't initialize any unused variables, because PyTorch's initialization and forward are separate, and it will not initialize because you don't use them.
@torch.jit.script , use PyTroch JIT to fuse point-by-point operations onto a single CUDA kernel. PyTorch optimizes the operation of tensors with large dimensions. It is very inefficient to do too many operations on small tensors in PyTorch. So if possible, rewriting all calculation operations into batches can reduce consumption and improve performance. If you cannot manually implement batch operations, then TorchScript can be used to improve the performance of your code. TorchScript is a subset of Python functions, but after PyTorch is verified by PyTorch, PyTorch can automatically optimize TorchScript code to improve performance through its just in time(jtt) compiler. But a better approach is to manually implement batch operations.
When using FP16 with mixed precision, set a multiple of size 8 for all different architectural designs.
The convolutional layer before BN can remove bias. Because mathematically, bias can be offset by means of subtraction of BN. We can save model parameters and runtime memory.

data

Set batch size to a multiple of 8 to maximize GPU memory usage.
Perform NumPy-style operations as much as possible on the GPU.
Use del to free up memory footprint.
Avoid unnecessary data transmission between different devices.
When creating a tensor, specify the device directly, instead of creating it and then transferring it to the target device.
Use torch.from_numpy(ndarray) or torch.as_tensor(data, dtype=None, device=None) , which can avoid reapplying space by sharing memory. For details and precautions, please refer to the corresponding document. If the source and target devices are CPUs, torch.from_numpy and torch.as_tensor will not copy the data. If the source data is a NumPy array, use torch.from_numpy is faster. If the source data is a tensor with the same data type and device type, torch.as_tensor can avoid copying the data, which can be a list, tuple, or tensor of Python.
Use non-blocking transmission, that is, set non_blocking=True . This attempts asynchronous conversion where possible, for example, converting a CPU tensor in page lock memory into a CUDA tensor.

Optimization of the optimizer

Store model parameters in a continuous piece of memory, thereby reducing the time of optimizer.step() .
- contiguous_pytorch_params
Using fused building blocks in APEX

Model design

CNN

ShuffleNetV2, paper.
- The input and output channels of the convolution layer are consistent: When the number of input and output feature channels of the convolution layer is equal, the MAC (memory access consumption time, memory access cost abbreviation is MAC ) is the smallest, and the model speed is the fastest at this time
- Reduce convolutional grouping: Too many group operations will increase the MAC, which will slow down the model
- Reduce model branches: The fewer branches in the model, the faster the model is
- Reduce element-wise operations: The time consumption brought by element-wise operations is much greater than the values reflected in FLOPs, so element-wise operations should be minimized as much as possible. depthwise convolution also has the characteristics of low FLOPs and high MAC.

Vision Transformer

TRT-ViT: TensorRT-oriented Vision Transformer, paper, interpretation.
- stage-level: Transformer block is suitable for later stages of the model, which maximizes the trade-off between efficiency and performance.
- stage-level: The stage design pattern with shallow first and then deep can improve performance.
- block-level: A hybrid block of Transformer and BottleNeck is more effective than a separate Transformer.
- block-level: The global and then local block design pattern helps to compensate for performance problems.

General ideas

Reduce complexity: For example, model cutting and pruning, reduce model layers and parameter scale
Modify the model structure: For example, model distillation, and obtain small models through knowledge distillation method

Accelerate reasoning

Half precision and weighting

Use low-precision ( FP16 or even INT8 , binary network, and three-value network) representation in inference to replace the original accuracy ( FP32 ) representation.

TensorRT is a neural network inference engine proposed by NVIDIA, which supports post-training 8BIT quantization. It uses a cross-entropy-based model quantization algorithm to minimize the degree of difference between the two distributions.
Pytorch1.3 has already supported quantization function, based on QNNPACK implementation, and supports post-training quantization, dynamic quantization and quantization perception training and other technologies.
In addition, Distiller is an open source model optimization tool based on Pytorch, and naturally supports quantitative technology in Pytorch.
Microsoft's NNI integrates a variety of quantitative perception training algorithms and supports multiple open source frameworks such as PyTorch/TensorFlow/MXNet/Caffe2

For more details, please refer to three AIs: [Miscellaneous Talk] What are the open source tools available for current model quantification?

Operational Fusion

Model reasoning acceleration skills: Fusion of BN and Conv layers - Xiaoxiaojiang's articles - Zhihu
The convergence of the conv layer and the BN layer in the network inference stage - autocyz's article - Zhihu
PyTorch itself provides similar functionality

Re-Parameterization

RepVGG
- RepVGG|Let your ConVNet to the end, plain network exceeds 80% top1 for the first time

Time analysis

Python comes with several performance analysis profile , cProfile and hotshot . The usage methods are basically the same. It is nothing more than whether the module is pure Python or written in C.
PyTorch Profiler is a tool that collects performance metrics during training and inference. Profiler's context manager API can be used to better understand which model operator is most expensive, check its input shape and stack records, study device kernel activity, and visualize execution records.

Project recommendation

Implement model compression based on Pytorch:
- Quantification: 8/4/2 bits(dorefa), three-value/binary value (twn/bnn/xnor-net).
- Pruning: Normal, regular, channel pruning for grouped convolutional structures.
- Grouped convolutional structure.
- BN fusion for feature binary quantization.

Extended reading

pytorch dataloader data loading takes up most of the time. How do you guys solve it? - Zhihu
When using pytorch, there are too many training set data to reach tens of millions, and what should I do if the Dataloader loads very slowly? - Zhihu
What are the pitfalls/bugs in PyTorch? - Zhihu
Optimizing PyTorch training code
26 seconds single GPU training CIFAR10, Jeff Dean also likes the deep learning optimization skills - Articles on the Heart of Machines - Zhihu
After training a few new features on the online model, why is the prediction time of tensorflow serving more than 20 times slower than the original one? - TzeSing's answer - Zhihu
Deep Learning Model Compression
Today, has your model accelerated? Here are 5 methods for your reference (with code analysis)
Summary of common pitfalls in pytorch - Yu Zhenbo's articles - Zhihu
Pytorch speedup guide - Yunmeng's articles - Zhihu
Optimize PyTorch's speed and memory efficiency (2022)

PyTorch saves video memory

Original document: https://www.yuque.com/lart/ugkv9f/nvffyf
Collected from: What are the tips for saving memory (video memory) in Pytorch? - Zhihu https://www.zhihu.com/question/274635237

Use In-Place

Try to enable operations that support inplace by default. For example, relu can use inplace=True .
batchnorm and some specific activation functions can be packaged into inplace_abn .

Loss function

Deleting loss at the end of each loop can save very little video memory, but it is better than nothing. Tensor to Variable and memory freeing best practices

Mixing Accuracy

It can save a certain amount of video memory and speed up, but be careful of unsafe operations such as mean and sum.

Introduction to mixed precision training:
- Mixed precision training tutorial from shallow to deep
Mixed accuracy support provided by NVIDIA/Apex .
- PyTorch must-have artifact | Fast-free: Apex-based hybrid precision acceleration
- Pytorch Installation of APEX Difficult and Miscellaneous Disease Solutions - Chen Hanke's articles - Zhihu
PyTorch1.6 starts providing torch.cuda.amp to support mixed precision.

Manage operations that do not require backpropagation

For forward phases that do not require backpropagation, such as verification and inference periods, use torch.no_grad to wrap the code.
- Note that model.eval() is not equal to torch.no_grad() , please see the following discussion: 'model.eval()' vs 'with torch.no_grad()'
Set requires_grad of variables that do not need to calculate the gradient to False , so that the variable does not participate in the backward propagation of the gradient to reduce the memory usage of unnecessary gradients.
Remove the gradient path that does not need to be calculated:
- Stochastic Backpropagation: A Memory Efficient Strategy for Training Video Models, the interpretation can be seen:
  - https://www.yuque.com/lart/papers/xu5t00
  - https://blog.csdn.net/P_LarT/article/details/124978961

Video memory cleaning

torch.cuda.empty_cache() is an advanced version of del . Using nvidia-smi will find that the video memory has obvious changes. However, the maximum video memory usage during training does not seem to change. You can try: How can we release GPU memory cache?
You can use del to delete unnecessary intermediate variables, or use the form of replacing variables to reduce occupancy.

Gradient Accumulation

Divide a batchsize=64 into two batches of 32, and after two forwarding, backward once. But it will affect batchnorm and other layers related to batchsize .

In PyTorch's documentation, an example of using gradient accumulation and mixing precision is mentioned.

Use gradient accumulation technology to accelerate distributed training, which can be used to refer to: [Original][Deep][PyTorch] DDP Series 3: Practical and Skills - 996 Articles from the Golden Generation - Zhihu

Gradient Checkpointing

torch.utils.checkpoint is provided in PyTorch. This is achieved by re-execute forward propagation at each checkpoint location during backpropagation.

The paper Training Deep Nets with Sublinear Memory Cost is based on gradient checkpoint technology to reduce video memory from O(N) to O(sqrt(N)). For deeper models, the more memory this method saves and does not significantly slow down.

Analysis of Checkpoint mechanism of PyTorch
torch.utils.checkpoint Introduction and easy to use
A PyTorch implementation of Sublinear Memory Cost, referenced from: What are the tips for saving memory (video memory) in Pytorch? - Lyken's answer - Zhihu

Related tools

These codes can help you to detect your GPU memory during training with Pytorch. https://github.com/Oldpan/Pytorch-Memory-Utils
Just less than nvidia-smi? https://github.com/wookayin/gpustat

References

What are the tips for saving memory (video memory) in Pytorch? - Zheng Zhedong's answer - Zhihu
A brief discussion on deep learning: How to calculate the memory footprint of models and intermediate variables
How to finely utilize video memory in Pytorch
What are the tips for saving video memory in Pytorch? - Chen Hanke's answer - Zhihu
Analysis of PyTorch video memory mechanism - Connolly's article - Zhihu

Other tips

Reproduce

You can follow relevant chapters in the document.

Mandatory Deterministic Operation

Avoid using nondeterministic algorithms.

In PyTorch, torch.use_deterministic_algorithms() can force the use of deterministic algorithms instead of nondeterministic algorithms, and an error is thrown if the operation is known to be nondeterministic (and no deterministic alternative).

Set random number seeds

 def seed_torch ( seed = 1029 ):
    random . seed ( seed )
    os . environ [ 'PYTHONHASHSEED' ] = str ( seed )
    np . random . seed ( seed )
    torch . manual_seed ( seed )
    torch . cuda . manual_seed ( seed )
    torch . cuda . manual_seed_all ( seed ) # if you are using multi-GPU.
    torch . backends . cudnn . benchmark = False
    torch . backends . cudnn . deterministic = True

seed_torch ()

Reference from https://www.zdaiot.com/MLFrameworks/Pytorch/Pytorch%E9%9A%8F%E6%9C%BA%E7%A7%8D%E5%AD%90/

Hidden BUG in DataLoader before PyTorch version 1.9

The specific details show that 95% of people are still making PyTorch mistakes - serendipity articles - Zhihu

For solutions, please refer to the documentation:

 def seed_worker ( worker_id ):
    worker_seed = torch . initial_seed () % 2 ** 32
    numpy . random . seed ( worker_seed )
    random . seed ( worker_seed )

DataLoader (..., worker_init_fn = seed_worker )

Expand

Additional Information

Version 1.0.0
Type Python
Update Time 2025-07-13
size 10.41KB
From Github

Related Applications

ToDo Co

2025-03-22
Python Portfolio

2024-11-10
Redash open source data chart tool v24.10.0

2024-11-27
datamule python

2024-11-08
smartchart data visualization platform v6.9

2024-11-27
Locust load testing tool v2.32.0

2024-11-27

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
ToDo Co

Python

1.0.0
Python Portfolio

Python
datamule python

Python
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All