它是Pytorch的簡單準確的CUDA內存管理實驗室,由有關內存的不同部分組成:
特徵:
line_profiler樣式CUDA內存profiler。%mlrun / %%mlrun系列 /單元格命令支持IPYTHON。目錄
pip install pytorch_memlabpip install git+https://github.com/stonesjtu/pytorch_memlabPytorch中的不可存儲錯誤經常發生,對於新手和經驗豐富的程序員。一個普遍的原因是,大多數人並沒有真正學習Pytorch和GPU的基本記憶管理哲學。他們編寫了記憶效率的代碼,並抱怨Pytorch吃了太多的CUDA記憶。
在此存儲庫中,我將分享一些有用的工具來幫助調試OOM,或者如果有人感興趣,請檢查基本機制。
內存剖面是對Python的line_profiler的修改,它為指定函數/方法中的每一行代碼提供了內存使用信息。
import torch
from pytorch_memlab import LineProfiler
def inner ():
torch . nn . Linear ( 100 , 100 ). cuda ()
def outer ():
linear = torch . nn . Linear ( 100 , 100 ). cuda ()
linear2 = torch . nn . Linear ( 100 , 100 ). cuda ()
linear3 = torch . nn . Linear ( 100 , 100 ). cuda ()
work ()腳本完成或被鍵盤中斷後,如果您在Jupyter筆記本中,則提供以下分析信息:

或以下信息,如果您在僅文本終端中:
## outer
active_bytes reserved_bytes line code
all all
peak peak
0.00B 0.00B 7 def outer():
40.00K 2.00M 8 linear = torch.nn.Linear(100, 100).cuda()
80.00K 2.00M 9 linear2 = torch.nn.Linear(100, 100).cuda()
120.00K 2.00M 10 inner()
## inner
active_bytes reserved_bytes line code
all all
peak peak
80.00K 2.00M 4 def inner():
120.00K 2.00M 5 torch.nn.Linear(100, 100).cuda()
對每列含義的解釋都可以在火炬文檔中找到。可以display() memory_stats()的任何字段的名稱以查看相應的統計量。
如果使用profile裝飾器,則在多次運行期間收集了內存統計信息,並且最多顯示了最大值。我們還提供了一個更靈活的API,稱為profile_every ,該API每次執行函數時都會打印內存信息。您只需用@profile_every(1)替換@profile即可打印每個執行的內存使用量。
也可以混合@profile和@profile_every ,以獲得對調試粒度的更多控制。
class Net ( torch . nn . Module ):
def __init__ ( self ):
super (). __init__ ()
@ profile
def forward ( self , inp ):
#do_somethingset_target_gpu將設備切換為配置文件。 GPU選擇在全球範圍內,這意味著您必須記住您在整個過程中正在分析的GPU: import torch
from pytorch_memlab import profile , set_target_gpu
@ profile
def func ():
net1 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 0 )
set_target_gpu ( 1 )
net2 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 1 )
set_target_gpu ( 0 )
net3 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 0 )
func ()可以在test/test_line_profiler.py中找到更多樣本
確保已安裝了IPython ,或使用pip install pytorch-memlab[ipython]安裝了pytorch-memlab 。
首先,加載擴展名:
% load_ext pytorch_memlab這使得%mlrun和%%mlrun系列/Cell Magics可供使用。例如,在新單元格中運行以下內容以介紹整個單元格
% % mlrun - f func
import torch
from pytorch_memlab import profile , set_target_gpu
def func ():
net1 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 0 )
set_target_gpu ( 1 )
net2 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 1 )
set_target_gpu ( 0 )
net3 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 0 )或者,您可以通過%mlrun單元格魔術調用Profiler進行單個語句。
import torch
from pytorch_memlab import profile , set_target_gpu
def func ( input_size ):
net1 = torch . nn . Linear ( input_size , 1024 ). cuda ( 0 )
% mlrun - f func func ( 2048 )看到%mlrun?在支持哪些論點方面提供幫助。您可以將GPU設備設置為配置文件,將結果轉儲到文件中,然後返回LineProfiler對象進行後填充檢查。
通過查看演示Jupyter筆記本來了解更多信息
由於內存剖面僅通過行提供總體內存使用信息,因此可以通過內存記者獲得更低級別的內存使用信息。
內存記者迭代所有Tensor對象,並獲取基礎的UntypedStorage (以前Storage )對像以獲取實際的內存使用情況而不是表面Tensor.size 。
有關詳細信息,請參見Untypedstorage
import torch
from pytorch_memlab import MemReporter
linear = torch . nn . Linear ( 1024 , 1024 ). cuda ()
reporter = MemReporter ()
reporter . report ()輸出:
Element type Size Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Parameter0 (1024, 1024) 4.00M
Parameter1 (1024,) 4.00K
-------------------------------------------------------------------------------
Total Tensors: 1049600 Used Memory: 4.00M
The allocated memory on cuda:0: 4.00M
-------------------------------------------------------------------------------
import torch
from pytorch_memlab import MemReporter
linear = torch . nn . Linear ( 1024 , 1024 ). cuda ()
inp = torch . Tensor ( 512 , 1024 ). cuda ()
# pass in a model to automatically infer the tensor names
reporter = MemReporter ( linear )
out = linear ( inp ). mean ()
print ( '========= before backward =========' )
reporter . report ()
out . backward ()
print ( '========= after backward =========' )
reporter . report ()輸出:
========= before backward =========
Element type Size Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight (1024, 1024) 4.00M
bias (1024,) 4.00K
Tensor0 (512, 1024) 2.00M
Tensor1 (1,) 512.00B
-------------------------------------------------------------------------------
Total Tensors: 1573889 Used Memory: 6.00M
The allocated memory on cuda:0: 6.00M
-------------------------------------------------------------------------------
========= after backward =========
Element type Size Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight (1024, 1024) 4.00M
weight.grad (1024, 1024) 4.00M
bias (1024,) 4.00K
bias.grad (1024,) 4.00K
Tensor0 (512, 1024) 2.00M
Tensor1 (1,) 512.00B
-------------------------------------------------------------------------------
Total Tensors: 2623489 Used Memory: 10.01M
The allocated memory on cuda:0: 10.01M
-------------------------------------------------------------------------------
import torch
from pytorch_memlab import MemReporter
linear = torch . nn . Linear ( 1024 , 1024 ). cuda ()
linear2 = torch . nn . Linear ( 1024 , 1024 ). cuda ()
linear2 . weight = linear . weight
container = torch . nn . Sequential (
linear , linear2
)
inp = torch . Tensor ( 512 , 1024 ). cuda ()
# pass in a model to automatically infer the tensor names
out = container ( inp ). mean ()
out . backward ()
# verbose shows how storage is shared across multiple Tensors
reporter = MemReporter ( container )
reporter . report ( verbose = True )輸出:
Element type Size Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
0.weight (1024, 1024) 4.00M
0.weight.grad (1024, 1024) 4.00M
0.bias (1024,) 4.00K
0.bias.grad (1024,) 4.00K
1.bias (1024,) 4.00K
1.bias.grad (1024,) 4.00K
Tensor0 (512, 1024) 2.00M
Tensor1 (1,) 512.00B
-------------------------------------------------------------------------------
Total Tensors: 2625537 Used Memory: 10.02M
The allocated memory on cuda:0: 10.02M
-------------------------------------------------------------------------------
import torch
from pytorch_memlab import MemReporter
lstm = torch . nn . LSTM ( 1024 , 1024 ). cuda ()
reporter = MemReporter ( lstm )
reporter . report ( verbose = True )
inp = torch . Tensor ( 10 , 10 , 1024 ). cuda ()
out , _ = lstm ( inp )
out . mean (). backward ()
reporter . report ( verbose = True )如下所示, (->)表示同一存儲後端輸出的重複使用:
Element type Size Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight_ih_l0 (4096, 1024) 32.03M
weight_hh_l0(->weight_ih_l0) (4096, 1024) 0.00B
bias_ih_l0(->weight_ih_l0) (4096,) 0.00B
bias_hh_l0(->weight_ih_l0) (4096,) 0.00B
Tensor0 (10, 10, 1024) 400.00K
-------------------------------------------------------------------------------
Total Tensors: 8499200 Used Memory: 32.42M
The allocated memory on cuda:0: 32.52M
Memory differs due to the matrix alignment
-------------------------------------------------------------------------------
Element type Size Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight_ih_l0 (4096, 1024) 32.03M
weight_ih_l0.grad (4096, 1024) 32.03M
weight_hh_l0(->weight_ih_l0) (4096, 1024) 0.00B
weight_hh_l0.grad(->weight_ih_l0.grad) (4096, 1024) 0.00B
bias_ih_l0(->weight_ih_l0) (4096,) 0.00B
bias_ih_l0.grad(->weight_ih_l0.grad) (4096,) 0.00B
bias_hh_l0(->weight_ih_l0) (4096,) 0.00B
bias_hh_l0.grad(->weight_ih_l0.grad) (4096,) 0.00B
Tensor0 (10, 10, 1024) 400.00K
Tensor1 (10, 10, 1024) 400.00K
Tensor2 (1, 10, 1024) 40.00K
Tensor3 (1, 10, 1024) 40.00K
-------------------------------------------------------------------------------
Total Tensors: 17018880 Used Memory: 64.92M
The allocated memory on cuda:0: 65.11M
Memory differs due to the matrix alignment
-------------------------------------------------------------------------------
注意:
當使用
grad_mode=True轉發時,Pytorch將在C級別維護張量緩衝液以供將來的後傳播。因此,這些緩衝區不會由Pytorch管理或收集。但是,如果將這些中間結果存儲為Python變量,則將報告它們。
您還可以通過傳遞額外參數來過濾以進行報告: report(device=torch.device(0))
由於Pytorch的C側張量緩衝區而導致的一個失敗示例
在下面的示例中,在inp * (inp + 2)上創建了一個溫度緩衝區以存儲inp和inp + 2 ,不幸的是,Python只知道INP的存在,因此我們丟失了2M內存,這與張量inp相同。
import torch
from pytorch_memlab import MemReporter
linear = torch . nn . Linear ( 1024 , 1024 ). cuda ()
inp = torch . Tensor ( 512 , 1024 ). cuda ()
# pass in a model to automatically infer the tensor names
reporter = MemReporter ( linear )
out = linear ( inp * ( inp + 2 )). mean ()
reporter . report ()輸出:
Element type Size Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight (1024, 1024) 4.00M
bias (1024,) 4.00K
Tensor0 (512, 1024) 2.00M
Tensor1 (1,) 512.00B
-------------------------------------------------------------------------------
Total Tensors: 1573889 Used Memory: 6.00M
The allocated memory on cuda:0: 8.00M
Memory differs due to the matrix alignment or invisible gradient buffer tensors
-------------------------------------------------------------------------------
有時人們想搶占您的運行任務,但是您不想保存檢查站然後加載,實際上他們所需要的只是GPU資源(通常是CPU資源和CPU內存始終在GPU簇中備用),因此您可以將所有工作區從GPU移動到CPU,然後將其停止到CPU,然後停止任務,直到啟用了重新啟動的信號,而不是從觸發和加載和加載cooks和botspapps和bootspappss和bootspapppapping。
仍在發展.....但是您可以玩得開心:
from pytorch_memlab import Courtesy
iamcourtesy = Courtesy ()
for i in range ( num_iteration ):
if something_happens :
iamcourtesy . yield_memory ()
wait_for_restart_signal ()
iamcourtesy . restore ()Memory_Reporter中所述,中間張量未正確覆蓋,因此您可能需要在backward或forward插入此類禮貌邏輯。在開發有效的深度學習模型的3年中,我遭受了很多調試,並從偉大的開源社區中學到了很多東西。
DataFrame.drop for pandas 1.5+ MemReporter中的名稱映射(#24)MemReporter中的記憶洩漏line_profiler找不到