pytorch_memlab下載pytorch_memlab源代碼下載

pytorch_memlab

Python

0.3.0

下載

pytorch_memlab

它是Pytorch的簡單準確的CUDA內存管理實驗室，由有關內存的不同部分組成：

特徵：
- 內存剖面：使用簡單API的line_profiler樣式CUDA內存profiler。
- 記憶記者：檢查張貼量的記者佔據CUDA內存。
- 禮貌：一個有趣的功能，是將所有CUDA張量暫時轉移到CPU記憶中，以提供禮貌，當然也向後轉移。
- 通過%mlrun / %%mlrun系列 /單元格命令支持IPYTHON。
目錄
- 安裝
- 用戶doc
  - 內存剖面
  - ipython的支持
  - 內存記者
  - 禮貌
  - ACK
- 更改

安裝

發布版本：

pip install pytorch_memlab

最新版本：

pip install git+https://github.com/stonesjtu/pytorch_memlab

是為了

Pytorch中的不可存儲錯誤經常發生，對於新手和經驗豐富的程序員。一個普遍的原因是，大多數人並沒有真正學習Pytorch和GPU的基本記憶管理哲學。他們編寫了記憶效率的代碼，並抱怨Pytorch吃了太多的CUDA記憶。

在此存儲庫中，我將分享一些有用的工具來幫助調試OOM，或者如果有人感興趣，請檢查基本機制。

用戶doc

內存剖面

內存剖面是對Python的line_profiler的修改，它為指定函數/方法中的每一行代碼提供了內存使用信息。

樣本：

 import torch
from pytorch_memlab import LineProfiler

def inner ():
    torch . nn . Linear ( 100 , 100 ). cuda ()

def outer ():
    linear = torch . nn . Linear ( 100 , 100 ). cuda ()
    linear2 = torch . nn . Linear ( 100 , 100 ). cuda ()
    linear3 = torch . nn . Linear ( 100 , 100 ). cuda ()

work ()

腳本完成或被鍵盤中斷後，如果您在Jupyter筆記本中，則提供以下分析信息：

或以下信息，如果您在僅文本終端中：

 ## outer

active_bytes reserved_bytes line  code
         all            all
        peak           peak
       0.00B          0.00B    7  def outer():
      40.00K          2.00M    8      linear = torch.nn.Linear(100, 100).cuda()
      80.00K          2.00M    9      linear2 = torch.nn.Linear(100, 100).cuda()
     120.00K          2.00M   10      inner()


## inner

active_bytes reserved_bytes line  code
         all            all
        peak           peak
      80.00K          2.00M    4  def inner():
     120.00K          2.00M    5      torch.nn.Linear(100, 100).cuda()

對每列含義的解釋都可以在火炬文檔中找到。可以display() memory_stats()的任何字段的名稱以查看相應的統計量。

如果使用profile裝飾器，則在多次運行期間收集了內存統計信息，並且最多顯示了最大值。我們還提供了一個更靈活的API，稱為profile_every ，該API每次執行函數時都會打印內存信息。您只需用@profile_every(1)替換@profile即可打印每個執行的內存使用量。

也可以混合@profile和@profile_every ，以獲得對調試粒度的更多控制。

您還可以在模塊類中添加裝飾器：

 class Net ( torch . nn . Module ):
    def __init__ ( self ):
        super (). __init__ ()
    @ profile
    def forward ( self , inp ):
        #do_something

線profiler概述了CUDA設備0的內存使用情況，您可能需要將設備通過set_target_gpu將設備切換為配置文件。 GPU選擇在全球範圍內，這意味著您必須記住您在整個過程中正在分析的GPU：

 import torch
from pytorch_memlab import profile , set_target_gpu
@ profile
def func ():
    net1 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 0 )
    set_target_gpu ( 1 )
    net2 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 1 )
    set_target_gpu ( 0 )
    net3 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 0 )

func ()

可以在test/test_line_profiler.py中找到更多樣本

ipython的支持

確保已安裝了IPython ，或使用pip install pytorch-memlab[ipython]安裝了pytorch-memlab 。

首先，加載擴展名：

 % load_ext pytorch_memlab

這使得%mlrun和%%mlrun系列/Cell Magics可供使用。例如，在新單元格中運行以下內容以介紹整個單元格

 % % mlrun - f func
import torch
from pytorch_memlab import profile , set_target_gpu
def func ():
    net1 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 0 )
    set_target_gpu ( 1 )
    net2 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 1 )
    set_target_gpu ( 0 )
    net3 = torch . nn . Linear ( 1024 , 1024 ). cuda ( 0 )

或者，您可以通過%mlrun單元格魔術調用Profiler進行單個語句。

 import torch
from pytorch_memlab import profile , set_target_gpu
def func ( input_size ):
    net1 = torch . nn . Linear ( input_size , 1024 ). cuda ( 0 )
% mlrun - f func func ( 2048 )

看到%mlrun?在支持哪些論點方面提供幫助。您可以將GPU設備設置為配置文件，將結果轉儲到文件中，然後返回LineProfiler對象進行後填充檢查。

通過查看演示Jupyter筆記本來了解更多信息

內存記者

由於內存剖面僅通過行提供總體內存使用信息，因此可以通過內存記者獲得更低級別的內存使用信息。

內存記者迭代所有Tensor對象，並獲取基礎的UntypedStorage （以前Storage ）對像以獲取實際的內存使用情況而不是表面Tensor.size 。

有關詳細信息，請參見Untypedstorage

樣本

最小一個：

 import torch
from pytorch_memlab import MemReporter
linear = torch . nn . Linear ( 1024 , 1024 ). cuda ()
reporter = MemReporter ()
reporter . report ()

輸出：

 Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
Parameter0                                      (1024, 1024)     4.00M
Parameter1                                           (1024,)     4.00K
-------------------------------------------------------------------------------
Total Tensors: 1049600  Used Memory: 4.00M
The allocated memory on cuda:0: 4.00M
-------------------------------------------------------------------------------

您還可以將模型對像傳遞給自動名稱推理。

 import torch
from pytorch_memlab import MemReporter

linear = torch . nn . Linear ( 1024 , 1024 ). cuda ()
inp = torch . Tensor ( 512 , 1024 ). cuda ()
# pass in a model to automatically infer the tensor names
reporter = MemReporter ( linear )
out = linear ( inp ). mean ()
print ( '========= before backward =========' )
reporter . report ()
out . backward ()
print ( '========= after backward =========' )
reporter . report ()

輸出：

 ========= before backward =========
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 1573889  Used Memory: 6.00M
The allocated memory on cuda:0: 6.00M
-------------------------------------------------------------------------------
========= after backward =========
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
weight.grad                                     (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
bias.grad                                            (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 2623489  Used Memory: 10.01M
The allocated memory on cuda:0: 10.01M
-------------------------------------------------------------------------------

記者自動處理共享權重參數：

 import torch
from pytorch_memlab import MemReporter

linear = torch . nn . Linear ( 1024 , 1024 ). cuda ()
linear2 = torch . nn . Linear ( 1024 , 1024 ). cuda ()
linear2 . weight = linear . weight
container = torch . nn . Sequential (
    linear , linear2
)
inp = torch . Tensor ( 512 , 1024 ). cuda ()
# pass in a model to automatically infer the tensor names

out = container ( inp ). mean ()
out . backward ()

# verbose shows how storage is shared across multiple Tensors
reporter = MemReporter ( container )
reporter . report ( verbose = True )

輸出：

 Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
0.weight                                        (1024, 1024)     4.00M
0.weight.grad                                   (1024, 1024)     4.00M
0.bias                                               (1024,)     4.00K
0.bias.grad                                          (1024,)     4.00K
1.bias                                               (1024,)     4.00K
1.bias.grad                                          (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 2625537  Used Memory: 10.02M
The allocated memory on cuda:0: 10.02M
-------------------------------------------------------------------------------

您可以更好地了解更複雜的模塊的內存佈局：

 import torch
from pytorch_memlab import MemReporter

lstm = torch . nn . LSTM ( 1024 , 1024 ). cuda ()
reporter = MemReporter ( lstm )
reporter . report ( verbose = True )
inp = torch . Tensor ( 10 , 10 , 1024 ). cuda ()
out , _ = lstm ( inp )
out . mean (). backward ()
reporter . report ( verbose = True )

如下所示， (->)表示同一存儲後端輸出的重複使用：

 Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight_ih_l0                                    (4096, 1024)    32.03M
weight_hh_l0(->weight_ih_l0)                    (4096, 1024)     0.00B
bias_ih_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_hh_l0(->weight_ih_l0)                           (4096,)     0.00B
Tensor0                                       (10, 10, 1024)   400.00K
-------------------------------------------------------------------------------
Total Tensors: 8499200  Used Memory: 32.42M
The allocated memory on cuda:0: 32.52M
Memory differs due to the matrix alignment
-------------------------------------------------------------------------------
Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight_ih_l0                                    (4096, 1024)    32.03M
weight_ih_l0.grad                               (4096, 1024)    32.03M
weight_hh_l0(->weight_ih_l0)                    (4096, 1024)     0.00B
weight_hh_l0.grad(->weight_ih_l0.grad)          (4096, 1024)     0.00B
bias_ih_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_ih_l0.grad(->weight_ih_l0.grad)                 (4096,)     0.00B
bias_hh_l0(->weight_ih_l0)                           (4096,)     0.00B
bias_hh_l0.grad(->weight_ih_l0.grad)                 (4096,)     0.00B
Tensor0                                       (10, 10, 1024)   400.00K
Tensor1                                       (10, 10, 1024)   400.00K
Tensor2                                        (1, 10, 1024)    40.00K
Tensor3                                        (1, 10, 1024)    40.00K
-------------------------------------------------------------------------------
Total Tensors: 17018880         Used Memory: 64.92M
The allocated memory on cuda:0: 65.11M
Memory differs due to the matrix alignment
-------------------------------------------------------------------------------

注意：

當使用grad_mode=True轉發時，Pytorch將在C級別維護張量緩衝液以供將來的後傳播。因此，這些緩衝區不會由Pytorch管理或收集。但是，如果將這些中間結果存儲為Python變量，則將報告它們。

您還可以通過傳遞額外參數來過濾以進行報告： report(device=torch.device(0))
由於Pytorch的C側張量緩衝區而導致的一個失敗示例

在下面的示例中，在inp * (inp + 2)上創建了一個溫度緩衝區以存儲inp和inp + 2 ，不幸的是，Python只知道INP的存在，因此我們丟失了2M內存，這與張量inp相同。

 import torch
from pytorch_memlab import MemReporter

linear = torch . nn . Linear ( 1024 , 1024 ). cuda ()
inp = torch . Tensor ( 512 , 1024 ). cuda ()
# pass in a model to automatically infer the tensor names
reporter = MemReporter ( linear )
out = linear ( inp * ( inp + 2 )). mean ()
reporter . report ()

輸出：

 Element type                                            Size  Used MEM
-------------------------------------------------------------------------------
Storage on cuda:0
weight                                          (1024, 1024)     4.00M
bias                                                 (1024,)     4.00K
Tensor0                                          (512, 1024)     2.00M
Tensor1                                                 (1,)   512.00B
-------------------------------------------------------------------------------
Total Tensors: 1573889  Used Memory: 6.00M
The allocated memory on cuda:0: 8.00M
Memory differs due to the matrix alignment or invisible gradient buffer tensors
-------------------------------------------------------------------------------

禮貌

有時人們想搶占您的運行任務，但是您不想保存檢查站然後加載，實際上他們所需要的只是GPU資源（通常是CPU資源和CPU內存始終在GPU簇中備用），因此您可以將所有工作區從GPU移動到CPU，然後將其停止到CPU，然後停止任務，直到啟用了重新啟動的信號，而不是從觸發和加載和加載cooks和botspapps和bootspappss和bootspapppapping。

仍在發展.....但是您可以玩得開心：

 from pytorch_memlab import Courtesy

iamcourtesy = Courtesy ()
for i in range ( num_iteration ):
    if something_happens :
        iamcourtesy . yield_memory ()
        wait_for_restart_signal ()
        iamcourtesy . restore ()

已知問題

如上面在Memory_Reporter中所述，中間張量未正確覆蓋，因此您可能需要在backward或forward插入此類禮貌邏輯。
目前，Pytorch的CUDA上下文需要大約1 GB CUDA內存，這意味著即使所有張力都在CPU上，也浪費了1GB的CUDA內存，:-(。但是，如果我可以完全破壞上下文並重新使用，則仍在研究中。