EffectivePyTorch下載 - EffectivePyTorch源代碼下載

有效的Pytorch

第一部分：Pytorch基本面

Pytorch基礎知識
用模塊封裝模型
廣播善與惡
利用超載的操作員
用Torchscript優化運行時
構建有效的自定義數據加載程序
pytorch的數值穩定性
自動混合精度更快的培訓

要安裝Pytorch，請遵循官方網站上的說明：

 pip install torch torchvision

我們的目標是通過添加新文章，並與Pytorch API的最新版本保持最新內容，從而逐步擴展本系列。如果您有有關如何改進本系列或發現解釋模棱兩可的建議，請隨時創建問題，發送補丁或通過電子郵件伸出手。

第一部分：Pytorch基本面

Pytorch基礎知識

Pytorch是數值計算最受歡迎的庫之一，目前是執行機器學習研究最廣泛使用的庫之一。在許多方面，Pytorch類似於Numpy，Pytorch允許您在CPU，GPU和TPU上執行計算，而無需對代碼進行任何重大更改。 Pytorch還可以輕鬆地在多個設備或機器上分發計算。 Pytorch最重要的特徵之一是自動分化。它允許以有效的方式分析函數的梯度，這對於使用梯度下降方法訓練機器學習模型至關重要。我們的目標是為Pytorch提供溫和的介紹，並討論使用Pytorch的最佳實踐。

關於Pytorch的第一件事是張量的概念。張量只是多維陣列。 pytorch張量與帶有一些的numpy陣列非常相似神奇附加功能。

張量可以存儲標量值：

 import torch
a = torch . tensor ( 3 )
print ( a )  # tensor(3)

或一個數組：

 b = torch . tensor ([ 1 , 2 ])
print ( b )  # tensor([1, 2])

矩陣：

 c = torch . zeros ([ 2 , 2 ])
print ( c )  # tensor([[0., 0.], [0., 0.]])

或任何任意尺寸張量：

 d = torch . rand ([ 2 , 2 , 2 ])

張量可用於有效執行代數操作。機器學習應用程序中最常用的操作之一是矩陣乘法。假設您要乘以兩個大小3x5和5x4的隨機矩陣，可以使用矩陣乘法（@）操作來完成：

 import torch

x = torch . randn ([ 3 , 5 ])
y = torch . randn ([ 5 , 4 ])
z = x @ y

print ( z )

同樣，要添加兩個向量，您可以做：

 z = x + y

要將張量轉換為numpy數組，您可以調用Tensor的numpy（）方法：

 print ( z . numpy ())

而且您始終可以通過以下方式將numpy陣列轉換為張量。

 x = torch . tensor ( np . random . normal ([ 3 , 5 ]))

自動差異化

Pytorch比Numpy的最重要優勢是其自動分化功能，它在優化應用程序（例如優化神經網絡的參數）方面非常有用。讓我們嘗試以示例來理解它。

假設您的複合函數是兩個函數的鏈： g(u(x)) 。要計算g相對於x的衍生物，我們可以使用鏈條規則，該規則指出： dg/dx = dg/du * du/dx 。 Pytorch可以在分析上為我們計算衍生物。

首先，要計算Pytorch中的衍生物，我們創建一個張量，並將其requires_grad設置為true。我們可以使用張量操作來定義我們的功能。我們假設u是二次函數， g是一個簡單的線性函數：

 x = torch . tensor ( 1.0 , requires_grad = True )

def u ( x ):
  return x * x

def g ( u ):
  return - u

在這種情況下，我們的複合函數是g(u(x)) = -x*x 。因此，它相對於x導數為-2x 。在x=1點，這等於-2 。

讓我們驗證這一點。這可以使用Pytorch中的畢業函數完成：

 dgdx = torch . autograd . grad ( g ( u ( x )), x )[ 0 ]
print ( dgdx )  # tensor(-2.)

曲線擬合

要了解如何強大的自動差異化，讓我們看看另一個示例。假設我們有來自曲線的樣品（例如f(x) = 5x^2 + 3 ），我們希望根據這些樣品估算f(x) 。我們定義一個參數函數g(x, w) = w0 x^2 + w1 x + w2 ，該函數是輸入x和潛在參數w的函數，然後我們的目標是找到g(x, w) ≈ f(x)潛在參數。這可以通過最小化以下損耗函數來完成： L(w) = Σ (f(x) - g(x, w))^2 。儘管對於這個簡單的問題有一個封閉的形式解決方案，但我們選擇使用可以應用於任何任意可區分函數的更通用的方法，並且使用隨機梯度下降。我們只需在一組樣品w上計算L(w)的平均梯度，然後沿相反方向移動。

這是可以在Pytorch中完成的方式：

 import numpy as np
import torch

# Assuming we know that the desired function is a polynomial of 2nd degree, we
# allocate a vector of size 3 to hold the coefficients and initialize it with
# random noise.
w = torch . tensor ( torch . randn ([ 3 , 1 ]), requires_grad = True )

# We use the Adam optimizer with learning rate set to 0.1 to minimize the loss.
opt = torch . optim . Adam ([ w ], 0.1 )

def model ( x ):
    # We define yhat to be our estimate of y.
    f = torch . stack ([ x * x , x , torch . ones_like ( x )], 1 )
    yhat = torch . squeeze ( f @ w , 1 )
    return yhat

def compute_loss ( y , yhat ):
    # The loss is defined to be the mean squared error distance between our
    # estimate of y and its true value. 
    loss = torch . nn . functional . mse_loss ( yhat , y )
    return loss

def generate_data ():
    # Generate some training data based on the true function
    x = torch . rand ( 100 ) * 20 - 10
    y = 5 * x * x + 3
    return x , y

def train_step ():
    x , y = generate_data ()

    yhat = model ( x )
    loss = compute_loss ( y , yhat )

    opt . zero_grad ()
    loss . backward ()
    opt . step ()

for _ in range ( 1000 ):
    train_step ()

print ( w . detach (). numpy ())

通過運行此代碼，您應該看到與此相關的結果：

[ 4.9924135 , 0.00040895029 , 3.4504161 ]

這是與我們的參數相對接近的近似值。

對於Pytorch可以做的事情，這只是冰山一角。許多問題，例如優化具有數百萬個參數的大型神經網絡，只需幾行代碼即可在Pytorch中有效地實現。 Pytorch負責跨多個設備和線程的擴展，並支持各種平台。

用模塊封裝模型

在上一個示例中，我們使用裸骨張量和張量操作來構建模型。為了使您的代碼更有條理，建議使用Pytorch的模塊。模塊只是用於您的參數的容器，並封裝了模型操作。例如，說您要表示線性模型y = ax + b 。該模型可以用以下代碼表示：

 import torch

class Net ( torch . nn . Module ):

  def __init__ ( self ):
    super (). __init__ ()
    self . a = torch . nn . Parameter ( torch . rand ( 1 ))
    self . b = torch . nn . Parameter ( torch . rand ( 1 ))

  def forward ( self , x ):
    yhat = self . a * x + self . b
    return yhat

要在實踐中使用此模型，您可以實例化模塊，然後將其稱為函數：

 x = torch . arange ( 100 , dtype = torch . float32 )

net = Net ()
y = net ( x )

參數本質上是requires_grad ，其設置為true。使用參數很方便，因為您可以簡單地使用模塊的parameters()方法檢索它們：

 for p in net . parameters ():
    print ( p )

現在，說您的功能未知y = 5x + 3 + some noise ，您想優化模型的參數以適合此功能。您可以從您的功能中抽樣一些點開始：

 x = torch . arange ( 100 , dtype = torch . float32 ) / 100
y = 5 * x + 3 + torch . rand ( 100 ) * 0.3

與上一個示例類似，您可以定義損失函數並優化模型的參數如下：

 criterion = torch . nn . MSELoss ()
optimizer = torch . optim . SGD ( net . parameters (), lr = 0.01 )

for i in range ( 10000 ):
  net . zero_grad ()
  yhat = net ( x )
  loss = criterion ( yhat , y )
  loss . backward ()
  optimizer . step ()

print ( net . a , net . b ) # Should be close to 5 and 3

Pytorch帶有許多預定義的模塊。一個這樣的模塊是torch.nn.Linear ，它是線性函數的更通用形式。我們可以使用torch.nn.Linear這樣的上方重寫我們的模塊：

 class Net ( torch . nn . Module ):

  def __init__ ( self ):
    super (). __init__ ()
    self . linear = torch . nn . Linear ( 1 , 1 )

  def forward ( self , x ):
    yhat = self . linear ( x . unsqueeze ( 1 )). squeeze ( 1 )
    return yhat

請注意，由於torch.nn.Linear在批量矢量而不是標量相反，因此我們使用了擠壓和未測量。

默認情況下，模塊上的調用參數（）將返回其所有子模塊的參數：

 net = Net ()
for p in net . parameters ():
    print ( p )

有一些預定義的模塊充當其他模塊的容器。最常用的容器模塊是torch.nn.Sequential 。顧名思義，它用來將多個模塊（或層）彼此堆疊。例如，在您可以ReLU的兩者之間堆疊兩個線性層，而您可以做：

 model = torch . nn . Sequential (
    torch . nn . Linear ( 64 , 32 ),
    torch . nn . ReLU (),
    torch . nn . Linear ( 32 , 10 ),
)

廣播善與惡

Pytorch支持廣播元素WISE操作。通常，當您要執行諸如加法和乘法之類的操作時，您需要確保操作數的形狀匹配，例如，您不能在形狀張量[ [3, 2] [3, 4] ]。但是有一個特殊情況，那就是您有一個奇異的維度。 pytorch隱式張張張量橫跨其奇異尺寸，以匹配其他操作數的形狀。因此，將形狀張量[3, 2]添加到形狀張量[3, 1]是有效的。

 import torch

a = torch . tensor ([[ 1. , 2. ], [ 3. , 4. ]])
b = torch . tensor ([[ 1. ], [ 2. ]])
# c = a + b.repeat([1, 2])
c = a + b

print ( c )

廣播使我們能夠執行隱式瓷磚，這使代碼較短，並且更有效，因為我們不需要存儲平鋪操作的結果。可以使用的一個整潔的地方是結合不同長度的特徵時。為了使長度變化的串聯特徵，我們通常會鋪平輸入張量，將結果串聯並應用一些非線性。這是各種神經網絡體系結構的常見模式：

 a = torch . rand ([ 5 , 3 , 5 ])
b = torch . rand ([ 5 , 1 , 6 ])

linear = torch . nn . Linear ( 11 , 10 )

# concat a and b and apply nonlinearity
tiled_b = b . repeat ([ 1 , 3 , 1 ])
c = torch . cat ([ a , tiled_b ], 2 )
d = torch . nn . functional . relu ( linear ( c ))

print ( d . shape )  # torch.Size([5, 3, 10])

但這可以通過廣播更有效地完成。我們使用f(m(x + y))等於f(mx + my)的事實。因此，我們可以分別進行線性操作，並使用廣播進行隱式串聯：

 a = torch . rand ([ 5 , 3 , 5 ])
b = torch . rand ([ 5 , 1 , 6 ])

linear1 = torch . nn . Linear ( 5 , 10 )
linear2 = torch . nn . Linear ( 6 , 10 )

pa = linear1 ( a )
pb = linear2 ( b )
d = torch . nn . functional . relu ( pa + pb )

print ( d . shape )  # torch.Size([5, 3, 10])

實際上，只要可以在張量之間進行廣播，該代碼非常籠統，可以應用於任意形狀的張量：

 class Merge ( torch . nn . Module ):
    def __init__ ( self , in_features1 , in_features2 , out_features , activation = None ):
        super (). __init__ ()
        self . linear1 = torch . nn . Linear ( in_features1 , out_features )
        self . linear2 = torch . nn . Linear ( in_features2 , out_features )
        self . activation = activation

    def forward ( self , a , b ):
        pa = self . linear1 ( a )
        pb = self . linear2 ( b )
        c = pa + pb
        if self . activation is not None :
            c = self . activation ( c )
        return c

到目前為止，我們討論了廣播的重要組成部分。但是，您可能會問什麼醜陋的部分？隱含的假設幾乎總是使調試更加困難。考慮以下示例：

 a = torch . tensor ([[ 1. ], [ 2. ]])
b = torch . tensor ([ 1. , 2. ])
c = torch . sum ( a + b )

print ( c )

您認為評估後c的價值將是什麼？如果您猜到了6，那是錯誤的。這將是12。這是因為當兩個張量的等級不匹配時，pytorch會在元素方向操作之前自動擴展張量的張量的第一維，因此添加的結果將是[[2, 3], [3, 4]] ，並且在所有參數上的降低將為我們提供12個。

避免此問題的方法是盡可能明確。如果我們指定要減少哪個維度，捕獲此錯誤將變得容易得多：

 a = torch . tensor ([[ 1. ], [ 2. ]])
b = torch . tensor ([ 1. , 2. ])
c = torch . sum ( a + b , 0 )

print ( c )

在這裡， c的值是[5, 7] ，我們立即根據結果的形狀猜測出了問題。一般的經驗法則是始終指定減少操作和使用torch.squeeze時的尺寸。

利用超載的操作員

就像Numpy一樣，Pytorch超載了許多Python操作員，以使Pytorch代碼更短，更可讀。

切片OP是可以使索引張量非常容易的超載操作員之一：

 z = x [ begin : end ]  # z = torch.narrow(0, begin, end-begin)

使用此操作時要非常小心。切片OP與其他任何OP一樣，都有一些開銷。因為這是一種常見的OP和天真的外觀，它可能會導致很多效率。要了解如何效率低下，讓我們看看一個示例。我們想在矩陣的行中手動進行減少：

 import torch
import time

x = torch . rand ([ 500 , 10 ])

z = torch . zeros ([ 10 ])

start = time . time ()
for i in range ( 500 ):
    z += x [ i ]
print ( "Took %f seconds." % ( time . time () - start ))

這速度很慢，原因是我們將Slice OP 500次調用，這增加了很多開銷。一個更好的選擇是使用torch.unbind 。

 z = torch . zeros ([ 10 ])
for x_i in torch . unbind ( x ):
    z += x_i

這更快地（在我的機器上〜30％）。

當然，正確減少此簡單的正確方法是使用torch.sum op。

 z = torch . sum ( x , dim = 0 )

這非常快（在我的機器上快〜100倍）。

Pytorch還超載了一系列算術和邏輯運算符：

 z = - x  # z = torch.neg(x)
z = x + y  # z = torch.add(x, y)
z = x - y
z = x * y  # z = torch.mul(x, y)
z = x / y  # z = torch.div(x, y)
z = x // y
z = x % y
z = x ** y  # z = torch.pow(x, y)
z = x @ y  # z = torch.matmul(x, y)
z = x > y
z = x >= y
z = x < y
z = x <= y
z = abs ( x )  # z = torch.abs(x)
z = x & y
z = x | y
z = x ^ y  # z = torch.logical_xor(x, y)
z = ~ x  # z = torch.logical_not(x)
z = x == y  # z = torch.eq(x, y)
z = x != y  # z = torch.ne(x, y)

您也可以使用這些操作的增強版本。例如x += y和x **= 2也有效。

請注意，Python不允許超載and ， or ，或者not關鍵字。

用Torchscript優化運行時

Pytorch被優化以在大型張量上執行操作。在小張量上進行許多操作在Pytorch中效率很低。因此，只要可能，您就應該以批處理形式重寫計算，以減少開銷並提高性能。如果您無法手動批量操作，則使用Torchscript可以提高代碼的性能。 Torchscript只是Pytorch識別的Python函數的一個子集。 Pytorch可以使用及時（JIT）編譯器自動優化Torchscript代碼並減少一些開銷。

讓我們看一個例子。 ML應用程序中非常常見的操作是“批處理”。此操作可以簡單地寫入output[i] = input[i, index[i]] 。這可以簡單地在Pytorch中實現，如下所示：

 import torch
def batch_gather ( tensor , indices ):
    output = []
    for i in range ( tensor . size ( 0 )):
        output += [ tensor [ i ][ indices [ i ]]]
    return torch . stack ( output )

要使用Torchscript實現相同的功能，只需使用torch.jit.script Decorator：

 @ torch . jit . script
def batch_gather_jit ( tensor , indices ):
    output = []
    for i in range ( tensor . size ( 0 )):
        output += [ tensor [ i ][ indices [ i ]]]
    return torch . stack ( output )

在我的測試中，這更快約10％。

但是，沒有什麼可以手動批量操作的勝過。我的測試中的矢量實現速度更快100倍：

 def batch_gather_vec ( tensor , indices ):
    shape = list ( tensor . shape )
    flat_first = torch . reshape (
        tensor , [ shape [ 0 ] * shape [ 1 ]] + shape [ 2 :])
    offset = torch . reshape (
        torch . arange ( shape [ 0 ]). cuda () * shape [ 1 ],
        [ shape [ 0 ]] + [ 1 ] * ( len ( indices . shape ) - 1 ))
    output = flat_first [ indices + offset ]
    return output

構建有效的自定義數據加載程序

在最後一課，我們討論了編寫有效的Pytorch代碼。但是，要使代碼以最大的效率運行，您還需要將數據有效加載到設備的內存中。幸運的是，Pytorch提供了一種使數據加載變得容易的工具。它稱為DataLoader 。 DataLoader使用多個工人同時從Dataset加載數據，並選擇使用Sampler來採樣數據條目並形成批處理。

如果您可以隨機訪問數據，則使用DataLoader非常容易：您只需要實現實現__getitem__ （讀取每個數據項）和__len__ （返回數據集中的項目數）方法的Dataset集類。例如，這是如何從給定目錄加載圖像的方法：

 import glob
import os
import random
import cv2
import torch

class ImageDirectoryDataset ( torch . utils . data . Dataset ):
    def __init__ ( path , pattern ):
        self . paths = list ( glob . glob ( os . path . join ( path , pattern )))

    def __len__ ( self ):
        return len ( self . paths )

    def __item__ ( self ):
        path = random . choice ( paths )
        return cv2 . imread ( path , 1 )

要從給定目錄加載所有JPEG圖像，您可以執行以下操作：

 dataloader = torch . utils . data . DataLoader ( ImageDirectoryDataset ( "/data/imagenet/*.jpg" ), num_workers = 8 )
for data in dataloader :
    # do something with data

在這裡，我們使用8名工人同時從磁盤中讀取我們的數據。您可以調整計算機上的工人數量以獲得最佳結果。

如果您擁有快速存儲或數據項很大，則使用DataLoader可以隨機訪問讀取數據。但是，想像一下具有連接緩慢的網絡文件系統。以這種方式要求單個文件可能非常慢，並且可能最終成為您培訓管道的瓶頸。

一種更好的方法是以連續的文件格式存儲您的數據，可以依次讀取數據。例如，如果您有大量圖像集合，則可以使用TAR創建單個存檔並從Python中依次從存檔中提取文件。為此，您可以使用pytorch的IterableDataset 。要創建一個IterableDataset類，您只需要實現__iter__方法，該方法依次讀取並從數據集中產生數據項。

幼稚的實施會這樣：

 import tarfile
import torch

def tar_image_iterator ( path ):
    tar = tarfile . open ( self . path , "r" )
    for tar_info in tar :
        file = tar . extractfile ( tar_info )
        content = file . read ()
        yield cv2 . imdecode ( content , 1 )
        file . close ()
        tar . members = []
    tar . close ()

class TarImageDataset ( torch . utils . data . IterableDataset ):
    def __init__ ( self , path ):
        super (). __init__ ()
        self . path = path

    def __iter__ ( self ):
        yield from tar_image_iterator ( self . path )

但是，此實施存在一個主要問題。如果您嘗試使用DataLoader從該數據集中讀取多個工人，您會觀察到很多重複的圖像：

 dataloader = torch . utils . data . DataLoader ( TarImageDataset ( "/data/imagenet.tar" ), num_workers = 8 )
for data in dataloader :
    # data contains duplicated items

問題在於，每個工人都創建一個數據集的單獨實例，每個工人將從數據集的開頭開始。避免這種情況的一種方法是，而不是擁有一個tar文件，而是將數據分開為單獨的焦油num_workers ，並加載每個人都有一個單獨的工人：

 class TarImageDataset ( torch . utils . data . IterableDataset ):
    def __init__ ( self , paths ):
        super (). __init__ ()
        self . paths = paths

    def __iter__ ( self ):
        worker_info = torch . utils . data . get_worker_info ()
        # For simplicity we assume num_workers is equal to number of tar files
        if worker_info is None or worker_info . num_workers != len ( self . paths ):
            raise ValueError ( "Number of workers doesn't match number of files." )
        yield from tar_image_iterator ( self . paths [ worker_info . worker_id ])

這就是我們的數據集類的使用方式：

 dataloader = torch . utils . data . DataLoader (
    TarImageDataset ([ "/data/imagenet_part1.tar" , "/data/imagenet_part2.tar" ]), num_workers = 2 )
for data in dataloader :
    # do something with data

我們討論了一個簡單的策略，以避免重複的條目問題。 Tfrecord軟件包使用稍微複雜的策略來拍攝數據。

pytorch的數值穩定性

當使用任何數值計算庫（例如numpy或pytorch）時，重要的是要注意，編寫數學上正確的代碼不一定會導致正確的結果。您還需要確保計算穩定。

讓我們從一個簡單的示例開始。從數學上講，對於任何非零值的x ，很容易看到x * y / y = x 。但是，讓我們看看這在實踐中是否總是正確的：

 import numpy as np

x = np . float32 ( 1 )

y = np . float32 ( 1e-50 )  # y would be stored as zero
z = x * y / y

print ( z )  # prints nan

結果不正確的原因是y對Float32類型的類型太小。當y太大時也會發生類似的問題：

 y = np . float32 ( 1e39 )  # y would be stored as inf
z = x * y / y

print ( z )  # prints nan

FLOAT32類型可以代表的最小正值為1.4013E-45，而低於零的任何內容。另外，任何超過3.40282e+38的數字都將存儲為INF。

 print ( np . nextafter ( np . float32 ( 0 ), np . float32 ( 1 )))  # prints 1.4013e-45
print ( np . finfo ( np . float32 ). max )  # print 3.40282e+38

為了確保您的計算穩定，您希望避免具有小或非常大的絕對值的值。這聽起來可能很明顯，但是這些問題可能會變得非常難以調試，尤其是在Pytorch進行梯度下降時。這是因為您不僅需要確保向前通行證中的所有值都在數據類型的有效範圍內，而且還需要確保向後通過（在梯度計算期間）相同。

讓我們看一個真實的例子。我們想在邏輯向量上計算軟max。幼稚的實施看起來像這樣：

 import torch

def unstable_softmax ( logits ):
    exp = torch . exp ( logits )
    return exp / torch . sum ( exp )

print ( unstable_softmax ( torch . tensor ([ 1000. , 0. ])). numpy ())  # prints [ nan, 0.]

請注意，計算邏輯的指數為相對較少的數字結果，從float32範圍內的巨大結果。我們幼稚的軟馬克斯實現的最大有效logit是ln(3.40282e+38) = 88.7 ，除此之外，任何可能導致NAN結果。

但是，我們如何才能使它更穩定？解決方案很簡單。很容易看到exp(x - c) Σ exp(x - c) = exp(x) / Σ exp(x) 。因此，我們可以從邏輯中減去任何常數，結果將保持不變。我們選擇此常數是最大邏輯。這樣，指數函數的域將僅限於[-inf, 0] ，因此其範圍為[0.0, 1.0]這是可取的：

 import torch

def softmax ( logits ):
    exp = torch . exp ( logits - torch . reduce_max ( logits ))
    return exp / torch . sum ( exp )

print ( softmax ( torch . tensor ([ 1000. , 0. ])). numpy ())  # prints [ 1., 0.]

讓我們看一個更複雜的情況。認為我們有分類問題。我們使用SoftMax函數從邏輯中產生概率。然後，我們將損失函數定義為預測和標籤之間的橫向熵。回想一下，分類分佈的交叉熵可以簡單地定義為xe(p, q) = -Σ p_i log(q_i) 。因此，固定熵的天真實現看起來像這樣：

 def unstable_softmax_cross_entropy ( labels , logits ):
    logits = torch . log ( softmax ( logits ))
    return - torch . sum ( labels * logits )

labels = torch . tensor ([ 0.5 , 0.5 ])
logits = torch . tensor ([ 1000. , 0. ])

xe = unstable_softmax_cross_entropy ( labels , logits )

print ( xe . numpy ())  # prints inf

請注意，在此實現中，隨著軟磁輸出接近零，日誌的輸出接近無窮大，這會導致我們計算中的不穩定。我們可以通過擴展SoftMax並進行一些簡化來重寫這一點：

 def softmax_cross_entropy ( labels , logits , dim = - 1 ):
    scaled_logits = logits - torch . max ( logits )
    normalized_logits = scaled_logits - torch . logsumexp ( scaled_logits , dim )
    return - torch . sum ( labels * normalized_logits )

labels = torch . tensor ([ 0.5 , 0.5 ])
logits = torch . tensor ([ 1000. , 0. ])

xe = softmax_cross_entropy ( labels , logits )

print ( xe . numpy ())  # prints 500.0

我們還可以驗證梯度還正確計算：

 logits . requires_grad_ ( True )
xe = softmax_cross_entropy ( labels , logits )
g = torch . autograd . grad ( xe , logits )[ 0 ]
print ( g . numpy ())  # prints [0.5, -0.5]

讓我再次提醒我，在進行梯度下降時必須格外小心，以確保您的功能範圍以及每一層的梯度在有效範圍內。指數和對數函數在天真使用時尤其有問題，因為它們可以將少量數字映射到巨大的數字以及相反的方式。

更快的訓練以混合精度

默認情況下，Pytorch中的張量和模型參數存儲在32位浮點精度中。使用32位浮子的訓練神經網絡通常穩定，不會引起重大的數值問題，但是已經證明神經網絡在16位甚至更低的精度方面表現出色。在現代GPU上，以較低的精度計算可以明顯更快。它還具有額外的好處，即使用較少的內存功能訓練較大的模型和/或具有較大批量的尺寸，從而可以進一步提高性能。不過，問題在於，16位的訓練通常變得非常不穩定，因為精度通常不足以執行某些操作，例如積累。

為了幫助解決這個問題，pytorch支持混合精確度的培訓。簡而言之，通過在16位進行一些昂貴的操作（例如卷積和矩陣多層）來完成混合精液訓練，通過拋棄輸入，同時執行其他數值敏感的操作，例如32位的積累。這樣，我們就可以獲得16位計算的所有好處，而無需其缺點。接下來，我們談論使用AutoCast和GradScaler進行自動混合精液培訓。

自動鑄造

通過將數據自動降低到16位以進行某些計算， autocast有助於提高運行時性能。要了解它的工作原理，讓我們看一個示例：

 import torch

x = torch . rand ([ 32 , 32 ]). cuda ()
y = torch . rand ([ 32 , 32 ]). cuda ()

with torch . cuda . amp . autocast ():
  a = x + y
  b = x @ y
print ( a . dtype )  # prints torch.float32
print ( b . dtype )  # prints torch.float16

請注意， x和y都是32位張量，但是autocast以16位執行矩陣乘法，同時將加法操作保持在32位。如果其中一個操作數為16位怎麼辦？

 import torch

x = torch . rand ([ 32 , 32 ]). cuda ()
y = torch . rand ([ 32 , 32 ]). cuda (). half ()

with torch . cuda . amp . autocast ():
  a = x + y
  b = x @ y
print ( a . dtype )  # prints torch.float32
print ( b . dtype )  # prints torch.float16

再次autocast並將32位操作數降至16位以執行矩陣乘法，但不會改變加法操作。默認情況下，在Pytorch中添加兩個張量會導致鑄件的精度更高。

在實踐中，您可以相信autocast可以進行正確的鑄造以提高運行時效率。重要的是要在autocast估算的環境下保留所有正向通過計算：

 model = ...
loss_fn = ...

with torch . cuda . amp . autocast ():
  outputs = model ( inputs )
  loss = loss_fn ( outputs , targets )

如果您有相對穩定的優化問題，並且使用相對較低的學習率，那麼這也許是您所需要的。添加這一系列的額外代碼可以將您的培訓減少到現代硬件上的一半。

畢業生

正如我們在本節開頭提到的那樣，對於某些計算，16位精度可能並不總是足夠的。感興趣的一種特殊情況是代表梯度值，其中很大一部分通常是小值。用16位浮子代表它們通常會導致緩衝底面（即它們被代表為零）。這使得訓練神經網絡非常不穩定。 GradScalar旨在解決此問題。它以輸入您的損失值並將其乘以大的標量，使梯度值膨脹，因此使其以16位精度代表。然後，它在漸變更新期間將它們縮小，以確保正確更新參數。這通常是GradScalar所做的。但是，在引擎蓋GradScalar下面比這更聰明。膨脹梯度實際上可能導致溢出，這同樣不好。因此， GradScalar實際上會監視梯度值，如果它檢測到溢出的層面值，則它會跳過更新，從而根據可配置的時間表縮小標量因子。（默認時間表通常有效，但您可能需要針對用例調整它。）

在實踐中使用GradScalar非常容易：

 scaler = torch . cuda . amp . GradScaler ()

loss = ...
optimizer = ...  # an instance torch.optim.Optimizer

scaler . scale ( loss ). backward ()
scaler . step ( optimizer )
scaler . update ()

請注意，我們首先創建一個GradScalar的實例。在訓練循環中，我們調用GradScalar.scale以擴展損失，然後再向後撥打梯度，然後使用GradScalar.step （5月）更新模型參數。然後，我們致電GradScalar.update ，如果需要，可以執行標量更新。就這樣！

以下是一個示例代碼，該示例代碼顯示了關於學習從圖像坐標生成棋盤的合成問題的混合精確培訓。您可以將其粘貼到Google Colab上，將後端設置為GPU並比較單個和混合精神的性能。請注意，這是一個小玩具示例，實際上，使用更大的網絡，您可能會使用混合精度看到更大的性能提升。

一個例子

生成一個棋盤

 import torch
import matplotlib . pyplot as plt
import time

def grid ( width , height ):
  hrange = torch . arange ( width ). unsqueeze ( 0 ). repeat ([ height , 1 ]). div ( width )
  vrange = torch . arange ( height ). unsqueeze ( 1 ). repeat ([ 1 , width ]). div ( height )
  output = torch . stack ([ hrange , vrange ], 0 )
  return output


def checker ( width , height , freq ):
  hrange = torch . arange ( width ). reshape ([ 1 , width ]). mul ( freq / width / 2.0 ). fmod ( 1.0 ). gt ( 0.5 )
  vrange = torch . arange ( height ). reshape ([ height , 1 ]). mul ( freq / height / 2.0 ). fmod ( 1.0 ). gt ( 0.5 )
  output = hrange . logical_xor ( vrange ). float ()
  return output

# Note the inputs are grid coordinates and the target is a checkerboard
inputs = grid ( 512 , 512 ). unsqueeze ( 0 ). cuda ()
targets = checker ( 512 , 512 , 8 ). unsqueeze ( 0 ). unsqueeze ( 1 ). cuda ()

定義卷積神經網絡

 class Net ( torch . jit . ScriptModule ):
  def __init__ ( self ):
    super (). __init__ ()
    self . net = torch . nn . Sequential (
      torch . nn . Conv2d ( 2 , 256 , 1 ),
      torch . nn . BatchNorm2d ( 256 ),
      torch . nn . ReLU (),
      torch . nn . Conv2d ( 256 , 256 , 1 ),
      torch . nn . BatchNorm2d ( 256 ),
      torch . nn . ReLU (),
      torch . nn . Conv2d ( 256 , 256 , 1 ),
      torch . nn . BatchNorm2d ( 256 ),
      torch . nn . ReLU (),
      torch . nn . Conv2d ( 256 , 1 , 1 ))

  @ torch . jit . script_method
  def forward ( self , x ):
    return self . net ( x )

單精度訓練

 net = Net (). cuda ()
loss_fn = torch . nn . MSELoss ()
opt = torch . optim . Adam ( net . parameters (), 0.001 )

start_time = time . time ()

for i in range ( 500 ):
  opt . zero_grad ()
  outputs = net ( inputs )
  loss = loss_fn ( outputs , targets )
  loss . backward ()
  opt . step ()
print ( loss )

print ( time . time () - start_time )

plt . subplot ( 1 , 2 , 1 ); plt . imshow ( outputs . squeeze (). detach (). cpu ());
plt . subplot ( 1 , 2 , 2 ); plt . imshow ( targets . squeeze (). cpu ()); plt . show ()

混合精度訓練

 net = Net (). cuda ()
loss_fn = torch . nn . MSELoss ()
opt = torch . optim . Adam ( net . parameters (), 0.001 )

scaler = torch . cuda . amp . GradScaler ()

start_time = time . time ()

for i in range ( 500 ):
  opt . zero_grad ()
  with torch . cuda . amp . autocast ():
    outputs = net ( inputs )
    loss = loss_fn ( outputs , targets )
  scaler . scale ( loss ). backward ()
  scaler . step ( opt )
  scaler . update ()
print ( loss )

print ( time . time () - start_time )

plt . subplot ( 1 , 2 , 1 ); plt . imshow ( outputs . squeeze (). detach (). cpu (). float ());
plt . subplot ( 1 , 2 , 2 ); plt . imshow ( targets . squeeze (). cpu (). float ()); plt . show ()