SwissArmyTransformer下載 - SwissArmyTransformer源代碼下載

SwissArmyTransformer

Python

1.0.0

下載

介紹

sat （ SwissArmyTransformer ）是一個靈活而強大的庫，可以開發自己的變壓器變體。

sat以“瑞士軍刀”的名字命名，這意味著所有模型（例如Bert，GPT，T5，GLM，Cogview，Vit ...）共享相同的骨幹代碼，並使用一些額外的輕量級混合物來迎合多種用途。

sat由deepspeed-ZeRO和模型並行性提供動力，旨在為大型模型提供預處理和填充的最佳實踐（100m〜20B參數）。

安裝

    pip install SwissArmyTransformer

特徵

僅在一行中添加模型不穩定組件，例如前綴調整！

前綴調整（或P-tuning）通過在每個注意力層中添加可訓練的參數來改善鑑定。使用我們的庫將其應用於GLM分類（或任何其他）模型很容易。

    class ClassificationModel ( GLMModel ): # can also be BertModel, RobertaModel, etc. 
        def __init__ ( self , args , transformer = None , ** kwargs ):
            super (). __init__ ( args , transformer = transformer , ** kwargs )
            self . add_mixin ( 'classification_head' , MLPHeadMixin ( args . hidden_size , 2048 , 1 ))
            # Arm an arbitrary model with Prefix-tuning with this line!
            self . add_mixin ( 'prefix-tuning' , PrefixTuningMixin ( args . num_layers , args . hidden_size // args . num_attention_heads , args . num_attention_heads , args . prefix_len ))

GPT和其他自動回歸模型在訓練和推理過程中的作用不同。在推論過程中，文本是按token產生的，我們需要在先前的狀態中提高效率。借助我們的LIB，您只需要在訓練（教師訓練）期間考慮行為，然後通過添加Mixin將其轉換為緩存的自動回歸模型：

    model , args = AutoModel . from_pretrained ( 'glm-10b-chinese' , args )
    model . add_mixin ( 'auto-regressive' , CachedAutoregressiveMixin ())
    # Generate a sequence with beam search
    from sat . generation . autoregressive_sampling import filling_sequence
    from sat . generation . sampling_strategies import BeamSearchStrategy
    output , * mems = filling_sequence ( model , input_seq ,
                    batch_size = args . batch_size ,
                    strategy = BeamSearchStrategy ( args . batch_size ))

使用最小代碼構建基於變壓器的模型。我們提到了GLM，它僅與標準變壓器（稱為鹼基模型）不同（和訓練損失）。我們只需要專注於編碼時的相關部分即可。

擴展整個定義：

 class BlockPositionEmbeddingMixin ( BaseMixin ):
    # Here define parameters for the mixin
    def __init__ ( self , max_sequence_length , hidden_size , init_method_std = 0.02 ):
        super ( BlockPositionEmbeddingMixin , self ). __init__ ()
        self . max_sequence_length = max_sequence_length
        self . hidden_size = hidden_size
        self . block_position_embeddings = torch . nn . Embedding ( max_sequence_length , hidden_size )
        torch . nn . init . normal_ ( self . block_position_embeddings . weight , mean = 0.0 , std = init_method_std )
    
    # Here define the method for the mixin
    def position_embedding_forward ( self , position_ids , ** kwargs ):
        position_ids , block_position_ids = position_ids [:, 0 ], position_ids [:, 1 ]
        position_embeddings = self . transformer . position_embeddings ( position_ids )
        block_position_embeddings = self . block_position_embeddings ( block_position_ids )
        return position_embeddings + block_position_embeddings

class GLMModel ( BaseModel ):
    def __init__ ( self , args , transformer = None ):
        super (). __init__ ( args , transformer = transformer )
        self . add_mixin ( 'block_position_embedding' , 
            BlockPositionEmbeddingMixin ( args . max_sequence_length , args . hidden_size )
        ) # Add the mixin for GLM

全面的培訓支持。 sat目的是為預處理和填充提供最佳實踐，您只需要完成forward_step和create_dataset_function ，而使用超參數來改變有用的培訓配置。
- 通過指定--num_nodes ， --num_gpus和一個簡單的hostfile ，將訓練擴展到多個GPU或節點。
- 深速和模型並行性。
- 更好地集成零2和激活檢查點。
- 自動擴展和洗牌培訓數據以及memmap 。
- 成功支持Cogview2和Cogvideo的培訓。
- 目前，GPU上支持FINETUNTINTINTIN的唯一開源代碼庫。

快速遊覽

在SAT中使用Bert最典型的Python文件（用於推斷）如下：

 # @File: inference_bert.py
from sat import get_args , get_tokenizer , AutoModel
# Parse args, initialize the environment. This is necessary.
args = get_args () 
# Automatically download and load model. Will also dump model-related hyperparameters to args.
model , args = AutoModel . from_pretrained ( 'bert-base-uncased' , args ) 
# Get the BertTokenizer according to args.tokenizer_type (automatically set).
tokenizer = get_tokenizer ( args ) 
# Here to use bert as you want!
# ...

然後我們可以通過

    SAT_HOME=/path/to/download python inference_bert.py --mode inference

所有正式支持的模型名稱都在URLS.PY中。

對於芬太尼或預算變壓器也非常容易！

 # @File: finetune_bert.py
from sat import get_args , get_tokenizer , AutoModel
from sat . model . mixins import MLPHeadMixin

def create_dataset_function ( path , args ):
    # Here to load the dataset
    # ...
    assert isinstance ( dataset , torch . utils . data . Dataset )
    return dataset

def forward_step ( data_iterator , model , args , timers ):
    inputs = next ( data_iterator ) # from the dataset of create_dataset_function.
    loss , * others = model ( inputs )
    return loss
    
# Parse args, initialize the environment. This is necessary.
args = get_args () 
model , args = AutoModel . from_pretrained ( 'bert-base-uncased' , args ) 
tokenizer = get_tokenizer ( args ) 
# Here to use bert as you want!
model . del_mixin ( 'bert-final' )
model . add_mixin ( 'classification_head' , MLPHeadMixin ( args . hidden_size , 2048 , 1 ))
# ONE LINE to train! 
# args already includes hyperparams such as lr, train-iters, zero-stage ...
training_main ( args , 
    model_cls = model , 
    forward_step_function = forward_step , # user define
    create_dataset_function = create_dataset_function # user define
)

然後我們可以通過

deepspeed --include localhost:0,1 finetune_bert.py 
    --experiment-name ftbert 
    --mode finetune --train-iters 1000 --save /path/to/save 
    --train-data /path/to/train --valid-data /path/to/valid 
    --lr 0.00002 --batch-size 8 --zero-stage 1 --fp16

在這裡，我們在GPU 0,1上使用數據並行。我們還可以通過--hostfile /path/to/hostfile在許多相互連接的機器上啟動培訓。有關更多詳細信息，請參見教程。

要編寫自己的模型，您只需要考慮標準變壓器之間的區別。例如，如果您有提高注意力操作的想法：

 from sat . model import BaseMixin
class MyAttention ( BaseMixin ):
    def __init__ ( self , hidden_size ):
        super ( MyAttention , self ). __init__ ()
        # MyAttention may needs some new params, e.g. a learnable alpha.
        self . learnable_alpha = torch . nn . Parameter ( torch . ones ( hidden_size ))
    
    # This is a hook function, the name `attention_fn` is special.
    def attention_fn ( q , k , v , mask , dropout = None , ** kwargs ):
        # Code for my attention.
        # ...
        return attention_results