Progen下載 - Progen源代碼下載

Progen

其他源碼

1.0.0

下載

多模式

後代

pytorch中的後代，從“後代：蛋白質產生的語言建模”中實施

蛋白質序列的GPT

紙鏈接

欣賞

西南
農民

安裝

pip install progen-torch

用法

 import torch
from progen . model import ProGen

x = torch . randint ( 0 , 100 , ( 1 , 1024 ))

# Initialize the model with specific parameters
model = ProGen (
    num_tokens = 100 ,  # The size of the vocabulary
    dim = 512 ,  # The dimension of the embeddings
    seq_len = 1024 ,  # The length of the sequences
    depth = 6 ,  # The number of layers in the model
    window_size = 256 ,  # The size of the window for local attention
    global_mlp_depth = 2 ,  # The depth of the MLP in the global attention mechanism
    heads = 8 ,  # The number of attention heads
    dim_head = 512 ,  # The dimension of each attention head
    ff_mult = 4 ,  # The multiplier for the feed-forward network's hidden layer size
    ff_glu = True ,  # Whether to use a GLU activation in the feed-forward network
    attn_dim = None ,  # The dimension of the attention mechanism (None means it defaults to `dim`)
    clamp_gate = True ,  # Whether to clamp the gate values in the GLU activation
    shift_tokens = True ,  # Whether to shift the tokens for the causal attention mechanism
    dropout = 0.1 ,  # The dropout rate
)

# Forward pass through the model
logits = model ( x )

# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print ( logits . shape )  # Should print: torch.Size([1, 1024, 100])

數據集策略

這是帶有元數據和源鏈接的論文中使用的數據集的表：

數據集	描述	來源
uniparc	包含來自各種來源的蛋白質序列	https://www.uniprot.org/uniparc/
Uniprotkb	包含蛋白質序列和註釋	https://www.uniprot.org/uniprot/
瑞士 - 普羅特	精選的蛋白質序列數據庫	https://www.uniprot.org/swiss-prot/
顫抖	計算機註銷的蛋白質序列	https://www.uniprot.org/trembl/
PFAM	蛋白質家族數據庫	https://pfam.xfam.org/
NCBI分類學	生物的分類分類	https://www.ncbi.nlm.nih.gov/taxonomy

這是一個顯示數據預處理流的圖：

圖TD
    a [uniparc]  - > b [過濾器和合併]
    c [uniprotkb]  - > b
    d [瑞士 - 普羅特]  - > be [trembl]  - > b
    f [pfam]  - > b
    G [NCBI分類法]  - > b
    b-> h [火車/測試拆分]
    h->我[火車套裝]
    H-> J [ID測試集] 
    H-> K [OOD測試集]

UniPARC，UniprotkB，Swiss-Prot，Trembl，PFAM和NCBI分類學數據集被過濾並在步驟B中合併。然後將匯總的數據集分為訓練，分發測試和步驟H中的分佈測試集。