Progen
1.0.0

pytorch中的後代,從“後代:蛋白質產生的語言建模”中實施
蛋白質序列的GPT
紙鏈接
pip install progen-torch
import torch
from progen . model import ProGen
x = torch . randint ( 0 , 100 , ( 1 , 1024 ))
# Initialize the model with specific parameters
model = ProGen (
num_tokens = 100 , # The size of the vocabulary
dim = 512 , # The dimension of the embeddings
seq_len = 1024 , # The length of the sequences
depth = 6 , # The number of layers in the model
window_size = 256 , # The size of the window for local attention
global_mlp_depth = 2 , # The depth of the MLP in the global attention mechanism
heads = 8 , # The number of attention heads
dim_head = 512 , # The dimension of each attention head
ff_mult = 4 , # The multiplier for the feed-forward network's hidden layer size
ff_glu = True , # Whether to use a GLU activation in the feed-forward network
attn_dim = None , # The dimension of the attention mechanism (None means it defaults to `dim`)
clamp_gate = True , # Whether to clamp the gate values in the GLU activation
shift_tokens = True , # Whether to shift the tokens for the causal attention mechanism
dropout = 0.1 , # The dropout rate
)
# Forward pass through the model
logits = model ( x )
# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print ( logits . shape ) # Should print: torch.Size([1, 1024, 100])
這是帶有元數據和源鏈接的論文中使用的數據集的表:
| 數據集 | 描述 | 來源 |
|---|---|---|
| uniparc | 包含來自各種來源的蛋白質序列 | https://www.uniprot.org/uniparc/ |
| Uniprotkb | 包含蛋白質序列和註釋 | https://www.uniprot.org/uniprot/ |
| 瑞士 - 普羅特 | 精選的蛋白質序列數據庫 | https://www.uniprot.org/swiss-prot/ |
| 顫抖 | 計算機註銷的蛋白質序列 | https://www.uniprot.org/trembl/ |
| PFAM | 蛋白質家族數據庫 | https://pfam.xfam.org/ |
| NCBI分類學 | 生物的分類分類 | https://www.ncbi.nlm.nih.gov/taxonomy |
這是一個顯示數據預處理流的圖:
圖TD
a [uniparc] - > b [過濾器和合併]
c [uniprotkb] - > b
d [瑞士 - 普羅特] - > be [trembl] - > b
f [pfam] - > b
G [NCBI分類法] - > b
b-> h [火車/測試拆分]
h->我[火車套裝]
H-> J [ID測試集]
H-> K [OOD測試集]
UniPARC,UniprotkB,Swiss-Prot,Trembl,PFAM和NCBI分類學數據集被過濾並在步驟B中合併。然後將匯總的數據集分為訓練,分發測試和步驟H中的分佈測試集。
麻省理工學院