Progen下载 - Progen源代码下载

Progen

其他源码

1.0.0

下载

多模式

后代

pytorch中的后代，从“后代：蛋白质产生的语言建模”中实施

蛋白质序列的GPT

纸链接

欣赏

西南
农民

安装

pip install progen-torch

用法

 import torch
from progen . model import ProGen

x = torch . randint ( 0 , 100 , ( 1 , 1024 ))

# Initialize the model with specific parameters
model = ProGen (
    num_tokens = 100 ,  # The size of the vocabulary
    dim = 512 ,  # The dimension of the embeddings
    seq_len = 1024 ,  # The length of the sequences
    depth = 6 ,  # The number of layers in the model
    window_size = 256 ,  # The size of the window for local attention
    global_mlp_depth = 2 ,  # The depth of the MLP in the global attention mechanism
    heads = 8 ,  # The number of attention heads
    dim_head = 512 ,  # The dimension of each attention head
    ff_mult = 4 ,  # The multiplier for the feed-forward network's hidden layer size
    ff_glu = True ,  # Whether to use a GLU activation in the feed-forward network
    attn_dim = None ,  # The dimension of the attention mechanism (None means it defaults to `dim`)
    clamp_gate = True ,  # Whether to clamp the gate values in the GLU activation
    shift_tokens = True ,  # Whether to shift the tokens for the causal attention mechanism
    dropout = 0.1 ,  # The dropout rate
)

# Forward pass through the model
logits = model ( x )

# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print ( logits . shape )  # Should print: torch.Size([1, 1024, 100])

数据集策略

这是带有元数据和源链接的论文中使用的数据集的表：

数据集	描述	来源
uniparc	包含来自各种来源的蛋白质序列	https://www.uniprot.org/uniparc/
Uniprotkb	包含蛋白质序列和注释	https://www.uniprot.org/uniprot/
瑞士 - 普罗特	精选的蛋白质序列数据库	https://www.uniprot.org/swiss-prot/
颤抖	计算机注销的蛋白质序列	https://www.uniprot.org/trembl/
PFAM	蛋白质家族数据库	https://pfam.xfam.org/
NCBI分类学	生物的分类分类	https://www.ncbi.nlm.nih.gov/taxonomy

这是一个显示数据预处理流的图：

图TD
    a [uniparc]  - > b [过滤器和合并]
    c [uniprotkb]  - > b
    d [瑞士 - 普罗特]  - > be [trembl]  - > b
    f [pfam]  - > b
    G [NCBI分类法]  - > b
    b-> h [火车/测试拆分]
    h->我[火车套装]
    H-> J [ID测试集] 
    H-> K [OOD测试集]

UniPARC，UniprotkB，Swiss-Prot，Trembl，PFAM和NCBI分类学数据集被过滤并在步骤B中合并。然后将汇总的数据集分为训练，分发测试和步骤H中的分布测试集。