Download Progen - download do código fonte Progen

Progen

Outro código-fonte

1.0.0

Baixar

Multi-modalidade

Progen

Implementação de progênio em Pytorch, a partir do artigo "Progênio: modelagem de idiomas para geração de proteínas"

GPT para sequências de proteínas

Link em papel

Apreciação

Lucidrains
AGORIANOS

Instalar

pip install progen-torch

Uso

 import torch
from progen . model import ProGen

x = torch . randint ( 0 , 100 , ( 1 , 1024 ))

# Initialize the model with specific parameters
model = ProGen (
    num_tokens = 100 ,  # The size of the vocabulary
    dim = 512 ,  # The dimension of the embeddings
    seq_len = 1024 ,  # The length of the sequences
    depth = 6 ,  # The number of layers in the model
    window_size = 256 ,  # The size of the window for local attention
    global_mlp_depth = 2 ,  # The depth of the MLP in the global attention mechanism
    heads = 8 ,  # The number of attention heads
    dim_head = 512 ,  # The dimension of each attention head
    ff_mult = 4 ,  # The multiplier for the feed-forward network's hidden layer size
    ff_glu = True ,  # Whether to use a GLU activation in the feed-forward network
    attn_dim = None ,  # The dimension of the attention mechanism (None means it defaults to `dim`)
    clamp_gate = True ,  # Whether to clamp the gate values in the GLU activation
    shift_tokens = True ,  # Whether to shift the tokens for the causal attention mechanism
    dropout = 0.1 ,  # The dropout rate
)

# Forward pass through the model
logits = model ( x )

# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print ( logits . shape )  # Should print: torch.Size([1, 1024, 100])

Estratégia do conjunto de dados

Aqui está uma tabela dos conjuntos de dados usados no papel com metadados e links de origem:

Conjunto de dados	Descrição	Fonte
Uniparc	Contém sequências de proteínas de várias fontes	https://www.uniprot.org/uniparc/
Uniprotkb	Contém sequências de proteínas e anotações	https://www.uniprot.org/uniprot/
Swiss-Prot	Banco de dados de sequência de proteínas com curadoria	https://www.uniprot.org/swiss-prot/
Trembl	Sequências de proteínas anotadas por computador	https://www.uniprot.org/trembl/
Pfam	Banco de dados de famílias de proteínas	https://pfam.xfam.org/
Taxonomia do NCBI	Classificação taxonômica de organismos	https://www.ncbi.nlm.nih.gov/taxonomy

Aqui está um diagrama mostrando o fluxo de pré -processamento de dados:

 Gráfico TD
    A [uniparc] -> B [filtro e mesclagem]
    C [uniprotkb] -> b
    D [SWISS-PROT]-> BE [Trembl]-> B
    F [pfam] -> b
    G [Taxonomia NCBI] -> B
    B -> H [Split de trem/teste]
    H -> I [conjunto de trem]
    H -> J [Conjunto de testes de identificação] 
    H -> K [conjunto de testes de Ood]

Os conjuntos de dados de taxonomia UNIPARC, UniprotKB, Swiss-Prot, Trembl, PFAM e NCBI são filtrados e mesclados na etapa B. O conjunto de dados agregado é dividido em treinamento em treinamento, teste de distribuição e testes de distribuição na etapa H.

Licença

Mit

Citações

Expandir

Informações adicionais

Versão 1.0.0
Tipo Outro código-fonte
Data da Última Atualização 2025-03-08
tamanho 212.98KB
Vindo de Github

Aplicativos Relacionados

Google Dorks

2025-03-10
shepherd

2025-06-04
mongo express

2025-06-04
hidusbf

2025-02-14
Free Algorithms Books

2025-05-29
markdownpedia

2025-04-22

Recomendado para você

chat.petals.dev

Outro código-fonte

1.0.0
GPT Prompt Templates

Outro código-fonte

1.0.0
GPTyped

Outro código-fonte

GPTyped 1.0.5
Google Dorks

Outro código-fonte

1.0
shepherd

Outro código-fonte

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Outro código-fonte

v1.1.0-rc-3
Google Dorks

Outro código-fonte

1.0
shepherd

Outro código-fonte

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Outro código-fonte

v1.1.0-rc-3

Informações Relacionadas Todos