Загрузка Progen - Progen Source Code скачать

Progen

Другой исходный код

1.0.0

Скачать

Многомодальность

Предварительный

Внедрение Progen в Pytorch, из статьи «Progen: языковое моделирование для генерации белка»

GPT для белковых последовательностей

Бумажная ссылка

Признательность

Lucidrains
Агорцы

Установить

pip install progen-torch

Использование

 import torch
from progen . model import ProGen

x = torch . randint ( 0 , 100 , ( 1 , 1024 ))

# Initialize the model with specific parameters
model = ProGen (
    num_tokens = 100 ,  # The size of the vocabulary
    dim = 512 ,  # The dimension of the embeddings
    seq_len = 1024 ,  # The length of the sequences
    depth = 6 ,  # The number of layers in the model
    window_size = 256 ,  # The size of the window for local attention
    global_mlp_depth = 2 ,  # The depth of the MLP in the global attention mechanism
    heads = 8 ,  # The number of attention heads
    dim_head = 512 ,  # The dimension of each attention head
    ff_mult = 4 ,  # The multiplier for the feed-forward network's hidden layer size
    ff_glu = True ,  # Whether to use a GLU activation in the feed-forward network
    attn_dim = None ,  # The dimension of the attention mechanism (None means it defaults to `dim`)
    clamp_gate = True ,  # Whether to clamp the gate values in the GLU activation
    shift_tokens = True ,  # Whether to shift the tokens for the causal attention mechanism
    dropout = 0.1 ,  # The dropout rate
)

# Forward pass through the model
logits = model ( x )

# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print ( logits . shape )  # Should print: torch.Size([1, 1024, 100])

Стратегия набора данных

Вот таблица наборов данных, используемых в статье с метаданными и исходными ссылками:

Набор данных	Описание	Источник
Uniparc	Содержит белковые последовательности из различных источников	https://www.uniprot.org/uniparc/
Uniprotkb	Содержит белковые последовательности и аннотации	https://www.uniprot.org/uniprot/
Швейцарский проток	База данных последовательности кураторских белков	https://www.uniprot.org/swiss-prot/
Трембл	Компьютерные последовательности белков	https://www.uniprot.org/trembl/
Pfam	База данных семейств белков	https://pfam.xfam.org/
Таксономия NCBI	Таксономическая классификация организмов	https://www.ncbi.nlm.nih.gov/taxonomy

Вот диаграмма, показывающая поток предварительной обработки данных:

 График тд
    [Uniparc] -> b [Фильтр и слияние]
    C [Uniprotkb] -> b
    D [Swiss-Prot]-> Be [trembl]-> b
    F [pfam] -> б
    G [NCBI таксономия] -> B
    B -> h [поезда/тестовый сплит]
    H -> i [набор поездов]
    H -> j [id test set] 
    H -> k [ood test set]

Наборы данных UniPARC, UniproTKB, Swiss-Prot, TREMBL, PFAM и NCBI фильтроваются и объединяются на шаге B. Затем агрегированный набор данных разделяется на обучение, тесты в распределении и тесты на распределение в шаге H.