Progen

Progen

기타 소스코드

1.0.0

다운로드

다중 유체

자전거

종이 "Progen : 단백질 생성을위한 언어 모델링"에서 Pytorch의 Progen 구현

단백질 서열에 대한 GPT

종이 링크

감사

Lucidrains
Agorians

설치하다

pip install progen-torch

용법

 import torch
from progen . model import ProGen

x = torch . randint ( 0 , 100 , ( 1 , 1024 ))

# Initialize the model with specific parameters
model = ProGen (
    num_tokens = 100 ,  # The size of the vocabulary
    dim = 512 ,  # The dimension of the embeddings
    seq_len = 1024 ,  # The length of the sequences
    depth = 6 ,  # The number of layers in the model
    window_size = 256 ,  # The size of the window for local attention
    global_mlp_depth = 2 ,  # The depth of the MLP in the global attention mechanism
    heads = 8 ,  # The number of attention heads
    dim_head = 512 ,  # The dimension of each attention head
    ff_mult = 4 ,  # The multiplier for the feed-forward network's hidden layer size
    ff_glu = True ,  # Whether to use a GLU activation in the feed-forward network
    attn_dim = None ,  # The dimension of the attention mechanism (None means it defaults to `dim`)
    clamp_gate = True ,  # Whether to clamp the gate values in the GLU activation
    shift_tokens = True ,  # Whether to shift the tokens for the causal attention mechanism
    dropout = 0.1 ,  # The dropout rate
)

# Forward pass through the model
logits = model ( x )

# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print ( logits . shape )  # Should print: torch.Size([1, 1024, 100])

데이터 세트 전략

다음은 메타 데이터 및 소스 링크가있는 용지에 사용 된 데이터 세트 테이블입니다.

데이터 세트	설명	원천
Uniparc	다양한 공급원의 단백질 서열을 포함합니다	https://www.uniprot.org/uniparc/
uniprotkb	단백질 서열 및 주석이 포함되어 있습니다	https://www.uniprot.org/uniprot/
스위스 프로트	선별 된 단백질 서열 데이터베이스	https://www.uniprot.org/swiss-prot/
trembl	컴퓨터 주석화 단백질 서열	https://www.uniprot.org/trembl/
PFAM	단백질 패밀리 데이터베이스	https://pfam.xfam.org/
NCBI 분류	유기체의 분류 학적 분류	https://www.ncbi.nlm.nih.gov/taxonomy

다음은 데이터 전처리 흐름을 보여주는 다이어그램입니다.

 그래프 TD
    a [uniparc] -> b [필터 및 병합]
    C [uniprotkb] -> b
    D [Swiss-Prot]-> be [trembl]-> b
    f [pfam] -> b
    G [NCBI 분류] -> b
    B-> H [기차/시험 분할]
    H-> I [기차 세트]
    H-> J [ID 테스트 세트] 
    H-> K [OOD 테스트 세트]

UNIPARC, UNIPROTKB, SWISS-PROT, TREMBL, PFAM 및 NCBI 분류 데이터 세트는 단계 B에서 필터링되고 병합됩니다. 그런 다음 집계 된 데이터 세트는 HT 단계에서 훈련, 분포 테스트 및 분산되지 않은 테스트 세트로 분할됩니다.