ProgenダウンロードProgenソースコードのダウンロード

Progen

その他のソースコード

1.0.0

ダウンロード

マルチモダリティ

プロセン

PytorchでのProgenの実装、「Progen：タンパク質生成のための言語モデリング」から

タンパク質シーケンスのGPT

紙のリンク

感謝

ルシドレイン
アゴリア人

インストール

pip install progen-torch

使用法

 import torch
from progen . model import ProGen

x = torch . randint ( 0 , 100 , ( 1 , 1024 ))

# Initialize the model with specific parameters
model = ProGen (
    num_tokens = 100 ,  # The size of the vocabulary
    dim = 512 ,  # The dimension of the embeddings
    seq_len = 1024 ,  # The length of the sequences
    depth = 6 ,  # The number of layers in the model
    window_size = 256 ,  # The size of the window for local attention
    global_mlp_depth = 2 ,  # The depth of the MLP in the global attention mechanism
    heads = 8 ,  # The number of attention heads
    dim_head = 512 ,  # The dimension of each attention head
    ff_mult = 4 ,  # The multiplier for the feed-forward network's hidden layer size
    ff_glu = True ,  # Whether to use a GLU activation in the feed-forward network
    attn_dim = None ,  # The dimension of the attention mechanism (None means it defaults to `dim`)
    clamp_gate = True ,  # Whether to clamp the gate values in the GLU activation
    shift_tokens = True ,  # Whether to shift the tokens for the causal attention mechanism
    dropout = 0.1 ,  # The dropout rate
)

# Forward pass through the model
logits = model ( x )

# The output is the logits for each token in the vocabulary, for each position in the input sequences
# Shape: (batch_size, sequence_length, num_tokens)
print ( logits . shape )  # Should print: torch.Size([1, 1024, 100])

データセット戦略

これは、メタデータとソースリンクを備えたペーパーで使用されているデータセットのテーブルです。

データセット	説明	ソース
uniparc	さまざまなソースからのタンパク質配列が含まれています	https://www.uniprot.org/uniparc/
uniprotkb	タンパク質配列と注釈が含まれています	https://www.uniprot.org/uniprot/
スイスプロット	キュレーションされたタンパク質シーケンスデータベース	https://www.uniprot.org/swiss-prot/
Trembl	コンピューターに気をつけたタンパク質配列	https://www.uniprot.org/trembl/
pfam	タンパク質ファミリーのデータベース	https://pfam.xfam.org/
NCBI分類法	生物の分類学的分類	https://www.ncbi.nlm.nih.gov/taxonomy

これは、データの前処理フローを示す図です。

グラフTD
    a [uniparc]  - > b [フィルターとマージ]
    c [uniprotkb]  - > b
    d [スイスプロット]  - > be [trembl]  - > b
    f [pfam]  - > b
    G [NCBI分類法]  - > b
    B-> H [トレイン/テストスプリット]
    h-> i [トレインセット]
    H-> J [IDテストセット] 
    h-> k [oodテストセット]

UniPARC、UNIPROTKB、SWISS-PROT、TREMBL、PFAM、およびNCBI分類データセットは、ステップBでフィルタリングされ、統合されています。その後、集約されたデータセットは、ステップHの分散分布テストセットに分割されます。