biotrainer下載 - biotrainer源代碼下載

biotrainer

Ai源碼

v0.9.5

下載

生物顆粒

Biotrainer是一種開源工具，可簡化用於生物應用機器學習模型的訓練過程。它專門研究模型以預測蛋白質的特徵。 Biotrainer支持，培訓新模型並採用訓練有素的模型進行推理。使用Biotrainer和配置文件一起提供正確格式的序列和標籤數據一樣簡單。

安裝

確保安裝了詩歌：

curl -sSL https://install.python-poetry.org/ | python3 -

通過poetry安裝依賴項和生物顆粒：

 # In the base directory:
poetry install
# Adding jupyter notebook (if needed):
poetry add jupyter

# [WINDOWS] Explicitly install torch libraries suited for your hardware:
# Select hardware and get install command here: https://pytorch.org/get-started/locally/
# Then run for example:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

請確保使用與pyproject.toml相同的火炬版本，以用於模型可重複性！

前端 +文檔

如果您想將Biotrainer與GUI前端不錯的前端使用，請查看bioCentral ：

生物中心應用
生物中心存儲庫
生物顆粒渲染文檔

訓練

 cd examples/residue_to_class
poetry run biotrainer config.yml

您還可以使用提供的run-biotrainer.py文件進行開發和調試（您可能需要設置IDE以直接執行Run-Biotrainer.py，以使用提供的虛擬環境）：

 # residue_to_class
poetry run python3 run-biotrainer.py examples/residue_to_class/config.yml
# sequence_to_class
poetry run python3 run-biotrainer.py examples/sequence_to_class/config.yml

在Docker培訓

 # Build
docker build -t biotrainer .
# Run
docker run --rm 
    -v " $( pwd ) /examples/docker " :/mnt 
    -u $( id -u ${USER} ) : $( id -g ${USER} ) 
    biotrainer:latest /mnt/config.yml

可以在提供的配置文件的目錄中找到輸出。

推理

訓練模型後，默認情況下，您將在output目錄中找到一個out.yml文件。現在，您可以使用它來使用模型為新數據創建預測， inferencer模塊會自動加載檢查點：

 from biotrainer . inference import Inferencer
from biotrainer . embedders import OneHotEncodingEmbedder

sequences = [
    "PROVTEIN" ,
    "SEQVENCESEQVENCE"
]

out_file_path = '../residue_to_class/output/out.yml'

inferencer , out_file = Inferencer . create_from_out_file ( out_file_path = out_file_path , allow_torch_pt_loading = True )

print ( f"For the { out_file [ 'model_choice' ] } , the metrics on the test set are:" )
for metric in out_file [ 'test_iterations_results' ][ 'metrics' ]:
    print ( f" t { metric } : { out_file [ 'test_iterations_results' ][ 'metrics' ][ metric ] } " )


embedder = OneHotEncodingEmbedder ()
embeddings = list ( embedder . embed_many ( sequences ))
# Note that for per-sequence embeddings, you would have to reduce the embeddings now:
# embeddings = [[embedder.reduce_per_protein(embedding)] for embedding in embeddings]
predictions = inferencer . from_embeddings ( embeddings , split_name = "hold_out" )
for sequence , prediction in zip ( sequences , predictions [ "mapped_predictions" ]. values ()):
    print ( sequence )
    print ( prediction )

# If your checkpoints are stored as .pt, consider converting them to safetensors (supported by biotrainer >=0.9.1)
inferencer . convert_all_checkpoints_to_safetensors ()

請參閱此處的完整示例。

inferencer模塊還具有bootstrapping和monte carlo dropout預測。

數據標準化

Biotrainer提供了許多數據標準，旨在簡化機器學習對生物學的使用。預計該標準化過程還將改善不同科學學科之間的溝通，並有助於保持有關快速發展的蛋白質預測領域的概述。

可用協議

協議定義瞭如何解釋輸入數據以及必須應用哪種預測任務。以下協議已經實施：

 D=embedding dimension (e.g. 1024)
B=batch dimension (e.g. 30)
L=sequence dimension (e.g. 350)
C=number of classes (e.g. 13)

- residue_to_class --> Predict a class C for each residue encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxLxC
- residues_to_class --> Predict a class C for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxC
- residues_to_value --> Predict a value V for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output Bx1
- sequence_to_class --> Predict a class C for each sequence encoded in a fixed dimension D. Input BxD --> output BxC
- sequence_to_value --> Predict a value V for each sequence encoded in a fixed dimension D. Input BxD --> output Bx1

輸入文件標準化

對於每個協議，我們創建了一個標準化，介紹瞭如何提供輸入數據。您可以在此處找到每個協議的詳細信息。

在下面，我們展示了一個示例，介紹了序列和標籤文件對於restue_to_class協議的外觀：

sequences.fasta

 >Seq1
SEQWENCE

標籤

 >Seq1 SET=train VALIDATION=False
DVCDVVDD

配置文件

要運行Biotrainer ，您需要以.yaml格式以及序列和標籤數據提供配置文件。在這裡，您可以找到一個示例性文件的示例性文件。所有配置選項均在此處列出。

示例configuration for Listue_to_class ：

 protocol : residue_to_class
sequence_file : sequences.fasta # Specify your sequence file
labels_file : labels.fasta # Specify your label file
model_choice : CNN # Model architecture 
optimizer_choice : adam # Model optimizer
learning_rate : 1e-3 # Optimizer learning rate
loss_choice : cross_entropy_loss # Loss function 
use_class_weights : True # Balance class weights by using class sample size in the given dataset
num_epochs : 200 # Number of maximum epochs
batch_size : 128 # Batch size
embedder_name : Rostlab/prot_t5_xl_uniref50 # Embedder to use

（生物 - ）嵌入

為了將序列數據轉換為模型更有意義的輸入，在過去幾年中，蛋白質語言模型（PLM）生成的嵌入已廣泛應用。因此，根據協議，生物顆粒可以自動計算嵌入在每個序列和每個佔用水平上。查看嵌入式選項，以了解所有可用的嵌入方法。也可以使用自己的嵌入器提供自己的嵌入式文件，而與提供的計算管道無關。請參閱數據標準化文檔和相關示例，以了解如何執行此操作。如配置選項中所述，預計算的嵌入可以通過embeddings_file參數用於訓練過程。

故障排除

如果您在安裝或運行期間遇到任何問題，請先檢查故障排除指南。

如果您的問題未解決，請創建一個問題。

引用

如果您正在使用生物顆粒進行工作，請添加引用：

 @inproceedings{
sanchez2022standards,
title={Standards, tooling and benchmarks to probe representation learning on proteins},
author={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},
booktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},
year={2022},
url={https://openreview.net/forum?id=adODyN-eeJ8}
}

展開

附加信息

版本 v0.9.5
類型 Ai源碼
更新時間 2025-09-08
大小 481.13KB
來自於 Github

相關應用

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
pywin_contextmenu

2025-08-31
promptl

2025-02-17
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
ML stack

Ai源碼

1.0.0
awesome free chatgpt

Ai源碼

1.0.0
pywin_contextmenu

Ai源碼

Version update
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部