Biotrainer是一種開源工具,可簡化用於生物應用機器學習模型的訓練過程。它專門研究模型以預測蛋白質的特徵。 Biotrainer支持,培訓新模型並採用訓練有素的模型進行推理。使用Biotrainer和配置文件一起提供正確格式的序列和標籤數據一樣簡單。
curl -sSL https://install.python-poetry.org/ | python3 -poetry安裝依賴項和生物顆粒: # In the base directory:
poetry install
# Adding jupyter notebook (if needed):
poetry add jupyter
# [WINDOWS] Explicitly install torch libraries suited for your hardware:
# Select hardware and get install command here: https://pytorch.org/get-started/locally/
# Then run for example:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118請確保使用與pyproject.toml相同的火炬版本,以用於模型可重複性!
如果您想將Biotrainer與GUI前端不錯的前端使用,請查看bioCentral :
cd examples/residue_to_class
poetry run biotrainer config.yml您還可以使用提供的run-biotrainer.py文件進行開發和調試(您可能需要設置IDE以直接執行Run-Biotrainer.py,以使用提供的虛擬環境):
# residue_to_class
poetry run python3 run-biotrainer.py examples/residue_to_class/config.yml
# sequence_to_class
poetry run python3 run-biotrainer.py examples/sequence_to_class/config.yml # Build
docker build -t biotrainer .
# Run
docker run --rm
-v " $( pwd ) /examples/docker " :/mnt
-u $( id -u ${USER} ) : $( id -g ${USER} )
biotrainer:latest /mnt/config.yml可以在提供的配置文件的目錄中找到輸出。
訓練模型後,默認情況下,您將在output目錄中找到一個out.yml文件。現在,您可以使用它來使用模型為新數據創建預測, inferencer模塊會自動加載檢查點:
from biotrainer . inference import Inferencer
from biotrainer . embedders import OneHotEncodingEmbedder
sequences = [
"PROVTEIN" ,
"SEQVENCESEQVENCE"
]
out_file_path = '../residue_to_class/output/out.yml'
inferencer , out_file = Inferencer . create_from_out_file ( out_file_path = out_file_path , allow_torch_pt_loading = True )
print ( f"For the { out_file [ 'model_choice' ] } , the metrics on the test set are:" )
for metric in out_file [ 'test_iterations_results' ][ 'metrics' ]:
print ( f" t { metric } : { out_file [ 'test_iterations_results' ][ 'metrics' ][ metric ] } " )
embedder = OneHotEncodingEmbedder ()
embeddings = list ( embedder . embed_many ( sequences ))
# Note that for per-sequence embeddings, you would have to reduce the embeddings now:
# embeddings = [[embedder.reduce_per_protein(embedding)] for embedding in embeddings]
predictions = inferencer . from_embeddings ( embeddings , split_name = "hold_out" )
for sequence , prediction in zip ( sequences , predictions [ "mapped_predictions" ]. values ()):
print ( sequence )
print ( prediction )
# If your checkpoints are stored as .pt, consider converting them to safetensors (supported by biotrainer >=0.9.1)
inferencer . convert_all_checkpoints_to_safetensors ()請參閱此處的完整示例。
inferencer模塊還具有bootstrapping和monte carlo dropout預測。
Biotrainer提供了許多數據標準,旨在簡化機器學習對生物學的使用。預計該標準化過程還將改善不同科學學科之間的溝通,並有助於保持有關快速發展的蛋白質預測領域的概述。
協議定義瞭如何解釋輸入數據以及必須應用哪種預測任務。以下協議已經實施:
D=embedding dimension (e.g. 1024)
B=batch dimension (e.g. 30)
L=sequence dimension (e.g. 350)
C=number of classes (e.g. 13)
- residue_to_class --> Predict a class C for each residue encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxLxC
- residues_to_class --> Predict a class C for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxC
- residues_to_value --> Predict a value V for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output Bx1
- sequence_to_class --> Predict a class C for each sequence encoded in a fixed dimension D. Input BxD --> output BxC
- sequence_to_value --> Predict a value V for each sequence encoded in a fixed dimension D. Input BxD --> output Bx1
對於每個協議,我們創建了一個標準化,介紹瞭如何提供輸入數據。您可以在此處找到每個協議的詳細信息。
在下面,我們展示了一個示例,介紹了序列和標籤文件對於restue_to_class協議的外觀:
sequences.fasta
>Seq1
SEQWENCE
標籤
>Seq1 SET=train VALIDATION=False
DVCDVVDD
要運行Biotrainer ,您需要以.yaml格式以及序列和標籤數據提供配置文件。在這裡,您可以找到一個示例性文件的示例性文件。所有配置選項均在此處列出。
示例configuration for Listue_to_class :
protocol : residue_to_class
sequence_file : sequences.fasta # Specify your sequence file
labels_file : labels.fasta # Specify your label file
model_choice : CNN # Model architecture
optimizer_choice : adam # Model optimizer
learning_rate : 1e-3 # Optimizer learning rate
loss_choice : cross_entropy_loss # Loss function
use_class_weights : True # Balance class weights by using class sample size in the given dataset
num_epochs : 200 # Number of maximum epochs
batch_size : 128 # Batch size
embedder_name : Rostlab/prot_t5_xl_uniref50 # Embedder to use為了將序列數據轉換為模型更有意義的輸入,在過去幾年中,蛋白質語言模型(PLM)生成的嵌入已廣泛應用。因此,根據協議,生物顆粒可以自動計算嵌入在每個序列和每個佔用水平上。查看嵌入式選項,以了解所有可用的嵌入方法。也可以使用自己的嵌入器提供自己的嵌入式文件,而與提供的計算管道無關。請參閱數據標準化文檔和相關示例,以了解如何執行此操作。如配置選項中所述,預計算的嵌入可以通過embeddings_file參數用於訓練過程。
如果您在安裝或運行期間遇到任何問題,請先檢查故障排除指南。
如果您的問題未解決,請創建一個問題。
如果您正在使用生物顆粒進行工作,請添加引用:
@inproceedings{
sanchez2022standards,
title={Standards, tooling and benchmarks to probe representation learning on proteins},
author={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},
booktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},
year={2022},
url={https://openreview.net/forum?id=adODyN-eeJ8}
}