Biotrainer是一种开源工具,可简化用于生物应用机器学习模型的训练过程。它专门研究模型以预测蛋白质的特征。 Biotrainer支持,培训新模型并采用训练有素的模型进行推理。使用Biotrainer和配置文件一起提供正确格式的序列和标签数据一样简单。
curl -sSL https://install.python-poetry.org/ | python3 -poetry安装依赖项和生物颗粒: # In the base directory:
poetry install
# Adding jupyter notebook (if needed):
poetry add jupyter
# [WINDOWS] Explicitly install torch libraries suited for your hardware:
# Select hardware and get install command here: https://pytorch.org/get-started/locally/
# Then run for example:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118请确保使用与pyproject.toml相同的火炬版本,以用于模型可重复性!
如果您想将Biotrainer与GUI前端不错的前端使用,请查看bioCentral :
cd examples/residue_to_class
poetry run biotrainer config.yml您还可以使用提供的run-biotrainer.py文件进行开发和调试(您可能需要设置IDE以直接执行Run-Biotrainer.py,以使用提供的虚拟环境):
# residue_to_class
poetry run python3 run-biotrainer.py examples/residue_to_class/config.yml
# sequence_to_class
poetry run python3 run-biotrainer.py examples/sequence_to_class/config.yml # Build
docker build -t biotrainer .
# Run
docker run --rm
-v " $( pwd ) /examples/docker " :/mnt
-u $( id -u ${USER} ) : $( id -g ${USER} )
biotrainer:latest /mnt/config.yml可以在提供的配置文件的目录中找到输出。
训练模型后,默认情况下,您将在output目录中找到一个out.yml文件。现在,您可以使用它来使用模型为新数据创建预测, inferencer模块会自动加载检查点:
from biotrainer . inference import Inferencer
from biotrainer . embedders import OneHotEncodingEmbedder
sequences = [
"PROVTEIN" ,
"SEQVENCESEQVENCE"
]
out_file_path = '../residue_to_class/output/out.yml'
inferencer , out_file = Inferencer . create_from_out_file ( out_file_path = out_file_path , allow_torch_pt_loading = True )
print ( f"For the { out_file [ 'model_choice' ] } , the metrics on the test set are:" )
for metric in out_file [ 'test_iterations_results' ][ 'metrics' ]:
print ( f" t { metric } : { out_file [ 'test_iterations_results' ][ 'metrics' ][ metric ] } " )
embedder = OneHotEncodingEmbedder ()
embeddings = list ( embedder . embed_many ( sequences ))
# Note that for per-sequence embeddings, you would have to reduce the embeddings now:
# embeddings = [[embedder.reduce_per_protein(embedding)] for embedding in embeddings]
predictions = inferencer . from_embeddings ( embeddings , split_name = "hold_out" )
for sequence , prediction in zip ( sequences , predictions [ "mapped_predictions" ]. values ()):
print ( sequence )
print ( prediction )
# If your checkpoints are stored as .pt, consider converting them to safetensors (supported by biotrainer >=0.9.1)
inferencer . convert_all_checkpoints_to_safetensors ()请参阅此处的完整示例。
inferencer模块还具有bootstrapping和monte carlo dropout预测。
Biotrainer提供了许多数据标准,旨在简化机器学习对生物学的使用。预计该标准化过程还将改善不同科学学科之间的沟通,并有助于保持有关快速发展的蛋白质预测领域的概述。
协议定义了如何解释输入数据以及必须应用哪种预测任务。以下协议已经实施:
D=embedding dimension (e.g. 1024)
B=batch dimension (e.g. 30)
L=sequence dimension (e.g. 350)
C=number of classes (e.g. 13)
- residue_to_class --> Predict a class C for each residue encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxLxC
- residues_to_class --> Predict a class C for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxC
- residues_to_value --> Predict a value V for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output Bx1
- sequence_to_class --> Predict a class C for each sequence encoded in a fixed dimension D. Input BxD --> output BxC
- sequence_to_value --> Predict a value V for each sequence encoded in a fixed dimension D. Input BxD --> output Bx1
对于每个协议,我们创建了一个标准化,介绍了如何提供输入数据。您可以在此处找到每个协议的详细信息。
在下面,我们展示了一个示例,介绍了序列和标签文件对于restue_to_class协议的外观:
sequences.fasta
>Seq1
SEQWENCE
标签
>Seq1 SET=train VALIDATION=False
DVCDVVDD
要运行Biotrainer ,您需要以.yaml格式以及序列和标签数据提供配置文件。在这里,您可以找到一个示例性文件的示例性文件。所有配置选项均在此处列出。
示例configuration for Listue_to_class :
protocol : residue_to_class
sequence_file : sequences.fasta # Specify your sequence file
labels_file : labels.fasta # Specify your label file
model_choice : CNN # Model architecture
optimizer_choice : adam # Model optimizer
learning_rate : 1e-3 # Optimizer learning rate
loss_choice : cross_entropy_loss # Loss function
use_class_weights : True # Balance class weights by using class sample size in the given dataset
num_epochs : 200 # Number of maximum epochs
batch_size : 128 # Batch size
embedder_name : Rostlab/prot_t5_xl_uniref50 # Embedder to use为了将序列数据转换为模型更有意义的输入,在过去几年中,蛋白质语言模型(PLM)生成的嵌入已广泛应用。因此,根据协议,生物颗粒可以自动计算嵌入在每个序列和每个占用水平上。查看嵌入式选项,以了解所有可用的嵌入方法。也可以使用自己的嵌入器提供自己的嵌入式文件,而与提供的计算管道无关。请参阅数据标准化文档和相关示例,以了解如何执行此操作。如配置选项中所述,预计算的嵌入可以通过embeddings_file参数用于训练过程。
如果您在安装或运行期间遇到任何问题,请先检查故障排除指南。
如果您的问题未解决,请创建一个问题。
如果您正在使用生物颗粒进行工作,请添加引用:
@inproceedings{
sanchez2022standards,
title={Standards, tooling and benchmarks to probe representation learning on proteins},
author={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},
booktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},
year={2022},
url={https://openreview.net/forum?id=adODyN-eeJ8}
}