biotrainer下载 - biotrainer源代码下载

biotrainer

Ai源码

v0.9.5

下载

生物颗粒

Biotrainer是一种开源工具，可简化用于生物应用机器学习模型的训练过程。它专门研究模型以预测蛋白质的特征。 Biotrainer支持，培训新模型并采用训练有素的模型进行推理。使用Biotrainer和配置文件一起提供正确格式的序列和标签数据一样简单。

安装

确保安装了诗歌：

curl -sSL https://install.python-poetry.org/ | python3 -

通过poetry安装依赖项和生物颗粒：

 # In the base directory:
poetry install
# Adding jupyter notebook (if needed):
poetry add jupyter

# [WINDOWS] Explicitly install torch libraries suited for your hardware:
# Select hardware and get install command here: https://pytorch.org/get-started/locally/
# Then run for example:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

请确保使用与pyproject.toml相同的火炬版本，以用于模型可重复性！

前端 +文档

如果您想将Biotrainer与GUI前端不错的前端使用，请查看bioCentral ：

生物中心应用
生物中心存储库
生物颗粒渲染文档

训练

 cd examples/residue_to_class
poetry run biotrainer config.yml

您还可以使用提供的run-biotrainer.py文件进行开发和调试（您可能需要设置IDE以直接执行Run-Biotrainer.py，以使用提供的虚拟环境）：

 # residue_to_class
poetry run python3 run-biotrainer.py examples/residue_to_class/config.yml
# sequence_to_class
poetry run python3 run-biotrainer.py examples/sequence_to_class/config.yml

在Docker培训

 # Build
docker build -t biotrainer .
# Run
docker run --rm 
    -v " $( pwd ) /examples/docker " :/mnt 
    -u $( id -u ${USER} ) : $( id -g ${USER} ) 
    biotrainer:latest /mnt/config.yml

可以在提供的配置文件的目录中找到输出。

推理

训练模型后，默认情况下，您将在output目录中找到一个out.yml文件。现在，您可以使用它来使用模型为新数据创建预测， inferencer模块会自动加载检查点：

 from biotrainer . inference import Inferencer
from biotrainer . embedders import OneHotEncodingEmbedder

sequences = [
    "PROVTEIN" ,
    "SEQVENCESEQVENCE"
]

out_file_path = '../residue_to_class/output/out.yml'

inferencer , out_file = Inferencer . create_from_out_file ( out_file_path = out_file_path , allow_torch_pt_loading = True )

print ( f"For the { out_file [ 'model_choice' ] } , the metrics on the test set are:" )
for metric in out_file [ 'test_iterations_results' ][ 'metrics' ]:
    print ( f" t { metric } : { out_file [ 'test_iterations_results' ][ 'metrics' ][ metric ] } " )


embedder = OneHotEncodingEmbedder ()
embeddings = list ( embedder . embed_many ( sequences ))
# Note that for per-sequence embeddings, you would have to reduce the embeddings now:
# embeddings = [[embedder.reduce_per_protein(embedding)] for embedding in embeddings]
predictions = inferencer . from_embeddings ( embeddings , split_name = "hold_out" )
for sequence , prediction in zip ( sequences , predictions [ "mapped_predictions" ]. values ()):
    print ( sequence )
    print ( prediction )

# If your checkpoints are stored as .pt, consider converting them to safetensors (supported by biotrainer >=0.9.1)
inferencer . convert_all_checkpoints_to_safetensors ()

请参阅此处的完整示例。

inferencer模块还具有bootstrapping和monte carlo dropout预测。

数据标准化

Biotrainer提供了许多数据标准，旨在简化机器学习对生物学的使用。预计该标准化过程还将改善不同科学学科之间的沟通，并有助于保持有关快速发展的蛋白质预测领域的概述。

可用协议

协议定义了如何解释输入数据以及必须应用哪种预测任务。以下协议已经实施：

 D=embedding dimension (e.g. 1024)
B=batch dimension (e.g. 30)
L=sequence dimension (e.g. 350)
C=number of classes (e.g. 13)

- residue_to_class --> Predict a class C for each residue encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxLxC
- residues_to_class --> Predict a class C for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output BxC
- residues_to_value --> Predict a value V for all residues encoded in D dimensions in a sequence of length L. Input BxLxD --> output Bx1
- sequence_to_class --> Predict a class C for each sequence encoded in a fixed dimension D. Input BxD --> output BxC
- sequence_to_value --> Predict a value V for each sequence encoded in a fixed dimension D. Input BxD --> output Bx1

输入文件标准化

对于每个协议，我们创建了一个标准化，介绍了如何提供输入数据。您可以在此处找到每个协议的详细信息。

在下面，我们展示了一个示例，介绍了序列和标签文件对于restue_to_class协议的外观：

sequences.fasta

 >Seq1
SEQWENCE

标签

 >Seq1 SET=train VALIDATION=False
DVCDVVDD

配置文件

要运行Biotrainer ，您需要以.yaml格式以及序列和标签数据提供配置文件。在这里，您可以找到一个示例性文件的示例性文件。所有配置选项均在此处列出。

示例configuration for Listue_to_class ：

 protocol : residue_to_class
sequence_file : sequences.fasta # Specify your sequence file
labels_file : labels.fasta # Specify your label file
model_choice : CNN # Model architecture 
optimizer_choice : adam # Model optimizer
learning_rate : 1e-3 # Optimizer learning rate
loss_choice : cross_entropy_loss # Loss function 
use_class_weights : True # Balance class weights by using class sample size in the given dataset
num_epochs : 200 # Number of maximum epochs
batch_size : 128 # Batch size
embedder_name : Rostlab/prot_t5_xl_uniref50 # Embedder to use

（生物 - ）嵌入

为了将序列数据转换为模型更有意义的输入，在过去几年中，蛋白质语言模型（PLM）生成的嵌入已广泛应用。因此，根据协议，生物颗粒可以自动计算嵌入在每个序列和每个占用水平上。查看嵌入式选项，以了解所有可用的嵌入方法。也可以使用自己的嵌入器提供自己的嵌入式文件，而与提供的计算管道无关。请参阅数据标准化文档和相关示例，以了解如何执行此操作。如配置选项中所述，预计算的嵌入可以通过embeddings_file参数用于训练过程。

故障排除

如果您在安装或运行期间遇到任何问题，请先检查故障排除指南。

如果您的问题未解决，请创建一个问题。

引用

如果您正在使用生物颗粒进行工作，请添加引用：

 @inproceedings{
sanchez2022standards,
title={Standards, tooling and benchmarks to probe representation learning on proteins},
author={Joaquin Gomez Sanchez and Sebastian Franz and Michael Heinzinger and Burkhard Rost and Christian Dallago},
booktitle={NeurIPS 2022 Workshop on Learning Meaningful Representations of Life},
year={2022},
url={https://openreview.net/forum?id=adODyN-eeJ8}
}

展开

附加信息