DNABERT_2下載DNABERT_2源代碼下載

DNABERT_2

Ai源碼

1.0.0

下載

DNABERT-2：多物種基因組的有效基礎模型和基準

存儲庫包含：

DNABERT-2的官方實施：多種物種基因組的有效基礎模型和基準
基因組理解評估（GUE）：綜合基準，其中包含28個數據集用於多種物種基因組理解基準。

內容

1。簡介
2。模型和數據
3。設置環境
4。快速開始
5。預訓練
6。芬特
7。引用

更新（2024/02/14）

我們發布了DNABERT-S，這是一種基於DNABERT-2的基礎模型，專門設計用於生成嵌入的DNA嵌入，該DNA自然簇和分離嵌入空間中不同物種的基因組。如果您有興趣，請在此處查看。

1。簡介

DNABERT-2是一種基金會模型，該模型訓練了大規模多種物種基因組，可實現最先進的性能 $ 28 $ GUE基準的任務。它用BPE代替K-MER令牌化，位置嵌入線性偏置（Alibi），並結合了其他技術以提高DNABERT的效率和有效性。

2。模型和數據

預先訓練的模型可在Huggingface上使用zhihan1996/DNABERT-2-117M 。鏈接到HuggingFace ModelHub。鏈接直接下載。

2.1 GUE：基因組理解評估

Gue是基因組理解的全面基準 $ 28 $跨越不同的數據集 $ 7 $任務和 $ 4 $物種。 GUE可以在這裡下載。 GUE上的統計和模型性能如下：

Gue

3。設置環境

 # create and activate virtual python environment
conda create -n dna python=3.8
conda activate dna

# (optional if you would like to use flash attention)
# install triton from source
git clone https://github.com/openai/triton.git;
cd triton/python;
pip install cmake; # build-time dependency
pip install -e .

# install required packages
python3 -m pip install -r requirements.txt

4。快速開始

我們的模型易於與Transformers軟件包一起使用。

從HuggingFace（版本4.28）加載模型：

 import torch
from transformers import AutoTokenizer , AutoModel

tokenizer = AutoTokenizer . from_pretrained ( "zhihan1996/DNABERT-2-117M" , trust_remote_code = True )
model = AutoModel . from_pretrained ( "zhihan1996/DNABERT-2-117M" , trust_remote_code = True )

從HuggingFace（版本> 4.28）加載模型：

 from transformers . models . bert . configuration_bert import BertConfig

config = BertConfig . from_pretrained ( "zhihan1996/DNABERT-2-117M" )
model = AutoModel . from_pretrained ( "zhihan1996/DNABERT-2-117M" , trust_remote_code = True , config = config )

計算DNA序列的嵌入

 dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
hidden_states = model(inputs)[0] # [1, sequence_length, 768]

# embedding with mean pooling
embedding_mean = torch.mean(hidden_states[0], dim=0)
print(embedding_mean.shape) # expect to be 768

# embedding with max pooling
embedding_max = torch.max(hidden_states[0], dim=0)[0]
print(embedding_max.shape) # expect to be 768

5。預訓練

我們使用並稍微修改了Mosaicbert實施，用於DNABERT-2 https://github.com/mosaicml/examples/tree/main/main/main/examples/benchmarks/bert。您應該能夠按照說明復制模型培訓。

或者，您可以在https://github.com/huggingface/transformers/tree/main/main/examples/pytorch/language-modeling上使用Run_mlm.py它應該產生非常相似的模型。

培訓數據可在此處獲得。

6。芬特

6.1評估gue的模型

請首先從此處下載GUE數據集。然後運行腳本以評估所有任務。

當前的腳本設置為使用DataParallel進行4個GPU進行培訓。如果您有不同數量的GPU，請更改per_device_train_batch_size和gradient_accumulation_steps以相應地將全局批次大小調整為32，以復制論文中的結果。如果您想執行分佈式的多GPU培訓（例如，使用DistributedDataParallel ），只需將python更改為torchrun --nproc_per_node ${n_gpu}即可。

 export DATA_PATH=/path/to/GUE #(e.g., /home/user)
cd finetune

# Evaluate DNABERT-2 on GUE
sh scripts/run_dnabert2.sh DATA_PATH

# Evaluate DNABERT (e.g., DNABERT with 3-mer) on GUE
# 3 for 3-mer, 4 for 4-mer, 5 for 5-mer, 6 for 6-mer
sh scripts/run_dnabert1.sh DATA_PATH 3

# Evaluate Nucleotide Transformers on GUE
# 0 for 500m-1000g, 1 for 500m-human-ref, 2 for 2.5b-1000g, 3 for 2.5b-multi-species
sh scripts/run_nt.sh DATA_PATH 0

6.2您自己數據集中的微調DNABERT2

在這裡，我們提供了您自己數據集上微調DNABERT2的示例。

6.2.1格式您的數據集

首先，請從數據集生成3個csv文件： train.csv ， dev.csv和test.csv 。在培訓過程中，該模型在train.csv上進行了培訓，並在dev.csv文件上進行了評估。訓練後，如果完成，則加載了dev.csv文件上最小損失的檢查點，並在test.csv上進行評估。如果您沒有驗證集，請進行dev.csv和test.csv相同。

請參閱sample_data文件夾以獲取數據格式示例。每個文件應採用相同的格式，第一行作為文檔頭名sequence, label 。以下每個行應包含由, （例如， ACGTCAGTCAGCGTACGT, 1 ）串聯的DNA序列和數值標記。

然後，您可以使用以下代碼在您自己的數據集上進行Finetune dnabert-2：

 cd finetune

export DATA_PATH=$path/to/data/folder  # e.g., ./sample_data
export MAX_LENGTH=100 # Please set the number as 0.25 * your sequence length. 
											# e.g., set it as 250 if your DNA sequences have 1000 nucleotide bases
											# This is because the tokenized will reduce the sequence length by about 5 times
export LR=3e-5

# Training use DataParallel
python train.py 
    --model_name_or_path zhihan1996/DNABERT-2-117M 
    --data_path  ${DATA_PATH} 
    --kmer -1 
    --run_name DNABERT2_${DATA_PATH} 
    --model_max_length ${MAX_LENGTH} 
    --per_device_train_batch_size 8 
    --per_device_eval_batch_size 16 
    --gradient_accumulation_steps 1 
    --learning_rate ${LR} 
    --num_train_epochs 5 
    --fp16 
    --save_steps 200 
    --output_dir output/dnabert2 
    --evaluation_strategy steps 
    --eval_steps 200 
    --warmup_steps 50 
    --logging_steps 100 
    --overwrite_output_dir True 
    --log_level info 
    --find_unused_parameters False
    
# Training use DistributedDataParallel (more efficient)
export num_gpu=4 # please change the value based on your setup

torchrun --nproc_per_node=${num_gpu} train.py 
    --model_name_or_path zhihan1996/DNABERT-2-117M 
    --data_path  ${DATA_PATH} 
    --kmer -1 
    --run_name DNABERT2_${DATA_PATH} 
    --model_max_length ${MAX_LENGTH} 
    --per_device_train_batch_size 8 
    --per_device_eval_batch_size 16 
    --gradient_accumulation_steps 1 
    --learning_rate ${LR} 
    --num_train_epochs 5 
    --fp16 
    --save_steps 200 
    --output_dir output/dnabert2 
    --evaluation_strategy steps 
    --eval_steps 200 
    --warmup_steps 50 
    --logging_steps 100 
    --overwrite_output_dir True 
    --log_level info 
    --find_unused_parameters False

7。引用

如果您對我們的紙張或代碼有任何疑問，請隨時開始發行問題或發送電子郵件（[email protected]）。

如果您在工作中使用DNABERT-2，請請我們的論文：

DNABERT-2

 @misc{zhou2023dnabert2,
      title={DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome}, 
      author={Zhihan Zhou and Yanrong Ji and Weijian Li and Pratik Dutta and Ramana Davuluri and Han Liu},
      year={2023},
      eprint={2306.15006},
      archivePrefix={arXiv},
      primaryClass={q-bio.GN}
}

dnabert

 @article{ji2021dnabert,
    author = {Ji, Yanrong and Zhou, Zhihan and Liu, Han and Davuluri, Ramana V},
    title = "{DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome}",
    journal = {Bioinformatics},
    volume = {37},
    number = {15},
    pages = {2112-2120},
    year = {2021},
    month = {02},
    issn = {1367-4803},
    doi = {10.1093/bioinformatics/btab083},
    url = {https://doi.org/10.1093/bioinformatics/btab083},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/37/15/2112/50578892/btab083.pdf},
}

展開

附加信息

版本 1.0.0
類型 Ai源碼
更新時間 2025-09-10
大小 469.96KB
來自於 Github

相關應用

OpenCore_NO_ACPI_Build

2024-11-13
nspanel_pro_tools_apk

2024-11-12
zkwork_aleo_gpu_worker

2024-11-11
nextcloud_share_url_downloader

2024-11-01
視訊交友系統FuShow_V3.2_beta

2022-08-15
麗華資料分析引擎免費版3.0_搜尋_導航_採集_輿情_排行_api

2022-06-28

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
ML stack

Ai源碼

1.0.0
awesome free chatgpt

Ai源碼

1.0.0
pywin_contextmenu

Ai源碼

Version update
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部