GENA_LM下載 - GENA_LM源代碼下載

GENA_LM

Ai源碼

1.0.0

下載

gena-lm

Gena-LM是長DNA序列的開源基礎模型家族。

Gena-LM模型是在人DNA序列中訓練的變壓器掩蓋語言模型。

我們的Gena-LM模型的主要特徵：

BPE令牌化而不是K-MER（DNABERT，核苷酸變壓器）
最大輸入序列大小範圍為4.5k至36k bp，而DNABERT中的512bp和1000bp的核苷酸變壓器中的最大序列範圍範圍為範圍為512bp
對最新的T2T人類基因組組件的預培訓與GRCH38/HG38

預訓練的模型

模型	建築學	Max Seqlen，令牌（BP）	參數	令牌數據	培訓數據
伯特基	Bert-12l	512（4500）	110m	T2T拆分V1	T2T拆分V1
BERT-BASE-T2T	Bert-12l	512（4500）	110m	T2T+1000G SNP+多人	T2T+1000克SNP
bert-base-lastln-t2t	Bert-12l	512（4500）	110m	T2T+1000G SNP+多人	T2T+1000克SNP
BERT-BASE-T2T-MULTI	Bert-12l	512（4500）	110m	T2T+1000G SNP+多人	T2T+1000G SNP+多人
Bert-large-T2T	Bert-24L	512（4500）	336m	T2T+1000G SNP+多人	T2T+1000克SNP
Bigbird-base-sparse	Bert-12l，深速稀疏操作，繩索	4096（36000）	110m	T2T拆分V1	T2T拆分V1
bigbird-base-sparse-t2t	Bert-12l，深速稀疏操作，繩索	4096（36000）	110m	T2T+1000G SNP+多人	T2T+1000克SNP
bigbird-base-t2t	Bert-12l，HF Bigbird	4096（36000）	110m	T2T+1000G SNP+多人	T2T+1000克SNP

T2T拆分V1是指具有非增強T2T人體基因組組裝的初步模型。基於BERT的模型採用前層歸一化，最後一個明確表示層歸一化也應用於最終層。繩索表明使用旋轉位置嵌入代替BERT樣絕對位置嵌入。

對於我們的第一個模型（ gena-lm-bert-base和gena-lm-bigbird-base-sparse ），我們將人類染色體22和y（CP068256.2和CP086569.2）作為掩蓋語言建模任務的測試數據集。對於所有其他模型，我們均拿出人類染色體7和10（CP068271.2和CP068268.2）;這些型號的名稱具有後綴“ T2T”。其他數據用於培訓。對預處理的人類T2T V2基因組組裝及其1000基因組SNP的增強進行了訓練，以≈480x 10^9鹼基對進行培訓。對多種族模型進行了僅在唯一的人為和多物種數據的培訓中，總共以≈1072x 10^9的基礎對培訓。

下游任務的預訓練模型

模型	任務	任務seq len	公制	HF分支名稱
gena-lm-bert-base-t2t	發起人	300bp	74.56+-0.36 F1	promoter_300_run_1
gena-lm-bert-large-t2t	發起人	300bp	76.44+-0.16 f1	promoter_300_run_1
gena-lm-bert-large-t2t	發起人	2000bp	93.70+-0.44 F1	promoter_2000_run_1
gena-lm-bert-base-t2t	剪接網站	15000bp	92.63+-0.09 pr auc	spliceai_run_1
gena-lm-bert-large-t2t	剪接網站	15000bp	93.59+-0.11 pr auc	spliceai_run_1

要在下游任務上獲取預訓練的模型，請用表中的值替換model_name和branch_name 。表中的指標在多個運行中平均。因此，每個檢查點的值可能與此處報告的值不同。

 from transformers import AutoTokenizer , AutoModel
tokenizer = AutoTokenizer . from_pretrained ( f'AIRI-Institute/ { model_name } ' )
model = AutoModel . from_pretrained ( f'AIRI-Institute/ { model_name } ' , revision = branch_name , trust_remote_code = True )

例子

如何為蒙版語言建模加載預訓練的Gena-LM

 from transformers import AutoTokenizer , AutoModel

tokenizer = AutoTokenizer . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' )
model = AutoModel . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' , trust_remote_code = True )

如何在分類任務上加載預訓練的Gena-LM以微調它

從Gena-LM存儲庫中獲取模型類：

git clone https://github.com/AIRI-Institute/GENA_LM.git

 from GENA_LM . src . gena_lm . modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' )
model = BertForSequenceClassification . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' )

或者，您只需下載Modeling_bert.py並將其靠近您的代碼即可。

或者，您可以從HuggingFace Automodel獲得模型類：

 from transformers import AutoTokenizer , AutoModel
model = AutoModel . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' , trust_remote_code = True )
gena_module_name = model . __class__ . __module__
print ( gena_module_name )
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr ( importlib . import_module ( gena_module_name ), 'BertForSequenceClassification' )
print ( cls )
model = cls . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' , num_labels = 2 )

Gena-LM bigbird-base-t2t模型使用HuggingFace Bigbird實現。因此，可以使用來自變形金剛庫中的默認類：

 from transformers import AutoTokenizer , BigBirdForSequenceClassification
tokenizer = AutoTokenizer . from_pretrained ( 'AIRI-Institute/gena-lm-bigbird-base-t2t' )
model = BigBirdForSequenceClassification . from_pretrained ( 'AIRI-Institute/gena-lm-bigbird-base-t2t' )

筆記本

使用Gena-LM和HuggingFace Transformers的序列分類
用基因LM產生的DNA嵌入的聚類
探索在Enformer數據集上微調基因表達的Gena-LM模型

引用

 @article {GENA_LM,
	author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
	title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
	elocation-id = {2023.06.12.544594},
	year = {2023},
	doi = {10.1101/2023.06.12.544594},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
	eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
	journal = {bioRxiv}
}

下游任務

模型評估的下游任務包括啟動子和增強子活性，剪接位點，染色質曲線和聚腺苷酸位點強度的預測。檢查downstream_tasks文件夾中的代碼和數據預處理腳本：

發起人的預測
剪接網站預測（spliceai）
果蠅增強劑預測（DeepStarr）
染色質分析（深海）
聚腺苷酸化位點的預測（普遍）

預訓練數據

下載並預處理數據

為了下載人類基因組，請運行以下腳本：

 ./download_data.sh human

對於預處理，執行以下腳本：

 python src/gena_lm/genome_tools/create_corpus.py --input_file data/ncbi_dataset/data/GCA_009914755.4/GCA_009914755.4_T2T-CHM13v2.0_genomic.fna --output_dir data/processed/human/

安裝

對於稀疏注意的模型（ gena-lm-bigbird-base-sparse ， gena-lm-bigbird-base-sparse-t2t ）FP16支持和DeepSpeed。

FP16的頂點

安裝Apex https://github.com/nvidia/apex#quick-start

 git clone https://github.com/NVIDIA/apex
cd apex
# most recent commits may fail to build
git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

深速稀疏操作

需要進行深速安裝才能與語言模型的稀疏版本一起使用。深速稀疏注意僅支持具有計算兼容性的GPU> = 7（V100，T4，A100），CUDA 10.1、10.2、10.0或11.1或11.1，並且僅在FP16模式下運行（截至DeepSpeed 0.6.0）。

Pytorch> = 1.7.1，<= 1.10.1帶有CUDA 10.2/11.0/11.1的車輪可以使用pytorch.org。但是，使用稀疏的OPS與CUDA 11.1 PYTORCH輪轂將需要CUDA 11.3/11.4安裝在系統上。稀疏操作也可以與Pytorch == 1.12.1 CUDA 11.3車輪一起使用，但是運行DeepSpeed Sparse Ops測試將需要修改它們，因為它們檢查了Torch Cuda版本<= 11.1。 Triton 1.1.1的DeepSpeed Fork已經具有更新的測試。

Triton 1.0.0和1.1.1需要Python <= 3.9。

pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option= " build_ext " --global-option= " -j8 " --no-cache

並檢查安裝

ds_report

Triton 1.1.1

Triton 1.1.1將X2加速加速到A100的稀疏操作，但DeepSpeed（0.6.5）當前僅支持Triton 1.0.0。在需要加速的情況下，可以使用帶有Triton 1.1.1的DeepSpeed Fork：

pip install triton==1.1.1
git clone https://github.com/yurakuratov/DeepSpeed.git
cd DeepSpeed
DS_BUILD_SPARSE_ATTN=1 pip install -e . --global-option= " build_ext " --global-option= " -j8 " --no-cache

並與

 cd tests/unit
pytest -v test_sparse_attention.py

使用LM示例工具進行填充

我們使用來自LM示例工具存儲庫的培訓師和多GPU培訓作為我們的Finetuning腳本的基礎。但是，您可以使用HF Transformers培訓師，Pytorch Lightning，或者使用定制訓練環加速和Pytorch。

根據https://github.com/yurakuratov/t5-experiments#install-only-only-lm_experiments_tools安裝lm-experiments-tools：

 git clone https://github.com/yurakuratov/t5-experiments
cd t5-experiments
pip install -e .

展開

附加信息

版本 1.0.0
類型 Ai源碼
更新時間 2025-09-11
大小 31.87MB
來自於 Github

相關應用

OpenCore_NO_ACPI_Build

2024-11-13
nspanel_pro_tools_apk

2024-11-12
zkwork_aleo_gpu_worker

2024-11-11
nextcloud_share_url_downloader

2024-11-01
狗_狐狸_兔子

2022-08-01
麗華資料分析引擎免費版3.0_搜尋_導航_採集_輿情_排行_api

2022-06-28

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
ML stack

Ai源碼

1.0.0
awesome free chatgpt

Ai源碼

1.0.0
pywin_contextmenu

Ai源碼

Version update
Google Dorks

其他源碼

1.0
shepherd

其他源碼

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

其他源碼

v1.1.0-rc-3

相關資訊全部