GENA_LM下载 - GENA_LM源代码下载

GENA_LM

Ai源码

1.0.0

下载

gena-lm

Gena-LM是长DNA序列的开源基础模型家族。

Gena-LM模型是在人DNA序列中训练的变压器掩盖语言模型。

我们的Gena-LM模型的主要特征：

BPE令牌化而不是K-MER（DNABERT，核苷酸变压器）
最大输入序列大小范围为4.5k至36k bp，而DNABERT中的512bp和1000bp的核苷酸变压器中的最大序列范围范围为范围为512bp
对最新的T2T人类基因组组件的预培训与GRCH38/HG38

预训练的模型

模型	建筑学	Max Seqlen，令牌（BP）	参数	令牌数据	培训数据
伯特基	Bert-12l	512（4500）	110m	T2T拆分V1	T2T拆分V1
BERT-BASE-T2T	Bert-12l	512（4500）	110m	T2T+1000G SNP+多人	T2T+1000克SNP
bert-base-lastln-t2t	Bert-12l	512（4500）	110m	T2T+1000G SNP+多人	T2T+1000克SNP
BERT-BASE-T2T-MULTI	Bert-12l	512（4500）	110m	T2T+1000G SNP+多人	T2T+1000G SNP+多人
Bert-large-T2T	Bert-24L	512（4500）	336m	T2T+1000G SNP+多人	T2T+1000克SNP
Bigbird-base-sparse	Bert-12l，深速稀疏操作，绳索	4096（36000）	110m	T2T拆分V1	T2T拆分V1
bigbird-base-sparse-t2t	Bert-12l，深速稀疏操作，绳索	4096（36000）	110m	T2T+1000G SNP+多人	T2T+1000克SNP
bigbird-base-t2t	Bert-12l，HF Bigbird	4096（36000）	110m	T2T+1000G SNP+多人	T2T+1000克SNP

T2T拆分V1是指具有非增强T2T人体基因组组装的初步模型。基于BERT的模型采用前层归一化，最后一个明确表示层归一化也应用于最终层。绳索表明使用旋转位置嵌入代替BERT样绝对位置嵌入。

对于我们的第一个模型（ gena-lm-bert-base和gena-lm-bigbird-base-sparse ），我们将人类染色体22和y（CP068256.2和CP086569.2）作为掩盖语言建模任务的测试数据集。对于所有其他模型，我们均拿出人类染色体7和10（CP068271.2和CP068268.2）;这些型号的名称具有后缀“ T2T”。其他数据用于培训。对预处理的人类T2T V2基因组组装及其1000基因组SNP的增强进行了训练，以≈480x 10^9碱基对进行培训。对多种族模型进行了仅在唯一的人为和多物种数据的培训中，总共以≈1072x 10^9的基础对培训。

下游任务的预训练模型

模型	任务	任务seq len	公制	HF分支名称
gena-lm-bert-base-t2t	发起人	300bp	74.56+-0.36 F1	promoter_300_run_1
gena-lm-bert-large-t2t	发起人	300bp	76.44+-0.16 f1	promoter_300_run_1
gena-lm-bert-large-t2t	发起人	2000bp	93.70+-0.44 F1	promoter_2000_run_1
gena-lm-bert-base-t2t	剪接网站	15000bp	92.63+-0.09 pr auc	spliceai_run_1
gena-lm-bert-large-t2t	剪接网站	15000bp	93.59+-0.11 pr auc	spliceai_run_1

要在下游任务上获取预训练的模型，请用表中的值替换model_name和branch_name 。表中的指标在多个运行中平均。因此，每个检查点的值可能与此处报告的值不同。

 from transformers import AutoTokenizer , AutoModel
tokenizer = AutoTokenizer . from_pretrained ( f'AIRI-Institute/ { model_name } ' )
model = AutoModel . from_pretrained ( f'AIRI-Institute/ { model_name } ' , revision = branch_name , trust_remote_code = True )

例子

如何为蒙版语言建模加载预训练的Gena-LM

 from transformers import AutoTokenizer , AutoModel

tokenizer = AutoTokenizer . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' )
model = AutoModel . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' , trust_remote_code = True )

如何在分类任务上加载预训练的Gena-LM以微调它

从Gena-LM存储库中获取模型类：

git clone https://github.com/AIRI-Institute/GENA_LM.git

 from GENA_LM . src . gena_lm . modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' )
model = BertForSequenceClassification . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' )

或者，您只需下载Modeling_bert.py并将其靠近您的代码即可。

或者，您可以从HuggingFace Automodel获得模型类：

 from transformers import AutoTokenizer , AutoModel
model = AutoModel . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' , trust_remote_code = True )
gena_module_name = model . __class__ . __module__
print ( gena_module_name )
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr ( importlib . import_module ( gena_module_name ), 'BertForSequenceClassification' )
print ( cls )
model = cls . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' , num_labels = 2 )

Gena-LM bigbird-base-t2t模型使用HuggingFace Bigbird实现。因此，可以使用来自变形金刚库中的默认类：

 from transformers import AutoTokenizer , BigBirdForSequenceClassification
tokenizer = AutoTokenizer . from_pretrained ( 'AIRI-Institute/gena-lm-bigbird-base-t2t' )
model = BigBirdForSequenceClassification . from_pretrained ( 'AIRI-Institute/gena-lm-bigbird-base-t2t' )

笔记本

使用Gena-LM和HuggingFace Transformers的序列分类
用基因LM产生的DNA嵌入的聚类
探索在Enformer数据集上微调基因表达的Gena-LM模型

引用

 @article {GENA_LM,
	author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
	title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
	elocation-id = {2023.06.12.544594},
	year = {2023},
	doi = {10.1101/2023.06.12.544594},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
	eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
	journal = {bioRxiv}
}

下游任务

模型评估的下游任务包括启动子和增强子活性，剪接位点，染色质曲线和聚腺苷酸位点强度的预测。检查downstream_tasks文件夹中的代码和数据预处理脚本：

发起人的预测
剪接网站预测（spliceai）
果蝇增强剂预测（DeepStarr）
染色质分析（深海）
聚腺苷酸化位点的预测（普遍）

预训练数据

下载并预处理数据

为了下载人类基因组，请运行以下脚本：

 ./download_data.sh human

对于预处理，执行以下脚本：

 python src/gena_lm/genome_tools/create_corpus.py --input_file data/ncbi_dataset/data/GCA_009914755.4/GCA_009914755.4_T2T-CHM13v2.0_genomic.fna --output_dir data/processed/human/

安装

对于稀疏注意的模型（ gena-lm-bigbird-base-sparse ， gena-lm-bigbird-base-sparse-t2t ）FP16支持和DeepSpeed。

FP16的顶点

安装Apex https://github.com/nvidia/apex#quick-start

 git clone https://github.com/NVIDIA/apex
cd apex
# most recent commits may fail to build
git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

深速稀疏操作

需要进行深速安装才能与语言模型的稀疏版本一起使用。深速稀疏注意仅支持具有计算兼容性的GPU> = 7（V100，T4，A100），CUDA 10.1、10.2、10.0或11.1或11.1，并且仅在FP16模式下运行（截至DeepSpeed 0.6.0）。

Pytorch> = 1.7.1，<= 1.10.1带有CUDA 10.2/11.0/11.1的车轮可以使用pytorch.org。但是，使用稀疏的OPS与CUDA 11.1 PYTORCH轮毂将需要CUDA 11.3/11.4安装在系统上。稀疏操作也可以与Pytorch == 1.12.1 CUDA 11.3车轮一起使用，但是运行DeepSpeed Sparse Ops测试将需要修改它们，因为它们检查了Torch Cuda版本<= 11.1。 Triton 1.1.1的DeepSpeed Fork已经具有更新的测试。

Triton 1.0.0和1.1.1需要Python <= 3.9。

pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option= " build_ext " --global-option= " -j8 " --no-cache

并检查安装

ds_report

Triton 1.1.1

Triton 1.1.1将X2加速加速到A100的稀疏操作，但DeepSpeed（0.6.5）当前仅支持Triton 1.0.0。在需要加速的情况下，可以使用带有Triton 1.1.1的DeepSpeed Fork：

pip install triton==1.1.1
git clone https://github.com/yurakuratov/DeepSpeed.git
cd DeepSpeed
DS_BUILD_SPARSE_ATTN=1 pip install -e . --global-option= " build_ext " --global-option= " -j8 " --no-cache

并与

 cd tests/unit
pytest -v test_sparse_attention.py

使用LM示例工具进行填充

我们使用来自LM示例工具存储库的培训师和多GPU培训作为我们的Finetuning脚本的基础。但是，您可以使用HF Transformers培训师，Pytorch Lightning，或者使用定制训练环加速和Pytorch。

根据https://github.com/yurakuratov/t5-experiments#install-only-only-lm_experiments_tools安装lm-experiments-tools：

 git clone https://github.com/yurakuratov/t5-experiments
cd t5-experiments
pip install -e .

展开

附加信息

版本 1.0.0
类型 Ai源码
更新时间 2025-09-11
大小 31.87MB
来自于 Github

GENA_LM

gena-lm

预训练的模型

下游任务的预训练模型

例子

如何为蒙版语言建模加载预训练的Gena-LM

如何在分类任务上加载预训练的Gena-LM以微调它

笔记本

引用

下游任务

预训练数据

下载并预处理数据

安装

FP16的顶点

深速稀疏操作

Triton 1.1.1

使用LM示例工具进行填充

OpenCore_NO_ACPI_Build

nspanel_pro_tools_apk

zkwork_aleo_gpu_worker

nextcloud_share_url_downloader

狗_狐狸_兔子

丽华数据分析引擎免费版3.0_搜索_导航_采集_舆情_排行_api

chat.petals.dev

GPT Prompt Templates

GPTyped

ML stack

awesome free chatgpt

pywin_contextmenu

Google Dorks

shepherd

mongo express