GENA-LM is a family of Open-Source Foundational Models for Long DNA Sequences.
GENA-LM models are transformer masked language models trained on human DNA sequence.
Key features of our GENA-LM models:
| Model | Architecture | Max SeqLen, tokens (bp) | Params | Tokenizer data | Training data |
|---|---|---|---|---|---|
| bert-base | BERT-12L | 512(4500) | 110M | T2T split v1 | T2T split v1 |
| bert-base-t2t | BERT-12L | 512(4500) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
| bert-base-lastln-t2t | BERT-12L | 512(4500) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
| bert-base-t2t-multi | BERT-12L | 512(4500) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs+Multispecies |
| bert-large-t2t | BERT-24L | 512(4500) | 336M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
| bigbird-base-sparse | BERT-12L, DeepSpeed Sparse Ops, RoPE | 4096(36000) | 110M | T2T split v1 | T2T split v1 |
| bigbird-base-sparse-t2t | BERT-12L, DeepSpeed Sparse Ops, RoPE | 4096(36000) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
| bigbird-base-t2t | BERT-12L, HF BigBird | 4096(36000) | 110M | T2T+1000G SNPs+Multispecies | T2T+1000G SNPs |
T2T split v1 refers to preliminary models with a non-augmented T2T human genome assembly split. BERT-based models employ Pre-Layer Normalization and lastln explicitly denotes that layer normalization is also applied to the final layer. RoPE indicates the use of rotary position embeddings in place of BERT-like absolute positional embeddings.
For our first models (gena-lm-bert-base and gena-lm-bigbird-base-sparse) we hold out human chromosomes 22 and Y (CP068256.2 and CP086569.2) as the test dataset for the masked language modeling task. For all other models, we hold out human chromosomes 7 and 10 (CP068271.2 and CP068268.2); these models have the suffix "t2t" in their names. Other data was used for training. Human-only models were trained on pre-processed Human T2T v2 genome assembly and its 1000-genome SNP augmentations making in a total of ≈ 480 x 10^9 base pairs. Multispecies models were trained on human-only and multispecies data making in a total of ≈ 1072 x 10^9 base pairs.
| Model | Task | Task seq len | Metric | HF branch name |
|---|---|---|---|---|
| gena-lm-bert-base-t2t | promoters | 300bp | 74.56+-0.36 F1 | promoters_300_run_1 |
| gena-lm-bert-large-t2t | promoters | 300bp | 76.44+-0.16 F1 | promoters_300_run_1 |
| gena-lm-bert-large-t2t | promoters | 2000bp | 93.70+-0.44 F1 | promoters_2000_run_1 |
| gena-lm-bert-base-t2t | splice site | 15000bp | 92.63+-0.09 PR AUC | spliceai_run_1 |
| gena-lm-bert-large-t2t | splice site | 15000bp | 93.59+-0.11 PR AUC | spliceai_run_1 |
To get a pre-trained model on a downstream task, replace model_name and branch_name with values from the table. The metrics in the table are averaged over multiple runs. Therefore, the values for each checkpoint may differ from those reported here.
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained(f'AIRI-Institute/{model_name}')
model = AutoModel.from_pretrained(f'AIRI-Institute/{model_name}', revision=branch_name, trust_remote_code=True)from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', trust_remote_code=True)Get model class from GENA-LM repository:
git clone https://github.com/AIRI-Institute/GENA_LM.gitfrom GENA_LM.src.gena_lm.modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')
model = BertForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t')or you can just download modeling_bert.py and put it close to your code.
OR you can get model class from HuggingFace AutoModel:
from transformers import AutoTokenizer, AutoModel
model = AutoModel.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', trust_remote_code=True)
gena_module_name = model.__class__.__module__
print(gena_module_name)
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr(importlib.import_module(gena_module_name), 'BertForSequenceClassification')
print(cls)
model = cls.from_pretrained('AIRI-Institute/gena-lm-bert-base-t2t', num_labels=2)GENA-LM bigbird-base-t2t model uses the HuggingFace BigBird implementation. Therefore, default classes from the Transformers library could be used:
from transformers import AutoTokenizer, BigBirdForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')
model = BigBirdForSequenceClassification.from_pretrained('AIRI-Institute/gena-lm-bigbird-base-t2t')Sequence classification with GENA-LM and Huggingface Transformers
Clusterization of DNA embeddings generated with GENA LM
Explore GENA-LM model fine-tuned on Enformer dataset for gene expression
@article {GENA_LM,
author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
elocation-id = {2023.06.12.544594},
year = {2023},
doi = {10.1101/2023.06.12.544594},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
journal = {bioRxiv}
}
Downstream tasks for model evaluation encompass the prediction of promoter and enhancer activity, splicing sites, chromatin profiles, and polyadenylation site strength.
Check downstream_tasks folder for code and data preprocessing scripts we used:
In order to download human genome please run the following script:
./download_data.sh human
For preprocessing, execute the following script:
python src/gena_lm/genome_tools/create_corpus.py --input_file data/ncbi_dataset/data/GCA_009914755.4/GCA_009914755.4_T2T-CHM13v2.0_genomic.fna --output_dir data/processed/human/
For models with sparse attention (gena-lm-bigbird-base-sparse, gena-lm-bigbird-base-sparse-t2t) FP16 support and DeepSpeed is needed.
Install APEX https://github.com/NVIDIA/apex#quick-start
git clone https://github.com/NVIDIA/apex
cd apex
# most recent commits may fail to build
git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key...
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./
DeepSpeed installation is needed to work with SparseAttention versions of language models. DeepSpeed Sparse attention supports only GPUs with compute compatibility >= 7 (V100, T4, A100), CUDA 10.1, 10.2, 11.0, or 11.1 and runs only in FP16 mode (as of DeepSpeed 0.6.0).
PyTorch>=1.7.1,<=1.10.1 wheels with CUDA 10.2/11.0/11.1 from pytorch.org can be used. However, using Sparse Ops with CUDA 11.1 PyTorch wheels would require CUDA 11.3/11.4 to be installed on the system. Sparse Ops could also be used with PyTorch==1.12.1 CUDA 11.3 wheels, but running DeepSpeed Sparse Ops tests would require modifying them as they check for Torch CUDA version <=11.1. DeepSpeed fork for Triton 1.1.1 already has updated tests.
Triton 1.0.0 and 1.1.1 requires python<=3.9.
pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option="build_ext" --global-option="-j8" --no-cacheand check installation with
ds_reportTriton 1.1.1 brings x2 speed-up to sparse operations on A100, but DeepSpeed (0.6.5) currently supports only triton 1.0.0. DeepSpeed fork with triton 1.1.1 support could be used in the cases where such speed-up is needed:
pip install triton==1.1.1
git clone https://github.com/yurakuratov/DeepSpeed.git
cd DeepSpeed
DS_BUILD_SPARSE_ATTN=1 pip install -e . --global-option="build_ext" --global-option="-j8" --no-cacheand run sparse ops tests with
cd tests/unit
pytest -v test_sparse_attention.pyWe use Trainer and multi-gpu training from lm-experiments-tools repository as the basis for our finetuning scripts. However, you can use HF Transformers Trainer, PyTorch Lightning, or Accelerate and PyTorch with custom training loops instead.
Install lm-experiments-tools according to https://github.com/yurakuratov/t5-experiments#install-only-lm_experiments_tools:
git clone https://github.com/yurakuratov/t5-experiments
cd t5-experiments
pip install -e .