ดาวน์โหลด GENA_LM - ดาวน์โหลดซอร์สโค้ด GENA

GENA_LM

โค้ดแหล่งที่มา AI

1.0.0

ดาวน์โหลด

Gena-LM

Gena-LM เป็นตระกูลของแบบจำลองพื้นฐานโอเพ่นซอร์สสำหรับลำดับดีเอ็นเอยาว

รุ่น Gena-LM เป็นแบบจำลองภาษาที่สวมหน้ากากหม้อแปลงที่ผ่านการฝึกอบรมเกี่ยวกับลำดับดีเอ็นเอของมนุษย์

คุณสมบัติที่สำคัญของรุ่น Gena-LM ของเรา:

BPE tokenization แทน k-mers (Dnabert, Transformer นิวคลีโอไทด์)
ขนาดลำดับอินพุตสูงสุดมีตั้งแต่ 4.5k ถึง 36k bp เมื่อเทียบกับ 512bp ใน DNABERT และ 1,000BP ในหม้อแปลงนิวคลีโอไทด์
การฝึกอบรมล่วงหน้าเกี่ยวกับการประกอบจีโนมมนุษย์ T2T ล่าสุดเทียบกับ GRCH38/HG38

รุ่นที่ผ่านการฝึกอบรมมาก่อน

แบบอย่าง	สถาปัตยกรรม	Max Seqlen, โทเค็น (BP)	พารามิเตอร์	ข้อมูลโทเคนิเซอร์	ข้อมูลการฝึกอบรม
เบิร์ตเบส	bert-12l	512 (4500)	110m	T2T Split v1	T2T Split v1
bert-base-t2t	bert-12l	512 (4500)	110m	T2T+1000G SNPS+Multispecies	T2T+1000G SNPS
bert-base-lastln-t2t	bert-12l	512 (4500)	110m	T2T+1000G SNPS+Multispecies	T2T+1000G SNPS
bert-base-t2t-multi	bert-12l	512 (4500)	110m	T2T+1000G SNPS+Multispecies	T2T+1000G SNPS+Multispecies
bert-large-t2t	bert-24l	512 (4500)	336m	T2T+1000G SNPS+Multispecies	T2T+1000G SNPS
บิ๊กเบิร์ดเบส	bert-12l, deepspeed sparse ops, เชือก	4096 (36000)	110m	T2T Split v1	T2T Split v1
bigbird-base-sparse-t2t	bert-12l, deepspeed sparse ops, เชือก	4096 (36000)	110m	T2T+1000G SNPS+Multispecies	T2T+1000G SNPS
bigbird-base-t2t	bert-12l, hf bigbird	4096 (36000)	110m	T2T+1000G SNPS+Multispecies	T2T+1000G SNPS

T2T Split V1 หมายถึงแบบจำลองเบื้องต้นด้วยการแยกจีโนมมนุษย์ T2T ที่ไม่ได้ถูกรวมเข้าด้วยกัน โมเดลที่ใช้ BERT ใช้การทำให้เป็นมาตรฐานก่อนชั้นและ LastLN อย่างชัดเจนแสดงให้เห็นว่าการทำให้เป็นมาตรฐานของเลเยอร์นั้นถูกนำไปใช้กับเลเยอร์สุดท้าย เชือกบ่งบอกถึงการใช้การฝังตำแหน่งโรตารี่แทนการฝังตำแหน่งสัมบูรณ์แบบสัมบูรณ์แบบเบิร์ต

สำหรับรุ่นแรกของเรา ( gena-lm-bert-base และ gena-lm-bigbird-base-sparse ) เราถือโครโมโซมมนุษย์ 22 และ Y (CP068256.2 และ CP086569.2) เป็นชุดข้อมูลทดสอบสำหรับงานการสร้างแบบจำลองภาษาที่สวมหน้ากาก สำหรับรุ่นอื่น ๆ ทั้งหมดเราถือโครโมโซมมนุษย์ 7 และ 10 (CP068271.2 และ CP068268.2); โมเดลเหล่านี้มีคำต่อท้าย "T2T" ในชื่อของพวกเขา ข้อมูลอื่น ๆ ใช้สำหรับการฝึกอบรม แบบจำลองของมนุษย์เท่านั้นได้รับการฝึกฝนเกี่ยวกับการประกอบจีโนม T2T V2 ที่ผ่านการประมวลผลไว้ล่วงหน้าและการเพิ่ม SNP 1,000 จีโนมของมันทำให้เป็นคู่ฐาน≈ 480 x 10^9 โมเดล Multispecies ได้รับการฝึกฝนเกี่ยวกับข้อมูลมนุษย์เท่านั้นและ Multispecies ในการสร้างคู่ฐาน≈ 1072 x 10^9

โมเดลที่ผ่านการฝึกอบรมมาก่อนในงานดาวน์สตรีม

แบบอย่าง	งาน	งาน seq len	ตัวชี้วัด	ชื่อสาขา HF
Gena-LM-Bert-Base-T2T	ผู้สนับสนุน	300bp	74.56+-0.36 F1	โปรโมต _300_run_1
gena-lm-bert-large-t2t	ผู้สนับสนุน	300bp	76.44+-0.16 F1	โปรโมต _300_run_1
gena-lm-bert-large-t2t	ผู้สนับสนุน	2000bp	93.70+-0.44 F1	โปรโมต _2000_RUN_1
Gena-LM-Bert-Base-T2T	ไซต์ประกบ	15000bp	92.63+-0.09 PR AUC	spliceai_run_1
gena-lm-bert-large-t2t	ไซต์ประกบ	15000bp	93.59+-0.11 PR AUC	spliceai_run_1

หากต้องการรับโมเดลที่ผ่านการฝึกอบรมมาล่วงหน้าในงานดาวน์สตรีมให้แทนที่ model_name และ branch_name ด้วยค่าจากตาราง ตัวชี้วัดในตารางมีค่าเฉลี่ยมากกว่าการวิ่งหลายครั้ง ดังนั้นค่าสำหรับแต่ละจุดตรวจสอบอาจแตกต่างจากที่รายงานที่นี่

 from transformers import AutoTokenizer , AutoModel
tokenizer = AutoTokenizer . from_pretrained ( f'AIRI-Institute/ { model_name } ' )
model = AutoModel . from_pretrained ( f'AIRI-Institute/ { model_name } ' , revision = branch_name , trust_remote_code = True )

ตัวอย่าง

วิธีโหลด Gena-LM ที่ผ่านการฝึกอบรมมาล่วงหน้าสำหรับการสร้างแบบจำลองภาษาที่สวมหน้ากาก

 from transformers import AutoTokenizer , AutoModel

tokenizer = AutoTokenizer . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' )
model = AutoModel . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' , trust_remote_code = True )

วิธีโหลด Gena-LM ที่ผ่านการฝึกอบรมมาล่วงหน้าเพื่อปรับแต่งในงานการจำแนกประเภท

รับคลาสรุ่นจากที่เก็บ Gena-LM:

git clone https://github.com/AIRI-Institute/GENA_LM.git

 from GENA_LM . src . gena_lm . modeling_bert import BertForSequenceClassification
from transformers import AutoTokenizer

tokenizer = AutoTokenizer . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' )
model = BertForSequenceClassification . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' )

หรือคุณสามารถดาวน์โหลด modeling_bert.py และวางไว้ใกล้กับรหัสของคุณ

หรือคุณสามารถรับคลาสรุ่นจาก HuggingFace Automodel:

 from transformers import AutoTokenizer , AutoModel
model = AutoModel . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' , trust_remote_code = True )
gena_module_name = model . __class__ . __module__
print ( gena_module_name )
import importlib
# available class names:
# - BertModel, BertForPreTraining, BertForMaskedLM, BertForNextSentencePrediction,
# - BertForSequenceClassification, BertForMultipleChoice, BertForTokenClassification,
# - BertForQuestionAnswering
# check https://huggingface.co/docs/transformers/model_doc/bert
cls = getattr ( importlib . import_module ( gena_module_name ), 'BertForSequenceClassification' )
print ( cls )
model = cls . from_pretrained ( 'AIRI-Institute/gena-lm-bert-base-t2t' , num_labels = 2 )

รุ่น Gena-LM bigbird-base-t2t ใช้การใช้งาน HuggingFace BigBird ดังนั้นจึงสามารถใช้คลาสเริ่มต้นจากไลบรารี Transformers:

 from transformers import AutoTokenizer , BigBirdForSequenceClassification
tokenizer = AutoTokenizer . from_pretrained ( 'AIRI-Institute/gena-lm-bigbird-base-t2t' )
model = BigBirdForSequenceClassification . from_pretrained ( 'AIRI-Institute/gena-lm-bigbird-base-t2t' )

สมุดบันทึก

การจำแนกลำดับด้วยหม้อแปลง Gena-LM และ HuggingFace
การรวมกลุ่มของ DNA embeddings ที่สร้างขึ้นด้วย Gena LM
สำรวจ Gena-LM Model ที่ปรับแต่งบนชุดข้อมูล enformer สำหรับการแสดงออกของยีน

การอ้างอิง

 @article {GENA_LM,
	author = {Veniamin Fishman and Yuri Kuratov and Maxim Petrov and Aleksei Shmelev and Denis Shepelin and Nikolay Chekanov and Olga Kardymon and Mikhail Burtsev},
	title = {GENA-LM: A Family of Open-Source Foundational Models for Long DNA Sequences},
	elocation-id = {2023.06.12.544594},
	year = {2023},
	doi = {10.1101/2023.06.12.544594},
	publisher = {Cold Spring Harbor Laboratory},
	URL = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594},
	eprint = {https://www.biorxiv.org/content/early/2023/06/13/2023.06.12.544594.full.pdf},
	journal = {bioRxiv}
}

งานดาวน์สตรีม

งานดาวน์สตรีมสำหรับการประเมินผลแบบจำลองครอบคลุมการทำนายของผู้สนับสนุนและกิจกรรมเพิ่มประสิทธิภาพไซต์ประกบโปรไฟล์โครมาตินและความแข็งแรงของไซต์ polyadenylation ตรวจสอบโฟลเดอร์ downstream_tasks สำหรับรหัสและข้อมูลการประมวลผลข้อมูลล่วงหน้าที่เราใช้:

การทำนายผู้สนับสนุน
การทำนายไซต์ประกบ (spliceai)
การทำนายการเพิ่มประสิทธิภาพของ Drosophila (DeepStarr)
การทำโปรไฟล์ Chromatin (Deepsea)
การทำนายไซต์ polyadenylation (aparent)

ข้อมูลการฝึกอบรมล่วงหน้า

ดาวน์โหลดและประมวลผลข้อมูลล่วงหน้า

ในการดาวน์โหลดจีโนมมนุษย์โปรดเรียกใช้สคริปต์ต่อไปนี้:

 ./download_data.sh human

สำหรับการประมวลผลล่วงหน้าให้ดำเนินการสคริปต์ต่อไปนี้:

 python src/gena_lm/genome_tools/create_corpus.py --input_file data/ncbi_dataset/data/GCA_009914755.4/GCA_009914755.4_T2T-CHM13v2.0_genomic.fna --output_dir data/processed/human/

การติดตั้ง

สำหรับแบบจำลองที่มีความสนใจแบบเบาบาง ( gena-lm-bigbird-base-sparse , gena-lm-bigbird-base-sparse-t2t ) FP16 การสนับสนุนและจำเป็นต้องใช้ DeepSpeed

จุดสูงสุดสำหรับ FP16

ติดตั้ง Apex https://github.com/nvidia/apex#quick-start

 git clone https://github.com/NVIDIA/apex
cd apex
# most recent commits may fail to build
git checkout 2386a912164b0c5cfcd8be7a2b890fbac5607c82
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Deepspeed สำหรับ ops กระจัดกระจาย

การติดตั้ง DeepSpeed เป็นสิ่งจำเป็นในการทำงานกับรุ่นภาษาที่มีความมุ่งมั่น ความสนใจแบบเบาบาง Deepspeed รองรับเฉพาะ GPU ที่มีความเข้ากันได้ของการคำนวณ> = 7 (V100, T4, A100), CUDA 10.1, 10.2, 11.0 หรือ 11.1 และทำงานในโหมด FP16 เท่านั้น (เช่น DeepSpeed 0.6.0)

Pytorch> = 1.7.1, <= 1.10.1 ล้อพร้อม Cuda 10.2/11.0/11.1 จาก pytorch.org สามารถใช้งานได้ อย่างไรก็ตามการใช้ Sparse OPS กับล้อ CUDA 11.1 Pytorch จะต้องติดตั้ง Cuda 11.3/11.4 ในระบบ OPS แบบเบาบางสามารถใช้กับ pytorch == 1.12.1 Cuda 11.3 ล้อ แต่การทดสอบ Deepspeed Sparse Ops จะต้องมีการแก้ไขเนื่องจากพวกเขาตรวจสอบ Torch Cuda เวอร์ชัน <= 11.1 Deepspeed Fork สำหรับ Triton 1.1.1 มีการทดสอบที่อัปเดตแล้ว

Triton 1.0.0 และ 1.1.1 ต้องใช้ Python <= 3.9

pip install triton==1.0.0
DS_BUILD_SPARSE_ATTN=1 pip install deepspeed==0.6.0 --global-option= " build_ext " --global-option= " -j8 " --no-cache

และตรวจสอบการติดตั้งด้วย

ds_report

Triton 1.1.1

Triton 1.1.1 นำ X2 เร่งความเร็วไปสู่การดำเนินงานที่กระจัดกระจายใน A100 แต่ปัจจุบัน Deepspeed (0.6.5) รองรับเฉพาะ Triton 1.0.0 เท่านั้น Deepspeed Fork พร้อมการสนับสนุน Triton 1.1.1 สามารถใช้ในกรณีที่ต้องการความเร็วดังกล่าว:

pip install triton==1.1.1
git clone https://github.com/yurakuratov/DeepSpeed.git
cd DeepSpeed
DS_BUILD_SPARSE_ATTN=1 pip install -e . --global-option= " build_ext " --global-option= " -j8 " --no-cache

และเรียกใช้การทดสอบ OPS แบบเบาบางด้วย

 cd tests/unit
pytest -v test_sparse_attention.py

Finetuning ด้วยเครื่องมือทดลอง LM

เราใช้การฝึกอบรมผู้ฝึกสอนและการฝึกอบรมหลาย GPU จากพื้นที่เก็บข้อมูล LM-Experiments-Tools เป็นพื้นฐานสำหรับสคริปต์ Finetuning ของเรา อย่างไรก็ตามคุณสามารถใช้ HF Transformers Trainer, Pytorch Lightning หรือเร่งความเร็วและ pytorch กับลูปฝึกอบรมที่กำหนดเองแทน

ติดตั้ง LM-Experiments-tools ตาม https://github.com/yurakuratov/t5-experiments#install-only-lm_experiments_tools:

 git clone https://github.com/yurakuratov/t5-experiments
cd t5-experiments
pip install -e .

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน 1.0.0
ประเภท โค้ดแหล่งที่มา AI
เวลาอัปเดต 2025-09-11
ขนาด 31.87MB
มาจาก Github

แอปที่เกี่ยวข้อง

OpenCore_NO_ACPI_Build

2024-11-13
nspanel_pro_tools_apk

2024-11-12
zkwork_aleo_gpu_worker

2024-11-11
nextcloud_share_url_downloader

2024-11-01
หมา_สุนัขจิ้งจอก_กระต่าย

2022-08-01
เครื่องมือวิเคราะห์ข้อมูล Lihua เวอร์ชันฟรี 3.0_search_navigation_collection_public comment_ranking_api

2022-06-28

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
ML stack

โค้ดแหล่งที่มา AI

1.0.0
awesome free chatgpt

โค้ดแหล่งที่มา AI

1.0.0
pywin_contextmenu

โค้ดแหล่งที่มา AI

Version update
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด