ProREM 다운로드 - ProREM 소스 코드 다운로드

ProREM

AI 소스 코드

1.0.0

다운로드

검색 강화 돌연변이 마스터리 : 단백질 언어 모델의 제로 샷 예측 강화

소개 (prorem)

? 결과

소식

[2024.10.21]

다운로드

Proteingym A2M 상 동성 서열 (evcouplings) : https://huggingface.co/datasets/tyang816/prorem/blob/main/aa_seq_aln_a2m.tar.gz. 원래 A2M 파일은 Proteingym에서 다운로드됩니다.
Proteingym A3M 상 동성 서열 (colabfold) : https://huggingface.co/datasets/tyang816/prorem/blob/main/aa_seq_aln_a3m.tar.gz
uniref 100 데이터베이스 : https://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/uniref100.fasta.gz

종이 결과

Tab1

? 요구 사항

콘다 환경

Anaconda3 또는 Miniconda3을 설치했는지 확인하십시오.

 conda env create -f environment.yml
conda activate prorem

# We need HMMER and EVCouplings for MSA
# pip install hmmer
# pip install https://github.com/debbiemarkslab/EVcouplings/archive/develop.zip

다른 요구 사항

PLMC를 설치하고 src/single_config_monomer.txt 에서 경로를 변경하십시오

git clone https://github.com/debbiemarkslab/plmc.git
cd plmc
make all-openmp

하드웨어

추론을 직접 사용하려면 RTX 3080과 같은 10g 이상의 그래픽 메모리를 권장합니다.
상 동성 서열 검색의 경우 8 코어 CPU.

? 돌연변이 체에 대한 제로 샷 예측

Proteingym에 대한 평가

처리 된 데이터를 준비하십시오

 cd data/proteingym_v1
wget https://huggingface.co/datasets/tyang816/ProREM/blob/main/aa_seq_aln_a2m.tar.gz
# unzip homology files
tar -xzf aa_seq_aln_a2m.tar.gz
# unzip fasta sequence files
tar -xzf aa_seq.tar.gz
# unzip pdb structure files
tar -xzf pdbs.tar.gz
# unzip structure sequence files
tar -xzf struc_seq.tar.gz
# unzip DMS substitution csv files
tar -xzf substitutions.tar.gz

추론을 시작하십시오

protein_dir=proteingym_v1
python compute_fitness.py 
    --base_dir data/ $protein_dir 
    --out_scores_dir result/ $protein_dir

자신의 데이터 세트

적어도 필요한 것

data/ < your_protein_dir_name >
| ——aa_seq # amino acid sequences
| —— | ——protein1.fasta
| —— | ——protein2.fasta
| ——aa_seq_aln_a2m # homology sequences of EVCouplings
| —— | ——protein1.a2m
| —— | ——protein2.a2m
| ——pdbs # structures
| —— | ——protein1.pdb
| —— | ——protein2.pdb
| ——struc_seq # structure sequences
| —— | ——protein1.fasta
| —— | ——protein2.fasta
| ——substitutions # mutant files
| —— | ——protein1.csv
| —— | ——protein2.csv

Jackhmmer의 상 동성 서열을 검색하십시오

 # step 1: search homology sequences
# your protein name, eg. fluorescent_protein
protein_dir= < your_protein_dir_name >
# your protein path, eg. data/fluorescent_protein/aa_seq/GFP.fasta
query_protein_name= < your_protein_name >
protein_path=data/ $protein_dir /aa_seq/ $query_protein_name .fasta
# your uniprot dataset path
database= < your_path > /uniref100.fasta
evcouplings 
    -P output/ $protein_dir / $query_protein_name 
    -p $query_protein_name 
    -s $protein_path 
    -d $database 
    -b " 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 " 
    -n 5 src/single_config_monomer.txt
#  ? Repeat the searching process until all your proteins are done

# step 2: select a2m file
protein_dir= < your_protein_dir_name >
python src/data/select_msa.py 
    --input_dir output/ $protein_dir 
    --output_dir data/ $protein_dir

단백질에 대한 PDB 파일을 얻으십시오

Alphafold3 서버, Alphafold 데이터베이스, ESMFOL 및 기타 도구를 사용하여 구조를 얻을 수 있습니다.

습식 실험의 경우 가능한 한 가능한 한 고품질 구조를 얻으십시오.

PLM에 대한 구조 시퀀스를 얻으십시오

protein_dir= < your_protein_dir_name >
python src/data/get_struc_seq.py 
    --pdb_dir data/ $protein_dir /pdbs 
    --out_dir data/ $protein_dir /struc_seq

추론을 시작하십시오

protein_dir= < your_protein_dir_name >
python compute_fitness.py 
    --base_dir data/ $protein_dir 
    --out_scores_dir result/ $protein_dir

다른 지시 된 진화 도구

Protssn (Elife 2024) 또는 Prosst (Neurips 2024)를 사용할 수 있습니다.

질문

Q : Prorem의 입력 형식을 Protssn 또는 Prosst로 빠르게 변환하는 방법은 무엇입니까?

A : Prorem과 Protssn 입력 형식의 변환은 script/data_format_convert.sh 를 참조 할 수 있습니다. Prosst의 경우 JSUT는 알파를 0으로 변경합니다.

protein_dir= < your_protein_dir_name >
python compute_fitness.py 
    --base_dir data/ $protein_dir 
    --out_scores_dir result/ $protein_dir 
    --alpha 0 
    --model_out_name ProSST-2048

Q : Protssn, Prosst 및 Prorem의 차이점은 무엇입니까?

A : Protssn은 아미노산 좌표 수준에서 모델링을 사용하고, 로컬 구조에 대한 모델을 검찰하고, Prorem은 MSA 정보를 명시 적으로 소개합니다. 그들은 실제 실험 평가에서 각자 자신의 장점과 단점이 있습니다.

? 소환

코드 나 데이터를 사용한 경우 작업을 인용하십시오.

 @article{tan2024prorem,
  title={Retrieval-Enhanced Mutation Mastery: Augmenting Zero-Shot Prediction of Protein Language Model},
  author={Tan, Yang and Wang, Ruilin and Wu, Banghao and Hong, Liang and Zhou, Bingxin},
  journal={arXiv:2410.21127},
  year={2024}
}

확장하다

추가 정보

버전 1.0.0
유형 AI 소스 코드
업데이트 시간 2025-09-10
크기 220.76MB
출처 Github