DeepRank GNN esm
1.0.0
由於DeepRank-GNN不再從事積極發展,因此我們將DeepRank-GNN-ESM版本遷移到Haddocking/DeepRank-Gnn-ESM的新倉庫中。
有關詳細信息,請參閱我們的出版物“ DeepRank-gnn-esm:使用蛋白質語言模型評分蛋白質 - 蛋白質模型的圖形神經網絡”
❄️這個存儲庫現在被冷凍了。 ❄️
蛋白質蛋白接口的圖形網絡包括語言模型功能
與Anaconda
git clone https://github.com/DeepRank/DeepRank-GNN-esm.git
cd DeepRank-GNN-esmconda env create -f environment-cpu.yml && conda activate deeprank-gnn-esm-cpu-env或者
conda env create -f environment-gpu.yml && conda activate deeprank-gnn-esm-gpu-envpip install .pytest tests/我們為DeepRank-GNN-ESM提供了一個命令行界面,可用於評分蛋白質 - 蛋白質複合物。命令行接口可用於以下:
usage: deeprank-gnn-esm-predict [-h] pdb_file chain_id_1 chain_id_2
positional arguments:
pdb_file Path to the PDB file.
chain_id_1 First chain ID.
chain_id_2 Second chain ID.
optional arguments:
-h, --help show this help message and exit例如,為1B6C複合體評分
# download it
$ wget https://files.rcsb.org/view/1B6C.pdb -q
# make sure the environment is activated
$ conda activate deeprank-gnn-esm-gpu-env
(deeprank-gnn-esm-gpu-env) $ deeprank-gnn-esm-predict 1B6C.pdb A B
2023-06-28 06:08:21,889 predict:64 INFO - Setting up workspace - /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred_A_B
2023-06-28 06:08:21,945 predict:72 INFO - Renumbering PDB file.
2023-06-28 06:08:22,294 predict:104 INFO - Reading sequence of PDB 1B6C.pdb
2023-06-28 06:08:22,423 predict:131 INFO - Generating embedding for protein sequence.
2023-06-28 06:08:22,423 predict:132 INFO - # ###############################################################################
2023-06-28 06:08:32,447 predict:138 INFO - Transferred model to GPU
2023-06-28 06:08:32,450 predict:147 INFO - Read /home/1B6C-gnn_esm_pred_A_B/all.fasta with 2 sequences
2023-06-28 06:08:32,459 predict:157 INFO - Processing 1 of 1 batches (2 sequences)
2023-06-28 06:08:36,462 predict:200 INFO - # ###############################################################################
2023-06-28 06:08:36,470 predict:205 INFO - Generating graph, using 79 processors
Graphs added to the HDF5 file
Embedding added to the /home/1B6C-gnn_esm_pred_A_B/graph.hdf5 file file
2023-06-28 06:09:03,345 predict:220 INFO - Graph file generated: /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred_A_B/graph.hdf5
2023-06-28 06:09:03,345 predict:226 INFO - Predicting fnat of protein complex.
2023-06-28 06:09:03,345 predict:234 INFO - Using device: cuda:0
# ...
2023-06-28 06:09:07,794 predict:280 INFO - Predicted fnat for 1B6C between chainA and chainB: 0.359
2023-06-28 06:09:07,803 predict:290 INFO - Output written to /home/DeepRank-GNN-esm/1B6C-gnn_esm_pred/GNN_esm_prediction.csv從上方的輸出來看,您可以看到Chaina和Chainb之間1B6C複合物的預測FNAT為0.359 ,此信息也寫入GNN_esm_prediction.csv文件。
上面的命令將在當前工作目錄中生成一個文件夾,其中包含以下內容:
1B6C-gnn_esm_pred_A_B
├── 1B6C.pdb #input pdb file
├── all.fasta #fasta sequence for the pdb input
├── 1B6C.A.pt #esm-2 embedding for chainA in protein 1B6C
├── 1B6C.B.pt #esm-2 embedding for chainB in protein 1B6C
├── graph.hdf5 #input protein graph in hdf5 format
├── GNN_esm_prediction.hdf5 #prediction output in hdf5 format
└── GNN_esm_prediction.csv #prediction output in csv format
在散裝中生成fasta序列,使用腳本“ get_fasta.py'
usage: get_fasta.py [-h] pdb_dir output_fasta_name
positional arguments:
pdb_dir Path to the directory containing PDB files
output_fasta_name Name of the combined output FASTA file
options:
-h, --help show this help message and exit從合併的FastA文件中生成批量嵌入,使用ESM-2軟件包中提供的腳本,
$ python esm_2_installation_location/scripts/extract.py
esm2_t33_650M_UR50D
all.fasta
tests/data/embedding/1ATN/
--repr_layers 0 32 33
--include mean per_tok用上面生成的fastA序列,'esm_2_installation_location'替換安裝位置'all.fasta',用ESM嵌入式的輸出文件夾名稱,tests/data/embedding/eNbedding/1atn/'
from deeprank_gnn . GraphGenMP import GraphHDF5
pdb_path = "tests/data/pdb/1ATN/"
pssm_path = "tests/data/pssm/1ATN/"
embedding_path = "tests/data/embedding/1ATN/"
nproc = 20
outfile = "1ATN_residue.hdf5"
GraphHDF5 (
pdb_path = pdb_path ,
pssm_path = pssm_path ,
embedding_path = embedding_path ,
graph_type = "residue" ,
outfile = outfile ,
nproc = nproc , #number of cores to use
tmpdir = "./tmpdir" ) import h5py
import random
hdf5_file = h5py . File ( '1ATN_residue.hdf5' , "r+" )
for mol in hdf5_file . keys ():
fnat = random . random ()
bin_class = [ 1 if fnat > 0.3 else 0 ]
hdf5_file . create_dataset ( f"/ { mol } /score/binclass" , data = bin_class )
hdf5_file . create_dataset ( f"/ { mol } /score/fnat" , data = fnat )
hdf5_file . close () from deeprank_gnn . ginet import GINet
from deeprank_gnn . NeuralNet import NeuralNet
database_test = "1ATN_residue.hdf5"
gnn = GINet
target = "fnat"
edge_attr = [ "dist" ]
threshold = 0.3
pretrained_model = "deeprank-GNN-esm/paper_pretrained_models/scoring_of_docking_models/gnn_esm/treg_yfnat_b64_e20_lr0.001_foldall_esm.pth.tar"
node_feature = [ "type" , "polarity" , "bsa" , "charge" , "embedding" ]
device_name = "cuda:0"
num_workers = 10
model = NeuralNet (
database_test ,
gnn ,
device_name = device_name ,
edge_feature = edge_attr ,
node_feature = node_feature ,
target = target ,
num_workers = num_workers ,
pretrained_model = pretrained_model ,
threshold = threshold )
model . test ( hdf5 = "tmpdir/GNN_esm_prediction.hdf5" )為確保接口殘基和ESM-2嵌入之間的映射是正確的,請確保對於所有鏈條,PDB文件中的殘留編號是連續的,並且從殘基“ 1”開始。我們提供一個腳本(腳本/pdb_renumber.py)來執行編號。