This project implements a graph-based deep learning pipeline for predicting binding residues in proteins using 3D structural data, DSSP features, and ESM embeddings. The model utilizes a multi-layer Transformer-based Graph Neural Network (GNN) implemented with PyTorch Geometric.
- Parses PDB files and constructs residue-level graphs.
- Uses DSSP structural features and precomputed ESM-2 embeddings.
- Builds edges via 3D distance-based
radius_graph. - Implements a multi-head TransformerConv GNN for binary classification of binding sites.
- Optimizes using Binary Cross Entropy with class balancing.
- Evaluates performance using Jaccard Distance.
- Python 3.8+
- PyTorch & PyTorch Geometric
- Biopython
- scikit-learn
- tqdm
- esm (Facebook's ESM model)
- Precomputed ESM embeddings and DSSP features
-
train/directory containing PDB files named likeID_protein.pdb. -
train.csvfile with columns:id: Sample IDresid: Space-separated list of binding residue IDs
Before training, some preprocessing steps must be completed:
-
DSSP Feature Extraction
Secondary structure and accessibility features must be computed from PDB files using DSSP. -
ESM Embedding Generation
Residue-level embeddings must be generated using a pretrained ESM-2 model and saved as.pklfiles.
Scripts for both preprocessing steps are included in the repository.
Run the training script with:
python challenge_1.pyTrained models will be saved as modelv1.pth.
- ESM features must be precomputed and stored in
esm_train/{ID}_protein.pkl. - DSSP features are loaded from a
mergedpickle file. - The code automatically caches and reuses class weights for balanced training.