Skip to content

BinaryChrisEntropy/pocket_binding_prediction_challenge

Repository files navigation

Protein Binding Site Prediction using Graph Transformers

This project implements a graph-based deep learning pipeline for predicting binding residues in proteins using 3D structural data, DSSP features, and ESM embeddings. The model utilizes a multi-layer Transformer-based Graph Neural Network (GNN) implemented with PyTorch Geometric.

Features

  • Parses PDB files and constructs residue-level graphs.
  • Uses DSSP structural features and precomputed ESM-2 embeddings.
  • Builds edges via 3D distance-based radius_graph.
  • Implements a multi-head TransformerConv GNN for binary classification of binding sites.
  • Optimizes using Binary Cross Entropy with class balancing.
  • Evaluates performance using Jaccard Distance.

Requirements

  • Python 3.8+
  • PyTorch & PyTorch Geometric
  • Biopython
  • scikit-learn
  • tqdm
  • esm (Facebook's ESM model)
  • Precomputed ESM embeddings and DSSP features

Data Format

  • train/ directory containing PDB files named like ID_protein.pdb.

  • train.csv file with columns:

    • id: Sample ID
    • resid: Space-separated list of binding residue IDs

Preprocessing (Required)

Before training, some preprocessing steps must be completed:

  1. DSSP Feature Extraction
    Secondary structure and accessibility features must be computed from PDB files using DSSP.

  2. ESM Embedding Generation
    Residue-level embeddings must be generated using a pretrained ESM-2 model and saved as .pkl files.

Scripts for both preprocessing steps are included in the repository.

Usage

Run the training script with:

python challenge_1.py

Trained models will be saved as modelv1.pth.

Notes

  • ESM features must be precomputed and stored in esm_train/{ID}_protein.pkl.
  • DSSP features are loaded from a merged pickle file.
  • The code automatically caches and reuses class weights for balanced training.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages