Data augmentation for protein complexes with small molecules

Location: Schwede Group, Biozentrum, Basel, Switzerland

Motivation

In recent years, the field of protein structure prediction has seen a major breakthrough with the release of AlphaFold2 [1] and similar deep-learning methods [2]. In addition to the accurate structure prediction, these methods also capture flexible or disordered protein regions using their scoring function and confidence scores, such as pLDDT and pAE. This is because the model was trained on multiple conformations and unresolved residues of protein domains and can infer the flexibility based on learned underlying patterns [3]. The next important steps in computational structural biology involve using deep-learning methods to predict structures of protein-ligand complexes. Additionally, these methods should be able to assess the confidence and infer binding affinity of the predicted ligand poses, similarly to how AlphaFold2 assesses flexibility of a protein. There are many challenges preventing the accomplishment of this, including the acquisition and deposition of small molecule structural data in the Protein Data Bank (PDB). This data is often acquired under cryogenic conditions, where molecular flexibility can be reduced [4]. Moreover, small molecules are often deposited as a single pose that best fits the electron density or cryo-EM map, without fully accounting for the ligand environment, low resolution parts of the density map and other possible conformations of a ligand [5]. As a result, deep learning models are trained on a limited set of these static ligand poses and do not take all possible ligand conformations into account, which prevents advancements in the complex prediction field. This issue also hinders the scoring of predicted complexes, as current scores do not consider the quality and other possible conformations of the “ground truth” ligand pose.

The project aims to address these three challenges in the protein-ligand complex prediction field by (1) mining the underexplored molecular poses in experimental densities of deposited complex structures, (2) investigating correlation of the variability in these poses to measured binding affinity, and (3) incorporating atom- level variability in the assessment of protein-ligand complex modeling accuracy.

Description of the project

The candidate will use a density-based biased docking method under development by our collaborators, to augment the structural data of protein-small molecule complexes in order to account for the molecular flexibility of the small molecule and pocket residues. This builds on previous results [4,5,6] demonstrating both the abundance and utility of alternative binding poses in experimental structure data. Initially, the candidate will focus on a small (~900) subset of complexes with known binding affinities for their ligands and structural data that was acquired at non-cryogenic conditions. The candidate will generate an ensemble of ligand poses and binding pocket residue conformations that fit the experimental data. This will be followed by investigation of the correlation between binding affinity and the flexibility of both the ligand and the binding pocket.

The data augmentation pipeline from this experiment will be then applied to the larger dataset of ~400k protein-small molecule complexes, called PLINDER [7], which is used to train deep-learning models. The candidate will then explore incorporation of the atom-level variation seen in the generated ensembles into the LDDT-PLI algorithm used for assessing the accuracy of predicted protein-ligand complexes.

Goals

Develop a workflow to generate alternative poses of a small molecule and binding pocket residues based on experimental data in cryo-EM and X-ray crystallography maps.
Investigate the level of correlation between binding affinity and the molecular flexibility as seen in the generated poses.
Apply the workflow to the PLINDER dataset, to expand trainable datasets for deep learning models.
Incorporate atom-level flexibility based on experimental density into LDDT-PLI for assessing the accuracy of predicted protein complexes with small molecules.

Final Presentation Resources

All resources related to the subject and the final presentation (e.g., report, slides, etc.) are located in:
~/data_augmentation/subject

Workflow

This workflow is designed to run on a SLURM array job within the SciCORE environment.
To use it with a different dataset, follow the instructions below.

A compatible Conda environment is required and can be found at:
~/data_augmentation/environment.yml
This environment depends on external tools such as AutoDock-GPU, Meeko, CryoXKit, and OpenBabel (specific infos at ~/data_augmentation/environment.txt).

Dataset Preparation & Processing

To prepare your dataset, you need a CSV file containing information about each system (e.g., system_id, pdb_id, ccd_code, etc.).
You can refer to the following examples:

Test dataset:
~/data_augmentation/raw-data/test_dataset/low_occupancy_systems.csv
Full dataset:
~/data_augmentation/raw-data/full_dataset/propose_dataset_v1.csv

To generate the corresponding complex files, run the Python script provided in the repository (~/data_augmentation/src/py/prepare_dataset.py) in combination with the edm downloader + assignment (~/data_augmentation/src/py/download_edm.py + ~/data_augmentation/src/py/attribute_edm.py).

Use a SLURM script to process the dataset (~/src/sh/workflow.sh)
Generate the external data of the xray references in ~/ext_data/ (e.g. scoring energy, density-fitness values) using ~/src/py/altloc_density_fit.py + ~/src/py/altloc_rescoring.py (+ ~/src/py/ediascorer.py --> optionnal)
Concatenate all the .tsv files uing ~/src/sh/final_tab.sh and add the references values in a modified tab using ~/src/sh/mod_final_tab.sh

Scripts and Their Usage

`~/src/commands/`

This directory contains .cmd files for various commands related to different steps in the workflow. These files are intended to be used as command-line scripts and can be executed in your SLURM job environment.

Examples:

altloc_fullset_rescoring.cmd: Rescoring for the full set of systems (~/data_augmentation/raw-data/full_dataset/data) with alternate locations for X-ray structures.
altloc_subset_rescoring.cmd: A variation of altloc_fullset_rescoring.cmd for the test dataset (~/data_augmentation/raw-data/test_dataset/data).
commands_altloc_density_fit.cmd: Commands for fitting densities in systems with xray references altlocs.
commands_ediascorer.cmd: Executes the ediascorer (proprietary, from Hamburg University) to evaluate docking poses.
commands_flexres.cmd: Commands for running workflows with flexible residue selections (FlexRes).
commands_full_dataset.cmd: Runs the complete workflow for the full dataset of systems (~/data_augmentation/raw-data/full_dataset/data).
commands_fullset_altloc_density_fit.cmd: Runs density fitting for the full set (~/data_augmentation/raw-data/full_dataset/data) of systems with alternate locations.
commands_mod_final_tab.cmd: Generates a modified final table with reference values for the closest X-ray altloc poses.
commands_ost.cmd: Executes the OST workflow for system preparation.
commands_withoutflexres.cmd: Runs a workflow without flexible residue selection.

`~/src/ipynb/`

This directory contains Jupyter notebooks for various analysis tasks and dataset preparation steps. These are used for an interactive and visual approach to data analysis.

Examples:

global_analysis.ipynb: Performs global analysis of the dataset (template).
ost.ipynb: A test notebook for performing OST-specific analysis.
prepare_dataset.ipynb: Prepares the dataset from raw input data (CSV files) into the required format for the workflow (better to use the .py version).
test.ipynb: A test notebook for running quick experiments on a subset of the data.

`~/src/py/`

This folder includes Python scripts that implement the core functionality of the workflow. Most of the processing, including data augmentation and scoring, is handled by these scripts.

Examples:

altloc_density_fit.py: A script for fitting densities in systems with xray altlocs.
altloc_rescoring.py: Applies rescoring to systems with xray altlocs.
attribute_edm.py: Attributes EDM files to each complex.
autodock_map.py: Handles AutoDock-related map generation for docking simulations (from https://github.com/forlilab/waterkit.git).
commands_gen.py: Likely generates or prepares command files or workflows for execution (into ~/src/commands).
deprecated.py: Contains deprecated or obsolete code that is no longer used in the current workflow.
download_edm.py: Downloads EDM-related data necessary for scoring.
ediascorer.py: Executes the ediascorer function (proprietary, from Hamburg University) to evaluate docking poses.
final_tab.py: Generates the final tabular summary of the run.
mod_final_tab.py: A modified version for the tab generate with final_tab.py that includes reference values for the closest X-ray altloc poses.
ost.py: Implements the OST function or related operations for system backbone comparison (bb_local_lddt) or preparation --> deprecated (template).
prepare_dataset.py: Prepares the dataset for processing by transforming CSV files into systems folders for the workflow.
utils.py: Contains utility functions used across various scripts (from https://github.com/forlilab/waterkit.git).
workflow_flexres.py: A workflow for running simulations with flexible residues --> deprecated (template).
workflow_withoutflexres.py: Similar to the above but without using flexible residues, running a more traditional workflow by processing each receptor altlocs.
workflow_withoutflexres_workinprogress.py: A work-in-progress version of the workflow that doesn't use flexible residues; it may still be under development.

`~/src/sh/`

Shell scripts provided to automate various tasks in the workflow. These scripts can be directly executed from the command line to use SLURM.

Examples:

altloc_density_fit.sh: Runs the density fitting process for systems with xray alternate locations.
altloc_fullset_rescoring.sh: Executes the rescoring process for the full set (~/data_augmentation/raw-data/full_dataset/data) of systems with alternate locations.
altloc_subset_rescoring.sh: Executes rescoring for a subset (~/data_augmentation/raw-data/test_dataset/data) of systems with alternate locations.
ediascorer.sh: Runs the ediascorer script to apply the scoring function to the dataset.
final_tab.sh: Generates the final tabular summary (likely using final_tab.py).
fullset_workflow.sh: Executes the entire workflow for the full dataset.
mod_final_tab.sh: Runs the modified version of the final tab.
ost.sh: Executes the OST preparation or scoring workflow from the shell.
prepare_fullset.sh: Prepares the full dataset (~/data_augmentation/raw-data/full_dataset/data) for processing, possibly running the preparation pipeline on all systems.
prepare_testset.sh: Prepares a test set (~/data_augmentation/raw-data/test_dataset/data) for running experiments or debugging (likely a subset of the full dataset).
workflow.sh: A wrapper script that launches the entire workflow (with or without flexres), coordinating all necessary steps to run the analysis from start to finish.

References

Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Alderson, T. R., Pritišanac, I., Kolarić, Đ., Moses, A. M. & Forman-Kay, J. D. Systematic identification of conditionally folded intrinsically disordered regions by AlphaFold2. Proceedings of the National Academy of Sciences 120, e2304302120 (2023).
Fraser, J. S. et al. Accessing protein conformational ensembles using room-temperature X-ray crystallography. Proceedings of the National Academy of Sciences 108, 16247–16252 (2011).
Beshnova, D. A., Pereira, J. & Lamzin, V. S. Estimation of the protein–ligand interaction energy for model building and validation. Acta Crystallographica Section D: Structural Biology 73, 195–202 (2017).
Flowers, J. et al. Expanding Automated Multiconformer Ligand Modeling to Macrocycles and Fragments. bioRxiv 2024.09.20.613996 (2024) doi:10.1101/2024.09.20.613996.
Durairaj, J. et al. PLINDER: The protein-ligand interactions dataset and evaluation resource. bioRxiv 2024.07.17.603955 (2024) doi:10.1101/2024.07.17.603955.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
ext_data		ext_data
raw-data		raw-data
report		report
runs		runs
src		src
subject		subject
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
connect_jupyter.py		connect_jupyter.py
environment.txt		environment.txt
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data augmentation for protein complexes with small molecules

Motivation

Description of the project

Goals

Final Presentation Resources

Workflow

Dataset Preparation & Processing

Scripts and Their Usage

`~/src/commands/`

`~/src/ipynb/`

`~/src/py/`

`~/src/sh/`

References

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Data augmentation for protein complexes with small molecules

Motivation

Description of the project

Goals

Final Presentation Resources

Workflow

Dataset Preparation & Processing

Scripts and Their Usage

~/src/commands/

~/src/ipynb/

~/src/py/

~/src/sh/

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages

`~/src/commands/`

`~/src/ipynb/`

`~/src/py/`

`~/src/sh/`