From 11823a8b249783693e32dc1e51d5a2c7736e09cc Mon Sep 17 00:00:00 2001 From: Tristan Bepler Date: Mon, 19 Jan 2026 16:00:59 +0800 Subject: [PATCH 01/18] initial commit of wip notebook adapted from Tim Truong --- ...Nanobody_binder_design_with_BoltzGen.ipynb | 1226 +++++++++++++++++ 1 file changed, 1226 insertions(+) create mode 100644 source/walkthroughs/Nanobody_binder_design_with_BoltzGen.ipynb diff --git a/source/walkthroughs/Nanobody_binder_design_with_BoltzGen.ipynb b/source/walkthroughs/Nanobody_binder_design_with_BoltzGen.ipynb new file mode 100644 index 0000000..0d2b94e --- /dev/null +++ b/source/walkthroughs/Nanobody_binder_design_with_BoltzGen.ipynb @@ -0,0 +1,1226 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "intro", + "metadata": {}, + "source": [ + "# Nanobody binder design with BoltzGen\n", + "*Designing de novo nanobody binders to GMP-AMP phosphodiesterase of Penguinpox using BoltzGen and PoET-2*\n", + "\n", + "In this tutorial, we'll demonstrate how to use the OpenProtein.AI Python client to\n", + "design a nanobody that binds to a target protein. We refer to the designed protein as the\n", + "**binder** and the protein being bound as the **target**.\n", + "\n", + "Unlike general protein binder design, for nanobody design we utilize a **scaffold-based approach**. We will start with an existing nanobody framework and essentially \"graft\" new Complementarity-Determining Regions (CDRs) onto it. This ensures that our designed binder retains the stable, expressible framework regions of a natural nanobody while tailoring the binding loops (CDRs) to our specific target. The design process consists of four main steps:\n", + "\n", + "1. **Query Specification**: Specify the design problem as a \"query\", including\n", + " 1. the target protein (cGAMP PDE)\n", + " 2. the nanobody scaffold (framework regions)\n", + " 3. the lengths of the CDR loops to be designed\n", + "\n", + "2. **Structure Generation**: Generate plausible structures for the nanobody binder CDRs using\n", + " **BoltzGen**, a generative model capable of designing backbone structures using scaffolds.\n", + "\n", + "3. **Sequence Design**: Design sequences for the generated CDRs using **PoET-2**,\n", + " a foundation protein language model that leverages evolutionary context. We will use a \"prompt\" of natural nanobody homologs to guide PoET-2 toward generating natural-like sequences.\n", + "\n", + "4. **In Silico Validation**: Validate the designs by predicting their structures with **Boltz-2**\n", + " and computing metrics to select the best candidates for experimental testing.\n", + "\n", + "# Prerequisites\n", + "\n", + "To run this tutorial, you'll need a Python environment containing the following\n", + "packages:\n", + "\n", + "- `openprotein_python>=0.10`\n", + "- `molviewspec` (for structure visualization)\n", + "\n", + "See the Python client [installation instructions](https://docs.openprotein.ai/python-api/installation.html) for more info.\n", + "\n", + "Additionally, you should have your [credentials set up](https:/docs.openprotein.ai/python-api/quickstart.html) in `~/.openprotein/config.toml` to\n", + "authenticate with the OpenProtein.AI API." + ] + }, + { + "cell_type": "markdown", + "id": "imports", + "metadata": {}, + "source": [ + "## Import necessary packages" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "imports_code", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/tbepler/miniconda3/envs/openprotein/lib/python3.12/site-packages/pydantic/_internal/_fields.py:161: UserWarning: Field \"model_index\" has conflict with protected namespace \"model_\".\n", + "\n", + "You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n", + " warnings.warn(\n", + "/Users/tbepler/miniconda3/envs/openprotein/lib/python3.12/site-packages/pydantic/_internal/_fields.py:161: UserWarning: Field \"model_id\" has conflict with protected namespace \"model_\".\n", + "\n", + "You may be able to resolve this warning by setting `model_config['protected_namespaces'] = ()`.\n", + " warnings.warn(\n" + ] + } + ], + "source": [ + "import io\n", + "import requests\n", + "from dataclasses import dataclass\n", + "\n", + "import numpy as np\n", + "import numpy.typing as npt\n", + "import pandas as pd\n", + "from scipy.spatial.transform import Rotation\n", + "\n", + "from tqdm import tqdm\n", + "\n", + "import molviewspec as mvs\n", + "from molviewspec.nodes import RepresentationTypeT\n", + "\n", + "import openprotein\n", + "from openprotein.fasta import parse_stream\n", + "from openprotein.molecules import Protein, Complex, Structure" + ] + }, + { + "cell_type": "markdown", + "id": "connect", + "metadata": {}, + "source": [ + "## Connect to OpenProtein.AI" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "connect_code", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Successfully connected to the OpenProtein.AI API!\n" + ] + } + ], + "source": [ + "session = openprotein.connect()\n", + "print(\"✅ Successfully connected to the OpenProtein.AI API!\")" + ] + }, + { + "cell_type": "markdown", + "id": "step1", + "metadata": {}, + "source": [ + "# Step 1: Query Specification\n", + "*Specify the nanobody binder design problem*\n", + "\n", + "In this tutorial, we will design nanobody binders against **GMP-AMP phosphodiesterase of Penguinpox (cGAMP PDE)**. This design problem is adapted from the BoltzGen study ([Stark et al., 2025](https://www.biorxiv.org/content/10.1101/2025.11.20.689494v1)).\n", + "\n", + "We will design a **nanobody binder**, which is a single-domain antibody fragment derived from heavy-chain-only antibodies found in camelids. To restrict the BotlzGen structure generator to specifically design nanobody binders, we will use a **scaffold**. The scaffold defines an overall framework structure and specific designable regions for binding to the target. This means we will keep the framework regions of an existing, well-behaved nanobody constant, while redesigning the Complementarity Determining Regions (CDRs) to bind our specific target. Later, we will also use a PoET-2 **prompt context** composed of camelid repertoire sequences to ensure our designs follow the natural sequence distribution of nanobodies.\n", + "\n", + "The **scaffold** and later **prompt context** provide convenient ways to generate binders of other types such as scFvs for FAbs." + ] + }, + { + "cell_type": "markdown", + "id": "step1_1", + "metadata": {}, + "source": [ + "## Step 1.1: Define and visualize the target\n", + "\n", + "We fetch the structure of the GMP-AMP phosphodiesterase of Penguinpox (cGAMP PDE) target from PDB." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "5dc31f68", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 SEQUENCE SATTIQKELENIVVKERQNKKDTILMGLKVEVPWNYCDWASISFYDVRLESGILDMESIA\n", + "0 STRUCTURE_MASK ^^ \n", + "\n", + "60 SEQUENCE VKYMTGCDIPPHVTLGITNKDQEANFQRFKELTRNIDLTSLSFTCKEVICFPQSRASKEL\n", + "60 STRUCTURE_MASK \n", + "\n", + "120 SEQUENCE GANGRAVVMKLEASDDVKALRNVLFNVVPTPRDIFGPVLSDPVWCPHVTIGYVRADDEDN\n", + "120 STRUCTURE_MASK \n", + "\n", + "180 SEQUENCE KNSFIELAEAFRGSKIKVIGWCE\n", + "180 STRUCTURE_MASK \n" + ] + } + ], + "source": [ + "structure = Structure.from_pdb_id('9bkq')\n", + "first_complex = structure[0]\n", + "target = first_complex.get_protein(chain_id=\"B\")\n", + "print(target.formatted(include=(\"sequence\", \"structure_mask\")))" + ] + }, + { + "cell_type": "markdown", + "id": "6dd9c290", + "metadata": {}, + "source": [ + "We'll also drop the first two residues which are missing from the structure." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "eecc8879", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 SEQUENCE TTIQKELENIVVKERQNKKDTILMGLKVEVPWNYCDWASISFYDVRLESGILDMESIAVK\n", + "0 STRUCTURE_MASK \n", + "\n", + "60 SEQUENCE YMTGCDIPPHVTLGITNKDQEANFQRFKELTRNIDLTSLSFTCKEVICFPQSRASKELGA\n", + "60 STRUCTURE_MASK \n", + "\n", + "120 SEQUENCE NGRAVVMKLEASDDVKALRNVLFNVVPTPRDIFGPVLSDPVWCPHVTIGYVRADDEDNKN\n", + "120 STRUCTURE_MASK \n", + "\n", + "180 SEQUENCE SFIELAEAFRGSKIKVIGWCE\n", + "180 STRUCTURE_MASK \n" + ] + } + ], + "source": [ + "target = target[2:]\n", + "print(target.formatted(include=(\"sequence\", \"structure_mask\")))" + ] + }, + { + "cell_type": "markdown", + "id": "step1_2", + "metadata": {}, + "source": [ + "## Step 1.2: Define the nanobody scaffold\n", + "\n", + "We will use the structure from PDB ID `7eow` as our scaffold. First, we load the protein." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "load_scaffold", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 SEQUENCE MEVQLVESGGGLVQPGGSLRLSCAASGRTFSYNPMGWFRQAPGKGRELVAAISRTGGSTY\n", + "0 STRUCTURE_MASK ^ \n", + "\n", + "60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAAAGVRAEDGRVRTLPSEYTFWG\n", + "60 STRUCTURE_MASK \n", + "\n", + "120 SEQUENCE QGTQVTVSSLEHHHHHH\n", + "120 STRUCTURE_MASK ^^^^^^^^\n" + ] + } + ], + "source": [ + "structure = Structure.from_pdb_id(\"7eow\")\n", + "first_complex = structure[0]\n", + "binder_scaffold = first_complex.get_protein(chain_id=\"B\")\n", + "print(binder_scaffold.formatted(include=(\"sequence\", \"structure_mask\")))" + ] + }, + { + "cell_type": "markdown", + "id": "clean_scaffold", + "metadata": {}, + "source": [ + "### Clean the scaffold\n", + "\n", + "We remove the leading Methionine (M) and the trailing Histidine tag (His-tag) because they are expression artifacts. The structure mask confirms these residues have no defined structure." + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "remove_artifacts", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASGRTFSYNPMGWFRQAPGKGRELVAAISRTGGSTYY\n", + "0 STRUCTURE_MASK \n", + "\n", + "60 SEQUENCE PDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAAAGVRAEDGRVRTLPSEYTFWGQ\n", + "60 STRUCTURE_MASK \n", + "\n", + "120 SEQUENCE GTQVTVSS\n", + "120 STRUCTURE_MASK \n" + ] + } + ], + "source": [ + "binder_scaffold = binder_scaffold[~binder_scaffold.get_structure_mask()]\n", + "print(binder_scaffold.formatted(include=(\"sequence\", \"structure_mask\")))" + ] + }, + { + "cell_type": "markdown", + "id": "extract_frameworks", + "metadata": {}, + "source": [ + "### Define the framework and binding regions\n", + "\n", + "We want to use the nanobody structure as a framework, but design new CDRs for binding to our target. To do this, we keep the framework regions (FWRs) constant but replace the CDRs with designable regions.\n", + "\n", + "In this example, we will set CDR1 length to 10 (increased from 9), CDR2 to 8 (same as scaffold), and CDR3 to 20 (decreased from 21). The `X` characters represent residues to be designed." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "extract_fw", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "FWR1: EVQLVESGGGLVQPGGSLRLSCAAS\n", + "FWR2: GWFRQAPGKGRELVAAI\n", + "FWR3: YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCA\n", + "FWR4: GQGTQVTVSS\n" + ] + } + ], + "source": [ + "fwr1 = binder_scaffold[:25]\n", + "fwr2 = binder_scaffold[34:51]\n", + "fwr3 = binder_scaffold[59:97]\n", + "fwr4 = binder_scaffold[118:]\n", + "print(\"FWR1:\", fwr1.sequence.decode())\n", + "print(\"FWR2:\", fwr2.sequence.decode())\n", + "print(\"FWR3:\", fwr3.sequence.decode())\n", + "print(\"FWR4:\", fwr4.sequence.decode())" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "insert_cdrs", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX\n", + "0 STRUCTURE_MASK ^^^^^^^^^^ ^^^^^^^^\n", + "\n", + "60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ\n", + "60 STRUCTURE_MASK ^^^^^^^^^^^^^^^^^^^^ \n", + "\n", + "120 SEQUENCE GTQVTVSS\n", + "120 STRUCTURE_MASK \n" + ] + } + ], + "source": [ + "cdr1_length = 10\n", + "cdr2_length = 8\n", + "cdr3_length = 20\n", + "binder_scaffold = (\n", + " fwr1\n", + " + \"X\" * cdr1_length\n", + " + fwr2\n", + " + \"X\" * cdr2_length\n", + " + fwr3\n", + " + \"X\" * cdr3_length\n", + " + fwr4\n", + ")\n", + "print(binder_scaffold.formatted(include=(\"sequence\", \"structure_mask\")))" + ] + }, + { + "cell_type": "markdown", + "id": "step1_3", + "metadata": {}, + "source": [ + "## Step 1.3: Configure relative positioning (Groups)\n", + "\n", + "By default, all residues are in \"group 0\", which implies their relative positions are fixed. Since we want the nanobody to dock against the target (i.e., its position relative to the target is not fixed), we assign the scaffold to a different group (group 1)." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "set_groups", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Visualize target groups:\n", + "0 SEQUENCE TTIQKELENIVVKERQNKKDTILMGLKVEVPWNYCDWASISFYDVRLESGILDMESIAVK\n", + "0 GROUP 000000000000000000000000000000000000000000000000000000000000\n", + "\n", + "60 SEQUENCE YMTGCDIPPHVTLGITNKDQEANFQRFKELTRNIDLTSLSFTCKEVICFPQSRASKELGA\n", + "60 GROUP 000000000000000000000000000000000000000000000000000000000000\n", + "\n", + "120 SEQUENCE NGRAVVMKLEASDDVKALRNVLFNVVPTPRDIFGPVLSDPVWCPHVTIGYVRADDEDNKN\n", + "120 GROUP 000000000000000000000000000000000000000000000000000000000000\n", + "\n", + "180 SEQUENCE SFIELAEAFRGSKIKVIGWCE\n", + "180 GROUP 000000000000000000000\n", + "\n", + "Visualize binder scaffold groups:\n", + "0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX\n", + "0 GROUP 000000000000000000000000000000000000000000000000000000000000\n", + "\n", + "60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ\n", + "60 GROUP 000000000000000000000000000000000000000000000000000000000000\n", + "\n", + "120 SEQUENCE GTQVTVSS\n", + "120 GROUP 00000000\n" + ] + } + ], + "source": [ + "# Visualize current groups (all 0)\n", + "print(\"\\nVisualize target groups:\")\n", + "print(target.formatted((\"sequence\", \"group\")))\n", + "print(\"\\nVisualize binder scaffold groups:\")\n", + "print(binder_scaffold.formatted((\"sequence\", \"group\")))" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "5f0f1bbb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "Updated binder scaffold groups:\n", + "0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX\n", + "0 GROUP 111111111111111111111111111111111111111111111111111111111111\n", + "\n", + "60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ\n", + "60 GROUP 111111111111111111111111111111111111111111111111111111111111\n", + "\n", + "120 SEQUENCE GTQVTVSS\n", + "120 GROUP 11111111\n" + ] + } + ], + "source": [ + "# Set scaffold to group 1 to unfix relative position\n", + "binder_scaffold = binder_scaffold.set_group(1)\n", + "print(\"\\nUpdated binder scaffold groups:\")\n", + "print(binder_scaffold.formatted((\"sequence\", \"group\")))" + ] + }, + { + "cell_type": "markdown", + "id": "combine_query", + "metadata": {}, + "source": [ + "Finally, we combine the target and the binder scaffold into a single `Complex` query." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "create_query", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Query type \n", + "Chains in query: ['A', 'B']\n", + "\n", + "Visualize target (Chain A):\n", + "0 SEQUENCE TTIQKELENIVVKERQNKKDTILMGLKVEVPWNYCDWASISFYDVRLESGILDMESIAVK\n", + "0 STRUCTURE_MASK \n", + "\n", + "60 SEQUENCE YMTGCDIPPHVTLGITNKDQEANFQRFKELTRNIDLTSLSFTCKEVICFPQSRASKELGA\n", + "60 STRUCTURE_MASK \n", + "\n", + "120 SEQUENCE NGRAVVMKLEASDDVKALRNVLFNVVPTPRDIFGPVLSDPVWCPHVTIGYVRADDEDNKN\n", + "120 STRUCTURE_MASK \n", + "\n", + "180 SEQUENCE SFIELAEAFRGSKIKVIGWCE\n", + "180 STRUCTURE_MASK \n", + "\n", + "Visualize binder scaffold (Chain B):\n", + "0 SEQUENCE EVQLVESGGGLVQPGGSLRLSCAASXXXXXXXXXXGWFRQAPGKGRELVAAIXXXXXXXX\n", + "0 STRUCTURE_MASK ^^^^^^^^^^ ^^^^^^^^\n", + "\n", + "60 SEQUENCE YPDSVEGRFTISRDNAKRMVYLQMNSLRAEDTAVYYCAXXXXXXXXXXXXXXXXXXXXGQ\n", + "60 STRUCTURE_MASK ^^^^^^^^^^^^^^^^^^^^ \n", + "\n", + "120 SEQUENCE GTQVTVSS\n", + "120 STRUCTURE_MASK \n" + ] + } + ], + "source": [ + "query = target & binder_scaffold\n", + "print(\"Query type\", type(query))\n", + "print(\"Chains in query:\", list(query.get_chains().keys()))\n", + "print(\"\\nVisualize target (Chain A):\")\n", + "print(query.get_protein(chain_id=\"A\").formatted(include=(\"sequence\", \"structure_mask\")))\n", + "print(\"\\nVisualize binder scaffold (Chain B):\")\n", + "print(query.get_protein(chain_id=\"B\").formatted(include=(\"sequence\", \"structure_mask\")))" + ] + }, + { + "cell_type": "markdown", + "id": "bf083d77", + "metadata": {}, + "source": [ + "# Visualizing structure of query..." + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "2fdd48a9", + "metadata": {}, + "outputs": [], + "source": [ + "@dataclass(frozen=True)\n", + "class ColorSpec:\n", + " chain_id: str\n", + " color: str\n", + " positions: list[int] | None = None\n", + " rep_type: RepresentationTypeT = \"cartoon\"\n", + "\n", + "\n", + "def visualize_cif(cif_string: str, colors: list[ColorSpec]):\n", + " builder = mvs.create_builder()\n", + " model = (\n", + " builder.download(url=\"structure.cif\").parse(format=\"mmcif\").model_structure()\n", + " )\n", + " for color_spec in colors:\n", + " component = model.component(\n", + " selector=(\n", + " mvs.ComponentExpression(label_asym_id=color_spec.chain_id)\n", + " if color_spec.positions is None\n", + " else [\n", + " mvs.ComponentExpression(\n", + " label_asym_id=color_spec.chain_id, label_seq_id=i\n", + " )\n", + " for i in color_spec.positions\n", + " ]\n", + " )\n", + " )\n", + " rep = component.representation(type=color_spec.rep_type)\n", + " rep.color(color=color_spec.color)\n", + " builder.molstar_notebook(\n", + " data={\"structure.cif\": cif_string},\n", + " width=600,\n", + " height=500,\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "1f8d4677", + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
" + ], + "text/plain": [ + "" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/javascript": "\n setTimeout(function(){\n var wrapper = document.getElementById(\"molstar_b129a1dd-6daa-4de2-ba3a-ce0b0121630d\")\n if (wrapper === null) {\n throw new Error(\"Wrapper element #molstar_b129a1dd-6daa-4de2-ba3a-ce0b0121630d not found anymore\")\n }\n var blob = new Blob([\"\\n\\n \\n