Skip to content

Anomaly detection: engineered pathogen signature detection layer #3

@smodee

Description

@smodee

Overview

Add an anomaly detection layer on top of the core 7-family classifier that flags sequences with signatures of potential genetic engineering. This is a proof-of-concept for biosecurity-relevant surveillance, not a production detection system.

Status: Deferred until the core classifier (Phase 2) is trained and validated. This issue documents the approach so it can be picked up later.

Motivation

A key part of ECDC's mandate includes monitoring for unusual or potentially engineered pathogens. A classifier that can not only identify virus families but also flag anomalous sequences would demonstrate the kind of thinking relevant to biosecurity data science.

Important framing: This is a proof-of-concept demonstrating the methodology for anomaly detection in genomic surveillance. The "engineered" training samples are synthetic and represent a hypothesis, not validated real-world examples. This must be clearly communicated in any documentation or presentation.

Approach

Strategy 1: Confidence-based anomaly detection (simpler)

Use the core classifier's confidence scores as an anomaly signal:

  • A sequence that genuinely belongs to one of the 7 families should produce a high-confidence prediction
  • A chimeric or engineered sequence would produce low confidence or split probability across multiple families
  • Implementation: Add a confidence threshold; sequences below it are flagged as anomalous
  • Pros: No additional training needed, leverages existing model
  • Cons: Can't distinguish "engineered" from "just hard to classify"

Strategy 2: Synthetic training data for an auxiliary detection head (more impressive)

Generate synthetic "engineered" sequences for training:

  1. Cross-family chimeras: Splice together chunks from different virus families (e.g., 70% Coronaviridae backbone + 30% Poxviridae insertion). These simulate recombinant or chimeric constructs
  2. Codon usage perturbation: Alter the codon frequencies of natural sequences to mimic codon optimization (a hallmark of synthetic gene design). Host-optimized codon usage patterns differ measurably from natural viral codon usage
  3. Synthetic element insertion: Insert known synthetic biology elements into viral backbones:
    • Common promoters (T7, CMV)
    • Selection markers (antibiotic resistance cassettes)
    • Reporter gene fragments (GFP, luciferase)
    • Restriction enzyme site clusters (rare in natural genomes, common in engineered constructs)

Then train a binary auxiliary head: natural vs potentially_engineered, either as:

  • A separate binary classifier on the same encoder
  • A multi-task head alongside the family classifier

Data generation for Strategy 2

src/synthetic_engineering.py

Functions needed:

  • generate_chimera(seq_a, seq_b, ratio=0.7) — splice two sequences
  • perturb_codon_usage(seq, host='human') — shift codons toward host-optimized frequencies
  • insert_synthetic_elements(seq, elements=['T7_promoter', 'GFP_fragment']) — insert known synthetic sequences at random positions
  • Generate N synthetic samples per family, balanced against the natural dataset

Model architecture change

Add an auxiliary output head to the classifier:

Encoder → [family_head (7 classes), anomaly_head (2 classes: natural/engineered)]

Multi-task loss: L = L_family + λ * L_anomaly where λ is a weighting hyperparameter.

Considerations and limitations

  • No ground truth: There is no public dataset of confirmed engineered pathogen sequences. The synthetic training data represents plausible signatures, not validated examples
  • False positive risk: Natural recombination events (common in RNA viruses) may trigger false alarms. The model would need calibration against known natural recombinants
  • Ethical framing: This is dual-use research. Frame it as defensive surveillance tooling. Document clearly that this is a proof-of-concept
  • Scope for portfolio: Even implementing just Strategy 1 (confidence-based flagging) with a brief discussion of Strategy 2 would be strong for the ECDC application

Existing code reference

Definition of done

  • At minimum: confidence-based anomaly flagging on the trained classifier (Strategy 1)
  • Stretch: synthetic chimera generation and auxiliary detection head (Strategy 2)
  • Clear documentation explaining this is a proof-of-concept with synthetic training data
  • Demo notebook showing example flagged sequences with explanations

Dependencies

  • Blocked by: Core classifier (Phase 2 — model training and validation)
  • Blocks: nothing (this is an extension feature)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions