Overview
Add an anomaly detection layer on top of the core 7-family classifier that flags sequences with signatures of potential genetic engineering. This is a proof-of-concept for biosecurity-relevant surveillance, not a production detection system.
Status: Deferred until the core classifier (Phase 2) is trained and validated. This issue documents the approach so it can be picked up later.
Motivation
A key part of ECDC's mandate includes monitoring for unusual or potentially engineered pathogens. A classifier that can not only identify virus families but also flag anomalous sequences would demonstrate the kind of thinking relevant to biosecurity data science.
Important framing: This is a proof-of-concept demonstrating the methodology for anomaly detection in genomic surveillance. The "engineered" training samples are synthetic and represent a hypothesis, not validated real-world examples. This must be clearly communicated in any documentation or presentation.
Approach
Strategy 1: Confidence-based anomaly detection (simpler)
Use the core classifier's confidence scores as an anomaly signal:
- A sequence that genuinely belongs to one of the 7 families should produce a high-confidence prediction
- A chimeric or engineered sequence would produce low confidence or split probability across multiple families
- Implementation: Add a confidence threshold; sequences below it are flagged as anomalous
- Pros: No additional training needed, leverages existing model
- Cons: Can't distinguish "engineered" from "just hard to classify"
Strategy 2: Synthetic training data for an auxiliary detection head (more impressive)
Generate synthetic "engineered" sequences for training:
- Cross-family chimeras: Splice together chunks from different virus families (e.g., 70% Coronaviridae backbone + 30% Poxviridae insertion). These simulate recombinant or chimeric constructs
- Codon usage perturbation: Alter the codon frequencies of natural sequences to mimic codon optimization (a hallmark of synthetic gene design). Host-optimized codon usage patterns differ measurably from natural viral codon usage
- Synthetic element insertion: Insert known synthetic biology elements into viral backbones:
- Common promoters (T7, CMV)
- Selection markers (antibiotic resistance cassettes)
- Reporter gene fragments (GFP, luciferase)
- Restriction enzyme site clusters (rare in natural genomes, common in engineered constructs)
Then train a binary auxiliary head: natural vs potentially_engineered, either as:
- A separate binary classifier on the same encoder
- A multi-task head alongside the family classifier
Data generation for Strategy 2
src/synthetic_engineering.py
Functions needed:
generate_chimera(seq_a, seq_b, ratio=0.7) — splice two sequences
perturb_codon_usage(seq, host='human') — shift codons toward host-optimized frequencies
insert_synthetic_elements(seq, elements=['T7_promoter', 'GFP_fragment']) — insert known synthetic sequences at random positions
- Generate N synthetic samples per family, balanced against the natural dataset
Model architecture change
Add an auxiliary output head to the classifier:
Encoder → [family_head (7 classes), anomaly_head (2 classes: natural/engineered)]
Multi-task loss: L = L_family + λ * L_anomaly where λ is a weighting hyperparameter.
Considerations and limitations
- No ground truth: There is no public dataset of confirmed engineered pathogen sequences. The synthetic training data represents plausible signatures, not validated examples
- False positive risk: Natural recombination events (common in RNA viruses) may trigger false alarms. The model would need calibration against known natural recombinants
- Ethical framing: This is dual-use research. Frame it as defensive surveillance tooling. Document clearly that this is a proof-of-concept
- Scope for portfolio: Even implementing just Strategy 1 (confidence-based flagging) with a brief discussion of Strategy 2 would be strong for the ECDC application
Existing code reference
Definition of done
Dependencies
- Blocked by: Core classifier (Phase 2 — model training and validation)
- Blocks: nothing (this is an extension feature)
Overview
Add an anomaly detection layer on top of the core 7-family classifier that flags sequences with signatures of potential genetic engineering. This is a proof-of-concept for biosecurity-relevant surveillance, not a production detection system.
Status: Deferred until the core classifier (Phase 2) is trained and validated. This issue documents the approach so it can be picked up later.
Motivation
A key part of ECDC's mandate includes monitoring for unusual or potentially engineered pathogens. A classifier that can not only identify virus families but also flag anomalous sequences would demonstrate the kind of thinking relevant to biosecurity data science.
Important framing: This is a proof-of-concept demonstrating the methodology for anomaly detection in genomic surveillance. The "engineered" training samples are synthetic and represent a hypothesis, not validated real-world examples. This must be clearly communicated in any documentation or presentation.
Approach
Strategy 1: Confidence-based anomaly detection (simpler)
Use the core classifier's confidence scores as an anomaly signal:
Strategy 2: Synthetic training data for an auxiliary detection head (more impressive)
Generate synthetic "engineered" sequences for training:
Then train a binary auxiliary head:
naturalvspotentially_engineered, either as:Data generation for Strategy 2
Functions needed:
generate_chimera(seq_a, seq_b, ratio=0.7)— splice two sequencesperturb_codon_usage(seq, host='human')— shift codons toward host-optimized frequenciesinsert_synthetic_elements(seq, elements=['T7_promoter', 'GFP_fragment'])— insert known synthetic sequences at random positionsModel architecture change
Add an auxiliary output head to the classifier:
Multi-task loss:
L = L_family + λ * L_anomalywhere λ is a weighting hyperparameter.Considerations and limitations
Existing code reference
src/data_preprocessing.py(Data preprocessing: deduplication, chunking, and train/val/test split #1) — provides the processed sequencessrc/augmentation.py(Data augmentation: training-time transforms for genomic sequences #2) — the synthetic data generation here is conceptually similar to augmentationDefinition of done
Dependencies