Anomaly detection: engineered pathogen signature detection layer

## Overview

Add an anomaly detection layer on top of the core 7-family classifier that flags sequences with signatures of potential genetic engineering. This is a **proof-of-concept** for biosecurity-relevant surveillance, not a production detection system.

**Status:** Deferred until the core classifier (Phase 2) is trained and validated. This issue documents the approach so it can be picked up later.

## Motivation

A key part of ECDC's mandate includes monitoring for unusual or potentially engineered pathogens. A classifier that can not only identify virus families but also flag anomalous sequences would demonstrate the kind of thinking relevant to biosecurity data science.

**Important framing:** This is a proof-of-concept demonstrating the *methodology* for anomaly detection in genomic surveillance. The "engineered" training samples are synthetic and represent a hypothesis, not validated real-world examples. This must be clearly communicated in any documentation or presentation.

## Approach

### Strategy 1: Confidence-based anomaly detection (simpler)

Use the core classifier's confidence scores as an anomaly signal:
- A sequence that genuinely belongs to one of the 7 families should produce a high-confidence prediction
- A chimeric or engineered sequence would produce low confidence or split probability across multiple families
- **Implementation:** Add a confidence threshold; sequences below it are flagged as anomalous
- **Pros:** No additional training needed, leverages existing model
- **Cons:** Can't distinguish "engineered" from "just hard to classify"

### Strategy 2: Synthetic training data for an auxiliary detection head (more impressive)

Generate synthetic "engineered" sequences for training:

1. **Cross-family chimeras:** Splice together chunks from different virus families (e.g., 70% Coronaviridae backbone + 30% Poxviridae insertion). These simulate recombinant or chimeric constructs
2. **Codon usage perturbation:** Alter the codon frequencies of natural sequences to mimic codon optimization (a hallmark of synthetic gene design). Host-optimized codon usage patterns differ measurably from natural viral codon usage
3. **Synthetic element insertion:** Insert known synthetic biology elements into viral backbones:
   - Common promoters (T7, CMV)
   - Selection markers (antibiotic resistance cassettes)
   - Reporter gene fragments (GFP, luciferase)
   - Restriction enzyme site clusters (rare in natural genomes, common in engineered constructs)

Then train a binary auxiliary head: `natural` vs `potentially_engineered`, either as:
- A separate binary classifier on the same encoder
- A multi-task head alongside the family classifier

### Data generation for Strategy 2

```
src/synthetic_engineering.py
```

Functions needed:
- `generate_chimera(seq_a, seq_b, ratio=0.7)` — splice two sequences
- `perturb_codon_usage(seq, host='human')` — shift codons toward host-optimized frequencies
- `insert_synthetic_elements(seq, elements=['T7_promoter', 'GFP_fragment'])` — insert known synthetic sequences at random positions
- Generate N synthetic samples per family, balanced against the natural dataset

### Model architecture change

Add an auxiliary output head to the classifier:
```
Encoder → [family_head (7 classes), anomaly_head (2 classes: natural/engineered)]
```
Multi-task loss: `L = L_family + λ * L_anomaly` where λ is a weighting hyperparameter.

## Considerations and limitations

- **No ground truth:** There is no public dataset of confirmed engineered pathogen sequences. The synthetic training data represents plausible signatures, not validated examples
- **False positive risk:** Natural recombination events (common in RNA viruses) may trigger false alarms. The model would need calibration against known natural recombinants
- **Ethical framing:** This is dual-use research. Frame it as defensive surveillance tooling. Document clearly that this is a proof-of-concept
- **Scope for portfolio:** Even implementing just Strategy 1 (confidence-based flagging) with a brief discussion of Strategy 2 would be strong for the ECDC application

## Existing code reference

- Core classifier model (from Phase 2, not yet built) — this layer sits on top
- `src/data_preprocessing.py` (#1) — provides the processed sequences
- `src/augmentation.py` (#2) — the synthetic data generation here is conceptually similar to augmentation

## Definition of done

- [ ] At minimum: confidence-based anomaly flagging on the trained classifier (Strategy 1)
- [ ] Stretch: synthetic chimera generation and auxiliary detection head (Strategy 2)
- [ ] Clear documentation explaining this is a proof-of-concept with synthetic training data
- [ ] Demo notebook showing example flagged sequences with explanations

## Dependencies

- **Blocked by:** Core classifier (Phase 2 — model training and validation)
- **Blocks:** nothing (this is an extension feature)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anomaly detection: engineered pathogen signature detection layer #3

Overview

Motivation

Approach

Strategy 1: Confidence-based anomaly detection (simpler)

Strategy 2: Synthetic training data for an auxiliary detection head (more impressive)

Data generation for Strategy 2

Model architecture change

Considerations and limitations

Existing code reference

Definition of done

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Anomaly detection: engineered pathogen signature detection layer #3

Description

Overview

Motivation

Approach

Strategy 1: Confidence-based anomaly detection (simpler)

Strategy 2: Synthetic training data for an auxiliary detection head (more impressive)

Data generation for Strategy 2

Model architecture change

Considerations and limitations

Existing code reference

Definition of done

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions