The Tissue Patching system in MetaInformAnt provides a robust mechanism for correcting and automating tissue metadata assignment across thousands of samples. This ensures that downstream analyses like csca (Cross-Species Correlation Analysis) use accurate, canonical tissue labels even when NCBI metadata is missing or ambiguous.
NCBI SRA/ENA metadata often contains:
- Missing Values: The
tissuefield is empty. - Ambiguous Terms: "Nervous system", "head", "central nervous system".
- Experimental Details: "Brain of forager kept in a group for 2 days".
The MetaInformAnt system uses a Two-Tier normalization strategy:
- Synonym Mapping: Mapping varied strings to canonical names (e.g., "Nervous system" →
brain). - Patching: Force-assigning tissues to specific Runs, BioProjects, or BioSamples based on manual research or study titles.
The system is controlled by two YAML files in config/amalgkit/:
Maps canonical tissue names to a list of synonyms.
brain:
- nerve
- nervous system
- head
- brainHigh-priority overrides for specific accessions.
samples:
SRR12345678: brain
bioprojects:
PRJNA339620: mushroom_body
biosamples:
SAMN00849801: brainThe StreamingPipelineOrchestrator and scripts/rna/normalize_tissue_metadata.py apply normalization in the following priority order:
- Sample Patch: If the Run Accession (SRR) exists in
samples:. - BioSample Patch: If the BioSample Accession exists in
biosamples:. - BioProject Patch: If the BioProject Accession exists in
bioprojects:. - Synonym Match: If the raw metadata value matches a synonym in
tissue_mapping.yaml. - Prefix Match: If the raw metadata starts with a known synonym (e.g., "Brain tissue..." matches
brain).
Path: scripts/rna/normalize_tissue_metadata.py
Used to batch-process a metadata.tsv file and add a tissue_normalized column.
python3 scripts/rna/normalize_tissue_metadata.py \
--input output/amalgkit/pbarbatus/work/metadata/metadata.tsv \
--output output/amalgkit/pbarbatus/work/metadata/metadata_normalized.tsvPath: scripts/rna/verify_tissue_coverage.py
Reports how many samples are mapped vs. unmapped in a species workflow.
python3 scripts/rna/verify_tissue_coverage.py --species pbarbatusThe tissue patching system is validated via:
- Automated Tests: test_rna_tissue_normalization.py ensures patching priority and synonym matching work correctly.
- Production Audit: Current coverage for the honeybee dataset is 99.9% (7,265/7,270 samples) using these patches.