Skip to content

feat(ingest): do not use ncbi release data for assigning segment groups#6527

Open
anna-parker wants to merge 3 commits into
mainfrom
ingest_grouping_2
Open

feat(ingest): do not use ncbi release data for assigning segment groups#6527
anna-parker wants to merge 3 commits into
mainfrom
ingest_grouping_2

Conversation

@anna-parker
Copy link
Copy Markdown
Contributor

@anna-parker anna-parker commented May 28, 2026

see @theosanderson's report https://gist.github.com/theosanderson-agent/ded53e709683a52063233c174aa7c0c0 for why this would be good to have to andes virus

Use NCBI release date as a segment-specific field in ingest when grouping but still have only one release date per segment-group (as this is used for calculating the earliestReleaseDate making it segment specific would be a massive overhaul - see #6390 for a starting attempt in this direction)

Changes in CCHF

Based on claude's summary of me running the diff script.

  1. individual → grouped

Previously separate single-segment records are combined into groups:

  ┌─────────────────────────────┬───────────────────────────┐
  │    Removed (individual)     │      Added (grouped)      │
  ├─────────────────────────────┼───────────────────────────┤
  │ ON623080.1.M + ON142178.1.S │ ON623080.1.M/ON142178.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ ON623086.1.L + ON142179.1.S │ ON623086.1.L/ON142179.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ OQ935542.1.M + ON254096.1.S │ OQ935542.1.M/ON254096.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ OQ935543.1.M + ON254100.1.S │ OQ935543.1.M/ON254100.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ OQ935544.1.M + ON254097.1.S │ OQ935544.1.M/ON254097.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ PV210235.1.L + PV168397.1.M │ PV210235.1.L/PV168397.1.M │
  ├─────────────────────────────┼───────────────────────────┤
  │ PV210236.1.L + PV168394.1.M │ PV210236.1.L/PV168394.1.M │
  └─────────────────────────────┴───────────────────────────┘
  1. grouped → split

Previously grouped records (only L and M interestingly) are split into individual segment entries. These seem to be all from the same submitting group: Stavropol State Research Anti-Plague Institute, Laboratory for diagnostics of viral infections

For example first two examples: https://pathoplexus.org/cchf/search?visibility_specimenCollectorSampleId=true&specimenCollectorSampleId=229-ARM-TI-2023 have the same specimenCollectorSampleId and metadata (only release date was used to split) so it actually does make sense to split them up:

  ┌───────────────────────────┬─────────────────────────────┐
  │     Removed (grouped)     │     Added (individual)      │  specimenCollectorSampleId
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878918.1.L/PQ878909.1.M │ PQ878918.1.L + PQ878909.1.M │  229-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105948.1.L/PV091003.1.M │ PV105948.1.L + PV091003.1.M │  229-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878919.1.L/PQ878910.1.M │ PQ878919.1.L + PQ878910.1.M │  279-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105949.1.L/PV091004.1.M │ PV105949.1.L + PV091004.1.M │  279-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878920.1.L/PQ878911.1.M │ PQ878920.1.L + PQ878911.1.M │  345-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878922.1.L/PQ878913.1.M │ PQ878922.1.L + PQ878913.1.M │  372-ARM-TI-2022
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878925.1.L/PQ878916.1.M │ PQ878925.1.L + PQ878916.1.M │  423-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878926.1.L/PQ878917.1.M │ PQ878926.1.L + PQ878917.1.M │  521-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105950.1.L/PV091005.1.M │ PV105950.1.L + PV091005.1.M │  345-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105951.1.L/PV091006.1.M │ PV105951.1.L + PV091006.1.M │  423-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105952.1.L/PV091007.1.M │ PV105952.1.L + PV091007.1.M │ 521-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105953.1.L/PV091008.1.M │ PV105953.1.L + PV091008.1.M │  372-ARM-TI-2022
  └───────────────────────────┴─────────────────────────────┘

PR Checklist

  • Test on andes virus to see if this improves grouping there as well: testing results on andes are good: Andes and BDBV improvements pathoplexus/pathoplexus#983
  • test on staging that does not trigger a revision of all grouped sequences (should not happen as the hash value will be the same unless the release date was different and in these cases the grouping will change anyways

🚀 Preview: https://ingest-grouping-2.loculus.org

@anna-parker anna-parker changed the base branch from main to accept_ranges May 28, 2026 16:19
@claude claude Bot added the ingest Ingest pipeline label May 28, 2026
@anna-parker anna-parker added the preview Triggers a deployment to argocd label May 28, 2026
Base automatically changed from accept_ranges to main May 29, 2026 07:00
@anna-parker anna-parker changed the title Ingest grouping 2 feat(ingest): do not use ncbi release data for assigning segment groups May 29, 2026
@anna-parker anna-parker marked this pull request as ready for review May 29, 2026 12:36
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 29, 2026

Claude finished @anna-parker's task in 2m 32s —— View job


Code Review

  • Gather context and read changed files
  • Analyze changes
  • Post review

The change is small and well-motivated. The core idea is correct: ncbiReleaseDate shouldn't be used as a grouping key (it's segment-specific), but it's a useful fallback proxy for ncbiCollectionDate when that field is absent. The max-over-segments approach for the output release date is appropriate.

A few issues to consider:

1. KeyError risk on rename["ncbiCollectionDate"] (line 47)

values_copy[rename["ncbiCollectionDate"]] = (
    values_copy[rename["ncbiCollectionDate"]] or values_copy["ncbiReleaseDate"]
)

If any segmented organism's config doesn't include ncbiCollectionDate in its rename dict, this will raise a KeyError at runtime. Similarly if the renamed field is not present in values_copy. The CCHF config has this (ncbiCollectionDate: sampleCollectionDate), but the defaults don't enforce it. A .get() with a fallback would make this more robust:

renamed_date_key = rename.get("ncbiCollectionDate")
if renamed_date_key and renamed_date_key in values_copy:
    values_copy[renamed_date_key] = values_copy[renamed_date_key] or values_copy.get("ncbiReleaseDate")

Fix this →

2. max() may fail if any segment has a None or empty ncbiReleaseDate (line 254-256)

row["ncbiReleaseDate"] = max(
    segment_metadata[group[segment]]["ncbiReleaseDate"] for segment in group
)

ISO 8601 strings compare lexicographically correctly, but if any value is None or "", Python 3 will raise TypeError: '>' not supported between instances of 'str' and 'NoneType'. Worth filtering or guarding:

dates = [segment_metadata[group[s]].get("ncbiReleaseDate") for s in group]
row["ncbiReleaseDate"] = max((d for d in dates if d), default=None)

Fix this →

3. Docstring is stale (line 44)

"""Sort authors values and return modified values""" no longer describes what the function does — it should mention the collection date fallback.

4. No unit tests for the new fallback behavior

The integration test (test_snakemake) will catch regressions in the CCHF case, but there's no targeted test for:

  • Segments with differing release dates but no collection date → should they group or not?
  • Segments where one has a collection date and one doesn't

This is minor for a focused ingest fix, but would help document the intended semantics.


Overall this is a clean fix for a real problem and the approach is sound. The KeyError on missing rename["ncbiCollectionDate"] is the most actionable issue since it would be a silent breakage if another segmented organism is added without that mapping.

@anna-parker anna-parker marked this pull request as draft May 29, 2026 12:37
Comment thread ingest/scripts/heuristic_group_segments.py Outdated
Comment thread ingest/scripts/heuristic_group_segments.py Outdated
@anna-parker anna-parker marked this pull request as ready for review May 29, 2026 12:38
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 29, 2026

Claude finished @anna-parker's task in 2m 53s —— View job


Code Review

  • Read PR diff
  • Review code changes
  • Post findings

The change is well-motivated and the approach is sound: removing ncbiReleaseDate from the grouping heuristic prevents spurious splits (like the CCHF Armenian samples), and the min() aggregation correctly reflects the earliest segment release date for use as an earliestReleaseDate bound.

Two issues worth addressing (inline comments posted):

  1. rename["ncbiCollectionDate"] is an undocumented hard dependency — if a segmented organism's config omits this key, the script crashes with a KeyError. A .get() pattern would be safer.

  2. min() over ncbiReleaseDate strings with potential empty values — empty strings sort before any ISO date, so a single segment with "" causes the whole group's ncbiReleaseDate to be set to "". In practice this may never happen (NCBI always populates this field for GenBank records), but worth guarding.

The anna-parker suggestion to use min instead of max is already applied in the current commit — that's the right call since this is used as a lower bound for earliestReleaseDate.

No concerns about the INTRINSICALLY_SEGMENT_SPECIFIC_FIELDS change or the Config.rename addition.

Comment thread ingest/scripts/heuristic_group_segments.py
Comment thread ingest/scripts/heuristic_group_segments.py Outdated
Comment thread ingest/scripts/heuristic_group_segments.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingest Ingest pipeline preview Triggers a deployment to argocd

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant