feat(ingest): do not use ncbi release data for assigning segment groups by anna-parker · Pull Request #6527 · loculus-project/loculus

anna-parker · 2026-05-28T16:19:35Z

see @theosanderson's report https://gist.github.com/theosanderson-agent/ded53e709683a52063233c174aa7c0c0 for why this would be good to have to andes virus

Use NCBI release date as a segment-specific field in ingest when grouping but still have only one release date per segment-group (as this is used for calculating the earliestReleaseDate making it segment specific would be a massive overhaul - see #6390 for a starting attempt in this direction)

Changes in CCHF

Based on claude's summary of me running the diff script.

individual → grouped

Previously separate single-segment records are combined into groups:

  ┌─────────────────────────────┬───────────────────────────┐
  │    Removed (individual)     │      Added (grouped)      │
  ├─────────────────────────────┼───────────────────────────┤
  │ ON623080.1.M + ON142178.1.S │ ON623080.1.M/ON142178.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ ON623086.1.L + ON142179.1.S │ ON623086.1.L/ON142179.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ OQ935542.1.M + ON254096.1.S │ OQ935542.1.M/ON254096.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ OQ935543.1.M + ON254100.1.S │ OQ935543.1.M/ON254100.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ OQ935544.1.M + ON254097.1.S │ OQ935544.1.M/ON254097.1.S │
  ├─────────────────────────────┼───────────────────────────┤
  │ PV210235.1.L + PV168397.1.M │ PV210235.1.L/PV168397.1.M │
  ├─────────────────────────────┼───────────────────────────┤
  │ PV210236.1.L + PV168394.1.M │ PV210236.1.L/PV168394.1.M │
  └─────────────────────────────┴───────────────────────────┘

grouped → split

Previously grouped records (only L and M interestingly) are split into individual segment entries. These seem to be all from the same submitting group: Stavropol State Research Anti-Plague Institute, Laboratory for diagnostics of viral infections

For example first two examples: https://pathoplexus.org/cchf/search?visibility_specimenCollectorSampleId=true&specimenCollectorSampleId=229-ARM-TI-2023 have the same specimenCollectorSampleId and metadata (only release date was used to split) so it actually does make sense to split them up:

  ┌───────────────────────────┬─────────────────────────────┐
  │     Removed (grouped)     │     Added (individual)      │  specimenCollectorSampleId
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878918.1.L/PQ878909.1.M │ PQ878918.1.L + PQ878909.1.M │  229-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105948.1.L/PV091003.1.M │ PV105948.1.L + PV091003.1.M │  229-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878919.1.L/PQ878910.1.M │ PQ878919.1.L + PQ878910.1.M │  279-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105949.1.L/PV091004.1.M │ PV105949.1.L + PV091004.1.M │  279-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878920.1.L/PQ878911.1.M │ PQ878920.1.L + PQ878911.1.M │  345-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878922.1.L/PQ878913.1.M │ PQ878922.1.L + PQ878913.1.M │  372-ARM-TI-2022
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878925.1.L/PQ878916.1.M │ PQ878925.1.L + PQ878916.1.M │  423-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PQ878926.1.L/PQ878917.1.M │ PQ878926.1.L + PQ878917.1.M │  521-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105950.1.L/PV091005.1.M │ PV105950.1.L + PV091005.1.M │  345-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105951.1.L/PV091006.1.M │ PV105951.1.L + PV091006.1.M │  423-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105952.1.L/PV091007.1.M │ PV105952.1.L + PV091007.1.M │ 521-ARM-TI-2023
  ├───────────────────────────┼─────────────────────────────┤
  │ PV105953.1.L/PV091008.1.M │ PV105953.1.L + PV091008.1.M │  372-ARM-TI-2022
  └───────────────────────────┴─────────────────────────────┘

PR Checklist

Test on andes virus to see if this improves grouping there as well: testing results on andes are good: Andes and BDBV improvements pathoplexus/pathoplexus#983
test on staging that does not trigger a revision of all grouped sequences (should not happen as the hash value will be the same unless the release date was different and in these cases the grouping will change anyways

🚀 Preview: https://ingest-grouping-2.loculus.org

claude · 2026-05-29T12:36:49Z

Claude finished @anna-parker's task in 2m 32s —— View job

Code Review

Gather context and read changed files
Analyze changes
Post review

The change is small and well-motivated. The core idea is correct: ncbiReleaseDate shouldn't be used as a grouping key (it's segment-specific), but it's a useful fallback proxy for ncbiCollectionDate when that field is absent. The max-over-segments approach for the output release date is appropriate.

A few issues to consider:

1. KeyError risk on rename["ncbiCollectionDate"] (line 47)

values_copy[rename["ncbiCollectionDate"]] = (
    values_copy[rename["ncbiCollectionDate"]] or values_copy["ncbiReleaseDate"]
)

If any segmented organism's config doesn't include ncbiCollectionDate in its rename dict, this will raise a KeyError at runtime. Similarly if the renamed field is not present in values_copy. The CCHF config has this (ncbiCollectionDate: sampleCollectionDate), but the defaults don't enforce it. A .get() with a fallback would make this more robust:

renamed_date_key = rename.get("ncbiCollectionDate")
if renamed_date_key and renamed_date_key in values_copy:
    values_copy[renamed_date_key] = values_copy[renamed_date_key] or values_copy.get("ncbiReleaseDate")

Fix this →

2. max() may fail if any segment has a None or empty ncbiReleaseDate (line 254-256)

row["ncbiReleaseDate"] = max(
    segment_metadata[group[segment]]["ncbiReleaseDate"] for segment in group
)

ISO 8601 strings compare lexicographically correctly, but if any value is None or "", Python 3 will raise TypeError: '>' not supported between instances of 'str' and 'NoneType'. Worth filtering or guarding:

dates = [segment_metadata[group[s]].get("ncbiReleaseDate") for s in group]
row["ncbiReleaseDate"] = max((d for d in dates if d), default=None)

Fix this →

3. Docstring is stale (line 44)

"""Sort authors values and return modified values""" no longer describes what the function does — it should mention the collection date fallback.

4. No unit tests for the new fallback behavior

The integration test (test_snakemake) will catch regressions in the CCHF case, but there's no targeted test for:

Segments with differing release dates but no collection date → should they group or not?
Segments where one has a collection date and one doesn't

This is minor for a focused ingest fix, but would help document the intended semantics.

Overall this is a clean fix for a real problem and the approach is sound. The KeyError on missing rename["ncbiCollectionDate"] is the most actionable issue since it would be a silent breakage if another segmented organism is added without that mapping.

claude · 2026-05-29T12:38:43Z

Claude finished @anna-parker's task in 2m 53s —— View job

Code Review

Read PR diff
Review code changes
Post findings

The change is well-motivated and the approach is sound: removing ncbiReleaseDate from the grouping heuristic prevents spurious splits (like the CCHF Armenian samples), and the min() aggregation correctly reflects the earliest segment release date for use as an earliestReleaseDate bound.

Two issues worth addressing (inline comments posted):

rename["ncbiCollectionDate"] is an undocumented hard dependency — if a segmented organism's config omits this key, the script crashes with a KeyError. A .get() pattern would be safer.
min() over ncbiReleaseDate strings with potential empty values — empty strings sort before any ISO date, so a single segment with "" causes the whole group's ncbiReleaseDate to be set to "". In practice this may never happen (NCBI always populates this field for GenBank records), but worth guarding.

The anna-parker suggestion to use min instead of max is already applied in the current commit — that's the right call since this is used as a lower bound for earliestReleaseDate.

No concerns about the INTRINSICALLY_SEGMENT_SPECIFIC_FIELDS change or the Config.rename addition.

anna-parker changed the base branch from main to accept_ranges May 28, 2026 16:19

claude Bot added the ingest Ingest pipeline label May 28, 2026

anna-parker added the preview Triggers a deployment to argocd label May 28, 2026

Base automatically changed from accept_ranges to main May 29, 2026 07:00

feat: dont use releaseDate for heuristic grouping

210853b

anna-parker force-pushed the ingest_grouping_2 branch from 5686059 to 210853b Compare May 29, 2026 08:35

anna-parker mentioned this pull request May 29, 2026

Andes and BDBV improvements pathoplexus/pathoplexus#983

Draft

4 tasks

anna-parker force-pushed the ingest_grouping_2 branch from 7215e5b to 210853b Compare May 29, 2026 12:35

anna-parker changed the title ~~Ingest grouping 2~~ feat(ingest): do not use ncbi release data for assigning segment groups May 29, 2026

anna-parker marked this pull request as ready for review May 29, 2026 12:36

anna-parker marked this pull request as draft May 29, 2026 12:37

anna-parker commented May 29, 2026

View reviewed changes

Comment thread ingest/scripts/heuristic_group_segments.py Outdated

anna-parker commented May 29, 2026

View reviewed changes

Comment thread ingest/scripts/heuristic_group_segments.py Outdated

Apply suggestion from @anna-parker

77f1211

anna-parker marked this pull request as ready for review May 29, 2026 12:38

anna-parker requested review from corneliusroemer and theosanderson May 29, 2026 12:39

claude Bot reviewed May 29, 2026

View reviewed changes

Comment thread ingest/scripts/heuristic_group_segments.py

claude Bot reviewed May 29, 2026

View reviewed changes

Comment thread ingest/scripts/heuristic_group_segments.py Outdated

anna-parker commented May 29, 2026

View reviewed changes

Comment thread ingest/scripts/heuristic_group_segments.py Outdated

Apply suggestion from @anna-parker

0a93a9b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): do not use ncbi release data for assigning segment groups#6527

feat(ingest): do not use ncbi release data for assigning segment groups#6527
anna-parker wants to merge 3 commits into
mainfrom
ingest_grouping_2

anna-parker commented May 28, 2026 •

edited by loculus-bot

Loading

Uh oh!

claude Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anna-parker commented May 28, 2026 • edited by loculus-bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes in CCHF

PR Checklist

Uh oh!

claude Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Uh oh!

Uh oh!

Uh oh!

claude Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anna-parker commented May 28, 2026 •

edited by loculus-bot

Loading

claude Bot commented May 29, 2026 •

edited

Loading

claude Bot commented May 29, 2026 •

edited

Loading