feat(ingest): do not use ncbi release data for assigning segment groups#6527
feat(ingest): do not use ncbi release data for assigning segment groups#6527anna-parker wants to merge 3 commits into
Conversation
5686059 to
210853b
Compare
7215e5b to
210853b
Compare
|
Claude finished @anna-parker's task in 2m 32s —— View job Code Review
The change is small and well-motivated. The core idea is correct: A few issues to consider: 1. values_copy[rename["ncbiCollectionDate"]] = (
values_copy[rename["ncbiCollectionDate"]] or values_copy["ncbiReleaseDate"]
)If any segmented organism's config doesn't include renamed_date_key = rename.get("ncbiCollectionDate")
if renamed_date_key and renamed_date_key in values_copy:
values_copy[renamed_date_key] = values_copy[renamed_date_key] or values_copy.get("ncbiReleaseDate")2. row["ncbiReleaseDate"] = max(
segment_metadata[group[segment]]["ncbiReleaseDate"] for segment in group
)ISO 8601 strings compare lexicographically correctly, but if any value is dates = [segment_metadata[group[s]].get("ncbiReleaseDate") for s in group]
row["ncbiReleaseDate"] = max((d for d in dates if d), default=None)3. Docstring is stale (line 44)
4. No unit tests for the new fallback behavior The integration test (
This is minor for a focused ingest fix, but would help document the intended semantics. Overall this is a clean fix for a real problem and the approach is sound. The |
|
Claude finished @anna-parker's task in 2m 53s —— View job Code Review
The change is well-motivated and the approach is sound: removing Two issues worth addressing (inline comments posted):
The No concerns about the |
see @theosanderson's report https://gist.github.com/theosanderson-agent/ded53e709683a52063233c174aa7c0c0 for why this would be good to have to andes virus
Use NCBI release date as a segment-specific field in ingest when grouping but still have only one release date per segment-group (as this is used for calculating the earliestReleaseDate making it segment specific would be a massive overhaul - see #6390 for a starting attempt in this direction)
Changes in CCHF
Based on claude's summary of me running the diff script.
Previously separate single-segment records are combined into groups:
Previously grouped records (only L and M interestingly) are split into individual segment entries. These seem to be all from the same submitting group:
Stavropol State Research Anti-Plague Institute, Laboratory for diagnostics of viral infectionsFor example first two examples: https://pathoplexus.org/cchf/search?visibility_specimenCollectorSampleId=true&specimenCollectorSampleId=229-ARM-TI-2023 have the same specimenCollectorSampleId and metadata (only release date was used to split) so it actually does make sense to split them up:
PR Checklist
🚀 Preview: https://ingest-grouping-2.loculus.org