Skip to content

feat(prepro): add function to process phenotype_values and extract regex#5885

Merged
anna-parker merged 13 commits intomainfrom
improve-nextclade-metadata
Feb 12, 2026
Merged

feat(prepro): add function to process phenotype_values and extract regex#5885
anna-parker merged 13 commits intomainfrom
improve-nextclade-metadata

Conversation

@anna-parker
Copy link
Copy Markdown
Contributor

@anna-parker anna-parker commented Jan 27, 2026

resolves #5902, #5901

Screenshot

This PR is currently being used on the private virus preview:

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by appropriate, automated tests.
  • Any manual testing that has been done is documented (i.e. what exactly was tested?)

🚀 Preview: Add preview label to enable

@claude claude Bot added the preprocessing Issues related to the preprocessing component label Jan 27, 2026
@anna-parker anna-parker changed the title feat: add function to process phenotype_values feat: add function to process phenotype_values and extract regex Jan 27, 2026
@anna-parker anna-parker force-pushed the restructure-anya branch 3 times, most recently from 03c0980 to 0deeeb0 Compare January 29, 2026 19:32
@anna-parker anna-parker force-pushed the improve-nextclade-metadata branch from bfea53c to a874d49 Compare January 30, 2026 11:04
Base automatically changed from restructure-anya to main January 30, 2026 15:28
@anna-parker anna-parker force-pushed the improve-nextclade-metadata branch from a874d49 to b62278b Compare January 30, 2026 19:59
@anna-parker anna-parker marked this pull request as ready for review February 2, 2026 13:31
@anna-parker anna-parker changed the title feat: add function to process phenotype_values and extract regex feat(prepro): add function to process phenotype_values and extract regex Feb 2, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Feb 2, 2026

Claude finished @anna-parker's task in 1m 39s —— View job


Code Review Complete

I've reviewed the PR that adds functionality to process phenotype_values and extract regex patterns. Here's my assessment:

Summary

The implementation adds two useful preprocessing functions (extract_regex and implicit support for phenotype_values processing) that extend the metadata processing capabilities. The code is generally well-structured and follows existing patterns. However, there are several areas that need attention before merging.

Key Issues

Critical:

  1. Missing input validation in process_phenotype_values (preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py:1243) - The function doesn't validate that args contains the required "name" key, which could cause KeyError exceptions.

  2. Overly broad exception handling (line 1249) - Using bare except Exception could mask programming errors. Should be more specific about expected exceptions from ast.literal_eval.

Important:
3. Incomplete documentation - The process_phenotype_values function is used in the code but not documented in the README alongside the other preprocessing functions.

  1. Limited test coverage - The tests for process_phenotype_values (test_nextclade_preprocessing.py:1250) only cover the happy path. Missing tests for error cases, edge cases, and invalid inputs.

  2. Ambiguous behavior with duplicates (line 1246) - If multiple phenotype entries have the same name, the function silently returns the last match without warning.

Minor:
6. Type handling - The function converts all values to strings, which might lose type information (e.g., distinguishing between 0.0 and 0).

  1. Error messages could be more informative by including which phenotype name was being searched for.

Positive Aspects

  • Clean separation of concerns with the process_phenotype_values function
  • Good test coverage for the extract_regex functionality with both success and failure cases
  • Consistent code style matching the existing codebase
  • The regex extraction implementation is robust with proper error handling

Recommendations

  1. Add input validation to process_phenotype_values
  2. Make exception handling more specific
  3. Add comprehensive tests for edge cases and error conditions
  4. Document the process_phenotype_values function in the README
  5. Consider whether duplicate phenotype names should trigger a warning

The functionality is sound and the implementation follows good patterns, but the missing validation and documentation need to be addressed.


Comment thread preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated
Comment thread preprocessing/nextclade/tests/test_nextclade_preprocessing.py
Comment thread preprocessing/nextclade/README.md Outdated
@anna-parker anna-parker force-pushed the improve-nextclade-metadata branch from 2641a12 to d189ec4 Compare February 4, 2026 10:54
@anna-parker
Copy link
Copy Markdown
Contributor Author

Comment from @rneher in the private preview we use the extract_regex function for obtaining the subtype from the reference name and there it would be good to have the subtype capitalized.

I proposed adding this as an option to the function.

@anna-parker anna-parker force-pushed the improve-nextclade-metadata branch 2 times, most recently from d950ab8 to 49895f3 Compare February 10, 2026 16:11
@theosanderson
Copy link
Copy Markdown
Member

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 49895f392a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated
Comment thread preprocessing/nextclade/src/loculus_preprocessing/processing_functions.py Outdated
@anna-parker anna-parker force-pushed the improve-nextclade-metadata branch from e44e92d to 38a842c Compare February 12, 2026 08:05
Copy link
Copy Markdown
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@anna-parker anna-parker merged commit ac106df into main Feb 12, 2026
41 checks passed
@anna-parker anna-parker deleted the improve-nextclade-metadata branch February 12, 2026 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preprocessing Issues related to the preprocessing component

Projects

None yet

Development

Successfully merging this pull request may close these issues.

prepro nextclade: add function to extract/parse string of reference name

2 participants