refactor: update references from SEQREF to SEQRES and aa_sequence#79
Open
refactor: update references from SEQREF to SEQRES and aa_sequence#79
Conversation
…PDB handling and samplesheet processing
There was a problem hiding this comment.
Pull request overview
This pull request refactors the codebase to use consistent and accurate naming for sequence-related data. The main changes standardize column names and terminology throughout the dataset processing scripts to avoid confusion between amino-acid sequences and RefSeq protein accessions.
Changes:
- Renamed the
refseqcolumn toaa_sequencethroughout the codebase to clearly identify amino-acid sequence data - Updated all function names, variable names, and docstrings to reflect the new
aa_sequencenaming convention - Standardized terminology from
SEQREF/REFSEQtoSEQRESin comments and function names to match PDB file format specifications - Added validation for required columns in custom MANE metadata files
- Fixed bugs in logging messages that referenced undefined variables
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tools/preprocessing/update_samplesheet_and_structures.py | Renamed attach_refseq function to attach_aa_sequence, updated internal variable names and column references from refseq to aa_sequence |
| tools/preprocessing/prepare_samplesheet.py | Updated _parse_ncbi_mane_fasta to create DataFrame with aa_sequence column; updated write_fastas_and_update_sheet to use aa_sequence column |
| scripts/datasets/custom_pdb.py | Added validation for required columns (sequence and aa_sequence); updated column access to use aa_sequence; improved logging messages by removing references to undefined variables; updated comments to use SEQRES terminology |
| scripts/datasets/af_merge.py | Renamed add_refseq_record_to_pdb function to add_seqres_records_to_pdb; updated comments and docstrings to use SEQRES terminology; fixed typo in comment |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
aa_sequence.refseq_protunchanged (it remains the RefSeq protein accession).Still missing full run of tests including:
update_samplesheet_and_structures.pybuild-datasetsrunCopilot
This pull request makes several important changes to standardize the naming of sequence columns and improve SEQRES record handling throughout the dataset processing scripts. The most significant updates include renaming the
refseqcolumn toaa_sequence, updating related functions and documentation, and enhancing error handling for missing metadata. Below are the key changes grouped by theme:Column Renaming and Data Consistency
refseqcolumn toaa_sequencethroughout the codebase, including in DataFrame construction, metadata attachment, and FASTA file writing. This affects functions such as_parse_ncbi_mane_fasta,write_fastas_and_update_sheet, andattach_aa_sequence[1] [2] [3] [4] [5] [6].aa_sequencenaming convention [1] [2].SEQRES Record Handling
REFSEQandSEQREFwithSEQRESin comments, log messages, and function names [1] [2] [3] [4].Metadata Validation and Logging
sequenceoraa_sequencecolumns are missing.These changes collectively improve clarity, maintain consistency across scripts, and ensure robust handling of sequence and metadata information.