refactor: update references from SEQREF to SEQRES and aa_sequence by St3451 · Pull Request #79 · bbglab/oncodrive3d

St3451 · 2026-02-15T04:07:16Z

Summary

Replace the misleading refseq metadata column (previously used for amino-acid sequences) with a single consistent name: aa_sequence.
Keep refseq_prot unchanged (it remains the RefSeq protein accession).

Still missing full run of tests including:

update_samplesheet_and_structures.py
build-datasets
run

Copilot

This pull request makes several important changes to standardize the naming of sequence columns and improve SEQRES record handling throughout the dataset processing scripts. The most significant updates include renaming the refseq column to aa_sequence, updating related functions and documentation, and enhancing error handling for missing metadata. Below are the key changes grouped by theme:

Column Renaming and Data Consistency

Renamed the refseq column to aa_sequence throughout the codebase, including in DataFrame construction, metadata attachment, and FASTA file writing. This affects functions such as _parse_ncbi_mane_fasta, write_fastas_and_update_sheet, and attach_aa_sequence [1] [2] [3] [4] [5] [6].
Updated function and variable names, as well as docstrings, to reflect the new aa_sequence naming convention [1] [2].

SEQRES Record Handling

Standardized terminology and function names related to SEQRES records in PDB files, replacing REFSEQ and SEQREF with SEQRES in comments, log messages, and function names [1] [2] [3] [4].

Metadata Validation and Logging

Added validation for required columns in custom MANE metadata files, raising a clear error if sequence or aa_sequence columns are missing.
Improved logging messages to clarify when SEQRES insertion is skipped due to missing or unavailable metadata.

These changes collectively improve clarity, maintain consistency across scripts, and ensure robust handling of sequence and metadata information.

…PDB handling and samplesheet processing

Copilot

Pull request overview

This pull request refactors the codebase to use consistent and accurate naming for sequence-related data. The main changes standardize column names and terminology throughout the dataset processing scripts to avoid confusion between amino-acid sequences and RefSeq protein accessions.

Changes:

Renamed the refseq column to aa_sequence throughout the codebase to clearly identify amino-acid sequence data
Updated all function names, variable names, and docstrings to reflect the new aa_sequence naming convention
Standardized terminology from SEQREF/REFSEQ to SEQRES in comments and function names to match PDB file format specifications
Added validation for required columns in custom MANE metadata files
Fixed bugs in logging messages that referenced undefined variables

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
tools/preprocessing/update_samplesheet_and_structures.py	Renamed `attach_refseq` function to `attach_aa_sequence`, updated internal variable names and column references from `refseq` to `aa_sequence`
tools/preprocessing/prepare_samplesheet.py	Updated `_parse_ncbi_mane_fasta` to create DataFrame with `aa_sequence` column; updated `write_fastas_and_update_sheet` to use `aa_sequence` column
scripts/datasets/custom_pdb.py	Added validation for required columns (`sequence` and `aa_sequence`); updated column access to use `aa_sequence`; improved logging messages by removing references to undefined variables; updated comments to use SEQRES terminology
scripts/datasets/af_merge.py	Renamed `add_refseq_record_to_pdb` function to `add_seqres_records_to_pdb`; updated comments and docstrings to use SEQRES terminology; fixed typo in comment

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

refactor: update references from SEQREF to SEQRES and aa_sequence in …

5b2880e

…PDB handling and samplesheet processing

Copilot AI review requested due to automatic review settings February 15, 2026 04:07

Copilot started reviewing on behalf of St3451 February 15, 2026 04:07 View session

Copilot AI reviewed Feb 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: update references from SEQREF to SEQRES and aa_sequence#79

refactor: update references from SEQREF to SEQRES and aa_sequence#79
St3451 wants to merge 1 commit intomasterfrom
refactor/aa-sequence-column

St3451 commented Feb 15, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

St3451 commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Copilot

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

St3451 commented Feb 15, 2026 •

edited

Loading