Skip to content

Batch Ensembl CDS, stabilize BioMart downloads, and improve PAE handling#80

Open
St3451 wants to merge 58 commits intomasterfrom
dev/fix_build_datasets
Open

Batch Ensembl CDS, stabilize BioMart downloads, and improve PAE handling#80
St3451 wants to merge 58 commits intomasterfrom
dev/fix_build_datasets

Conversation

@St3451
Copy link
Collaborator

@St3451 St3451 commented Feb 16, 2026

Summary

This PR makes the build pipeline faster and more reliable under current Ensembl/BioMart/AF DB constraints. Ensembl CDS is now batched, BioMart downloads have fallbacks and clearer logs, and PAE handling is robust to missing AF DB versions with a new --custom_pae_dir. MANE builds are explicitly pinned to AF DB v4 for consistency, and several edge‑case crashes plus doc/debug ergonomics were cleaned up. Default AF DB version for non‑MANE builds is now v6 (latest at the time of this PR). Backtranseq batching/retry/timeouts were hardened to avoid multi‑hour hangs.

Issue 1: Ensembl CDS retrieval speed

  • What: Switched CDS retrieval to batched POST requests (50 IDs/request) with 429 handling.
  • Why: Per‑transcript GET became too slow and rate‑limited.
  • How:
    • Batch ID POST to /sequence/id; retry on 429; keep result order.
    • Batch requests are capped to 8 workers; missing IDs are retried individually in parallel (up to 8) after the batch pass.
  • Notes:
    • Short CDS now treated as missing NA to avoid invalid contexts.
    • Batch retries are now 8 attempts (was 10 in the note).
    • Failed batch IDs are retried one‑by‑one before being moved to non‑MANE.
    • Added debug log for missing custom ENSP IDs not present in MANE summary.

Issue 2: BioMart metadata download instability

  • What: Added archive to latest fallback and Python downloader fallback; improved logs.
  • Why: Archive endpoints sometimes return 500/timeout.
  • How:
    • Retry loops with stderr capture, start‑attempt logs, --no-hsts, and SSL verify for HTTPS fallback.
    • Include the cap of download segments to 8 for download_single_file().
  • Notes: If downloads fail, canonical transcript prioritization is skipped (CDS obtained from Proteins API) and build continue.

Issue 3: PAE availability and custom input

  • What: Added --custom_pae_dir; skip download when 10 consecutive 404/410s detected.
  • Why: PAE URLs for older AF versions are no longer hosted.
  • How: Copy provided directory into pae/; probe first 10 IDs sequentially, then parallel download.
  • Notes: If PAE is missing, pCMAPs fall back to binary contact maps.

Issue 4: MANE + AF version consistency

  • What: Force af_version=4 when --mane/--mane_only is used.
  • Why: MANE structures are only available from AF DB v4.
  • How:
    • Override version early with a warning and reuse across the build.
    • AF fragment merge fix: merge_af_fragments now receives af_version (fixes v4 hardcode when default is v6).
  • Notes: Update default non‑MANE --af_version is now 6.

Issue 5: Backtranseq robustness

  • What: Reduce batch size to 100, add non‑200 logging, add 45‑min total timeout with max 5 retries per batch.
  • Why: Large single batches were timing out/hanging with no exit path.
  • How: Cap batch size; add HTTP logging; retry with bounded time; return NaN on failure.
  • Notes: Failures now surface earlier but do not block the build.

Issue 6: Sequence/PDB hygiene & mapping

  • SEQRES guard: skip SEQRES insertion if already present (custom PDBs + fragment merge) and log skipped counts.
  • Custom PDB copy logging: info‑level summary of how many custom PDBs were copied (and skipped invalid filenames); debug count for SEQRES insertions.
  • Proteins API guard: skip ENSP IDs when querying Proteins API to go straight to Backtranseq.
  • Custom MANE symbol propagation: custom samplesheet symbol/gene now fills ENSP‑only entries; debug log warns if custom ENSPs aren’t in MANE summary.

Issue 7: Developer UX & tooling

  • Fix prepare_samplesheet.py so it runs directly from its folder (adds repo root to sys.path, avoiding ModuleNotFoundError: scripts).
  • Preprocessing tool behavior: update_samplesheet_and_structures.py always adds symbol to final bundles; samplesheet.csv is kept clean (no symbol/CGC/length). --include-metadata now only adds CGC/length.
  • Removed hardcoded local paths from build_datasets.py and kept CLI‑only guard.
  • Add empty‑df checks, NA guards, doc fixes, new launch configs.

Copilot AI review requested due to automatic review settings February 16, 2026 15:22
@St3451 St3451 changed the title Dev/fix build datasets Fix issue in build-datasets Feb 16, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request enhances the robustness and monitoring of dataset building operations in scripts/datasets/seq_for_mut_prob.py. It introduces fallback mechanisms for BioMart metadata downloads and adds progress tracking for Ensembl CDS retrieval, improving reliability when external services are unavailable or slow.

Changes:

  • Refactored download_biomart_metadata to include retry logic, fallback from archive to latest Ensembl server, and Python-based downloader when wget is unavailable
  • Added progress monitoring with tqdm for Ensembl CDS sequence retrieval
  • Simplified multiprocessing by removing unnecessary wrapper function

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@St3451 St3451 requested a review from Copilot February 17, 2026 00:50
@St3451 St3451 changed the title Fix issue in build-datasets Speed up Ensembl CDS retrieval with batched REST requests AND add BioMart download fallbacks Feb 17, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

scripts/datasets/seq_for_mut_prob.py:956

  • This branch is effectively unreachable and sys.exit() is risky inside a library-style helper (and especially problematic under multiprocessing). r.raise_for_status() will raise for 4xx/5xx, so status = "ERROR"; sys.exit() won’t run; if the logic changes later it could unexpectedly terminate the whole process. Consider removing this block and handling non-OK responses via exceptions/retries and returning np.nan on terminal failure.
            if not r.ok:
                r.raise_for_status()

                status = "ERROR"
                sys.exit()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Comments suppressed due to low confidence (2)

scripts/datasets/seq_for_mut_prob.py:1456

  • The main block (lines 1446-1456) contains hardcoded local paths for testing, which should not be in production code. While the PR description mentions removing hardcoded paths from build_datasets.py and adding a CLI-only guard, this file still has test code. Consider removing this test code or replacing it with a similar guard that directs users to use the CLI.
if __name__ == "__main__":
    raise SystemExit(
        "This module is intended to be used via the CLI: `oncodrive3d build-datasets`."
    )

scripts/datasets/seq_for_mut_prob.py:1237

  • The function signature changed to remove ens_canonical_transcripts_lst, custom_mane_metadata_path, and mane_version parameters, and add mane_only parameter. The docstring should be updated to reflect these changes and explain the new behavior, particularly around how mane_only affects the filtering of non-MANE sequences.
                        uniprot_to_gene_dict,
                        mane_mapping,
                        mane_mapping_not_af,
                        mane_only=False,
                        num_cores=1):
    """
    Retrieve DNA sequence and tri-nucleotide context
    for each structure in the initialized dataframe
    prioritizing MANE associated structures and metadata.

    Reference_info labels:
        1  : Transcript ID, exons coord, seq DNA obtained from Proteins API
        0  : Transcript ID retrieved from MANE and seq DNA from Ensembl
        -1 : Not available transcripts, seq DNA retrieved from Backtranseq API
    """

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

scripts/datasets/get_pae.py:63

  • If the response is HTTP 200 but the body doesn’t match the expected JSON pattern (content.endswith(b'}]')), status remains INIT, which causes the loop to immediately retry without the 30s backoff. Set status = "ERROR" (or otherwise sleep) when content validation fails to avoid tight retry loops and rate-limiting.
            content = response.content
            if content.endswith(b'}]') and not content.endswith(b'</Error>'):
                with open(file_path, 'wb') as output_file:
                    output_file.write(content)
                status = "FINISHED"

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -224,12 +224,17 @@ def get_pdb_seqres_records(lst_res):
def add_refseq_record_to_pdb(path_structure):
"""
Add the SEQREF records to the pdb file.
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says “Add the SEQREF records”, but the code is inserting SEQRES records. Update the docstring to match the PDB record type to avoid confusion.

Suggested change
Add the SEQREF records to the pdb file.
Add the SEQRES records to the pdb file.

Copilot uses AI. Check for mistakes.
samplesheet = attach_metadata(samplesheet, metadata_map)
samplesheet.to_csv(paths.samplesheet_path, index=False)

metadata_for_outputs = metadata_map or symbol_map
Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata_for_outputs = metadata_map or symbol_map will raise ValueError: The truth value of a DataFrame is ambiguous when metadata_map is a DataFrame. Use an explicit is not None check (e.g., choose metadata_map if it’s not None, otherwise symbol_map).

Suggested change
metadata_for_outputs = metadata_map or symbol_map
metadata_for_outputs = metadata_map if metadata_map is not None else symbol_map

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments