Skip to content

[Security] SSRF + path traversal chain in bio-research ncbi_utils.py and sra_geo_fetch.py #166

@Aravindargutus

Description

@Aravindargutus

Description

The bio-research plugin's Python scripts have two defense-in-depth concerns in how they fetch and download FASTQ data from external APIs.

Severity: Low-Medium (not immediately exploitable, but worth hardening)

Issue 1: HTTP Protocol Downgrade on FASTQ Downloads (Medium)

File: bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py (line 343)

The ENA API is queried over HTTPS (line 314), but the actual FASTQ file downloads are forced to unencrypted HTTP:

# Line 343 — FTP paths from ENA converted to HTTP (not HTTPS)
urls = [f"http://{url}" for url in ftp_urls.split(';') if url]

A real ENA response returns values like ftp.sra.ebi.ac.uk/vol1/fastq/SRR635/000/SRR6357070/SRR6357070_1.fastq.gz, which becomes http://ftp.sra.ebi.ac.uk/....

Impact: FASTQ downloads (often multi-GB) happen over unencrypted HTTP. A network-level attacker could modify file contents in transit. While genomic data isn't secret, integrity matters for research reproducibility.

Fix: Change http:// to https:// on line 343. ENA supports HTTPS downloads.

Issue 2: No Domain Validation on Download URLs (Low)

File: bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py (lines 338-344)

The fastq_ftp field from the ENA API response is used to construct download URLs without validating that they point to known ENA/NCBI domains:

# Lines 338-344
ftp_urls = fields[ftp_idx]
if ftp_urls:
    urls = [f"http://{url}" for url in ftp_urls.split(';') if url]
    fastq_urls[srr] = urls

These URLs are then passed to download_file() which streams the response body to disk via requests.get(url, stream=True).

Impact: If the ENA API were ever compromised or its response tampered with, the code would fetch from arbitrary URLs and write content to disk. This is a defense-in-depth concern — the ENA query itself is over HTTPS (line 314), so MITM is not trivial.

Fix: Validate that download URLs match expected ENA domains (e.g., *.ebi.ac.uk, ftp.sra.ebi.ac.uk) before fetching.

Issue 3: Missing URL Encoding on API Parameters (Informational)

File: bio-research/skills/nextflow-development/scripts/utils/ncbi_utils.py (lines 99, 156, 212, 314)

User-supplied geo_id is interpolated into API URLs without urllib.parse.quote():

search_url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gds&term={geo_id}[Accession]&retmode=json"

Since this is a CLI tool where the user provides their own arguments, this is not exploitable in practice — but URL encoding is good hygiene.

What's NOT a vulnerability (correcting our original report)

  • Output path (--output): Our original report claimed this was "arbitrary file write." It's not — this is a CLI tool where the user supplies their own arguments. Normal CLI behavior, not a security issue.
  • Compound attack scenario: Our original report chained HTTPS MITM + CLI argument control. This was unrealistic — each link requires conditions that make the chain implausible.

Suggested Fixes

  1. Line 343: Change f"http://{url}" to f"https://{url}" (simplest, highest impact)
  2. Lines 338-344: Add domain allowlist check before downloading
  3. Lines 99, 156, 212, 314: Use urllib.parse.quote() for geo_id/accession in URLs

Secure Patterns Already in Use (Credit)

  • ✅ ENA API query is over HTTPS (line 314)
  • yaml.safe_load() used correctly
  • subprocess.run() uses list format, not shell=True
  • ✅ No hardcoded secrets
  • ✅ NCBI rate limiting properly enforced

🤖 Generated with Claude Code

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions