Skip to content

view -R: pathological performance with large single-base region sets (84M regions, 11+ hours unfinished) #2557

@carstenerickson

Description

@carstenerickson

Summary

bcftools view -R sites.bed.gz target.vcf.gz is 350-400× slower than a sequential scan when the BED contains many densely-clustered single-base regions — a common pattern in PRS / ancestry pipelines (HGDP+1kGP, AADR 1240k, PGS Catalog). A production run on 84M such regions against a 23 GB VCF did not complete in 11+ hours of 100% CPU before we killed it.

Reproducer

python3 gen_reproducer.py --regions 1000000 --prefix synth1m
bgzip -f synth1m.bed && tabix -p bed synth1m.bed.gz
bgzip -f synth1m.vcf && tabix -p vcf synth1m.vcf.gz
bcftools view -R synth1m.bed.gz synth1m.vcf.gz > /dev/null
gen_reproducer.py (N densely-clustered single-base positions, written to both BED and VCF)
import argparse, random

# Approx GRCh38 autosome lengths (Mbp) — used as relative weights only.
CHROM_LEN_MBP = [248,242,198,190,181,170,159,145,138,133,
                 135,133,114,107,101, 90, 83, 80, 58, 64, 46, 50]

def gen_positions(n_total, avg_spacing, seed=42):
    rng = random.Random(seed)
    rate = 1.0 / avg_spacing
    total = sum(CHROM_LEN_MBP)
    for chrom_i, mbp in enumerate(CHROM_LEN_MBP, start=1):
        length, share = mbp * 1_000_000, n_total * mbp // total
        pos = 0
        for _ in range(share):
            pos += max(1, int(rng.expovariate(rate)))
            if pos >= length: break
            yield str(chrom_i), pos

ap = argparse.ArgumentParser()
ap.add_argument("--regions", type=int, default=1_000_000)
ap.add_argument("--avg-spacing", type=int, default=10)
ap.add_argument("--prefix", default="synth")
args = ap.parse_args()
positions = list(gen_positions(args.regions, args.avg_spacing))

with open(f"{args.prefix}.bed", "w") as f:
    for chrom, pos in positions:
        f.write(f"{chrom}\t{pos-1}\t{pos}\n")

with open(f"{args.prefix}.vcf", "w") as f:
    f.write("##fileformat=VCFv4.2\n")
    for i, mbp in enumerate(CHROM_LEN_MBP, start=1):
        f.write(f"##contig=<ID={i},length={mbp * 1_000_000}>\n")
    f.write('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">\n')
    f.write("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\tFORMAT\tS1\tS2\tS3\n")
    for chrom, pos in positions:
        f.write(f"{chrom}\t{pos}\t.\tA\tG\t100\tPASS\t.\tGT\t0/0\t0/0\t0/0\n")

Measurements (bcftools 1.23.1, htslib 1.23.1, macOS arm64)

N = 1,000,000 regions = 1,000,000 records:

run real × baseline
view -Ou vcf.gz > /dev/null (baseline) 0.45 s
view -R bed.gz (default --regions-overlap 1) 156.6 s 348×
view -R bed.gz --regions-overlap 0 193.1 s 429×
view -R bed.gz --regions-overlap 2 239.7 s 533×
view -T bed.gz (streaming, --targets-overlap 0) 1.34 s 3.0×
view -T bed.gz --targets-overlap 1 1.68 s 3.7×
view -T bed.gz --targets-overlap 2 1.08 s 2.4×

At N=5M: baseline 1.79 s, view -R default 705.5 s (394×). The 84M production case extrapolates to ≥ 12,000 s (~3.3 h) of iterator setup alone, before any meaningful work.

Two takeaways: --regions-overlap value is irrelevant to the bottleneck; -T (streaming) handles the identical workload at near-baseline speed.

Root cause (htslib synced_bcf_reader.c)

_readers_next_region() calls _reader_seek() per BED region, which invokes a fresh tbx_itr_queryi(). For N regions, N iterator setup+teardown cycles. With 84M dense 1bp regions, that's 84M iterator queries when ~22 (one per autosome) would suffice.

// synced_bcf_reader.c:595 — runs N times for N BED entries
static int _readers_next_region(bcf_srs_t *files) {
    ...
    if ( bcf_sr_regions_next(files->regions)<0 ) return -1;
    for (i=0; i<files->nreaders; i++)
        _reader_seek(&files->readers[i],            // ← fresh tbx_itr_queryi
                     files->regions->seq_names[files->regions->iseq],
                     files->regions->start, files->regions->end);
    return 0;
}

Suggested fix

Filing as a feature request — semantics are correct, just impractically slow at this scale.

Preferred: region coalescing in _readers_next_region. If the next BED region starts within K bp of the current iterator position, keep streaming through the existing iterator and let the per-record overlap check at line 733 filter records in gaps. Collapses 84M iterator queries → ~22, generalizes to any dense-region workload. K tunable, default ~64 KB ≈ one BGZF block.

Alternative: opt-in --regions-stream flag forcing in-memory regidx + top-to-bottom VCF stream (semantically equivalent to -T but inheriting -R's default overlap mode). Lowest-risk; preserves current behavior.

The existing -T path is the correctness baseline (1.34 s vs 0.45 s at 1M — see table); a regression test built on the reproducer above is straightforward.


bcftools 1.23.1 / htslib 1.23.1 / macOS arm64; reproducible on Linux x86_64 per the production observation above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions