Skip to content

enable path coverage computation on gbz files with vg depth. #4883

Open
glennhickey wants to merge 4 commits into
masterfrom
altpaths2
Open

enable path coverage computation on gbz files with vg depth. #4883
glennhickey wants to merge 4 commits into
masterfrom
altpaths2

Conversation

@glennhickey
Copy link
Copy Markdown
Contributor

Changelog Entry

To be copied to the draft changelog by merger:

  • vg depth will now work on .gbz files.

Description

glennhickey and others added 3 commits April 17, 2026 19:53
On a GBZ/GBWT-backed graph, the default for_each_path_handle and
for_each_step_on_handle iterations elide haplotype paths.  As a result,
vg depth's path coverage reports zero coverage from haplotypes, which
defeats the typical use case of measuring pangenome coverage along a
reference on a GBZ.

Switch the selection iteration to for_each_path_of_sense with
{REFERENCE, GENERIC, HAPLOTYPE}, and the per-handle step iteration in
path_depths / path_depth_of_bin to for_each_step_of_sense with the same
set.  On non-GBZ graphs this is equivalent to the prior behavior (the
default sense-filtered iterators fall back on the unfiltered ones).

Verified on yeast-27 GBZ: depth along S288C#0#chrI now reports ~5x
coverage, matching the 5 haplotype paths present on that contig.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…verage

The haplotype-sense expansion added in 20936aa was applied to the
shared path-selection iterator, so it also leaked into -k (pack) mode.
Keep for_each_path_handle for -k so pack coverage behaves as before;
only path coverage (no -k) uses for_each_path_of_sense({REF, GEN, HAP}).

The outer loop over ref_paths is now an OpenMP parallel-for with
schedule(dynamic,1) and an ordered clause.  Each iteration writes to a
thread-local ostringstream and flushes under `#pragma omp ordered`,
keeping output in deterministic path order.  With multiple paths we cap
active nesting levels so the inner `binned_*_depth` pragmas don't
over-subscribe; with a single path we skip the outer region so inner
parallelism still benefits binned workloads.

yeast-27, 3 chromosome-scale paths, -b 100000:
  -t 1: 2m47s   -t 8: 1m18s   (output byte-identical)

Tests in 49_vg_depth.t verify:
- path coverage on GBZ counts haplotype paths
- pack coverage on GBZ ignores the haplotype selection expansion
- parallel output matches -t 1 byte-for-byte

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…egion

With `#pragma omp parallel for ordered schedule(dynamic, 1)`, a thread
that finished a short iteration had to block at `#pragma omp ordered`
until all earlier iterations had emitted.  A thread cannot pick up a new
iteration while blocked there, so with a skewed work distribution (e.g.
augref_* paths of very different lengths on an HPRC GBZ) most threads
piled up idle waiting on one slow iteration -- effective parallelism
dropped to ~2 cores even with -t 12.

Each iteration now writes into its own slot of a pre-sized vector<string>;
we serialize the emission after the parallel region.  Threads grab new
work as soon as they finish compute, so a slow iter no longer stalls the
pool.  Output remains deterministic across thread counts.

Memory trade-off: all per-path output buffers live simultaneously.  For
the coarse-bin use cases this is negligible; for -b 1 on whole-genome
paths this could be hefty, but that wasn't a practical workload before
either.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@glennhickey glennhickey changed the title enable path cover computation on gbz files with vg depth. enable path coverage computation on gbz files with vg depth. May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants