Skip to content

algbio/BubbleFinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

115 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BubbleFinder

CI Bioconda GitHub release GPLv3 License Open Source

BubbleFinder computes all snarls, superbubbles, and ultrabubbles in genomic and pangenomic GFA and GBZ graphs (i.e. bidirected graphs).

All algorithms run in linear time in the size of the input graph (O(|V|+|E|)). Ultrabubbles are computed using two modes: oriented mode (default), which orients the bidirected graph and reduces to directed weak superbubbles (requires at least one tip or one cut vertex per connected component), and doubled mode (--doubled), which builds a doubled directed graph with no restriction on connected components, but uses more RAM.


Table of Contents

Additional resources (Wiki):

  • Flowchart, which details execution paths for all commands
  • Internals, covering GFA/bidirected graph representation, orientation projection, and theoretical background
  • Validation, describing bruteforce testing and the random test harness

Snarls and superbubbles via SPQR trees

BubbleFinder first builds the undirected version of the input bidirected graph, then uses the SPQR trees of its biconnected components to identify all snarls and superbubbles.

Important

snarls computes all snarls and aims to replicate the behavior of vg snarls -a -T, but vg outputs only a pruned, linear-size snarl decomposition.
Therefore, BubbleFinder may output more snarls than vg snarls.

Note

Empirical performance (snarls & superbubbles). Benchmarks and theory are reported in Sena, Politov et al., 2025.

  • Snarls: BubbleFinder is consistently faster than vg snarls -a -T on the PGGB graphs (up to ~2× faster on larger graphs and ~3× on the smallest one). On human chromosome graphs (Chromosome 1/10/22), BubbleFinder can be up to ~2× slower end-to-end in a single-threaded run due to preprocessing (BC/SPQR tree building), but benefits from multi-threading (up to ~4× speedup at 16 threads in those datasets).
  • Superbubbles: BubbleFinder runs in similar times as BubbleGun on small graphs, and is about ~10× faster on larger graphs; in particular, BubbleGun hit a >3h timeout on Chromosome 1/10/22, while BubbleFinder completed in minutes in our benchmarks.

Ultrabubbles via linear-time orientation + reduction to weak superbubbles

Ultrabubbles use a different approach (not SPQR-based). BubbleFinder first orients the bidirected graph into a directed graph using a DFS-based procedure, then runs a linear-time directed weak superbubble algorithm on the result and maps the output back to ultrabubbles in the original bidirected graph.

Note

Empirical performance (ultrabubbles). In our ultrabubble benchmarks, BubbleFinder consistently outperformed vg across all tested datasets. On the HPRC graphs in GBZ format, excluding parsing time, BubbleFinder achieves speedups of 19–26× over vg. On HPRC v2.0 CHM13 (232 individuals), after parsing, BubbleFinder completes in under 3 minutes while vg requires more than one hour, using four times less RAM (24.8 GiB vs 101.8 GiB). On GFA input, BubbleFinder is ~200× faster than BubbleGun on the HPRC v1.1 graph (47 individuals).
A dedicated preprint describing this method, its correctness, and benchmarks is forthcoming (link to be added).


Quickstart

Prebuilt Linux binary

Download the latest release:

https://github.com/algbio/BubbleFinder/releases/latest

./BubbleFinder --help
./BubbleFinder snarls -g example/tiny1.gfa -o tiny1.snarls

Conda / Bioconda

conda create -n bubblefinder_env -c conda-forge -c bioconda bubblefinder
conda activate bubblefinder_env
./BubbleFinder --help

Build from source (Linux)

git clone --recurse-submodules https://github.com/algbio/BubbleFinder && \
cd BubbleFinder && \
cmake -S . -B build && \
cmake --build build -j <NUM_THREADS> && \
mv build/BubbleFinder .

Replace <NUM_THREADS> with the number of parallel build jobs (e.g. -j 8). Omitting -j builds single-threaded.

Dependencies are handled automatically by the build system:

  • OGDF is fetched and built via CMake FetchContent.
  • zstd is detected on the system. If not found, it is automatically fetched and built from source.
  • OpenSSL (libcrypto) must be available on the system (pre-installed on most Linux distributions).
  • GBZ support pulls in four submodules, all under external/gbz/ and built automatically:
    • gbwtgraph is the GBZ/GBWTGraph library
    • gbwt is the GBWT index (required by gbwtgraph)
    • sdsl-lite (vgteam fork), which provides low-level data structures (required by gbwt)
    • libhandlegraph is the handle graph interface (required by gbwtgraph)

Commands overview

Command Typical input Output endpoints Notes
snarls bidirected GFA / GBZ oriented incidences (a+, d-) may output cliques
superbubbles bidirected GFA / GBZ (default) or directed (--directed) segment IDs (a, e) in bidirected mode; oriented IDs (a+, e-) in directed mode computed on doubled directed graph + orientation projection (bidirected) or directly (directed)
ultrabubbles bidirected GFA / GBZ oriented incidences oriented mode (default): ≥ 1 tip or cut vertex per CC; doubled mode (--doubled): no restriction
spqr-tree GFA / GBZ only .spqr v0.4 connected components + BC-tree + SPQR decomposition

All commands except spqr-tree exclude trivial bubbles by default (use -T to include them), and are validated against a brute-force implementation on randomly generated graphs (see Validation on the Wiki).

For a detailed walkthrough of all execution paths, see the Flowchart on the Wiki.


Running BubbleFinder

./BubbleFinder <command> -g <graphFile> -o <outputFile> [options]

Available commands:

  • superbubbles find superbubbles (bidirected by default, use --directed for directed mode)
  • snarls find snarls (typically on bidirected graphs from GFA)
  • ultrabubbles find ultrabubbles (oriented mode by default, use --doubled for doubled graph mode)
  • spqr-tree output the connected components, BC-tree and SPQR decomposition in .spqr v0.4 format

Warning

In oriented mode (default), ultrabubbles requires at least one tip or one cut vertex per connected component in the input graph (otherwise it will fail). Use --doubled if your graph has tipless and cut-vertex-free connected components.

Input data

Extension Format Description
.gfa / .gfa1 GFA1 Graphical Fragment Assembly format
.gbz GBZ vg/gbwtgraph binary format
.graph BubbleFinder text Simple directed edge list (see below)

BubbleFinder .graph text format:

  • first line: two integers n (number of node IDs) and m (number of directed edges)
  • next m lines: u v (one directed edge per line)
  • u and v are arbitrary node identifiers (strings without whitespace)

Force the input format with --gfa, --gfa-directed, or --graph. Input files can be compressed (gzip, bzip2, xz), auto-detected from the file suffix.

Note

spqr-tree currently requires GFA or GBZ input.

Command-line options

Option Description
-g <file> Input graph file (possibly compressed)
-o <file> Output file
-j <threads> Number of threads
--gfa Force GFA input (bidirected)
--gfa-directed Force GFA input interpreted as directed graph
--graph Force .graph text format
--directed Interpret graph as directed (for superbubbles)
--doubled Use doubled-graph algorithm (for ultrabubbles)
-T, --include-trivial Include trivial bubbles in output
--clsd-trees <file> Write ultrabubble hierarchy to <file> (ultrabubbles only)
--report-json <file> Write JSON metrics report
-m <bytes> Stack size in bytes
-h, --help Show help and exit

Output format

All commands write plain text to the file given by -o <outputFile>. The first line is a single integer N (the number of result lines that follow), and lines 2 through N+1 each contain one result.

Each result line encodes one or more unordered pairs of endpoints. What an "endpoint" looks like depends on the command: snarls and ultrabubbles use oriented incidences (e.g. a+, d-), superbubbles in bidirected mode uses segment IDs without orientation (e.g. a, e), and superbubbles --directed uses oriented IDs (e.g. a+, e-).

Snarls

By default, trivial snarls are excluded. Use -T / --include-trivial to include them.

With -T: each line contains at least two incidences. A line with k ≥ 2 incidences encodes all unordered pairs among them (clique representation).

Example on example/tiny1.gfa:

./BubbleFinder snarls -T -g example/tiny1.gfa -o example/tiny1.snarls --gfa
2
g+ k-
a+ d- f+ g-
  • g+ k- → single pair {g+, k-}.
  • a+ d- f+ g- → all pairs: {a+, d-}, {a+, f+}, {a+, g-}, {d-, f+}, {d-, g-}, {f+, g-}.

Without -T (default): cliques are expanded, trivial pairs filtered, each line contains exactly two oriented incidences.

Superbubbles

In bidirected mode (default), each result line contains exactly two segment IDs (no orientation):

3
a b
e f
b e

These pairs are obtained after running the superbubble algorithm on the doubled directed graph and applying the orientation projection (see Internals on the Wiki).

In directed mode (--directed), each result line contains two oriented IDs:

3
a+ b-
e+ f-
b+ e-
Ultrabubbles

A flat list of endpoint pairs where each endpoint is an oriented incidence (segmentID+ / segmentID-):

N
a+ d-
g+ k-
...

Both oriented mode (default) and doubled mode (--doubled) produce the same output format.

To also output the hierarchical nesting structure, use --clsd-trees <file>. Each line in that file is a rooted tree in parenthesized form:

  • leaf bubble: <X,Y>
  • internal bubble: (child1,...,childk)<X,Y>

where X and Y are oriented incidences such as a+ or d-.

SPQR-tree

The spqr-tree command writes a .spqr file according to the SPQR tree file format specification (version v0.4). BubbleFinder writes the header:

H v0.4 https://github.com/sebschmi/SPQR-tree-file-format

For details on line types and semantics, refer to the specification repository.


References

About

Find and decompose genomic variation sites in pangenome graphs

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors