[New Feature] vg convert: add GFAz input support by babyplutokurt · Pull Request #4861 · vgteam/vg

babyplutokurt · 2026-03-25T01:23:16Z

Changelog Entry

Added GFAz input support to vg by integrating it into vg's shared GFA parsing and import pipeline.

Description

This PR adds support for reading GFAz files in vg. Instead of adding a separate copy of the GFA import logic, this change refactors vg's existing GFA parser/import path so that both text GFA and GFAz drive the same semantic conversion code.

The key design change is that GFA parsing is now split into a small hierarchy:

GFAParser: shared base class for parser types, listener registration, shared string/tag types, and event APIs
GFATextParser: existing text-GFA parser logic adapted to the new shared interface
GFAzParser: drives the same listeners from decompressed GFAz elements

This means vg now has a single copy of the logic that turns GFA semantics into vg graph/path semantics, regardless of whether the input is text GFA or GFAz.

In addition, GFAz is now registered as its own input format in vg::io::Registry, so generic graph-loading paths can recognize and load .gfaz files through the normal VPKG mechanism.

Benchmark

A local benchmark on HPRC chr1 comparing vg convert -g from text GFA vs GFAz into PackedGraph:

Threads	GFA Wall (s)	GFAz Wall (s)	Speedup	GFA RSS (KB)	GFAz RSS (KB)
1	434.54	182.83	2.38x	11,179,328	10,889,244

In this workload, GFAz input is about 2.4x faster than text GFA, with slightly lower peak RSS.

What Changed

New files:

src/algorithms/gfaz_to_handle.hpp / .cpp: GFAz-backed parser implementation and import support
src/io/register_loader_saver_gfaz.cpp: registry loader registration for GFAz

Modified files:

src/algorithms/gfa_to_handle.hpp / .cpp: refactored shared GFA parsing/import code around a parser hierarchy and shared listener-driven conversion
src/io/register_libvg_io.cpp: registers GFAz loader support
src/subcommand/convert_main.cpp, src/subcommand/view_main.cpp, src/subcommand/sort_main.cpp, src/index_registry.cpp: use the new temporary helper path where these commands still explicitly import GFA-like inputs
Makefile: builds and links GFAz support
.gitmodules: adds deps/GFAz
test/t/48_vg_convert.t: adds .gfaz import coverage

Design Notes

This PR follows the review direction of "melting and resynthesizing" the integration instead of duplicating gfa_to_handle.cpp.

The main architectural pieces are:

GFAz is registered as its own format in src/io, rather than being hidden inside the old text-GFA loader path.
The old text-GFA parser logic remains, but now lives behind GFATextParser.
GFAzParser emits the same parser events as text GFA.
Shared conversion code attaches listeners to a polymorphic parser and builds vg graphs/paths from those events.
Path/walk/overlap event APIs were refactored so listeners receive iterator-style callbacks instead of raw text slices that must be reparsed manually.

This keeps GFA semantics handling in one place while allowing different physical encodings of GFA records.

GFAz-Specific Semantics

GFAz restores segment names as 1-based dense contiguous numeric IDs. vg can use these IDs directly, instead of building the text-GFA name-to-ID translation map normally required for arbitrary string segment names.

As a result:

GFAz node IDs are imported directly
unnecessary text-name remapping is avoided
the shared import code still handles the usual GFA semantics for links, paths, walks, and rGFA tags

Registry Integration

This PR adds a real GFAZ registry entry, so generic graph-loading code can auto-detect .gfaz input through vg::io::VPKG.

That said, this PR does not fully rewrite all legacy explicit GFA command paths to go through the registry. Some existing convert / view / sort / index-registry code paths still use temporary helper wrappers for "load GFA or GFAz". The parser/import refactor is the main goal of this change; broader unification of all GFA input onto registry/VPKG can be done separately.

Tests

This PR extends test/t/48_vg_convert.t with coverage for:

explicit -g import from .gfaz
automatic .gfaz detection through the registry-backed generic graph load path
equivalence of imported graph/path content between .gfa and .gfaz
-x conversion from .gfaz
translation-file behavior for .gfaz
stability across different thread counts

Current Scope / Limitations

This PR adds GFAz input support, not GFAz output support.
Output for GFAHandleGraph remains normal text GFA.
Some legacy explicit GFA import call sites still use temporary filename-based dispatch helpers instead of fully routing through VPKG.
This PR intentionally avoids a broader redesign of vg's full graph-loading API beyond what is needed for shared GFA/GFAz import.

Rationale

The goal of this change is to add GFAz support without creating a second copy of vg's GFA import logic. By making GFAz another parser implementation that feeds the same semantic conversion layer, the integration stays maintainable and keeps future GFA behavior changes shared across both formats.

adamnovak · 2026-03-25T23:38:54Z

This seems potentially useful.

What's the license on the GFAz codebase, @babyplutokurt? Are we going to be able to distribute it in our vg tarballs? There doesn't seem to be one on Github.

Would it make sense to try and auto-magically import GFAz when a user provides a graph in that format, the way we do with GFA?

Would it make sense to do GFAz output from vg convert?

adamnovak · 2026-03-26T00:02:11Z

It looks like the GFAz library pulls in zstd as a submodule dependency, but vg already has zstd as a system-level dependency through libgbwtgraph that needs it.

Is there a good reason the GFAz build system can't build against a system-installed zstd? Or should this change the vg build to link against the submodule zstd provided by GFAz?

(And shouldn't -lzstd come after -lgfa_compression_core in the linker command?)

Finally, it looks like you've used a highly productive AI-enabled development process to come up with GFAz very quickly, so even though it can do all this cool stuff it looks like it might have only existed at all for about a week. What is the project's plan for adoption and maintenance, beyond just vg integration? Is the vg team going to end up having to come in and bump its versions of zstd, pybind11, etc. to newer versions as breaking changes happen, or is there a plan for long-term maintenance of the library from your side? Is there a plan for some kind of preprint/paper or writeup about the format that might lead to it being adopted by more tools than just vg? Or that might help us understand or explain how it works?

babyplutokurt · 2026-03-26T00:04:20Z

This seems potentially useful.

What's the license on the GFAz codebase, @babyplutokurt? Are we going to be able to distribute it in our vg tarballs? There doesn't seem to be one on Github.

Would it make sense to try and auto-magically import GFAz when a user provides a graph in that format, the way we do with GFA?

Would it make sense to do GFAz output from vg convert?

Hi @adamnovak, GFAz is under MIT license and I just updated the LICENSE.

Thank for the comment, I will look into more details about the seamless integration later and re-push the PR once implementation is done, I am currently working on the memory consumption increase fix by streaming decompression and vg convert. I expect GFAz should not increase the memory consumption when input to vg.

babyplutokurt · 2026-03-26T00:19:04Z

It looks like the GFAz library pulls in zstd as a submodule dependency, but vg already has zstd as a system-level dependency through libgbwtgraph that needs it.

Is there a good reason the GFAz build system can't build against a system-installed zstd? Or should this change the vg build to link against the submodule zstd provided by GFAz?

(And shouldn't -lzstd come after -lgfa_compression_core in the linker command?)

Finally, it looks like you've used a highly productive AI-enabled development process to come up with GFAz very quickly, so even though it can do all this cool stuff it looks like it might have only existed at all for about a week. What is the project's plan for adoption and maintenance, beyond just vg integration? Is the vg team going to end up having to come in and bump its versions of zstd, pybind11, etc. to newer versions as breaking changes happen, or is there a plan for long-term maintenance of the library from your side? Is there a plan for some kind of preprint/paper or writeup about the format that might lead to it being adopted by more tools than just vg? Or that might help us understand or explain how it works?

Thanks for the thoughtful questions, Let me address them one by one:

1. `zstd` Dependency (System vs. Submodule) and Linker Order

Yeah I agree, building against the system-installed zstd is definitely the right approach to avoid symbol conflicts.

2. Long-term Maintenance and Stability

The vg team will not be left to maintain GFAz or deal with breaking dependency bumps; I am committed to long-term upstream maintenance of the library.

The key thing to note is that since vg primarily works with GFA v1.x input, the GFAz integration will be extremely stable. I already have full compatibility and testing across GFA v1.0, v1.1, and v1.2. Because those specifications are essentially finalized, the core implementation powering the vg integration will rarely need to change. Moving forward, active development and feature updates will mainly be focused on supporting the GFA v2.0+ format, which won't disrupt the core v1.x pipeline you rely on.

3. Adoption, Preprints, and Explanation of the Format

I definitely plan for GFAz to extend far beyond just vg. As a proof point of our adoption strategy, I have already successfully integrated GFAz across all subcommands in ODGI, and I've verified speedups across subcommands.

I am also planning to release a formal techinical paper, It will thoroughly detail the grammar-based compression algorithm, structural benchmarks. In the meantime, if you're curious about how it works under the hood, we've included a high-level overview of our architecture and serialization pipelines in the workflow.md file in our repository.

babyplutokurt · 2026-03-31T20:09:36Z

@adamnovak Hi Adam, I addressed all the concerns, please let me know your ideas.

adamnovak

I've looked at this again, and I think the integration code needs a bit of melting and resynthesizing.

Rather than hooking into the GFA read functions, GFAz ought to be registered as its own format with the vg::io::Registry, with a sniffer or magic number, via files in src/io and a call in src/io/register_libvg_io.cpp.

Rather than duplicating everything in src/aglorithms/gfa_to_handle.cpp and slightly modifying it, we should take GFAParser and turn it into a 3-class hierarchy, with a base class (GFAParser?) that defines the various string-view types and tag_list_t and handles attaching and managing the listeners, and then two implementations (GFATextParser for GFA and GFAzParser for GFAz?) that drive the listeners. For GFA we would keep the existing in-class parsing code, and for GFAz we add code to drive the listeners by iterating over the decompressed GFAz elements.

To make this work, we need to refactor the event listener signatures slightly: instead of representing things like visits and overlaps as GFAParser::chars_t that the listener processes with GFAParser::scan_p_visits, GFAParser::scan_w_visits, etc., we need to present them to the listener by providing a function that can iterate over them. Then the listener would use the iteratee it's currently passing to GFAParser::scan_w_visits() or whatever with the new iterator function, while the existing scan functions would end up being used to construct the iterator functions for GFATextParser.

Then we would have one copy of all the code that deals with converting GFA semantics to vg semantics (which attaches listeners to a polymorphic parser and gets and acts on all the events, and also deals with managing node ID translation data), and it would be able to be driven by encapsulated decoders for either data source.

Also, due to advancements in the vg::io::Registry system, a lot of code we've had that special-cases reading a GFA so we can track the source filename is no longer necessary, so we can run all the GFA input (except some that hands GFAs straight to libgbwt) through the registry system and not need any other logic to identify if something is GFA or GFAz.

adamnovak · 2026-04-14T15:48:59Z

This whole file looks like a clone of gfa_to_handle.cpp. Some of the functions (like write_gfaz_translation()) look like complete duplicates of the corresponding GFA-side operations.

I don't think that's the right approach; we want to share common code for converting from the two similar formats.

adamnovak · 2026-04-14T15:55:41Z

I don't think hooking into the GFA parsing functions and dispatching on extension is the right design. Instead, the GFA parsing functions should only parse GFA, and we should have an src/io/register_loader_saver_gfaz.cpp that describes how to identify a GFAz stream based on magic numbers and registers the format with the vg::io::Registry, so that all vg commands will understand the format and things like vg stats tiny/tiny.gfaz will work.

adamnovak · 2026-04-14T16:17:14Z

            vg::algorithms::gfa_to_path_handle_graph(input_stream_name, &intermediate,
-                                                 input_rgfa_rank, gfa_trans_path);
+                                                 input_rgfa_rank, gfa_trans_path,
+                                                 nullptr, num_threads);


Rather than passing around thread counts, the way vg likes to do thread count management is by using OMP's global state. We use set_thread_count() from src/utility.hpp to tell OMP how many threads we want for the program, and then OMP gives everything the right number of threads automatically when parallel sections start, even when they're nested or themselves starting from other threads.

Since GFAz already uses OMP, we shouldn't need to pass it a thread count; we should just tell it to use the current configured OMP number of threads.

It looks like GFAz doesn't know how to do that, though; the thread count here goes into a ScopedOMPThreads, which really wants to save off all that global state and replace it with either a particular provided value or whatever resolve_omp_thread_count() autodetects from the environment if 0. It looks like GFAz doesn't support just using the currently configured thread count.

Ideally it would be taught (maybe a new -1 sentinel value for "just use the current OMP configuration and don't change it"?), but if it can't be we would want to read the thread count with get_thread_count() from src/utility.hpp and pass that every time we call into GFAz.

Addressed by adding -1 sentinel value on GFAz side to directly inherit OMP configuration from host side.

Now GFAz threads resolver has three options:

n: using n threads
== 0: using default(8) threads
== -1: using Host OMP configurations

Actually GFAz integration is using single-thread, never uses OMP. Even for GFAz decompression, it only use OMP for multi-stream Zstd decompression, most pipeline is single-threaded. It still have GB level throughput due to O(N) time complexcity

adamnovak · 2026-04-14T16:32:26Z

    if (input == input_gfa) {
        // we have to check this manually since we're not using the istream-based loading
        // functions in order to be able to use the disk-backed loading algorithm
        if (optind >= argc) {
            logger.error() << "no input graph supplied" << endl;
        }
        string input_stream_name = argv[optind];
        if (output_format == "xg") {
            xg::XG* xg_graph = dynamic_cast<xg::XG*>(output_graph.get());

            // Need to go through a handle graph
            bdsg::HashGraph intermediate;
            logger.warn() << "currently cannot convert GFA directly to XG; "
                          << "converting through another format" << endl;
            vg::algorithms::gfa_to_path_handle_graph(input_stream_name, &intermediate,
-                                                 input_rgfa_rank, gfa_trans_path);
+                                                 input_rgfa_rank, gfa_trans_path,
+                                                 nullptr, num_threads);
            graph_to_xg_adjusting_paths(&intermediate, xg_graph, ref_samples, hap_locus, new_sample, drop_haplotypes);
        }
        else {
            // If the GFA doesn't have forward references, we can handle it
            // efficiently even if we are streaming it, so we shouldn't warn about
            // "-" input here.
            try {
                if (output_path_graph != nullptr) {
                    MutablePathMutableHandleGraph* mutable_output_graph \
                        = dynamic_cast<MutablePathMutableHandleGraph*>(output_path_graph);
                    assert(mutable_output_graph != nullptr);
                    vg::algorithms::gfa_to_path_handle_graph(input_stream_name, mutable_output_graph,
-                                                         input_rgfa_rank, gfa_trans_path);
+                                                         input_rgfa_rank, gfa_trans_path,
+                                                         nullptr, num_threads);
                }
                else {
                    MutableHandleGraph* mutable_output_graph = dynamic_cast<MutableHandleGraph*>(output_graph.get());
                    assert(mutable_output_graph != nullptr);
                    vg::algorithms::gfa_to_handle_graph(input_stream_name, mutable_output_graph,
-                                                    gfa_trans_path);
+                                                    gfa_trans_path, num_threads);
                }
            } catch (vg::algorithms::GFAFormatError& e) {
                logger.error() << "Input GFA is not acceptable.\n" << e.what() << endl;
            } catch (std::ios_base::failure& e) {
                logger.error() << "IO error processing input GFA.\n" << e.what() << endl;
            }
        }
    }


It turns out we no longer need any of this logic to handle GFA input specially.

At https://github.com/babyplutokurt/vg/blame/842fdc5cef32eac62fda9feaade0c34e10f5fc7b/src/subcommand/convert_main.cpp#L304-L306 we decided we needed to handle GFA here separate from the normal load-a-HandleGraph codepath so that we could keep the filename around and use the disk-backed GFA loader, and also so we could stream straight to XG without a HandleGraph in memory. But then we decided to set up the HandleGraph load path so it could track filenames (see https://github.com/vgteam/vg/blame/bf85ab04b251e7a2bc308750d6c8c44afda213f5/src/io/register_loader_saver_gfa.cpp#L96 and https://github.com/vgteam/libvgio/blame/65e0ba22246c7443b879f40d0bb69ac96c467107/include/vg/io/registry.hpp#L315-L322). And I had to remove the straight-to-XG code and use an intermediate HandleGraph anyway.

So now nothing important or special actually happens when input == input_gfa, and we should just treat it the same as input == input_handlegraph and use the vg::io::VPKG::load_one<HandleGraph>() approach. (We might want a warning if we load the HandleGraph and it is not an instance of GFAHandleGraph when we thought it would be.)

If we do that cut as part of this, then we don't need to worry about having some kind of interface other than the vg::io stuff that knows how to load either-GFA-or-GFAz to a HandleGraph; everything will go through the registry.

We'd also want to combine the corresponding cases in vg view:

vg/src/subcommand/view_main.cpp

Lines 557 to 578 in bf85ab0

} else if (input_type == "handlegraph") {

if (output_type == "stream") {

logger.error() << "Cannot stream a generic HandleGraph to JSON" << endl;

} else {

graph = vg::io::VPKG::load_one<PathHandleGraph>(file_name);

}

} else if (input_type == "gfa") {

graph = make_unique<bdsg::HashGraph>();

try {

// Use the disk-backed GFA loader that `vg convert` also uses.

vg::algorithms::gfa_to_path_handle_graph(file_name,

dynamic_cast<MutablePathMutableHandleGraph*>(graph.get()),

nullptr,

0); // set rgfa path rank to 0 to be consistent with vg convert's default logic

} catch (vg::algorithms::GFAFormatError& e) {

logger.error() << "Input GFA is not acceptable\n" << e.what() << endl;

} catch (std::ios_base::failure& e) {

logger.error() << "IO error processing input GFA\n" << e.what() << endl;

}

// GFA can convert to any of the graph formats, so keep going

I agree with the direction here.

This PR already adds GFAz as its own registered format in vg::io::Registry, and the generic HandleGraph load path can now auto-detect and load .gfaz through VPKG::load_one<HandleGraph>(). That path is covered in the tests with direct vg convert <file.gfaz> ... input.

I kept the existing explicit input_gfa command paths for now because this PR already does a fairly large refactor in the shared import layer: it introduces a GFAParser base class, splits the physical parsers into GFATextParser and GFAzParser, and moves both formats onto the same shared semantic conversion code. I wanted to keep the command-level changes narrower than the parser refactor itself.

For the explicit -g path, the current behavior is that we still honor -g as “treat this as GFA-style input”, but the actual loader now auto-detects text GFA vs GFAz internally through the temporary helpers load_gfa_or_gfaz_to_handle_graph() / load_gfa_or_gfaz_to_path_handle_graph(). That behavior is also covered in the tests with vg convert -g <file.gfaz> ....

So the current state is:

vg convert <file.gfaz> ... exercises the registry/VPKG path

vg convert -g <file.gfaz> ... exercises the explicit GFA import path, which now auto-detects GFA vs GFAz internally

I agree that the long-term cleanup should be to collapse the remaining input_gfa special cases into the normal input_handlegraph / VPKG::load_one<HandleGraph>() path in vg convert, and similarly combine the corresponding cases in vg view, so that the registry is the only place making the GFA vs GFAz format decision.

For this PR, I was aiming to keep the focus on:

registering GFAz properly in src/io

eliminating duplicated GFA/GFAz semantic import code

making both formats drive the same shared import pipeline

Also I improved the GFAz streaming decompression, the GFAz integration to vg not only speed up the subcommands, also with less peak RSS

OK, we can keep the GFA special-casing out of this and make that its own change. I've opened #4880 for that.

OK, we can keep the GFA special-casing out of this and make that its own change. I've opened #4880 for that.

Cool, I can work on #4880 and push another PR, do you have any suggestions on current GFAz integration?

@adamnovak ^^

I'll go through and re-review it.

babyplutokurt changed the title ~~[New Feature] vg convert: add GFAZ input support~~ [New Feature] vg convert: add GFAz input support Mar 25, 2026

babyplutokurt force-pushed the gfaz_integration branch from c41797c to 4fac0c0 Compare March 25, 2026 02:05

babyplutokurt force-pushed the gfaz_integration branch 2 times, most recently from 228e32e to 6fae5d7 Compare March 31, 2026 20:05

adamnovak reviewed Apr 14, 2026

View reviewed changes

babyplutokurt force-pushed the gfaz_integration branch 3 times, most recently from 5a18319 to 88b0c3e Compare April 17, 2026 06:22

Refactor shared GFA import pipeline and add GFAz input support

5f9b512

babyplutokurt force-pushed the gfaz_integration branch from 88b0c3e to 5f9b512 Compare April 17, 2026 20:33

adamnovak mentioned this pull request Apr 21, 2026

Remove logic we no longer need for handling GFA input specially #4880

Open

	} else if (input_type == "handlegraph") {
	if (output_type == "stream") {
	logger.error() << "Cannot stream a generic HandleGraph to JSON" << endl;
	} else {
	graph = vg::io::VPKG::load_one<PathHandleGraph>(file_name);
	}
	} else if (input_type == "gfa") {
	graph = make_unique<bdsg::HashGraph>();

	try {
	// Use the disk-backed GFA loader that `vg convert` also uses.
	vg::algorithms::gfa_to_path_handle_graph(file_name,
	dynamic_cast<MutablePathMutableHandleGraph*>(graph.get()),
	nullptr,
	0); // set rgfa path rank to 0 to be consistent with vg convert's default logic
	} catch (vg::algorithms::GFAFormatError& e) {
	logger.error() << "Input GFA is not acceptable\n" << e.what() << endl;
	} catch (std::ios_base::failure& e) {
	logger.error() << "IO error processing input GFA\n" << e.what() << endl;
	}

	// GFA can convert to any of the graph formats, so keep going

Conversation

babyplutokurt commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog Entry

Description

Benchmark

What Changed

Design Notes

GFAz-Specific Semantics

Registry Integration

Tests

Current Scope / Limitations

Rationale

Uh oh!

adamnovak commented Mar 25, 2026

Uh oh!

adamnovak commented Mar 26, 2026

Uh oh!

babyplutokurt commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

babyplutokurt commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. zstd Dependency (System vs. Submodule) and Linker Order

2. Long-term Maintenance and Stability

3. Adoption, Preprints, and Explanation of the Format

Uh oh!

babyplutokurt commented Mar 31, 2026

Uh oh!

adamnovak left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

babyplutokurt Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

babyplutokurt Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

babyplutokurt commented Mar 25, 2026 •

edited

Loading

babyplutokurt commented Mar 26, 2026 •

edited

Loading

babyplutokurt commented Mar 26, 2026 •

edited

Loading

1. `zstd` Dependency (System vs. Submodule) and Linker Order

adamnovak left a comment •

edited

Loading

babyplutokurt Apr 15, 2026 •

edited

Loading

babyplutokurt Apr 17, 2026 •

edited

Loading