[New Feature] vg convert: add GFAz input support#4861
Conversation
c41797c to
4fac0c0
Compare
|
This seems potentially useful. What's the license on the GFAz codebase, @babyplutokurt? Are we going to be able to distribute it in our vg tarballs? There doesn't seem to be one on Github. Would it make sense to try and auto-magically import GFAz when a user provides a graph in that format, the way we do with GFA? Would it make sense to do GFAz output from |
|
It looks like the GFAz library pulls in zstd as a submodule dependency, but vg already has zstd as a system-level dependency through libgbwtgraph that needs it. Is there a good reason the GFAz build system can't build against a system-installed zstd? Or should this change the vg build to link against the submodule zstd provided by GFAz? (And shouldn't Finally, it looks like you've used a highly productive AI-enabled development process to come up with GFAz very quickly, so even though it can do all this cool stuff it looks like it might have only existed at all for about a week. What is the project's plan for adoption and maintenance, beyond just vg integration? Is the vg team going to end up having to come in and bump its versions of zstd, pybind11, etc. to newer versions as breaking changes happen, or is there a plan for long-term maintenance of the library from your side? Is there a plan for some kind of preprint/paper or writeup about the format that might lead to it being adopted by more tools than just vg? Or that might help us understand or explain how it works? |
Hi @adamnovak, GFAz is under MIT license and I just updated the LICENSE. Thank for the comment, I will look into more details about the seamless integration later and re-push the PR once implementation is done, I am currently working on the memory consumption increase fix by streaming decompression and vg convert. I expect GFAz should not increase the memory consumption when input to vg. |
Thanks for the thoughtful questions, Let me address them one by one: 1.
|
228e32e to
6fae5d7
Compare
|
@adamnovak Hi Adam, I addressed all the concerns, please let me know your ideas. |
There was a problem hiding this comment.
I've looked at this again, and I think the integration code needs a bit of melting and resynthesizing.
Rather than hooking into the GFA read functions, GFAz ought to be registered as its own format with the vg::io::Registry, with a sniffer or magic number, via files in src/io and a call in src/io/register_libvg_io.cpp.
Rather than duplicating everything in src/aglorithms/gfa_to_handle.cpp and slightly modifying it, we should take GFAParser and turn it into a 3-class hierarchy, with a base class (GFAParser?) that defines the various string-view types and tag_list_t and handles attaching and managing the listeners, and then two implementations (GFATextParser for GFA and GFAzParser for GFAz?) that drive the listeners. For GFA we would keep the existing in-class parsing code, and for GFAz we add code to drive the listeners by iterating over the decompressed GFAz elements.
To make this work, we need to refactor the event listener signatures slightly: instead of representing things like visits and overlaps as GFAParser::chars_t that the listener processes with GFAParser::scan_p_visits, GFAParser::scan_w_visits, etc., we need to present them to the listener by providing a function that can iterate over them. Then the listener would use the iteratee it's currently passing to GFAParser::scan_w_visits() or whatever with the new iterator function, while the existing scan functions would end up being used to construct the iterator functions for GFATextParser.
Then we would have one copy of all the code that deals with converting GFA semantics to vg semantics (which attaches listeners to a polymorphic parser and gets and acts on all the events, and also deals with managing node ID translation data), and it would be able to be driven by encapsulated decoders for either data source.
Also, due to advancements in the vg::io::Registry system, a lot of code we've had that special-cases reading a GFA so we can track the source filename is no longer necessary, so we can run all the GFA input (except some that hands GFAs straight to libgbwt) through the registry system and not need any other logic to identify if something is GFA or GFAz.
There was a problem hiding this comment.
This whole file looks like a clone of gfa_to_handle.cpp. Some of the functions (like write_gfaz_translation()) look like complete duplicates of the corresponding GFA-side operations.
I don't think that's the right approach; we want to share common code for converting from the two similar formats.
There was a problem hiding this comment.
I don't think hooking into the GFA parsing functions and dispatching on extension is the right design. Instead, the GFA parsing functions should only parse GFA, and we should have an src/io/register_loader_saver_gfaz.cpp that describes how to identify a GFAz stream based on magic numbers and registers the format with the vg::io::Registry, so that all vg commands will understand the format and things like vg stats tiny/tiny.gfaz will work.
| vg::algorithms::gfa_to_path_handle_graph(input_stream_name, &intermediate, | ||
| input_rgfa_rank, gfa_trans_path); | ||
| input_rgfa_rank, gfa_trans_path, | ||
| nullptr, num_threads); |
There was a problem hiding this comment.
Rather than passing around thread counts, the way vg likes to do thread count management is by using OMP's global state. We use set_thread_count() from src/utility.hpp to tell OMP how many threads we want for the program, and then OMP gives everything the right number of threads automatically when parallel sections start, even when they're nested or themselves starting from other threads.
Since GFAz already uses OMP, we shouldn't need to pass it a thread count; we should just tell it to use the current configured OMP number of threads.
It looks like GFAz doesn't know how to do that, though; the thread count here goes into a ScopedOMPThreads, which really wants to save off all that global state and replace it with either a particular provided value or whatever resolve_omp_thread_count() autodetects from the environment if 0. It looks like GFAz doesn't support just using the currently configured thread count.
Ideally it would be taught (maybe a new -1 sentinel value for "just use the current OMP configuration and don't change it"?), but if it can't be we would want to read the thread count with get_thread_count() from src/utility.hpp and pass that every time we call into GFAz.
There was a problem hiding this comment.
Addressed by adding -1 sentinel value on GFAz side to directly inherit OMP configuration from host side.
Now GFAz threads resolver has three options:
n: using n threads
== 0: using default(8) threads
== -1: using Host OMP configurations
Actually GFAz integration is using single-thread, never uses OMP. Even for GFAz decompression, it only use OMP for multi-stream Zstd decompression, most pipeline is single-threaded. It still have GB level throughput due to O(N) time complexcity
| if (input == input_gfa) { | ||
| // we have to check this manually since we're not using the istream-based loading | ||
| // functions in order to be able to use the disk-backed loading algorithm | ||
| if (optind >= argc) { | ||
| logger.error() << "no input graph supplied" << endl; | ||
| } | ||
| string input_stream_name = argv[optind]; | ||
| if (output_format == "xg") { | ||
| xg::XG* xg_graph = dynamic_cast<xg::XG*>(output_graph.get()); | ||
|
|
||
| // Need to go through a handle graph | ||
| bdsg::HashGraph intermediate; | ||
| logger.warn() << "currently cannot convert GFA directly to XG; " | ||
| << "converting through another format" << endl; | ||
| vg::algorithms::gfa_to_path_handle_graph(input_stream_name, &intermediate, | ||
| input_rgfa_rank, gfa_trans_path); | ||
| input_rgfa_rank, gfa_trans_path, | ||
| nullptr, num_threads); | ||
| graph_to_xg_adjusting_paths(&intermediate, xg_graph, ref_samples, hap_locus, new_sample, drop_haplotypes); | ||
| } | ||
| else { | ||
| // If the GFA doesn't have forward references, we can handle it | ||
| // efficiently even if we are streaming it, so we shouldn't warn about | ||
| // "-" input here. | ||
| try { | ||
| if (output_path_graph != nullptr) { | ||
| MutablePathMutableHandleGraph* mutable_output_graph \ | ||
| = dynamic_cast<MutablePathMutableHandleGraph*>(output_path_graph); | ||
| assert(mutable_output_graph != nullptr); | ||
| vg::algorithms::gfa_to_path_handle_graph(input_stream_name, mutable_output_graph, | ||
| input_rgfa_rank, gfa_trans_path); | ||
| input_rgfa_rank, gfa_trans_path, | ||
| nullptr, num_threads); | ||
| } | ||
| else { | ||
| MutableHandleGraph* mutable_output_graph = dynamic_cast<MutableHandleGraph*>(output_graph.get()); | ||
| assert(mutable_output_graph != nullptr); | ||
| vg::algorithms::gfa_to_handle_graph(input_stream_name, mutable_output_graph, | ||
| gfa_trans_path); | ||
| gfa_trans_path, num_threads); | ||
| } | ||
| } catch (vg::algorithms::GFAFormatError& e) { | ||
| logger.error() << "Input GFA is not acceptable.\n" << e.what() << endl; | ||
| } catch (std::ios_base::failure& e) { | ||
| logger.error() << "IO error processing input GFA.\n" << e.what() << endl; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
It turns out we no longer need any of this logic to handle GFA input specially.
At https://github.com/babyplutokurt/vg/blame/842fdc5cef32eac62fda9feaade0c34e10f5fc7b/src/subcommand/convert_main.cpp#L304-L306 we decided we needed to handle GFA here separate from the normal load-a-HandleGraph codepath so that we could keep the filename around and use the disk-backed GFA loader, and also so we could stream straight to XG without a HandleGraph in memory. But then we decided to set up the HandleGraph load path so it could track filenames (see https://github.com/vgteam/vg/blame/bf85ab04b251e7a2bc308750d6c8c44afda213f5/src/io/register_loader_saver_gfa.cpp#L96 and https://github.com/vgteam/libvgio/blame/65e0ba22246c7443b879f40d0bb69ac96c467107/include/vg/io/registry.hpp#L315-L322). And I had to remove the straight-to-XG code and use an intermediate HandleGraph anyway.
So now nothing important or special actually happens when input == input_gfa, and we should just treat it the same as input == input_handlegraph and use the vg::io::VPKG::load_one<HandleGraph>() approach. (We might want a warning if we load the HandleGraph and it is not an instance of GFAHandleGraph when we thought it would be.)
If we do that cut as part of this, then we don't need to worry about having some kind of interface other than the vg::io stuff that knows how to load either-GFA-or-GFAz to a HandleGraph; everything will go through the registry.
We'd also want to combine the corresponding cases in vg view:
vg/src/subcommand/view_main.cpp
Lines 557 to 578 in bf85ab0
There was a problem hiding this comment.
I agree with the direction here.
This PR already adds GFAz as its own registered format in vg::io::Registry, and the generic HandleGraph load path can now auto-detect and load .gfaz through VPKG::load_one<HandleGraph>(). That path is covered in the tests with direct vg convert <file.gfaz> ... input.
I kept the existing explicit input_gfa command paths for now because this PR already does a fairly large refactor in the shared import layer: it introduces a GFAParser base class, splits the physical parsers into GFATextParser and GFAzParser, and moves both formats onto the same shared semantic conversion code. I wanted to keep the command-level changes narrower than the parser refactor itself.
For the explicit -g path, the current behavior is that we still honor -g as “treat this as GFA-style input”, but the actual loader now auto-detects text GFA vs GFAz internally through the temporary helpers load_gfa_or_gfaz_to_handle_graph() / load_gfa_or_gfaz_to_path_handle_graph(). That behavior is also covered in the tests with vg convert -g <file.gfaz> ....
So the current state is:
vg convert <file.gfaz> ...exercises the registry/VPKG pathvg convert -g <file.gfaz> ...exercises the explicit GFA import path, which now auto-detectsGFAvsGFAzinternally
I agree that the long-term cleanup should be to collapse the remaining input_gfa special cases into the normal input_handlegraph / VPKG::load_one<HandleGraph>() path in vg convert, and similarly combine the corresponding cases in vg view, so that the registry is the only place making the GFA vs GFAz format decision.
For this PR, I was aiming to keep the focus on:
- registering
GFAzproperly insrc/io - eliminating duplicated GFA/GFAz semantic import code
- making both formats drive the same shared import pipeline
Also I improved the GFAz streaming decompression, the GFAz integration to vg not only speed up the subcommands, also with less peak RSS
There was a problem hiding this comment.
OK, we can keep the GFA special-casing out of this and make that its own change. I've opened #4880 for that.
There was a problem hiding this comment.
I'll go through and re-review it.
5a18319 to
88b0c3e
Compare
88b0c3e to
5f9b512
Compare
Changelog Entry
Added GFAz input support to vg by integrating it into vg's shared GFA parsing and import pipeline.
Description
This PR adds support for reading GFAz files in vg. Instead of adding a separate copy of the GFA import logic, this change refactors vg's existing GFA parser/import path so that both text GFA and GFAz drive the same semantic conversion code.
The key design change is that GFA parsing is now split into a small hierarchy:
GFAParser: shared base class for parser types, listener registration, shared string/tag types, and event APIsGFATextParser: existing text-GFA parser logic adapted to the new shared interfaceGFAzParser: drives the same listeners from decompressed GFAz elementsThis means vg now has a single copy of the logic that turns GFA semantics into vg graph/path semantics, regardless of whether the input is text GFA or GFAz.
In addition, GFAz is now registered as its own input format in
vg::io::Registry, so generic graph-loading paths can recognize and load.gfazfiles through the normalVPKGmechanism.Benchmark
A local benchmark on HPRC
chr1comparingvg convert -gfrom text GFA vs GFAz into PackedGraph:In this workload, GFAz input is about 2.4x faster than text GFA, with slightly lower peak RSS.
What Changed
New files:
src/algorithms/gfaz_to_handle.hpp/.cpp: GFAz-backed parser implementation and import supportsrc/io/register_loader_saver_gfaz.cpp: registry loader registration for GFAzModified files:
src/algorithms/gfa_to_handle.hpp/.cpp: refactored shared GFA parsing/import code around a parser hierarchy and shared listener-driven conversionsrc/io/register_libvg_io.cpp: registers GFAz loader supportsrc/subcommand/convert_main.cpp,src/subcommand/view_main.cpp,src/subcommand/sort_main.cpp,src/index_registry.cpp: use the new temporary helper path where these commands still explicitly import GFA-like inputsMakefile: builds and links GFAz support.gitmodules: addsdeps/GFAztest/t/48_vg_convert.t: adds.gfazimport coverageDesign Notes
This PR follows the review direction of "melting and resynthesizing" the integration instead of duplicating
gfa_to_handle.cpp.The main architectural pieces are:
GFAzis registered as its own format insrc/io, rather than being hidden inside the old text-GFA loader path.GFATextParser.GFAzParseremits the same parser events as text GFA.This keeps GFA semantics handling in one place while allowing different physical encodings of GFA records.
GFAz-Specific Semantics
GFAz restores segment names as 1-based dense contiguous numeric IDs. vg can use these IDs directly, instead of building the text-GFA name-to-ID translation map normally required for arbitrary string segment names.
As a result:
Registry Integration
This PR adds a real
GFAZregistry entry, so generic graph-loading code can auto-detect.gfazinput throughvg::io::VPKG.That said, this PR does not fully rewrite all legacy explicit GFA command paths to go through the registry. Some existing
convert/view/sort/ index-registry code paths still use temporary helper wrappers for "load GFA or GFAz". The parser/import refactor is the main goal of this change; broader unification of all GFA input onto registry/VPKG can be done separately.Tests
This PR extends
test/t/48_vg_convert.twith coverage for:-gimport from.gfaz.gfazdetection through the registry-backed generic graph load path.gfaand.gfaz-xconversion from.gfaz.gfazCurrent Scope / Limitations
GFAzinput support, notGFAzoutput support.GFAHandleGraphremains normal text GFA.VPKG.Rationale
The goal of this change is to add GFAz support without creating a second copy of vg's GFA import logic. By making GFAz another parser implementation that feeds the same semantic conversion layer, the integration stays maintainable and keeps future GFA behavior changes shared across both formats.