Add chunked hFILE input scheme#2018
Open
fabwa wants to merge 1 commit into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This adds a built-in
chunked:hFILE input scheme for reading byte-split binary streams such as BAMs whose raw bytes have been stored in ordered chunks.The scheme takes a manifest file containing one chunk path per line, ignores blank/comment lines, resolves relative chunk paths against the manifest directory, and exposes the chunks as one seekable logical file. BGZF/BAM readers can consume the concatenated byte stream without samtools-specific wrapping, and normal BAM indexes can be built and used against the logical chunked input.
Example:
chunks.fofnshould list the raw BAM byte chunks in order.Notes
Chunk files must be seekable so HTSlib can determine chunk sizes and translate logical BAM offsets to the right chunk and intra-chunk offset.
For local manifests, default index names are derived from the manifest path, so
samtools index chunked:chunks.fofnwriteschunks.fofn.baiandsamtools view chunked:chunks.fofn regioncan discover it.Tests
Also manually tested manifest-relative chunk names with blank/comment lines in both normal and
-@4threaded reads.