fix: honor per-record-batch dictionary replacements#70
Merged
Conversation
`createTable` decoded all dictionary batches in one pass before any
record batches, overwriting the dictionary stored for an id whenever it
saw a non-delta replacement. After the loop, every record batch — even
ones that had been transmitted *before* the replacement in the IPC
stream — was associated with the *final* dictionary for the id, losing
the dictionary that was current at its position.
In Arrow IPC streams a single dictionary id can appear multiple times:
each non-delta dictionary batch replaces the previous one for any
record batches that follow it, while record batches that came before
keep the dictionary that was current at their position. Apache Arrow
JS handles this by walking the message stream message-by-message and
having each `_loadRecordBatch` capture whatever is currently in
`this.dictionaries`.
This change does the same in flechette without giving up the existing
two-phase decode → build split:
* `decodeIPC{Stream,File}` now also produces a `dictsBeforeRecord`
array — for each record batch, the number of dictionary batches that
preceded it in the original stream / file (sorted by file offset for
the file format).
* `createTable` walks `records` in order and processes any dictionary
batches that came before each record batch right before decoding it.
`processDict` mirrors arrow-js's `_loadDictionaryBatch`: each call
creates a *fresh* `Column` instance (replacing or extending the
existing one) and stores it in `dictionaryMap`. Record batches that
captured the old reference earlier are unaffected because Columns
are never mutated in place.
`dictsBeforeRecord` on `ArrowData` is optional, so existing callers
that build the structure by hand still get the legacy "all dicts before
all records" behavior.
Includes a regression test built from a fixture written by Apache
Arrow Rust's `StreamWriter` so the input is independently well-formed.
This was referenced May 4, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
createTabledecoded all dictionary batches in one pass before any record batches, overwriting the dictionary stored for an id whenever it saw a non-delta replacement. After the loop, every record batch — even ones that had been transmitted before the replacement in the IPC stream — was associated with the final dictionary for the id, losing the dictionary that was current at its position.In Arrow IPC streams a single dictionary id can appear multiple times: each non-delta dictionary batch replaces the previous one for any record batches that follow it, while record batches that came before keep the dictionary that was current at their position. Apache Arrow JS handles this by walking the message stream message-by-message and having each
_loadRecordBatchcapture whatever is currently inthis.dictionaries.This change does the same in flechette without giving up the existing two-phase decode → build split:
decodeIPC{Stream,File}now also produces adictsBeforeRecordarray — for each record batch, the number of dictionary batches that preceded it in the original stream / file (sorted by file offset for the file format).createTablewalksrecordsin order and processes any dictionary batches that came before each record batch right before decoding it.processDictmirrors arrow-js's_loadDictionaryBatch: each call creates a freshColumninstance (replacing or extending the existing one) and stores it indictionaryMap. Record batches that captured the old reference earlier are unaffected because Columns are never mutated in place.dictsBeforeRecordonArrowDatais optional, so existing callers that build the structure by hand still get the legacy "all dicts before all records" behavior.Includes a regression test built from a fixture written by Apache Arrow Rust's
StreamWriterso the input is independently well-formed.AI disclaimer: Yes, AI was used in the making of this, but it was thoroughly tested through our codebase, and audited to be close to the solution that arrow-rs implements as well. That said, I'm not a js/ts expert so I did heavily lean on claude code for the heavy lifting.
@jheer