Skip to content

fix: honor per-record-batch dictionary replacements#70

Merged
jheer merged 1 commit into
uwdata:mainfrom
polarsignals:fix-dict-replacement
Apr 28, 2026
Merged

fix: honor per-record-batch dictionary replacements#70
jheer merged 1 commit into
uwdata:mainfrom
polarsignals:fix-dict-replacement

Conversation

@brancz
Copy link
Copy Markdown
Contributor

@brancz brancz commented Apr 15, 2026

createTable decoded all dictionary batches in one pass before any record batches, overwriting the dictionary stored for an id whenever it saw a non-delta replacement. After the loop, every record batch — even ones that had been transmitted before the replacement in the IPC stream — was associated with the final dictionary for the id, losing the dictionary that was current at its position.

In Arrow IPC streams a single dictionary id can appear multiple times: each non-delta dictionary batch replaces the previous one for any record batches that follow it, while record batches that came before keep the dictionary that was current at their position. Apache Arrow JS handles this by walking the message stream message-by-message and having each _loadRecordBatch capture whatever is currently in this.dictionaries.

This change does the same in flechette without giving up the existing two-phase decode → build split:

  • decodeIPC{Stream,File} now also produces a dictsBeforeRecord array — for each record batch, the number of dictionary batches that preceded it in the original stream / file (sorted by file offset for the file format).
  • createTable walks records in order and processes any dictionary batches that came before each record batch right before decoding it. processDict mirrors arrow-js's _loadDictionaryBatch: each call creates a fresh Column instance (replacing or extending the existing one) and stores it in dictionaryMap. Record batches that captured the old reference earlier are unaffected because Columns are never mutated in place.

dictsBeforeRecord on ArrowData is optional, so existing callers that build the structure by hand still get the legacy "all dicts before all records" behavior.

Includes a regression test built from a fixture written by Apache Arrow Rust's StreamWriter so the input is independently well-formed.

AI disclaimer: Yes, AI was used in the making of this, but it was thoroughly tested through our codebase, and audited to be close to the solution that arrow-rs implements as well. That said, I'm not a js/ts expert so I did heavily lean on claude code for the heavy lifting.

@jheer

`createTable` decoded all dictionary batches in one pass before any
record batches, overwriting the dictionary stored for an id whenever it
saw a non-delta replacement. After the loop, every record batch — even
ones that had been transmitted *before* the replacement in the IPC
stream — was associated with the *final* dictionary for the id, losing
the dictionary that was current at its position.

In Arrow IPC streams a single dictionary id can appear multiple times:
each non-delta dictionary batch replaces the previous one for any
record batches that follow it, while record batches that came before
keep the dictionary that was current at their position. Apache Arrow
JS handles this by walking the message stream message-by-message and
having each `_loadRecordBatch` capture whatever is currently in
`this.dictionaries`.

This change does the same in flechette without giving up the existing
two-phase decode → build split:

 * `decodeIPC{Stream,File}` now also produces a `dictsBeforeRecord`
   array — for each record batch, the number of dictionary batches that
   preceded it in the original stream / file (sorted by file offset for
   the file format).
 * `createTable` walks `records` in order and processes any dictionary
   batches that came before each record batch right before decoding it.
   `processDict` mirrors arrow-js's `_loadDictionaryBatch`: each call
   creates a *fresh* `Column` instance (replacing or extending the
   existing one) and stores it in `dictionaryMap`. Record batches that
   captured the old reference earlier are unaffected because Columns
   are never mutated in place.

`dictsBeforeRecord` on `ArrowData` is optional, so existing callers
that build the structure by hand still get the legacy "all dicts before
all records" behavior.

Includes a regression test built from a fixture written by Apache
Arrow Rust's `StreamWriter` so the input is independently well-formed.
@jheer jheer merged commit bbeb29d into uwdata:main Apr 28, 2026
2 checks passed
@jheer jheer mentioned this pull request May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants