fix: honor per-record-batch dictionary replacements by brancz · Pull Request #70 · uwdata/flechette

brancz · 2026-04-15T14:29:54Z

createTable decoded all dictionary batches in one pass before any record batches, overwriting the dictionary stored for an id whenever it saw a non-delta replacement. After the loop, every record batch — even ones that had been transmitted before the replacement in the IPC stream — was associated with the final dictionary for the id, losing the dictionary that was current at its position.

In Arrow IPC streams a single dictionary id can appear multiple times: each non-delta dictionary batch replaces the previous one for any record batches that follow it, while record batches that came before keep the dictionary that was current at their position. Apache Arrow JS handles this by walking the message stream message-by-message and having each _loadRecordBatch capture whatever is currently in this.dictionaries.

This change does the same in flechette without giving up the existing two-phase decode → build split:

decodeIPC{Stream,File} now also produces a dictsBeforeRecord array — for each record batch, the number of dictionary batches that preceded it in the original stream / file (sorted by file offset for the file format).
createTable walks records in order and processes any dictionary batches that came before each record batch right before decoding it. processDict mirrors arrow-js's _loadDictionaryBatch: each call creates a fresh Column instance (replacing or extending the existing one) and stores it in dictionaryMap. Record batches that captured the old reference earlier are unaffected because Columns are never mutated in place.

dictsBeforeRecord on ArrowData is optional, so existing callers that build the structure by hand still get the legacy "all dicts before all records" behavior.

Includes a regression test built from a fixture written by Apache Arrow Rust's StreamWriter so the input is independently well-formed.

AI disclaimer: Yes, AI was used in the making of this, but it was thoroughly tested through our codebase, and audited to be close to the solution that arrow-rs implements as well. That said, I'm not a js/ts expert so I did heavily lean on claude code for the heavy lifting.

@jheer

`createTable` decoded all dictionary batches in one pass before any record batches, overwriting the dictionary stored for an id whenever it saw a non-delta replacement. After the loop, every record batch — even ones that had been transmitted *before* the replacement in the IPC stream — was associated with the *final* dictionary for the id, losing the dictionary that was current at its position. In Arrow IPC streams a single dictionary id can appear multiple times: each non-delta dictionary batch replaces the previous one for any record batches that follow it, while record batches that came before keep the dictionary that was current at their position. Apache Arrow JS handles this by walking the message stream message-by-message and having each `_loadRecordBatch` capture whatever is currently in `this.dictionaries`. This change does the same in flechette without giving up the existing two-phase decode → build split: * `decodeIPC{Stream,File}` now also produces a `dictsBeforeRecord` array — for each record batch, the number of dictionary batches that preceded it in the original stream / file (sorted by file offset for the file format). * `createTable` walks `records` in order and processes any dictionary batches that came before each record batch right before decoding it. `processDict` mirrors arrow-js's `_loadDictionaryBatch`: each call creates a *fresh* `Column` instance (replacing or extending the existing one) and stores it in `dictionaryMap`. Record batches that captured the old reference earlier are unaffected because Columns are never mutated in place. `dictsBeforeRecord` on `ArrowData` is optional, so existing callers that build the structure by hand still get the legacy "all dicts before all records" behavior. Includes a regression test built from a fixture written by Apache Arrow Rust's `StreamWriter` so the input is independently well-formed.

jheer merged commit bbeb29d into uwdata:main Apr 28, 2026
2 checks passed

This was referenced May 4, 2026

Discrepancy in Dictionary Value in streaming IPC message #65

Closed

Missing prepare script preventing usage of unreleased version directly from the Github repo. #71

Open

jheer mentioned this pull request May 7, 2026

v2.5.0 #73

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: honor per-record-batch dictionary replacements#70

fix: honor per-record-batch dictionary replacements#70
jheer merged 1 commit into
uwdata:mainfrom
polarsignals:fix-dict-replacement

brancz commented Apr 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

brancz commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brancz commented Apr 15, 2026 •

edited

Loading