Fix: hardcoded `DType` in dict layout #5761

connortsui20 · 2025-12-16T20:46:35Z

Fixes #5591
Also fixes #5563

Still not super sure about why we limit the dict layout codes to be a max of u16::MAX...

codecov · 2025-12-16T20:54:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.21%. Comparing base (c93b6a7) to head (6aa5e45).

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

gatesn · 2025-12-16T20:54:32Z

We decided that more than 64k unique elements in a DictLayout (where chunks can be ~8k elements), doesn't make much sense! I imagine this is done way with bad assumptions in a few places..

onursatici · 2025-12-17T11:21:40Z

So this was done before the dict encoder was generic over the codes ptype, and was dynamically setting the codes dtype based on the magnitude of the max_len constraint. So it was always encoding to a wider int, then after the encoding is done we would downcast to a narrower int. In that world it is not possible to tell the codes ptype before encoding is fully finished.

also I don't think the codes ptype is selected dynamically, is should still be depending on the max_len argument, and if that is smaller than 256, only then we should get u8 codes:

https://github.com/vortex-data/vortex/blob/develop/vortex-array/src/builders/dict/primitive.rs#L36-L42

I think what is happening here is that we are trying to dict encode u8 values, in that case this code would select u8, even though max_len is u16::max.

So probably the right fix is to expose the codes ptype from the dict builder?

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 · 2025-12-17T18:28:57Z

@onursatici I think that is what I am doing since I am extracting the dtype of the codes? Or are you saying that we need to forward the original dtype from somewhere else to the writer? Or are you saying we need to upcast the codes always?

onursatici · 2025-12-18T10:55:40Z

I guess what I am saying is that this comment is not exactly right:

    /// The codes dtype is chosen dynamically based on the actual dictionary size:
    /// - [`PType::U8`] when the dictionary has at most 255 entries
    /// - [`PType::U16`] when the dictionary has more than 255 entries

the way we chose the ptype for codes is upfront, so it doesn't depend on the actual cardinality after encoding. It depends on two things, the width of the input primitive ptype, and the max_len constraint. Normally we have max_len set to u16::max because we don't want a higher cardinality in a chunk, so we always get u16 codes. In fuzzer we end up dict encoding u8 types, and because the input is narrower than the max_len constraint (u8 vs u16), dict encoder returns u8 codes.

for the fix, I think this will work and I am happy to merge as is, but I think the right way is to get the codes dtype from the dict encoder, because as soon as you construct one you can get a codes ptype, you don't need to encode a chunk to see what ptype it will end up having.

connortsui20 requested review from gatesn and onursatici December 16, 2025 20:46

connortsui20 added the fix label Dec 16, 2025

This comment was marked as off-topic.

Sign in to view

connortsui20 force-pushed the ct/fix-dict-writer branch from 167d852 to 833c131 Compare December 17, 2025 18:12

fix hardcoded dtype

6aa5e45

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>

connortsui20 force-pushed the ct/fix-dict-writer branch from 833c131 to 6aa5e45 Compare December 17, 2025 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix: hardcoded `DType` in dict layout #5761

Fix: hardcoded `DType` in dict layout #5761

Uh oh!

connortsui20 commented Dec 16, 2025 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

codecov bot commented Dec 16, 2025 •

edited

Loading

Uh oh!

gatesn commented Dec 16, 2025

Uh oh!

onursatici commented Dec 17, 2025

Uh oh!

connortsui20 commented Dec 17, 2025

Uh oh!

onursatici commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix: hardcoded DType in dict layout #5761

Are you sure you want to change the base?

Fix: hardcoded DType in dict layout #5761

Uh oh!

Conversation

connortsui20 commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as off-topic.

codecov bot commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

gatesn commented Dec 16, 2025

Uh oh!

onursatici commented Dec 17, 2025

Uh oh!

connortsui20 commented Dec 17, 2025

Uh oh!

onursatici commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix: hardcoded `DType` in dict layout #5761

Fix: hardcoded `DType` in dict layout #5761

connortsui20 commented Dec 16, 2025 •

edited

Loading

codecov bot commented Dec 16, 2025 •

edited

Loading