Skip to content

Conversation

@connortsui20
Copy link
Contributor

@connortsui20 connortsui20 commented Dec 16, 2025

Fixes #5591
Also fixes #5563

Still not super sure about why we limit the dict layout codes to be a max of u16::MAX...

@codspeed-hq

This comment was marked as off-topic.

@codecov
Copy link

codecov bot commented Dec 16, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.21%. Comparing base (c93b6a7) to head (6aa5e45).

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@gatesn
Copy link
Contributor

gatesn commented Dec 16, 2025

We decided that more than 64k unique elements in a DictLayout (where chunks can be ~8k elements), doesn't make much sense! I imagine this is done way with bad assumptions in a few places..

@onursatici
Copy link
Contributor

So this was done before the dict encoder was generic over the codes ptype, and was dynamically setting the codes dtype based on the magnitude of the max_len constraint. So it was always encoding to a wider int, then after the encoding is done we would downcast to a narrower int. In that world it is not possible to tell the codes ptype before encoding is fully finished.

also I don't think the codes ptype is selected dynamically, is should still be depending on the max_len argument, and if that is smaller than 256, only then we should get u8 codes:

https://github.com/vortex-data/vortex/blob/develop/vortex-array/src/builders/dict/primitive.rs#L36-L42

I think what is happening here is that we are trying to dict encode u8 values, in that case this code would select u8, even though max_len is u16::max.

So probably the right fix is to expose the codes ptype from the dict builder?

Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
@connortsui20
Copy link
Contributor Author

@onursatici I think that is what I am doing since I am extracting the dtype of the codes? Or are you saying that we need to forward the original dtype from somewhere else to the writer? Or are you saying we need to upcast the codes always?

@onursatici
Copy link
Contributor

I guess what I am saying is that this comment is not exactly right:

    /// The codes dtype is chosen dynamically based on the actual dictionary size:
    /// - [`PType::U8`] when the dictionary has at most 255 entries
    /// - [`PType::U16`] when the dictionary has more than 255 entries

the way we chose the ptype for codes is upfront, so it doesn't depend on the actual cardinality after encoding. It depends on two things, the width of the input primitive ptype, and the max_len constraint. Normally we have max_len set to u16::max because we don't want a higher cardinality in a chunk, so we always get u16 codes. In fuzzer we end up dict encoding u8 types, and because the input is narrower than the max_len constraint (u8 vs u16), dict encoder returns u8 codes.

for the fix, I think this will work and I am happy to merge as is, but I think the right way is to get the codes dtype from the dict encoder, because as soon as you construct one you can get a codes ptype, you don't need to encode a chunk to see what ptype it will end up having.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fuzzing Crash: Type mismatch in SequentialStreamAdapter (U8 vs U16) Fuzzing Crash: Sequential stream dtype mismatch in file_io

4 participants