-
Notifications
You must be signed in to change notification settings - Fork 107
Fix: hardcoded DType in dict layout
#5761
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Codecov Report✅ All modified and coverable lines are covered by tests. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
We decided that more than 64k unique elements in a DictLayout (where chunks can be ~8k elements), doesn't make much sense! I imagine this is done way with bad assumptions in a few places.. |
|
So this was done before the dict encoder was generic over the codes ptype, and was dynamically setting the codes dtype based on the magnitude of the max_len constraint. So it was always encoding to a wider int, then after the encoding is done we would downcast to a narrower int. In that world it is not possible to tell the codes ptype before encoding is fully finished. also I don't think the codes ptype is selected dynamically, is should still be depending on the max_len argument, and if that is smaller than 256, only then we should get u8 codes: I think what is happening here is that we are trying to dict encode u8 values, in that case this code would select u8, even though max_len is u16::max. So probably the right fix is to expose the codes ptype from the dict builder? |
167d852 to
833c131
Compare
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
833c131 to
6aa5e45
Compare
|
@onursatici I think that is what I am doing since I am extracting the dtype of the codes? Or are you saying that we need to forward the original dtype from somewhere else to the writer? Or are you saying we need to upcast the codes always? |
|
I guess what I am saying is that this comment is not exactly right: the way we chose the ptype for codes is upfront, so it doesn't depend on the actual cardinality after encoding. It depends on two things, the width of the input primitive ptype, and the max_len constraint. Normally we have max_len set to u16::max because we don't want a higher cardinality in a chunk, so we always get u16 codes. In fuzzer we end up dict encoding u8 types, and because the input is narrower than the max_len constraint (u8 vs u16), dict encoder returns u8 codes. for the fix, I think this will work and I am happy to merge as is, but I think the right way is to get the codes dtype from the dict encoder, because as soon as you construct one you can get a codes ptype, you don't need to encode a chunk to see what ptype it will end up having. |
Fixes #5591
Also fixes #5563
Still not super sure about why we limit the dict layout codes to be a max of
u16::MAX...