[FLINK-39399] Fix integer overflow in HyperLogLogPlusPlus causing APPROX_COUNT_DISTINCT undercount#27895
[FLINK-39399] Fix integer overflow in HyperLogLogPlusPlus causing APPROX_COUNT_DISTINCT undercount#27895jnh5y wants to merge 1 commit intoapache:masterfrom
Conversation
|
Thanks for contribution since you use AI, can you use also the doc about this is in progress https://github.com/apache/flink/pull/27776/changes#r2979585031 |
Is it sufficient to have that on the PR Description and keep the Co-Authored tag as-is in the Git commit? |
|
quote from ASF doc
so it should be changed in commit |
…ROX_COUNT_DISTINCT undercount In HyperLogLogPlusPlus.query(), the expression "1 << mIdx" uses int shift which wraps modulo 32. At high cardinality (~100M+ distinct values), some registers accumulate values >= 32, causing the shift to produce incorrect zInverse contributions and wildly wrong estimates. Change "1 << mIdx" to "1L << mIdx" to use long shift arithmetic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Generated-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
43626cd to
b40601d
Compare
|
@flinkbot run azure |
What is the purpose of the change
In HyperLogLogPlusPlus.query(), the expression "1 << mIdx" uses int shift which wraps modulo 32. At high cardinality
(~100M+ distinct values), some registers accumulate values >= 32, causing the shift to produce incorrect zInverse contributions and wildly wrong estimates.
Brief change log
In
HyperLogLogPlusPlus.java, change "1 << mIdx" to "1L << mIdx" to use long shift arithmetic.Verifying this change
This change added a test in
HyperLogLogPlusPlusTest.java.Does this pull request potentially affect one of the following parts:
@Public(Evolving): (no)Documentation
Generated-by: Claude Opus 4.6 (1M context)