-
Notifications
You must be signed in to change notification settings - Fork 73
fix: rejecting inf as value #471
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Regarding overflow and underflow, I believe I followed the reference implementation as close as practical. If you see potential problems I would suggest asking @tdunning first. |
|
@proost Let's talk. Feel free to ping me directly on my apache email address. I would love to work with you on a Go implementation and there are important simplifications in Go relative to the Java implementation because you can easily have an array of structs. Regarding your question, a guard for ±inf is more subtle than it looks. If you insert a single infinite value, it will have a centroid with one data point in it. That's fine. If you insert another on the same side, you will start to get centroids with infinite value which is actually numerically correct. You will also have effects on the interpolation code, but that is also likely fairly reasonable. How do you think an empirical distribution with infinite values should be handled? What do you think the mean of such a distribution should be? |
|
@tdunning In my use case, I was mainly trying to protect the t-digest from values that come from numeric overflow or underflow, not from intentional ±∞ in the data. Your explanation helped clarify the distinction for me. |
|
There is one thing I wanted to clarify regarding this PR. In the current implementation, ±∞ values can flow into the t-digest and be treated as data points. With this PR, ±∞ values passed to If supporting ±∞ as valid data was intentional in the original design, then I agree that changing this behavior is not appropriate. However, if it wasn’t an intentional design choice and infinity handling was simply unspecified, then I think explicitly rejecting/ignoring infinities and documenting the behavior would improve robustness and make the policy clearer for users. I wanted to check your view on this before moving forward, since it affects compatibility and library semantics. |
|
If you have a function that adds up an array of values, what is the correct
behavior if one of the values is +∞ and the others are normal numbers?
I think the answer should be +∞
What if you pass in +∞ and -∞ along with normal values?
I think the answer should be NaN
This is the behavior that any function like sum should follow. This
includes mean and, in a more nuanced way, t-digest.
What is the median of [0, 0, 1, +∞, +∞] ?
I think it should be 1, not 0.
Julia agrees:
```
julia> median([0, 0, 1, +Inf, +Inf])
1.0
```
If the infinities are ignored, we get the wrong answer.
…On Wed, Jan 7, 2026 at 9:15 PM Hyeonho Kim ***@***.***> wrote:
*proost* left a comment (apache/datasketches-cpp#471)
<#471 (comment)>
@AlexanderSaydakov <https://github.com/AlexanderSaydakov>
There is one thing I wanted to clarify regarding this PR.
In the current implementation, ±∞ values can flow into the t-digest and be
treated as data points. With this PR, ±∞ values passed to update() are
silently ignored, and ±∞ passed to query methods such as get_rank()
result in an exception. That means this change alters the behavior for
users who might already be passing infinity values (intentionally or not),
which effectively makes it a breaking change.
If supporting ±∞ as valid data was intentional in the original design,
then I agree that changing this behavior is not appropriate. However, if it
wasn’t an intentional design choice and infinity handling was simply
unspecified, then I think explicitly rejecting/ignoring infinities and
documenting the behavior would improve robustness and make the policy
clearer for users.
I wanted to check your view on this before moving forward, since it
affects compatibility and library semantics.
—
Reply to this email directly, view it on GitHub
<#471 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAB5E6SNQ4WHZEFLXWC2OYL4FXRWFAVCNFSM6AAAAACQ5Y54J2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMRRHEZDKOJSGE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Thanks for the detailed explanation and examples. Since this PR goes against core model of data structure, I’m going to close it. Thanks |
Currently I'm writing code for go t-Digest.
While I refer C++ implementation, i found that no guard for +inf, -inf.
Additional Question: There is no validation logic in the t-Digest about overflow or underflow from calculation. Is intended behavior?