Skip to content

Conversation

@PsiACE
Copy link
Member

@PsiACE PsiACE commented Jan 14, 2026

Added halve and decay to CountMinSketch, matching cmsketch behavior for non‑negative counters.

This scales counters (and total_weight) to support exponential decay while keeping serialization unchanged. I’m not fully confident about the math details, especially around our error bounds, but the implementation itself should be straightforward. 😃

Related #70

@jmalkin
Copy link

jmalkin commented Jan 14, 2026

If the impact on error bounds is not known we need to have a large warning about this and to set a flag that refuses to return error bounds after these operations.

Lying by giving an answer when we do not have math to support it is very bad.

@PsiACE PsiACE marked this pull request as draft January 14, 2026 16:21
@c-dickens
Copy link

Think this goes through by the linear property of count min but am on phone so haven't looked at the code yet.

@tisonkun
Copy link
Member

Think this goes through by the linear property of count min but am on phone so haven't looked at the code yet.

Agree that this should be provable. Perhaps https://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf can be helpful.

@c-dickens
Copy link

To answer the question directly: this example of scaling up or down by a nonnegative value is acceptable. However, there is a knock-on effect on the error bound which needs testing. Basically, for an underlying frequency value f_i of item i the error bound is true(f_i) <= est(f_i) <= true(f_i) + error where error depends on the sketch size and the total mass in the stream. So if you are scale the inputs to the sketch (that make up each "true" f_i), then you also need to scale the total_weight - this is implemented in the PR but the altered error bounds are not tested.

In future, we do need a little care because there is some subtlety around the behaviour depending on the input stream. If only positive integers are used for addition and multiplication and the underlying stream only has positive frequencies, then things are ok. The nuance comes in around negative values in the underlying frequency vector, in which case we might prefer a CountSketch -- something useful that we haven't got round to yet!

There are some good notes here and here that make the point clear. The "standard" count min point query estimation is only valid on the assumption that all entries in the underlying frequency vector are nonnegative. Happy to give more details if need be.

In reference to @tisonkun's comment, there are a bunch of windowing/merging/aggregate approaches that might be interested. If so please add to the discussion - I am very keen to learn if you have specific applications in mind, or whether they (and other sketches eg. CountSketch) would be generally useful.

@PsiACE PsiACE marked this pull request as ready for review January 15, 2026 16:36
@PsiACE
Copy link
Member Author

PsiACE commented Jan 15, 2026

Marking this as ready for review. I added a simple test to make sure the scaling behavior basically works (single key, no collisions). We can use that as a starting point

@leerho
Copy link
Member

leerho commented Jan 16, 2026

@c-dickens,
You might check with Justin (@jthaler) as he knows Rust very well and has the math chops to verify our assumptions. Perhaps he or you could then put your approval on this PR.

Comment on lines +210 to +213
/// let mut sketch = CountMinSketch::new(4, 128);
/// sketch.update_with_weight("apple", 3);
/// sketch.decay(0.5);
/// assert!(sketch.estimate("apple") >= 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a more practical example? Ditto halve

@MrCroxx how can halve and decay practically be helpful?

Co-authored-by: tison <wander4096@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants