-
Notifications
You must be signed in to change notification settings - Fork 11
feat: add halve/decay to countmin sketch #71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
If the impact on error bounds is not known we need to have a large warning about this and to set a flag that refuses to return error bounds after these operations. Lying by giving an answer when we do not have math to support it is very bad. |
|
Think this goes through by the linear property of count min but am on phone so haven't looked at the code yet. |
Agree that this should be provable. Perhaps https://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf can be helpful. |
|
To answer the question directly: this example of scaling up or down by a nonnegative value is acceptable. However, there is a knock-on effect on the error bound which needs testing. Basically, for an underlying frequency value In future, we do need a little care because there is some subtlety around the behaviour depending on the input stream. If only positive integers are used for addition and multiplication and the underlying stream only has positive frequencies, then things are ok. The nuance comes in around negative values in the underlying frequency vector, in which case we might prefer a There are some good notes here and here that make the point clear. The "standard" count min point query estimation is only valid on the assumption that all entries in the underlying frequency vector are nonnegative. Happy to give more details if need be. In reference to @tisonkun's comment, there are a bunch of windowing/merging/aggregate approaches that might be interested. If so please add to the discussion - I am very keen to learn if you have specific applications in mind, or whether they (and other sketches eg. CountSketch) would be generally useful. |
|
Marking this as ready for review. I added a simple test to make sure the scaling behavior basically works (single key, no collisions). We can use that as a starting point |
|
@c-dickens, |
| /// let mut sketch = CountMinSketch::new(4, 128); | ||
| /// sketch.update_with_weight("apple", 3); | ||
| /// sketch.decay(0.5); | ||
| /// assert!(sketch.estimate("apple") >= 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a more practical example? Ditto halve
@MrCroxx how can halve and decay practically be helpful?
Co-authored-by: tison <wander4096@gmail.com>
Added
halveanddecayto CountMinSketch, matching cmsketch behavior for non‑negative counters.This scales counters (and
total_weight) to support exponential decay while keeping serialization unchanged. I’m not fully confident about the math details, especially around our error bounds, but the implementation itself should be straightforward. 😃Related #70