-
Notifications
You must be signed in to change notification settings - Fork 11
feat: impl densitysketch #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Chojan Shang <psiace@apache.org>
datasketches/src/density/sketch.rs
Outdated
| } | ||
|
|
||
| thread_local! { | ||
| static RNG_STATE: Cell<u64> = Cell::new(seed_rng()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love not to have thread local variables declared statically in the library. Is this really something we cannot store in another struct and pass it to the sketch when needed, delegating lifecycle decisions to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems C++ uses thread locals while Java relies on the java.util.Random.
Unfortunately, Rust doesn't have a std random support so we have to find a reasonable way here. At least the random common should live under datasketches::random_common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Seems the paper reference is this one: https://arxiv.org/abs/2102.12301? |
| use crate::error::ErrorKind; | ||
|
|
||
| /// Floating point types supported by the density sketch. | ||
| pub trait DensityValue: Copy + PartialOrd + 'static { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be sealed and thus has a size method to return the mem size.
datasketches/src/density/sketch.rs
Outdated
| pub trait DensityKernel<T: DensityValue> { | ||
| /// Returns the kernel evaluation for the two points. | ||
| fn evaluate(&self, left: &[T], right: &[T]) -> T; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub trait DensityKernel<T: DensityValue> { | |
| /// Returns the kernel evaluation for the two points. | |
| fn evaluate(&self, left: &[T], right: &[T]) -> T; | |
| } | |
| pub trait DensityKernel { | |
| /// Returns the kernel evaluation for the two points. | |
| fn evaluate<T: DensityValue>(&self, left: &[T], right: &[T]) -> T; | |
| } |
The trait bound doesn't seem to be related to DensityValue but it should accept any DensityValue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry. SphericalKernel works only for f32.
Nope. SphericalKernel is only in testing scope and I don't think it's reasonable. Why do we need this "kernel" to be configurable if our out-of-the-box "kernel" has only GaussianKernel?
tisonkun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several comments above.
Also I need some time to understand what densitysketch is and how it works.
|
@c-dickens,
Or if you want to just write it up as a markdown and send it to me, I can integrate it into the code and website. Whatever works for you :) Thanks! |
|
I’m going to change this PR to Draft until we can get some meaningful documentation and a reference to the correct research paper. The above Desai, et al paper is not correct. It is a paper by Karnin & Liberty. I am reaching out to @AlexanderSaydakov to get more information. |
|
There is some documentation, but it is still sparse. From the datasketches website, you can find under Code Docs CPP 5.2.0 then Namespaces/DensitySketch. Then scroll down to the Detailed Description. I have not used this sketch, but in many scientific areas (including ML) it is important to understand the density at a point in an large algebraic field. This sketch is constructed with a value k, which controls the overall size and accuracy of the sketch, dim, the number of spacial dimensions, a Kernel function, which does the work of identifying the density at a point in dim-space within a dim-sphere whose radial sensitivity is controlled by the Kernal density function (A common one is Gaussian, but user can supply others), and finally an Allocator. The input is a stream of floating-point vectors of size dim. A query is basically a get_estimate(point) where the point is a dim_vector. Once the sketch reaches size k, it starts summarizing the field into a core-set of points which approximates the most dense points in the field. I would study some of the test cases for some simple examples and read the Karnin & Liberty paper. Note: this has not been ported to Java, but can be accessed from Python. |
|
The original authors of the concepts behind this sketch are Zohar Karnin & Edo Liberty (see above), In addition, Edo wrote the original Python code gde.py that was our inspiration for our C++ code as well as our python code that calls it. In Edo's streaming-quantiles directory that contains gde.py, you can also find a very useful gde_example_usage.ipynb Jupyter Notebook with nice comments and graphs that illustrate the use of this sketch. There is also a gde_test.py that illustrates some useful tests. So I think documentation is out there, it just is not easy to find from our library. We need to fix this before we go much further with this sketch, e.g. it is not even mentioned on our website. (This should have been addressed before we released it with C++ and Python!) P.S. Edo has been a strong supporter of our DataSketches project, has guided us in a number of theoretical areas, and is on our PMC. |
|
Thanks, @leerho . I think I have a better understanding of the background now. I'll study them later and try to make everything work smoothly. |
Density sketch implementation for density estimation from streaming data.
Ref https://github.com/apache/datasketches-cpp/tree/master/density