Skip to content

Conversation

@PsiACE
Copy link
Member

@PsiACE PsiACE commented Jan 6, 2026

Density sketch implementation for density estimation from streaming data.

Ref https://github.com/apache/datasketches-cpp/tree/master/density

@tisonkun tisonkun changed the title feat: imple densitysketch feat: impl densitysketch Jan 7, 2026
Signed-off-by: Chojan Shang <psiace@apache.org>
}

thread_local! {
static RNG_STATE: Cell<u64> = Cell::new(seed_rng());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love not to have thread local variables declared statically in the library. Is this really something we cannot store in another struct and pass it to the sketch when needed, delegating lifecycle decisions to the user?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems C++ uses thread locals while Java relies on the java.util.Random.

Unfortunately, Rust doesn't have a std random support so we have to find a reasonable way here. At least the random common should live under datasketches::random_common.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @jmalkin @leerho

Both for random usage and more background of density sketch, which seems a new sketch I'm not yet familiar with.

@tisonkun
Copy link
Member

Seems the paper reference is this one: https://arxiv.org/abs/2102.12301?

use crate::error::ErrorKind;

/// Floating point types supported by the density sketch.
pub trait DensityValue: Copy + PartialOrd + 'static {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be sealed and thus has a size method to return the mem size.

Comment on lines 63 to 66
pub trait DensityKernel<T: DensityValue> {
/// Returns the kernel evaluation for the two points.
fn evaluate(&self, left: &[T], right: &[T]) -> T;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub trait DensityKernel<T: DensityValue> {
/// Returns the kernel evaluation for the two points.
fn evaluate(&self, left: &[T], right: &[T]) -> T;
}
pub trait DensityKernel {
/// Returns the kernel evaluation for the two points.
fn evaluate<T: DensityValue>(&self, left: &[T], right: &[T]) -> T;
}

The trait bound doesn't seem to be related to DensityValue but it should accept any DensityValue.

Copy link
Member

@tisonkun tisonkun Jan 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. SphericalKernel works only for f32.

Nope. SphericalKernel is only in testing scope and I don't think it's reasonable. Why do we need this "kernel" to be configurable if our out-of-the-box "kernel" has only GaussianKernel?

Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several comments above.

Also I need some time to understand what densitysketch is and how it works.

@leerho
Copy link
Member

leerho commented Jan 18, 2026

@c-dickens,
Unfortunately, the documentation on our density sketch in C++ and Python is practically non-existant. I could use your help in adding some documentation certainly in the header of the C++ and Python main sketch classes (and perhaps on the website as well, or on the website and a pointer to it within the class files) that addresses:

  • What problem this sketch solves (with an example)
  • How it works, at a very basic level
  • Any caveats, assumptions or limitations the user should know about.
  • How the sketch error is computed and what it means to the user.
  • References to the research paper(s) that guided this implementation.

Or if you want to just write it up as a markdown and send it to me, I can integrate it into the code and website. Whatever works for you :)

Thanks!
Lee.

@leerho
Copy link
Member

leerho commented Jan 19, 2026

I’m going to change this PR to Draft until we can get some meaningful documentation and a reference to the correct research paper. The above Desai, et al paper is not correct. It is a paper by Karnin & Liberty.

I am reaching out to @AlexanderSaydakov to get more information.

@leerho leerho marked this pull request as draft January 19, 2026 20:14
@leerho
Copy link
Member

leerho commented Jan 19, 2026

There is some documentation, but it is still sparse.

From the datasketches website, you can find under Code Docs CPP 5.2.0 then Namespaces/DensitySketch. Then scroll down to the Detailed Description.

I have not used this sketch, but in many scientific areas (including ML) it is important to understand the density at a point in an large algebraic field.

This sketch is constructed with a value k, which controls the overall size and accuracy of the sketch, dim, the number of spacial dimensions, a Kernel function, which does the work of identifying the density at a point in dim-space within a dim-sphere whose radial sensitivity is controlled by the Kernal density function (A common one is Gaussian, but user can supply others), and finally an Allocator.

The input is a stream of floating-point vectors of size dim. A query is basically a get_estimate(point) where the point is a dim_vector. Once the sketch reaches size k, it starts summarizing the field into a core-set of points which approximates the most dense points in the field. I would study some of the test cases for some simple examples and read the Karnin & Liberty paper.

Note: this has not been ported to Java, but can be accessed from Python.

@leerho
Copy link
Member

leerho commented Jan 20, 2026

The original authors of the concepts behind this sketch are Zohar Karnin & Edo Liberty (see above), In addition, Edo wrote the original Python code gde.py that was our inspiration for our C++ code as well as our python code that calls it.

In Edo's streaming-quantiles directory that contains gde.py, you can also find a very useful gde_example_usage.ipynb Jupyter Notebook with nice comments and graphs that illustrate the use of this sketch. There is also a gde_test.py that illustrates some useful tests. So I think documentation is out there, it just is not easy to find from our library. We need to fix this before we go much further with this sketch, e.g. it is not even mentioned on our website. (This should have been addressed before we released it with C++ and Python!)

P.S. Edo has been a strong supporter of our DataSketches project, has guided us in a number of theoretical areas, and is on our PMC.

@PsiACE
Copy link
Member Author

PsiACE commented Jan 20, 2026

Thanks, @leerho . I think I have a better understanding of the background now. I'll study them later and try to make everything work smoothly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants