feat: impl densitysketch #62

PsiACE · 2026-01-06T17:32:27Z

Density sketch implementation for density estimation from streaming data.

Ref https://github.com/apache/datasketches-cpp/tree/master/density

tools/generate_serialization_test_data.py

Signed-off-by: Chojan Shang <psiace@apache.org>

datasketches/src/density/sketch.rs

notfilippo · 2026-01-16T15:59:28Z

datasketches/src/density/sketch.rs

+}
+
+thread_local! {
+    static RNG_STATE: Cell<u64> = Cell::new(seed_rng());


I would love not to have thread local variables declared statically in the library. Is this really something we cannot store in another struct and pass it to the sketch when needed, delegating lifecycle decisions to the user?

It seems C++ uses thread locals while Java relies on the java.util.Random.

Unfortunately, Rust doesn't have a std random support so we have to find a reasonable way here. At least the random common should live under datasketches::random_common.

cc @jmalkin @leerho

Both for random usage and more background of density sketch, which seems a new sketch I'm not yet familiar with.

tisonkun · 2026-01-17T14:54:14Z

Seems the paper reference is this one: https://arxiv.org/abs/2102.12301?

tisonkun · 2026-01-17T15:02:46Z

datasketches/src/density/sketch.rs

+use crate::error::ErrorKind;
+
+/// Floating point types supported by the density sketch.
+pub trait DensityValue: Copy + PartialOrd + 'static {


This can be sealed and thus has a size method to return the mem size.

datasketches/src/density/sketch.rs

tisonkun · 2026-01-17T15:11:56Z

datasketches/src/density/sketch.rs

+pub trait DensityKernel<T: DensityValue> {
+    /// Returns the kernel evaluation for the two points.
+    fn evaluate(&self, left: &[T], right: &[T]) -> T;
+}


Suggested change

pub trait DensityKernel<T: DensityValue> {

/// Returns the kernel evaluation for the two points.

fn evaluate(&self, left: &[T], right: &[T]) -> T;

}

pub trait DensityKernel {

/// Returns the kernel evaluation for the two points.

fn evaluate<T: DensityValue>(&self, left: &[T], right: &[T]) -> T;

}

The trait bound doesn't seem to be related to DensityValue but it should accept any DensityValue.

~~Sorry. SphericalKernel works only for f32.~~

Nope. SphericalKernel is only in testing scope and I don't think it's reasonable. Why do we need this "kernel" to be configurable if our out-of-the-box "kernel" has only GaussianKernel?

tisonkun

Several comments above.

Also I need some time to understand what densitysketch is and how it works.

leerho · 2026-01-18T00:20:24Z

@c-dickens,
Unfortunately, the documentation on our density sketch in C++ and Python is practically non-existant. I could use your help in adding some documentation certainly in the header of the C++ and Python main sketch classes (and perhaps on the website as well, or on the website and a pointer to it within the class files) that addresses:

What problem this sketch solves (with an example)
How it works, at a very basic level
Any caveats, assumptions or limitations the user should know about.
How the sketch error is computed and what it means to the user.
References to the research paper(s) that guided this implementation.

Or if you want to just write it up as a markdown and send it to me, I can integrate it into the code and website. Whatever works for you :)

Thanks!
Lee.

leerho · 2026-01-19T20:01:46Z

I’m going to change this PR to Draft until we can get some meaningful documentation and a reference to the correct research paper. The above Desai, et al paper is not correct. It is a paper by Karnin & Liberty.

I am reaching out to @AlexanderSaydakov to get more information.

leerho · 2026-01-19T22:10:38Z

There is some documentation, but it is still sparse.

From the datasketches website, you can find under Code Docs CPP 5.2.0 then Namespaces/DensitySketch. Then scroll down to the Detailed Description.

I have not used this sketch, but in many scientific areas (including ML) it is important to understand the density at a point in an large algebraic field.

This sketch is constructed with a value k, which controls the overall size and accuracy of the sketch, dim, the number of spacial dimensions, a Kernel function, which does the work of identifying the density at a point in dim-space within a dim-sphere whose radial sensitivity is controlled by the Kernal density function (A common one is Gaussian, but user can supply others), and finally an Allocator.

The input is a stream of floating-point vectors of size dim. A query is basically a get_estimate(point) where the point is a dim_vector. Once the sketch reaches size k, it starts summarizing the field into a core-set of points which approximates the most dense points in the field. I would study some of the test cases for some simple examples and read the Karnin & Liberty paper.

Note: this has not been ported to Java, but can be accessed from Python.

leerho · 2026-01-20T02:44:41Z

The original authors of the concepts behind this sketch are Zohar Karnin & Edo Liberty (see above), In addition, Edo wrote the original Python code gde.py that was our inspiration for our C++ code as well as our python code that calls it.

In Edo's streaming-quantiles directory that contains gde.py, you can also find a very useful gde_example_usage.ipynb Jupyter Notebook with nice comments and graphs that illustrate the use of this sketch. There is also a gde_test.py that illustrates some useful tests. So I think documentation is out there, it just is not easy to find from our library. We need to fix this before we go much further with this sketch, e.g. it is not even mentioned on our website. (This should have been addressed before we released it with C++ and Python!)

P.S. Edo has been a strong supporter of our DataSketches project, has guided us in a number of theoretical areas, and is on our PMC.

PsiACE · 2026-01-20T03:10:11Z

Thanks, @leerho . I think I have a better understanding of the background now. I'll study them later and try to make everything work smoothly.

tisonkun changed the title ~~feat: imple densitysketch~~ feat: impl densitysketch Jan 7, 2026

PsiACE force-pushed the density branch from d556cb3 to b74e510 Compare January 7, 2026 05:31

PsiACE mentioned this pull request Jan 7, 2026

feat: add theta sketch (part 2) #59

Merged

tisonkun reviewed Jan 7, 2026

View reviewed changes

tools/generate_serialization_test_data.py Outdated Show resolved Hide resolved

feat: imple densitysketch

d209b3c

Signed-off-by: Chojan Shang <psiace@apache.org>

PsiACE force-pushed the density branch from b74e510 to d209b3c Compare January 8, 2026 13:15

leerho requested a review from tisonkun January 14, 2026 22:13

notfilippo reviewed Jan 16, 2026

View reviewed changes

datasketches/src/density/sketch.rs Outdated Show resolved Hide resolved

datasketches/src/density/sketch.rs Outdated Show resolved Hide resolved

notfilippo reviewed Jan 16, 2026

View reviewed changes

tisonkun reviewed Jan 17, 2026

View reviewed changes

datasketches/src/density/sketch.rs Outdated Show resolved Hide resolved

tisonkun reviewed Jan 17, 2026

View reviewed changes

tisonkun requested changes Jan 17, 2026

View reviewed changes

PsiACE added 2 commits January 18, 2026 19:17

refactor: align density sketch kernel rng

10c8526

Merge upstream/main

cdd821e

leerho marked this pull request as draft January 19, 2026 20:14

feat: impl densitysketch #62

Are you sure you want to change the base?

feat: impl densitysketch #62

Uh oh!

Conversation

PsiACE commented Jan 6, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

notfilippo Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

tisonkun Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

tisonkun Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

tisonkun commented Jan 17, 2026

Uh oh!

tisonkun Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tisonkun Jan 17, 2026

Choose a reason for hiding this comment

Uh oh!

tisonkun Jan 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tisonkun left a comment

Choose a reason for hiding this comment

Uh oh!

leerho commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leerho commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leerho commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leerho commented Jan 20, 2026

Uh oh!

PsiACE commented Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tisonkun Jan 17, 2026 •

edited

Loading

leerho commented Jan 18, 2026 •

edited

Loading

leerho commented Jan 19, 2026 •

edited

Loading

leerho commented Jan 19, 2026 •

edited

Loading