Skip to content

feat: support sample-weighted aggregations for sampled trace data#1963

Open
vinzee wants to merge 3 commits intohyperdxio:mainfrom
vinzee:va/trace_sampling_aggregations
Open

feat: support sample-weighted aggregations for sampled trace data#1963
vinzee wants to merge 3 commits intohyperdxio:mainfrom
vinzee:va/trace_sampling_aggregations

Conversation

@vinzee
Copy link
Contributor

@vinzee vinzee commented Mar 23, 2026

Problem

High-throughput services can produce millions of spans per second. Storing every span is expensive, so we run the OpenTelemetry Collector's tail-sampling processor to keep only 1-in-N spans. Each kept span carries a SampleRate attribute recording N.

Once data is sampled, naive aggregations are wrong: count() returns N-x fewer events than actually occurred, sum()/avg() are biased, and percentiles shift. Dashboards show misleadingly low request counts, throughput, and error rates, making capacity planning and alerting unreliable.

Summary

TraceSourceSchema gets a new optional field sampleRateExpression - the ClickHouse expression that evaluates to the per-span sample rate (e.g. SpanAttributes['SampleRate']). When not configured, all queries are unchanged.
When set, the query builder rewrites SQL aggregations to weight each span by its sample rate:

aggFn Before After (sample-corrected) Overhead
count count() sum(weight) ~1x
count + cond countIf(cond) sumIf(weight, cond) ~1x
avg avg(col) sum(col * weight) / sum(weight) ~2x
sum sum(col) sum(col * weight) ~1x
quantile(p) quantile(p)(col) quantileTDigestWeighted(p)(col, toUInt32(weight)) ~1.5x
min/max unchanged unchanged 1x
count_distinct unchanged unchanged (cannot correct) 1x

Types:

  • Add sampleRateExpression to TraceSourceSchema + Mongoose model
  • Add sampleWeightExpression to ChartConfig schema

Query builder:

  • sampleWeightExpression is wrapped as greatest(toUInt64OrZero(toString(expr)), 1) so
    spans without a SampleRate attribute default to weight 1 (unsampled
    data produces identical results to the original queries).
  • Rewrite aggFnExpr in renderChartConfig.ts when sampleWeightExpression
    is set, with safe default-to-1 wrapping

Integration (propagate sampleWeightExpression to all chart configs):

  • ChartEditor/utils.ts, DBSearchPage, ServicesDashboardPage, sessions
  • DBDashboardPage (raw SQL + builder branches)
  • AlertPreviewChart
  • SessionSubpanel
  • ServiceDashboardEndpointPerformanceChart
  • ServiceDashboardSlowestEventsTile (p95 query + events table)
  • ServiceDashboardEndpointSidePanel (error rate + throughput)
  • ServiceDashboardDbQuerySidePanel (total query time + throughput)
  • External API v2 charts, AI controller, alerts (index + template)

UI:

  • Add Sample Rate Expression field to trace source admin form

Screenshots or video

Before After

How to test locally or on Vercel

References

  • Linear Issue:
  • Related PRs:

TraceSourceSchema has a new optional field `sampleRateExpression`.
When undefined, SQL aggregations are unchanged. But when set (e.g. `SpanAttributes['SampleRate']`),
SQL aggregations are rewritten to correct for upstream 1-in-N sampling:

  aggFn          | Before                 | After (sample-corrected)                            | Overhead
  -------------- | ---------------------- | --------------------------------------------------- | --------
  count          | count()                | sum(weight)                                         | ~1x
  count + cond   | countIf(cond)          | sumIf(weight, cond)                                 | ~1x
  avg            | avg(col)               | sum(col * weight) / sum(weight)                     | ~2x
  sum            | sum(col)               | sum(col * weight)                                   | ~1x
  quantile(p)    | quantile(p)(col)       | quantileTDigestWeighted(p)(col, toUInt32(weight))   | ~1.5x
  min/max        | unchanged              | unchanged                                           | 1x
  count_distinct | unchanged              | unchanged (cannot correct)                          | 1x

Weight is wrapped as greatest(toUInt64OrZero(toString(expr)), 1) so
spans without a SampleRate attribute default to weight 1 (unsampled
data produces identical results to the original queries).

Types:
- Add sampleWeightExpression to ChartConfig schema
- Add sampleRateExpression to TraceSourceSchema + Mongoose model

Query builder:
- Rewrite aggFnExpr in renderChartConfig.ts when sampleWeightExpression
  is set, with safe default-to-1 wrapping

Integration (propagate sampleWeightExpression to all chart configs):
- ChartEditor/utils.ts, DBSearchPage, ServicesDashboardPage, sessions
- DBDashboardPage (raw SQL + builder branches)
- AlertPreviewChart
- SessionSubpanel
- ServiceDashboardEndpointPerformanceChart
- ServiceDashboardSlowestEventsTile (p95 query + events table)
- ServiceDashboardEndpointSidePanel (error rate + throughput)
- ServiceDashboardDbQuerySidePanel (total query time + throughput)
- External API v2 charts, AI controller, alerts (index + template)

UI:
- Add Sample Rate Expression field to trace source admin form
@vercel
Copy link

vercel bot commented Mar 23, 2026

@vinzee is attempting to deploy a commit to the HyperDX Team on Vercel.

A member of the Team first needs to authorize it.

@changeset-bot
Copy link

changeset-bot bot commented Mar 23, 2026

🦋 Changeset detected

Latest commit: 23872e7

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 4 packages
Name Type
@hyperdx/common-utils Patch
@hyperdx/api Patch
@hyperdx/app Patch
@hyperdx/otel-collector Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@github-actions
Copy link
Contributor

github-actions bot commented Mar 23, 2026

PR Review

  • ⚠️ Changeset version is patch but this is a new user-facing feature → Change all three packages in .changeset/tiny-forks-deny.md from patch to minor per semver conventions.

  • ⚠️ sampleRateExpression is interpolated directly into SQL with no sanitization (\greatest(toUInt64OrZero(toString(${sampleWeightExpression})), 1)`UNSAFE_RAW_SQL) — consistent with how other raw SQL fields work in this codebase, but z.string().optional()` applies no constraints. Acceptable if this field is admin-only, but worth a comment noting the trust boundary.

  • ⚠️ p_rate / p_sample aggregation functions (fn.endsWith('State') / fn.endsWith('Merge') guard) silently skip rewriting for -State and -Merge combinators — but the guard doesn't cover -MergeState. If someone uses a -MergeState combinator function this guard won't protect it. Consider fn.includes('Merge') || fn.includes('State') or add a comment explaining why -MergeState isn't a concern.

  • ✅ Migration file (000002_add_sample_rate_column_to_otel_traces.up.sql) correctly uses ADD COLUMN IF NOT EXISTS — safe for existing deployments.

  • ✅ Good test coverage for all aggregation rewrite cases including edge cases (empty expression, SpanAttributes map access, mixed weighted/passthrough).

vinzee added 2 commits March 22, 2026 17:30
add SampleRate migration for existing deployments, standardize
sampleWeightExpression propagation, and document percentile approximation

The initial sampling support (c61acd1) only added the SampleRate
materialized column to the seed file, which runs on fresh installs.
Existing deployments need an ALTER TABLE migration.

Changes:
- Add CH migration 000002 to create the SampleRate materialized column
  on existing otel_traces tables (IF NOT EXISTS, safe to re-run)
- Standardize sampleWeightExpression propagation across 5 components
  (SessionSubpanel, ServiceDashboardDbQuerySidePanel,
  ServiceDashboardEndpointSidePanel,
  ServiceDashboardEndpointPerformanceChart,
  ServiceDashboardSlowestEventsTile) from direct assignment to
  conditional spread, matching the pattern used elsewhere
- Note in SourceForm help text that percentiles under sampling use
  quantileTDigestWeighted (approximate T-Digest sketch)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant