Skip to content

[ISSUE #10240] Add BatchSplittingMetricExporter to prevent OTLP gRPC export failures#10239

Merged
lizhimins merged 4 commits intoapache:developfrom
Houlong66:feature/metrics_gzip_batch_split
Apr 3, 2026
Merged

[ISSUE #10240] Add BatchSplittingMetricExporter to prevent OTLP gRPC export failures#10239
lizhimins merged 4 commits intoapache:developfrom
Houlong66:feature/metrics_gzip_batch_split

Conversation

@Houlong66
Copy link
Copy Markdown
Contributor

@Houlong66 Houlong66 commented Apr 2, 2026

What is the purpose of the change

Closes #10240

When high-cardinality metrics (e.g., consumer lag with consumer_group × topic combinations) are exported
via OTLP gRPC, the payload can exceed the gRPC 32MB message size limit or the backend's per-RPC processing
limit, causing all metrics to fail to export.

The OpenTelemetry Java SDK does not support automatic batch splitting for metrics export
(see opentelemetry-java#5394).

This PR adds a MetricExporter decorator that:

  • Splits large batches of MetricData objects into smaller sub-batches by data point count
  • Splits single oversized MetricData objects by their internal data points into multiple
    smaller MetricData objects (supports all 7 MetricDataType variants)
  • Provides a fast path with zero overhead when data points are within the threshold
  • Is configurable via BrokerConfig.metricsExportBatchMaxDataPoints (default 1000)
  • Logs failed batch details for debugging

Brief changelog

  • Add BatchSplittingMetricExporter implementing MetricExporter as a decorator
  • Integrate it in BrokerMetricsManager to wrap OtlpGrpcMetricExporter
  • Add metricsExportBatchMaxDataPoints config to BrokerConfig (default 1000, dynamically updatable)
  • Add unit tests (18 test cases)

Verifying this change

This change added tests and can be verified as follows:

  • mvn test -pl broker -Dtest=BatchSplittingMetricExporterTest

When high-cardinality metrics (consumer_group x topic) produce OTLP export
payloads exceeding the gRPC 32MB limit or SLS per-RPC processing limit,
all metrics fail to export. This adds a MetricExporter decorator that:

- Splits large batches of MetricData objects into smaller sub-batches
- Splits single oversized MetricData objects by their internal data points
  into multiple smaller MetricData objects (supports all 7 MetricDataType)
- Configurable via BrokerConfig.metricsExportBatchMaxDataPoints (default 1000)
- Fast path with zero overhead when data points are within threshold
- Logs failed batch details for debugging
@Houlong66 Houlong66 changed the title [RIP-XXX] Add BatchSplittingMetricExporter to prevent OTLP gRPC export failures [ISSUE #10240] Add BatchSplittingMetricExporter to prevent OTLP gRPC export failures Apr 2, 2026
@Houlong66
Copy link
Copy Markdown
Contributor Author

New commit: Fix AIOOBE from concurrent callback modification

Added commit 81915011cc to address a related production issue:

ArrayIndexOutOfBoundsException: Index 9689 out of bounds for length 9689
    at NumberDataPointMarshaler.createRepeated

What changed: BatchSplittingMetricExporter.export() now snapshots all MetricData point collections into new ArrayList instances before counting, splitting, or delegating. This prevents the OTel SDK's marshaler from hitting AIOOBE when callback threads concurrently modify point collections during export.

Impact: Shallow copy only (object references), ~800KB overhead per export cycle for 100K data points. Negligible compared to the gRPC export itself.

This fix works together with the original batch-splitting logic — the snapshot ensures marshaling safety, while batch-splitting ensures payload size compliance.

The OTel SDK's NumberDataPointMarshaler.createRepeated allocates an
array based on points.size() then iterates. If callback threads
concurrently add data points between size() and iteration, an
ArrayIndexOutOfBoundsException occurs. This adds a defensive snapshot
of all data point collections at the start of export(), ensuring
the delegate exporter always receives immutable point collections.
- testSnapshotCreatesNewMetricData: verify delegate receives
  snapshotted MetricData, not the original reference
- testSnapshotFallsBackToOriginal: verify catch block falls
  back to original when snapshot fails (e.g., mock without type)
- testSnapshotPointsAreIndependentCopy: verify the snapshotted
  points collection is a separate instance from the original
@Houlong66 Houlong66 force-pushed the feature/metrics_gzip_batch_split branch from f148458 to 21215a8 Compare April 2, 2026 11:53
@lizhimins lizhimins merged commit 7b85a5d into apache:develop Apr 3, 2026
11 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] OTLP gRPC metrics export fails when payload exceeds 32MB limit due to high-cardinality metrics

3 participants