Skip to content

Add opt-in stream error log throttling#2814

Draft
He-Pin wants to merge 1 commit intoapache:mainfrom
He-Pin:improve/stream-log-throttle
Draft

Add opt-in stream error log throttling#2814
He-Pin wants to merge 1 commit intoapache:mainfrom
He-Pin:improve/stream-log-throttle

Conversation

@He-Pin
Copy link
Copy Markdown
Member

@He-Pin He-Pin commented Mar 28, 2026

Motivation

When a stream stage fails rapidly and repeatedly (e.g., persistent network failure, bad data in a loop), each failure generates a log message. In high-throughput systems, this can result in thousands of log messages per second, overwhelming log aggregation systems and masking other important messages.

Modification

Added a new configuration option:

pekko.stream.materializer.stage-errors-log-throttle-period = off

When set to a positive duration (e.g. 10s), only the first stage error within each time window is logged at ERROR level. Subsequent errors in the window are counted silently, and a summary warning is emitted when the next window opens or the interpreter finishes.

Implementation details

Aspect Detail
Scope Per-interpreter throttle state (not shared across streams)
First error Uses errorLogInitialized flag to guarantee first error always logs, regardless of System.nanoTime() origin
Config validation Rejects negative durations with require()
Cleanup Flushes suppressed count in finish() for best-effort reporting
Default off — zero behavior change for existing users

Files changed

  • stream/src/main/resources/reference.conf — New config key
  • stream/.../GraphInterpreter.scala — Throttle state fields + modified reportStageError + finish() flush
  • stream-tests/.../StageErrorLogThrottleSpec.scala — 4 tests (enabled throttle with Broadcast fan-out, single error, disabled with fan-out, disabled single error)

Result

  • Operators can opt-in to throttle stage error logs in noisy environments
  • Zero behavior change by default (off)
  • Tests use Broadcast with 5 parallel failing stages to exercise the actual throttle code path (multiple reportStageError calls per interpreter)
  • All new tests pass, all existing ActorGraphInterpreterSpec and InterpreterSpec tests pass (no regression)

References

@He-Pin He-Pin force-pushed the improve/stream-log-throttle branch from b0b4b83 to c712091 Compare March 28, 2026 08:33
@He-Pin He-Pin changed the title Validate replica sequence number in external replication Add opt-in stream error log throttling Mar 28, 2026
@He-Pin He-Pin force-pushed the improve/stream-log-throttle branch 2 times, most recently from bf1d66b to bd4dba9 Compare March 28, 2026 09:44
Add a new configuration option to throttle repeated stage error log
messages in stream interpreters:

  pekko.stream.materializer.stage-errors-log-throttle-period = off

When set to a positive duration (e.g. '10s'), only the first stage error
within each time window is logged at ERROR level. Subsequent errors in
the window are counted silently, and a summary warning is emitted when
the next window opens or the interpreter finishes.

Key implementation details:
- Per-interpreter throttle state (not shared across streams)
- Uses errorLogInitialized flag to ensure first error always logs
  regardless of System.nanoTime() origin (fix for GPT-5.4 review finding)
- Validates negative durations with require() (fix for GPT-5.4 finding)
- Flushes suppressed count in finish() for best-effort cleanup
- Default 'off' preserves existing behavior (zero behavior change)

Tests use Broadcast with 5 parallel failing stages to exercise actual
throttle code paths (fix for Sonnet 4.6 + GPT-5.4 review finding that
original tests only triggered single errors).

Cross-reviewed by GPT-5.4 and Sonnet 4.6.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@He-Pin He-Pin force-pushed the improve/stream-log-throttle branch from bd4dba9 to e662724 Compare March 28, 2026 12:50
@He-Pin He-Pin marked this pull request as ready for review March 28, 2026 13:11
@He-Pin He-Pin marked this pull request as draft March 28, 2026 13:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant