fix: decouple Beholder observation metrics from Prometheus scrape cycle#21720
fix: decouple Beholder observation metrics from Prometheus scrape cycle#21720
Conversation
|
👋 emate, thanks for creating this pull request! To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team. Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks! |
|
I see you updated files related to
|
|
✅ No conflicts with other open PRs targeting |
There was a problem hiding this comment.
Pull request overview
Risk Rating: HIGH (adds a long-lived background goroutine that triggers metric collection concurrently with Prometheus scrapes; incorrect deltas or panics would directly impact telemetry reliability and could crash the process)
Decouples OCR3 observation metric forwarding to Beholder from the Prometheus scrape cycle by adding a periodic polling loop that calls Collect() on the wrapped counters on a fixed interval.
Changes:
- Start a background polling loop (default 10s) after
libocr3.NewOracle(...)so wrapped counters publish deltas even without/metricsscrapes. - Add
Start(interval)+poll()toObservationMetricsCollector, with cancellation viaClose(). - Add a unit test verifying polling publishes deltas without an external Prometheus scrape.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| core/capabilities/ccip/oraclecreator/plugin.go | Starts metrics collector polling after oracle creation and ensures it’s closed with the oracle. |
| core/capabilities/ccip/oraclecreator/observation_metrics_collector.go | Adds polling interval, lifecycle context/cancel, and the polling goroutine to drive delta publishing independently of scrapes. |
| core/capabilities/ccip/oraclecreator/observation_metrics_collector_test.go | Adds a test asserting polling publishes and does not republish without new increments. |
Scrupulous human review recommended (high-impact areas):
ObservationMetricsCollector.Start(...)/poll()interaction with Prometheus scrapes: this introduces concurrentCollect()calls and can cause incorrect delta publishing unless the wrapped-counter delta tracking is made concurrency-safe.- Lifecycle correctness: ensuring
Start()cannot panic (invalid interval) and thatClose()reliably stops any background goroutines in all runtime paths.
Reviewer recommendations (based on CODEOWNERS for /core/capabilities/ccip):
@smartcontractkit/ccip-offchain(primary owners for this directory)@smartcontractkit/keystoneand/or@smartcontractkit/capabilities-team(owners for/core/capabilities/broadly)
core/capabilities/ccip/oraclecreator/observation_metrics_collector_test.go
Show resolved
Hide resolved
|




ocr3_sent_observations_total(andocr3_included_observations_total) were only forwarded to Beholder when Prometheus scraped the/metricsendpoint. ThewrappedCounter.Collect()method — the only place deltas are computed and PublishMetric is called — is driven entirely by the Prometheus scrape cycle.On nodes where scrapes were slow, missing, or misaligned, the Beholder counter would appear stuck even though the underlying OCR3 protocol was functioning normally.
Added a background polling loop to ObservationMetricsCollector that ticks every 10 seconds (matching the OTel PeriodicReader default) and calls poll(), which invokes Collect() on each wrapped counter directly