Skip to content

Vault better instrumentation#21732

Open
cedric-cordenier wants to merge 3 commits intodevelopfrom
vault-better-instrumentation
Open

Vault better instrumentation#21732
cedric-cordenier wants to merge 3 commits intodevelopfrom
vault-better-instrumentation

Conversation

@cedric-cordenier
Copy link
Contributor

Requires

Supports

Instrument all public KVStore methods with a histogram metric
(platform_vault_plugin_kv_operation_duration_seconds) to help
diagnose high P95 Observation latencies in production. Also adds
a max-batch Observation test exercising 10 GetSecrets requests
with 10 encryption keys each plus 20 pending queue items.
Copilot AI review requested due to automatic review settings March 26, 2026 19:08
@cedric-cordenier cedric-cordenier requested review from a team as code owners March 26, 2026 19:08
@github-actions
Copy link
Contributor

👋 cedric-cordenier, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

@github-actions
Copy link
Contributor

✅ No conflicts with other open PRs targeting develop

@github-actions
Copy link
Contributor

I see you updated files related to core. Please run make gocs in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Risk Rating: MEDIUM

This PR improves observability for Vault and OCR3.1 reporting plugins by adding new latency metrics around KV and blob operations, expanding histogram bucket ranges, and updating the Vault KVStore API to propagate context.Context through reads/writes (with accompanying test updates).

Changes:

  • Add OCR3.1 beholderwrapper histograms and wrappers to record KV/blob operation durations.
  • Add Vault plugin KV operation duration histogram and plumb context.Context through KVStore APIs/callers.
  • Expand histogram buckets (plugin duration + data size) and extend tests to cover new wrapping/instrumentation paths.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
core/services/ocr3_1/beholderwrapper/types.go Adds KV/blob duration instruments and views; expands bucket ranges.
core/services/ocr3_1/beholderwrapper/plugin.go Wraps KV readers/writers and blob fetchers/broadcasters with instrumented implementations.
core/services/ocr3_1/beholderwrapper/instrumented_kv.go New KV reader/readwriter wrappers that record per-method duration.
core/services/ocr3_1/beholderwrapper/instrumented_blob.go New blob fetcher/broadcast-fetcher wrappers that record per-method duration.
core/services/ocr3_1/beholderwrapper/plugin_test.go Adds tests validating delegation, wrapping behavior, and error propagation for new wrappers.
core/services/ocr2/plugins/vault/metrics.go Adds a KV operation duration histogram metric.
core/services/ocr2/plugins/vault/kvstore.go Updates KVStore interfaces to accept context.Context and records per-method duration metrics.
core/services/ocr2/plugins/vault/plugin.go Plumbs ctx into KVStore calls and passes metrics into KVStore constructors.
core/services/ocr2/plugins/vault/kvstore_test.go Updates tests for new KVStore signatures and adds test helpers that provide metrics.
core/services/ocr2/plugins/vault/plugin_test.go Updates plugin tests for new KVStore signatures/metrics; adds a max-batch observation test.

Areas needing scrupulous human review:

  • Vault KVStore API changes (ctx plumbing) to ensure no call sites are missed and that behavior is unchanged aside from instrumentation.
  • Histogram bucket changes (especially data size) to confirm the intended max ranges match actual limits/operational expectations.
  • The newly added “max batch” Vault observation test for runtime/flakiness (crypto + large loops).

Reviewer recommendations (from .github/CODEOWNERS):

  • /core/services/ocr*@smartcontractkit/foundations and @smartcontractkit/core

// the observation also includes the maximum number of pending queue items
// to be broadcast as blobs.
numLocalItems := 2 * batchSize
for i := range numLocalItems {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop uses for i := range numLocalItems, but range can't be used on an int (this won't compile). Use a standard index loop (e.g., for i := 0; i < numLocalItems; i++ { ... }).

Suggested change
for i := range numLocalItems {
for i := 0; i < numLocalItems; i++ {

Copilot uses AI. Check for mistakes.
require.NoError(t, err)
gotLocalIDs[gotMsg.Id] = true
}
for i := range numLocalItems {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: for i := range numLocalItems attempts to range over an int and will not compile. Switch to an index-based for loop.

Suggested change
for i := range numLocalItems {
for i := 0; i < numLocalItems; i++ {

Copilot uses AI. Check for mistakes.
Comment on lines 169 to +171
// 512KB is the max value possible
// 1KB, 2KB, 4KB, 8KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB
Boundaries: prometheus.ExponentialBuckets(1024, 2, 10),
// 1KB, 2KB, 4KB, 8KB, 16KB, 32KB, 64KB, 128KB, 256KB, 512KB, 1024KB, 2048KB, 4096KB, 8192KB
Boundaries: prometheus.ExponentialBuckets(1024, 2, 14),
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "512KB is the max value possible", but the bucket boundaries were expanded up to 8192KB. Please update/remove the max-value comment so it matches the new histogram range (or adjust the buckets if 512KB truly is the max).

Copilot uses AI. Check for mistakes.
Key: "my_secret",
}
err = NewWriteStore(rdr).WriteSecret(createdID, &vaultcommon.StoredSecret{
err = newTestWriteStore(t, rdr).WriteSecret(t.Context(), createdID,&vaultcommon.StoredSecret{
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line (and a few similar ones nearby) appears not to be gofmt'd (missing spaces after commas, e.g. createdID,&...). Please run gofmt on this file to keep formatting consistent.

Suggested change
err = newTestWriteStore(t, rdr).WriteSecret(t.Context(), createdID,&vaultcommon.StoredSecret{
err = newTestWriteStore(t, rdr).WriteSecret(t.Context(), createdID, &vaultcommon.StoredSecret{

Copilot uses AI. Check for mistakes.
Comment on lines +142 to 143
err = store.WriteSecret(t.Context(), id,ss)
require.NoError(t, err)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This call isn't gofmt'd (missing a space after the comma: id,ss). Please run gofmt on the file to ensure standard Go formatting throughout.

Copilot uses AI. Check for mistakes.
@cl-sonarqube-production
Copy link

Quality Gate failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube

Catch issues before they fail your Quality Gate with our IDE extension SonarQube IDE SonarQube IDE

@trunk-io
Copy link

trunk-io bot commented Mar 26, 2026

Static BadgeStatic BadgeStatic BadgeStatic Badge

View Full Report ↗︎Docs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants