Skip to content

release-26.2: obs: add per-CODEOWNER metric count to Prometheus scrape#167439

Open
angles-n-daemons wants to merge 3 commits intocockroachdb:release-26.2from
angles-n-daemons:backport26.2-166695
Open

release-26.2: obs: add per-CODEOWNER metric count to Prometheus scrape#167439
angles-n-daemons wants to merge 3 commits intocockroachdb:release-26.2from
angles-n-daemons:backport26.2-166695

Conversation

@angles-n-daemons
Copy link
Copy Markdown
Contributor

Backport 1/1 commits from #166695 on behalf of @angles-n-daemons.

/cc @cockroachdb/release


Add a runtime metric obs.metric_export.codeowner.metric_count (GaugeVec
labeled by codeowner) that reports metric counts per owning team during
each scrape. Counts reflect downstream ingestion: simple metrics count as
1, histograms expand to their computed metrics (percentiles, count, sum,
avg, max). This follows the same pattern as the existing
obs.metric_export.child.count metric and enables teams to understand
their contribution to scrape volume.

To make metric-to-team ownership data available at runtime, embed
metric_owners.yaml (generated by gen-metric-owners) into the
metricscan package via go:embed. The ./dev generate docs step copies
the YAML into the package directory. The data is loaded once at
MetricsRecorder construction and used during each scrape to attribute
metric families to their CODEOWNER team.

Also fix check_generated_code CI to run gen-metric-owners before
//pkg/gen, ensuring that metric_owners.yaml staleness is caught and
new metrics get their owner field populated in metrics.yaml.

Resolves: #166692

Release note: None

Release justification: Allows us to report on metrics output volume by codeowner, it's a low risk change.

Add a runtime metric `obs.metric_export.codeowner.metric_count` (GaugeVec
labeled by `codeowner`) that reports metric counts per owning team during
each scrape. Counts reflect downstream ingestion: simple metrics count as
1, histograms expand to their computed metrics (percentiles, count, sum,
avg, max). This follows the same pattern as the existing
`obs.metric_export.child.count` metric and enables teams to understand
their contribution to scrape volume.

To make metric-to-team ownership data available at runtime, embed
`metric_owners.yaml` (generated by `gen-metric-owners`) into the
`metricscan` package via `go:embed`. The `./dev generate docs` step copies
the YAML into the package directory. The data is loaded once at
`MetricsRecorder` construction and used during each scrape to attribute
metric families to their CODEOWNER team.

Gate output behind `obs.metric_export.codeowner_count.enabled` (default
`false`) so customer clusters don't see internal team names in their
metrics. Internal CRL clusters override this to `true` via managed-service.

Fix 8 `admission_cpu_time_tokens_per_tenant_*` metrics showing as
`codeowner="unknown"` by adding format-string `Name` fields to the base
metadata templates in `cpu_time_token_metrics.go`, so the AST scanner
detects them as patterns.

Also fix `check_generated_code` CI to run `gen-metric-owners` before
`//pkg/gen`, ensuring that `metric_owners.yaml` staleness is caught and
new metrics get their `owner` field populated in `metrics.yaml`.

Follow-up: CLOUDOPS-19945

Resolves: cockroachdb#166692

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@angles-n-daemons angles-n-daemons requested review from a team as code owners April 2, 2026 21:10
@trunk-io
Copy link
Copy Markdown
Contributor

trunk-io bot commented Apr 2, 2026

Merging to release-26.2 in this repository is managed by Trunk.

  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@angles-n-daemons angles-n-daemons requested review from alyshanjahani-crl, herkolategan and srosenberg and removed request for a team April 2, 2026 21:10
@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 2, 2026

Thanks for opening a backport.

Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate.

@blathers-crl blathers-crl bot added backport Label PR's that are backports to older release branches T-observability labels Apr 2, 2026
@cockroach-teamcity
Copy link
Copy Markdown
Member

This change is Reviewable

Remove metrics that exist on master but not on release-26.2
(rpc.drpc.enabled, rpc.server.requests.total, rpc.client.request.duration.nanos,
rpc.client.requests.total, storage.wal.failover.secondary.disk.available,
storage.wal.failover.secondary.disk.capacity) from metrics.yaml and
metric_owners.yaml to fix check_generated_code CI.

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
@angles-n-daemons angles-n-daemons requested review from dhartunian and removed request for a team, alyshanjahani-crl, herkolategan and srosenberg April 3, 2026 16:42
settings.ApplicationLevel, "obs.metric_export.codeowner_count.enabled",
"enables the reporting of per-CODEOWNER metric counts in the Prometheus scrape",
false,
settings.WithPublic)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this setting doesn't need to be public

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh nice catch, I'll change that - and open a pr against master for the same.

// suffix is appended in makeCPUTimeTokenMetrics. See the comment on
// cpuTimeTokenMetrics.AdmittedCountPerTenant for the rationale.
cpuTimeTokenAdmittedCountPerTenantMetaBase = metric.Metadata{
Name: "admission.cpu_time_tokens.per_tenant.admitted_count.%s",
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these changed here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They need to be in near the Metadata definition so that the code scanner (which associates metrics with owners) can pick them up.

They're actually overridden with the same name further down in the code, this bit just allows us to associate them with an owner.


// test_gauge and test_counter are simple metrics: 1 each.
require.Regexp(t,
`obs_metric_export_codeowner_metric_count\{`+
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for next time: I'd strongly prefer datadriven tests for stuff like this so we can clearly see the output instead of a regex. This is tough to verify visually.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I'll try to keep that in mind for next time.

var count int64
for _, m := range family.Metric {
if m.Histogram != nil {
count += int64(len(metric.HistogramMetricComputers))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking but this is not quite correct in terms of the docstring. len(metric.HistogramMetricComputers) records the number of items in the TSDB's persisted set. The Datadog estimate is tough to gauge because it converts to a native representation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll update the docstring. Is there a better way we can somewhat quantify the impact, from a cost perspective? The Computers just felt like a low-hanging approximation.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bucket count is the proxy here since that affects the size of the histogram

@blathers-crl
Copy link
Copy Markdown

blathers-crl bot commented Apr 3, 2026

Metrics change detected

This PR adds or updates one or more CRDB metrics. If you want these metrics to be exported by CRDB Cloud clusters to Internal CRL Datadog and/or included in the customer metric export integration (Essential metrics for standard deployment, and Essential metrics for advanced deployment), refer to this Installation and Usage guide of a CLI tool that syncs the metric mappings in managed-service. Run this CLI tool after your CRDB PR is merged.

  • The CLI opens a PR in managed-service with the required config changes.
  • Please track that PR and ensure it merges so your metrics become available to CRDB Cloud clusters.

Note: Your metric will appear in Internal CRL Datadog only after the managed-service PR merges and the new OTel configuration rolls out to at least one cluster running a CRDB build that includes this metric.

Docs: cockroach-metric-sync

Questions: reach out to @obs-india-prs

Address review feedback: the `obs.metric_export.codeowner_count.enabled`
setting is internal-only and doesn't need to be publicly visible.

Release note: None

Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches T-observability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants