release-26.2: obs: add per-CODEOWNER metric count to Prometheus scrape#167439
release-26.2: obs: add per-CODEOWNER metric count to Prometheus scrape#167439angles-n-daemons wants to merge 3 commits intocockroachdb:release-26.2from
Conversation
Add a runtime metric `obs.metric_export.codeowner.metric_count` (GaugeVec labeled by `codeowner`) that reports metric counts per owning team during each scrape. Counts reflect downstream ingestion: simple metrics count as 1, histograms expand to their computed metrics (percentiles, count, sum, avg, max). This follows the same pattern as the existing `obs.metric_export.child.count` metric and enables teams to understand their contribution to scrape volume. To make metric-to-team ownership data available at runtime, embed `metric_owners.yaml` (generated by `gen-metric-owners`) into the `metricscan` package via `go:embed`. The `./dev generate docs` step copies the YAML into the package directory. The data is loaded once at `MetricsRecorder` construction and used during each scrape to attribute metric families to their CODEOWNER team. Gate output behind `obs.metric_export.codeowner_count.enabled` (default `false`) so customer clusters don't see internal team names in their metrics. Internal CRL clusters override this to `true` via managed-service. Fix 8 `admission_cpu_time_tokens_per_tenant_*` metrics showing as `codeowner="unknown"` by adding format-string `Name` fields to the base metadata templates in `cpu_time_token_metrics.go`, so the AST scanner detects them as patterns. Also fix `check_generated_code` CI to run `gen-metric-owners` before `//pkg/gen`, ensuring that `metric_owners.yaml` staleness is caught and new metrics get their `owner` field populated in `metrics.yaml`. Follow-up: CLOUDOPS-19945 Resolves: cockroachdb#166692 Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
|
Merging to
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
|
Thanks for opening a backport. Before merging, please confirm that the change does not break backwards compatibility and otherwise complies with the backport policy. Include a brief release justification in the PR description explaining why the backport is appropriate. All backports must be reviewed by the TL for the owning area. While the stricter LTS policy does not yet apply, please exercise judgment and consider gating non-critical changes behind a disabled-by-default feature flag when appropriate. |
Remove metrics that exist on master but not on release-26.2 (rpc.drpc.enabled, rpc.server.requests.total, rpc.client.request.duration.nanos, rpc.client.requests.total, storage.wal.failover.secondary.disk.available, storage.wal.failover.secondary.disk.capacity) from metrics.yaml and metric_owners.yaml to fix check_generated_code CI. Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
pkg/server/status/recorder.go
Outdated
| settings.ApplicationLevel, "obs.metric_export.codeowner_count.enabled", | ||
| "enables the reporting of per-CODEOWNER metric counts in the Prometheus scrape", | ||
| false, | ||
| settings.WithPublic) |
There was a problem hiding this comment.
nit: this setting doesn't need to be public
There was a problem hiding this comment.
Oh nice catch, I'll change that - and open a pr against master for the same.
| // suffix is appended in makeCPUTimeTokenMetrics. See the comment on | ||
| // cpuTimeTokenMetrics.AdmittedCountPerTenant for the rationale. | ||
| cpuTimeTokenAdmittedCountPerTenantMetaBase = metric.Metadata{ | ||
| Name: "admission.cpu_time_tokens.per_tenant.admitted_count.%s", |
There was a problem hiding this comment.
why are these changed here?
There was a problem hiding this comment.
They need to be in near the Metadata definition so that the code scanner (which associates metrics with owners) can pick them up.
They're actually overridden with the same name further down in the code, this bit just allows us to associate them with an owner.
|
|
||
| // test_gauge and test_counter are simple metrics: 1 each. | ||
| require.Regexp(t, | ||
| `obs_metric_export_codeowner_metric_count\{`+ |
There was a problem hiding this comment.
for next time: I'd strongly prefer datadriven tests for stuff like this so we can clearly see the output instead of a regex. This is tough to verify visually.
There was a problem hiding this comment.
Ah I'll try to keep that in mind for next time.
| var count int64 | ||
| for _, m := range family.Metric { | ||
| if m.Histogram != nil { | ||
| count += int64(len(metric.HistogramMetricComputers)) |
There was a problem hiding this comment.
Not blocking but this is not quite correct in terms of the docstring. len(metric.HistogramMetricComputers) records the number of items in the TSDB's persisted set. The Datadog estimate is tough to gauge because it converts to a native representation.
There was a problem hiding this comment.
That's a good point, I'll update the docstring. Is there a better way we can somewhat quantify the impact, from a cost perspective? The Computers just felt like a low-hanging approximation.
There was a problem hiding this comment.
I think the bucket count is the proxy here since that affects the size of the histogram
Metrics change detectedThis PR adds or updates one or more CRDB metrics. If you want these metrics to be exported by CRDB Cloud clusters to Internal CRL Datadog and/or included in the customer metric export integration (Essential metrics for standard deployment, and Essential metrics for advanced deployment), refer to this Installation and Usage guide of a CLI tool that syncs the metric mappings in managed-service. Run this CLI tool after your CRDB PR is merged.
Note: Your metric will appear in Internal CRL Datadog only after the managed-service PR merges and the new OTel configuration rolls out to at least one cluster running a CRDB build that includes this metric. Docs: cockroach-metric-sync Questions: reach out to @obs-india-prs |
Address review feedback: the `obs.metric_export.codeowner_count.enabled` setting is internal-only and doesn't need to be publicly visible. Release note: None Co-Authored-By: roachdev-claude <roachdev-claude-bot@cockroachlabs.com>
Backport 1/1 commits from #166695 on behalf of @angles-n-daemons.
/cc @cockroachdb/release
Add a runtime metric
obs.metric_export.codeowner.metric_count(GaugeVeclabeled by
codeowner) that reports metric counts per owning team duringeach scrape. Counts reflect downstream ingestion: simple metrics count as
1, histograms expand to their computed metrics (percentiles, count, sum,
avg, max). This follows the same pattern as the existing
obs.metric_export.child.countmetric and enables teams to understandtheir contribution to scrape volume.
To make metric-to-team ownership data available at runtime, embed
metric_owners.yaml(generated bygen-metric-owners) into themetricscanpackage viago:embed. The./dev generate docsstep copiesthe YAML into the package directory. The data is loaded once at
MetricsRecorderconstruction and used during each scrape to attributemetric families to their CODEOWNER team.
Also fix
check_generated_codeCI to rungen-metric-ownersbefore//pkg/gen, ensuring thatmetric_owners.yamlstaleness is caught andnew metrics get their
ownerfield populated inmetrics.yaml.Resolves: #166692
Release note: None
Release justification: Allows us to report on metrics output volume by codeowner, it's a low risk change.