Attach metric name labels#36633
Draft
SangJunBak wants to merge 4 commits into
Draft
Conversation
Adds a new mod rule with a declarative Rule enum (ClusterNameLookup, ReplicaNameLookup, ObjectNameLookup), a CatalogNameLookup trait, and a Rule::apply that stamps resolved name labels onto a MetricFamily. Threads an optional rules: [...] field through the metric! macro and MakeCollectorOpts. MetricsRegistry::register stores those rules keyed by the fully-qualified Prometheus metric name and exposes them via a new rules_by_metric() accessor. No HTTP wiring yet — follow-ups connect the registry to the federated endpoint via a response header.
Adds two header constants and a wants_enrich_rules helper. When a caller sends X-Materialize-Accept-Enrich-Rules, handle_prometheus serializes registry.rules_by_metric() and attaches it as X-Materialize-Enrich-Rules in the response. Header is opt-in so default Prometheus scrapers see clean responses; only consumers that understand the per-metric rules wire format request it (today: env's federated /metrics/public scraper).
Wires env's federated /metrics/public endpoint to the per-metric rules mechanism: scraper sends the opt-in header on every replica scrape, parses X-Materialize-Enrich-Rules off the response as a map of metric name -> rules, and applies them only to the matching family. Adds CatalogLookup wrapping &Catalog (implements CatalogNameLookup with typed ClusterId / ReplicaId / GlobalId parsing), splits the old add_replica_labels into stamp_scrape_context_labels (the three labels that come from the connection) plus rule application. Env-local metrics get the same per-metric treatment via metrics_registry .rules_by_metric(). E2E test_metrics_public_endpoint is expected red after this commit until callers declare rules on their metrics in the next change.
Migrates the three subsystems with metrics carrying cluster_id / replica_id / source_id / parent_source_id / collection_id labels to declare their enrichment rules inline via the metric! macro's new rules: [...] field: - cluster-client: ControllerMetrics adds ClusterNameLookup, ReplicaNameLookup, and ObjectNameLookup (for collection_id) to each of the three dataflow wallclock-lag metrics. - compute-client: ComputeControllerMetrics gains instance_name_rule() and replica_name_rule() helpers; every metric with an instance_id / replica_id label is decorated. - storage/statistics: SourceStatisticsMetricDefs gains source_name_rule() and parent_source_name_rule() helpers; every source metric with source_id / parent_source_id is decorated. These declarations replace what the prototype's global register_rule calls used to do, but scoped per-metric so labels can't bleed across unrelated families. E2E test_metrics_public_endpoint is back to green via cluster-client's mz_dataflow_wallclock_lag_seconds rule.
129175c to
3412db4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a feature flag for the upcoming federated
/metrics/external endpoint on environmentd, which will fan out a single
scrape across env's local metrics and every clusterd replica's /metrics.Remove these sections if your commit already has a good description!
Motivation
Why does this change exist? Link to a GitHub issue, design doc, Slack
thread, or explain the problem in a sentence or two. A reviewer who has
no context should understand why after reading this section.
If this implements or addresses an existing issue, it's enough to link to that:
Closes
Fixes
etc.
Description
What does this PR actually do? Focus on the approach and any non-obvious
decisions. The diff shows the code --- use this space to explain what the
diff can't tell a reviewer.
Verification
How do you know this change is correct? Describe new or existing automated
tests, or manual steps you took.