@@ -77,30 +77,108 @@ Metrics metrics; // initialize your metrics implementation
7777Operator operator = new Operator (client, o - > o. withMetrics(metrics));
7878```
7979
80- ### Micrometer implementation
80+ ### MicrometerMetricsV2 (Recommended, since 5.3.0)
8181
82- The micrometer implementation is typically created using one of the provided factory methods which, depending on which
83- is used, will return either a ready to use instance or a builder allowing users to customize how the implementation
84- behaves, in particular when it comes to the granularity of collected metrics. It is, for example, possible to collect
85- metrics on a per-resource basis via tags that are associated with meters. This is the default, historical behavior but
86- this will change in a future version of JOSDK because this dramatically increases the cardinality of metrics, which
87- could lead to performance issues.
82+ [ ` MicrometerMetricsV2 ` ] ( https://github.com/java-operator-sdk/java-operator-sdk/blob/main/micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetricsV2.java ) is the recommended micrometer-based implementation. It is designed with low cardinality in mind:
83+ all meters are scoped to the controller, not to individual resources. This avoids unbounded cardinality growth as
84+ resources come and go.
8885
89- To create a ` MicrometerMetrics ` implementation that behaves how it has historically behaved, you can just create an
90- instance via:
86+ The simplest way to create an instance:
9187
9288``` java
9389MeterRegistry registry; // initialize your registry implementation
94- Metrics metrics = MicrometerMetrics . newMicrometerMetricsBuilder(registry). build();
90+ Metrics metrics = MicrometerMetricsV2 . newPerResourceCollectingMicrometerMetricsBuilder(registry). build();
91+ ```
92+
93+ Optionally, include a ` namespace ` tag on per-reconciliation counters (disabled by default to avoid unexpected
94+ cardinality increases in existing deployments):
95+
96+ ``` java
97+ Metrics metrics = MicrometerMetricsV2 . newPerResourceCollectingMicrometerMetricsBuilder(registry)
98+ .withNamespaceAsTag()
99+ .build();
100+ ```
101+
102+ You can also supply a custom timer configuration for ` reconciliations.execution.duration ` :
103+
104+ ``` java
105+ Metrics metrics = MicrometerMetricsV2 . newPerResourceCollectingMicrometerMetricsBuilder(registry)
106+ .withExecutionTimerConfig(builder - > builder. publishPercentiles(0.5 , 0.95 , 0.99 ))
107+ .build();
95108```
96109
97- The class provides factory methods which either return a fully pre-configured instance or a builder object that will
98- allow you to configure more easily how the instance will behave. You can, for example, configure whether the
99- implementation should collect metrics on a per-resource basis, whether associated meters should be removed when a
100- resource is deleted and how the clean-up is performed. See the relevant classes documentation for more details.
110+ #### MicrometerMetricsV2 metrics
111+
112+ All meters use ` controller.name ` as their primary tag. Counters optionally carry a ` namespace ` tag when
113+ ` withNamespaceAsTag() ` is enabled.
114+
115+ | Meter name (Micrometer) | Type | Tags | Description |
116+ | ------------------------------------------| ---------| -----------------------------------| ----------------------------------------------------------------------|
117+ | ` reconciliations.executions ` | gauge | ` controller.name ` | Number of reconciler executions currently in progress |
118+ | ` reconciliations.active ` | gauge | ` controller.name ` | Number of resources currently queued for reconciliation |
119+ | ` custom_resources ` | gauge | ` controller.name ` | Number of custom resources tracked by the controller |
120+ | ` reconciliations.execution.duration ` | timer | ` controller.name ` | Reconciliation execution duration with explicit SLO bucket histogram |
121+ | ` reconciliations.started.total ` | counter | ` controller.name ` , ` namespace ` * | Number of reconciliations started (including retries) |
122+ | ` reconciliations.success.total ` | counter | ` controller.name ` , ` namespace ` * | Number of successfully finished reconciliations |
123+ | ` reconciliations.failure.total ` | counter | ` controller.name ` , ` namespace ` * | Number of failed reconciliations |
124+ | ` reconciliations.retries.total ` | counter | ` controller.name ` , ` namespace ` * | Number of reconciliation retries |
125+ | ` events.received ` | counter | ` controller.name ` , ` event ` , ` action ` , ` namespace ` * | Number of Kubernetes events received by the controller |
126+ | ` events.delete ` | counter | ` controller.name ` , ` namespace ` * | Number of resource deletion events processed |
127+
128+ \* ` namespace ` tag is only included when ` withNamespaceAsTag() ` is enabled.
129+
130+ The execution timer uses explicit SLO boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure
131+ compatibility with ` histogram_quantile() ` queries in Prometheus. This is important when using the OTLP registry, where
132+ ` publishPercentileHistogram() ` would otherwise produce Base2 Exponential Histograms that are incompatible with classic
133+ ` _bucket ` queries.
134+
135+ > ** Note on Prometheus metric names** : The exact Prometheus metric name suffix depends on the ` MeterRegistry ` in use.
136+ > For ` PrometheusMeterRegistry ` the timer is exposed as ` reconciliations_execution_duration_seconds_* ` . For
137+ > ` OtlpMeterRegistry ` (metrics exported via OpenTelemetry Collector), it is exposed as
138+ > ` reconciliations_execution_duration_milliseconds_* ` .
139+
140+ #### Grafana Dashboard
141+
142+ A ready-to-use Grafana dashboard is available at
143+ [ ` observability/josdk-operator-metrics-dashboard.json ` ] ( https://github.com/java-operator-sdk/java-operator-sdk/blob/main/observability/josdk-operator-metrics-dashboard.json ) .
144+ It visualizes all of the metrics listed above, including reconciliation throughput, error rates, queue depth, active
145+ executions, resource counts, and execution duration histograms and heatmaps.
146+
147+ The dashboard is designed to work with metrics exported via OpenTelemetry Collector to Prometheus, as set up by the
148+ observability sample (see below).
149+
150+ #### Exploring metrics end-to-end
151+
152+ The
153+ [ ` metrics-processing ` sample operator] ( https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/metrics-processing )
154+ includes a full end-to-end test,
155+ [ ` MetricsHandlingE2E ` ] ( https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/metrics-processing/src/test/java/io/javaoperatorsdk/operator/sample/metrics/MetricsHandlingE2E.java ) ,
156+ that:
157+
158+ 1 . Installs a local observability stack (Prometheus, Grafana, OpenTelemetry Collector) via
159+ ` observability/install-observability.sh ` . That imports also the Grafana dashboards.
160+ 2 . Runs two reconcilers that produce both successful and failing reconciliations over a sustained period
161+ 3 . Verifies that the expected metrics appear in Prometheus
162+
163+ This is a good starting point for experimenting with the metrics and the Grafana dashboard in a real cluster without
164+ having to deploy your own operator.
165+
166+ ### MicrometerMetrics (Deprecated)
167+
168+ > ** Deprecated** : ` MicrometerMetrics ` (V1) is deprecated as of JOSDK 5.3.0. Use ` MicrometerMetricsV2 ` instead.
169+ > V1 attaches resource-specific metadata (name, namespace, etc.) as tags to every meter, which causes unbounded
170+ > cardinality growth and can lead to performance issues in your metrics backend.
171+
172+ The legacy ` MicrometerMetrics ` implementation is still available. To create an instance that behaves as it historically
173+ has:
174+
175+ ``` java
176+ MeterRegistry registry; // initialize your registry implementation
177+ Metrics metrics = MicrometerMetrics . newMicrometerMetricsBuilder(registry). build();
178+ ```
101179
102- For example, the following will create a ` MicrometerMetrics ` instance configured to collect metrics on a per- resource
103- basis, deleting the associated meters after 5 seconds when a resource is deleted, using up to 2 threads to do so.
180+ To collect metrics on a per-resource basis, deleting the associated meters after 5 seconds when a resource is deleted,
181+ using up to 2 threads:
104182
105183``` java
106184MicrometerMetrics . newPerResourceCollectingMicrometerMetricsBuilder(registry)
@@ -109,9 +187,9 @@ MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
109187 .build();
110188```
111189
112- ### Operator SDK metrics
190+ #### Operator SDK metrics (V1)
113191
114- The micrometer implementation records the following metrics:
192+ The V1 micrometer implementation records the following metrics:
115193
116194| Meter name | Type | Tag names | Description |
117195| -------------------------------------------------------------| ----------------| -------------------------------------------------------------------------------------| --------------------------------------------------------------------------------------------------------|
@@ -130,12 +208,11 @@ The micrometer implementation records the following metrics:
130208| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
131209| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |
132210
133- As you can see all the recorded metrics start with the ` operator.sdk ` prefix. ` <resource metadata> ` , in the table above,
134- refers to resource-specific metadata and depends on the considered metric and how the implementation is configured and
135- could be summed up as follows: ` group?, version, kind, [name, namespace?], scope ` where the tags in square
136- brackets (` [] ` ) won't be present when per-resource collection is disabled and tags followed by a question mark are
137- omitted if the associated value is empty. Of note, when in the context of controllers' execution metrics, these tag
138- names are prefixed with ` resource. ` . This prefix might be removed in a future version for greater consistency.
211+ All V1 metrics start with the ` operator.sdk ` prefix. ` <resource metadata> ` refers to resource-specific metadata and
212+ depends on the considered metric and how the implementation is configured: `group?, version, kind, [ name, namespace?] ,
213+ scope` where tags in square brackets ( ` [ ] `) won't be present when per-resource collection is disabled and tags followed
214+ by a question mark are omitted if the value is empty. In the context of controllers' execution metrics, these tag names
215+ are prefixed with ` resource. ` .
139216
140217### Aggregated Metrics
141218
0 commit comments