Skip to content

Commit f9fdea1

Browse files
committed
documentation update
Signed-off-by: Attila Mészáros <a_meszaros@apple.com>
1 parent fe4ab6a commit f9fdea1

File tree

2 files changed

+104
-24
lines changed

2 files changed

+104
-24
lines changed

docs/content/en/docs/documentation/observability.md

Lines changed: 101 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -77,30 +77,108 @@ Metrics metrics; // initialize your metrics implementation
7777
Operator operator = new Operator(client, o -> o.withMetrics(metrics));
7878
```
7979

80-
### Micrometer implementation
80+
### MicrometerMetricsV2 (Recommended, since 5.3.0)
8181

82-
The micrometer implementation is typically created using one of the provided factory methods which, depending on which
83-
is used, will return either a ready to use instance or a builder allowing users to customize how the implementation
84-
behaves, in particular when it comes to the granularity of collected metrics. It is, for example, possible to collect
85-
metrics on a per-resource basis via tags that are associated with meters. This is the default, historical behavior but
86-
this will change in a future version of JOSDK because this dramatically increases the cardinality of metrics, which
87-
could lead to performance issues.
82+
[`MicrometerMetricsV2`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetricsV2.java) is the recommended micrometer-based implementation. It is designed with low cardinality in mind:
83+
all meters are scoped to the controller, not to individual resources. This avoids unbounded cardinality growth as
84+
resources come and go.
8885

89-
To create a `MicrometerMetrics` implementation that behaves how it has historically behaved, you can just create an
90-
instance via:
86+
The simplest way to create an instance:
9187

9288
```java
9389
MeterRegistry registry; // initialize your registry implementation
94-
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
90+
Metrics metrics = MicrometerMetricsV2.newPerResourceCollectingMicrometerMetricsBuilder(registry).build();
91+
```
92+
93+
Optionally, include a `namespace` tag on per-reconciliation counters (disabled by default to avoid unexpected
94+
cardinality increases in existing deployments):
95+
96+
```java
97+
Metrics metrics = MicrometerMetricsV2.newPerResourceCollectingMicrometerMetricsBuilder(registry)
98+
.withNamespaceAsTag()
99+
.build();
100+
```
101+
102+
You can also supply a custom timer configuration for `reconciliations.execution.duration`:
103+
104+
```java
105+
Metrics metrics = MicrometerMetricsV2.newPerResourceCollectingMicrometerMetricsBuilder(registry)
106+
.withExecutionTimerConfig(builder -> builder.publishPercentiles(0.5, 0.95, 0.99))
107+
.build();
95108
```
96109

97-
The class provides factory methods which either return a fully pre-configured instance or a builder object that will
98-
allow you to configure more easily how the instance will behave. You can, for example, configure whether the
99-
implementation should collect metrics on a per-resource basis, whether associated meters should be removed when a
100-
resource is deleted and how the clean-up is performed. See the relevant classes documentation for more details.
110+
#### MicrometerMetricsV2 metrics
111+
112+
All meters use `controller.name` as their primary tag. Counters optionally carry a `namespace` tag when
113+
`withNamespaceAsTag()` is enabled.
114+
115+
| Meter name (Micrometer) | Type | Tags | Description |
116+
|------------------------------------------|---------|-----------------------------------|----------------------------------------------------------------------|
117+
| `reconciliations.executions` | gauge | `controller.name` | Number of reconciler executions currently in progress |
118+
| `reconciliations.active` | gauge | `controller.name` | Number of resources currently queued for reconciliation |
119+
| `custom_resources` | gauge | `controller.name` | Number of custom resources tracked by the controller |
120+
| `reconciliations.execution.duration` | timer | `controller.name` | Reconciliation execution duration with explicit SLO bucket histogram |
121+
| `reconciliations.started.total` | counter | `controller.name`, `namespace`* | Number of reconciliations started (including retries) |
122+
| `reconciliations.success.total` | counter | `controller.name`, `namespace`* | Number of successfully finished reconciliations |
123+
| `reconciliations.failure.total` | counter | `controller.name`, `namespace`* | Number of failed reconciliations |
124+
| `reconciliations.retries.total` | counter | `controller.name`, `namespace`* | Number of reconciliation retries |
125+
| `events.received` | counter | `controller.name`, `event`, `action`, `namespace`* | Number of Kubernetes events received by the controller |
126+
| `events.delete` | counter | `controller.name`, `namespace`* | Number of resource deletion events processed |
127+
128+
\* `namespace` tag is only included when `withNamespaceAsTag()` is enabled.
129+
130+
The execution timer uses explicit SLO boundaries (10ms, 50ms, 100ms, 250ms, 500ms, 1s, 2s, 5s, 10s, 30s) to ensure
131+
compatibility with `histogram_quantile()` queries in Prometheus. This is important when using the OTLP registry, where
132+
`publishPercentileHistogram()` would otherwise produce Base2 Exponential Histograms that are incompatible with classic
133+
`_bucket` queries.
134+
135+
> **Note on Prometheus metric names**: The exact Prometheus metric name suffix depends on the `MeterRegistry` in use.
136+
> For `PrometheusMeterRegistry` the timer is exposed as `reconciliations_execution_duration_seconds_*`. For
137+
> `OtlpMeterRegistry` (metrics exported via OpenTelemetry Collector), it is exposed as
138+
> `reconciliations_execution_duration_milliseconds_*`.
139+
140+
#### Grafana Dashboard
141+
142+
A ready-to-use Grafana dashboard is available at
143+
[`observability/josdk-operator-metrics-dashboard.json`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/observability/josdk-operator-metrics-dashboard.json).
144+
It visualizes all of the metrics listed above, including reconciliation throughput, error rates, queue depth, active
145+
executions, resource counts, and execution duration histograms and heatmaps.
146+
147+
The dashboard is designed to work with metrics exported via OpenTelemetry Collector to Prometheus, as set up by the
148+
observability sample (see below).
149+
150+
#### Exploring metrics end-to-end
151+
152+
The
153+
[`metrics-processing` sample operator](https://github.com/java-operator-sdk/java-operator-sdk/tree/main/sample-operators/metrics-processing)
154+
includes a full end-to-end test,
155+
[`MetricsHandlingE2E`](https://github.com/java-operator-sdk/java-operator-sdk/blob/main/sample-operators/metrics-processing/src/test/java/io/javaoperatorsdk/operator/sample/metrics/MetricsHandlingE2E.java),
156+
that:
157+
158+
1. Installs a local observability stack (Prometheus, Grafana, OpenTelemetry Collector) via
159+
`observability/install-observability.sh`. That imports also the Grafana dashboards.
160+
2. Runs two reconcilers that produce both successful and failing reconciliations over a sustained period
161+
3. Verifies that the expected metrics appear in Prometheus
162+
163+
This is a good starting point for experimenting with the metrics and the Grafana dashboard in a real cluster without
164+
having to deploy your own operator.
165+
166+
### MicrometerMetrics (Deprecated)
167+
168+
> **Deprecated**: `MicrometerMetrics` (V1) is deprecated as of JOSDK 5.3.0. Use `MicrometerMetricsV2` instead.
169+
> V1 attaches resource-specific metadata (name, namespace, etc.) as tags to every meter, which causes unbounded
170+
> cardinality growth and can lead to performance issues in your metrics backend.
171+
172+
The legacy `MicrometerMetrics` implementation is still available. To create an instance that behaves as it historically
173+
has:
174+
175+
```java
176+
MeterRegistry registry; // initialize your registry implementation
177+
Metrics metrics = MicrometerMetrics.newMicrometerMetricsBuilder(registry).build();
178+
```
101179

102-
For example, the following will create a `MicrometerMetrics` instance configured to collect metrics on a per-resource
103-
basis, deleting the associated meters after 5 seconds when a resource is deleted, using up to 2 threads to do so.
180+
To collect metrics on a per-resource basis, deleting the associated meters after 5 seconds when a resource is deleted,
181+
using up to 2 threads:
104182

105183
```java
106184
MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
@@ -109,9 +187,9 @@ MicrometerMetrics.newPerResourceCollectingMicrometerMetricsBuilder(registry)
109187
.build();
110188
```
111189

112-
### Operator SDK metrics
190+
#### Operator SDK metrics (V1)
113191

114-
The micrometer implementation records the following metrics:
192+
The V1 micrometer implementation records the following metrics:
115193

116194
| Meter name | Type | Tag names | Description |
117195
|-------------------------------------------------------------|----------------|-------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------|
@@ -130,12 +208,11 @@ The micrometer implementation records the following metrics:
130208
| operator.sdk.controllers.execution.cleanup.success | counter | controller, type | Number of successful cleanups per controller |
131209
| operator.sdk.controllers.execution.cleanup.failure | counter | controller, exception | Number of failed cleanups per controller |
132210

133-
As you can see all the recorded metrics start with the `operator.sdk` prefix. `<resource metadata>`, in the table above,
134-
refers to resource-specific metadata and depends on the considered metric and how the implementation is configured and
135-
could be summed up as follows: `group?, version, kind, [name, namespace?], scope` where the tags in square
136-
brackets (`[]`) won't be present when per-resource collection is disabled and tags followed by a question mark are
137-
omitted if the associated value is empty. Of note, when in the context of controllers' execution metrics, these tag
138-
names are prefixed with `resource.`. This prefix might be removed in a future version for greater consistency.
211+
All V1 metrics start with the `operator.sdk` prefix. `<resource metadata>` refers to resource-specific metadata and
212+
depends on the considered metric and how the implementation is configured: `group?, version, kind, [name, namespace?],
213+
scope` where tags in square brackets (`[]`) won't be present when per-resource collection is disabled and tags followed
214+
by a question mark are omitted if the value is empty. In the context of controllers' execution metrics, these tag names
215+
are prefixed with `resource.`.
139216

140217
### Aggregated Metrics
141218

micrometer-support/src/main/java/io/javaoperatorsdk/operator/monitoring/micrometer/MicrometerMetricsV2.java

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,9 @@
3838
import io.micrometer.core.instrument.Tag;
3939
import io.micrometer.core.instrument.Timer;
4040

41+
/**
42+
* @since 5.3.0
43+
*/
4144
public class MicrometerMetricsV2 implements Metrics {
4245

4346
private static final String CONTROLLER_NAME = "controller.name";

0 commit comments

Comments
 (0)