Conversation
Introduce comprehensive observability for SeiNodeGroup and SeiNode controllers: Metrics: - SeiNodeGroup: phase gauge, replicas gauges, condition gauges, reconcile substep duration histogram, reconcile error counter - SeiNode: phase gauge, phase transition counter, init duration histogram + last-init gauge, sidecar request duration histogram, sidecar unreachable counter, reconcile error counter - Shared observability package with InitBuckets, NormalizeStatusCode, EmitPhaseGauge/DeletePhaseGauge helpers, and centralised ReconcileErrorsTotal counter Events: - Kubernetes events on SeiNode phase transitions via EventRecorder - RBAC and FakeRecorder wiring in tests Infrastructure: - Switch metrics endpoint from HTTPS :8443 to HTTP :8080 - Add container port declaration and update Service + NetworkPolicy - ServiceMonitor for Prometheus Operator scraping - PrometheusRule with 7 alerts (degraded/failed groups, stuck nodes, sidecar unreachable, reconcile errors, high latency) - Wire monitoring resources into config/default kustomization Made-with: Cursor
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds comprehensive observability for the SeiNodeGroup and SeiNode controllers — Prometheus metrics, Kubernetes events, and Prometheus Operator monitoring resources (ServiceMonitor + PrometheusRule).
Metrics added
sei_controller_seinodegroup_phasesei_controller_seinodegroup_replicassei_controller_seinodegroup_conditionsei_controller_seinodegroup_reconcile_substep_duration_secondssei_controller_seinode_phasesei_controller_seinode_phase_transitions_totalsei_controller_seinode_init_duration_secondssei_controller_seinode_last_init_duration_secondssei_controller_sidecar_request_duration_secondssei_controller_sidecar_unreachable_totalsei_controller_reconcile_errors_totalEvents
PhaseTransitionevents emitted on every SeiNode phase change viaEventRecorderInfrastructure
:8443→ HTTP:8080ServiceMonitorfor Prometheus Operator scraping (30s interval)PrometheusRulewith 7 alerts:SeiNodeGroupDegraded,SeiNodeGroupFailed,SeiNodeStuckInitializing,SeiNodeStuckPending,SidecarUnreachableHigh,ControllerReconcileErrors,ControllerHighReconcileLatencyDesign decisions
namelabel omitted from transition counter and sidecar histogram to control cardinalityInitBuckets(10s–1h) used for node init durations vsReconcileBucketsfor substeps2xx,4xx, etc.) to bound cardinalityobservabilitypackage centralises helpers and cross-controller metricsSeiNodeStuckPendingalert usesmax by()aggregation to preventfortimer resets across Pending/PreInitializing transitionsTest plan
make testpasses (FakeRecorder injected in all test reconcilers)/metricskubectl get servicemonitor -A)kubectl describe seinode