openshift · jaypoulz · Oct 21, 2025
diff --git a/etcd/README.md b/etcd/README.md
@@ -0,0 +1,197 @@
+# etcd.openshift.io API Group
+
+This API group contains CRDs related to etcd cluster management in Two Node OpenShift with Fencing deployments.
+
+## API Versions
+
+### v1alpha1
+
+Contains the `PacemakerCluster` custom resource for monitoring Pacemaker cluster health in Two Node OpenShift with Fencing deployments.
+
+#### PacemakerCluster
+
+- **Feature Gate**: `DualReplica`
+- **Component**: `two-node-fencing`
+- **Scope**: Cluster-scoped singleton resource (must be named "cluster")
+- **Resource Path**: `pacemakerclusters.etcd.openshift.io`
+
+The `PacemakerCluster` resource provides visibility into the health and status of a Pacemaker-managed cluster.
+It is periodically updated by the cluster-etcd-operator's status collector.
+
+### Status Subresource Design
+
+This resource uses the standard Kubernetes status subresource pattern (`+kubebuilder:subresource:status`).
+The status collector creates the resource without status, then immediately populates it via the `/status` endpoint.
+
+**Why not atomic create-with-status?**
+
+We initially explored removing the status subresource to allow creating the resource with status in a single
+atomic operation. This would ensure the resource is never observed in an incomplete state. However:
+
+1. The Kubernetes API server strips the `status` field from create requests when a status subresource is enabled
+2. Without the subresource, we cannot use separate RBAC for spec vs status updates
+3. The OpenShift API test framework assumes status subresource exists for status update tests
+
+The status collector performs a two-step operation: create resource, then immediately update status.
+The brief window where status is empty is acceptable since the healthcheck controller handles missing status gracefully.
+
+### Pacemaker Resources
+
+A **pacemaker resource** is a unit of work managed by pacemaker. In pacemaker terminology, resources are services
+or applications that pacemaker monitors, starts, stops, and moves between nodes to maintain high availability.
+
+For Two Node OpenShift with Fencing, we manage three resources:
+- **Kubelet**: The Kubernetes node agent and a prerequisite for etcd
+- **Etcd**: The distributed key-value store
+- **FencingAgent**: Used to isolate failed nodes during a quorum loss event
+
+### Status Structure
+
+```yaml
+status:                    # Optional on creation, populated via status subresource
+  conditions:              # Cluster-level conditions (optional, but min 3 items when present)
+    - type: Healthy
+    - type: InService
+    - type: NodeCountAsExpected
+  lastUpdated: <timestamp> # When status was last updated (optional, cannot decrease once set)
+  nodes:                   # Per-node status (optional, 0-32 nodes, expects 2)
+    - name: <hostname>     # RFC 1123 subdomain name
+      addresses:           # List of node addresses using corev1.NodeAddress
+        - type: InternalIP # Address type (InternalIP, ExternalIP, Hostname, etc.)
+          address: <ip>    # First InternalIP address used for etcd peer URLs
+      conditions:          # Node-level conditions (optional, but min 9 items when present)
+        - type: Healthy
+        - type: Online
+        - type: InService
+        - type: Active
+        - type: Ready
+        - type: Clean
+        - type: Member
+        - type: FencingAvailable
+        - type: FencingHealthy
+      resources:           # Array of pacemaker resources scheduled on this node (optional, min 2)
+        - name: Kubelet    # Both resources (Kubelet, Etcd) must be present
+          conditions:      # Resource-level conditions (optional, but min 8 items when present)
+            - type: Healthy
+            - type: InService
+            - type: Managed
+            - type: Enabled
+            - type: Operational
+            - type: Active
+            - type: Started
+            - type: Schedulable
+        - name: Etcd
+          conditions: []
+      fencingAgents:       # Fencing agents that can fence THIS node (optional, 1-8 per node)
+        - name: <nodename>_<method>  # e.g., "master-0_redfish"
+          method: <method>           # Fencing method: redfish, ipmi, fence_aws, etc.
+          conditions: []   # Same 8 conditions as resources
+```
+
+### Fencing Agents
+
+Fencing agents are STONITH (Shoot The Other Node In The Head) devices used to isolate failed nodes.
+Unlike regular pacemaker resources (Kubelet, Etcd), fencing agents are tracked separately because:
+
+1. **Mapping by target, not schedule**: Resources are mapped to the node where they are scheduled to run.
+   Fencing agents are mapped to the node they can *fence* (their target), regardless of which node
+   their monitoring operations are scheduled on.
+
+2. **Multiple agents per node**: A node can have multiple fencing agents for redundancy
+   (e.g., both Redfish and IPMI). Expected: 1 per node, supported: up to 8.
+
+3. **Health tracking via two node-level conditions**:
+   - **FencingAvailable**: True if at least one agent is healthy (fencing works), False if all agents unhealthy (degrades operator)
+   - **FencingHealthy**: True if all agents are healthy (ideal state), False if any agent is unhealthy (emits warning events)
+
+### Cluster-Level Conditions
+
+**Per API conventions, conditions are optional but when present must include all three types (enforced via MinItems=3 and XValidation rules).**
+
+| Condition | True | False |
+|-----------|------|-------|
+| `Healthy` | Cluster is healthy (`ClusterHealthy`) | Cluster has issues (`ClusterUnhealthy`) |
+| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
+| `NodeCountAsExpected` | Node count is as expected (`AsExpected`) | Wrong count (`InsufficientNodes`, `ExcessiveNodes`) |
+
+### Node-Level Conditions
+
+**Per API conventions, conditions are optional but when present must include all nine types (enforced via MinItems=9 and XValidation rules).**
+
+| Condition | True | False |
+|-----------|------|-------|
+| `Healthy` | Node is healthy (`NodeHealthy`) | Node has issues (`NodeUnhealthy`) |
+| `Online` | Node is online (`Online`) | Node is offline (`Offline`) |
+| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
+| `Active` | Node is active (`Active`) | Node is in standby (`Standby`) |
+| `Ready` | Node is ready (`Ready`) | Node is pending (`Pending`) |
+| `Clean` | Node is clean (`Clean`) | Node is unclean (`Unclean`) |
+| `Member` | Node is a member (`Member`) | Not a member (`NotMember`) |
+| `FencingAvailable` | At least one agent healthy (`FencingAvailable`) | All agents unhealthy (`FencingUnavailable`) - degrades operator |
+| `FencingHealthy` | All agents healthy (`FencingHealthy`) | Some agents unhealthy (`FencingUnhealthy`) - emits warnings |
+
+### Resource-Level Conditions
+
+Each resource in the `resources` array and each fencing agent in the `fencingAgents` array has its own conditions. **Per API conventions, conditions are optional but when present must include all eight types (enforced via MinItems=8 and XValidation rules).**
+
+| Condition | True | False |
+|-----------|------|-------|
+| `Healthy` | Resource is healthy (`ResourceHealthy`) | Resource has issues (`ResourceUnhealthy`) |
+| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
+| `Managed` | Managed by pacemaker (`Managed`) | Not managed (`Unmanaged`) |
+| `Enabled` | Resource is enabled (`Enabled`) | Resource is disabled (`Disabled`) |
+| `Operational` | Resource is operational (`Operational`) | Resource has failed (`Failed`) |
+| `Active` | Resource is active (`Active`) | Resource is not active (`Inactive`) |
+| `Started` | Resource is started (`Started`) | Resource is stopped (`Stopped`) |
+| `Schedulable` | Resource is schedulable (`Schedulable`) | Resource is not schedulable (`Unschedulable`) |
+
+### Validation Rules
+
+**Resource naming:**
+- Resource name must be "cluster" (singleton)
+
+**Node name validation:**
+- Must be a lowercase RFC 1123 subdomain name
+- Consists of lowercase alphanumeric characters, '-' or '.'
+- Must start and end with an alphanumeric character
+- Maximum 253 characters
+
+**Node addresses:**
+- Uses `corev1.NodeAddress` for consistency with Kubernetes Node API
+- Pacemaker allows multiple addresses for Corosync communication between nodes (1-8 addresses)
+- The first InternalIP address in the list is used for IP-based peer URLs for etcd membership
+- Each address must be a valid global unicast IPv4 or IPv6 address in canonical form
+- Excludes loopback, link-local, and multicast addresses
+
+**Timestamp validation:**
+- `lastUpdated` is optional but once set cannot be removed
+- Timestamps must always increase (prevents stale updates from overwriting newer data)
+
+**Status fields:**
+- `status` - Optional on creation (pointer type), populated via status subresource
+- `lastUpdated` - Optional timestamp for staleness detection
+- `nodes` - Optional array of node statuses
+
+**Conditions validation (all levels):**
+- Per Kubernetes API conventions, conditions fields are marked `+optional`
+- However, MinItems and XValidation rules enforce that when conditions are present, they must include all required types
+- Cluster-level: MinItems=3 (Healthy, InService, NodeCountAsExpected)
+- Node-level: MinItems=9 (Healthy, Online, InService, Active, Ready, Clean, Member, FencingAvailable, FencingHealthy)
+- Resource-level: MinItems=8 (Healthy, InService, Managed, Enabled, Operational, Active, Started, Schedulable)
+- Fencing agent-level: MinItems=8 (same conditions as resources)
+
+**Resource names:**
+- Valid values are: `Kubelet`, `Etcd`
+- Both resources must be present in each node's `resources` array (MinItems=2)
+- Fencing agents are tracked separately in the `fencingAgents` array
+
+**Fencing agent fields:**
+- `name`: The pacemaker resource name (e.g., "master-0_redfish"), max 253 characters
+- `method`: The fencing method (e.g., "redfish", "ipmi", "fence_aws"), max 63 characters
+- `conditions`: Same 8 conditions as resources (optional, but min 8 items when present)
+
+### Usage
+
+The cluster-etcd-operator healthcheck controller watches this resource and updates operator conditions based on
+the cluster state. The aggregate `Healthy` conditions at each level (cluster, node, resource) provide a quick
+way to determine overall health.
diff --git a/etcd/install.go b/etcd/install.go
@@ -0,0 +1,26 @@
+package etcd
+
+import (
+	"k8s.io/apimachinery/pkg/runtime"
+	"k8s.io/apimachinery/pkg/runtime/schema"
+
+	v1alpha1 "github.com/openshift/api/etcd/v1alpha1"
+)
+
+const (
+	GroupName = "etcd.openshift.io"
+)
+
+var (
+	schemeBuilder = runtime.NewSchemeBuilder(v1alpha1.Install)
+	// Install is a function which adds every version of this group to a scheme
+	Install = schemeBuilder.AddToScheme
+)
+
+func Resource(resource string) schema.GroupResource {
+	return schema.GroupResource{Group: GroupName, Resource: resource}
+}
+
+func Kind(kind string) schema.GroupKind {
+	return schema.GroupKind{Group: GroupName, Kind: kind}
+}
diff --git a/etcd/v1alpha1/Makefile b/etcd/v1alpha1/Makefile
@@ -0,0 +1,3 @@
+.PHONY: test
+test:
+	make -C ../../tests test GINKGO_EXTRA_ARGS=--focus="etcd.openshift.io/v1alpha1"
diff --git a/etcd/v1alpha1/doc.go b/etcd/v1alpha1/doc.go
@@ -0,0 +1,6 @@
+// +k8s:deepcopy-gen=package,register
+// +k8s:defaulter-gen=TypeMeta
+// +k8s:openapi-gen=true
+// +openshift:featuregated-schema-gen=true
+// +groupName=etcd.openshift.io
+package v1alpha1
diff --git a/etcd/v1alpha1/register.go b/etcd/v1alpha1/register.go
@@ -0,0 +1,39 @@
+package v1alpha1
+
+import (
+	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
+	"k8s.io/apimachinery/pkg/runtime"
+	"k8s.io/apimachinery/pkg/runtime/schema"
+)
+
+var (
+	GroupName     = "etcd.openshift.io"
+	GroupVersion  = schema.GroupVersion{Group: GroupName, Version: "v1alpha1"}
+	schemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)
+	// Install is a function which adds this version to a scheme
+	Install = schemeBuilder.AddToScheme
+
+	// SchemeGroupVersion generated code relies on this name
+	// Deprecated
+	SchemeGroupVersion = GroupVersion
+	// AddToScheme exists solely to keep the old generators creating valid code
+	// DEPRECATED
+	AddToScheme = schemeBuilder.AddToScheme
+)
+
+// Resource generated code relies on this being here, but it logically belongs to the group
+// DEPRECATED
+func Resource(resource string) schema.GroupResource {
+	return schema.GroupResource{Group: GroupName, Resource: resource}
+}
+
+func addKnownTypes(scheme *runtime.Scheme) error {
+	metav1.AddToGroupVersion(scheme, GroupVersion)
+
+	scheme.AddKnownTypes(GroupVersion,
+		&PacemakerCluster{},
+		&PacemakerClusterList{},
+	)
+
+	return nil
+}