Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions etcd/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# etcd.openshift.io API Group

This API group contains CRDs related to etcd cluster management in Two Node OpenShift with Fencing deployments.

## API Versions

### v1alpha1

Contains the `PacemakerCluster` custom resource for monitoring Pacemaker cluster health in Two Node OpenShift with Fencing deployments.

#### PacemakerCluster

- **Feature Gate**: `DualReplica`
- **Component**: `two-node-fencing`
- **Scope**: Cluster-scoped singleton resource (must be named "cluster")
- **Resource Path**: `pacemakerclusters.etcd.openshift.io`

The `PacemakerCluster` resource provides visibility into the health and status of a Pacemaker-managed cluster.
It is periodically updated by the cluster-etcd-operator's status collector.

### Status Subresource Design

This resource uses the standard Kubernetes status subresource pattern (`+kubebuilder:subresource:status`).
The status collector creates the resource without status, then immediately populates it via the `/status` endpoint.

**Why not atomic create-with-status?**

We initially explored removing the status subresource to allow creating the resource with status in a single
atomic operation. This would ensure the resource is never observed in an incomplete state. However:

1. The Kubernetes API server strips the `status` field from create requests when a status subresource is enabled
2. Without the subresource, we cannot use separate RBAC for spec vs status updates
3. The OpenShift API test framework assumes status subresource exists for status update tests

The status collector performs a two-step operation: create resource, then immediately update status.
The brief window where status is empty is acceptable since the healthcheck controller handles missing status gracefully.

### Pacemaker Resources

A **pacemaker resource** is a unit of work managed by pacemaker. In pacemaker terminology, resources are services
or applications that pacemaker monitors, starts, stops, and moves between nodes to maintain high availability.

For Two Node OpenShift with Fencing, we manage three resources:
- **Kubelet**: The Kubernetes node agent and a prerequisite for etcd
- **Etcd**: The distributed key-value store
- **FencingAgent**: Used to isolate failed nodes during a quorum loss event

### Status Structure

```yaml
status: # Optional on creation, populated via status subresource
conditions: # Cluster-level conditions (optional, but min 3 items when present)
- type: Healthy
- type: InService
- type: NodeCountAsExpected
lastUpdated: <timestamp> # When status was last updated (optional, cannot decrease once set)
nodes: # Per-node status (optional, 0-32 nodes, expects 2)
- name: <hostname> # RFC 1123 subdomain name
addresses: # List of node addresses using corev1.NodeAddress
- type: InternalIP # Address type (InternalIP, ExternalIP, Hostname, etc.)
address: <ip> # First InternalIP address used for etcd peer URLs
conditions: # Node-level conditions (optional, but min 9 items when present)
- type: Healthy
- type: Online
- type: InService
- type: Active
- type: Ready
- type: Clean
- type: Member
- type: FencingAvailable
- type: FencingHealthy
resources: # Array of pacemaker resources scheduled on this node (optional, min 2)
- name: Kubelet # Both resources (Kubelet, Etcd) must be present
conditions: # Resource-level conditions (optional, but min 8 items when present)
- type: Healthy
- type: InService
- type: Managed
- type: Enabled
- type: Operational
- type: Active
- type: Started
- type: Schedulable
- name: Etcd
conditions: []
fencingAgents: # Fencing agents that can fence THIS node (optional, 1-8 per node)
- name: <nodename>_<method> # e.g., "master-0_redfish"
method: <method> # Fencing method: redfish, ipmi, fence_aws, etc.
conditions: [] # Same 8 conditions as resources
```

### Fencing Agents

Fencing agents are STONITH (Shoot The Other Node In The Head) devices used to isolate failed nodes.
Unlike regular pacemaker resources (Kubelet, Etcd), fencing agents are tracked separately because:

1. **Mapping by target, not schedule**: Resources are mapped to the node where they are scheduled to run.
Fencing agents are mapped to the node they can *fence* (their target), regardless of which node
their monitoring operations are scheduled on.

2. **Multiple agents per node**: A node can have multiple fencing agents for redundancy
(e.g., both Redfish and IPMI). Expected: 1 per node, supported: up to 8.

3. **Health tracking via two node-level conditions**:
- **FencingAvailable**: True if at least one agent is healthy (fencing works), False if all agents unhealthy (degrades operator)
- **FencingHealthy**: True if all agents are healthy (ideal state), False if any agent is unhealthy (emits warning events)

### Cluster-Level Conditions

**Per API conventions, conditions are optional but when present must include all three types (enforced via MinItems=3 and XValidation rules).**

| Condition | True | False |
|-----------|------|-------|
| `Healthy` | Cluster is healthy (`ClusterHealthy`) | Cluster has issues (`ClusterUnhealthy`) |
| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
| `NodeCountAsExpected` | Node count is as expected (`AsExpected`) | Wrong count (`InsufficientNodes`, `ExcessiveNodes`) |

### Node-Level Conditions

**Per API conventions, conditions are optional but when present must include all nine types (enforced via MinItems=9 and XValidation rules).**

| Condition | True | False |
|-----------|------|-------|
| `Healthy` | Node is healthy (`NodeHealthy`) | Node has issues (`NodeUnhealthy`) |
| `Online` | Node is online (`Online`) | Node is offline (`Offline`) |
| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
| `Active` | Node is active (`Active`) | Node is in standby (`Standby`) |
| `Ready` | Node is ready (`Ready`) | Node is pending (`Pending`) |
| `Clean` | Node is clean (`Clean`) | Node is unclean (`Unclean`) |
| `Member` | Node is a member (`Member`) | Not a member (`NotMember`) |
| `FencingAvailable` | At least one agent healthy (`FencingAvailable`) | All agents unhealthy (`FencingUnavailable`) - degrades operator |
| `FencingHealthy` | All agents healthy (`FencingHealthy`) | Some agents unhealthy (`FencingUnhealthy`) - emits warnings |

### Resource-Level Conditions

Each resource in the `resources` array and each fencing agent in the `fencingAgents` array has its own conditions. **Per API conventions, conditions are optional but when present must include all eight types (enforced via MinItems=8 and XValidation rules).**

| Condition | True | False |
|-----------|------|-------|
| `Healthy` | Resource is healthy (`ResourceHealthy`) | Resource has issues (`ResourceUnhealthy`) |
| `InService` | In service (`InService`) | In maintenance (`InMaintenance`) |
| `Managed` | Managed by pacemaker (`Managed`) | Not managed (`Unmanaged`) |
| `Enabled` | Resource is enabled (`Enabled`) | Resource is disabled (`Disabled`) |
| `Operational` | Resource is operational (`Operational`) | Resource has failed (`Failed`) |
| `Active` | Resource is active (`Active`) | Resource is not active (`Inactive`) |
| `Started` | Resource is started (`Started`) | Resource is stopped (`Stopped`) |
| `Schedulable` | Resource is schedulable (`Schedulable`) | Resource is not schedulable (`Unschedulable`) |

### Validation Rules

**Resource naming:**
- Resource name must be "cluster" (singleton)

**Node name validation:**
- Must be a lowercase RFC 1123 subdomain name
- Consists of lowercase alphanumeric characters, '-' or '.'
- Must start and end with an alphanumeric character
- Maximum 253 characters

**Node addresses:**
- Uses `corev1.NodeAddress` for consistency with Kubernetes Node API
- Pacemaker allows multiple addresses for Corosync communication between nodes (1-8 addresses)
- The first InternalIP address in the list is used for IP-based peer URLs for etcd membership
- Each address must be a valid global unicast IPv4 or IPv6 address in canonical form
- Excludes loopback, link-local, and multicast addresses

**Timestamp validation:**
- `lastUpdated` is optional but once set cannot be removed
- Timestamps must always increase (prevents stale updates from overwriting newer data)

**Status fields:**
- `status` - Optional on creation (pointer type), populated via status subresource
- `lastUpdated` - Optional timestamp for staleness detection
- `nodes` - Optional array of node statuses

**Conditions validation (all levels):**
- Per Kubernetes API conventions, conditions fields are marked `+optional`
- However, MinItems and XValidation rules enforce that when conditions are present, they must include all required types
- Cluster-level: MinItems=3 (Healthy, InService, NodeCountAsExpected)
- Node-level: MinItems=9 (Healthy, Online, InService, Active, Ready, Clean, Member, FencingAvailable, FencingHealthy)
- Resource-level: MinItems=8 (Healthy, InService, Managed, Enabled, Operational, Active, Started, Schedulable)
- Fencing agent-level: MinItems=8 (same conditions as resources)

**Resource names:**
- Valid values are: `Kubelet`, `Etcd`
- Both resources must be present in each node's `resources` array (MinItems=2)
- Fencing agents are tracked separately in the `fencingAgents` array

**Fencing agent fields:**
- `name`: The pacemaker resource name (e.g., "master-0_redfish"), max 253 characters
- `method`: The fencing method (e.g., "redfish", "ipmi", "fence_aws"), max 63 characters
- `conditions`: Same 8 conditions as resources (optional, but min 8 items when present)

### Usage

The cluster-etcd-operator healthcheck controller watches this resource and updates operator conditions based on
the cluster state. The aggregate `Healthy` conditions at each level (cluster, node, resource) provide a quick
way to determine overall health.
26 changes: 26 additions & 0 deletions etcd/install.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
package etcd

import (
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/schema"

v1alpha1 "github.com/openshift/api/etcd/v1alpha1"
)

const (
GroupName = "etcd.openshift.io"
)

var (
schemeBuilder = runtime.NewSchemeBuilder(v1alpha1.Install)
// Install is a function which adds every version of this group to a scheme
Install = schemeBuilder.AddToScheme
)

func Resource(resource string) schema.GroupResource {
return schema.GroupResource{Group: GroupName, Resource: resource}
}

func Kind(kind string) schema.GroupKind {
return schema.GroupKind{Group: GroupName, Kind: kind}
}
3 changes: 3 additions & 0 deletions etcd/v1alpha1/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.PHONY: test
test:
make -C ../../tests test GINKGO_EXTRA_ARGS=--focus="etcd.openshift.io/v1alpha1"
6 changes: 6 additions & 0 deletions etcd/v1alpha1/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
// +k8s:deepcopy-gen=package,register
// +k8s:defaulter-gen=TypeMeta
// +k8s:openapi-gen=true
// +openshift:featuregated-schema-gen=true
// +groupName=etcd.openshift.io
package v1alpha1
39 changes: 39 additions & 0 deletions etcd/v1alpha1/register.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
package v1alpha1

import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
"k8s.io/apimachinery/pkg/runtime"
"k8s.io/apimachinery/pkg/runtime/schema"
)

var (
GroupName = "etcd.openshift.io"
GroupVersion = schema.GroupVersion{Group: GroupName, Version: "v1alpha1"}
schemeBuilder = runtime.NewSchemeBuilder(addKnownTypes)
// Install is a function which adds this version to a scheme
Install = schemeBuilder.AddToScheme

// SchemeGroupVersion generated code relies on this name
// Deprecated
SchemeGroupVersion = GroupVersion
// AddToScheme exists solely to keep the old generators creating valid code
// DEPRECATED
AddToScheme = schemeBuilder.AddToScheme
)

// Resource generated code relies on this being here, but it logically belongs to the group
// DEPRECATED
func Resource(resource string) schema.GroupResource {
return schema.GroupResource{Group: GroupName, Resource: resource}
}

func addKnownTypes(scheme *runtime.Scheme) error {
metav1.AddToGroupVersion(scheme, GroupVersion)

scheme.AddKnownTypes(GroupVersion,
&PacemakerCluster{},
&PacemakerClusterList{},
)

return nil
}
Loading