diff --git a/src/main/tla/ConsistentFailover/Admin.tla b/src/main/tla/ConsistentFailover/Admin.tla new file mode 100644 index 00000000000..4a8e3de4430 --- /dev/null +++ b/src/main/tla/ConsistentFailover/Admin.tla @@ -0,0 +1,182 @@ +-------------------------- MODULE Admin ---------------------------------------- +(* + * Operator-initiated actions for the Phoenix Consistent Failover + * protocol: failover initiation, abort, and OFFLINE lifecycle. + * + * These actions model the human operator (Admin actor) who drives + * failover and abort via the PhoenixHAAdminTool CLI, which delegates + * to HAGroupStoreManager coprocessor endpoints. + * + * AdminGoOffline and AdminForceRecover are gated on the + * UseOfflinePeerDetection feature flag and model + * the proactive design for peer OFFLINE detection. These actions + * use PhoenixHAAdminTool update with --force, bypassing the + * normal isTransitionAllowed() check. + * + * Implementation traceability: + * + * TLA+ action | Java source + * -------------------------+---------------------------------------------- + * AdminStartFailover(c) | HAGroupStoreManager + * | .initiateFailoverOnActiveCluster() L375-400 + * AdminAbortFailover(c) | HAGroupStoreManager + * | .setHAGroupStatusToAbortToStandby() L419-425 + * | Also clears failoverPending (models + * | abortFailoverListener L173-185) + * AdminGoOffline(c) | PhoenixHAAdminTool update --state OFFLINE + * | (gated on UseOfflinePeerDetection) + * AdminForceRecover(c) | PhoenixHAAdminTool update --force + * | --state STANDBY (OFFLINE -> S) + * | (gated on UseOfflinePeerDetection) + *) +EXTENDS SpecState, Types + +--------------------------------------------------------------------------- + +(* + * Admin initiates failover on the active cluster. + * + * Two paths depending on current state: + * + * AIS path: AIS -> ATS (in-sync, ready to hand off immediately) + * ANIS path: ANIS -> ANISTS (not-in-sync, forwarder must drain OUT + * before ANISTS can advance to ATS) + * + * AIS path guards: + * The OUT directory must be empty and all live RS must be in SYNC + * mode. DEAD RSes are allowed -- an RS can crash while the cluster + * is AIS without changing the HA group state. The implementation + * checks clusterState = AIS, not per-RS modes; a DEAD RS is not + * writing, so the remaining SYNC RSes and empty OUT dir ensure + * safety. + * + * ANIS path guards: + * The implementation only validates the current state (ANIS) and + * peer state. No outDirEmpty or writer-mode guards -- the + * forwarder will drain OUT after the transition. The ANISTS -> + * ATS transition (ANISTSToATS in HAGroupStore.tla) guards on + * outDirEmpty and the anti-flapping gate. + * + * Both paths: + * Peer must be in a stable standby state (S or DS) to prevent + * initiating a new failover during the non-atomic window of a + * previous failover (where the peer may still be in ATS). + * Without this guard, the admin could produce an irrecoverable + * (ATS, ATS) or (ANISTS, ATS) deadlock. + * + * Post: Cluster c transitions to ATS or ANISTS, both of which + * map to the ACTIVE_TO_STANDBY role, blocking mutations + * (isMutationBlocked()=true). + * + * Source: initiateFailoverOnActiveCluster() L375-400 checks current + * state and selects AIS -> ATS or ANIS -> ANISTS. + * Peer-state guard: getHAGroupStoreRecordFromPeer() + * (HAGroupStoreClient L421). + *) +AdminStartFailover(c) == + /\ clusterState[Peer(c)] \in {"S", "DS"} + /\ \/ /\ clusterState[c] = "AIS" + /\ outDirEmpty[c] + /\ \A rs \in RS : writerMode[c][rs] \in {"SYNC", "DEAD"} + /\ clusterState' = [clusterState EXCEPT ![c] = "ATS"] + \/ /\ clusterState[c] = "ANIS" + /\ clusterState' = [clusterState EXCEPT ![c] = "ANISTS"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Admin aborts an in-progress failover from the standby side. + * + * Pre: Cluster c is in STA (STANDBY_TO_ACTIVE). + * Post: Cluster c transitions to AbTS (ABORT_TO_STANDBY). + * The peer (in ATS) will react via PeerReactToAbTS, + * transitioning to AbTAIS. Both then auto-complete back + * to their pre-failover states. + * + * Abort must originate from the STA side to prevent dual-active + * races -- this is the AbortSafety property. + * + * Also clears failoverPending[c], modeling the abortFailoverListener + * (ReplicationLogDiscoveryReplay.java L173-185) which fires on LOCAL + * ABORT_TO_STANDBY, calling failoverPending.set(false). + * + * Source: setHAGroupStatusToAbortToStandby() L419-425. + *) +AdminAbortFailover(c) == + /\ clusterState[c] = "STA" + /\ clusterState' = [clusterState EXCEPT ![c] = "AbTS"] + /\ failoverPending' = [failoverPending EXCEPT ![c] = FALSE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Admin takes a standby cluster offline. + * + * Gated on UseOfflinePeerDetection (Iteration 18, proactive modeling). + * + * Pre: Cluster c is in S or DS (a standby state). + * Post: Cluster c transitions to OFFLINE. + * + * In the implementation, entering OFFLINE requires + * PhoenixHAAdminTool update --force --state OFFLINE, which + * bypasses isTransitionAllowed(). The operator decides when + * to take a cluster offline for maintenance or decommissioning. + * + * No ZK connectivity guard: the --force path writes directly + * to ZK, bypassing the isHealthy check used by + * setHAGroupStatusIfNeeded(). + * + * Source: PhoenixHAAdminTool update --state OFFLINE (--force) + *) +AdminGoOffline(c) == + /\ UseOfflinePeerDetection = TRUE + /\ clusterState[c] \in {"S", "DS"} + /\ clusterState' = [clusterState EXCEPT ![c] = "OFFLINE"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Admin force-recovers a cluster from OFFLINE. + * + * Gated on UseOfflinePeerDetection. + * + * Pre: Cluster c is in OFFLINE. + * Post: Cluster c transitions to S (STANDBY). + * + * Recovery from OFFLINE requires PhoenixHAAdminTool update --force + * --state STANDBY, which bypasses isTransitionAllowed() (OFFLINE + * has no allowed outbound transitions in the implementation). + * + * The S-entry side effects mirror the pattern used by + * PeerReactToAIS (ATS->S) and AutoComplete (AbTS->S): + * - writerMode reset to INIT for all RS (replication subsystem + * restart on standby entry) + * - outDirEmpty set to TRUE (OUT directory cleared) + * - replayState set to SYNCED_RECOVERY (recoveryListener fold) + * + * No ZK connectivity guard: the --force path writes directly + * to ZK. + * + * Source: PhoenixHAAdminTool update --force --state STANDBY + *) +AdminForceRecover(c) == + /\ UseOfflinePeerDetection = TRUE + /\ clusterState[c] = "OFFLINE" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + /\ writerMode' = [writerMode EXCEPT ![c] = + [rs \in RS |-> "INIT"]] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = TRUE] + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] + /\ UNCHANGED <> + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/Clock.tla b/src/main/tla/ConsistentFailover/Clock.tla new file mode 100644 index 00000000000..9891c775ef1 --- /dev/null +++ b/src/main/tla/ConsistentFailover/Clock.tla @@ -0,0 +1,45 @@ +---------------------------- MODULE Clock ---------------------------------------- +(* + * Countdown timer. + * + * Provides a single Tick action that advances all per-cluster + * anti-flapping countdown timers by one tick toward 0. This follows + * the explicit-time pattern from Lamport, "Real Time is Really + * Simple" (CHARME 2005, Section 2). Time is modeled as an ordinary + * variable, and lower-bound timing constraints (action cannot fire + * until enough time passes) are expressed as enabling conditions on + * the guarded action using a countdown timer that ticks to 0. + * + * The Tick action is guarded so it only fires when at least one + * timer is still counting down (AntiFlapGateClosed). This prevents + * useless stuttering ticks when all timers have already expired. + * + * Implementation traceability: + * + * TLA+ action | Java source + * -------------+-------------------------------------------------- + * Tick | Passage of wall-clock time; no direct Java + * | counterpart. Models the interval between + * | HAGroupStoreClient.validateTransitionAndGet- + * | WaitTime() checks (L1027-1046). + *) +EXTENDS SpecState, Types + +--------------------------------------------------------------------------- + +(* + * Advance all countdown timers by one tick toward 0. + * + * Each cluster's anti-flapping timer is decremented via + * DecrementTimer (floor at 0). The action is enabled only when + * at least one cluster has a timer still counting down + * (AntiFlapGateClosed), preventing infinite stuttering at zero. + *) +Tick == + /\ \E c \in Cluster : AntiFlapGateClosed(antiFlapTimer[c]) + /\ antiFlapTimer' = [c \in Cluster |-> DecrementTimer(antiFlapTimer[c])] + /\ UNCHANGED <> + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/ConsistentFailover-sim-liveness-ac.cfg b/src/main/tla/ConsistentFailover/ConsistentFailover-sim-liveness-ac.cfg new file mode 100644 index 00000000000..ec96afcb6fc --- /dev/null +++ b/src/main/tla/ConsistentFailover/ConsistentFailover-sim-liveness-ac.cfg @@ -0,0 +1,54 @@ +\* TLC model configuration for ConsistentFailover.tla +\* +\* Per-property simulation liveness: AbortCompletion +\* Uses FairnessAC (3 temporal clauses) to keep the Buchi +\* automaton tractable. Samples random behaviors. +\* +\* Run: +\* java -XX:+UseParallelGC \ +\* -Dtlc2.TLC.stopAfter=28800 \ +\* -cp tla2tools.jar:CommunityModules-deps.jar \ +\* tlc2.TLC ConsistentFailover.tla \ +\* -config ConsistentFailover-sim-liveness-ac.cfg \ +\* -simulate -depth 10000 -workers auto + +SPECIFICATION SpecAC + +\* Model values +CONSTANTS + \* The finite set of cluster identifiers forming the HA pair. + Cluster = {c1, c2} + \* The finite set of region server identifiers per cluster. + RS = {rs1, rs2} + \* Anti-flapping wait threshold (logical time ticks). + WaitTimeForSync = 2 + \* Feature gate for proactive AWOP/ANISWOP modeling. + \* Set to TRUE to verify OFFLINE peer detection lifecycle. + UseOfflinePeerDetection = FALSE + +\* No SYMMETRY -- provides no benefit for random trace sampling. + +\* Liveness property +PROPERTY + AbortCompletion + +\* Invariants to check +INVARIANT + TypeOK + MutualExclusion + AbortSafety + AISImpliesInSync + WriterClusterConsistency + ZKSessionConsistency + +\* Action properties: every state change follows allowed transitions +ACTION_CONSTRAINT + TransitionValid + WriterTransitionValid + AIStoATSPrecondition + AntiFlapGate + ANISTStoATSPrecondition + ReplayTransitionValid + FailoverTriggerCorrectness + NoDataLoss + ReplayRewindCorrectness diff --git a/src/main/tla/ConsistentFailover/ConsistentFailover-sim-liveness-dr.cfg b/src/main/tla/ConsistentFailover/ConsistentFailover-sim-liveness-dr.cfg new file mode 100644 index 00000000000..ca282b64712 --- /dev/null +++ b/src/main/tla/ConsistentFailover/ConsistentFailover-sim-liveness-dr.cfg @@ -0,0 +1,54 @@ +\* TLC model configuration for ConsistentFailover.tla +\* +\* Per-property simulation liveness: DegradationRecovery +\* Uses FairnessDR (17 temporal clauses with 2 RS) to keep +\* the Buchi automaton tractable. Samples random behaviors. +\* +\* Run: +\* java -XX:+UseParallelGC \ +\* -Dtlc2.TLC.stopAfter=28800 \ +\* -cp tla2tools.jar:CommunityModules-deps.jar \ +\* tlc2.TLC ConsistentFailover.tla \ +\* -config ConsistentFailover-sim-liveness-dr.cfg \ +\* -simulate -depth 10000 -workers auto + +SPECIFICATION SpecDR + +\* Model values +CONSTANTS + \* The finite set of cluster identifiers forming the HA pair. + Cluster = {c1, c2} + \* The finite set of region server identifiers per cluster. + RS = {rs1, rs2} + \* Anti-flapping wait threshold (logical time ticks). + WaitTimeForSync = 2 + \* Feature gate for proactive AWOP/ANISWOP modeling. + \* Set to TRUE to verify OFFLINE peer detection lifecycle. + UseOfflinePeerDetection = FALSE + +\* No SYMMETRY -- provides no benefit for random trace sampling. + +\* Liveness property +PROPERTY + DegradationRecovery + +\* Invariants to check +INVARIANT + TypeOK + MutualExclusion + AbortSafety + AISImpliesInSync + WriterClusterConsistency + ZKSessionConsistency + +\* Action properties: every state change follows allowed transitions +ACTION_CONSTRAINT + TransitionValid + WriterTransitionValid + AIStoATSPrecondition + AntiFlapGate + ANISTStoATSPrecondition + ReplayTransitionValid + FailoverTriggerCorrectness + NoDataLoss + ReplayRewindCorrectness diff --git a/src/main/tla/ConsistentFailover/ConsistentFailover-sim-liveness-fc.cfg b/src/main/tla/ConsistentFailover/ConsistentFailover-sim-liveness-fc.cfg new file mode 100644 index 00000000000..bcd88d6366e --- /dev/null +++ b/src/main/tla/ConsistentFailover/ConsistentFailover-sim-liveness-fc.cfg @@ -0,0 +1,54 @@ +\* TLC model configuration for ConsistentFailover.tla +\* +\* Per-property simulation liveness: FailoverCompletion +\* Uses FairnessFC (8 temporal clauses) to keep the Buchi +\* automaton tractable. Samples random behaviors. +\* +\* Run: +\* java -XX:+UseParallelGC \ +\* -Dtlc2.TLC.stopAfter=28800 \ +\* -cp tla2tools.jar:CommunityModules-deps.jar \ +\* tlc2.TLC ConsistentFailover.tla \ +\* -config ConsistentFailover-sim-liveness-fc.cfg \ +\* -simulate -depth 10000 -workers auto + +SPECIFICATION SpecFC + +\* Model values +CONSTANTS + \* The finite set of cluster identifiers forming the HA pair. + Cluster = {c1, c2} + \* The finite set of region server identifiers per cluster. + RS = {rs1, rs2} + \* Anti-flapping wait threshold (logical time ticks). + WaitTimeForSync = 2 + \* Feature gate for proactive AWOP/ANISWOP modeling. + \* Set to TRUE to verify OFFLINE peer detection lifecycle. + UseOfflinePeerDetection = FALSE + +\* No SYMMETRY -- provides no benefit for random trace sampling. + +\* Liveness property +PROPERTY + FailoverCompletion + +\* Invariants to check +INVARIANT + TypeOK + MutualExclusion + AbortSafety + AISImpliesInSync + WriterClusterConsistency + ZKSessionConsistency + +\* Action properties: every state change follows allowed transitions +ACTION_CONSTRAINT + TransitionValid + WriterTransitionValid + AIStoATSPrecondition + AntiFlapGate + ANISTStoATSPrecondition + ReplayTransitionValid + FailoverTriggerCorrectness + NoDataLoss + ReplayRewindCorrectness diff --git a/src/main/tla/ConsistentFailover/ConsistentFailover-sim.cfg b/src/main/tla/ConsistentFailover/ConsistentFailover-sim.cfg new file mode 100644 index 00000000000..f19a81c3712 --- /dev/null +++ b/src/main/tla/ConsistentFailover/ConsistentFailover-sim.cfg @@ -0,0 +1,58 @@ +\* TLC model configuration for ConsistentFailover.tla +\* +\* Simulation model: 2 clusters, 9 region servers. +\* Safety-only (no Fairness). Samples random behaviors at +\* production-scale RS count to stress per-RS writer interleaving. +\* Liveness simulation uses a separate cfg with smaller RS. +\* +\* Run: +\* java -XX:+UseParallelGC \ +\* -Dtlc2.TLC.stopAfter=28800 \ +\* -cp tla2tools.jar:CommunityModules-deps.jar \ +\* tlc2.TLC ConsistentFailover.tla -config ConsistentFailover-sim.cfg \ +\* -simulate -depth 10000 -workers auto + +SPECIFICATION SafetySpec + +\* Model values +CONSTANTS + \* The finite set of cluster identifiers forming the HA pair. + Cluster = {c1, c2} + \* The finite set of region server identifiers per cluster. + \* 9 RS exercises per-RS writer interleaving at production scale. + RS = {rs1, rs2, rs3, rs4, rs5, rs6, rs7, rs8, rs9} + \* Anti-flapping wait threshold (logical time ticks). + \* Larger than the exhaustive model to explore richer + \* interleavings during the anti-flapping wait window + \* (HDFS failures, ZK disruptions, RS crashes while gate closed). + WaitTimeForSync = 5 + \* Feature gate for proactive AWOP/ANISWOP modeling. + \* Set to TRUE to verify OFFLINE peer detection lifecycle. + UseOfflinePeerDetection = FALSE + +\* No SYMMETRY -- provides no benefit for random trace sampling. + +\* Invariants to check +INVARIANT + TypeOK + MutualExclusion + AbortSafety + AISImpliesInSync + WriterClusterConsistency + ZKSessionConsistency + +\* Action properties: every state change follows allowed transitions +ACTION_CONSTRAINT + TransitionValid + WriterTransitionValid + AIStoATSPrecondition + AntiFlapGate + ANISTStoATSPrecondition + ReplayTransitionValid + FailoverTriggerCorrectness + NoDataLoss + ReplayRewindCorrectness + +\* No state constraint -- simulation samples random traces, +\* counters grow organically along each trace without +\* state-space tractability concerns. diff --git a/src/main/tla/ConsistentFailover/ConsistentFailover.cfg b/src/main/tla/ConsistentFailover/ConsistentFailover.cfg new file mode 100644 index 00000000000..bfc61bac9e8 --- /dev/null +++ b/src/main/tla/ConsistentFailover/ConsistentFailover.cfg @@ -0,0 +1,51 @@ +\* TLC model configuration for ConsistentFailover.tla +\* +\* Primary (exhaustive) model: 2 clusters, 2 region servers. +\* +\* Run: +\* java -XX:+UseParallelGC \ +\* -cp tla2tools.jar:CommunityModules-deps.jar \ +\* tlc2.TLC ConsistentFailover.tla -config ConsistentFailover.cfg \ +\* -workers auto -cleanup + +SPECIFICATION SafetySpec + +\* Model values +CONSTANTS + \* The finite set of cluster identifiers forming the HA pair. + Cluster = {c1, c2} + \* The finite set of region server identifiers per cluster. + RS = {rs1, rs2} + \* Anti-flapping wait threshold (logical time ticks). + WaitTimeForSync = 2 + \* Feature gate for proactive AWOP/ANISWOP modeling. + \* Set to TRUE to verify OFFLINE peer detection lifecycle. + UseOfflinePeerDetection = FALSE + +\* Symmetry reduction: RS identifiers are interchangeable. +SYMMETRY Symmetry + +\* Invariants to check +INVARIANT + TypeOK + MutualExclusion + AbortSafety + AISImpliesInSync + WriterClusterConsistency + ZKSessionConsistency + +\* Action properties: every state change follows allowed transitions +ACTION_CONSTRAINT + TransitionValid + WriterTransitionValid + AIStoATSPrecondition + AntiFlapGate + ANISTStoATSPrecondition + ReplayTransitionValid + FailoverTriggerCorrectness + NoDataLoss + ReplayRewindCorrectness + +\* Bound replay counters for exhaustive search tractability. +CONSTRAINT + ReplayCounterBound diff --git a/src/main/tla/ConsistentFailover/ConsistentFailover.tla b/src/main/tla/ConsistentFailover/ConsistentFailover.tla new file mode 100644 index 00000000000..ebb3b1e8255 --- /dev/null +++ b/src/main/tla/ConsistentFailover/ConsistentFailover.tla @@ -0,0 +1,990 @@ +-------------------- MODULE ConsistentFailover -------------------------------- +(* + * TLA+ specification of the Phoenix Consistent Failover protocol. + * + * Root orchestrator module: EXTENDS SpecState (variables), defines Init, Next, + * Spec, invariants, and action constraints. Composes actor-driven + * actions from sub-modules via INSTANCE. + * + * Models the HA group state machine for two paired Phoenix/HBase + * clusters. Each cluster maintains an HA group state in ZooKeeper. + * State transitions are driven by admin actions, peer-reactive + * listeners, writer/reader state changes, HDFS availability + * incidents, and ZK coordination failures. + * + * ZK COORDINATION MODEL: ZK connection and session + * lifecycle are modeled explicitly. Peer-reactive transitions + * (PeerReact actions) are guarded on zkPeerConnected[c] and + * zkPeerSessionAlive[c]. Auto-completion, heartbeat, writer ZK + * writes, and failover trigger are guarded on zkLocalConnected[c]. + * Retry exhaustion of the FailoverManagementListener (2-retry + * limit) is modeled as ReactiveTransitionFail(c). + * + * Sub-modules: + * - Admin.tla: operator-initiated failover/abort + * - Clock.tla: anti-flapping countdown timer (Tick action) + * - HAGroupStore.tla: peer-reactive transitions, auto-completion, + * retry exhaustion + * - HDFS.tla: HDFS availability incident actions + * - Reader.tla: standby-side replication replay state machine + * - RS.tla: RS lifecycle (crash, abort on local HDFS failure, + * restart after abort) + * - Writer.tla: per-RS replication writer mode state machine + * - ZK.tla: ZK connection/session lifecycle environment actions + * + * Implementation traceability: + * + * Modeled concept | Java class / field + * ------------------------+--------------------------------------------- + * clusterState | HAGroupStoreRecord per-cluster ZK znode + * PeerReact* actions | FailoverManagementListener + * | (HAGroupStoreManager.java L633-706) + * | Delivered via peerPathChildrenCache + * | (ZK watcher -- conditional delivery) + * ReactiveTransitionFail | FailoverManagementListener 2-retry + * | exhaustion (L653-704); method returns + * | silently, transition permanently lost + * TriggerFailover | Reader.TriggerFailover -- guarded + * | STA->AIS via shouldTriggerFailover() + * | L500-533 + triggerFailover() L535-548 + * AutoComplete | createLocalStateTransitions() L140-150 + * | Delivered via local pathChildrenCache + * | (ZK watcher -- conditional delivery) + * ANISTSToATS | HAGroupStoreManager + * | .setHAGroupStatusToSync() L341-355 + * | ANISTS -> ATS (drain completion) + * AdminStartFailover | initiateFailoverOnActiveCluster() L375-400 + * | AIS -> ATS or ANIS -> ANISTS + * AdminAbortFailover | setHAGroupStatusToAbortToStandby() L419-425 + * AdminGoOffline | PhoenixHAAdminTool update --state OFFLINE + * | (gated on UseOfflinePeerDetection) + * AdminForceRecover | PhoenixHAAdminTool update --force + * | --state STANDBY (OFFLINE -> S) + * | (gated on UseOfflinePeerDetection) + * PeerReactToOFFLINE | intended peer OFFLINE detection; + * | gated on UseOfflinePeerDetection + * PeerRecoverFromOFFLINE | intended peer OFFLINE recovery; + * | gated on UseOfflinePeerDetection + * Init (AIS, S) | Default initial states per team confirmation + * | (see PHOENIX_HA_TLA_PLAN.md Appendix A.6) + * MutualExclusion | Architecture safety argument: at most one + * | cluster in ACTIVE role at any time + * AbortSafety | Abort originates from STA side; AbTAIS + * | only reachable via peer AbTS detection + * AllowedTransitions | HAGroupStoreRecord.java L99-123 + * writerMode | ReplicationLogGroup per-RS mode + * | (SYNC/STORE_AND_FWD/SYNC_AND_FWD) + * outDirEmpty | ReplicationLogDiscoveryForwarder + * | .processNoMoreRoundsLeft() L155-184 + * | Boolean: OUT dir empty/non-empty + * hdfsAvailable | Abstract: NameNode availability per cluster + * | (no explicit field in implementation; + * | detected via IOException) + * RSCrash | JVM crash, OOM, kill signal + * RSAbortOnLocalHDFS- | StoreAndForwardModeImpl.onFailure() + * Failure | L115-123 -> logGroup.abort() + * HDFSDown/HDFSUp | NameNode crash/recovery incidents; + * | SyncModeImpl.onFailure() L61-74 + * antiFlapTimer | Countdown timer (Lamport CHARME 2005); + * | models validateTransitionAndGetWait- + * | Time() L1027-1046 anti-flapping gate + * Tick | Passage of wall-clock time + * ANISHeartbeat | StoreAndForwardModeImpl + * | .startHAGroupStoreUpdateTask() L71-87 + * replayState | ReplicationLogDiscoveryReplay replay state + * | (NOT_INITIALIZED/SYNC/DEGRADED/ + * | SYNCED_RECOVERY) + * lastRoundInSync | ReplicationLogDiscoveryReplay L336-343 + * lastRoundProcessed | ReplicationLogDiscoveryReplay L336-351 + * failoverPending | ReplicationLogDiscoveryReplay L159-171 + * inProgressDirEmpty | ReplicationLogDiscoveryReplay L500-533 + * ReplayAdvance | replay() L336-343 (SYNC) and L345-351 + * | (DEGRADED) round processing + * ReplayRewind | replay() L323-333 (CAS to SYNC) + * [listener folds] | degradedListener L136-145 and + * | recoveryListener L147-157 are folded + * | into HAGroupStore S/DS-entry actions + * TriggerFailover | shouldTriggerFailover() L500-533 + + * | triggerFailover() L535-548 + * FailoverTriggerCorrectness | Action constraint: STA->AIS requires + * | failoverPending /\ inProgressDirEmpty + * | /\ replayState = SYNC + * NoDataLoss | Action constraint: zero RPO property + * | for failover (STA->AIS) + * zkPeerConnected | peerPathChildrenCache TCP connection + * | state (HAGroupStoreClient L110-112) + * zkPeerSessionAlive | Peer ZK session state (Curator internal) + * zkLocalConnected | pathChildrenCache TCP connection state; + * | maps to HAGroupStoreClient.isHealthy + * | (L878-911) + * ZKPeerDisconnect | peerPathChildrenCache CONNECTION_LOST + * ZKPeerReconnect | peerPathChildrenCache CONNECTION_RECONNECTED + * ZKPeerSessionExpiry | Curator session expiry -> CONNECTION_LOST + * ZKPeerSessionRecover | Curator retry -> new session + * ZKLocalDisconnect | pathChildrenCache CONNECTION_LOST + * ZKLocalReconnect | pathChildrenCache CONNECTION_RECONNECTED + * + * failoverPending lifecycle: + * Set TRUE: PeerReactToATS (HAGroupStore.tla) + * Set FALSE: TriggerFailover (Reader.tla) + * Set FALSE: AdminAbortFailover (Admin.tla) + *) +EXTENDS SpecState, Types + +--------------------------------------------------------------------------- + +\* The variable-group tuples (writerVars, clusterVars, replayVars, +\* envVars) and the full `vars` tuple used in temporal formulas +\* ([][Next]_vars, WF_vars, SF_vars) are defined in SpecState.tla +\* so they are shared by all sub-modules via EXTENDS. + +(* + * Replay-completeness guards for STA -> AIS (TriggerFailover / + * shouldTriggerFailover). Shared by FailoverTriggerCorrectness and + * NoDataLoss so the two action constraints cannot drift apart. + *) +STAtoAISTriggerReplayGuards(c) == + /\ failoverPending[c] + /\ inProgressDirEmpty[c] + /\ replayState[c] = "SYNC" + +--------------------------------------------------------------------------- + +(* Sub-module instances *) + +\* Peer-reactive transitions and auto-completion. +haGroupStore == INSTANCE HAGroupStore + +\* Operator-initiated failover and abort. +admin == INSTANCE Admin + +\* Per-RS replication writer mode state machine. +writer == INSTANCE Writer + +\* HDFS availability incident actions. +hdfs == INSTANCE HDFS + +\* RS lifecycle (crash, local HDFS abort, restart). +rs == INSTANCE RS + +\* Anti-flapping countdown timer. +clk == INSTANCE Clock + +\* Replication replay state machine (standby-side reader). +reader == INSTANCE Reader + +\* ZK connection/session lifecycle environment actions. +zk == INSTANCE ZK + +--------------------------------------------------------------------------- + +(* Initial state *) + +\* The system starts with one cluster active and in sync (AIS) +\* and the other in standby (S). The choice of which cluster is +\* active is deterministic: CHOOSE picks an arbitrary but fixed +\* element of Cluster as the initial active. +\* +\* The standby starts with replayState = SYNCED_RECOVERY, modeling +\* the recoveryListener having already fired during startup +\* (NOT_INITIALIZED -> SYNCED_RECOVERY is synchronous with S +\* entry on the local PathChildrenCache event thread). The active +\* starts NOT_INITIALIZED (reader is dormant until the cluster +\* first enters S after a failover). +Init == + \* Deterministically assign one cluster to AIS and the other to S. + \* CHOOSE x \in Cluster : TRUE picks an arbitrary fixed cluster. + LET active == CHOOSE x \in Cluster : TRUE + IN /\ clusterState = [c \in Cluster |-> + IF c = active THEN "AIS" ELSE "S"] + /\ writerMode = [c \in Cluster |-> [r \in RS |-> "INIT"]] + /\ outDirEmpty = [c \in Cluster |-> TRUE] + /\ hdfsAvailable = [c \in Cluster |-> TRUE] + /\ antiFlapTimer = [c \in Cluster |-> 0] + /\ replayState = [c \in Cluster |-> + IF c = active THEN "NOT_INITIALIZED" + ELSE "SYNCED_RECOVERY"] + /\ lastRoundInSync = [c \in Cluster |-> 0] + /\ lastRoundProcessed = [c \in Cluster |-> 0] + /\ failoverPending = [c \in Cluster |-> FALSE] + /\ inProgressDirEmpty = [c \in Cluster |-> TRUE] + /\ zkPeerConnected = [c \in Cluster |-> TRUE] + /\ zkPeerSessionAlive = [c \in Cluster |-> TRUE] + /\ zkLocalConnected = [c \in Cluster |-> TRUE] + +--------------------------------------------------------------------------- + +(* Next-state relation *) + +(* + * In each step, exactly one cluster performs one actor-driven action. + * Actions are factored by actor: + * - haGroupStore: peer-reactive transitions and auto-completion + * (FailoverManagementListener + local resolvers) + * ALL of these depend on ZK watcher notification chains for + * delivery. See HAGroupStore.tla module header for details. + * - admin: operator-initiated failover and abort + * These are direct ZK writes (not watcher-dependent). + * - hdfs: HDFS NameNode crash/recovery incidents + * HDFSDown sets the availability flag for any cluster's HDFS; + * per-RS degradation is handled by writer actions with CAS + * success/failure. Local HDFS failure with S&F writers + * triggers RS abort (RS.tla). + * - writer: per-RS writer mode transitions (startup, recovery, + * drain complete, HDFS failure degradation, CAS failure) + * - rs: RS lifecycle (crash, local HDFS abort, restart) + * + * Each action encodes the precise guard (peer state or local state) + * under which the transition fires, modeling the implementation's + * actual trigger conditions. + *) +Next == + \* [Timer] Anti-flapping countdown timer tick (global). + \/ clk!Tick + \/ \E c \in Cluster : + \* [ZK watcher] Peer-reactive: standby detects peer ATS. + \/ haGroupStore!PeerReactToATS(c) + \* [ZK watcher] Peer-reactive: cluster detects peer ANIS. + \/ haGroupStore!PeerReactToANIS(c) + \* [ZK watcher] Peer-reactive: active detects peer AbTS. + \/ haGroupStore!PeerReactToAbTS(c) + \* [ZK watcher] Local auto-completion: AbTS->S, etc. + \/ haGroupStore!AutoComplete(c) + \* [Reader-driven] Standby completes failover: STA->AIS (guarded). + \/ reader!TriggerFailover(c) + \* [ZK watcher] Peer-reactive: cluster detects peer AIS. + \/ haGroupStore!PeerReactToAIS(c) + \* [S&F heartbeat] ANIS self-transition: resets anti-flap timer. + \/ haGroupStore!ANISHeartbeat(c) + \* [Writer-driven] All RS synced + OUT empty + gate open: ANIS->AIS. + \/ haGroupStore!ANISToAIS(c) + \* [Writer-driven] OUT drained + anti-flap gate open: ANISTS->ATS. + \/ haGroupStore!ANISTSToATS(c) + \* [Retry exhaustion] PeerReact retry failure: transition lost. + \/ haGroupStore!ReactiveTransitionFail(c) + \* [ZK watcher] Peer-reactive: active detects peer OFFLINE. + \* (gated on UseOfflinePeerDetection) + \/ haGroupStore!PeerReactToOFFLINE(c) + \* [ZK watcher] Peer-reactive: active detects peer left OFFLINE. + \* (gated on UseOfflinePeerDetection) + \/ haGroupStore!PeerRecoverFromOFFLINE(c) + \* [Direct ZK write] Admin initiates failover: AIS->ATS or ANIS->ANISTS. + \/ admin!AdminStartFailover(c) + \* [Direct ZK write] Admin aborts failover: STA->AbTS. + \/ admin!AdminAbortFailover(c) + \* [Direct ZK write] Admin takes standby offline: S/DS->OFFLINE. + \* (gated on UseOfflinePeerDetection) + \/ admin!AdminGoOffline(c) + \* [Direct ZK write] Admin force-recovers from OFFLINE: OFFLINE->S. + \* (gated on UseOfflinePeerDetection) + \/ admin!AdminForceRecover(c) + \* HDFS NameNode crash/recovery incidents. + \/ hdfs!HDFSDown(c) + \/ hdfs!HDFSUp(c) + \* ZK connection/session lifecycle (environment actions). + \/ zk!ZKPeerDisconnect(c) + \/ zk!ZKPeerReconnect(c) + \/ zk!ZKPeerSessionExpiry(c) + \/ zk!ZKPeerSessionRecover(c) + \/ zk!ZKLocalDisconnect(c) + \/ zk!ZKLocalReconnect(c) + \* Standby-side replay state machine (reader). + \* Listener effects (degradedListener, recoveryListener) are + \* folded into the S-entry and DS-entry actions in HAGroupStore. + \/ reader!ReplayAdvance(c) + \/ reader!ReplayRewind(c) + \* In-progress directory dynamics (reader round processing). + \/ reader!ReplayBeginProcessing(c) + \/ reader!ReplayFinishProcessing(c) + \* Per-RS writer mode transitions and RS lifecycle. + \/ \E r \in RS : + \* Writer startup. + \/ writer!WriterInit(c, r) + \/ writer!WriterInitToStoreFwd(c, r) + \/ writer!WriterInitToStoreFwdFail(c, r) + \* Writer mode transitions (recovery, drain, forwarder). + \/ writer!WriterSyncToSyncFwd(c, r) + \/ writer!WriterStoreFwdToSyncFwd(c, r) + \/ writer!WriterSyncFwdToSync(c, r) + \* Per-RS HDFS failure degradation (CAS success). + \/ writer!WriterToStoreFwd(c, r) + \/ writer!WriterSyncFwdToStoreFwd(c, r) + \* Per-RS HDFS failure degradation (CAS failure -> DEAD). + \/ writer!WriterToStoreFwdFail(c, r) + \/ writer!WriterSyncFwdToStoreFwdFail(c, r) + \* RS lifecycle: crash, local HDFS abort, restart. + \/ rs!RSRestart(c, r) + \/ rs!RSCrash(c, r) + \/ rs!RSAbortOnLocalHDFSFailure(c, r) + +--------------------------------------------------------------------------- + +(* Specification *) + +\* Safety-only specification: initial state, followed by zero or +\* more Next steps (or stuttering). No fairness — used for fast +\* safety-only model checking (no temporal overhead). +SafetySpec == Init /\ [][Next]_vars + +--------------------------------------------------------------------------- + +(* Fairness *) + +(* + * Fairness assumptions for liveness checking. + * + * Classifies every action in Next into one of four fairness tiers. + * The guiding principle: any action whose guard depends on an + * environment variable that oscillates without fairness needs SF, + * because the adversary can cycle the env var once per lasso cycle + * to break WF's continuous-enablement requirement. + * + * Tier 3 actions are grouped into disjunctions under a single + * SF_vars to keep the temporal formula within TLC's DNF size + * limit. When at most one disjunct is ENABLED in any state, + * SF(A1\/...\/An) ≡ SF(A1)/\.../\SF(An): the only disjunct + * that can fire is the one that is enabled, so the scheduler + * cannot satisfy the disjunction by firing a different disjunct. + * Mutual exclusivity is guaranteed by the single-valued nature + * of clusterState (per-cluster groups) and writerMode (per-RS + * groups). + * + * 1. WF on protocol-internal steps whose guards depend only on + * protocol state variables (no env var guards). Continuous + * enablement is guaranteed by protocol progress. Includes + * Tick, replay state machine, WriterInit, WriterSyncToSyncFwd. + * Exception: ANISHeartbeat keeps WF despite its zkLocal- + * Connected guard because suppressing the heartbeat HELPS + * liveness (the anti-flap gate opens sooner); SF would be + * counterproductive. + * + * 2. WF on ZK recovery actions (ZKPeerReconnect, ZKPeerSession- + * Recover, ZKLocalReconnect): encodes the ZK Liveness + * Assumption (ZLA, §4.2). ZK sessions are eventually alive + * and connected. These recovery actions are the basis for + * SF on all actions guarded by zkPeerConnected or + * zkLocalConnected. + * + * 3. SF on all actions guarded by environment variables that + * oscillate without fairness: zkPeerConnected, zkPeerSession- + * Alive, zkLocalConnected, hdfsAvailable. Grouped as: + * - Peer-reactive (exclusive by clusterState[Peer(c)]): + * PeerReactToATS/ANIS/AIS/AbTS + * - Local transitions (exclusive by clusterState[c]): + * AutoComplete, ANISToAIS, ANISTSToATS, TriggerFailover + * - HDFSUp (standalone) + * - Writer degradation (exclusive by writerMode): + * WriterToStoreFwd, WriterSyncFwdToStoreFwd, + * WriterInitToStoreFwd + * - Writer recovery (exclusive by writerMode): + * WriterStoreFwdToSyncFwd, WriterSyncFwdToSync + * - RS lifecycle (exclusive by writerMode): + * RSAbortOnLocalHDFSFailure, RSRestart + * + * 4. No fairness on non-deterministic environmental faults + * (HDFSDown, RSCrash, ZKPeerDisconnect, ZKPeerSessionExpiry, + * ZKLocalDisconnect, ReactiveTransitionFail), operator actions + * (AdminStartFailover, AdminAbortFailover, AdminGoOffline, + * AdminForceRecover), and CAS failures + * (WriterToStoreFwdFail, WriterSyncFwdToStoreFwdFail, + * WriterInitToStoreFwdFail). These are genuinely non- + * deterministic; imposing fairness would force unrealistic + * guarantees. + *) +Fairness == + \* --- Tier 1: WF on protocol-internal steps --- + \* Guards depend only on protocol state; continuous enablement + \* is guaranteed by protocol progress. + /\ WF_vars(clk!Tick) + /\ \A c \in Cluster : + \* [S&F heartbeat] Anti-flap timer reset. Keeps WF despite + \* zkLocalConnected guard: suppressing the heartbeat HELPS + \* liveness (gate opens sooner), so SF would be counterproductive. + /\ WF_vars(haGroupStore!ANISHeartbeat(c)) + \* [Reader] Replay state machine (no env var guards). + /\ WF_vars(reader!ReplayAdvance(c)) + /\ WF_vars(reader!ReplayRewind(c)) + /\ WF_vars(reader!ReplayBeginProcessing(c)) + /\ WF_vars(reader!ReplayFinishProcessing(c)) + \* --- Tier 2: WF on ZK recovery (encodes ZLA §4.2) --- + \* ZK sessions are eventually alive and connected. These + \* recovery actions are the basis for SF on all actions + \* guarded by zkPeerConnected/zkLocalConnected. + /\ WF_vars(zk!ZKPeerReconnect(c)) + /\ WF_vars(zk!ZKPeerSessionRecover(c)) + /\ WF_vars(zk!ZKLocalReconnect(c)) + \* --- Tier 3: SF on actions guarded by env vars --- + \* Grouped by mutual exclusivity to keep TLC's temporal + \* formula within its DNF size limit. When at most one + \* disjunct is ENABLED in any state, SF(A1\/...\/An) is + \* equivalent to SF(A1)/\.../\SF(An), because the only + \* disjunct that can fire is the one that is enabled. + \* + \* Peer-reactive group (exclusive by clusterState[Peer(c)] + \* and clusterState[c]: ATS, ANIS, AbTS, AIS, OFFLINE are + \* mutually exclusive peer states; AWOP/ANISWOP are mutually + \* exclusive with S/DS/ATS local states of other PeerReact + \* actions). + /\ SF_vars(haGroupStore!PeerReactToATS(c) + \/ haGroupStore!PeerReactToANIS(c) + \/ haGroupStore!PeerReactToAbTS(c) + \/ haGroupStore!PeerReactToAIS(c) + \/ haGroupStore!PeerReactToOFFLINE(c) + \/ haGroupStore!PeerRecoverFromOFFLINE(c)) + \* Local cluster transition group (exclusive by + \* clusterState[c]: AbTS/AbTAIS/AbTANIS, ANIS, ANISTS, + \* STA are mutually exclusive). + /\ SF_vars(haGroupStore!AutoComplete(c) + \/ haGroupStore!ANISToAIS(c) + \/ haGroupStore!ANISTSToATS(c) + \/ reader!TriggerFailover(c)) + \* [Environmental] HDFS recovery. + /\ SF_vars(hdfs!HDFSUp(c)) + \* --- Per-RS actions --- + /\ \A r \in RS : + \* Writer startup (no env var guard). + /\ WF_vars(writer!WriterInit(c, r)) + \* Forwarder-driven SYNC->S&FWD (no env var guard). + /\ WF_vars(writer!WriterSyncToSyncFwd(c, r)) + \* Writer degradation group (exclusive by writerMode: + \* SYNC, SYNC_AND_FWD, INIT are mutually exclusive; + \* all guarded on zkLocalConnected, ~hdfsAvailable[Peer]). + /\ SF_vars(writer!WriterToStoreFwd(c, r) + \/ writer!WriterSyncFwdToStoreFwd(c, r) + \/ writer!WriterInitToStoreFwd(c, r)) + \* Writer recovery group (exclusive by writerMode: + \* STORE_AND_FWD, SYNC_AND_FWD are mutually exclusive; + \* guarded on hdfsAvailable[Peer]). + /\ SF_vars(writer!WriterStoreFwdToSyncFwd(c, r) + \/ writer!WriterSyncFwdToSync(c, r)) + \* RS lifecycle group (exclusive by writerMode: + \* STORE_AND_FWD, DEAD are mutually exclusive). + /\ SF_vars(rs!RSAbortOnLocalHDFSFailure(c, r) + \/ rs!RSRestart(c, r)) + +--------------------------------------------------------------------------- + +(* Liveness specifications *) + +\* Full specification: safety conjoined with the complete fairness +\* formula. Documents the full fairness design; too large for TLC's +\* Buchi automaton construction (43 temporal clauses). Used only in +\* THEOREM declarations. +Spec == Init /\ [][Next]_vars /\ Fairness + +\* Per-property specifications: each conjoins only the fairness +\* clauses on the critical path for one liveness property, keeping +\* the temporal formula small enough for TLC. + +\* AbortCompletion: AutoComplete (SF, zkLocalConnected guard), +\* ZKLocalReconnect (WF, re-enables zkLocalConnected), Tick (WF). +FairnessAC == + /\ WF_vars(clk!Tick) + /\ \A c \in Cluster : + /\ WF_vars(zk!ZKLocalReconnect(c)) + /\ SF_vars(haGroupStore!AutoComplete(c)) + +SpecAC == Init /\ [][Next]_vars /\ FairnessAC + +\* FailoverCompletion: AutoComplete + TriggerFailover (SF, grouped +\* by clusterState exclusivity), HDFSUp (SF), ZKLocalReconnect (WF), +\* replay machine including ReplayRewind (WF), Tick (WF). +FairnessFC == + /\ WF_vars(clk!Tick) + /\ \A c \in Cluster : + /\ WF_vars(zk!ZKLocalReconnect(c)) + /\ WF_vars(reader!ReplayAdvance(c)) + /\ WF_vars(reader!ReplayRewind(c)) + /\ WF_vars(reader!ReplayBeginProcessing(c)) + /\ WF_vars(reader!ReplayFinishProcessing(c)) + /\ SF_vars(haGroupStore!AutoComplete(c) + \/ reader!TriggerFailover(c)) + /\ SF_vars(hdfs!HDFSUp(c)) + +SpecFC == Init /\ [][Next]_vars /\ FairnessFC + +\* DegradationRecovery: ANISToAIS (SF), HDFSUp (SF), +\* ZKLocalReconnect (WF), Tick (WF), ANISHeartbeat (WF), +\* per-RS writer recovery chain (SF) and lifecycle (SF), +\* WriterInit and WriterSyncToSyncFwd (WF). +FairnessDR == + /\ WF_vars(clk!Tick) + /\ \A c \in Cluster : + /\ WF_vars(zk!ZKLocalReconnect(c)) + /\ WF_vars(haGroupStore!ANISHeartbeat(c)) + /\ SF_vars(haGroupStore!ANISToAIS(c)) + /\ SF_vars(hdfs!HDFSUp(c)) + /\ \A r \in RS : + /\ WF_vars(writer!WriterInit(c, r)) + /\ WF_vars(writer!WriterSyncToSyncFwd(c, r)) + /\ SF_vars(writer!WriterStoreFwdToSyncFwd(c, r) + \/ writer!WriterSyncFwdToSync(c, r)) + /\ SF_vars(rs!RSAbortOnLocalHDFSFailure(c, r) + \/ rs!RSRestart(c, r)) + +SpecDR == Init /\ [][Next]_vars /\ FairnessDR + +--------------------------------------------------------------------------- + +(* Liveness properties *) + +(* + * Failover completion: standby-side and abort transient states + * eventually resolve to a stable state. Resolution paths: + * STA -> AIS (TriggerFailover) or STA -> AbTS -> S (abort) + * AbTAIS -> AIS/ANIS, AbTANIS -> ANIS, AbTS -> S + * (auto-completion) + * + * ATS and ANISTS are excluded: their resolution depends on the + * peer completing failover (PeerReactToAIS/PeerReactToANIS) or + * on abort propagation (PeerReactToAbTS). Both require the peer + * to reach a specific state AND the ZK peer connection to be alive + * at the right moment. With no fairness on admin actions (the + * admin can abort every failover attempt) and no fairness on ZK + * disconnect (the scheduler can disconnect exactly when the peer + * is in AbTS), ATS can remain indefinitely. ATS does have a + * resolution path via the reconciliation fold in ZKPeerReconnect/ + * ZKPeerSessionRecover (ATS -> AbTAIS -> AIS when peer is in + * S/DS at reconnect), but adding ATS here would require + * extending FairnessFC with the peer-reactive SF group. + * + * Predicated on ZLA (encoded in Fairness). + *) +FailoverCompletion == + \A c \in Cluster : + clusterState[c] \in FailoverCompletionAntecedentStates + ~> clusterState[c] \in StableClusterStates + +--------------------------------------------------------------------------- + +(* + * Degradation recovery: ANIS with available peer HDFS eventually + * progresses out of ANIS. The recovery chain is: + * S&F -> S&FWD (WriterStoreFwdToSyncFwd) + * -> SYNC (WriterSyncFwdToSync, sets outDirEmpty) + * -> anti-flap timer expires (Tick) + * -> ANIS -> AIS (ANISToAIS) + * + * The cluster may also leave ANIS via failover (ANIS -> ANISTS), + * which satisfies the consequent. Under SF on HDFSUp, HDFS + * cannot be permanently down. + * + * Predicated on ZLA (encoded in Fairness). + *) +DegradationRecovery == + \A c \in Cluster : + (clusterState[c] = "ANIS" /\ hdfsAvailable[Peer(c)]) + ~> clusterState[c] \in NotANISClusterStates + +--------------------------------------------------------------------------- + +(* + * Abort completion: every abort state eventually auto-completes + * to a stable state. + * AbTS -> S (AutoComplete) + * AbTAIS -> AIS or ANIS (AutoComplete, conditional on + * writer/outDir state) + * AbTANIS -> ANIS (AutoComplete) + * + * Under WF on AutoComplete, each abort state deterministically + * resolves. Requires zkLocalConnected (AutoComplete guard). + * + * Predicated on ZLA (encoded in Fairness). + *) +AbortCompletion == + \A c \in Cluster : + clusterState[c] \in AbortCompletionAntecedentStates + ~> clusterState[c] \in StableClusterStates + +--------------------------------------------------------------------------- + +(* Type invariant *) + +\* All specification variables have valid types. +TypeOK == + /\ clusterState \in [Cluster -> HAGroupState] + /\ writerMode \in [Cluster -> [RS -> WriterMode]] + /\ outDirEmpty \in [Cluster -> BOOLEAN] + /\ hdfsAvailable \in [Cluster -> BOOLEAN] + /\ antiFlapTimer \in [Cluster -> 0..WaitTimeForSync] + /\ replayState \in [Cluster -> ReplayStateSet] + /\ lastRoundInSync \in [Cluster -> Nat] + /\ lastRoundProcessed \in [Cluster -> Nat] + /\ failoverPending \in [Cluster -> BOOLEAN] + /\ inProgressDirEmpty \in [Cluster -> BOOLEAN] + /\ zkPeerConnected \in [Cluster -> BOOLEAN] + /\ zkPeerSessionAlive \in [Cluster -> BOOLEAN] + /\ zkLocalConnected \in [Cluster -> BOOLEAN] + +--------------------------------------------------------------------------- + +(* Safety invariants *) + +(* + * ZK session/connection structural consistency: if the peer ZK + * session is expired, the peer connection must also be dead. + * Session expiry implies disconnection -- the ZKPeerSessionExpiry + * action sets both zkPeerSessionAlive and zkPeerConnected to FALSE. + * ZKPeerReconnect requires zkPeerSessionAlive = TRUE, so a + * reconnect cannot happen without a live session. + * + * This invariant verifies that the ZK actions correctly maintain + * the session/connection relationship across all reachable states. + *) +ZKSessionConsistency == + \A c \in Cluster : + zkPeerSessionAlive[c] = FALSE => zkPeerConnected[c] = FALSE + +\* Mutual exclusion: two clusters never both in the ACTIVE role +\* simultaneously. This is the primary safety property of the +\* failover protocol. +\* +\* The ACTIVE role includes: AIS, ANIS, AbTAIS, AbTANIS, AWOP, +\* ANISWOP. Transitional states ATS and ANISTS map to the +\* ACTIVE_TO_STANDBY role (not ACTIVE), which is the mechanism +\* by which safety is maintained during the non-atomic failover +\* window -- isMutationBlocked()=true for ACTIVE_TO_STANDBY. +\* +\* Source: Architecture safety argument; ClusterRoleRecord.java +\* L84 -- ACTIVE_TO_STANDBY has isMutationBlocked()=true. +MutualExclusion == + ~(\E c1, c2 \in Cluster : + \* Two distinct clusters ... + /\ c1 # c2 + \* ... both in the ACTIVE role. + /\ RoleOf(clusterState[c1]) = "ACTIVE" + /\ RoleOf(clusterState[c2]) = "ACTIVE") + +--------------------------------------------------------------------------- + +(* + * Abort safety: if a cluster is in AbTAIS (ABORT_TO_ACTIVE_IN_SYNC), + * the peer must be in AbTS, S, or DS. + * + * AbTAIS is reached via two paths: + * 1. Abort path: PeerReactToAbTS (peer = AbTS). The peer can + * auto-complete AbTS -> S before the local AbTAIS auto-completes. + * 2. Reconciliation path: ZKPeerReconnect/ZKPeerSessionRecover + * with local = ATS and peer in {S, DS}. DS is reachable when + * the peer degraded (S -> DS via PeerReactToANIS) before the + * failover partition. + * + * All three peer states (AbTS, S, DS) map to STANDBY role, so + * MutualExclusion is preserved in all cases. + * + * The abort protocol is: + * (ATS, STA) --[Admin]--> (ATS, AbTS) --[PeerReact]--> (AbTAIS, AbTS) + * then auto-complete both sides back to (AIS, S). + * + * The reconciliation protocol is: + * (ATS, S/DS) --[ZKPeerReconnect]--> (AbTAIS, S/DS) + * then auto-complete AbTAIS -> AIS/ANIS. + * + * Source: Architecture safety argument; abort originates from + * setHAGroupStatusToAbortToStandby() (L419-425) on the + * STA side; active detects via FailoverManagementListener + * peer AbTS resolver (L132). Reconciliation path is in + * ZK.tla (ZKPeerReconnect/ZKPeerSessionRecover). + *) +AbortSafety == + \A c \in Cluster : + clusterState[c] = "AbTAIS" => + clusterState[Peer(c)] \in {"AbTS", "S", "DS", "OFFLINE"} + +--------------------------------------------------------------------------- + +(* Action constraints *) + +\* Every state change in every step follows the AllowedTransitions +\* table. This is an action constraint checked by TLC: it verifies +\* that the Next relation only produces transitions that are in the +\* implementation's allowedTransitions set. +\* +\* Source: HAGroupStoreRecord.java L99-123, isTransitionAllowed() L130. +TransitionValid == + \A c \in Cluster : + \* If the state changed for this cluster ... + clusterState'[c] # clusterState[c] => + \* ... then the (old, new) pair must be allowed. + <> \in AllowedTransitions + +--------------------------------------------------------------------------- + +(* + * Every writer mode change follows the allowed writer transitions. + * Action constraint checked by TLC analogous to TransitionValid. + * + * The X -> INIT transitions (SYNC, STORE_AND_FWD, SYNC_AND_FWD) + * model the replication subsystem restart on ATS -> S (standby + * entry). These are lifecycle resets, not ReplicationLogGroup + * mode CAS transitions: the entire ReplicationLogGroup is + * destroyed when the cluster becomes standby. + * + * Source: ReplicationLogGroup.java mode transitions; + * FailoverManagementListener replication subsystem restart. + * + * AllowedWriterTransitions is defined in Types.tla. + *) +WriterTransitionValid == + \A c \in Cluster : + \A r \in RS : + writerMode'[c][r] # writerMode[c][r] => + <> \in AllowedWriterTransitions + +--------------------------------------------------------------------------- + +(* + * AIS-to-ATS precondition: failover can only begin from AIS when + * the OUT directory is empty and all live RS are in SYNC mode. + * + * DEAD RSes are allowed: an RS can crash while the cluster is AIS + * without changing the HA group state. A DEAD RS is not writing, + * so the remaining SYNC RSes and empty OUT dir ensure safety. + * The implementation checks clusterState = AIS, not per-RS modes. + * + * Source: initiateFailoverOnActiveCluster() L375-400 (validates + * current state is AIS or ANIS); the precondition holds + * because AIS is only reachable when OUT dir is empty and + * all writers have returned to SYNC. RS crash does not + * change clusterState. + *) +AIStoATSPrecondition == + \A c \in Cluster : + clusterState[c] = "AIS" /\ clusterState'[c] = "ATS" + => outDirEmpty[c] /\ \A r \in RS : writerMode[c][r] \in {"SYNC", "DEAD"} + +--------------------------------------------------------------------------- + +(* + * Anti-flapping gate: ANIS -> AIS never fires while the countdown + * timer is still running. This is a cross-check on the ANISToAIS + * action's AntiFlapGateOpen guard, analogous to how AIStoATS- + * Precondition cross-checks AdminStartFailover. + * + * Source: HAGroupStoreClient.validateTransitionAndGetWaitTime() + * L1027-1046 + *) +AntiFlapGate == + \A c \in Cluster : + clusterState[c] = "ANIS" /\ clusterState'[c] = "AIS" + => AntiFlapGateOpen(antiFlapTimer[c]) + +--------------------------------------------------------------------------- + +(* + * ANISTS-to-ATS precondition: the ANISTS -> ATS transition + * (forwarder drain completion during ANIS failover) can only + * proceed when the OUT directory is empty and the anti-flapping + * gate is open. Cross-checks the ANISTSToATS action's guards, + * analogous to how AIStoATSPrecondition cross-checks + * AdminStartFailover and AntiFlapGate cross-checks ANISToAIS. + * + * Source: HAGroupStoreManager.setHAGroupStatusToSync() L341-355; + * HAGroupStoreClient.validateTransitionAndGetWaitTime() + * L1027-1046. + *) +ANISTStoATSPrecondition == + \A c \in Cluster : + clusterState[c] = "ANISTS" /\ clusterState'[c] = "ATS" + => /\ outDirEmpty[c] + /\ AntiFlapGateOpen(antiFlapTimer[c]) + +--------------------------------------------------------------------------- + +(* + * Failover trigger correctness: STA -> AIS requires replay- + * completeness conditions. Cross-checks the TriggerFailover + * action's guards -- if TLC finds a step where STA->AIS happens + * without the required conditions, the action constraint fires. + * + * hdfsAvailable is excluded: it is an environmental/liveness + * guard (without HDFS, the action cannot fire), not a replay- + * completeness condition. + * + * Source: shouldTriggerFailover() L500-533 (implementation guards) + *) +FailoverTriggerCorrectness == + \A c \in Cluster : + clusterState[c] = "STA" /\ clusterState'[c] = "AIS" + => STAtoAISTriggerReplayGuards(c) + +--------------------------------------------------------------------------- + +(* + * No data loss (zero RPO): the high-level safety property for + * failover. When the standby completes STA -> AIS, replay must + * have been in SYNC (no pending SYNCED_RECOVERY rewind), the + * in-progress directory must be empty, and the failover must + * have been properly initiated. + *) +NoDataLoss == + \A c \in Cluster : + clusterState[c] = "STA" /\ clusterState'[c] = "AIS" + => STAtoAISTriggerReplayGuards(c) + +--------------------------------------------------------------------------- + +(* + * Replay rewind correctness: the SYNCED_RECOVERY -> SYNC + * transition (ReplayRewind CAS) equalizes the replay counters. + * + * After a degradation period, lastRoundProcessed can advance + * beyond lastRoundInSync (ReplayAdvance in DEGRADED only + * increments lastRoundProcessed). When the cluster recovers + * (DS -> S or ATS -> S), recoveryListener sets replayState to + * SYNCED_RECOVERY. ReplayRewind then resets lastRoundProcessed + * back to lastRoundInSync, ensuring re-processing of all rounds + * from the last known in-sync point before replay resumes in + * SYNC mode. + * + * This property verifies the mechanism; NoDataLoss verifies the + * safety outcome. Together they guarantee zero RPO: the rewind + * closes the counter gap, and TriggerFailover (which requires + * replayState = SYNC) cannot fire until the rewind completes. + * + * Source: replay() L323-333 -- compareAndSet(SYNCED_RECOVERY, SYNC); + * getFirstRoundToProcess() L389 -- rewinds to lastRoundInSync + *) +ReplayRewindCorrectness == + \A c \in Cluster : + replayState[c] = "SYNCED_RECOVERY" /\ replayState'[c] = "SYNC" + => lastRoundProcessed'[c] = lastRoundInSync'[c] + +--------------------------------------------------------------------------- + +(* + * Every replay state change follows the allowed replay transitions. + * Action constraint checked by TLC analogous to TransitionValid + * and WriterTransitionValid. + * + * Source: ReplicationLogDiscoveryReplay.java L131-206 (listeners), + * L323-333 (CAS), L336-351 (replay loop) + * + * AllowedReplayTransitions is defined in Types.tla. + *) +ReplayTransitionValid == + \A c \in Cluster : + replayState'[c] # replayState[c] => + <> \in AllowedReplayTransitions + +--------------------------------------------------------------------------- + +(* + * AIS implies in-sync: whenever a cluster is in AIS, the OUT + * directory must be empty and all RS must be in SYNC, INIT, or + * DEAD. + * + * DEAD is allowed because an RS can crash while the cluster is + * AIS. RSCrash sets writerMode to DEAD but does not change + * clusterState. The HA group state in ZK is independent of RS + * process lifecycle. + *) +AISImpliesInSync == + \A c \in Cluster : + clusterState[c] = "AIS" => + /\ outDirEmpty[c] + /\ \A r \in RS : writerMode[c][r] \in {"INIT", "SYNC", "DEAD"} + +--------------------------------------------------------------------------- + +(* + * Writer-cluster consistency: degraded writer modes (S&F, + * SYNC_AND_FWD) can only appear on active clusters that are + * NOT in AIS, on the ANISTS/ATS transitional states, or on + * abort states where HDFS failure can degrade writers. + * + * AIS is excluded: prevented by the AIS->ANIS coupling + * (WriterToStoreFwd, WriterInitToStoreFwd atomically transition + * AIS -> ANIS when a writer degrades). + * + * ATS is included: the AIS failover path enters ATS with all + * writers in SYNC/DEAD (AdminStartFailover guard), but the ANIS + * failover path enters ATS via ANISTSToATS which does NOT snap + * writer modes -- SYNC_AND_FWD writers persist into ATS. Also, + * if HDFS goes down during ATS, WriterSyncFwdToStoreFwd can + * re-degrade S&FWD writers to S&F. These degraded writers are + * cleaned up on ATS -> S (replication subsystem restart). + * + * Standby states (S, DS, AbTS) are excluded: live writer modes + * are reset to INIT on ATS -> S entry (PeerReactToAIS, + * PeerReactToANIS lifecycle reset). DEAD writers are preserved + * through ATS -> S (crashed RSes cannot process state change + * notifications) and handled by RSRestart independently. + * + * DEAD is excluded from this check because RSCrash can set + * writerMode to DEAD on any cluster state (an RS can crash + * at any time). DEAD writers from CAS failure also appear + * only on non-AIS active states, but RSCrash is unconstrained. + * + * The allowed set includes AbTAIS and AWOP because HDFS can go + * down while the cluster is in these states; the AIS->ANIS + * coupling only fires for AIS, so other active states retain + * their state while writers degrade. + *) +WriterClusterConsistency == + \A c \in Cluster : + (\E r \in RS : writerMode[c][r] \in {"STORE_AND_FWD", "SYNC_AND_FWD"}) => + clusterState[c] \in {"ANIS", "ANISTS", "ATS", "ANISWOP", "AbTANIS", "AbTAIS", "AWOP"} + +--------------------------------------------------------------------------- + +(* State constraint *) + +\* Bound replay counters for exhaustive search tractability. +\* The abstract counter values only matter relationally +\* (lastRoundProcessed >= lastRoundInSync), so small bounds suffice. +ReplayCounterBound == + \A c \in Cluster : lastRoundProcessed[c] <= 3 + +--------------------------------------------------------------------------- + +(* Symmetry *) + +\* RS identifiers are interchangeable (all start in INIT, identical +\* action sets). Cluster identifiers remain asymmetric (AIS vs S). +Symmetry == Permutations(RS) + +--------------------------------------------------------------------------- + +(* Theorems *) + +\* Safety: all variables have valid types. +THEOREM Spec => []TypeOK + +\* Safety: mutual exclusion holds in every reachable state. +THEOREM Spec => []MutualExclusion + +\* Safety: abort is always initiated from the correct side. +THEOREM Spec => []AbortSafety + +\* Safety: AIS implies in-sync (derived invariant). +THEOREM Spec => []AISImpliesInSync + +\* Safety: degraded writer modes only on degraded-active clusters. +THEOREM Spec => []WriterClusterConsistency + +\* Safety: ZK session/connection consistency (session expiry implies disconnection). +THEOREM Spec => []ZKSessionConsistency + +\* Safety: ANISTS->ATS requires outDirEmpty and anti-flapping gate open (action property). +THEOREM Spec => [][ANISTStoATSPrecondition]_vars + +\* Safety: STA->AIS requires replay-completeness conditions (action property). +THEOREM Spec => [][FailoverTriggerCorrectness]_vars + +\* Safety: zero RPO -- no data loss on failover (action property). +THEOREM Spec => [][NoDataLoss]_vars + +\* Safety: replay rewind equalizes counters (action property). +THEOREM Spec => [][ReplayRewindCorrectness]_vars + +\* Liveness: failover/abort transient states eventually resolve. +THEOREM SpecFC => FailoverCompletion + +\* Liveness: ANIS with available HDFS eventually recovers. +THEOREM SpecDR => DegradationRecovery + +\* Liveness: abort states eventually auto-complete. +THEOREM SpecAC => AbortCompletion + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/DEVELOPING.md b/src/main/tla/ConsistentFailover/DEVELOPING.md new file mode 100644 index 00000000000..ebb12fcfb8e --- /dev/null +++ b/src/main/tla/ConsistentFailover/DEVELOPING.md @@ -0,0 +1,340 @@ +# Developer Guide: Modeling Changes with the TLA+ Specification + +This guide shows how to use the Phoenix Consistent Failover TLA+ specification to validate proposed design and architecture changes **before** writing implementation code. It covers the five most common change patterns, the end-to-end verification workflow, and annotated guidance on which invariants to check for each kind of change. + +--- + +## 1. Verification Workflow + +For any proposed change to the Consistent Failover protocol or its supporting subsystems, follow this workflow: + +1. **Identify the feature area** -- HA group transitions, writer modes, replay, ZK coordination, HDFS availability, or RS lifecycle. +2. **Find the corresponding spec module** -- `HAGroupStore.tla`, `Writer.tla`, `Reader.tla`, `ZK.tla`, `HDFS.tla`, `RS.tla`, `Clock.tla`, or `Admin.tla`. See the module-implementation mapping in Section 4. +3. **Model the proposed change in TLA+** -- add or modify actions, adjust guards, add or modify type definitions in `Types.tla`. +4. **Add or modify invariants** if the change introduces new safety requirements. +5. **Run exhaustive verification** (2 clusters, 2 RS, ~12 min) for fast feedback on core safety properties. +6. **Run simulation** (2 clusters, 9 RS, configurable duration) for deeper coverage at production-scale RS count. +7. **Run liveness simulation** (2 clusters, 2 RS, per property) to verify progress properties under fair scheduling. +8. **If a violation is found**, the counterexample trace shows the exact failure sequence. Refine the design and repeat from step 3. + +### Running Verification + +**Exhaustive (2 clusters, 2 RS):** + +```sh +java -XX:+UseParallelGC \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla -config ConsistentFailover.cfg \ + -workers auto -cleanup +``` + +**Simulation (2 clusters, 9 RS, configurable duration):** + +```sh +java -XX:+UseParallelGC \ + -Dtlc2.TLC.stopAfter=3600 \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla -config ConsistentFailover-sim.cfg \ + -simulate -depth 10000 -workers auto +``` + +**Liveness (per-property, example: AbortCompletion):** + +```sh +java -XX:+UseParallelGC \ + -Dtlc2.TLC.stopAfter=3600 \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla \ + -config ConsistentFailover-sim-liveness-ac.cfg \ + -simulate -depth 10000 -workers auto +``` + +### Recommended Durations for Simulation + +| Tier | Duration | Use Case | +|------|----------|----------| +| Quick feedback | 300s (5 min) | Feedback during development | +| Post-change | 900s (15 min) | Validation after completing a change | +| Post-phase | 3600s (1 hr) | Milestone verification | +| Nightly CI | 28800s (8 hr) | Continuous overnight run for rare interleavings | + +--- + +## 2. Common Change Patterns + +### Pattern A: Adding or Modifying an HA Group State Transition + +**When:** You are adding a new HA group state, adding a new transition between existing states, or modifying the guards on an existing transition. + +**Where to edit:** + +1. **`Types.tla`** — Add the new state to `HAGroupState` and the new transition pair to `AllowedTransitions`. Update `ActiveStates`, `StandbyStates`, or `TransitionalActiveStates` if the new state maps to one of those roles. Update `RoleOf` accordingly. If liveness uses named state sets (`StableClusterStates`, `FailoverCompletionAntecedentStates`, `AbortCompletionAntecedentStates`, `NotANISClusterStates`), extend them when a new state should appear in a `~>` antecedent or consequent. +2. **`HAGroupStore.tla`** or **`Admin.tla`** — Add or modify the action that produces the new transition. Set appropriate ZK connectivity guards (`zkPeerConnected`, `zkPeerSessionAlive`, `zkLocalConnected`). +3. **`ConsistentFailover.tla`** — Wire the new action into the `Next` disjunction and, if appropriate, into the `Fairness` condition. Add a new invariant if the transition introduces new safety requirements. + +**What to verify (primary invariants and constraints):** + +| Invariant / Constraint | Why | +|---|---| +| `MutualExclusion` | The new state must not allow two clusters in the ACTIVE role simultaneously | +| `TransitionValid` | The new `(from, to)` pair must be in `AllowedTransitions` | +| `AbortSafety` | If the new state participates in the abort protocol, AbTAIS peer constraints must hold | +| `TypeOK` | The new state must be a member of `HAGroupState` | + +--- + +### Pattern B: Modifying Writer Mode Transitions + +**When:** You are adding a new writer mode, changing the degradation/recovery paths, modifying CAS failure behavior, or changing the AIS-to-ANIS coupling. + +**Where to edit:** + +1. **`Types.tla`** — Add the new mode to `WriterMode` if adding a new mode. Add any new `(from, to)` pair to `AllowedWriterTransitions`. +2. **`Writer.tla`** — Add or modify the action. Set `zkLocalConnected` guard on any action that performs a ZK CAS write. Ensure the AIS-to-ANIS coupling fires atomically when the first RS degrades. +3. **`ConsistentFailover.tla`** — Wire new actions into `Next` and `Fairness`. + +**What to verify (primary invariants and constraints):** + +| Invariant / Constraint | Why | +|---|---| +| `AISImpliesInSync` | AIS must never coexist with degraded writer modes | +| `WriterClusterConsistency` | Degraded writer modes only on appropriate cluster states | +| `WriterTransitionValid` | Every writer mode change follows `AllowedWriterTransitions` | +| `MutualExclusion` | Writer mode changes must not break the cluster-level mutual exclusion invariant | + +--- + +### Pattern C: Modifying the Replay State Machine or Failover Trigger + +**When:** You are changing the replay advance/rewind logic, modifying the `shouldTriggerFailover()` guards, changing how `failoverPending` or `inProgressDirEmpty` are managed, or modifying the replay state transitions. + +**Where to edit:** + +1. **`Types.tla`** — Update `ReplayStateSet` if adding a new replay state. Update `AllowedReplayTransitions` for any new replay `(from, to)` pair. +2. **`Reader.tla`** — Modify `ReplayAdvance`, `ReplayRewind`, `ReplayBeginProcessing`, `ReplayFinishProcessing`, or `TriggerFailover`. +3. **`HAGroupStore.tla`** — If changing listener folds (recoveryListener or degradedListener effects on S-entry or DS-entry actions), modify the `replayState'` assignments in `PeerReactToAIS`, `PeerReactToANIS`, and `AutoComplete`. + +**What to verify (primary invariants and constraints):** + +| Invariant / Constraint | Why | +|---|---| +| `FailoverTriggerCorrectness` | STA->AIS requires failoverPending, inProgressDirEmpty, replayState=SYNC | +| `NoDataLoss` | Zero RPO: the core safety property for failover | +| `ReplayRewindCorrectness` | SYNCED_RECOVERY->SYNC must equalize replay counters | +| `ReplayTransitionValid` | Every replay state change follows `AllowedReplayTransitions` | +| `FailoverCompletion` (liveness) | STA and abort states must eventually resolve | + +--- + +### Pattern D: Modifying ZK Coordination or Adding a ZK Failure Mode + +**When:** You are changing how ZK connection loss or session expiry affects the protocol, adding a new ZK failure mode, modifying the ATS reconciliation fold, or changing which actions are guarded by ZK connectivity. + +**Where to edit:** + +1. **`ZK.tla`** — Add or modify ZK lifecycle actions. If adding a new failure mode, add both the failure and recovery actions. Ensure session expiry implies disconnection (`ZKSessionConsistency` invariant). +2. **`HAGroupStore.tla`**, **`Writer.tla`**, **`Reader.tla`** — Update `zkPeerConnected`/`zkPeerSessionAlive`/`zkLocalConnected` guards on actions affected by the ZK change. +3. **`ConsistentFailover.tla`** — Wire new ZK actions into `Next`. Classify ZK failure actions in Tier 4 (no fairness) and ZK recovery actions in Tier 2 (WF, encoding the ZK Liveness Assumption). + +**What to verify (primary invariants and constraints):** + +| Invariant / Constraint | Why | +|---|---| +| `ZKSessionConsistency` | Session expiry must imply disconnection | +| `MutualExclusion` | ZK failures must not create a dual-active window | +| `AbortSafety` | AbTAIS peer constraints must hold after reconciliation | +| `AbortCompletion` (liveness) | Abort states must eventually resolve under ZK recovery | +| `FailoverCompletion` (liveness) | Failover must eventually complete under ZK recovery | + +--- + +### Pattern E: Adding a New Invariant or Action Constraint + +**When:** Your proposed change introduces new safety requirements that are not covered by existing invariants or action constraints. + +**Where to edit:** + +1. **`ConsistentFailover.tla`** — Define the invariant predicate or action constraint. Place it near semantically related invariants. Add a `THEOREM` declaration. +2. **All `.cfg` files** — Add the new invariant to the `INVARIANT` section or the new action constraint to the `ACTION_CONSTRAINT` section in all five configuration files. + +**What to verify:** + +- The new invariant passes under the current specification (no false positives). +- All existing invariants still pass (no regressions from any spec changes needed to support the new invariant). +- If the invariant is a state invariant, add it to `INVARIANT` in all `.cfg` files. +- If the invariant is an action constraint, add it to `ACTION_CONSTRAINT` in all `.cfg` files. + +--- + +## 3. Invariant Reference: When to Check What + +The following tables map each invariant and action constraint to the change areas where it is most relevant. When modifying a given area, prioritize the invariants listed for that area. + +### Core Safety Invariants + +| Invariant | Check When Changing | +|---|---| +| `MutualExclusion` | HA group state transitions, ZK watcher delivery, abort protocol, ATS reconciliation, any `RoleOf` mapping | +| `AbortSafety` | Abort protocol (AdminAbortFailover, PeerReactToAbTS), ATS reconciliation fold (ZKPeerReconnect, ZKPeerSessionRecover) | +| `ZKSessionConsistency` | ZK session expiry/recovery, peer connection lifecycle | +| `AISImpliesInSync` | Writer degradation (WriterToStoreFwd, WriterInitToStoreFwd), ANISToAIS recovery, RSCrash | +| `WriterClusterConsistency` | Writer mode transitions, standby entry lifecycle reset, ANISTSToATS (writer mode preservation) | + +### Action Constraints + +| Constraint | Check When Changing | +|---|---| +| `TransitionValid` | Any HA group state transition, AllowedTransitions table | +| `WriterTransitionValid` | Any writer mode transition, AllowedWriterTransitions table, standby entry lifecycle reset | +| `ReplayTransitionValid` | Any replay state transition, listener folds (recoveryListener, degradedListener) | +| `AIStoATSPrecondition` | AdminStartFailover guards, outDirEmpty semantics, writer mode guards for AIS path | +| `AntiFlapGate` | Anti-flapping timer, ANISToAIS guards, WaitTimeForSync semantics | +| `ANISTStoATSPrecondition` | ANISTSToATS guards, forwarder drain, anti-flapping timer | +| `FailoverTriggerCorrectness` | TriggerFailover guards, shouldTriggerFailover() conditions, failoverPending lifecycle | +| `NoDataLoss` | Replay state machine, failover trigger, rewind correctness | +| `ReplayRewindCorrectness` | ReplayRewind action, SYNCED_RECOVERY->SYNC CAS, lastRoundProcessed/lastRoundInSync counters | + +### Liveness Properties + +| Property | Check When Changing | +|---|---| +| `FailoverCompletion` | AutoComplete, TriggerFailover, replay state machine, ZK recovery, HDFS recovery | +| `DegradationRecovery` | Writer recovery chain (S&F->S&FWD->SYNC), ANISToAIS, anti-flapping timer, HDFSUp, RS restart | +| `AbortCompletion` | AutoComplete, ZK local connectivity, abort state transitions | + +--- + +## 4. Module-Implementation Mapping + +Use this to quickly locate which spec module corresponds to the implementation +code you are changing: + +| Implementation Class / Method | Spec Module | Key Actions | +|---|---|---| +| `HAGroupStoreManager` | `ConsistentFailover.tla` | `Init`, `Next`, `Fairness`, invariants | +| `FailoverManagementListener.onStateChange()` L653-706 | `HAGroupStore.tla` | `PeerReactToATS`, `PeerReactToANIS`, `PeerReactToAbTS`, `PeerReactToAIS`, `ReactiveTransitionFail` | +| `createLocalStateTransitions()` L140-150 | `HAGroupStore.tla` | `AutoComplete` (AbTS->S, AbTAIS->AIS/ANIS, AbTANIS->ANIS) | +| `StoreAndForwardModeImpl.startHAGroupStoreUpdateTask()` L71-87 | `HAGroupStore.tla` | `ANISHeartbeat` | +| `HAGroupStoreManager.setHAGroupStatusToSync()` L341-355 | `HAGroupStore.tla` | `ANISToAIS`, `ANISTSToATS` | +| `initiateFailoverOnActiveCluster()` L375-400 | `Admin.tla` | `AdminStartFailover` (AIS->ATS or ANIS->ANISTS) | +| `setHAGroupStatusToAbortToStandby()` L419-425 | `Admin.tla` | `AdminAbortFailover` (STA->AbTS) | +| `SyncModeImpl.onFailure()` L61-74 | `Writer.tla` | `WriterToStoreFwd`, `WriterToStoreFwdFail` | +| `SyncAndForwardModeImpl.onFailure()` L66-78 | `Writer.tla` | `WriterSyncFwdToStoreFwd`, `WriterSyncFwdToStoreFwdFail` | +| `ReplicationLogDiscoveryForwarder.init()` L98-108 | `Writer.tla` | `WriterSyncToSyncFwd` | +| `ReplicationLogDiscoveryForwarder.processFile()` L133-152 | `Writer.tla` | `WriterStoreFwdToSyncFwd` | +| `ReplicationLogDiscoveryForwarder.processNoMoreRoundsLeft()` L155-184 | `Writer.tla` | `WriterSyncFwdToSync` | +| `ReplicationLogDiscoveryReplay.replay()` L323-351 | `Reader.tla` | `ReplayAdvance`, `ReplayRewind` | +| `ReplicationLogDiscoveryReplay.shouldTriggerFailover()` L500-533 | `Reader.tla` | `TriggerFailover` | +| NameNode crash/recovery | `HDFS.tla` | `HDFSDown`, `HDFSUp` | +| JVM crash, OOM, kill signal | `RS.tla` | `RSCrash` | +| `StoreAndForwardModeImpl.onFailure()` L115-123 | `RS.tla` | `RSAbortOnLocalHDFSFailure` | +| Kubernetes/YARN pod restart | `RS.tla` | `RSRestart` | +| `HAGroupStoreClient.validateTransitionAndGetWaitTime()` L1027-1046 | `Clock.tla` | `Tick` | +| `HAGroupStoreClient.createCacheListener()` L894-906 | `ZK.tla` | `ZKPeerDisconnect`, `ZKPeerReconnect`, `ZKLocalDisconnect`, `ZKLocalReconnect` | +| Curator session management | `ZK.tla` | `ZKPeerSessionExpiry`, `ZKPeerSessionRecover` | +| `HAGroupStoreRecord.HAGroupState` enum L51-65 | `Types.tla` | `HAGroupState`, `AllowedTransitions` | +| `ClusterRoleRecord.ClusterRole` enum L59-107 | `Types.tla` | `ClusterRole`, `RoleOf` | +| `ReplicationLogGroup` mode classes | `Types.tla` | `WriterMode` | + +--- + +## 5. Tips for Effective Spec-Driven Development + +1. **Start small.** Make the minimal change to the spec that captures your proposed design. Run exhaustive verification first (~12 min). Add complexity incrementally. + +2. **Read counterexample traces carefully.** When TLC reports a violation, it produces a step-by-step trace. Each step is an action name + the values of every state variable. The trace is the exact sequence of events that breaks your invariant. + +3. **Use simulation for production-scale RS counts.** The exhaustive model uses 2 RS per cluster -- sufficient for CAS race coverage but not for complex multi-RS writer interleavings. The simulation model uses 9 RS to exercise production-scale scenarios like 4 RS in S&F, 3 in SYNC_AND_FWD, 2 in SYNC simultaneously. + +4. **Check liveness after safety.** Liveness verification is more expensive (Buchi automaton construction) and only meaningful if safety holds. Always pass exhaustive and simulation safety first. + +5. **Preserve symmetry reduction.** RS identifiers are interchangeable (all start in INIT, identical action sets). When adding RS-level behavior, keep the symmetry. Cluster identifiers are asymmetric (AIS vs S at Init) and cannot use symmetry. + +6. **Check both tiers.** A change that passes exhaustive (2 RS) may fail in simulation (9 RS) due to deeper per-RS interleavings. Always run both. + +7. **New state variables.** Declare them in `SpecState.tla` (single `VARIABLE` line), add them to the `vars` tuple in `ConsistentFailover.tla`, extend every sub-module action’s `UNCHANGED` / primed update lists as needed, and wire new actions into `Next` and `Fairness`. + +8. **Update all five `.cfg` files.** When adding a new invariant or action constraint, add it to `ConsistentFailover.cfg`, `ConsistentFailover-sim.cfg`, and all three liveness configs. + +9. **Classify new actions by fairness tier.** Every new action must be + classified into one of four tiers: + - **Tier 1 (WF):** Guards depend only on protocol state, no env var guards + - **Tier 2 (WF):** ZK recovery actions (encodes ZK Liveness Assumption) + - **Tier 3 (SF):** Guards depend on oscillating environment variables + - **Tier 4 (none):** Non-deterministic faults, operator actions, CAS failures + +--- + +## 6. How To: Understanding a TLA+ Counterexample Trace + +When TLC finds an invariant violation it produces a counterexample trace, the exact sequence of states from `Init` to the violating state. Each state shows the action that produced it and the values of every state variable. This is the spec's most valuable output, the precise interleaving that breaks your invariant. + +### 6.1 Anatomy of a Trace Step + +``` +Error: Invariant is violated. +Error: The behavior up to this point is: +State 1: +/\ clusterState = (c1 :> "AIS" @@ c2 :> "S") +/\ writerMode = ... +... +State N: +/\ clusterState = ... ← the state that violates the invariant +``` + +TLC prints all 13 variables in every state. Diff consecutive states manually to see what changed. Start from the last state and work backwards to find the action that introduced the bad value. + +### 6.2 Key Variables to Watch + +For different invariant violations, focus on different variables: + +| Invariant Violated | Key Variables to Track | +|---|---| +| `MutualExclusion` | `clusterState` for both clusters -- look for both in `ActiveStates` | +| `AISImpliesInSync` | `clusterState`, `writerMode`, `outDirEmpty` -- look for AIS with S&F or non-empty OUT | +| `NoDataLoss` / `FailoverTriggerCorrectness` | `clusterState` (STA->AIS), `failoverPending`, `inProgressDirEmpty`, `replayState` | +| `ZKSessionConsistency` | `zkPeerSessionAlive`, `zkPeerConnected` -- look for session dead but connected | +| `TransitionValid` | `clusterState` old and new values -- look for pair not in `AllowedTransitions` | +| `WriterTransitionValid` | `writerMode` old and new values for the changing RS | +| `AbortSafety` | `clusterState` for both clusters -- look for AbTAIS with unexpected peer state | + +### 6.3 Common Patterns in Counterexample Traces + +| Pattern | What to Look For | Typical Invariants Violated | +|---|---|---| +| **ZK partition during failover** | Peer disconnect between ATS write and STA detection | `MutualExclusion` (if reconciliation is wrong) | +| **CAS race** | Two RS detect HDFS failure, second gets BadVersionException | `WriterTransitionValid`, `AISImpliesInSync` | +| **Missed watcher** | `ReactiveTransitionFail` fires, transition permanently lost | Liveness violations | +| **Anti-flap bypass** | ANIS->AIS fires while timer is positive | `AntiFlapGate` | +| **Stale writer after standby entry** | S&F/S&FWD writer persists through ATS->S | `WriterClusterConsistency` | +| **Replay rewind race** | DS-entry fold fires before ReplayRewind CAS | `ReplayRewindCorrectness` | +| **Failover without replay completion** | STA->AIS without replayState=SYNC | `NoDataLoss`, `FailoverTriggerCorrectness` | +| **Dual active via abort** | AbTAIS reached with peer in active state | `AbortSafety`, `MutualExclusion` | + +### 6.4 Using an LLM to Analyze Traces + +Paste the trace into an LLM with relevant context. Use this template: + +```` +I have a TLA+ counterexample trace from the Phoenix Consistent Failover +specification. + +## Violated Invariant +[Paste invariant definition from ConsistentFailover.tla] + +## Counterexample Trace +[Paste full TLC output starting from "Error: Invariant ..."] + +## Relevant Spec Module(s) +[Paste action definitions from the .tla files named in state headers] + +## Questions +1. What is the root cause of this invariant violation? +2. At which state does the critical event occur? +3. Is this a spec bug or an implementation bug? +4. What change would fix this? +```` + +A good analysis will identify the critical state (often 2-3 steps before the violation), explain the interleaving, and suggest concrete fixes. + +Apply the fix, re-run TLC exhaustive, re-run simulation. If a new violation appears, paste the new trace and repeat. diff --git a/src/main/tla/ConsistentFailover/HAGroupStore.tla b/src/main/tla/ConsistentFailover/HAGroupStore.tla new file mode 100644 index 00000000000..e4cecb5d7e5 --- /dev/null +++ b/src/main/tla/ConsistentFailover/HAGroupStore.tla @@ -0,0 +1,583 @@ +-------------------- MODULE HAGroupStore ---------------------------------------- +(* + * Peer-reactive transitions and auto-completion actions for the + * Phoenix Consistent Failover protocol. + * + * Actions model the FailoverManagementListener (HAGroupStoreManager.java + * L633-706) which reacts to peer ZK state changes via PathChildrenCache + * watchers, and the local auto-completion resolvers from + * createLocalStateTransitions() (L140-150). + * + * ZK WATCHER DELIVERY DEPENDENCY: All PeerReact* actions depend on + * the peer ZK connection and session being alive (guarded by + * zkPeerConnected[c] and zkPeerSessionAlive[c]). AutoComplete + * actions depend on the local ZK connection (guarded by + * zkLocalConnected[c]). Without these connections, watcher + * notifications cannot be delivered. + * + * RETRY EXHAUSTION: The FailoverManagementListener retries each + * reactive transition exactly 2 times (HAGroupStoreManager.java + * L653-704). After exhaustion, the method returns silently. This + * is modeled by the ReactiveTransitionFail(c) action, which + * non-deterministically "consumes" a pending peer-reactive + * transition without updating clusterState. + * + * Notification chain (peer-reactive transitions): + * Peer ZK znode change + * -> Curator peerPathChildrenCache + * -> HAGroupStoreClient.handleStateChange() [L1088-1110] + * -> notifySubscribers() [L1119-1151] + * -> FailoverManagementListener.onStateChange() [L653-705] + * -> setHAGroupStatusIfNeeded() (2-retry limit) + * + * Notification chain (auto-completion transitions): + * Local ZK znode change + * -> Curator pathChildrenCache (local) + * -> HAGroupStoreClient.handleStateChange() + * -> notifySubscribers() + * -> FailoverManagementListener.onStateChange() + * + * LISTENER FOLDS: The recoveryListener (L147-157) and degradedListener + * (L136-145) from ReplicationLogDiscoveryReplay fire synchronously on + * the local PathChildrenCache event thread during state entry. Their + * effects are folded atomically into the S-entry and DS-entry actions: + * - S entry (PeerReactToANIS ATS->S, PeerReactToAIS ATS->S, + * PeerReactToAIS DS->S, AutoComplete AbTS->S): sets replayState + * to SYNCED_RECOVERY. + * - DS entry (PeerReactToANIS S->DS): sets replayState to DEGRADED. + * + * Implementation traceability: + * + * TLA+ action | Java source + * --------------------------+----------------------------------------------- + * PeerReactToATS(c) | createPeerStateTransitions() L109 + * PeerReactToANIS(c) | createPeerStateTransitions() L123, L126 + * PeerReactToAbTS(c) | createPeerStateTransitions() L132 + * PeerReactToAIS(c) | createPeerStateTransitions() L112-120 + * AutoComplete(c) | createLocalStateTransitions() L144, L145, L147 + * ANISTSToATS(c) | HAGroupStoreManager.setHAGroupStatusToSync() + * | L341-355 (ANISTS -> ATS drain completion) + * PeerReactToOFFLINE(c) | (proactive, Iteration 18) intended peer + * | OFFLINE detection; no impl trigger yet; + * | gated on UseOfflinePeerDetection + * PeerRecoverFromOFFLINE(c) | (proactive, Iteration 18) intended peer + * | OFFLINE recovery detection; no impl + * | trigger yet; gated on + * | UseOfflinePeerDetection + * ReactiveTransitionFail(c) | FailoverManagementListener.onStateChange() + * | L653-704 (2 retries exhausted, returns + * | silently) + * + * Failover completion (STA -> AIS) is modeled in Reader.tla + * (TriggerFailover action), not in this module. + *) +EXTENDS SpecState, Types + +--------------------------------------------------------------------------- + +(* + * Standby entry from ATS (S-entry side effects). + * + * Shared by the ATS -> S branches of PeerReactToAIS and + * PeerReactToANIS. On standby entry from ATS, three side effects + * fire atomically: + * + * 1. Live writers reset to INIT (DEAD preserved). Models the + * replication subsystem restart on standby entry: the entire + * ReplicationLogGroup is destroyed, so any SYNC/SYNC_AND_FWD/ + * STORE_AND_FWD writers become INIT. DEAD writers (from + * RSCrash or CAS failure) are preserved because crashed RSes + * cannot process the state-change notification; they are + * handled by RSRestart independently. + * + * 2. OUT directory cleared. The forwarder drains before ATS + * auto-completes, so outDirEmpty is TRUE by the time the + * peer's AIS triggers the local S entry. + * + * 3. recoveryListener (L147-157) fires synchronously on the + * local PathChildrenCache event thread during S entry, + * unconditionally setting replayState to SYNCED_RECOVERY. + * + * Not applied to AdminForceRecover (resets all writers to INIT + * with no DEAD-preservation) or AutoComplete AbTS->S (only sets + * replayState; no writer/OUT reset needed because AbTS was never + * active). + *) +ResetToStandbyEntry(c) == + /\ writerMode' = [writerMode EXCEPT ![c] = + [rs \in RS |-> IF writerMode[c][rs] = "DEAD" + THEN "DEAD" + ELSE "INIT"]] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = TRUE] + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] + +--------------------------------------------------------------------------- + +(* + * Peer transitions to ATS (ACTIVE_IN_SYNC_TO_STANDBY). + * + * When the standby detects its peer has entered ATS, it begins the + * failover process by transitioning to STA (STANDBY_TO_ACTIVE). + * This fires from either S or DS -- the DS case supports the ANIS + * failover path where the standby is in DEGRADED_STANDBY when + * failover proceeds. + * + * ZK watcher dependency: Delivered via peerPathChildrenCache. + * Guarded on zkPeerConnected[c] and zkPeerSessionAlive[c]. + * If the peer ZK session expires or the notification is lost, the + * standby never learns of the failover. The active cluster remains + * in ATS with mutations blocked indefinitely. No polling fallback. + * + * Source: createPeerStateTransitions() L109 -- resolver is + * unconditional: currentLocal -> STANDBY_TO_ACTIVE. + * + * Also sets failoverPending[c] = TRUE, modeling the + * triggerFailoverListener (ReplicationLogDiscoveryReplay.java + * L159-171) which fires on LOCAL STANDBY_TO_ACTIVE. Folded + * into PeerReactToATS because the listener fires + * deterministically on every STA entry and PeerReactToATS is + * the sole producer of STA. + *) +PeerReactToATS(c) == + /\ PeerZKHealthy(c) + /\ clusterState[Peer(c)] = "ATS" + /\ clusterState[c] \in {"S", "DS"} + /\ clusterState' = [clusterState EXCEPT ![c] = "STA"] + /\ failoverPending' = [failoverPending EXCEPT ![c] = TRUE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Peer transitions to ANIS (ACTIVE_NOT_IN_SYNC). + * + * Two reactive transitions triggered by peer entering ANIS: + * 1. Local S -> DS: standby degrades because peer's replication is + * degraded. Atomically sets replayState = DEGRADED (degraded- + * Listener fold). Source: L126. + * 2. Local ATS -> S: old active (in failover) completes transition + * to standby when peer is ANIS. Atomically sets replayState = + * SYNCED_RECOVERY (recoveryListener fold). Source: L123. + * + * ZK watcher dependency: Delivered via peerPathChildrenCache. + * Guarded on zkPeerConnected[c] and zkPeerSessionAlive[c]. + * If lost: (1) standby stays in S when it should be DS -- consistency + * point tracking is incorrect; (2) old active stays in ATS with + * mutations blocked. No polling fallback. + *) +PeerReactToANIS(c) == + /\ PeerZKHealthy(c) + /\ clusterState[Peer(c)] = "ANIS" + /\ \/ /\ clusterState[c] = "S" + /\ clusterState' = [clusterState EXCEPT ![c] = "DS"] + \* degradedListener: unconditional set(DEGRADED) fires + \* synchronously on local PathChildrenCache thread during + \* DS entry. Counter advance is handled by ReplayAdvance. + /\ replayState' = [replayState EXCEPT ![c] = "DEGRADED"] + /\ UNCHANGED <> + \/ /\ clusterState[c] = "ATS" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + /\ ResetToStandbyEntry(c) + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Peer transitions to AbTS (ABORT_TO_STANDBY). + * + * When the active cluster (in ATS during failover) detects its peer + * has entered AbTS (abort initiated from the standby side), the + * active transitions to AbTAIS (ABORT_TO_ACTIVE_IN_SYNC). + * + * ZK watcher dependency: Delivered via peerPathChildrenCache. + * Guarded on zkPeerConnected[c] and zkPeerSessionAlive[c]. + * If lost: active stays in ATS with mutations blocked; abort does + * not propagate. No polling fallback. + * + * Source: createPeerStateTransitions() L132. + *) +PeerReactToAbTS(c) == + /\ PeerZKHealthy(c) + /\ clusterState[Peer(c)] = "AbTS" + /\ clusterState[c] = "ATS" + /\ clusterState' = [clusterState EXCEPT ![c] = "AbTAIS"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Auto-completion transitions (local, no peer trigger). + * + * These transitions fire automatically once the cluster enters the + * corresponding abort state. They return the cluster to its pre- + * failover state. + * + * ZK watcher dependency: Despite being "local" (no peer trigger), + * these transitions are driven by the local pathChildrenCache + * watcher chain, not an in-process event bus. Guarded on + * zkLocalConnected[c]. If the local ZK connection is lost, the + * cluster remains in the AbTS/AbTAIS/AbTANIS state indefinitely. + * + * AbTAIS auto-completion: conditional -- completes to AIS if all + * writers are clean (INIT or SYNC) and OUT dir is empty, otherwise + * completes to ANIS. This prevents AIS from coexisting with + * degraded writers when HDFS fails during the abort window. + * + * Source: createLocalStateTransitions() L140-150 + * AbTS -> S (L144) -- atomically sets replayState = + * SYNCED_RECOVERY (recoveryListener fold) + * AbTAIS -> AIS or ANIS (L145) -- conditional on writer/outDir state + * AbTANIS -> ANIS (L147) + *) +AutoComplete(c) == + /\ LocalZKHealthy(c) + /\ \/ /\ clusterState[c] = "AbTS" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + \* recoveryListener: unconditional set(SYNCED_RECOVERY) + \* fires synchronously on local PathChildrenCache thread + \* during S entry. + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] + /\ UNCHANGED <> + \/ /\ clusterState[c] = "AbTAIS" + /\ clusterState' = [clusterState EXCEPT ![c] = + IF outDirEmpty[c] /\ \A rs \in RS : writerMode[c][rs] \in {"INIT", "SYNC"} + THEN "AIS" + ELSE "ANIS"] + /\ UNCHANGED <> + \/ /\ clusterState[c] = "AbTANIS" + /\ clusterState' = [clusterState EXCEPT ![c] = "ANIS"] + /\ antiFlapTimer' = [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Peer transitions to AIS (ACTIVE_IN_SYNC). + * + * Two reactive transitions triggered by peer entering AIS: + * 1. Local ATS -> S: old active completes failover to standby + * when peer (the new active) enters AIS. Atomically sets + * replayState = SYNCED_RECOVERY (recoveryListener fold). + * 2. Local DS -> S: standby recovers from degraded when peer + * returns to AIS. Atomically sets replayState = + * SYNCED_RECOVERY (recoveryListener fold). + * + * WRITER LIFECYCLE RESET (ATS -> S): When the old active enters + * standby, the FailoverManagementListener triggers a replication + * subsystem restart on each live RS. Live writer modes reset to + * INIT (the ReplicationLogGroup is destroyed and will be + * recreated when the cluster next becomes active). The OUT + * directory is cleared (outDirEmpty = TRUE). DEAD writers are + * preserved: a crashed RS (JVM dead) cannot process the state + * change notification; the process supervisor restart (RSRestart) + * handles DEAD -> INIT independently. This is critical for the + * ANIS failover path where SYNC_AND_FWD or STORE_AND_FWD writers + * may persist through ANISTS -> ATS (ANISTSToATS does not snap + * writer modes). + * + * ZK watcher dependency: Delivered via peerPathChildrenCache. + * Guarded on zkPeerConnected[c] and zkPeerSessionAlive[c]. + * This is the critical transition that resolves the non-atomic + * failover window. If lost: old active stays in ATS with mutations + * blocked indefinitely (the (ATS, AIS) state persists). Safety + * holds (ATS maps to ACTIVE_TO_STANDBY, isMutationBlocked()=true) + * but liveness requires eventual watcher delivery. No polling + * fallback. Curator PathChildrenCache re-queries on reconnect, + * providing eventual delivery if the ZK session survives. + * + * Source: createPeerStateTransitions() L112-120 -- conditional + * resolver for peer ACTIVE_IN_SYNC. + *) +PeerReactToAIS(c) == + /\ PeerZKHealthy(c) + /\ clusterState[Peer(c)] = "AIS" + /\ \/ /\ clusterState[c] = "ATS" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + /\ ResetToStandbyEntry(c) + /\ UNCHANGED <> + \/ /\ clusterState[c] = "DS" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + \* recoveryListener: unconditional set(SYNCED_RECOVERY) + \* fires synchronously on local PathChildrenCache thread + \* during S entry. + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * ANIS self-transition (heartbeat): refreshes the anti-flapping + * countdown timer without changing cluster state. + * + * The S&F heartbeat runs while at least one RS is in STORE_AND_FWD + * mode. It periodically re-writes ANIS to the ZK znode, which + * refreshes mtime. In the countdown timer model, this resets the + * timer to StartAntiFlapWait, keeping the anti-flapping gate closed. + * + * The heartbeat stops when the last RS exits STORE_AND_FWD (enters + * SYNC_AND_FWD). At that point the timer begins counting down via + * Tick, and the gate opens when it reaches 0. + * + * Guarded on zkLocalConnected[c] because the heartbeat calls + * setHAGroupStatusToStoreAndForward() which goes through + * setHAGroupStatusIfNeeded(), requiring isHealthy = true. + * + * Source: StoreAndForwardModeImpl.startHAGroupStoreUpdateTask() + * L71-87; HAGroupStoreRecord.java L101 (ANIS self-transition). + *) +ANISHeartbeat(c) == + /\ LocalZKHealthy(c) + /\ clusterState[c] = "ANIS" + /\ \E rs \in RS : writerMode[c][rs] = "STORE_AND_FWD" + /\ antiFlapTimer' = [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Recovery: ANIS -> AIS. + * + * When all RS on the cluster are in SYNC or SYNC_AND_FWD, the OUT + * directory is empty, and the anti-flapping gate has opened + * (countdown timer reached 0), the cluster recovers from ANIS to AIS. + * + * The writer guard includes SYNC_AND_FWD (not just SYNC) because + * the anti-flapping gate ensures all RS have + * exited S&F (the heartbeat stops, and WaitTimeForSync ticks must + * elapse) before this action fires. Any remaining SYNC_AND_FWD RS + * are atomically transitioned to SYNC, modeling the ACTIVE_IN_SYNC + * ZK event at ReplicationLogDiscoveryForwarder.init() L113-123. + * + * The AISImpliesInSync invariant verifies that AIS is only reached + * with all RS in SYNC or INIT. + * + * Guarded on zkLocalConnected[c] because this calls + * setHAGroupStatusToSync() which requires isHealthy = true. + * + * Source: setHAGroupStatusToSync() L341-355, after forwarder drain. + *) +ANISToAIS(c) == + /\ LocalZKHealthy(c) + /\ clusterState[c] = "ANIS" + /\ AntiFlapGateOpen(antiFlapTimer[c]) + /\ \A rs \in RS : writerMode[c][rs] \in {"SYNC", "SYNC_AND_FWD"} + /\ outDirEmpty[c] + /\ clusterState' = [clusterState EXCEPT ![c] = "AIS"] + /\ writerMode' = [writerMode EXCEPT ![c] = + [rs \in RS |-> IF writerMode[c][rs] = "SYNC_AND_FWD" + THEN "SYNC" + ELSE writerMode[c][rs]]] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Drain completion: ANISTS -> ATS. + * + * When the forwarder has drained the OUT directory and the anti- + * flapping gate has opened, the cluster advances from ANISTS + * (ACTIVE_NOT_IN_SYNC_TO_STANDBY) to ATS + * (ACTIVE_IN_SYNC_TO_STANDBY), joining the normal AIS failover + * path. The standby reacts to ATS (not ANISTS), so this + * transition is the bridge that lets the ANIS failover path + * converge with the AIS failover path. + * + * Writer modes are NOT snapped here. In the implementation, + * setHAGroupStatusToSync() only writes the cluster-level ZK + * znode (ANISTS -> ATS); it does not modify per-RS writer modes. + * SYNC_AND_FWD writers may persist into ATS. They are cleaned + * up when the cluster transitions ATS -> S (replication subsystem + * restart on standby entry -- see PeerReactToAIS, PeerReactToANIS). + * + * Anti-flapping gate: confirmed by implementation -- + * validateTransitionAndGetWaitTime() L1035-1036 applies the same + * waitTimeForSyncModeInMs to ANISTS -> ATS as to ANIS -> AIS. + * The forwarder handles the wait via syncUpdateTS deferral + * (processNoMoreRoundsLeft() L169-172). + * + * Guarded on zkLocalConnected[c] because this calls + * setHAGroupStatusToSync() -> setHAGroupStatusIfNeeded() which + * requires isHealthy = true. + * + * Source: HAGroupStoreManager.setHAGroupStatusToSync() L341-355 -- + * if current state is ANISTS, target is ATS. + * HAGroupStoreClient.validateTransitionAndGetWaitTime() + * L1027-1046 (anti-flapping gate). + *) +ANISTSToATS(c) == + /\ LocalZKHealthy(c) + /\ clusterState[c] = "ANISTS" + /\ AntiFlapGateOpen(antiFlapTimer[c]) + /\ outDirEmpty[c] + /\ clusterState' = [clusterState EXCEPT ![c] = "ATS"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Peer transitions to OFFLINE. + * + * Gated on UseOfflinePeerDetection (Iteration 18, proactive modeling). + * + * When the active cluster detects its peer has entered OFFLINE, it + * transitions to AWOP or ANISWOP depending on its current state: + * AIS -> AWOP (peer went offline while active is in sync) + * ANIS -> ANISWOP (peer went offline while active is not in sync) + * + * Both AWOP and ANISWOP map to ClusterRole.ACTIVE via + * getClusterRole() (isMutationBlocked()=false), so the active + * cluster continues serving mutations while its peer is offline. + * + * No writer or timer side effects: the transition is purely a + * cluster-state annotation recording the peer's unavailability. + * + * ZK watcher dependency: Delivered via peerPathChildrenCache. + * Guarded on zkPeerConnected[c] and zkPeerSessionAlive[c]. + * + * NOTE: This models intended protocol behavior. No + * FailoverManagementListener entry for peer OFFLINE currently + * exists in the implementation (createPeerStateTransitions() + * has no OFFLINE entry). The TLA+ model verifies the design + * ahead of implementation. + * + * Source: (proactive) AIS->AWOP from allowedTransitions L103; + * ANIS->ANISWOP from allowedTransitions L101. + *) +PeerReactToOFFLINE(c) == + /\ UseOfflinePeerDetection = TRUE + /\ PeerZKHealthy(c) + /\ clusterState[Peer(c)] = "OFFLINE" + /\ \/ /\ clusterState[c] = "AIS" + /\ clusterState' = [clusterState EXCEPT ![c] = "AWOP"] + \/ /\ clusterState[c] = "ANIS" + /\ clusterState' = [clusterState EXCEPT ![c] = "ANISWOP"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Peer recovers from OFFLINE. + * + * Gated on UseOfflinePeerDetection (Iteration 18, proactive modeling). + * + * When the active cluster (in AWOP or ANISWOP) detects its peer + * has left OFFLINE (re-entered a non-OFFLINE state via manual + * --force recovery), the active returns to ANIS: + * AWOP -> ANIS (per AWOP.allowedTransitions = {ANIS}) + * ANISWOP -> ANIS (per ANISWOP.allowedTransitions = {ANIS}) + * + * Both paths enter ANIS because peer recovery is treated as a + * new peer entering sync -- the active must first synchronize, + * so it enters ANIS (not AIS). The anti-flap timer is reset to + * StartAntiFlapWait on ANIS entry. + * + * ZK watcher dependency: Delivered via peerPathChildrenCache. + * Guarded on zkPeerConnected[c] and zkPeerSessionAlive[c]. + * + * NOTE: This models intended protocol behavior. See + * PeerReactToOFFLINE comment for implementation status. + * + * Source: (proactive) AWOP->ANIS from allowedTransitions L113; + * ANISWOP->ANIS from allowedTransitions L123. + *) +PeerRecoverFromOFFLINE(c) == + /\ UseOfflinePeerDetection = TRUE + /\ PeerZKHealthy(c) + /\ clusterState[Peer(c)] # "OFFLINE" + /\ clusterState[c] \in {"AWOP", "ANISWOP"} + /\ clusterState' = [clusterState EXCEPT ![c] = "ANIS"] + /\ antiFlapTimer' = [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Reactive transition retry exhaustion. + * + * Models the FailoverManagementListener (HAGroupStoreManager.java + * L653-704) where both retries of setHAGroupStatusIfNeeded() fail + * and the method returns silently. The watcher notification was + * delivered, the listener was invoked, but the local ZK write + * failed. The transition is permanently lost for this notification. + * + * This action is enabled whenever any PeerReact* action would be + * enabled (same ZK connectivity and peer-state guards). Its effect + * is to leave clusterState unchanged -- the local transition was + * not applied. TLC explores both the success path (the actual + * PeerReact actions) and this failure path non-deterministically. + * + * In the implementation, handleStateChange() updates + * lastKnownPeerState before calling notifySubscribers(). After + * retry failure, if the peer state is re-written with the same + * value, handleStateChange() suppresses the notification (same- + * state check). The model is slightly more permissive: the same + * PeerReact* action remains enabled after ReactiveTransitionFail + * (the model does not track lastKnownPeerState). This is sound + * for safety: if safety holds when the transition can non- + * deterministically succeed or fail, it holds a fortiori when + * failures are permanent. + * + * Source: FailoverManagementListener.onStateChange() + * (HAGroupStoreManager.java L653-704) -- + * 2-retry exhaustion, method returns silently. + *) + +\* Guard disjunction shared by ReactiveTransitionFail. Mirrors the +\* peer-state/local-state enabling conditions of PeerReactToATS, +\* PeerReactToANIS, PeerReactToAbTS, PeerReactToAIS, +\* PeerReactToOFFLINE, and PeerRecoverFromOFFLINE so the retry- +\* exhaustion action cannot drift from the reactive transitions +\* it shadows. Intentionally does NOT include the PeerZKHealthy(c) +\* guard: that is applied at the call site uniformly for all +\* PeerReact* actions and ReactiveTransitionFail itself. +PeerReactWouldFire(c) == + \/ /\ clusterState[Peer(c)] = "ATS" + /\ clusterState[c] \in {"S", "DS"} + \/ /\ clusterState[Peer(c)] = "ANIS" + /\ clusterState[c] \in {"S", "ATS"} + \/ /\ clusterState[Peer(c)] = "AbTS" + /\ clusterState[c] = "ATS" + \/ /\ clusterState[Peer(c)] = "AIS" + /\ clusterState[c] \in {"ATS", "DS"} + \/ /\ UseOfflinePeerDetection = TRUE + /\ clusterState[Peer(c)] = "OFFLINE" + /\ clusterState[c] \in {"AIS", "ANIS"} + \/ /\ UseOfflinePeerDetection = TRUE + /\ clusterState[Peer(c)] # "OFFLINE" + /\ clusterState[c] \in {"AWOP", "ANISWOP"} + +ReactiveTransitionFail(c) == + /\ PeerZKHealthy(c) + /\ PeerReactWouldFire(c) + /\ UNCHANGED vars + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/HDFS.tla b/src/main/tla/ConsistentFailover/HDFS.tla new file mode 100644 index 00000000000..e4c3a15eca1 --- /dev/null +++ b/src/main/tla/ConsistentFailover/HDFS.tla @@ -0,0 +1,84 @@ +---------------------------- MODULE HDFS ------------------------------------------ +(* + * HDFS availability incident actions for the Phoenix Consistent + * Failover specification. + * + * Models NameNode crash and recovery as environment incidents. + * HDFSDown(c) sets the availability flag to FALSE; per-RS writer + * degradation and ZK CAS writes are handled individually by + * WriterToStoreFwd / WriterSyncFwdToStoreFwd in Writer.tla. + * This decomposition enables modeling of the ZK CAS race where + * multiple RS on the same cluster race to update the ZK state + * and the loser gets BadVersionException -> abort. + * + * Recovery is asymmetric: HDFSUp(c) only sets the availability + * flag. Per-RS recovery happens gradually via the forwarder path + * (WriterStoreFwdToSyncFwd), which is guarded on hdfsAvailable. + * + * Implementation traceability: + * + * TLA+ action | Java source + * ----------------+------------------------------------------------ + * HDFSDown(c) | NameNode crash; detected reactively via + * | IOException from ReplicationLog.apply() + * HDFSUp(c) | NameNode recovery; forwarder detects via + * | successful FileUtil.copy() in processFile() + * | L132-152 + *) +EXTENDS SpecState, Types + +--------------------------------------------------------------------------- + +(* + * NameNode of cluster c crashes. + * + * Sets the HDFS availability flag to FALSE. Per-RS writer + * degradation (SYNC -> S&F, SYNC_AND_FWD -> S&F) is handled + * individually by WriterToStoreFwd and WriterSyncFwdToStoreFwd + * in Writer.tla, which are guarded on hdfsAvailable[Peer(c)] + * = FALSE. Those actions also handle the AIS -> ANIS cluster + * state transition and CAS failure (-> DEAD). + * + * Any cluster's HDFS can fail at any time. Two consequences: + * 1. HDFSDown(c_standby): standby HDFS fails -> active writers + * detect via IOException and degrade (SYNC -> S&F). + * 2. HDFSDown(c_active): active cluster's own HDFS fails -> + * S&F writers on the active cluster abort (modeled by + * RSAbortOnLocalHDFSFailure in RS.tla). + * + * Pre: c's HDFS is currently available. + * Post: hdfsAvailable[c] = FALSE. + * All other variables unchanged -- per-RS effects deferred + * to writer actions (case 1) or RS.tla (case 2). + * + * Source: NameNode crash (environment event) + *) +HDFSDown(c) == + /\ hdfsAvailable[c] = TRUE + /\ hdfsAvailable' = [hdfsAvailable EXCEPT ![c] = FALSE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * NameNode of cluster c recovers. + * + * Sets hdfsAvailable[c] = TRUE. No immediate writer effect -- + * recovery is per-RS via the forwarder path. The forwarder + * detects connectivity by successfully copying a file from + * OUT to the peer's IN directory; if throughput exceeds the + * threshold, it transitions the writer S&F -> SYNC_AND_FWD. + * + * Pre: c's HDFS is currently unavailable. + * Post: hdfsAvailable[c] = TRUE. Writer modes unchanged. + * + * Source: ReplicationLogDiscoveryForwarder.processFile() L132-152 + *) +HDFSUp(c) == + /\ hdfsAvailable[c] = FALSE + /\ hdfsAvailable' = [hdfsAvailable EXCEPT ![c] = TRUE] + /\ UNCHANGED <> + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/README.md b/src/main/tla/ConsistentFailover/README.md new file mode 100644 index 00000000000..8032a9803c7 --- /dev/null +++ b/src/main/tla/ConsistentFailover/README.md @@ -0,0 +1,380 @@ +# Phoenix Consistent Failover -- TLA+ Specification + +Formal specification of the Phoenix Consistent Failover protocol using TLA+ and the TLC model checker. The spec verifies safety properties (mutual exclusion, zero RPO, abort correctness) under arbitrary interleavings of admin actions, HDFS failures, RS crashes, ZK connection/session failures, watcher retry exhaustion, and the anti-flapping timer. + +## Literate Specification + +Literate programming versions of all specification files are available in the [`markdown/`](markdown/) directory. Each file includes the complete TLA+ code with comments converted to prose that discusses modeling choices, tradeoffs, and implementation traceability in depth. + +### Root Orchestrator + +| Literate Version | Source | Description | +|-----------------|--------|-------------| +| [`ConsistentFailover.md`](markdown/ConsistentFailover.md) | [`ConsistentFailover.tla`](ConsistentFailover.tla) | Init, Next, Spec, invariants, action constraints, fairness, liveness properties | +| [`SpecState.md`](markdown/SpecState.md) | [`SpecState.tla`](SpecState.tla) | Shared `VARIABLE` declarations (13 state functions) for the root and all sub-modules | + +### Pure Definitions + +| Literate Version | Source | Description | +|-----------------|--------|-------------| +| [`Types.md`](markdown/Types.md) | [`Types.tla`](Types.tla) | Constants, 14 HA group states, allowed transitions, cluster roles, writer modes, replay states, liveness state sets, writer/replay transition tables, anti-flapping timer helpers | + +### Actor Modules + +| Literate Version | Source | Description | +|-----------------|--------|-------------| +| [`HAGroupStore.md`](markdown/HAGroupStore.md) | [`HAGroupStore.tla`](HAGroupStore.tla) | Peer-reactive transitions, auto-completion, S&F heartbeat, ANIS recovery, ANISTS drain, retry exhaustion | +| [`Admin.md`](markdown/Admin.md) | [`Admin.tla`](Admin.tla) | Operator-initiated failover (AIS->ATS, ANIS->ANISTS) and abort (STA->AbTS) | +| [`Writer.md`](markdown/Writer.md) | [`Writer.tla`](Writer.tla) | Per-RS writer mode state machine: startup, degradation (CAS success/failure), recovery, drain | +| [`Reader.md`](markdown/Reader.md) | [`Reader.tla`](Reader.tla) | Standby replay state machine: advance, rewind, in-progress directory, failover trigger | + +### Environment Modules + +| Literate Version | Source | Description | +|-----------------|--------|-------------| +| [`HDFS.md`](markdown/HDFS.md) | [`HDFS.tla`](HDFS.tla) | NameNode crash/recovery environment actions | +| [`RS.md`](markdown/RS.md) | [`RS.tla`](RS.tla) | RS crash, local HDFS abort, process supervisor restart | +| [`Clock.md`](markdown/Clock.md) | [`Clock.tla`](Clock.tla) | Anti-flapping countdown timer (Lamport CHARME 2005) | +| [`ZK.md`](markdown/ZK.md) | [`ZK.tla`](ZK.tla) | ZK peer/local connection lifecycle, session expiry/recovery, ATS reconciliation | + +### TLC Configurations + +| Literate Version | Source | Description | +|-----------------|--------|-------------| +| [`ConsistentFailover-cfg.md`](markdown/ConsistentFailover-cfg.md) | [`ConsistentFailover.cfg`](ConsistentFailover.cfg) | Exhaustive safety: 2 clusters, 2 RS, full state-space exploration | +| [`ConsistentFailover-sim-cfg.md`](markdown/ConsistentFailover-sim-cfg.md) | [`ConsistentFailover-sim.cfg`](ConsistentFailover-sim.cfg) | Simulation safety: 2 clusters, 9 RS, random trace sampling | +| [`ConsistentFailover-sim-liveness-ac-cfg.md`](markdown/ConsistentFailover-sim-liveness-ac-cfg.md) | [`ConsistentFailover-sim-liveness-ac.cfg`](ConsistentFailover-sim-liveness-ac.cfg) | AbortCompletion liveness (5 fairness clauses) | +| [`ConsistentFailover-sim-liveness-fc-cfg.md`](markdown/ConsistentFailover-sim-liveness-fc-cfg.md) | [`ConsistentFailover-sim-liveness-fc.cfg`](ConsistentFailover-sim-liveness-fc.cfg) | FailoverCompletion liveness (15 fairness clauses) | +| [`ConsistentFailover-sim-liveness-dr-cfg.md`](markdown/ConsistentFailover-sim-liveness-dr-cfg.md) | [`ConsistentFailover-sim-liveness-dr.cfg`](ConsistentFailover-sim-liveness-dr.cfg) | DegradationRecovery liveness (25 fairness clauses) | + +## Solution Design + +Phoenix clusters are deployed in pairs across distinct failure domains. The Consistent Failover protocol provides zero-RPO failover between a Primary (active) and Standby cluster using Phoenix Synchronous Replication. Every committed mutation on the active cluster is synchronously written to a replication log file on the standby cluster's HDFS before the mutation is acknowledged. A set of replay threads on the standby asynchronously consumes these log files round-by-round, applying changes to local HBase tables so the standby remains close to in-sync with the active. + +``` + ┌──────────────────────────────────────────────────────────────────┐ + │ Active Cluster (FD 1) │ + │ │ + │ Admin ──► HAGroupStoreManager ─ setData (CAS) ─► ZK Quorum 1 │ + │ ▲ │ + │ HAGroupStoreClient ── watch (local) ────────┘ │ + │ · │ + │ · watch (peer) ···············► ZK Quorum 2 │ + │ (remote) │ + │ │ + │ ReplicationLogWriter (per RS) │ + │ ├── SYNC ─────────────────────────► Standby HDFS / IN │ + │ └── STORE_AND_FORWARD ──► HDFS / OUT │ + │ │ │ + │ Forwarder ──► Standby HDFS / IN │ + └──────────────────────────────────────────────────────────────────┘ + + ┌──────────────────────────────────────────────────────────────────┐ + │ Standby Cluster (FD 2) │ + │ │ + │ HAGroupStoreManager ─ setData (CAS) ─► ZK Quorum 2 │ + │ ▲ │ + │ HAGroupStoreClient ── watch (local) ────────┘ │ + │ · │ + │ · watch (peer) ···············► ZK Quorum 1 │ + │ (remote) │ + │ │ + │ HDFS / IN ──► ReplicationLogReader ──► HBase Tables │ + │ (round-by-round replay) │ + └──────────────────────────────────────────────────────────────────┘ +``` + +### Actors + +| Actor | Role | +|-------|------| +| **Admin** | Human operator; initiates or aborts failover via CLI | +| **HAGroupStoreManager** | Per-cluster coprocessor endpoint; automates peer-reactive state transitions via `FailoverManagementListener` with up to 2 retries | +| **ReplicationLogWriter** | Per-RegionServer on the active cluster; captures mutations post-WAL-commit and writes them to standby HDFS (`SYNC` mode) or local HDFS (`STORE_AND_FORWARD` mode) | +| **ReplicationLogReader** | On the standby cluster; replays replication logs round-by-round, manages the consistency point, and triggers the final STA-to-AIS transition | +| **HAGroupStoreClient** | Per-RS ZK interaction layer; caches state, enforces anti-flapping, validates transitions | + +### State Machines + +The protocol is governed by six interrelated state machines, all modeled in this specification: + +**HA Group State (14 states).** +Each cluster's lifecycle: `ACTIVE_IN_SYNC`, `ACTIVE_NOT_IN_SYNC`, `ACTIVE_IN_SYNC_TO_STANDBY`, `ACTIVE_NOT_IN_SYNC_TO_STANDBY`, `STANDBY`, `STANDBY_TO_ACTIVE`, `DEGRADED_STANDBY`, `ABORT_TO_ACTIVE_IN_SYNC`, `ABORT_TO_ACTIVE_NOT_IN_SYNC`, `ABORT_TO_STANDBY`, `ACTIVE_WITH_OFFLINE_PEER` (reachable when `UseOfflinePeerDetection = TRUE`), `ACTIVE_NOT_IN_SYNC_WITH_OFFLINE_PEER` (reachable when `UseOfflinePeerDetection = TRUE`), `OFFLINE` (reachable when `UseOfflinePeerDetection = TRUE`), `UNKNOWN`. States map to roles visible to clients: ACTIVE (serves reads/writes), ACTIVE_TO_STANDBY (mutations blocked), `STANDBY`, `STANDBY_TO_ACTIVE`, `OFFLINE`, `UNKNOWN`. + +**Replication Writer Mode (4 modes per RS).** +`INIT` (pre-initialization), `SYNC` (writing directly to standby HDFS), `STORE_AND_FORWARD` (writing locally when standby is unavailable), `SYNC_AND_FORWARD` (draining local queue while also writing synchronously). A write error in `STORE_AND_FORWARD` mode triggers RS abort (fail-stop). + +**Replication Replay State (4 states per cluster).** +`NOT_INITIALIZED`, `SYNC` (fully in sync), `DEGRADED` (active peer in `ANIS`; `lastRoundInSync` frozen), `SYNCED_RECOVERY` (active returned to `AIS`; replay rewinds to `lastRoundInSync`). + +**Combined Product State.** +The (ActiveClusterState, StandbyClusterState) pair progresses through a well-defined sequence during failover. AIS path: `(AIS,S)` -> `(ATS,S)` -> `(ATS,STA)` -> `(ATS,AIS)` -> `(S,AIS)`. ANIS path: `(ANIS,DS)` -> `(ANISTS,DS)` -> `(ATS,DS)` -> `(ATS,STA)` -> `(ATS,AIS)` -> `(S,AIS)`. + +**Anti-Flapping Timer.** +A countdown timer (Lamport CHARME 2005) gates the `ANIS` -> `AIS` transition: the OUT directory must remain empty and all RS must be in SYNC for a configurable number of ticks before the cluster may return to `ACTIVE_IN_SYNC`. + +**ZK Connection/Session Lifecycle.** +Three per-cluster booleans model ZK health: peer connection state, peer session liveness, and local connection state. Peer-reactive transitions are guarded on peer connectivity and session liveness. Auto-completion, heartbeat, and writer ZK writes are guarded on local connectivity. Watcher-driven transitions may be permanently lost after retry exhaustion. + +### Coordination via ZooKeeper + +Each cluster stores its own HA group state as a ZooKeeper znode. Peer clusters observe each other's state changes via Curator watchers connected to the remote ZK quorum. State updates use ZK's versioned `setData` via optimistic CAS locking. A writer reads the current version, computes a new state, and writes with the read version. A `BadVersionException` triggers re-read and retry. + +The final failover step is two independent ZK writes. The new active writes `ACTIVE_IN_SYNC` to its own ZK, then the old active's `FailoverManagementListener` reactively writes `STANDBY` to its own ZK. Safety during this window is maintained because the old active is in `ACTIVE_IN_SYNC_TO_STANDBY`, a role that blocks all client mutations. + +### Failover Sequence + +1. **Initiation.** Admin transitions the active cluster from `AIS` to `ATS` (AIS failover path) or from `ANIS` to `ANISTS` (ANIS failover path). The `ACTIVE_TO_STANDBY` role blocks all mutations. On the ANIS path, the forwarder must drain the OUT directory and the anti-flapping gate must open before `ANISTS` advances to `ATS`. + +2. **Standby detection.** The standby's `FailoverManagementListener` detects the peer's ATS via a ZK watcher and reactively transitions the standby from `S` (or `DS` on the ANIS path) to `STA`. This sets `failoverPending = true`. + +3. **Log replay.** The standby replays all outstanding replication logs. The failover trigger requires: `failoverPending`, in-progress directory empty, and no new files in the time window (`lastRoundProcessed >= lastRoundInSync`). + +4. **Activation.** The standby writes `ACTIVE_IN_SYNC` to its own ZK. + +5. **Completion.** The old active's listener detects the peer's AIS and reactively writes `STANDBY` to its own ZK. + +### Why Formal Verification + +Exhaustive model checking with TLC explores every reachable state under all possible interleavings of these actors and failure modes, proving that the crucial safety properties of mutual exclusion, zero RPO, and abort correctness hold universally. + +The exhaustive model check verifies the following over the full reachable state space: + +**State invariants** (hold in every reachable state): + +| Invariant | Property | +|-----------|----------| +| `TypeOK` | All 13 variables have valid types | +| `MutualExclusion` | At most one cluster in the ACTIVE role at any time | +| `AbortSafety` | AbTAIS requires peer in AbTS, S, DS, or OFFLINE (abort, post-partition reconciliation, or offline peer) | +| `AISImpliesInSync` | AIS implies outDirEmpty and all RS in SYNC/INIT/DEAD | +| `WriterClusterConsistency` | Degraded writer modes only on non-AIS active or transitional states | +| `ZKSessionConsistency` | Peer session expiry implies peer disconnection | + +**Action constraints** (hold on every state transition): + +| Constraint | Property | +|------------|----------| +| `TransitionValid` | Every cluster state change follows `AllowedTransitions` | +| `WriterTransitionValid` | Every writer mode change follows `AllowedWriterTransitions` | +| `ReplayTransitionValid` | Every replay state change follows `AllowedReplayTransitions` | +| `AIStoATSPrecondition` | AIS->ATS requires outDirEmpty and all RS in SYNC/DEAD | +| `AntiFlapGate` | ANIS->AIS blocked while anti-flapping timer is positive | +| `ANISTStoATSPrecondition` | ANISTS->ATS requires outDirEmpty and anti-flapping gate open | +| `FailoverTriggerCorrectness` | STA->AIS requires failoverPending, inProgressDirEmpty, replayState=SYNC | +| `NoDataLoss` | Zero RPO: STA->AIS only when replay is complete | +| `ReplayRewindCorrectness` | SYNCED_RECOVERY->SYNC equalizes replay counters (lastRoundProcessed = lastRoundInSync) | + +**Liveness properties** (verified via simulation with per-property fairness): + +Liveness properties guarantee progress. Transient states eventually resolve to stable states under fair scheduling. Each property is checked with a per-property fairness formula containing only the temporal clauses on its critical path. The full `Fairness` formula has 43 temporal clauses, which would cause TLC's Buchi automaton construction to blow up, because it is exponential. Per-property formulas keep this tractable. Exhaustive liveness checking is infeasible at this state-space scale because TLC's SCC algorithm requires the full product graph (behavior graph x automaton) in memory, so probabilistic assurance is provided through simulation. + +| Property | Clauses | Description | +|----------|---------|-------------| +| `FailoverCompletion` | 15 | Standby-side and abort transient states (`STA`, `AbTAIS`, `AbTANIS`, `AbTS`) eventually resolve to a stable state (`AIS`, `ANIS`, `S`). ATS/ANISTS excluded: resolution depends on peer state and ZK connectivity at the right moment, with no fairness on admin actions or ZK disconnect. | +| `DegradationRecovery` | 25 | ANIS with available peer HDFS eventually progresses out of ANIS via the writer recovery chain: `S&F` -> `S&FWD` -> `SYNC`, anti-flap timer expires, `ANIS` -> `AIS`. May also leave ANIS via failover (`ANIS` -> `ANISTS`). | +| `AbortCompletion` | 5 | Every abort state (`AbTS`, `AbTAIS`, `AbTANIS`) eventually auto-completes to a stable state (`AIS`, `ANIS`, `S`). Deterministic under WF on `AutoComplete`. | + +## Module Architecture + +``` +ConsistentFailover.tla (root orchestrator: Init, Next, SafetySpec/Spec, invariants) + | + +-- SpecState.tla (shared VARIABLE declarations for root + all sub-modules) + | + +-- Types.tla (pure definitions: states, transitions, roles, helpers) + | + +-- HAGroupStore.tla (peer-reactive transitions, auto-completion, retry exhaustion) + +-- Admin.tla (operator-initiated failover and abort) + +-- Writer.tla (per-RS replication writer mode state machine) + +-- Reader.tla (standby-side replication replay state machine) + +-- HDFS.tla (HDFS NameNode crash/recovery) + +-- RS.tla (RS crash, local HDFS abort, process supervisor restart) + +-- Clock.tla (anti-flapping countdown timer) + +-- ZK.tla (ZK connection/session lifecycle) +``` + +All sub-modules extend `SpecState.tla` and `Types.tla` for shared variables and definitions. `ConsistentFailover.tla` composes them via `INSTANCE`. + +## Modules + +| Module | Description | +|--------|-------------| +| `SpecState.tla` | Declares the 13 specification variables once; root and sub-modules `EXTEND SpecState, Types`. | +| `Types.tla` | Pure definitions: 14 HA group states, allowed transitions, cluster roles, writer modes, replay states, liveness state sets, writer/replay transition tables, anti-flapping timer helpers. No variables. | +| `ConsistentFailover.tla` | Root orchestrator. Defines Init/Next/SafetySpec/Spec, instances sub-modules, defines all invariants and action constraints. | +| `HAGroupStore.tla` | 11 action schemas. Peer-reactive transitions (`PeerReactToATS`, `PeerReactToANIS`, `PeerReactToAbTS`, `PeerReactToAIS`), local auto-completion (`AutoComplete`), STORE_AND_FORWARD heartbeat (`ANISHeartbeat`), ANIS recovery (`ANISToAIS`), ANISTS drain completion (`ANISTSToATS`), retry exhaustion (`ReactiveTransitionFail`), peer OFFLINE detection (`PeerReactToOFFLINE`, `PeerRecoverFromOFFLINE`). ATS->S transitions include writer lifecycle reset (live writers reset to INIT, OUT directory cleared; DEAD writers preserved for RSRestart). S-entry actions atomically set `replayState = SYNCED_RECOVERY` (recoveryListener fold); DS-entry sets `replayState = DEGRADED` (degradedListener fold). All peer-reactive actions guarded on `zkPeerConnected` and `zkPeerSessionAlive`; OFFLINE lifecycle peer-reactive actions additionally guarded on `UseOfflinePeerDetection`. Auto-completion, heartbeat, recovery, and drain completion guarded on `zkLocalConnected`. | +| `Admin.tla` | `AdminStartFailover` (AIS->ATS or ANIS->ANISTS with peer-state guard), `AdminAbortFailover` (STA->AbTS, clears failoverPending), `AdminGoOffline` (S/DS->OFFLINE, gated on `UseOfflinePeerDetection`), and `AdminForceRecover` (OFFLINE->S, gated on `UseOfflinePeerDetection`). | +| `Writer.tla` | Per-RS writer mode transitions: startup (`WriterInit`, `WriterInitToStoreFwd`, `WriterInitToStoreFwdFail`), degradation (`WriterToStoreFwd`, `WriterToStoreFwdFail`, `WriterSyncFwdToStoreFwd`, `WriterSyncFwdToStoreFwdFail`), recovery (`WriterSyncToSyncFwd`, `WriterStoreFwdToSyncFwd`), drain complete (`WriterSyncFwdToSync`). ZK-writing actions guarded on `zkLocalConnected`. | +| `Reader.tla` | Replay advance (SYNC and DEGRADED), rewind, in-progress directory dynamics, failover trigger (`TriggerFailover` guarded on `zkLocalConnected`). Listener effects (degradedListener, recoveryListener) are folded into HAGroupStore S/DS-entry actions. | +| `HDFS.tla` | `HDFSDown` and `HDFSUp` -- environment actions for NameNode crash/recovery. | +| `RS.tla` | `RSCrash` (any mode->DEAD), `RSAbortOnLocalHDFSFailure` (STORE_AND_FORWARD->DEAD when own HDFS down), `RSRestart` (DEAD->INIT via process supervisor). | +| `Clock.tla` | `Tick` -- advances all per-cluster anti-flapping countdown timers by one tick toward zero. Guarded: only fires when at least one timer is positive. | +| `ZK.tla` | Peer ZK lifecycle (`ZKPeerDisconnect`, `ZKPeerReconnect`, `ZKPeerSessionExpiry`, `ZKPeerSessionRecover`) and local ZK lifecycle (`ZKLocalDisconnect`, `ZKLocalReconnect`). `ZKPeerReconnect` and `ZKPeerSessionRecover` fold a post-abort ATS reconciliation: when local = ATS and peer in {S, DS} at reconnect, `clusterState` is atomically set to AbTAIS (auto-completes to AIS via `AutoComplete`). This resolves the stuck-ATS scenario after abort during inter-cluster partition. | + +**Total: 42 action schemas** (some parameterized over cluster and RS). + +## Variables + +| Variable | Type | Source | +|----------|------|--------| +| `clusterState[c]` | `[Cluster -> HAGroupState]` | HAGroupStoreRecord per-cluster ZK znode | +| `writerMode[c][rs]` | `[Cluster -> [RS -> WriterMode]]` | ReplicationLogGroup per-RS mode | +| `outDirEmpty[c]` | `[Cluster -> BOOLEAN]` | Replication OUT directory state | +| `hdfsAvailable[c]` | `[Cluster -> BOOLEAN]` | NameNode availability (detected via IOException) | +| `antiFlapTimer[c]` | `[Cluster -> 0..WaitTimeForSync]` | Countdown timer (Lamport CHARME 2005) | +| `replayState[c]` | `[Cluster -> ReplayStateSet]` | Standby replay state (NOT_INITIALIZED/SYNC/DEGRADED/SYNCED_RECOVERY) | +| `lastRoundInSync[c]` | `[Cluster -> Nat]` | Last replay round processed while in SYNC | +| `lastRoundProcessed[c]` | `[Cluster -> Nat]` | Last replay round processed (any state) | +| `failoverPending[c]` | `[Cluster -> BOOLEAN]` | STA notification received, waiting for replay completion | +| `inProgressDirEmpty[c]` | `[Cluster -> BOOLEAN]` | No partially-processed replication log files | +| `zkPeerConnected[c]` | `[Cluster -> BOOLEAN]` | peerPathChildrenCache TCP connection state | +| `zkPeerSessionAlive[c]` | `[Cluster -> BOOLEAN]` | Peer ZK session liveness | +| `zkLocalConnected[c]` | `[Cluster -> BOOLEAN]` | pathChildrenCache (local) connection; maps to `isHealthy` | + +## Configuration + +Five TLC configurations are provided: + +### Safety Checking + +#### Exhaustive (`ConsistentFailover.cfg`) + +| Parameter | Value | Notes | +|-----------|-------|-------| +| `Cluster` | `{c1, c2}` | Exactly 2 clusters forming the HA pair | +| `RS` | `{rs1, rs2}` | 2 region servers per cluster | +| `WaitTimeForSync` | `2` | Anti-flapping timer ticks (small value sufficient for verification) | +| `UseOfflinePeerDetection` | `FALSE` | Feature gate for proactive AWOP/ANISWOP modeling; set `TRUE` to verify OFFLINE peer detection | +| Symmetry | `Permutations(RS)` | RS identifiers are interchangeable; clusters are asymmetric (AIS vs S at Init) | +| State constraint | `lastRoundProcessed[c] <= 3` | Bounds replay counters for tractability | +| Specification | `SafetySpec` | `Init /\ [][Next]_vars` (no fairness) | + +#### Simulation (`ConsistentFailover-sim.cfg`) + +| Parameter | Value | Notes | +|-----------|-------|-------| +| `Cluster` | `{c1, c2}` | Exactly 2 clusters forming the HA pair | +| `RS` | `{rs1, rs2, ..., rs9}` | 9 region servers per cluster (production-scale per-RS interleaving) | +| `WaitTimeForSync` | `5` | Larger window exercises richer interleavings during anti-flapping wait | +| Symmetry | None | No benefit for random trace sampling | +| State constraint | None | Counters grow organically along each trace | +| Specification | `SafetySpec` | `Init /\ [][Next]_vars` (no fairness) | + +### Liveness Checking (Per-Property Simulation) + +All runs use 2 clusters, 2 RS per cluster, WaitTimeForSync=2, depth 10000. See the liveness properties table above for property descriptions and critical paths. + +| Config File | Specification | Property | Fairness Clauses | +|-------------|---------------|----------|-----------------| +| `ConsistentFailover-sim-liveness-ac.cfg` | `SpecAC` | `AbortCompletion` | 5 | +| `ConsistentFailover-sim-liveness-fc.cfg` | `SpecFC` | `FailoverCompletion` | 15 | +| `ConsistentFailover-sim-liveness-dr.cfg` | `SpecDR` | `DegradationRecovery` | 25 | + +### Common Parameters + +The initial state is deterministic: one cluster starts in AIS, the other in S. All writers start in INIT, all HDFS available, all ZK connections alive, anti-flapping timers at zero, replay in SYNCED_RECOVERY (standby) / NOT_INITIALIZED (active). + +## Running + +Requires JDK 11+ (JDK 17 recommended). The `tla2tools.jar` must be on the classpath. + +**Syntax check (SANY):** + +``` +java -cp tla2tools.jar tla2sany.SANY ConsistentFailover.tla +``` + +### Safety Checking + +**Exhaustive model check (TLC):** + +``` +java -XX:+UseParallelGC \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla -config ConsistentFailover.cfg \ + -workers auto -cleanup +``` + +**Simulation (8-hour random trace sampling):** + +``` +java -XX:+UseParallelGC \ + -Dtlc2.TLC.stopAfter=28800 \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla -config ConsistentFailover-sim.cfg \ + -simulate -depth 10000 -workers auto +``` + +Simulation generates random execution traces up to depth 10000 (sufficient for ~100 complete failover cycles with 9 RS). The 9-RS model is too large for exhaustive search but ideal for simulation: 40 action schemas × 9 RS create a high branching factor that simulation samples efficiently. The `-Dtlc2.TLC.stopAfter=28800` flag limits the run to 8 hours. + +### Liveness Checking + +**All 3 properties (8-hour run each):** + +```bash +for prop in ac fc dr; do + java -XX:+UseParallelGC \ + -Dtlc2.TLC.stopAfter=28800 \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla \ + -config ConsistentFailover-sim-liveness-${prop}.cfg \ + -simulate -depth 10000 -workers auto +done +``` + +**Single property (example: AbortCompletion):** + +``` +java -XX:+UseParallelGC \ + -Dtlc2.TLC.stopAfter=28800 \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla \ + -config ConsistentFailover-sim-liveness-ac.cfg \ + -simulate -depth 10000 -workers auto +``` + +## Latest Results + +### Exhaustive + +| Metric | Value | +|--------|-------| +| Configuration | 2 clusters, 2 RS per cluster, WaitTimeForSync=2 | +| Workers | 16 | +| States generated | 2,718,437,761 | +| Distinct states | 170,978,688 | +| Depth | 55 | +| Duration | 24 min 19 sec | +| Date | 2026-04-21 | +| Result | Success | + +All 9 state invariants and 9 action constraints verified. No violations. + +### Simulation + +| Metric | Value | +|--------|-------| +| Configuration | 2 clusters, 9 RS per cluster, WaitTimeForSync=5 | +| Workers | 128 | +| States checked | 70,448,924,768 | +| Traces generated | 7,044,645 | +| Trace length | 10,000 | +| Seed | -5836228587005873350 | +| Duration | 8 hr | +| Date | 2026-04-21 | +| Result | Success | + +All 9 state invariants and 9 action constraints verified at production-scale RS count. No violations. + +### Liveness (Per-Property Simulation) + +All runs: 2 clusters, 2 RS per cluster, WaitTimeForSync=2, 128 workers, depth 10000, 1 hr. + +| Property | Config | States Checked | Traces | Seed | Date | Result | +|----------|--------|---------------|--------|------|------|--------| +| `AbortCompletion` | `SpecAC` | 2,671,855,331 | 267,165 | -3508296420780792285 | 2026-04-21 | Success | +| `DegradationRecovery` | `SpecDR` | 454,322,354 | 45,430 | 650174316504703997 | 2026-04-21 | Success | +| `FailoverCompletion` | `SpecFC` | 1,107,505,130 | 110,740 | 3654016485672320894 | 2026-04-21 | Success | + +All 3 liveness properties verified. No violations. diff --git a/src/main/tla/ConsistentFailover/RS.tla b/src/main/tla/ConsistentFailover/RS.tla new file mode 100644 index 00000000000..0ff9933bf3d --- /dev/null +++ b/src/main/tla/ConsistentFailover/RS.tla @@ -0,0 +1,117 @@ +------------------------------ MODULE RS ---------------------------------------- +(* + * RegionServer lifecycle actions for the Phoenix Consistent + * Failover specification. + * + * Models RS crash (fail-stop) and process supervisor restart. + * + * Crash modeling: An RS can crash at any time (JVM crash, OOM, kill + * signal, process supervisor termination). The crash sets writerMode + * to DEAD but does not change clusterState -- the HA group state in + * ZK is independent of RS process lifecycle. A special-case crash, + * RSAbortOnLocalHDFSFailure, models the abort triggered when the + * active cluster's own HDFS fails while the writer is in + * STORE_AND_FWD mode (writing to local HDFS). + * + * Restart modeling: When an RS dies (writer mode DEAD), the process + * supervisor (Kubernetes/YARN) detects the dead pod and creates a + * new one. HBase assigns regions and the writer re-initializes in + * INIT mode, ready to follow the normal startup path (WriterInit + * or WriterInitToStoreFwd). + * + * Implementation traceability: + * + * TLA+ action | Java source + * --------------------------------+---------------------------------- + * RSCrash(c, rs) | JVM crash, OOM, kill signal, + * | process supervisor termination + * RSAbortOnLocalHDFSFailure(c,rs) | StoreAndForwardModeImpl + * | .onFailure() L115-123 -> + * | logGroup.abort() + * RSRestart(c, rs) | Kubernetes/YARN pod restart -> + * | HBase RS startup -> + * | ReplicationLogGroup + * | .initializeReplicationMode() + * | -> setMode(SYNC) or + * | setMode(STORE_AND_FORWARD) + *) +EXTENDS SpecState, Types + +--------------------------------------------------------------------------- + +(* + * Process supervisor restarts a dead RS: DEAD -> INIT. + * + * The restarted RS enters INIT mode. Subsequent writer actions + * (WriterInit or WriterInitToStoreFwd) handle the actual mode + * initialization based on HDFS availability and cluster state. + * + * Pre: writerMode[c][rs] = "DEAD". + * Post: writerMode[c][rs] = "INIT". + * + * Source: Kubernetes/YARN pod restart -> HBase RS startup -> + * ReplicationLogGroup.initializeReplicationMode() + *) +RSRestart(c, rs) == + /\ writerMode[c][rs] = "DEAD" + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "INIT"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Non-deterministic RS crash: any mode -> DEAD. + * + * Models general RS failure (JVM crash, OOM, killed by process + * supervisor, etc.). The RS can crash at any time regardless of + * writer mode. The crash does not change clusterState -- the HA + * group state in ZK is independent of RS process lifecycle. + * + * Pre: writerMode[c][rs] /= "DEAD" (RS is alive). + * Post: writerMode[c][rs] = "DEAD". + * + * Source: JVM crash, OOM, kill signal, process supervisor + * termination -- environment event. + *) +RSCrash(c, rs) == + /\ writerMode[c][rs] /= "DEAD" + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * RS abort on local HDFS failure: STORE_AND_FWD -> DEAD. + * + * In STORE_AND_FWD mode, the writer targets the active cluster's + * own (local/fallback) HDFS. If that HDFS fails, + * StoreAndForwardModeImpl.onFailure() treats the error as fatal + * and calls logGroup.abort(), killing the RS. + * + * This is distinct from HDFSDown(c) (which models the *peer's* + * HDFS failing and degrades writers on the active side): + * RSAbortOnLocalHDFSFailure models the active cluster's *own* + * HDFS failing while the RS is already in fallback mode. + * + * Note: hdfsAvailable[c] is the cluster's OWN HDFS, not Peer(c). + * RS in SYNC or SYNC_AND_FWD write to the peer's HDFS, so they + * are not affected by their own cluster's HDFS failure. + * + * Depends on HDFSDown in HDFS.tla allowing any cluster's HDFS to + * fail, so that hdfsAvailable[c] = FALSE is reachable for active + * clusters. + * + * Pre: writerMode[c][rs] = "STORE_AND_FWD" and + * hdfsAvailable[c] = FALSE (own HDFS is down). + * Post: writerMode[c][rs] = "DEAD". + * + * Source: StoreAndForwardModeImpl.onFailure() L115-123 -> + * logGroup.abort() + *) +RSAbortOnLocalHDFSFailure(c, rs) == + /\ writerMode[c][rs] = "STORE_AND_FWD" + /\ hdfsAvailable[c] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/Reader.tla b/src/main/tla/ConsistentFailover/Reader.tla new file mode 100644 index 00000000000..f2fa3da0d4a --- /dev/null +++ b/src/main/tla/ConsistentFailover/Reader.tla @@ -0,0 +1,199 @@ +-------------------------- MODULE Reader ---------------------------------------- +(* + * Replication replay state machine for the Phoenix Consistent + * Failover specification. + * + * The standby cluster's reader replays replication logs round-by-round, + * tracking two counters (lastRoundProcessed, lastRoundInSync) and a + * replay state that determines how the counters advance. + * + * REPLAY STATE SEMANTICS: + * SYNC: Both counters advance together (in-sync replay). + * DEGRADED: Only lastRoundProcessed advances; lastRoundInSync + * is frozen (degraded replay). + * SYNCED_RECOVERY: Rewinds lastRoundProcessed to lastRoundInSync, + * then CAS-transitions to SYNC. + * NOT_INITIALIZED: Pre-init on the active side; transitions to + * SYNCED_RECOVERY on first S entry after failover. + * + * LISTENER EFFECTS: The degradedListener and recoveryListener use + * unconditional .set() (not .compareAndSet()). These fire + * synchronously on the local PathChildrenCache event thread during + * the cluster state transition and are modeled as atomic with the + * triggering state-entry actions in HAGroupStore.tla: + * - S entry: set(SYNCED_RECOVERY) -- folded into PeerReactToAIS, + * PeerReactToANIS (ATS->S), AutoComplete (AbTS->S) + * - DS entry: set(DEGRADED) -- folded into PeerReactToANIS (S->DS) + * + * CAS SEMANTICS: The SYNCED_RECOVERY -> SYNC transition uses + * compareAndSet(SYNCED_RECOVERY, SYNC) at L332-333. The CAS can + * only fail if a concurrent set(DEGRADED) fires first (the cluster + * re-degrades before replay() can CAS). TLC's interleaving semantics + * model this race: either ReplayRewind fires first (CAS succeeds) + * or the DS-entry fold in PeerReactToANIS fires first (state becomes + * DEGRADED, ReplayRewind is no longer enabled). + * + * Implementation traceability: + * + * TLA+ action | Java source + * --------------------------+-------------------------------------------- + * ReplayAdvance(c) | replay() L336-343 (SYNC) and L345-351 + * | (DEGRADED) -- round processing loop + * ReplayRewind(c) | replay() L323-333 -- + * | compareAndSet(SYNCED_RECOVERY, SYNC); + * | getFirstRoundToProcess() rewinds to + * | lastRoundInSync (L389) + * ReplayBeginProcessing(c) | replay() round processing start -- + * | in-progress files created when a + * | round is picked up for processing + * ReplayFinishProcessing(c) | replay() round processing end -- + * | in-progress files cleaned up after + * | round is fully processed + * TriggerFailover(c) | shouldTriggerFailover() L500-533 (guards); + * | triggerFailover() L535-548 (effect); + * | setHAGroupStatusToSync() L341-355 + * | (ZK write) + *) +EXTENDS SpecState, Types + +--------------------------------------------------------------------------- + +(* + * Replay advance: round processing in SYNC or DEGRADED state. + * + * The reader processes the next round of replication logs. + * - SYNC: both lastRoundProcessed and lastRoundInSync advance, + * maintaining the invariant that they are equal. + * - DEGRADED: only lastRoundProcessed advances; lastRoundInSync + * is frozen, modeling degraded replay where rounds are processed + * but the in-sync consistency point does not advance. + * + * Guard: cluster is in a standby state or STA (replay continues + * during failover pending) and replay is in SYNC or DEGRADED. + * + * Source: replay() L336-343 (SYNC), L345-351 (DEGRADED) + *) +ReplayAdvance(c) == + /\ clusterState[c] \in StandbyStates \union {"STA"} + /\ replayState[c] \in {"SYNC", "DEGRADED"} + /\ lastRoundProcessed' = [lastRoundProcessed EXCEPT ![c] = @ + 1] + /\ lastRoundInSync' = [lastRoundInSync EXCEPT ![c] = + IF replayState[c] = "SYNC" THEN @ + 1 ELSE @] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Replay rewind and CAS to SYNC from SYNCED_RECOVERY. + * + * In SYNCED_RECOVERY, replay() rewinds lastRoundProcessed to + * lastRoundInSync (via getFirstRoundToProcess() at L389), then + * attempts compareAndSet(SYNCED_RECOVERY, SYNC) at L332-333. + * + * The CAS can only fail if a concurrent set(DEGRADED) fires first + * (the cluster re-degrades before replay() can CAS). TLC's + * interleaving semantics model this race naturally: either this + * action fires (CAS succeeds, state becomes SYNC) or the DS-entry + * fold in PeerReactToANIS fires first (state becomes DEGRADED, + * this action is no longer enabled). + * + * Source: replay() L323-333 -- compareAndSet(SYNCED_RECOVERY, SYNC); + * getFirstRoundToProcess() L389 -- rewinds to lastRoundInSync + *) +ReplayRewind(c) == + /\ replayState[c] = "SYNCED_RECOVERY" + /\ replayState' = [replayState EXCEPT ![c] = "SYNC"] + /\ lastRoundProcessed' = [lastRoundProcessed EXCEPT ![c] = lastRoundInSync[c]] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Begin round processing: in-progress directory becomes non-empty. + * + * When the reader picks up a new round for processing, it creates + * in-progress files in the IN-PROGRESS directory. This makes the + * directory non-empty, blocking the failover trigger until + * processing completes. + * + * Guard: cluster is in a standby state or STA (replay continues + * during failover pending -- the replay() loop does not stop when + * the cluster enters STA) and the in-progress directory is + * currently empty. + * + * Source: replay() L307-310 -- getFirstRoundToProcess() returns a + * round; processing begins, creating in-progress files. + *) +ReplayBeginProcessing(c) == + /\ clusterState[c] \in StandbyStates \union {"STA"} + /\ inProgressDirEmpty[c] = TRUE + /\ inProgressDirEmpty' = [inProgressDirEmpty EXCEPT ![c] = FALSE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Finish round processing: in-progress directory becomes empty. + * + * When the reader finishes processing a round, it cleans up + * in-progress files. The directory becomes empty, allowing the + * failover trigger to proceed (if other guards are satisfied). + * + * Guard: in-progress directory is currently non-empty. + * + * Source: replay() L336-351 -- round processing completes, + * in-progress files are cleaned up. + *) +ReplayFinishProcessing(c) == + /\ inProgressDirEmpty[c] = FALSE + /\ inProgressDirEmpty' = [inProgressDirEmpty EXCEPT ![c] = TRUE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Failover trigger: STA -> AIS when replay is complete. + * + * The standby cluster writes ACTIVE_IN_SYNC to its own ZK znode + * after the replication log reader determines replay is complete. + * This is driven by the reader component, not a peer-reactive + * transition. + * + * Four guards model the conditions under which failover is safe: + * 1. failoverPending[c] -- set by triggerFailoverListener (L159-171) + * when the local cluster enters STA. + * 2. inProgressDirEmpty[c] -- no partially-processed replication + * log files (getInProgressFiles().isEmpty() at L508). + * 3. replayState[c] = "SYNC" -- the SYNCED_RECOVERY rewind must + * have completed. Without this guard, failover could proceed + * with degraded rounds not re-processed from the sync point. + * 4. hdfsAvailable[c] = TRUE -- the standby's own HDFS must be + * accessible; shouldTriggerFailover() performs HDFS reads + * (getInProgressFiles, getNewFiles) that throw IOException + * if HDFS is unavailable, blocking the trigger. + * + * Guarded on zkLocalConnected[c] because triggerFailover() calls + * setHAGroupStatusToSync() which requires isHealthy = true. + * + * The effect also clears failoverPending, modeling triggerFailover() + * L538 (failoverPending.set(false)). + * + * Source: shouldTriggerFailover() L500-533 (guards); + * triggerFailover() L535-548 (effect); + * setHAGroupStatusToSync() L341-355 (ZK write) + *) +TriggerFailover(c) == + /\ LocalZKHealthy(c) + /\ clusterState[c] = "STA" + /\ failoverPending[c] + /\ inProgressDirEmpty[c] + /\ replayState[c] = "SYNC" + /\ hdfsAvailable[c] = TRUE + /\ clusterState' = [clusterState EXCEPT ![c] = "AIS"] + /\ failoverPending' = [failoverPending EXCEPT ![c] = FALSE] + /\ UNCHANGED <> + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/SpecState.tla b/src/main/tla/ConsistentFailover/SpecState.tla new file mode 100644 index 00000000000..98a9d7f98c7 --- /dev/null +++ b/src/main/tla/ConsistentFailover/SpecState.tla @@ -0,0 +1,69 @@ +------------------------ MODULE SpecState ------------------------------------- +(* + * Shared state variables and state-dependent helper operators for the + * Phoenix Consistent Failover specification. + * + * The root module and all sub-modules EXTEND SpecState so the full + * variable list, variable-group tuples, and predicates that reference + * variables live in one place. See ConsistentFailover.tla module + * header for implementation traceability per variable. + * + * Variable groups partition the 13 specification variables by actor: + * writerVars -- per-RS replication writer mode + * clusterVars -- cluster-level HA group state and per-cluster + * protocol state (outDirEmpty, antiFlapTimer, + * failoverPending, inProgressDirEmpty) + * replayVars -- standby-side replay state and counters + * envVars -- environment substrate (HDFS availability, + * ZK connection/session state) + * + * UNCHANGED clauses reference these groups whenever a group is fully + * unchanged. Partially-changed groups list the unchanged members + * individually. + *) +EXTENDS Types + +VARIABLE clusterState, writerMode, outDirEmpty, hdfsAvailable, antiFlapTimer, + replayState, lastRoundInSync, lastRoundProcessed, + failoverPending, inProgressDirEmpty, + zkPeerConnected, zkPeerSessionAlive, zkLocalConnected + +--------------------------------------------------------------------------- + +(* Variable-group tuples *) + +\* Per-RS replication writer mode. +writerVars == <> + +\* Cluster-level HA group state and per-cluster protocol state. +clusterVars == <> + +\* Standby-side replay state and counters. +replayVars == <> + +\* Environment substrate: HDFS availability and ZK coordination state. +envVars == <> + +\* Full variable tuple for use in temporal formulas +\* ([][Next]_vars, WF_vars(...), SF_vars(...)). +vars == <> + +--------------------------------------------------------------------------- + +(* ZK connectivity health predicates *) + +\* Peer ZK path-children cache is delivering notifications: the +\* TCP connection is alive and the ZK session has not expired. +\* Peer-reactive transitions (PeerReact*) depend on this predicate. +PeerZKHealthy(c) == + /\ zkPeerConnected[c] = TRUE + /\ zkPeerSessionAlive[c] = TRUE + +\* Local ZK path-children cache is healthy; isHealthy = true in +\* HAGroupStoreClient, which gates setHAGroupStatusIfNeeded() and +\* all local ZK writes. +LocalZKHealthy(c) == zkLocalConnected[c] = TRUE + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/Types.tla b/src/main/tla/ConsistentFailover/Types.tla new file mode 100644 index 00000000000..d5d00559129 --- /dev/null +++ b/src/main/tla/ConsistentFailover/Types.tla @@ -0,0 +1,417 @@ +------------------------ MODULE Types ----------------------------------------- +(* + * Pure-definition module: constants, type sets, state definitions, + * valid transition table, role mapping, and helper operators for the + * Phoenix Consistent Failover specification. + * + * No variables are declared in this module. All definitions are + * pure (stateless) and imported by the root module and sub-modules + * via EXTENDS. + * + * Definitions provided: + * HAGroupState -- the 14 HA group states + * ActiveStates -- states that map to the ACTIVE cluster role + * StandbyStates -- states that map to the STANDBY cluster role + * TransitionalActiveStates -- ATS, ANISTS (ACTIVE_TO_STANDBY role) + * AllowedTransitions -- set of valid (from, to) state pairs + * ClusterRole -- the 6 cluster roles visible to clients + * RoleOf(state) -- maps an HAGroupState to its ClusterRole + * ActiveRoles -- the set of roles considered "active" (role-level) + * Peer(c) -- returns the other cluster in a 2-cluster model + * WriterMode -- the 5 replication writer modes (per-RS) + * ReplayStateSet -- the 4 replication replay states (standby reader) + * UseOfflinePeerDetection -- feature gate for AWOP/ANISWOP modeling + * AntiFlapGateOpen -- countdown timer helper: wait elapsed + * AntiFlapGateClosed -- countdown timer helper: wait in progress + * DecrementTimer -- countdown timer helper: advance one tick + * StartAntiFlapWait -- countdown timer helper: initial/reset value + * + * Implementation traceability: + * + * Modeled concept | Java class / field + * -----------------------+--------------------------------------------- + * HAGroupState | HAGroupStoreRecord.HAGroupState enum (L51-65) + * AllowedTransitions | HAGroupStoreRecord static init (L99-123) + * ClusterRole | ClusterRoleRecord.ClusterRole enum (L59-107) + * RoleOf | HAGroupState.getClusterRole() (L73-97) + * ANIS self-transition | HAGroupStoreRecord L101 (heartbeat support) + * WriterMode | ReplicationLogGroup mode (SYNC/S&F/S&FWD) + * ReplayStateSet | ReplicationLogDiscoveryReplay replay state + * StableClusterStates, | Named sets for liveness (~> consequents/antecedents) + * FailoverCompletionAntecedentStates, + * AbortCompletionAntecedentStates, + * NotANISClusterStates | + * AllowedWriterTransitions | Per-RS writer mode pairs (action constraint) + * AllowedReplayTransitions | Replay state machine pairs (action constraint) + *) +EXTENDS Naturals, FiniteSets, TLC + +--------------------------------------------------------------------------- + +(* Constants *) + +\* The finite set of cluster identifiers participating in the model. +\* Exactly two clusters form an HA pair. +CONSTANTS Cluster + +\* Cluster set must be non-empty. +ASSUME Cluster # {} + +\* Exactly two clusters in the HA pair. +ASSUME Cardinality(Cluster) = 2 + +\* The finite set of region server identifiers per cluster. +\* Each cluster runs the same set of RS; writer mode is tracked per (cluster, RS). +CONSTANTS RS + +\* RS set must be non-empty. +ASSUME RS # {} + +\* The anti-flapping wait threshold in logical time ticks. +\* Source: HAGroupStoreClient.java L98 -- ZK_SESSION_TIMEOUT_MULTIPLIER = 1.1 +CONSTANTS WaitTimeForSync + +\* WaitTimeForSync must be a positive natural number. +ASSUME WaitTimeForSync \in Nat +ASSUME WaitTimeForSync > 0 + +\* Feature gate for proactive AWOP/ANISWOP modeling. +\* When TRUE, AdminGoOffline, PeerReactToOFFLINE, +\* PeerRecoverFromOFFLINE, and AdminForceRecover are enabled, +\* making AWOP/ANISWOP reachable. +\* +\* This models the intended protocol behavior for a future +\* implementation feature. AWOP/ANISWOP exist as enum values and +\* allowedTransitions entries but are currently unreachable in +\* the implementation (no FailoverManagementListener entry for +\* peer OFFLINE detection). +CONSTANTS UseOfflinePeerDetection + +ASSUME UseOfflinePeerDetection \in BOOLEAN + +--------------------------------------------------------------------------- + +(* HA Group State definitions *) + +\* The 14 HA group states from HAGroupStoreRecord.HAGroupState enum. +\* +\* Source: HAGroupStoreRecord.java L51-65 +\* +\* Modeled value | Enum constant +\* ----------------+---------------------------------------------- +\* "AIS" | ACTIVE_IN_SYNC +\* "ANIS" | ACTIVE_NOT_IN_SYNC +\* "ATS" | ACTIVE_IN_SYNC_TO_STANDBY +\* "ANISTS" | ACTIVE_NOT_IN_SYNC_TO_STANDBY +\* "AbTAIS" | ABORT_TO_ACTIVE_IN_SYNC +\* "AbTANIS" | ABORT_TO_ACTIVE_NOT_IN_SYNC +\* "AWOP" | ACTIVE_WITH_OFFLINE_PEER +\* "ANISWOP" | ACTIVE_NOT_IN_SYNC_WITH_OFFLINE_PEER +\* "S" | STANDBY +\* "STA" | STANDBY_TO_ACTIVE +\* "DS" | DEGRADED_STANDBY +\* "AbTS" | ABORT_TO_STANDBY +\* "OFFLINE" | OFFLINE +\* "UNKNOWN" | UNKNOWN +HAGroupState == + { "AIS", "ANIS", "ATS", "ANISTS", + "AbTAIS", "AbTANIS", "AWOP", "ANISWOP", + "S", "STA", "DS", "AbTS", + "OFFLINE", "UNKNOWN" } + +\* States that map to the ACTIVE cluster role. +\* A cluster in any of these states is considered active and serves +\* mutations. Mutual exclusion requires at most one cluster in an +\* ActiveState at any time. +\* +\* Source: HAGroupState.getClusterRole() L73-97 -- these states +\* return ClusterRole.ACTIVE. +ActiveStates == { "AIS", "ANIS", "AbTAIS", "AbTANIS", "AWOP", "ANISWOP" } + +\* Active states whose writer-degradation path couples to ANIS +\* (AIS-like states with mutation-serving role and no in-flight +\* failover/abort). The Writer actions WriterInitToStoreFwd and +\* WriterToStoreFwd atomically transition clusterState to ANIS +\* and reset antiFlapTimer when a writer degrades from any of +\* these states. +\* +\* AIS is the base case. AWOP and ANISWOP are the OFFLINE-peer +\* variants (gated on UseOfflinePeerDetection): both serve +\* mutations while the peer is OFFLINE and are treated as +\* AIS-equivalents for writer-degradation coupling. +AISLikeStates == { "AIS", "AWOP", "ANISWOP" } + +\* States that map to the STANDBY cluster role. +\* A cluster in any of these states is receiving and replaying +\* replication logs from the active peer. +\* +\* Source: HAGroupState.getClusterRole() L73-97 -- these states +\* return ClusterRole.STANDBY. +StandbyStates == { "S", "DS", "AbTS" } + +\* States that map to the ACTIVE_TO_STANDBY cluster role. +\* A cluster in these states is transitioning from active to standby +\* during a failover. Mutations are blocked (isMutationBlocked()=true). +\* +\* Source: ClusterRoleRecord.java L84 -- ACTIVE_TO_STANDBY role +\* has isMutationBlocked() = true. +TransitionalActiveStates == { "ATS", "ANISTS" } + +\* The set of cluster roles considered "active" for role-level predicates. +\* Distinguished from ActiveStates (which is the set of HA group *states* +\* that map to ACTIVE): ActiveRoles operates at the role abstraction layer. +\* +\* Source: ClusterRoleRecord.java L59-67 -- ACTIVE role has +\* isMutationBlocked()=false. +ActiveRoles == {"ACTIVE"} + +--------------------------------------------------------------------------- + +(* Replication writer mode definitions *) + +\* The 5 replication writer modes from ReplicationLogGroup.java. +\* Each RegionServer on the active cluster maintains one of these modes. +\* +\* Modeled value | Java class +\* -------------------+---------------------------------------------- +\* "INIT" | Pre-initialization +\* "SYNC" | SyncModeImpl -- writing directly to standby HDFS +\* "STORE_AND_FWD" | StoreAndForwardModeImpl -- writing locally +\* "SYNC_AND_FWD" | SyncAndForwardModeImpl -- draining local queue +\* | while also writing synchronously +\* "DEAD" | RS aborted (logGroup.abort()) -- writer halted, +\* | awaiting process supervisor restart +\* +\* Source: ReplicationLogGroup.java mode classes; +\* SyncModeImpl.onFailure() L61-74 (CAS failure -> abort) +WriterMode == {"INIT", "SYNC", "STORE_AND_FWD", "SYNC_AND_FWD", "DEAD"} + +--------------------------------------------------------------------------- + +(* Replication replay state definitions *) + +\* The 4 replication replay states from ReplicationLogDiscoveryReplay.java. +\* The standby cluster's reader maintains one of these states per HA group. +\* +\* Modeled value | Meaning +\* --------------------+---------------------------------------------- +\* "NOT_INITIALIZED" | Pre-init; reader has not started +\* "SYNC" | Fully in sync; lastRoundProcessed and +\* | lastRoundInSync advance together +\* "DEGRADED" | Active peer in ANIS; lastRoundProcessed +\* | advances, lastRoundInSync frozen +\* "SYNCED_RECOVERY" | Active returned to AIS; replay rewinds +\* | lastRoundProcessed to lastRoundInSync +\* +\* Source: ReplicationLogDiscoveryReplay.java L550-555 +ReplayStateSet == {"NOT_INITIALIZED", "SYNC", "DEGRADED", "SYNCED_RECOVERY"} + +--------------------------------------------------------------------------- + +(* Allowed transitions *) + +\* The set of valid (from, to) state transition pairs. +\* Derived from the allowedTransitions static initializer in +\* HAGroupStoreRecord.java (L99-123). +\* +\* Each entry maps to one line of the static initializer block. +\* The ANIS self-transition ("ANIS" -> "ANIS") supports the +\* periodic heartbeat in StoreAndForwardModeImpl (L71-87) that +\* refreshes zkMtime without changing the state value. +\* +\* Source: HAGroupStoreRecord.java L99-123 +AllowedTransitions == + { + \* ANIS can stay in ANIS (heartbeat), return to AIS (recovery), + \* begin failover (ANISTS), or detect offline peer (ANISWOP). + \* Source: L101 + <<"ANIS", "ANIS">>, + <<"ANIS", "AIS">>, + <<"ANIS", "ANISTS">>, + <<"ANIS", "ANISWOP">>, + \* AIS can degrade to ANIS (writer failure), detect offline + \* peer (AWOP), or begin failover (ATS). + \* Source: L103 + <<"AIS", "ANIS">>, + <<"AIS", "AWOP">>, + <<"AIS", "ATS">>, + \* S (standby) can begin failover (STA), degrade (DS), + \* or go offline (OFFLINE) via admin --force. + \* Source: L105; OFFLINE entry via PhoenixHAAdminTool + \* update --force (bypasses isTransitionAllowed) + <<"S", "STA">>, + <<"S", "DS">>, + <<"S", "OFFLINE">>, + \* ANISTS can abort (AbTANIS) or advance to ATS once OUT + \* dir is drained (subject to anti-flapping gate). + \* Source: L107 + <<"ANISTS", "AbTANIS">>, + <<"ANISTS", "ATS">>, + \* ATS can abort (AbTAIS) or complete failover (become S). + \* Source: L109 + <<"ATS", "AbTAIS">>, + <<"ATS", "S">>, + \* STA can abort (AbTS) or complete failover (become AIS). + \* Source: L111 + <<"STA", "AbTS">>, + <<"STA", "AIS">>, + \* DS can recover to S, begin failover (STA), or go offline + \* (OFFLINE) via admin --force. + \* DS -> STA supports the ANIS failover path where the + \* standby is in DEGRADED_STANDBY when failover proceeds. + \* Source: L117; OFFLINE entry via PhoenixHAAdminTool + \* update --force (bypasses isTransitionAllowed) + <<"DS", "S">>, + <<"DS", "STA">>, + <<"DS", "OFFLINE">>, + \* AWOP returns to ANIS when peer comes back. + \* Source: L113 + <<"AWOP", "ANIS">>, + \* Abort auto-completion transitions. + \* Source: L115, L119, L121 + <<"AbTAIS", "AIS">>, + \* AbTAIS -> ANIS: needed so HDFS failure during abort can + \* route to ANIS (S&F writers cannot self-correct while in + \* AbTAIS without this transition). + <<"AbTAIS", "ANIS">>, + <<"AbTANIS", "ANIS">>, + <<"AbTS", "S">>, + \* ANISWOP returns to ANIS when peer comes back. + \* Source: L123 + <<"ANISWOP", "ANIS">>, + \* OFFLINE can recover to S via admin --force. + \* Source: PhoenixHAAdminTool update --force --state STANDBY + \* (bypasses isTransitionAllowed; OFFLINE.allowed- + \* Transitions = {} in the implementation) + <<"OFFLINE", "S">> + } + +--------------------------------------------------------------------------- + +(* Named sets for liveness formulas (ConsistentFailover.tla). *) + +StableClusterStates == + {"AIS", "ANIS", "S"} + +FailoverCompletionAntecedentStates == + {"STA", "AbTAIS", "AbTANIS", "AbTS"} + +AbortCompletionAntecedentStates == + {"AbTS", "AbTAIS", "AbTANIS"} + +NotANISClusterStates == HAGroupState \ {"ANIS"} + +--------------------------------------------------------------------------- + +(* Allowed writer mode (per-RS) transition pairs. *) + +AllowedWriterTransitions == + { + <<"INIT", "SYNC">>, + <<"INIT", "STORE_AND_FWD">>, + <<"INIT", "DEAD">>, + <<"SYNC", "STORE_AND_FWD">>, + <<"SYNC", "SYNC_AND_FWD">>, + <<"SYNC", "DEAD">>, + <<"SYNC", "INIT">>, + <<"STORE_AND_FWD", "SYNC_AND_FWD">>, + <<"STORE_AND_FWD", "DEAD">>, + <<"STORE_AND_FWD", "INIT">>, + <<"SYNC_AND_FWD", "SYNC">>, + <<"SYNC_AND_FWD", "STORE_AND_FWD">>, + <<"SYNC_AND_FWD", "DEAD">>, + <<"SYNC_AND_FWD", "INIT">>, + <<"DEAD", "INIT">> + } + +--------------------------------------------------------------------------- + +(* Allowed replay state transition pairs. *) + +AllowedReplayTransitions == + { + <<"NOT_INITIALIZED", "SYNCED_RECOVERY">>, + <<"NOT_INITIALIZED", "DEGRADED">>, + <<"SYNC", "DEGRADED">>, + <<"SYNC", "SYNCED_RECOVERY">>, + <<"DEGRADED", "SYNCED_RECOVERY">>, + <<"SYNCED_RECOVERY", "SYNC">>, + <<"SYNCED_RECOVERY", "DEGRADED">> + } + +--------------------------------------------------------------------------- + +(* Cluster role definitions *) + +\* The 6 cluster roles visible to clients. +\* +\* Source: ClusterRoleRecord.ClusterRole enum (L59-107) +ClusterRole == + { "ACTIVE", "ACTIVE_TO_STANDBY", "STANDBY", + "STANDBY_TO_ACTIVE", "OFFLINE", "UNKNOWN" } + +\* Maps an HAGroupState to its ClusterRole. +\* +\* Source: HAGroupState.getClusterRole() L73-97 +RoleOf(state) == + \* Active states map to ACTIVE role. + IF state \in ActiveStates THEN "ACTIVE" + \* Transitional states map to ACTIVE_TO_STANDBY role. + ELSE IF state \in TransitionalActiveStates THEN "ACTIVE_TO_STANDBY" + \* Standby states map to STANDBY role. + ELSE IF state \in StandbyStates THEN "STANDBY" + \* STANDBY_TO_ACTIVE is its own role. + ELSE IF state = "STA" THEN "STANDBY_TO_ACTIVE" + \* OFFLINE maps to OFFLINE role. + ELSE IF state = "OFFLINE" THEN "OFFLINE" + \* Everything else (UNKNOWN) maps to UNKNOWN role. + ELSE "UNKNOWN" + +--------------------------------------------------------------------------- + +(* Helpers *) + +\* Returns the peer cluster in a 2-cluster model. +\* Precondition: c \in Cluster and |Cluster| = 2. +Peer(c) == CHOOSE p \in Cluster : p # c + +--------------------------------------------------------------------------- + +(* Anti-flapping countdown timer helpers *) + +\* The anti-flapping mechanism uses a per-cluster countdown timer +\* following the pattern from Lamport, "Real Time is Really Simple" +\* (CHARME 2005, Section 2). Each cluster's timer counts DOWN from +\* WaitTimeForSync toward 0. The timer does NOT represent a clock +\* running backwards -- it represents a waiting period expiring: +\* +\* WaitTimeForSync ... 2 ... 1 ... 0 +\* |---- gate closed (waiting) ----| gate open (transition allowed) +\* +\* The S&F heartbeat resets the timer to WaitTimeForSync, keeping the +\* gate closed. When the heartbeat stops (all RS exit STORE_AND_FWD), +\* the Tick action counts the timer down to 0, opening the gate and +\* allowing ANIS -> AIS. +\* +\* Source: HAGroupStoreClient.validateTransitionAndGetWaitTime() +\* L1027-1046; StoreAndForwardModeImpl.startHAGroupStoreUpdate- +\* Task() L71-87. + +\* TRUE when the anti-flapping wait period has fully elapsed. +\* The guarded transition (ANIS -> AIS) may proceed. +AntiFlapGateOpen(t) == t = 0 + +\* TRUE when the anti-flapping wait is still in progress. +\* The guarded transition is blocked; time must elapse before +\* the gate opens. +AntiFlapGateClosed(t) == t > 0 + +\* Advance the countdown timer one tick toward 0 (floor at 0). +\* Used by the Tick action to model the passage of time. +DecrementTimer(t) == IF t > 0 THEN t - 1 ELSE 0 + +\* The value that starts (or restarts) the anti-flapping wait. +\* Used when a cluster enters ANIS or when the S&F heartbeat fires. +StartAntiFlapWait == WaitTimeForSync + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/Writer.tla b/src/main/tla/ConsistentFailover/Writer.tla new file mode 100644 index 00000000000..a09e63666e6 --- /dev/null +++ b/src/main/tla/ConsistentFailover/Writer.tla @@ -0,0 +1,381 @@ +-------------------------- MODULE Writer ---------------------------------------- +(* + * Replication writer mode state machine for the Phoenix Consistent + * Failover specification. + * + * Each RegionServer on the active cluster maintains a writer mode + * that determines how mutations are replicated: directly to standby + * HDFS (SYNC), locally buffered (STORE_AND_FWD), or draining local + * queue while also writing synchronously (SYNC_AND_FWD). An RS that + * aborts due to a ZK CAS failure enters DEAD mode. + * + * HDFS-failure-driven degradation (SYNC -> S&F, SYNC_AND_FWD -> S&F) + * is modeled as individual per-RS actions that each perform their + * own ZK CAS write. HDFSDown in HDFS.tla only sets the availability + * flag; per-RS degradation and CAS failure are handled here. + * + * CAS FAILURE SEMANTICS: When an RS detects HDFS unavailability via + * IOException, it attempts a ZK CAS write (setData().withVersion()) + * to transition AIS->ANIS (or ANIS->ANIS self-transition). If another + * RS has already bumped the ZK version (stale PathChildrenCache), + * BadVersionException is thrown. SyncModeImpl.onFailure() and + * SyncAndForwardModeImpl.onFailure() treat this as fatal: abort() + * throws RuntimeException, halting the Disruptor -- the RS is dead. + * CAS failure is only possible when clusterState /= "AIS" because + * the first RS to write faces no concurrent version bump. + * + * ZK LOCAL CONNECTIVITY: Actions that perform ZK writes + * (setHAGroupStatusToStoreAndForward, setHAGroupStatusToSync) + * require isHealthy = true, modeled by the zkLocalConnected[c] + * guard. Actions that are purely mode transitions driven by HDFS + * operations or forwarder events (WriterInit, WriterSyncToSyncFwd, + * WriterStoreFwdToSyncFwd) do NOT require a ZK connection. + * + * Implementation traceability: + * + * TLA+ action | Java source + * ---------------------------------+---------------------------------------- + * WriterInit(c, rs) | Normal startup -> SyncModeImpl + * WriterInitToStoreFwd(c, rs) | Startup with peer unavailable -> + * | StoreAndForwardModeImpl; CAS + * | success path + * WriterInitToStoreFwdFail(c, rs) | Startup CAS failure -> abort + * WriterToStoreFwd(c, rs) | SyncModeImpl.onFailure() L61-74 -> + * | setHAGroupStatusToStoreAndForward(); + * | CAS success path + * WriterToStoreFwdFail(c, rs) | SyncModeImpl.onFailure() CAS + * | failure -> abort + * WriterSyncToSyncFwd(c, rs) | Forwarder ACTIVE_NOT_IN_SYNC event + * | L98-108 while RS in SYNC + * WriterStoreFwdToSyncFwd(c, rs) | Forwarder processFile() L133-152 + * | throughput threshold or drain start + * WriterSyncFwdToSync(c, rs) | Forwarder drain complete; queue empty + * | -> setHAGroupStatusToSync() L171 + * WriterSyncFwdToStoreFwd(c, rs) | SyncAndForwardModeImpl.onFailure() + * | L66-78; CAS success path + * WriterSyncFwdToStoreFwdFail(c,rs)| SyncAndForwardModeImpl.onFailure() + * | CAS failure -> abort + * CanDegradeToStoreFwd(c, rs) | Guard predicate: RS is in a mode + * | that writes to standby HDFS + *) +EXTENDS SpecState, Types + +--------------------------------------------------------------------------- + +(* Predicates *) + +(* + * Guard predicate: RS is in a mode that writes to standby HDFS + * and would degrade to STORE_AND_FWD on an HDFS failure. + * + * Used by the per-RS degradation actions (WriterToStoreFwd, + * WriterSyncFwdToStoreFwd) and their CAS failure variants. + *) +CanDegradeToStoreFwd(c, rs) == + writerMode[c][rs] \in {"SYNC", "SYNC_AND_FWD"} + +--------------------------------------------------------------------------- + +(* Actions wired into Next *) + +(* + * Normal startup: INIT -> SYNC. + * + * RS initializes and begins writing directly to standby HDFS. + * Writers only run on the active cluster. + * No ZK write -- pure mode transition. + * + * Source: Normal startup -> SyncModeImpl + *) +WriterInit(c, rs) == + /\ clusterState[c] \in ActiveStates + /\ writerMode[c][rs] = "INIT" + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "SYNC"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Startup with peer unavailable: INIT -> STORE_AND_FWD. + * + * RS initializes but standby HDFS is unreachable; begins + * buffering locally in the OUT directory. Also transitions + * cluster AIS -> ANIS (setHAGroupStatusToStoreAndForward). + * Writers only run on the active cluster. + * + * AWOP/ANISWOP handling: same as WriterToStoreFwd. + * + * Guarded on zkLocalConnected[c] because this calls + * setHAGroupStatusToStoreAndForward() which requires + * isHealthy = true. + * + * Source: StoreAndForwardModeImpl.onEnter() L54-64 + *) +WriterInitToStoreFwd(c, rs) == + /\ LocalZKHealthy(c) + /\ clusterState[c] \in ActiveStates + /\ writerMode[c][rs] = "INIT" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "STORE_AND_FWD"] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = FALSE] + /\ clusterState' = IF clusterState[c] \in AISLikeStates + THEN [clusterState EXCEPT ![c] = "ANIS"] + ELSE clusterState + /\ antiFlapTimer' = IF clusterState[c] \in AISLikeStates + THEN [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + ELSE antiFlapTimer + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Forwarder started while in sync: SYNC -> SYNC_AND_FWD. + * + * On an ACTIVE_NOT_IN_SYNC event (L98-108), region servers + * currently in SYNC learn that the cluster has entered ANIS + * and transition to SYNC_AND_FWD. This event fires once when + * the cluster enters ANIS. ANISTS does not produce a new + * ACTIVE_NOT_IN_SYNC event -- it is a different ZK state + * change (ACTIVE_NOT_IN_SYNC_TO_STANDBY). A SYNC writer that + * has not yet received the event when ANIS -> ANISTS fires + * will remain in SYNC (harmlessly: SYNC writers write directly + * to standby HDFS, not to the OUT directory). + * No ZK write -- mode transition driven by forwarder event. + * + * Source: ReplicationLogDiscoveryForwarder.init() L98-108 + *) +WriterSyncToSyncFwd(c, rs) == + /\ clusterState[c] = "ANIS" + /\ writerMode[c][rs] = "SYNC" + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "SYNC_AND_FWD"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Recovery detected; standby available again: + * STORE_AND_FWD -> SYNC_AND_FWD. + * + * The forwarder successfully copies a file from the OUT directory + * to the standby's IN directory. If throughput exceeds the + * threshold, the writer transitions to SYNC_AND_FWD to begin + * draining the queue while also writing synchronously. + * The forwarder runs on active clusters and during the ANISTS + * transitional state (draining OUT before ANISTS->ATS). + * No ZK write -- mode transition driven by forwarder file copy. + * + * Source: ReplicationLogDiscoveryForwarder.processFile() L133-152 + * throughput threshold or drain start. + *) +WriterStoreFwdToSyncFwd(c, rs) == + /\ clusterState[c] \in ActiveStates \union TransitionalActiveStates + /\ writerMode[c][rs] = "STORE_AND_FWD" + /\ hdfsAvailable[Peer(c)] = TRUE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "SYNC_AND_FWD"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * All stored logs forwarded; queue empty: + * SYNC_AND_FWD -> SYNC. + * + * The forwarder has drained all buffered files from the OUT + * directory. The OUT directory is now empty. + * The forwarder runs on active clusters and during the ANISTS + * transitional state (draining OUT before ANISTS->ATS). + * + * Per-RS vs per-cluster semantics: processNoMoreRoundsLeft() + * (ReplicationLogDiscoveryForwarder.java L155-184) is a per- + * cluster forwarder check that examines the global OUT directory + * -- it only fires when the entire OUT directory is empty, not + * when a single RS finishes. The guard + * \A rs2 \in RS : writerMode[c][rs2] \notin {"STORE_AND_FWD"} + * prevents setting outDirEmpty = TRUE while any RS is still + * actively writing to the OUT directory. + * + * HDFS guard: processNoMoreRoundsLeft() can only fire after + * processFile() has successfully copied all remaining files from + * OUT to the peer's IN directory, which requires the peer's HDFS + * to be accessible. If the peer's HDFS is down, processFile() + * throws IOException and the forwarder retries -- it never + * reaches processNoMoreRoundsLeft(). + * + * Guarded on zkLocalConnected[c] because this calls + * setHAGroupStatusToSync() which requires isHealthy = true. + * + * Source: ReplicationLogDiscoveryForwarder.processFile() L133-152 + * copies to peer HDFS; processNoMoreRoundsLeft() L155-184 + * only fires after all files are forwarded. + * setHAGroupStatusToSync() L171 + *) +WriterSyncFwdToSync(c, rs) == + /\ LocalZKHealthy(c) + /\ clusterState[c] \in ActiveStates \union TransitionalActiveStates + /\ writerMode[c][rs] = "SYNC_AND_FWD" + /\ hdfsAvailable[Peer(c)] = TRUE + /\ \A rs2 \in RS : writerMode[c][rs2] \notin {"STORE_AND_FWD"} + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "SYNC"] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = TRUE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* Per-RS HDFS failure degradation -- CAS success paths *) + +(* + * Per-RS HDFS failure degradation: SYNC -> STORE_AND_FWD (CAS success). + * + * Models a single RS detecting standby HDFS unavailability via + * IOException and successfully CAS-writing the ZK state. The ZK + * CAS write is synchronous and happens BEFORE the mode change + * (SyncModeImpl.onFailure() L61-74). On success, the writer + * transitions to STORE_AND_FWD and the cluster transitions + * AIS -> ANIS (if still AIS). Writers only run on the active cluster. + * + * AWOP/ANISWOP handling: when AWOP or ANISWOP + * are reachable (UseOfflinePeerDetection = TRUE), HDFS failure + * during these states triggers setHAGroupStatusToStoreAndForward() + * which CAS-writes ANIS. AWOP.allowedTransitions = {ANIS} and + * ANISWOP.allowedTransitions = {ANIS}, so the transition succeeds. + * When UseOfflinePeerDetection = FALSE, AWOP/ANISWOP are + * unreachable and the extended IF is a no-op. + * + * Guarded on zkLocalConnected[c] because the CAS write goes through + * setHAGroupStatusIfNeeded() which requires isHealthy = true. + * + * Source: SyncModeImpl.onFailure() L61-74 -> + * setHAGroupStatusToStoreAndForward() + *) +WriterToStoreFwd(c, rs) == + /\ LocalZKHealthy(c) + /\ clusterState[c] \in ActiveStates + /\ writerMode[c][rs] = "SYNC" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "STORE_AND_FWD"] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = FALSE] + /\ clusterState' = IF clusterState[c] \in AISLikeStates + THEN [clusterState EXCEPT ![c] = "ANIS"] + ELSE clusterState + /\ antiFlapTimer' = IF clusterState[c] \in AISLikeStates + THEN [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + ELSE antiFlapTimer + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Re-degradation during drain: SYNC_AND_FWD -> STORE_AND_FWD + * (CAS success). + * + * Models standby HDFS becoming unavailable again while the + * forwarder is draining the local queue. The RS falls back to + * pure local buffering. The forwarder runs on active clusters + * and during the ANISTS transitional state. + * No AIS -> ANIS coupling needed: if RS is in SYNC_AND_FWD, + * cluster is already ANIS or ANISTS (cannot be AIS). + * + * Guarded on zkLocalConnected[c] because the CAS write goes through + * setHAGroupStatusIfNeeded() which requires isHealthy = true. + * + * Source: SyncAndForwardModeImpl.onFailure() L66-78 -> + * setHAGroupStatusToStoreAndForward() + *) +WriterSyncFwdToStoreFwd(c, rs) == + /\ LocalZKHealthy(c) + /\ clusterState[c] \in ActiveStates \union TransitionalActiveStates + /\ writerMode[c][rs] = "SYNC_AND_FWD" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "STORE_AND_FWD"] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = FALSE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* Per-RS HDFS failure degradation -- CAS failure paths (RS abort) *) + +(* + * CAS failure during SYNC degradation: SYNC -> DEAD. + * + * RS detects IOException, reads stale AIS/version N from + * PathChildrenCache, attempts CAS write AIS->ANIS with version N, + * but another RS already bumped the version to N+1. ZK throws + * BadVersionException -> StaleHAGroupStoreRecordVersionException -> + * abort() -> RuntimeException -> Disruptor halts -> RS dead. + * + * Guard: clusterState[c] /= "AIS" -- CAS failure is only possible + * when another RS has already changed the cluster state, meaning + * the ZK version has been bumped beyond the cached value. + * + * Guarded on zkLocalConnected[c] because the CAS write requires + * a live ZK connection (isHealthy = true). + * + * Source: SyncModeImpl.onFailure() L61-74 catch block -> abort() + *) +WriterToStoreFwdFail(c, rs) == + /\ LocalZKHealthy(c) + /\ clusterState[c] \in ActiveStates \ {"AIS"} + /\ writerMode[c][rs] = "SYNC" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * CAS failure during SYNC_AND_FWD re-degradation: + * SYNC_AND_FWD -> DEAD. + * + * Same CAS failure pattern as WriterToStoreFwdFail but from + * SYNC_AND_FWD mode. If RS is in SYNC_AND_FWD, the cluster is + * already ANIS or ANISTS (not AIS), so another RS or the S&F + * heartbeat may have bumped the ZK version. + * + * Guarded on zkLocalConnected[c] because the CAS write requires + * a live ZK connection (isHealthy = true). + * + * Source: SyncAndForwardModeImpl.onFailure() L66-78 catch block + * -> abort() + *) +WriterSyncFwdToStoreFwdFail(c, rs) == + /\ LocalZKHealthy(c) + /\ clusterState[c] \in ActiveStates \union TransitionalActiveStates + /\ writerMode[c][rs] = "SYNC_AND_FWD" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * CAS failure during init degradation: INIT -> DEAD. + * + * RS starts up, SyncModeImpl.onEnter() fails (HDFS unavailable), + * updateModeOnFailure -> SyncModeImpl.onFailure() -> CAS write + * fails -> abort(). Same CAS race as WriterToStoreFwdFail but + * from the INIT state during startup. + * + * Guard: clusterState[c] /= "AIS" -- same rationale: another RS + * must have already bumped the version for CAS to fail. + * + * Guarded on zkLocalConnected[c] because the CAS write requires + * a live ZK connection (isHealthy = true). + * + * Source: SyncModeImpl.onFailure() L61-74 via + * LogEventHandler.initializeMode() failure path + *) +WriterInitToStoreFwdFail(c, rs) == + /\ LocalZKHealthy(c) + /\ clusterState[c] \in ActiveStates \ {"AIS"} + /\ writerMode[c][rs] = "INIT" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/ZK.tla b/src/main/tla/ConsistentFailover/ZK.tla new file mode 100644 index 00000000000..dd81e05a99f --- /dev/null +++ b/src/main/tla/ConsistentFailover/ZK.tla @@ -0,0 +1,279 @@ +------------------------------ MODULE ZK ---------------------------------------- +(* + * ZooKeeper coordination substrate for the Phoenix Consistent + * Failover specification. + * + * Models the ZK session lifecycle and connection state as + * environment actions. Two independent PathChildrenCache instances + * per HAGroupStoreClient drive the protocol: pathChildrenCache + * (LOCAL) watches the local cluster's ZK znode, and + * peerPathChildrenCache (PEER) watches the peer cluster's ZK znode + * via a separate CuratorFramework/ZK connection. + * + * Peer ZK failures suppress peer-reactive transitions (PeerReact* + * actions). Local ZK failures suppress all local ZK writes + * (auto-completion, heartbeat, writer cluster-state transitions, + * failover trigger). + * + * ZK failure modes modeled: + * 1. Peer disconnection (transient): peerPathChildrenCache loses + * TCP connection. Peer-reactive transitions suppressed. On + * reconnect, Curator re-syncs and fires synthetic events. + * 2. Peer session expiry (permanent until recovery): ZK session + * expires. All watches permanently lost. Curator must establish + * a new session via retry policy. + * 3. Local disconnection: pathChildrenCache loses connection. + * isHealthy = false, blocking all setHAGroupStatusIfNeeded() + * calls. + * + * Retry exhaustion (FailoverManagementListener 2-retry limit) is + * modeled in HAGroupStore.tla as ReactiveTransitionFail(c). + * + * Implementation traceability: + * + * TLA+ action | Java source + * -------------------------+---------------------------------------------- + * ZKPeerDisconnect(c) | HAGroupStoreClient.createCacheListener() + * | L894-898 -- peerPathChildrenCache + * | CONNECTION_LOST/CONNECTION_SUSPENDED + * | (no effect on isHealthy for PEER cache) + * ZKPeerReconnect(c) | HAGroupStoreClient.createCacheListener() + * | L903-906 -- peerPathChildrenCache + * | CONNECTION_RECONNECTED; Curator re-syncs + * | PathChildrenCache, fires synthetic + * | CHILD_UPDATED events + * ZKPeerSessionExpiry(c) | Curator maps SESSION_EXPIRED to + * | CONNECTION_LOST internally; no explicit + * | SESSION_EXPIRED handling in Phoenix + * ZKPeerSessionRecover(c) | Curator retry policy establishes new + * | session; PathChildrenCache rebuilds + * ZKLocalDisconnect(c) | HAGroupStoreClient.createCacheListener() + * | L894-898 -- pathChildrenCache (LOCAL) + * | CONNECTION_LOST sets isHealthy = false + * ZKLocalReconnect(c) | HAGroupStoreClient.createCacheListener() + * | L903-906 -- pathChildrenCache (LOCAL) + * | CONNECTION_RECONNECTED sets isHealthy = + * | true + *) +EXTENDS SpecState, Types + +--------------------------------------------------------------------------- + +(* + * Post-abort ATS reconciliation fold. + * + * When the peer connection or session is re-established and the + * local cluster is in ATS while the peer is in S or DS at the + * moment of rebuild, the PathChildrenCache rebuild fires a + * synthetic event that triggers the FailoverManagementListener. + * No existing PeerReact* action handles (ATS, S/DS) -- the + * transient AbTS state was missed during the partition. The + * reconciliation transitions ATS -> AbTAIS, which auto-completes + * to AIS via AutoComplete. + * + * Shared by ZKPeerReconnect and ZKPeerSessionRecover: both reuse + * the identical synthetic-event -> listener chain (Curator rebuild + * is the same whether triggered by reconnection or session + * recovery). Extracting this operator keeps the two actions' + * reconciliation branches from drifting apart. + *) +ATSReconcileEffect(c) == + IF clusterState[c] = "ATS" /\ clusterState[Peer(c)] \in {"S", "DS"} + THEN clusterState' = [clusterState EXCEPT ![c] = "AbTAIS"] + ELSE UNCHANGED clusterState + +--------------------------------------------------------------------------- + +(* + * Peer ZK connection drops. + * + * The peerPathChildrenCache loses its TCP connection to the peer + * ZK quorum. During disconnection, no watcher notifications are + * delivered, so peer-reactive transitions for cluster c are + * suppressed. + * + * The implementation does NOT set isHealthy = false for PEER cache + * disconnection -- only LOCAL cache disconnection affects isHealthy. + * + * Pre: zkPeerConnected[c] = TRUE. + * Post: zkPeerConnected[c] = FALSE. + * + * Source: HAGroupStoreClient.createCacheListener() L894-898 + * (CONNECTION_LOST/CONNECTION_SUSPENDED for PEER cache) + *) +ZKPeerDisconnect(c) == + /\ zkPeerConnected[c] = TRUE + /\ zkPeerConnected' = [zkPeerConnected EXCEPT ![c] = FALSE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Peer ZK connection re-established. + * + * The peerPathChildrenCache re-establishes its TCP connection to + * the peer ZK quorum. Curator re-syncs PathChildrenCache by + * re-reading children and generating synthetic CHILD_UPDATED events. + * handleStateChange() compares against lastKnownPeerState and only + * fires notifications if the peer state differs from the last known + * value -- same-state suppression. In the TLA+ model, this is + * naturally handled: PeerReact* actions are re-enabled by the + * zkPeerConnected guard and fire when their peer-state guard is + * satisfied. + * + * Reconnection requires a live session -- if the session is expired, + * a new session must be established first via ZKPeerSessionRecover. + * + * POST-ABORT ATS RECONCILIATION: + * When the local cluster is in ATS and the peer is in S or DS at + * the moment of reconnect, the PathChildrenCache rebuild fires a + * synthetic event that triggers the FailoverManagementListener. + * No existing PeerReact* action handles (ATS, S/DS) -- the + * transient AbTS state was missed during the partition. The + * reconciliation transitions ATS -> AbTAIS, which auto-completes + * to AIS via AutoComplete. + * + * This is folded into the reconnect action (rather than modeled + * as a separate action with a boolean flag) because the + * CONNECTION_RECONNECTED -> PathChildrenCache rebuild -> + * handleStateChange() -> FailoverManagementListener chain is + * synchronous on the same event thread, following the same + * listener-effect folding pattern used for recoveryListener and + * degradedListener in HAGroupStore.tla. + * + * Race safety: ZKPeerReconnect requires zkPeerConnected[c] = FALSE, + * so it cannot fire during normal operation when the connection is + * healthy. The normal transient (ATS, S) state during happy-path + * failover is handled by PeerReactToATS on the peer side. + * + * Pre: zkPeerConnected[c] = FALSE, zkPeerSessionAlive[c] = TRUE. + * Post: zkPeerConnected[c] = TRUE. + * If clusterState[c] = ATS and peer in {S, DS}: + * clusterState[c] = AbTAIS (reconciliation). + * + * Source: HAGroupStoreClient.createCacheListener() L903-906 + * (CONNECTION_RECONNECTED for PEER cache) + *) +ZKPeerReconnect(c) == + /\ zkPeerConnected[c] = FALSE + /\ zkPeerSessionAlive[c] = TRUE + /\ zkPeerConnected' = [zkPeerConnected EXCEPT ![c] = TRUE] + /\ ATSReconcileEffect(c) + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Peer ZK session expires. + * + * The ZK server evicts cluster c's peer session after the session + * timeout elapses without heartbeats. All watches are permanently + * lost. The client must establish a new session before any watcher + * notifications can be delivered. + * + * Session expiry implies disconnection: when the session dies, the + * TCP connection is also considered dead. + * + * The implementation has no explicit SESSION_EXPIRED handling. + * Curator maps session expiry to CONNECTION_LOST internally, then + * attempts to create a new session via its retry policy. + * + * Pre: zkPeerSessionAlive[c] = TRUE. + * Post: zkPeerSessionAlive[c] = FALSE, zkPeerConnected[c] = FALSE. + * + * Source: Curator internal session management; no explicit Phoenix + * SESSION_EXPIRED handler + *) +ZKPeerSessionExpiry(c) == + /\ zkPeerSessionAlive[c] = TRUE + /\ zkPeerSessionAlive' = [zkPeerSessionAlive EXCEPT ![c] = FALSE] + /\ zkPeerConnected' = [zkPeerConnected EXCEPT ![c] = FALSE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Peer ZK session recovered. + * + * Curator's retry policy establishes a new ZK session for the peer + * connection. PathChildrenCache rebuilds its internal state by + * re-reading all children and fires synthetic CHILD_ADDED events, + * effectively re-syncing. + * + * Session recovery implies reconnection: the new session comes with + * a live TCP connection. + * + * POST-ABORT ATS RECONCILIATION: + * Same folded reconciliation as ZKPeerReconnect. Session recovery + * triggers a full PathChildrenCache rebuild with synthetic + * CHILD_ADDED events, which invokes the FailoverManagementListener + * synchronously. See ZKPeerReconnect comment for full rationale. + * + * Pre: zkPeerSessionAlive[c] = FALSE. + * Post: zkPeerSessionAlive[c] = TRUE, zkPeerConnected[c] = TRUE. + * If clusterState[c] = ATS and peer in {S, DS}: + * clusterState[c] = AbTAIS (reconciliation). + * + * Source: Curator retry policy -> new session -> PathChildrenCache + * rebuild + *) +ZKPeerSessionRecover(c) == + /\ zkPeerSessionAlive[c] = FALSE + /\ zkPeerSessionAlive' = [zkPeerSessionAlive EXCEPT ![c] = TRUE] + /\ zkPeerConnected' = [zkPeerConnected EXCEPT ![c] = TRUE] + /\ ATSReconcileEffect(c) + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Local ZK connection drops. + * + * The pathChildrenCache (LOCAL) loses its connection to the local + * ZK quorum. The implementation sets isHealthy = false, which + * blocks all setHAGroupStatusIfNeeded() calls with IOException. + * This suppresses auto-completion, heartbeat, writer cluster-state + * transitions, and failover trigger. + * + * Pre: zkLocalConnected[c] = TRUE. + * Post: zkLocalConnected[c] = FALSE. + * + * Source: HAGroupStoreClient.createCacheListener() L894-898 + * (CONNECTION_LOST/CONNECTION_SUSPENDED for LOCAL cache) + *) +ZKLocalDisconnect(c) == + /\ zkLocalConnected[c] = TRUE + /\ zkLocalConnected' = [zkLocalConnected EXCEPT ![c] = FALSE] + /\ UNCHANGED <> + +--------------------------------------------------------------------------- + +(* + * Local ZK connection re-established. + * + * The pathChildrenCache (LOCAL) re-establishes its connection to the + * local ZK quorum. The implementation sets isHealthy = true, + * re-enabling all setHAGroupStatusIfNeeded() calls. + * + * Pre: zkLocalConnected[c] = FALSE. + * Post: zkLocalConnected[c] = TRUE. + * + * Source: HAGroupStoreClient.createCacheListener() L903-906 + * (CONNECTION_RECONNECTED for LOCAL cache) + *) +ZKLocalReconnect(c) == + /\ zkLocalConnected[c] = FALSE + /\ zkLocalConnected' = [zkLocalConnected EXCEPT ![c] = TRUE] + /\ UNCHANGED <> + +============================================================================ diff --git a/src/main/tla/ConsistentFailover/markdown/Admin.md b/src/main/tla/ConsistentFailover/markdown/Admin.md new file mode 100644 index 00000000000..0f83be66323 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/Admin.md @@ -0,0 +1,151 @@ +# Admin -- Operator-Initiated Actions + +**Source:** [`Admin.tla`](../Admin.tla) + +## Overview + +`Admin` models the human operator (Admin actor) who drives failover and abort via the `PhoenixHAAdminTool` CLI, which delegates to `HAGroupStoreManager` coprocessor endpoints. This module contains four actions: `AdminStartFailover` (initiate failover), `AdminAbortFailover` (abort an in-progress failover), `AdminGoOffline` (take a standby cluster offline), and `AdminForceRecover` (force-recover from OFFLINE). The last two are gated on the `UseOfflinePeerDetection` feature flag and model the proactive design for peer OFFLINE detection using `PhoenixHAAdminTool update --force`. + +These are the only actions in the specification that represent deliberate human decisions rather than automated system behavior. All four (`AdminStartFailover`, `AdminAbortFailover`, `AdminGoOffline`, `AdminForceRecover`) receive no fairness in the liveness specifications. The admin is genuinely non-deterministic. The admin might never initiate a failover, or might abort every failover attempt. Imposing fairness on admin actions would force unrealistic guarantees about human behavior. + +### Modeling Choice: Direct ZK Writes + +Unlike the peer-reactive transitions in [HAGroupStore.md](HAGroupStore.md), admin actions are direct ZK writes -- they are not watcher-dependent. The admin CLI writes directly to the local ZK znode via the coprocessor endpoint. No `zkPeerConnected` or `zkLocalConnected` guard is needed because the admin tool manages its own ZK connection independently of the `HAGroupStoreClient` watcher infrastructure. `AdminGoOffline` and `AdminForceRecover` also use the `--force` path which writes directly to ZK, so no `zkLocalConnected` guard is needed for those actions either. + +## Implementation Traceability + +| TLA+ Action | Java Source | +|---|---| +| `AdminStartFailover(c)` | `HAGroupStoreManager.initiateFailoverOnActiveCluster()` L375-400 | +| `AdminAbortFailover(c)` | `HAGroupStoreManager.setHAGroupStatusToAbortToStandby()` L419-425; also clears `failoverPending` (models `abortFailoverListener` L173-185) | +| `AdminGoOffline(c)` | `PhoenixHAAdminTool update --state OFFLINE` (gated on `UseOfflinePeerDetection`) | +| `AdminForceRecover(c)` | `PhoenixHAAdminTool update --force --state STANDBY` (OFFLINE -> S; gated on `UseOfflinePeerDetection`) | + +```tla +EXTENDS SpecState, Types +``` + +## AdminStartFailover -- Initiate Failover + +The admin initiates failover on the active cluster. Two paths depending on current state: + +### AIS Path: AIS -> ATS + +The cluster is fully in sync. The OUT directory must be empty and all live RS must be in SYNC mode. DEAD RSes are allowed -- an RS can crash while the cluster is AIS without changing the HA group state. The implementation checks `clusterState = AIS`, not per-RS modes; a DEAD RS is not writing, so the remaining SYNC RSes and empty OUT dir ensure safety. + +### ANIS Path: ANIS -> ANISTS + +The cluster is not in sync (at least one RS is in STORE_AND_FWD). The implementation only validates the current state (ANIS) and peer state -- no `outDirEmpty` or writer-mode guards are needed because the forwarder will drain OUT after the transition. The ANISTS -> ATS transition (`ANISTSToATS` in [HAGroupStore.md](HAGroupStore.md)) guards on `outDirEmpty` and the anti-flapping gate. + +### Peer-State Guard (Both Paths) + +The peer must be in a stable standby state (S or DS) to prevent initiating a new failover during the non-atomic window of a previous failover (where the peer may still be in ATS). Without this guard, the admin could produce an irrecoverable `(ATS, ATS)` or `(ANISTS, ATS)` deadlock where both clusters are transitioning to standby with mutations blocked on both sides. + +### Post-Condition + +Cluster `c` transitions to ATS or ANISTS, both of which map to the ACTIVE_TO_STANDBY role, blocking mutations (`isMutationBlocked() = true`). + +Source: `initiateFailoverOnActiveCluster()` L375-400 checks current state and selects AIS -> ATS or ANIS -> ANISTS. Peer-state guard: `getHAGroupStoreRecordFromPeer()` (`HAGroupStoreClient` L421). + +```tla +AdminStartFailover(c) == + /\ clusterState[Peer(c)] \in {"S", "DS"} + /\ \/ /\ clusterState[c] = "AIS" + /\ outDirEmpty[c] + /\ \A rs \in RS : writerMode[c][rs] \in {"SYNC", "DEAD"} + /\ clusterState' = [clusterState EXCEPT ![c] = "ATS"] + \/ /\ clusterState[c] = "ANIS" + /\ clusterState' = [clusterState EXCEPT ![c] = "ANISTS"] + /\ UNCHANGED <> +``` + +## AdminAbortFailover -- Abort In-Progress Failover + +The admin aborts an in-progress failover from the standby side. The cluster transitions from STA (STANDBY_TO_ACTIVE) to AbTS (ABORT_TO_STANDBY). The peer (in ATS) will react via `PeerReactToAbTS` in [HAGroupStore.md](HAGroupStore.md), transitioning to AbTAIS. Both then auto-complete back to their pre-failover states. + +### Why Abort Must Originate from the STA Side + +Abort must originate from the STA side to prevent dual-active races. If abort could originate from the ATS side, the following race would be possible: + +1. ATS writes AbTAIS (abort on active side) +2. Meanwhile, STA completes failover and writes AIS +3. Result: (AbTAIS, AIS) -- both clusters briefly in ACTIVE role + +By requiring abort to originate from STA (writing AbTS), the standby explicitly declares it is returning to standby. The active (in ATS) detects the peer's AbTS via watcher and transitions to AbTAIS. This ordering is safe because the STA -> AbTS transition means the standby has abandoned the failover. This is the `AbortSafety` invariant in [ConsistentFailover.md](ConsistentFailover.md). + +### failoverPending Side-Effect + +Also clears `failoverPending[c]`, modeling the `abortFailoverListener` (`ReplicationLogDiscoveryReplay.java` L173-185) which fires on LOCAL ABORT_TO_STANDBY, calling `failoverPending.set(false)`. This ensures the replay state machine does not attempt to trigger failover after the abort. + +Source: `setHAGroupStatusToAbortToStandby()` L419-425. + +```tla +AdminAbortFailover(c) == + /\ clusterState[c] = "STA" + /\ clusterState' = [clusterState EXCEPT ![c] = "AbTS"] + /\ failoverPending' = [failoverPending EXCEPT ![c] = FALSE] + /\ UNCHANGED <> +``` + +## AdminGoOffline -- Take Standby Cluster Offline + +Admin takes a standby cluster offline. Gated on `UseOfflinePeerDetection`. + +Pre: Cluster `c` is in S or DS (a standby state). +Post: Cluster `c` transitions to OFFLINE. + +In the implementation, entering OFFLINE requires `PhoenixHAAdminTool update --force --state OFFLINE`, which bypasses `isTransitionAllowed()`. The operator decides when to take a cluster offline for maintenance or decommissioning. + +No ZK connectivity guard: the `--force` path writes directly to ZK, bypassing the `isHealthy` check used by `setHAGroupStatusIfNeeded()`. + +Source: `PhoenixHAAdminTool update --state OFFLINE (--force)`. + +```tla +AdminGoOffline(c) == + /\ UseOfflinePeerDetection = TRUE + /\ clusterState[c] \in {"S", "DS"} + /\ clusterState' = [clusterState EXCEPT ![c] = "OFFLINE"] + /\ UNCHANGED <> +``` + +## AdminForceRecover -- Force-Recover from OFFLINE + +Admin force-recovers a cluster from OFFLINE. Gated on `UseOfflinePeerDetection`. + +Pre: Cluster `c` is in OFFLINE. +Post: Cluster `c` transitions to S (STANDBY). + +Recovery from OFFLINE requires `PhoenixHAAdminTool update --force --state STANDBY`, which bypasses `isTransitionAllowed()` (OFFLINE has no allowed outbound transitions in the implementation). + +The S-entry side effects mirror the pattern used by `PeerReactToAIS` (ATS->S) and `AutoComplete` (AbTS->S): +- `writerMode` reset to INIT for all RS (replication subsystem restart on standby entry) +- `outDirEmpty` set to TRUE (OUT directory cleared) +- `replayState` set to SYNCED_RECOVERY (recoveryListener fold) + +No ZK connectivity guard: the `--force` path writes directly to ZK. + +Source: `PhoenixHAAdminTool update --force --state STANDBY`. + +```tla +AdminForceRecover(c) == + /\ UseOfflinePeerDetection = TRUE + /\ clusterState[c] = "OFFLINE" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + /\ writerMode' = [writerMode EXCEPT ![c] = + [rs \in RS |-> "INIT"]] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = TRUE] + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] + /\ UNCHANGED <> +``` diff --git a/src/main/tla/ConsistentFailover/markdown/Clock.md b/src/main/tla/ConsistentFailover/markdown/Clock.md new file mode 100644 index 00000000000..bf4e135520b --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/Clock.md @@ -0,0 +1,58 @@ +# Clock -- Anti-Flapping Countdown Timer + +**Source:** [`Clock.tla`](../Clock.tla) + +## Overview + +`Clock` provides a single `Tick` action that advances all per-cluster anti-flapping countdown timers by one tick toward 0. This follows the explicit-time pattern from Lamport, "Real Time is Really Simple" (CHARME 2005, Section 2). + +### The Lamport Countdown Timer Pattern + +Time is modeled as an ordinary variable, and lower-bound timing constraints (an action cannot fire until enough time passes) are expressed as enabling conditions on the guarded action using a countdown timer that ticks to 0. + +The key insight is that we do not need a global clock or real-time semantics. Instead: + +1. A countdown timer variable starts at a known value (`WaitTimeForSync`). +2. A `Tick` action decrements the timer by 1 (floor at 0). +3. The guarded action (ANIS -> AIS, ANISTS -> ATS) is enabled only when the timer reaches 0. +4. The S&F heartbeat (`ANISHeartbeat` in [HAGroupStore.md](HAGroupStore.md)) resets the timer to `WaitTimeForSync`, keeping the gate closed. + +This pattern models the wall-clock waiting period without introducing continuous time. The timer counts ticks, not seconds -- the relationship between ticks and real time is abstracted away. + +### Why a Separate Module? + +The `Tick` action is global (not per-cluster) -- it decrements all cluster timers simultaneously. This models the passage of time uniformly across the system. Factoring it into its own module clarifies that time advance is an independent system-wide action, not associated with any particular cluster's behavior. + +### Guard Against Stuttering + +The `Tick` action is guarded so it only fires when at least one timer is still counting down (`AntiFlapGateClosed`). This prevents useless stuttering ticks when all timers have already expired, which would inflate the state space without producing new reachable states. + +## Implementation Traceability + +| TLA+ Action | Java Source | +|---|---| +| `Tick` | Passage of wall-clock time; no direct Java counterpart. Models the interval between `HAGroupStoreClient.validateTransitionAndGetWaitTime()` checks (L1027-1046). | + +In the implementation, the anti-flapping gate is implemented via timestamp comparison: `validateTransitionAndGetWaitTime()` reads the ZK znode's `mtime` and computes the elapsed time since the last ANIS write. If the elapsed time is less than `waitTimeForSyncModeInMs`, the transition is deferred. The TLA+ countdown timer abstracts this timestamp-based mechanism into discrete ticks. + +```tla +EXTENDS SpecState, Types +``` + +## Tick -- Advance All Countdown Timers + +Each cluster's anti-flapping timer is decremented via `DecrementTimer` (floor at 0). The action is enabled only when at least one cluster has a timer still counting down (`AntiFlapGateClosed`), preventing infinite stuttering at zero. + +**Fairness:** WF (Tier 1). The guard depends only on protocol state (the `antiFlapTimer` variable), not on environment variables. Continuous enablement is guaranteed: once a timer is positive, it stays positive until `Tick` fires (no other action decrements it; `ANISHeartbeat` resets it to `WaitTimeForSync`, which is also positive). WF guarantees `Tick` eventually fires, ensuring the gate eventually opens. + +See [Types.md](Types.md) for the `AntiFlapGateClosed` and `DecrementTimer` helper operator definitions. + +```tla +Tick == + /\ \E c \in Cluster : AntiFlapGateClosed(antiFlapTimer[c]) + /\ antiFlapTimer' = [c \in Cluster |-> DecrementTimer(antiFlapTimer[c])] + /\ UNCHANGED <> +``` diff --git a/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-cfg.md b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-cfg.md new file mode 100644 index 00000000000..0357d33176c --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-cfg.md @@ -0,0 +1,102 @@ +# ConsistentFailover.cfg -- Exhaustive Safety Model Configuration + +**Source:** [`ConsistentFailover.cfg`](../ConsistentFailover.cfg) + +## Overview + +This is the primary (exhaustive) TLC model configuration for the Phoenix Consistent Failover specification. It performs a complete state-space exploration with 2 clusters and 2 region servers, verifying all 6 state invariants and 9 action constraints over every reachable state. This is the strongest verification mode: if TLC completes without finding a violation, the safety properties hold for all possible interleavings of all actions. + +### Model Checking Strategy + +**Exhaustive (breadth-first) search** explores every reachable state. The state space is bounded by: + +- 2 clusters (fixed by the protocol architecture) +- 2 RS per cluster (minimum to exercise per-RS CAS races) +- `WaitTimeForSync = 2` (minimum to exercise timer counting behavior) +- `lastRoundProcessed[c] <= 3` (state constraint to bound replay counter growth) +- RS symmetry reduction (`Permutations(RS)`) + +These choices keep the state space tractable (~95M distinct states, ~12 minutes on 16 workers with `UseOfflinePeerDetection = FALSE`). With `UseOfflinePeerDetection = TRUE`, the state space grows to ~171M distinct states, ~24 minutes on 16 workers. Both configurations exercise all safety-relevant interleavings. + +### Run Command + +```bash +java -XX:+UseParallelGC \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla -config ConsistentFailover.cfg \ + -workers auto -cleanup +``` + +## Configuration + +``` +SPECIFICATION SafetySpec +``` + +Uses `SafetySpec` (`Init /\ [][Next]_vars`) -- no fairness. Safety-only model checking avoids the temporal overhead of Buchi automaton construction, which is exponential in the number of fairness clauses. + +### Constants + +``` +CONSTANTS + Cluster = {c1, c2} + RS = {rs1, rs2} + WaitTimeForSync = 2 + UseOfflinePeerDetection = FALSE +``` + +**`Cluster = {c1, c2}`:** Exactly 2 clusters forming the HA pair, matching the protocol's architectural requirement. + +**`RS = {rs1, rs2}`:** 2 region servers per cluster. This is the minimum needed to exercise the ZK CAS race: when HDFS fails, two RS independently detect the failure and race to CAS-write AIS -> ANIS. The first succeeds; the second gets `BadVersionException` and aborts. With only 1 RS, CAS failure is unreachable. + +**`WaitTimeForSync = 2`:** The minimum value that exercises the timer's counting behavior (the timer can be at 0, 1, or 2). Larger values add more timer states without exercising new protocol interleavings. + +**`UseOfflinePeerDetection = FALSE`:** Feature gate for proactive AWOP/ANISWOP modeling. + +### Symmetry + +``` +SYMMETRY Symmetry +``` + +RS identifiers are interchangeable: all RS start in INIT with identical action sets. Permutation symmetry reduces the effective state space by `|RS|!` (factor of 2 with 2 RS). Cluster identifiers remain asymmetric because the initial state is asymmetric (one cluster starts AIS, the other S). + +### Invariants + +``` +INVARIANT + TypeOK + MutualExclusion + AbortSafety + AISImpliesInSync + WriterClusterConsistency + ZKSessionConsistency +``` + +All 6 state invariants are checked in every reachable state. See [ConsistentFailover.md](ConsistentFailover.md) for detailed descriptions of each invariant. + +### Action Constraints + +``` +ACTION_CONSTRAINT + TransitionValid + WriterTransitionValid + AIStoATSPrecondition + AntiFlapGate + ANISTStoATSPrecondition + ReplayTransitionValid + FailoverTriggerCorrectness + NoDataLoss + ReplayRewindCorrectness +``` + +All 9 action constraints are checked on every state transition. These verify that the `Next` relation only produces transitions consistent with the implementation's transition tables and preconditions. + +### State Constraint + +``` +CONSTRAINT + ReplayCounterBound +``` + +Bounds `lastRoundProcessed[c] <= 3` for exhaustive search tractability. The abstract counter values only matter relationally (`lastRoundProcessed >= lastRoundInSync`), so small bounds suffice. Without this constraint, the counters would grow unboundedly, making exhaustive search infeasible. diff --git a/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-cfg.md b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-cfg.md new file mode 100644 index 00000000000..714091aac18 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-cfg.md @@ -0,0 +1,93 @@ +# ConsistentFailover-sim.cfg -- Simulation Safety Model Configuration + +**Source:** [`ConsistentFailover-sim.cfg`](../ConsistentFailover-sim.cfg) + +## Overview + +This is the simulation (random trace sampling) TLC model configuration for the Phoenix Consistent Failover specification. It samples random behaviors at production-scale RS count (9 RS per cluster) to stress per-RS writer interleaving. Safety-only (no fairness) -- liveness simulation uses separate configurations with smaller RS counts. + +### Model Checking Strategy + +**Simulation (random trace sampling)** generates random execution traces up to depth 10,000, sufficient for ~100 complete failover cycles with 9 RS. The 9-RS model is too large for exhaustive search (the branching factor of 38 action schemas x 9 RS makes the state space intractable) but ideal for simulation: the high branching factor is sampled efficiently. + +The simulation complements the exhaustive model by: + +1. **Production-scale RS count:** 9 RS exercises more complex per-RS writer interleavings (e.g., 4 RS in S&F, 3 in SYNC_AND_FWD, 2 in SYNC simultaneously). +2. **Larger WaitTimeForSync:** 5 ticks (vs 2 in the exhaustive model) opens a wider anti-flapping window during which HDFS failures, ZK disruptions, and RS crashes can occur while the gate is closed. +3. **No state constraint:** Replay counters grow organically along each trace without state-space tractability concerns. +4. **No symmetry:** Symmetry reduction provides no benefit for random trace sampling. + +### Run Command + +```bash +java -XX:+UseParallelGC \ + -Dtlc2.TLC.stopAfter=28800 \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla -config ConsistentFailover-sim.cfg \ + -simulate -depth 10000 -workers auto +``` + +The `-Dtlc2.TLC.stopAfter=28800` flag limits the run to 8 hours (28800 seconds). + +## Configuration + +``` +SPECIFICATION SafetySpec +``` + +Uses `SafetySpec` -- no fairness. Same safety-only strategy as the exhaustive model. + +### Constants + +``` +CONSTANTS + Cluster = {c1, c2} + RS = {rs1, rs2, rs3, rs4, rs5, rs6, rs7, rs8, rs9} + WaitTimeForSync = 5 + UseOfflinePeerDetection = FALSE +``` + +**`RS = {rs1, ..., rs9}`:** 9 RS exercises per-RS writer interleaving at production scale. With 9 RS, the CAS race during HDFS failure produces rich interleavings: multiple RS detect the failure at different times, some succeed at the CAS write, some fail and abort, and the resulting mix of SYNC, STORE_AND_FWD, SYNC_AND_FWD, and DEAD modes across 9 RS creates scenarios not reachable with only 2 RS. + +**`WaitTimeForSync = 5`:** Larger than the exhaustive model to explore richer interleavings during the anti-flapping wait window. With 5 ticks, the system has more time for HDFS failures, ZK disruptions, RS crashes, heartbeat resets, and forwarder drain events to interleave while the anti-flapping gate is closed. + +**`UseOfflinePeerDetection = FALSE`:** Feature gate for proactive AWOP/ANISWOP modeling. Set to TRUE to verify the OFFLINE peer detection lifecycle. + +### Invariants + +``` +INVARIANT + TypeOK + MutualExclusion + AbortSafety + AISImpliesInSync + WriterClusterConsistency + ZKSessionConsistency +``` + +Same 6 invariants as the exhaustive model. + +### Action Constraints + +``` +ACTION_CONSTRAINT + TransitionValid + WriterTransitionValid + AIStoATSPrecondition + AntiFlapGate + ANISTStoATSPrecondition + ReplayTransitionValid + FailoverTriggerCorrectness + NoDataLoss + ReplayRewindCorrectness +``` + +Same 9 action constraints as the exhaustive model. + +### No State Constraint + +Unlike the exhaustive model, no state constraint is applied. Simulation samples random traces, and counters grow organically along each trace without state-space tractability concerns. + +### No Symmetry + +Symmetry reduction is not used because it provides no benefit for random trace sampling. TLC's simulation mode generates random traces by choosing enabled actions uniformly at random -- symmetry reduction affects the state graph structure, not the sampling process. diff --git a/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-liveness-ac-cfg.md b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-liveness-ac-cfg.md new file mode 100644 index 00000000000..d50e3dd4e3d --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-liveness-ac-cfg.md @@ -0,0 +1,74 @@ +# ConsistentFailover-sim-liveness-ac.cfg -- AbortCompletion Liveness Configuration + +**Source:** [`ConsistentFailover-sim-liveness-ac.cfg`](../ConsistentFailover-sim-liveness-ac.cfg) + +## Overview + +This is the per-property simulation liveness configuration for the `AbortCompletion` property. It uses `FairnessAC` (3 temporal clauses per cluster, 5 total with `Tick`) to keep the Buchi automaton tractable while checking that every abort state eventually auto-completes to a stable state. + +### Why Per-Property Liveness + +The full `Fairness` formula has 43 temporal clauses, which would cause TLC's Buchi automaton construction to blow up (the automaton size is exponential in the number of temporal clauses). Per-property formulas include only the fairness clauses on the critical path for one liveness property, keeping the automaton manageable. + +### AbortCompletion Critical Path + +The `AbortCompletion` property states: + +``` +AbortCompletion == \A c \in Cluster : + clusterState[c] \in {"AbTS", "AbTAIS", "AbTANIS"} + ~> clusterState[c] \in {"AIS", "ANIS", "S"} +``` + +The critical path for abort resolution is: + +1. `AutoComplete` fires (requires `zkLocalConnected = TRUE`) +2. If `zkLocalConnected` was FALSE, `ZKLocalReconnect` re-enables it +3. `Tick` advances the anti-flapping timer (needed if AbTANIS -> ANIS resets the timer) + +The minimal fairness formula is: +- WF on `Tick` +- WF on `ZKLocalReconnect` (ZLA) +- SF on `AutoComplete` (guarded by `zkLocalConnected`) + +### Run Command + +```bash +java -XX:+UseParallelGC \ + -Dtlc2.TLC.stopAfter=28800 \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla \ + -config ConsistentFailover-sim-liveness-ac.cfg \ + -simulate -depth 10000 -workers auto +``` + +## Configuration + +``` +SPECIFICATION SpecAC +``` + +Uses `SpecAC` = `Init /\ [][Next]_vars /\ FairnessAC`. + +### Constants + +``` +CONSTANTS + Cluster = {c1, c2} + RS = {rs1, rs2} + WaitTimeForSync = 2 + UseOfflinePeerDetection = FALSE +``` + +Same as the exhaustive safety model. Small RS count keeps the Buchi automaton tractable. + +### Liveness Property + +``` +PROPERTY + AbortCompletion +``` + +### Invariants and Action Constraints + +Same 6 invariants and 9 action constraints as the safety models. Liveness checking does not disable safety checking. diff --git a/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-liveness-dr-cfg.md b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-liveness-dr-cfg.md new file mode 100644 index 00000000000..3faa1d62a59 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-liveness-dr-cfg.md @@ -0,0 +1,82 @@ +# ConsistentFailover-sim-liveness-dr.cfg -- DegradationRecovery Liveness Configuration + +**Source:** [`ConsistentFailover-sim-liveness-dr.cfg`](../ConsistentFailover-sim-liveness-dr.cfg) + +## Overview + +This is the per-property simulation liveness configuration for the `DegradationRecovery` property. It uses `FairnessDR` (17 temporal clauses with 2 RS) to verify that ANIS with available peer HDFS eventually progresses out of ANIS. + +### DegradationRecovery Critical Path + +The `DegradationRecovery` property states: + +``` +DegradationRecovery == \A c \in Cluster : + (clusterState[c] = "ANIS" /\ hdfsAvailable[Peer(c)]) + ~> clusterState[c] # "ANIS" +``` + +This is the most clause-intensive property because the recovery chain involves per-RS writer actions. The critical path is: + +**Writer recovery chain (per-RS):** +1. S&F -> S&FWD (`WriterStoreFwdToSyncFwd` -- SF, guarded on `hdfsAvailable`) +2. S&FWD -> SYNC (`WriterSyncFwdToSync` -- SF, guarded on `hdfsAvailable` and `zkLocalConnected`) + +**Dead RS recovery (per-RS):** +1. `RSAbortOnLocalHDFSFailure` kills S&F writers on local HDFS failure (SF) +2. `RSRestart` restarts dead RS (SF) +3. `WriterInit` initializes restarted RS in SYNC mode (WF) + +**Cluster recovery:** +1. `ANISToAIS` fires when all RS are in SYNC/S&FWD, OUT is empty, and gate is open (SF, guarded on `zkLocalConnected`) +2. `HDFSUp` ensures HDFS is eventually available (SF) + +**Timer and ZK:** +1. `Tick` advances anti-flapping timer (WF) +2. `ANISHeartbeat` resets timer while S&F writers exist (WF) +3. `ZKLocalReconnect` re-enables `zkLocalConnected` (WF, ZLA) +4. `WriterSyncToSyncFwd` transitions SYNC writers to S&FWD when cluster is ANIS (WF) + +With 2 RS, the per-RS clauses contribute 2x6 = 12 clauses, plus 5 cluster-level clauses = 17 total. With 9 RS, this would be 59 clauses, far too large for TLC's Buchi automaton. + +### Run Command + +```bash +java -XX:+UseParallelGC \ + -Dtlc2.TLC.stopAfter=28800 \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla \ + -config ConsistentFailover-sim-liveness-dr.cfg \ + -simulate -depth 10000 -workers auto +``` + +## Configuration + +``` +SPECIFICATION SpecDR +``` + +Uses `SpecDR` = `Init /\ [][Next]_vars /\ FairnessDR`. + +### Constants + +``` +CONSTANTS + Cluster = {c1, c2} + RS = {rs1, rs2} + WaitTimeForSync = 2 + UseOfflinePeerDetection = FALSE +``` + +Same as the exhaustive safety model. 2 RS is the maximum feasible for `DegradationRecovery` liveness checking due to the per-RS fairness clause multiplication. + +### Liveness Property + +``` +PROPERTY + DegradationRecovery +``` + +### Invariants and Action Constraints + +Same 6 invariants and 9 action constraints as the safety models. diff --git a/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-liveness-fc-cfg.md b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-liveness-fc-cfg.md new file mode 100644 index 00000000000..721c0cc4ac3 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover-sim-liveness-fc-cfg.md @@ -0,0 +1,75 @@ +# ConsistentFailover-sim-liveness-fc.cfg -- FailoverCompletion Liveness Configuration + +**Source:** [`ConsistentFailover-sim-liveness-fc.cfg`](../ConsistentFailover-sim-liveness-fc.cfg) + +## Overview + +This is the per-property simulation liveness configuration for the `FailoverCompletion` property. It uses `FairnessFC` (8 temporal clauses per cluster, 15 total with `Tick`) to verify that standby-side and abort transient states eventually resolve to a stable state. + +### FailoverCompletion Critical Path + +The `FailoverCompletion` property states: + +``` +FailoverCompletion == \A c \in Cluster : + clusterState[c] \in {"STA", "AbTAIS", "AbTANIS", "AbTS"} + ~> clusterState[c] \in {"AIS", "ANIS", "S"} +``` + +This is the most complex liveness property, requiring the most fairness clauses. The critical paths are: + +**STA resolution:** +1. Replay machine completes (`ReplayAdvance`, `ReplayRewind`, `ReplayBeginProcessing`, `ReplayFinishProcessing` -- all WF) +2. HDFS becomes available (`HDFSUp` -- SF, needed for `shouldTriggerFailover()` HDFS reads) +3. `TriggerFailover` fires (SF, grouped with `AutoComplete` by `clusterState` exclusivity) + +**Abort state resolution:** +1. `AutoComplete` fires (SF, requires `zkLocalConnected`) +2. `ZKLocalReconnect` re-enables `zkLocalConnected` (WF, ZLA) + +**Timer:** +1. `Tick` advances anti-flapping timer (WF) + +The `AutoComplete` and `TriggerFailover` are grouped under a single SF because they guard on mutually exclusive `clusterState` values (AbTS/AbTAIS/AbTANIS vs STA). + +### Run Command + +```bash +java -XX:+UseParallelGC \ + -Dtlc2.TLC.stopAfter=28800 \ + -cp tla2tools.jar:CommunityModules-deps.jar \ + tlc2.TLC ConsistentFailover.tla \ + -config ConsistentFailover-sim-liveness-fc.cfg \ + -simulate -depth 10000 -workers auto +``` + +## Configuration + +``` +SPECIFICATION SpecFC +``` + +Uses `SpecFC` = `Init /\ [][Next]_vars /\ FairnessFC`. + +### Constants + +``` +CONSTANTS + Cluster = {c1, c2} + RS = {rs1, rs2} + WaitTimeForSync = 2 + UseOfflinePeerDetection = FALSE +``` + +Same as the exhaustive safety model. + +### Liveness Property + +``` +PROPERTY + FailoverCompletion +``` + +### Invariants and Action Constraints + +Same 6 invariants and 9 action constraints as the safety models. diff --git a/src/main/tla/ConsistentFailover/markdown/ConsistentFailover.md b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover.md new file mode 100644 index 00000000000..cf221a21ec5 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/ConsistentFailover.md @@ -0,0 +1,716 @@ +# ConsistentFailover -- Root Orchestrator Module + +**Source:** [`ConsistentFailover.tla`](../ConsistentFailover.tla) + +## Overview + +`ConsistentFailover` is the root orchestrator module of the Phoenix Consistent Failover TLA+ specification. State variables are declared in [`SpecState.tla`](../SpecState.tla); the root module has `EXTENDS SpecState, Types`. The root module defines the initial state (`Init`), the next-state relation (`Next`), the specification formulas (`SafetySpec`, `Spec`), and all safety invariants, action constraints, liveness properties, and fairness assumptions. It composes actor-driven actions from sub-modules via `INSTANCE`. + +The module models the HA group state machine for two paired Phoenix/HBase clusters. Each cluster maintains its HA group state in ZooKeeper. State transitions are driven by five categories of actors: + +1. **Admin actions** -- human operator initiates or aborts failover +2. **Peer-reactive transitions** -- ZK watcher notifications from the peer cluster trigger state changes +3. **Writer/reader state changes** -- per-RS replication writer mode transitions and standby-side replay progress +4. **HDFS availability incidents** -- NameNode crash and recovery +5. **ZK coordination failures** -- connection loss, session expiry, retry exhaustion + +### ZK Coordination Model + +ZK connection and session lifecycle are modeled explicitly. Peer-reactive transitions (`PeerReact*` actions) are guarded on `zkPeerConnected[c]` and `zkPeerSessionAlive[c]`. Auto-completion, heartbeat, writer ZK writes, and failover trigger are guarded on `zkLocalConnected[c]`. Retry exhaustion of the `FailoverManagementListener` (2-retry limit) is modeled as `ReactiveTransitionFail(c)` in [HAGroupStore.md](HAGroupStore.md). + +```tla +EXTENDS SpecState, Types +``` + +## Implementation Traceability + +| Modeled Concept | Java Class / Field | +|---|---| +| `clusterState` | `HAGroupStoreRecord` per-cluster ZK znode | +| `PeerReact*` actions | `FailoverManagementListener` (`HAGroupStoreManager.java` L633-706), delivered via `peerPathChildrenCache` | +| `ReactiveTransitionFail` | `FailoverManagementListener` 2-retry exhaustion (L653-704) | +| `TriggerFailover` | `Reader.TriggerFailover` via `shouldTriggerFailover()` L500-533 + `triggerFailover()` L535-548 | +| `AutoComplete` | `createLocalStateTransitions()` L140-150, delivered via local `pathChildrenCache` | +| `ANISTSToATS` | `HAGroupStoreManager.setHAGroupStatusToSync()` L341-355 | +| `AdminStartFailover` | `initiateFailoverOnActiveCluster()` L375-400 | +| `AdminAbortFailover` | `setHAGroupStatusToAbortToStandby()` L419-425 | +| `AdminGoOffline` | `PhoenixHAAdminTool update --state OFFLINE` (gated on `UseOfflinePeerDetection`) | +| `AdminForceRecover` | `PhoenixHAAdminTool update --force --state STANDBY` (OFFLINE -> S) (gated on `UseOfflinePeerDetection`) | +| `PeerReactToOFFLINE` | intended peer OFFLINE detection: AIS->AWOP, ANIS->ANISWOP; gated on `UseOfflinePeerDetection` | +| `PeerRecoverFromOFFLINE` | intended peer OFFLINE recovery: AWOP/ANISWOP->ANIS; gated on `UseOfflinePeerDetection` | +| `Init (AIS, S)` | Default initial states per team confirmation (PHOENIX_HA_TLA_PLAN.md Appendix A.6) | +| `MutualExclusion` | Architecture safety argument: at most one cluster in ACTIVE role | +| `AbortSafety` | Abort originates from STA side; AbTAIS only reachable via peer AbTS detection | +| `AllowedTransitions` | `HAGroupStoreRecord.java` L99-123 | +| `writerMode` | `ReplicationLogGroup` per-RS mode | +| `outDirEmpty` | `ReplicationLogDiscoveryForwarder.processNoMoreRoundsLeft()` L155-184 | +| `hdfsAvailable` | Abstract: NameNode availability per cluster (detected via IOException) | +| `RSCrash` | JVM crash, OOM, kill signal | +| `RSAbortOnLocalHDFSFailure` | `StoreAndForwardModeImpl.onFailure()` L115-123 | +| `HDFSDown`/`HDFSUp` | NameNode crash/recovery; `SyncModeImpl.onFailure()` L61-74 | +| `antiFlapTimer` | Countdown timer (Lamport CHARME 2005); `validateTransitionAndGetWaitTime()` L1027-1046 | +| `Tick` | Passage of wall-clock time | +| `ANISHeartbeat` | `StoreAndForwardModeImpl.startHAGroupStoreUpdateTask()` L71-87 | +| `replayState` | `ReplicationLogDiscoveryReplay` replay state (L550-555) | +| `lastRoundInSync` | `ReplicationLogDiscoveryReplay` L336-343 | +| `lastRoundProcessed` | `ReplicationLogDiscoveryReplay` L336-351 | +| `failoverPending` | `ReplicationLogDiscoveryReplay` L159-171 | +| `inProgressDirEmpty` | `ReplicationLogDiscoveryReplay` L500-533 | +| `ReplayAdvance` | `replay()` L336-343 (SYNC) and L345-351 (DEGRADED) | +| `ReplayRewind` | `replay()` L323-333 (CAS to SYNC) | +| Listener folds | `degradedListener` L136-145 and `recoveryListener` L147-157 folded into HAGroupStore S/DS-entry actions | +| `TriggerFailover` | `shouldTriggerFailover()` L500-533 + `triggerFailover()` L535-548 | +| `FailoverTriggerCorrectness` | Action constraint: STA->AIS requires replay-completeness conditions | +| `NoDataLoss` | Action constraint: zero RPO property | +| `zkPeerConnected` | `peerPathChildrenCache` TCP connection state (`HAGroupStoreClient` L110-112) | +| `zkPeerSessionAlive` | Peer ZK session state (Curator internal) | +| `zkLocalConnected` | `pathChildrenCache` TCP connection state; maps to `HAGroupStoreClient.isHealthy` (L878-911) | +| `ZKPeerDisconnect` | `peerPathChildrenCache` CONNECTION_LOST | +| `ZKPeerReconnect` | `peerPathChildrenCache` CONNECTION_RECONNECTED | +| `ZKPeerSessionExpiry` | Curator session expiry -> CONNECTION_LOST | +| `ZKPeerSessionRecover` | Curator retry -> new session | +| `ZKLocalDisconnect` | `pathChildrenCache` CONNECTION_LOST | +| `ZKLocalReconnect` | `pathChildrenCache` CONNECTION_RECONNECTED | + +### failoverPending Lifecycle + +| Event | Variable Effect | Source | +|---|---|---| +| Set TRUE | `PeerReactToATS` ([HAGroupStore.md](HAGroupStore.md)) | Standby detects peer ATS | +| Set FALSE | `TriggerFailover` ([Reader.md](Reader.md)) | Failover completes successfully | +| Set FALSE | `AdminAbortFailover` ([Admin.md](Admin.md)) | Operator aborts failover | + +## Variables + +The specification uses 13 variables, declared in [`SpecState.tla`](../SpecState.tla). The subsections below describe each variable’s role; see also [SpecState.md](SpecState.md). + +### Cluster State + +```tla +VARIABLE clusterState +``` + +`clusterState[c]` is the current HA group state of cluster `c`. Each cluster maintains its state as a ZK znode, updated via `setData().withVersion()` (optimistic locking). This is the primary state variable of the protocol -- almost every action reads or writes it. + +Source: `HAGroupStoreRecord` per-cluster ZK znode at `phoenix/consistentHA/`. + +### Writer State + +```tla +VARIABLE writerMode +``` + +`writerMode[c][rs]` is the current replication writer mode of region server `rs` on cluster `c`. The writer state machine is per-RS, reflecting the implementation where each `ReplicationLogGroup` independently manages its mode. Multiple RS on the same cluster can be in different modes simultaneously (e.g., one in SYNC and another in STORE_AND_FWD after an HDFS failure race). + +Source: `ReplicationLogGroup` per-RS mode (`SyncModeImpl`, `StoreAndForwardModeImpl`, `SyncAndForwardModeImpl`). + +```tla +VARIABLE outDirEmpty +``` + +`outDirEmpty[c]` is TRUE when the OUT directory on cluster `c` contains no buffered replication log files. FALSE when writes are accumulating locally. This is a per-cluster boolean (not per-RS) because the OUT directory is shared -- `ReplicationLogDiscoveryForwarder.processNoMoreRoundsLeft()` (L155-184) checks the entire directory. + +Source: `ReplicationLogDiscoveryForwarder.processNoMoreRoundsLeft()` L155-184 checks `getInProgressFiles().isEmpty() && getNewFilesForRound(nextRound).isEmpty()`. + +### Environment State + +```tla +VARIABLE hdfsAvailable +``` + +`hdfsAvailable[c]` is TRUE when cluster `c`'s HDFS (NameNode) is accessible. FALSE after a NameNode crash. This is modeled as an abstract boolean flag rather than per-file HDFS state because the specification focuses on the protocol's reaction to HDFS availability, not on HDFS internals. The flag is not explicitly tracked in the implementation -- HDFS unavailability is detected reactively via IOException from HDFS write operations. + +### Anti-Flapping Timer + +```tla +VARIABLE antiFlapTimer +``` + +`antiFlapTimer[c]` is the per-cluster anti-flapping countdown timer. Counts down from `WaitTimeForSync` toward 0. The ANIS -> AIS transition is blocked while the timer is positive (gate closed). The S&F heartbeat resets the timer to `WaitTimeForSync`; the `Tick` action decrements it. See [Types.md](Types.md) for helper operator documentation and the Lamport CHARME 2005 countdown timer pattern. + +Source: `HAGroupStoreClient.validateTransitionAndGetWaitTime()` L1027-1046. + +### Replay State + +```tla +VARIABLE replayState +VARIABLE lastRoundInSync +VARIABLE lastRoundProcessed +VARIABLE failoverPending +VARIABLE inProgressDirEmpty +``` + +These five variables model the standby-side replication replay state machine: + +- `replayState[c]` -- the current replay state (NOT_INITIALIZED / SYNC / DEGRADED / SYNCED_RECOVERY). Source: `ReplicationLogDiscoveryReplay.java` L550-555. +- `lastRoundInSync[c]` -- the last round processed while in SYNC state; frozen during DEGRADED. Source: L336-343 (advance), L389 (rewind target). +- `lastRoundProcessed[c]` -- the last round processed regardless of state; rewinds to `lastRoundInSync` during SYNCED_RECOVERY. Source: L336-351. +- `failoverPending[c]` -- TRUE when the standby has received an STA notification and is waiting for replay to complete. Source: L159-171. +- `inProgressDirEmpty[c]` -- TRUE when no partially-processed replication log files exist. Source: `shouldTriggerFailover()` L500-533. + +### ZK Coordination State + +```tla +VARIABLE zkPeerConnected +VARIABLE zkPeerSessionAlive +VARIABLE zkLocalConnected +``` + +These three booleans per cluster model the ZK coordination substrate: + +- `zkPeerConnected[c]` -- TRUE when the `peerPathChildrenCache` has a live TCP connection to the peer ZK quorum. When FALSE, no watcher notifications from the peer are delivered, suppressing all `PeerReact*` transitions. Source: `HAGroupStoreClient.createCacheListener()` L894-906. +- `zkPeerSessionAlive[c]` -- TRUE when the peer ZK session is alive. Session expiry permanently loses all watches until a new session is established. Session expiry implies disconnection. Source: Curator internal session management. +- `zkLocalConnected[c]` -- TRUE when the `pathChildrenCache` (local) has a live connection. When FALSE, `isHealthy = false`, blocking all `setHAGroupStatusIfNeeded()` calls. Source: `HAGroupStoreClient.createCacheListener()` L894-906. + +### Variable Tuple + +The `vars` tuple aggregates all 13 variables for use in temporal formulas (`[][Next]_vars`, `WF_vars(...)`, `SF_vars(...)`). It is defined in [`SpecState.tla`](../SpecState.tla) as a composition of four variable-group tuples -- `writerVars`, `clusterVars`, `replayVars`, `envVars` -- so every sub-module shares the same groups when writing `UNCHANGED` clauses. See [SpecState.md](SpecState.md) for the group definitions. + +```tla +\* Defined in SpecState.tla: +\* writerVars == <> +\* clusterVars == <> +\* replayVars == <> +\* envVars == <> +\* vars == <> + +STAtoAISTriggerReplayGuards(c) == + /\ failoverPending[c] + /\ inProgressDirEmpty[c] + /\ replayState[c] = "SYNC" +``` + +`STAtoAISTriggerReplayGuards` is the shared replay-completeness conjunction for STA -> AIS. `FailoverTriggerCorrectness` and `NoDataLoss` both use it so they cannot drift apart. + +## Sub-Module Instances + +```tla +haGroupStore == INSTANCE HAGroupStore +admin == INSTANCE Admin +writer == INSTANCE Writer +hdfs == INSTANCE HDFS +rs == INSTANCE RS +clk == INSTANCE Clock +reader == INSTANCE Reader +zk == INSTANCE ZK +``` + +Each sub-module is instantiated with default parameter passing (all variables and constants are shared by name). The instance names (`haGroupStore`, `admin`, `writer`, etc.) serve as namespace prefixes in the `Next` relation: `haGroupStore!PeerReactToATS(c)`, `admin!AdminStartFailover(c)`, etc. + +## Initial State + +```tla +Init == + LET active == CHOOSE x \in Cluster : TRUE + IN /\ clusterState = [c \in Cluster |-> + IF c = active THEN "AIS" ELSE "S"] + /\ writerMode = [c \in Cluster |-> [r \in RS |-> "INIT"]] + /\ outDirEmpty = [c \in Cluster |-> TRUE] + /\ hdfsAvailable = [c \in Cluster |-> TRUE] + /\ antiFlapTimer = [c \in Cluster |-> 0] + /\ replayState = [c \in Cluster |-> + IF c = active THEN "NOT_INITIALIZED" + ELSE "SYNCED_RECOVERY"] + /\ lastRoundInSync = [c \in Cluster |-> 0] + /\ lastRoundProcessed = [c \in Cluster |-> 0] + /\ failoverPending = [c \in Cluster |-> FALSE] + /\ inProgressDirEmpty = [c \in Cluster |-> TRUE] + /\ zkPeerConnected = [c \in Cluster |-> TRUE] + /\ zkPeerSessionAlive = [c \in Cluster |-> TRUE] + /\ zkLocalConnected = [c \in Cluster |-> TRUE] +``` + +The system starts with one cluster active and in sync (AIS) and the other in standby (S). The choice of which cluster is active is deterministic: `CHOOSE` picks an arbitrary but fixed element of `Cluster` as the initial active. + +### Modeling Choices in Init + +**Standby starts in SYNCED_RECOVERY, not SYNC:** The standby starts with `replayState = SYNCED_RECOVERY`, modeling the `recoveryListener` having already fired during startup. In the implementation, `NOT_INITIALIZED -> SYNCED_RECOVERY` is synchronous with S entry on the local `PathChildrenCache` event thread. The active starts `NOT_INITIALIZED` because the reader is dormant until the cluster first enters S after a failover. + +**All writers in INIT:** All RS start in INIT mode, reflecting the pre-initialization state before the `ReplicationLogGroup` is created and modes are assigned based on HDFS availability and cluster state. + +**All ZK connections alive:** The system starts with all ZK connections healthy. Failures are introduced non-deterministically by the `ZKPeerDisconnect`, `ZKPeerSessionExpiry`, and `ZKLocalDisconnect` actions. + +**Anti-flapping timers at zero:** Timers start at 0 (gate open), reflecting a clean startup with no prior degradation history. + +## Next-State Relation + +In each step, exactly one cluster performs one actor-driven action. Actions are factored by actor: + +```tla +Next == + \/ clk!Tick + \/ \E c \in Cluster : + \/ haGroupStore!PeerReactToATS(c) + \/ haGroupStore!PeerReactToANIS(c) + \/ haGroupStore!PeerReactToAbTS(c) + \/ haGroupStore!AutoComplete(c) + \/ reader!TriggerFailover(c) + \/ haGroupStore!PeerReactToAIS(c) + \/ haGroupStore!ANISHeartbeat(c) + \/ haGroupStore!ANISToAIS(c) + \/ haGroupStore!ANISTSToATS(c) + \/ haGroupStore!ReactiveTransitionFail(c) + \/ haGroupStore!PeerReactToOFFLINE(c) + \/ haGroupStore!PeerRecoverFromOFFLINE(c) + \/ admin!AdminStartFailover(c) + \/ admin!AdminAbortFailover(c) + \/ admin!AdminGoOffline(c) + \/ admin!AdminForceRecover(c) + \/ hdfs!HDFSDown(c) + \/ hdfs!HDFSUp(c) + \/ zk!ZKPeerDisconnect(c) + \/ zk!ZKPeerReconnect(c) + \/ zk!ZKPeerSessionExpiry(c) + \/ zk!ZKPeerSessionRecover(c) + \/ zk!ZKLocalDisconnect(c) + \/ zk!ZKLocalReconnect(c) + \/ reader!ReplayAdvance(c) + \/ reader!ReplayRewind(c) + \/ reader!ReplayBeginProcessing(c) + \/ reader!ReplayFinishProcessing(c) + \/ \E r \in RS : + \/ writer!WriterInit(c, r) + \/ writer!WriterInitToStoreFwd(c, r) + \/ writer!WriterInitToStoreFwdFail(c, r) + \/ writer!WriterSyncToSyncFwd(c, r) + \/ writer!WriterStoreFwdToSyncFwd(c, r) + \/ writer!WriterSyncFwdToSync(c, r) + \/ writer!WriterToStoreFwd(c, r) + \/ writer!WriterSyncFwdToStoreFwd(c, r) + \/ writer!WriterToStoreFwdFail(c, r) + \/ writer!WriterSyncFwdToStoreFwdFail(c, r) + \/ rs!RSRestart(c, r) + \/ rs!RSCrash(c, r) + \/ rs!RSAbortOnLocalHDFSFailure(c, r) +``` + +### Action Categories + +The 42 action schemas decompose into: + +- **Timer:** `Tick` -- global, not per-cluster +- **ZK watcher (peer):** `PeerReactToATS`, `PeerReactToANIS`, `PeerReactToAbTS`, `PeerReactToAIS`, `PeerReactToOFFLINE`, `PeerRecoverFromOFFLINE` -- require `zkPeerConnected` and `zkPeerSessionAlive` +- **ZK watcher (local):** `AutoComplete`, `ANISHeartbeat`, `ANISToAIS`, `ANISTSToATS`, `TriggerFailover` -- require `zkLocalConnected` +- **Retry exhaustion:** `ReactiveTransitionFail` -- requires peer ZK connectivity (same as PeerReact*) +- **Direct ZK write:** `AdminStartFailover`, `AdminAbortFailover`, `AdminGoOffline`, `AdminForceRecover` -- not watcher-dependent +- **Environment:** `HDFSDown`, `HDFSUp`, `ZKPeer*`, `ZKLocal*` -- environment actions +- **Reader:** `ReplayAdvance`, `ReplayRewind`, `ReplayBeginProcessing`, `ReplayFinishProcessing` -- replay state machine +- **Writer (per-RS):** 10 writer mode transitions -- some require `zkLocalConnected` +- **RS lifecycle (per-RS):** `RSRestart`, `RSCrash`, `RSAbortOnLocalHDFSFailure` + +## Specification Formulas + +### Safety-Only Specification + +```tla +SafetySpec == Init /\ [][Next]_vars +``` + +Initial state, followed by zero or more `Next` steps (or stuttering). No fairness -- used for fast safety-only model checking without temporal overhead. This is the specification used by the exhaustive and simulation safety configurations. + +### Full Specification + +```tla +Spec == Init /\ [][Next]_vars /\ Fairness +``` + +Safety conjoined with the complete fairness formula. Documents the full fairness design but has 43 temporal clauses -- too large for TLC's Buchi automaton construction. Used only in THEOREM declarations. + +## Fairness + +The fairness formula classifies every action in `Next` into one of four tiers. The guiding principle: any action whose guard depends on an environment variable that oscillates without fairness needs strong fairness (SF), because the adversary can cycle the environment variable once per lasso cycle to break weak fairness (WF)'s continuous-enablement requirement. + +### Tier 1: WF on Protocol-Internal Steps + +Guards depend only on protocol state; continuous enablement is guaranteed by protocol progress. + +```tla + /\ WF_vars(clk!Tick) + /\ \A c \in Cluster : + /\ WF_vars(haGroupStore!ANISHeartbeat(c)) + /\ WF_vars(reader!ReplayAdvance(c)) + /\ WF_vars(reader!ReplayRewind(c)) + /\ WF_vars(reader!ReplayBeginProcessing(c)) + /\ WF_vars(reader!ReplayFinishProcessing(c)) +``` + +**Exception: ANISHeartbeat** keeps WF despite its `zkLocalConnected` guard because suppressing the heartbeat *helps* liveness (the anti-flap gate opens sooner). SF would be counterproductive -- it would force the heartbeat to fire, keeping the gate closed. + +### Tier 2: WF on ZK Recovery (ZK Liveness Assumption) + +```tla + /\ WF_vars(zk!ZKPeerReconnect(c)) + /\ WF_vars(zk!ZKPeerSessionRecover(c)) + /\ WF_vars(zk!ZKLocalReconnect(c)) +``` + +Encodes the ZK Liveness Assumption (ZLA): ZK sessions are eventually alive and connected. These recovery actions are the basis for SF on all actions guarded by `zkPeerConnected` or `zkLocalConnected`. + +### Tier 3: SF on Actions Guarded by Environment Variables + +Actions guarded by environment variables that oscillate without fairness (`zkPeerConnected`, `zkPeerSessionAlive`, `zkLocalConnected`, `hdfsAvailable`). Grouped by mutual exclusivity to keep TLC's temporal formula within its DNF size limit. + +When at most one disjunct is ENABLED in any state, `SF(A1 \/ ... \/ An)` is equivalent to `SF(A1) /\ ... /\ SF(An)`, because the only disjunct that can fire is the one that is enabled. Mutual exclusivity is guaranteed by the single-valued nature of `clusterState` (per-cluster groups) and `writerMode` (per-RS groups). + +```tla + /\ SF_vars(haGroupStore!PeerReactToATS(c) + \/ haGroupStore!PeerReactToANIS(c) + \/ haGroupStore!PeerReactToAbTS(c) + \/ haGroupStore!PeerReactToAIS(c) + \/ haGroupStore!PeerReactToOFFLINE(c) + \/ haGroupStore!PeerRecoverFromOFFLINE(c)) + /\ SF_vars(haGroupStore!AutoComplete(c) + \/ haGroupStore!ANISToAIS(c) + \/ haGroupStore!ANISTSToATS(c) + \/ reader!TriggerFailover(c)) + /\ SF_vars(hdfs!HDFSUp(c)) + /\ \A r \in RS : + /\ WF_vars(writer!WriterInit(c, r)) + /\ WF_vars(writer!WriterSyncToSyncFwd(c, r)) + /\ SF_vars(writer!WriterToStoreFwd(c, r) + \/ writer!WriterSyncFwdToStoreFwd(c, r) + \/ writer!WriterInitToStoreFwd(c, r)) + /\ SF_vars(writer!WriterStoreFwdToSyncFwd(c, r) + \/ writer!WriterSyncFwdToSync(c, r)) + /\ SF_vars(rs!RSAbortOnLocalHDFSFailure(c, r) + \/ rs!RSRestart(c, r)) +``` + +### Tier 4: No Fairness + +No fairness on non-deterministic environmental faults (`HDFSDown`, `RSCrash`, `ZKPeerDisconnect`, `ZKPeerSessionExpiry`, `ZKLocalDisconnect`, `ReactiveTransitionFail`), operator actions (`AdminStartFailover`, `AdminAbortFailover`, `AdminGoOffline`, `AdminForceRecover`), and CAS failures (`WriterToStoreFwdFail`, `WriterSyncFwdToStoreFwdFail`, `WriterInitToStoreFwdFail`). These are genuinely non-deterministic; imposing fairness would force unrealistic guarantees. + +## Per-Property Liveness Specifications + +Each per-property specification conjoins only the fairness clauses on the critical path for one liveness property, keeping the temporal formula small enough for TLC's Buchi automaton construction. + +### SpecAC -- AbortCompletion + +```tla +FairnessAC == + /\ WF_vars(clk!Tick) + /\ \A c \in Cluster : + /\ WF_vars(zk!ZKLocalReconnect(c)) + /\ SF_vars(haGroupStore!AutoComplete(c)) + +SpecAC == Init /\ [][Next]_vars /\ FairnessAC +``` + +Critical path: `AutoComplete` (SF, `zkLocalConnected` guard), `ZKLocalReconnect` (WF, re-enables `zkLocalConnected`), `Tick` (WF). 5 temporal clauses total. + +### SpecFC -- FailoverCompletion + +```tla +FairnessFC == + /\ WF_vars(clk!Tick) + /\ \A c \in Cluster : + /\ WF_vars(zk!ZKLocalReconnect(c)) + /\ WF_vars(reader!ReplayAdvance(c)) + /\ WF_vars(reader!ReplayRewind(c)) + /\ WF_vars(reader!ReplayBeginProcessing(c)) + /\ WF_vars(reader!ReplayFinishProcessing(c)) + /\ SF_vars(haGroupStore!AutoComplete(c) + \/ reader!TriggerFailover(c)) + /\ SF_vars(hdfs!HDFSUp(c)) + +SpecFC == Init /\ [][Next]_vars /\ FairnessFC +``` + +Critical path: `AutoComplete` + `TriggerFailover` (SF, grouped by `clusterState` exclusivity), `HDFSUp` (SF), `ZKLocalReconnect` (WF), replay machine including `ReplayRewind` (WF), `Tick` (WF). 15 temporal clauses total. + +### SpecDR -- DegradationRecovery + +```tla +FairnessDR == + /\ WF_vars(clk!Tick) + /\ \A c \in Cluster : + /\ WF_vars(zk!ZKLocalReconnect(c)) + /\ WF_vars(haGroupStore!ANISHeartbeat(c)) + /\ SF_vars(haGroupStore!ANISToAIS(c)) + /\ SF_vars(hdfs!HDFSUp(c)) + /\ \A r \in RS : + /\ WF_vars(writer!WriterInit(c, r)) + /\ WF_vars(writer!WriterSyncToSyncFwd(c, r)) + /\ SF_vars(writer!WriterStoreFwdToSyncFwd(c, r) + \/ writer!WriterSyncFwdToSync(c, r)) + /\ SF_vars(rs!RSAbortOnLocalHDFSFailure(c, r) + \/ rs!RSRestart(c, r)) + +SpecDR == Init /\ [][Next]_vars /\ FairnessDR +``` + +Critical path: `ANISToAIS` (SF), `HDFSUp` (SF), `ZKLocalReconnect` (WF), `Tick` (WF), `ANISHeartbeat` (WF), per-RS writer recovery chain (SF) and lifecycle (SF), `WriterInit` and `WriterSyncToSyncFwd` (WF). 25 temporal clauses total with 2 RS. + +## Liveness Properties + +### FailoverCompletion + +```tla +FailoverCompletion == + \A c \in Cluster : + clusterState[c] \in FailoverCompletionAntecedentStates + ~> clusterState[c] \in StableClusterStates +``` + +Standby-side and abort transient states eventually resolve to a stable state. Resolution paths: + +- `STA -> AIS` (TriggerFailover) or `STA -> AbTS -> S` (abort) +- `AbTAIS -> AIS/ANIS`, `AbTANIS -> ANIS`, `AbTS -> S` (auto-completion) + +**ATS and ANISTS are excluded** from this property. Their resolution depends on the peer completing failover (`PeerReactToAIS`/`PeerReactToANIS`) or on abort propagation (`PeerReactToAbTS`). Both require the peer to reach a specific state AND the ZK peer connection to be alive at the right moment. With no fairness on admin actions (the admin can abort every failover attempt) and no fairness on ZK disconnect (the scheduler can disconnect exactly when the peer is in AbTS), ATS can remain indefinitely. ATS does have a resolution path via the reconciliation fold in `ZKPeerReconnect`/`ZKPeerSessionRecover` (ATS -> AbTAIS -> AIS when peer is in S/DS at reconnect), but adding ATS here would require extending `FairnessFC` with the peer-reactive SF group. + +### DegradationRecovery + +```tla +DegradationRecovery == + \A c \in Cluster : + (clusterState[c] = "ANIS" /\ hdfsAvailable[Peer(c)]) + ~> clusterState[c] \in NotANISClusterStates +``` + +ANIS with available peer HDFS eventually progresses out of ANIS. The recovery chain is: S&F -> S&FWD (`WriterStoreFwdToSyncFwd`) -> SYNC (`WriterSyncFwdToSync`, sets `outDirEmpty`) -> anti-flap timer expires (`Tick`) -> ANIS -> AIS (`ANISToAIS`). The cluster may also leave ANIS via failover (ANIS -> ANISTS), which satisfies the consequent. + +### AbortCompletion + +```tla +AbortCompletion == + \A c \in Cluster : + clusterState[c] \in AbortCompletionAntecedentStates + ~> clusterState[c] \in StableClusterStates +``` + +Every abort state eventually auto-completes to a stable state. Under WF on `AutoComplete`, each abort state deterministically resolves. Requires `zkLocalConnected` (`AutoComplete` guard). + +## Type Invariant + +```tla +TypeOK == + /\ clusterState \in [Cluster -> HAGroupState] + /\ writerMode \in [Cluster -> [RS -> WriterMode]] + /\ outDirEmpty \in [Cluster -> BOOLEAN] + /\ hdfsAvailable \in [Cluster -> BOOLEAN] + /\ antiFlapTimer \in [Cluster -> 0..WaitTimeForSync] + /\ replayState \in [Cluster -> ReplayStateSet] + /\ lastRoundInSync \in [Cluster -> Nat] + /\ lastRoundProcessed \in [Cluster -> Nat] + /\ failoverPending \in [Cluster -> BOOLEAN] + /\ inProgressDirEmpty \in [Cluster -> BOOLEAN] + /\ zkPeerConnected \in [Cluster -> BOOLEAN] + /\ zkPeerSessionAlive \in [Cluster -> BOOLEAN] + /\ zkLocalConnected \in [Cluster -> BOOLEAN] +``` + +All specification variables have valid types. TLC checks this invariant in every reachable state. + +## Safety Invariants + +### ZKSessionConsistency + +```tla +ZKSessionConsistency == + \A c \in Cluster : + zkPeerSessionAlive[c] = FALSE => zkPeerConnected[c] = FALSE +``` + +ZK session/connection structural consistency: if the peer ZK session is expired, the peer connection must also be dead. Session expiry implies disconnection -- the `ZKPeerSessionExpiry` action sets both `zkPeerSessionAlive` and `zkPeerConnected` to FALSE. `ZKPeerReconnect` requires `zkPeerSessionAlive = TRUE`, so a reconnect cannot happen without a live session. This invariant verifies that the ZK actions correctly maintain the session/connection relationship across all reachable states. + +### MutualExclusion + +```tla +MutualExclusion == + ~(\E c1, c2 \in Cluster : + /\ c1 # c2 + /\ RoleOf(clusterState[c1]) = "ACTIVE" + /\ RoleOf(clusterState[c2]) = "ACTIVE") +``` + +**The primary safety property of the failover protocol.** Two clusters never both in the ACTIVE role simultaneously. The ACTIVE role includes: AIS, ANIS, AbTAIS, AbTANIS, AWOP, ANISWOP. Transitional states ATS and ANISTS map to the ACTIVE_TO_STANDBY role (not ACTIVE), which is the mechanism by which safety is maintained during the non-atomic failover window -- `isMutationBlocked() = true` for ACTIVE_TO_STANDBY. + +Source: Architecture safety argument; `ClusterRoleRecord.java` L84 -- ACTIVE_TO_STANDBY has `isMutationBlocked() = true`. + +### AbortSafety + +```tla +AbortSafety == + \A c \in Cluster : + clusterState[c] = "AbTAIS" => + clusterState[Peer(c)] \in {"AbTS", "S", "DS", "OFFLINE"} +``` + +If a cluster is in AbTAIS, the peer must be in AbTS, S, DS, or OFFLINE. AbTAIS is reached via three paths: + +1. **Abort path:** `PeerReactToAbTS` (peer = AbTS). The peer can auto-complete AbTS -> S before the local AbTAIS auto-completes. +2. **Reconciliation path:** `ZKPeerReconnect`/`ZKPeerSessionRecover` with local = ATS and peer in {S, DS}. DS is reachable when the peer degraded (S -> DS via `PeerReactToANIS`) before the failover partition. +3. **OFFLINE path:** When the peer transitions to OFFLINE (via `AdminGoOffline`) while the local cluster is in AbTAIS, safety is preserved because OFFLINE is a non-active state. + +All four peer states (AbTS, S, DS, OFFLINE) map to STANDBY role, so MutualExclusion is preserved in all cases. + +### AISImpliesInSync + +```tla +AISImpliesInSync == + \A c \in Cluster : + clusterState[c] = "AIS" => + /\ outDirEmpty[c] + /\ \A r \in RS : writerMode[c][r] \in {"INIT", "SYNC", "DEAD"} +``` + +Whenever a cluster is in AIS, the OUT directory must be empty and all RS must be in SYNC, INIT, or DEAD. DEAD is allowed because an RS can crash while the cluster is AIS -- `RSCrash` sets `writerMode` to DEAD but does not change `clusterState`. + +### WriterClusterConsistency + +```tla +WriterClusterConsistency == + \A c \in Cluster : + (\E r \in RS : writerMode[c][r] \in {"STORE_AND_FWD", "SYNC_AND_FWD"}) => + clusterState[c] \in {"ANIS", "ANISTS", "ATS", "ANISWOP", + "AbTANIS", "AbTAIS", "AWOP"} +``` + +Degraded writer modes (S&F, SYNC_AND_FWD) can only appear on active clusters that are NOT in AIS, on the ANISTS/ATS transitional states, or on abort states where HDFS failure can degrade writers. AIS is excluded by the AIS->ANIS coupling. ATS is included because the ANIS failover path enters ATS via `ANISTSToATS` which does NOT snap writer modes. Standby states are excluded because writer modes are reset to INIT on ATS -> S entry. + +## Action Constraints + +### TransitionValid + +```tla +TransitionValid == + \A c \in Cluster : + clusterState'[c] # clusterState[c] => + <> \in AllowedTransitions +``` + +Every state change follows the `AllowedTransitions` table from [Types.md](Types.md). Source: `HAGroupStoreRecord.java` L99-123, `isTransitionAllowed()` L130. + +### WriterTransitionValid + +`AllowedWriterTransitions` is defined in [Types.md](Types.md). + +```tla +WriterTransitionValid == + \A c \in Cluster : + \A r \in RS : + writerMode'[c][r] # writerMode[c][r] => + <> \in AllowedWriterTransitions +``` + +The `X -> INIT` transitions (SYNC, STORE_AND_FWD, SYNC_AND_FWD) model the replication subsystem restart on ATS -> S (standby entry). These are lifecycle resets, not `ReplicationLogGroup` mode CAS transitions. + +### AIStoATSPrecondition + +```tla +AIStoATSPrecondition == + \A c \in Cluster : + clusterState[c] = "AIS" /\ clusterState'[c] = "ATS" + => outDirEmpty[c] /\ \A r \in RS : writerMode[c][r] \in {"SYNC", "DEAD"} +``` + +Failover can only begin from AIS when the OUT directory is empty and all live RS are in SYNC mode. DEAD RSes are allowed -- an RS can crash while the cluster is AIS without changing the HA group state. + +### AntiFlapGate + +```tla +AntiFlapGate == + \A c \in Cluster : + clusterState[c] = "ANIS" /\ clusterState'[c] = "AIS" + => AntiFlapGateOpen(antiFlapTimer[c]) +``` + +ANIS -> AIS never fires while the countdown timer is still running. + +### ANISTStoATSPrecondition + +```tla +ANISTStoATSPrecondition == + \A c \in Cluster : + clusterState[c] = "ANISTS" /\ clusterState'[c] = "ATS" + => /\ outDirEmpty[c] + /\ AntiFlapGateOpen(antiFlapTimer[c]) +``` + +ANISTS -> ATS requires empty OUT directory and open anti-flapping gate. + +### FailoverTriggerCorrectness + +```tla +FailoverTriggerCorrectness == + \A c \in Cluster : + clusterState[c] = "STA" /\ clusterState'[c] = "AIS" + => STAtoAISTriggerReplayGuards(c) +``` + +STA -> AIS requires replay-completeness conditions. Cross-checks the `TriggerFailover` action's guards. `hdfsAvailable` is excluded because it is an environmental/liveness guard, not a replay-completeness condition. + +### NoDataLoss + +```tla +NoDataLoss == + \A c \in Cluster : + clusterState[c] = "STA" /\ clusterState'[c] = "AIS" + => STAtoAISTriggerReplayGuards(c) +``` + +**Zero RPO property.** When the standby completes STA -> AIS, replay must have been in SYNC (no pending SYNCED_RECOVERY rewind), the in-progress directory must be empty, and the failover must have been properly initiated. + +### ReplayRewindCorrectness + +```tla +ReplayRewindCorrectness == + \A c \in Cluster : + replayState[c] = "SYNCED_RECOVERY" /\ replayState'[c] = "SYNC" + => lastRoundProcessed'[c] = lastRoundInSync'[c] +``` + +The SYNCED_RECOVERY -> SYNC transition equalizes the replay counters. Together with `NoDataLoss`, this guarantees zero RPO: the rewind closes the counter gap, and `TriggerFailover` (which requires `replayState = SYNC`) cannot fire until the rewind completes. + +### ReplayTransitionValid + +`AllowedReplayTransitions` is defined in [Types.md](Types.md). + +```tla +ReplayTransitionValid == + \A c \in Cluster : + replayState'[c] # replayState[c] => + <> \in AllowedReplayTransitions +``` + +Every replay state change follows the allowed transitions. Source: `ReplicationLogDiscoveryReplay.java` L131-206 (listeners), L323-333 (CAS), L336-351 (replay loop). + +## State Constraint + +```tla +ReplayCounterBound == + \A c \in Cluster : lastRoundProcessed[c] <= 3 +``` + +Bounds replay counters for exhaustive search tractability. The abstract counter values only matter relationally (`lastRoundProcessed >= lastRoundInSync`), so small bounds suffice. + +## Symmetry + +```tla +Symmetry == Permutations(RS) +``` + +RS identifiers are interchangeable (all start in INIT, identical action sets). Cluster identifiers remain asymmetric (AIS vs S at Init). + +## Theorems + +```tla +THEOREM Spec => []TypeOK +THEOREM Spec => []MutualExclusion +THEOREM Spec => []AbortSafety +THEOREM Spec => []AISImpliesInSync +THEOREM Spec => []WriterClusterConsistency +THEOREM Spec => []ZKSessionConsistency +THEOREM Spec => [][ANISTStoATSPrecondition]_vars +THEOREM Spec => [][FailoverTriggerCorrectness]_vars +THEOREM Spec => [][NoDataLoss]_vars +THEOREM Spec => [][ReplayRewindCorrectness]_vars +THEOREM SpecFC => FailoverCompletion +THEOREM SpecDR => DegradationRecovery +THEOREM SpecAC => AbortCompletion +``` + +These theorem declarations document the intended proof obligations. TLC checks the safety theorems via the exhaustive and simulation configurations; the liveness theorems are checked via the per-property simulation configurations. diff --git a/src/main/tla/ConsistentFailover/markdown/HAGroupStore.md b/src/main/tla/ConsistentFailover/markdown/HAGroupStore.md new file mode 100644 index 00000000000..b8c48db40cf --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/HAGroupStore.md @@ -0,0 +1,445 @@ +# HAGroupStore -- Peer-Reactive Transitions and Auto-Completion + +**Source:** [`HAGroupStore.tla`](../HAGroupStore.tla) + +## Overview + +`HAGroupStore` models the peer-reactive transitions and auto-completion actions of the Phoenix Consistent Failover protocol. These actions correspond to the `FailoverManagementListener` (`HAGroupStoreManager.java` L633-706), which reacts to peer ZK state changes via `PathChildrenCache` watchers, and the local auto-completion resolvers from `createLocalStateTransitions()` (L140-150). + +This is the largest sub-module by action count, containing 11 action schemas that handle: + +- **Peer-reactive transitions:** Detecting the peer's state change via ZK watcher and transitioning accordingly +- **Auto-completion:** Returning from abort states to stable states via local ZK writes +- **S&F heartbeat:** Refreshing the anti-flapping timer during STORE_AND_FWD degradation +- **Recovery transitions:** ANIS -> AIS when all RS recover, ANISTS -> ATS when OUT drains +- **Peer OFFLINE detection:** Reacting to peer entering or leaving OFFLINE state (gated on `UseOfflinePeerDetection`) +- **Retry exhaustion:** Modeling the case where the `FailoverManagementListener`'s 2-retry limit is exceeded + +### ZK Watcher Delivery Dependency + +All `PeerReact*` actions depend on the peer ZK connection and session being alive, guarded by `zkPeerConnected[c]` and `zkPeerSessionAlive[c]`. `AutoComplete` actions depend on the local ZK connection, guarded by `zkLocalConnected[c]`. Without these connections, watcher notifications cannot be delivered, and the corresponding transitions are suppressed. + +This models a critical implementation detail: the `FailoverManagementListener` is invoked by the `PathChildrenCache` watcher chain, not by polling. If the watcher connection is down, no notifications arrive, and the transition never fires. There is no polling fallback in the implementation. + +### Notification Chains + +**Peer-reactive transitions:** +``` +Peer ZK znode change + -> Curator peerPathChildrenCache + -> HAGroupStoreClient.handleStateChange() [L1088-1110] + -> notifySubscribers() [L1119-1151] + -> FailoverManagementListener.onStateChange() [L653-705] + -> setHAGroupStatusIfNeeded() (2-retry limit) +``` + +**Auto-completion transitions:** +``` +Local ZK znode change + -> Curator pathChildrenCache (local) + -> HAGroupStoreClient.handleStateChange() + -> notifySubscribers() + -> FailoverManagementListener.onStateChange() +``` + +### Listener Effect Folding + +The `recoveryListener` (L147-157) and `degradedListener` (L136-145) from `ReplicationLogDiscoveryReplay` fire synchronously on the local `PathChildrenCache` event thread during state entry. Their effects are folded atomically into the S-entry and DS-entry actions: + +- **S entry** (`PeerReactToANIS` ATS->S, `PeerReactToAIS` ATS->S / DS->S, `AutoComplete` AbTS->S): sets `replayState = SYNCED_RECOVERY` +- **DS entry** (`PeerReactToANIS` S->DS): sets `replayState = DEGRADED` + +This folding is sound because the listener fires deterministically and synchronously on every state entry -- there is no observable intermediate state between the cluster state change and the replay state change. + +The ATS->S side-effect bundle (live writers reset to INIT preserving DEAD, `outDirEmpty` cleared, `replayState` set to SYNCED_RECOVERY) is shared by `PeerReactToAIS` (ATS->S) and `PeerReactToANIS` (ATS->S) and extracted into a module-local operator: + +```tla +ResetToStandbyEntry(c) == + /\ writerMode' = [writerMode EXCEPT ![c] = + [rs \in RS |-> IF writerMode[c][rs] = "DEAD" + THEN "DEAD" + ELSE "INIT"]] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = TRUE] + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] +``` + +`ResetToStandbyEntry` is intentionally NOT applied to `AdminForceRecover` (which resets all writers to INIT with no DEAD-preservation) or to `AutoComplete` AbTS->S (which only sets `replayState`; no writer/OUT reset is needed because AbTS was never active). + +### Retry Exhaustion + +The `FailoverManagementListener` retries each reactive transition exactly 2 times (`HAGroupStoreManager.java` L653-704). After exhaustion, the method returns silently. This is modeled by the `ReactiveTransitionFail(c)` action, which non-deterministically "consumes" a pending peer-reactive transition without updating `clusterState`. + +The retry-exhaustion action shadows every `PeerReact*` action's enabling condition. To keep the two from drifting apart, the peer-state/local-state disjunction is factored out into a module-local predicate `PeerReactWouldFire(c)`: + +```tla +PeerReactWouldFire(c) == + \/ /\ clusterState[Peer(c)] = "ATS" + /\ clusterState[c] \in {"S", "DS"} + \/ /\ clusterState[Peer(c)] = "ANIS" + /\ clusterState[c] \in {"S", "ATS"} + \/ /\ clusterState[Peer(c)] = "AbTS" + /\ clusterState[c] = "ATS" + \/ /\ clusterState[Peer(c)] = "AIS" + /\ clusterState[c] \in {"ATS", "DS"} + \/ /\ UseOfflinePeerDetection = TRUE + /\ clusterState[Peer(c)] = "OFFLINE" + /\ clusterState[c] \in {"AIS", "ANIS"} + \/ /\ UseOfflinePeerDetection = TRUE + /\ clusterState[Peer(c)] # "OFFLINE" + /\ clusterState[c] \in {"AWOP", "ANISWOP"} +``` + +`ReactiveTransitionFail(c)` combines `PeerZKHealthy(c)` (the ZK connectivity guard, defined in [`SpecState.tla`](../SpecState.tla)) with `PeerReactWouldFire(c)`. The `PeerReact*` action bodies keep their inline peer-state/local-state guards so each action's enabling condition remains readable at the definition site. + +## Implementation Traceability + +| TLA+ Action | Java Source | +|---|---| +| `PeerReactToATS(c)` | `createPeerStateTransitions()` L109 | +| `PeerReactToANIS(c)` | `createPeerStateTransitions()` L123, L126 | +| `PeerReactToAbTS(c)` | `createPeerStateTransitions()` L132 | +| `PeerReactToAIS(c)` | `createPeerStateTransitions()` L112-120 | +| `AutoComplete(c)` | `createLocalStateTransitions()` L144, L145, L147 | +| `ANISTSToATS(c)` | `HAGroupStoreManager.setHAGroupStatusToSync()` L341-355 | +| `PeerReactToOFFLINE(c)` | peer OFFLINE detection; no impl trigger yet; gated on `UseOfflinePeerDetection` | +| `PeerRecoverFromOFFLINE(c)` | peer OFFLINE recovery detection; no impl trigger yet; gated on `UseOfflinePeerDetection` | +| `ReactiveTransitionFail(c)` | `FailoverManagementListener.onStateChange()` L653-704 (2 retries exhausted) | + +Failover completion (STA -> AIS) is modeled in [Reader.md](Reader.md) (`TriggerFailover` action), not in this module. + +```tla +EXTENDS SpecState, Types +``` + +## PeerReactToATS -- Standby Detects Peer ATS + +When the standby detects its peer has entered ATS (ACTIVE_IN_SYNC_TO_STANDBY), it begins the failover process by transitioning to STA (STANDBY_TO_ACTIVE). This fires from either S or DS -- the DS case supports the ANIS failover path where the standby is in DEGRADED_STANDBY when failover proceeds. + +**ZK watcher dependency:** Delivered via `peerPathChildrenCache`. Guarded on `zkPeerConnected[c]` and `zkPeerSessionAlive[c]`. If the peer ZK session expires or the notification is lost, the standby never learns of the failover. The active cluster remains in ATS with mutations blocked indefinitely. There is no polling fallback. + +**failoverPending side-effect:** Also sets `failoverPending[c] = TRUE`, modeling the `triggerFailoverListener` (`ReplicationLogDiscoveryReplay.java` L159-171) which fires on LOCAL STANDBY_TO_ACTIVE entry. This is folded into `PeerReactToATS` because the listener fires deterministically on every STA entry and `PeerReactToATS` is the sole producer of STA. + +Source: `createPeerStateTransitions()` L109 -- resolver is unconditional: `currentLocal -> STANDBY_TO_ACTIVE`. + +```tla +PeerReactToATS(c) == + /\ zkPeerConnected[c] = TRUE + /\ zkPeerSessionAlive[c] = TRUE + /\ clusterState[Peer(c)] = "ATS" + /\ clusterState[c] \in {"S", "DS"} + /\ clusterState' = [clusterState EXCEPT ![c] = "STA"] + /\ failoverPending' = [failoverPending EXCEPT ![c] = TRUE] + /\ UNCHANGED <> +``` + +## PeerReactToANIS -- Peer Enters ANIS + +Two reactive transitions triggered by the peer entering ANIS (ACTIVE_NOT_IN_SYNC): + +1. **Local S -> DS:** Standby degrades because the peer's replication is degraded. Atomically sets `replayState = DEGRADED` (degradedListener fold). Source: L126. +2. **Local ATS -> S:** Old active (in failover) completes transition to standby when peer is ANIS. Atomically sets `replayState = SYNCED_RECOVERY` (recoveryListener fold) and resets live writer modes to INIT. Source: L123. + +**ZK watcher dependency:** If lost: (1) standby stays in S when it should be DS -- consistency point tracking is incorrect; (2) old active stays in ATS with mutations blocked. + +**Writer lifecycle reset (ATS -> S):** When the old active enters standby, the `FailoverManagementListener` triggers a replication subsystem restart on each live RS. Live writer modes reset to INIT (the `ReplicationLogGroup` is destroyed and will be recreated when the cluster next becomes active). The OUT directory is cleared. DEAD writers are preserved: a crashed RS (JVM dead) cannot process the state change notification; the process supervisor restart (`RSRestart` in [RS.md](RS.md)) handles DEAD -> INIT independently. + +```tla +PeerReactToANIS(c) == + /\ zkPeerConnected[c] = TRUE + /\ zkPeerSessionAlive[c] = TRUE + /\ clusterState[Peer(c)] = "ANIS" + /\ \/ /\ clusterState[c] = "S" + /\ clusterState' = [clusterState EXCEPT ![c] = "DS"] + /\ replayState' = [replayState EXCEPT ![c] = "DEGRADED"] + /\ UNCHANGED <> + \/ /\ clusterState[c] = "ATS" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + /\ writerMode' = [writerMode EXCEPT ![c] = + [rs \in RS |-> IF writerMode[c][rs] = "DEAD" + THEN "DEAD" + ELSE "INIT"]] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = TRUE] + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] + /\ UNCHANGED <> +``` + +## PeerReactToAbTS -- Active Detects Peer AbTS + +When the active cluster (in ATS during failover) detects its peer has entered AbTS (abort initiated from the standby side), it transitions to AbTAIS (ABORT_TO_ACTIVE_IN_SYNC). This is the mechanism by which abort propagates from the standby to the active side. + +**ZK watcher dependency:** If lost, the active stays in ATS with mutations blocked; abort does not propagate. No polling fallback. + +Source: `createPeerStateTransitions()` L132. + +```tla +PeerReactToAbTS(c) == + /\ zkPeerConnected[c] = TRUE + /\ zkPeerSessionAlive[c] = TRUE + /\ clusterState[Peer(c)] = "AbTS" + /\ clusterState[c] = "ATS" + /\ clusterState' = [clusterState EXCEPT ![c] = "AbTAIS"] + /\ UNCHANGED <> +``` + +## AutoComplete -- Local Auto-Completion Transitions + +These transitions fire automatically once the cluster enters the corresponding abort state. They return the cluster to its pre-failover state. Despite being "local" (no peer trigger), these transitions are driven by the local `pathChildrenCache` watcher chain, not an in-process event bus. Guarded on `zkLocalConnected[c]`. + +Three sub-cases: + +**AbTS -> S:** Returns the standby to its pre-failover state. Atomically sets `replayState = SYNCED_RECOVERY` (recoveryListener fold). Source: L144. + +**AbTAIS -> AIS or ANIS:** Conditional -- completes to AIS if all writers are clean (INIT or SYNC) and OUT dir is empty, otherwise completes to ANIS. This prevents AIS from coexisting with degraded writers when HDFS fails during the abort window. Source: L145. + +**AbTANIS -> ANIS:** Returns to ANIS. Resets the anti-flapping timer to `StartAntiFlapWait`, keeping the gate closed so the cluster must wait before attempting ANIS -> AIS recovery. Source: L147. + +```tla +AutoComplete(c) == + /\ zkLocalConnected[c] = TRUE + /\ \/ /\ clusterState[c] = "AbTS" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] + /\ UNCHANGED <> + \/ /\ clusterState[c] = "AbTAIS" + /\ clusterState' = [clusterState EXCEPT ![c] = + IF outDirEmpty[c] /\ \A rs \in RS : writerMode[c][rs] \in {"INIT", "SYNC"} + THEN "AIS" + ELSE "ANIS"] + /\ UNCHANGED <> + \/ /\ clusterState[c] = "AbTANIS" + /\ clusterState' = [clusterState EXCEPT ![c] = "ANIS"] + /\ antiFlapTimer' = [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + /\ UNCHANGED <> +``` + +## PeerReactToAIS -- Peer Enters AIS + +Two reactive transitions triggered by the peer entering AIS (ACTIVE_IN_SYNC): + +1. **Local ATS -> S:** Old active completes failover to standby when the peer (the new active) enters AIS. Atomically sets `replayState = SYNCED_RECOVERY` (recoveryListener fold). This is the critical transition that resolves the non-atomic failover window. +2. **Local DS -> S:** Standby recovers from degraded when peer returns to AIS. Atomically sets `replayState = SYNCED_RECOVERY` (recoveryListener fold). + +**Writer lifecycle reset (ATS -> S):** Same as `PeerReactToANIS` ATS -> S. Live writer modes reset to INIT, OUT directory cleared. DEAD writers preserved for `RSRestart`. This is critical for the ANIS failover path where SYNC_AND_FWD or STORE_AND_FWD writers may persist through ANISTS -> ATS (`ANISTSToATS` does not snap writer modes). + +**ZK watcher dependency:** This is the critical transition that resolves the non-atomic failover window. If lost: old active stays in ATS with mutations blocked indefinitely (the (ATS, AIS) state persists). Safety holds (ATS maps to ACTIVE_TO_STANDBY, `isMutationBlocked() = true`) but liveness requires eventual watcher delivery. Curator `PathChildrenCache` re-queries on reconnect, providing eventual delivery if the ZK session survives. + +Source: `createPeerStateTransitions()` L112-120 -- conditional resolver for peer ACTIVE_IN_SYNC. + +```tla +PeerReactToAIS(c) == + /\ zkPeerConnected[c] = TRUE + /\ zkPeerSessionAlive[c] = TRUE + /\ clusterState[Peer(c)] = "AIS" + /\ \/ /\ clusterState[c] = "ATS" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + /\ writerMode' = [writerMode EXCEPT ![c] = + [rs \in RS |-> IF writerMode[c][rs] = "DEAD" + THEN "DEAD" + ELSE "INIT"]] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = TRUE] + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] + /\ UNCHANGED <> + \/ /\ clusterState[c] = "DS" + /\ clusterState' = [clusterState EXCEPT ![c] = "S"] + /\ replayState' = [replayState EXCEPT ![c] = "SYNCED_RECOVERY"] + /\ UNCHANGED <> +``` + +## ANISHeartbeat -- S&F Heartbeat Timer Reset + +The S&F heartbeat runs while at least one RS is in STORE_AND_FWD mode. It periodically re-writes ANIS to the ZK znode, refreshing mtime. In the countdown timer model, this resets the timer to `StartAntiFlapWait`, keeping the anti-flapping gate closed. + +The heartbeat stops when the last RS exits STORE_AND_FWD (enters SYNC_AND_FWD). At that point the timer begins counting down via `Tick` (in [Clock.md](Clock.md)), and the gate opens when it reaches 0. + +**Fairness classification:** WF despite its `zkLocalConnected` guard, because suppressing the heartbeat *helps* liveness (the anti-flap gate opens sooner). SF would be counterproductive -- it would force the heartbeat to fire, keeping the gate closed longer. + +Source: `StoreAndForwardModeImpl.startHAGroupStoreUpdateTask()` L71-87; `HAGroupStoreRecord.java` L101 (ANIS self-transition). + +```tla +ANISHeartbeat(c) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] = "ANIS" + /\ \E rs \in RS : writerMode[c][rs] = "STORE_AND_FWD" + /\ antiFlapTimer' = [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + /\ UNCHANGED <> +``` + +## ANISToAIS -- Recovery from ANIS + +When all RS on the cluster are in SYNC or SYNC_AND_FWD, the OUT directory is empty, and the anti-flapping gate has opened (countdown timer reached 0), the cluster recovers from ANIS to AIS. + +The writer guard includes SYNC_AND_FWD (not just SYNC) because the anti-flapping gate ensures all RS have exited S&F before this action fires. Any remaining SYNC_AND_FWD RS are atomically transitioned to SYNC, modeling the ACTIVE_IN_SYNC ZK event at `ReplicationLogDiscoveryForwarder.init()` L113-123. + +The `AISImpliesInSync` invariant in [ConsistentFailover.md](ConsistentFailover.md) verifies that AIS is only reached with all RS in SYNC or INIT. + +Source: `setHAGroupStatusToSync()` L341-355, after forwarder drain. + +```tla +ANISToAIS(c) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] = "ANIS" + /\ AntiFlapGateOpen(antiFlapTimer[c]) + /\ \A rs \in RS : writerMode[c][rs] \in {"SYNC", "SYNC_AND_FWD"} + /\ outDirEmpty[c] + /\ clusterState' = [clusterState EXCEPT ![c] = "AIS"] + /\ writerMode' = [writerMode EXCEPT ![c] = + [rs \in RS |-> IF writerMode[c][rs] = "SYNC_AND_FWD" + THEN "SYNC" + ELSE writerMode[c][rs]]] + /\ UNCHANGED <> +``` + +## ANISTSToATS -- Drain Completion + +When the forwarder has drained the OUT directory and the anti-flapping gate has opened, the cluster advances from ANISTS to ATS, joining the normal AIS failover path. The standby reacts to ATS (not ANISTS), so this transition is the bridge that lets the ANIS failover path converge with the AIS failover path. + +**Writer modes are NOT snapped here.** In the implementation, `setHAGroupStatusToSync()` only writes the cluster-level ZK znode (ANISTS -> ATS); it does not modify per-RS writer modes. SYNC_AND_FWD writers may persist into ATS. They are cleaned up when the cluster transitions ATS -> S (replication subsystem restart on standby entry -- see `PeerReactToAIS`, `PeerReactToANIS`). + +**Anti-flapping gate:** Confirmed by implementation -- `validateTransitionAndGetWaitTime()` L1035-1036 applies the same `waitTimeForSyncModeInMs` to ANISTS -> ATS as to ANIS -> AIS. The forwarder handles the wait via `syncUpdateTS` deferral (`processNoMoreRoundsLeft()` L169-172). + +Source: `HAGroupStoreManager.setHAGroupStatusToSync()` L341-355; `HAGroupStoreClient.validateTransitionAndGetWaitTime()` L1027-1046. + +```tla +ANISTSToATS(c) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] = "ANISTS" + /\ AntiFlapGateOpen(antiFlapTimer[c]) + /\ outDirEmpty[c] + /\ clusterState' = [clusterState EXCEPT ![c] = "ATS"] + /\ UNCHANGED <> +``` + +## PeerReactToOFFLINE -- Active Detects Peer OFFLINE + +Gated on `UseOfflinePeerDetection` (Iteration 18, proactive modeling). + +When the active cluster detects its peer has entered OFFLINE, it transitions to AWOP or ANISWOP depending on its current state: +- AIS -> AWOP (peer went offline while active is in sync) +- ANIS -> ANISWOP (peer went offline while active is not in sync) + +Both AWOP and ANISWOP map to `ClusterRole.ACTIVE` via `getClusterRole()` (`isMutationBlocked() = false`), so the active cluster continues serving mutations while its peer is offline. + +No writer or timer side effects: the transition is purely a cluster-state annotation recording the peer's unavailability. + +**ZK watcher dependency:** Delivered via `peerPathChildrenCache`. Guarded on `zkPeerConnected[c]` and `zkPeerSessionAlive[c]`. + +**NOTE:** This models intended protocol behavior. No `FailoverManagementListener` entry for peer OFFLINE currently exists in the implementation (`createPeerStateTransitions()` has no OFFLINE entry). The TLA+ model verifies the design ahead of implementation. + +Source: (proactive) AIS->AWOP from `allowedTransitions` L103; ANIS->ANISWOP from `allowedTransitions` L101. + +```tla +PeerReactToOFFLINE(c) == + /\ UseOfflinePeerDetection = TRUE + /\ zkPeerConnected[c] = TRUE + /\ zkPeerSessionAlive[c] = TRUE + /\ clusterState[Peer(c)] = "OFFLINE" + /\ \/ /\ clusterState[c] = "AIS" + /\ clusterState' = [clusterState EXCEPT ![c] = "AWOP"] + \/ /\ clusterState[c] = "ANIS" + /\ clusterState' = [clusterState EXCEPT ![c] = "ANISWOP"] + /\ UNCHANGED <> +``` + +## PeerRecoverFromOFFLINE -- Active Detects Peer Left OFFLINE + +Gated on `UseOfflinePeerDetection` (Iteration 18, proactive modeling). + +When the active cluster (in AWOP or ANISWOP) detects its peer has left OFFLINE (re-entered a non-OFFLINE state via manual `--force` recovery), the active returns to ANIS: +- AWOP -> ANIS (per `AWOP.allowedTransitions = {ANIS}`) +- ANISWOP -> ANIS (per `ANISWOP.allowedTransitions = {ANIS}`) + +Both paths enter ANIS because peer recovery is treated as a new peer entering sync -- the active must first synchronize, so it enters ANIS (not AIS). The anti-flap timer is reset to `StartAntiFlapWait` on ANIS entry. + +**ZK watcher dependency:** Delivered via `peerPathChildrenCache`. Guarded on `zkPeerConnected[c]` and `zkPeerSessionAlive[c]`. + +**NOTE:** This models intended protocol behavior. See `PeerReactToOFFLINE` comment for implementation status. + +Source: (proactive) AWOP->ANIS from `allowedTransitions` L113; ANISWOP->ANIS from `allowedTransitions` L123. + +```tla +PeerRecoverFromOFFLINE(c) == + /\ UseOfflinePeerDetection = TRUE + /\ zkPeerConnected[c] = TRUE + /\ zkPeerSessionAlive[c] = TRUE + /\ clusterState[Peer(c)] # "OFFLINE" + /\ clusterState[c] \in {"AWOP", "ANISWOP"} + /\ clusterState' = [clusterState EXCEPT ![c] = "ANIS"] + /\ antiFlapTimer' = [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + /\ UNCHANGED <> +``` + +## ReactiveTransitionFail -- Retry Exhaustion + +Models the `FailoverManagementListener` (`HAGroupStoreManager.java` L653-704) where both retries of `setHAGroupStatusIfNeeded()` fail and the method returns silently. The watcher notification was delivered, the listener was invoked, but the local ZK write failed. The transition is permanently lost for this notification. + +This action is enabled whenever any `PeerReact*` action would be enabled (same ZK connectivity and peer-state guards), including `PeerReactToOFFLINE` and `PeerRecoverFromOFFLINE` retry exhaustion (gated on `UseOfflinePeerDetection`). Its effect is to leave `clusterState` unchanged -- the local transition was not applied. TLC explores both the success path (the actual `PeerReact*` actions) and this failure path non-deterministically. + +**Soundness:** The model is slightly more permissive than the implementation: the same `PeerReact*` action remains enabled after `ReactiveTransitionFail` (the model does not track `lastKnownPeerState`). In the implementation, `handleStateChange()` updates `lastKnownPeerState` before calling `notifySubscribers()`, and after retry failure, if the peer state is re-written with the same value, `handleStateChange()` suppresses the notification (same-state check). This is sound for safety: if safety holds when the transition can non-deterministically succeed or fail, it holds a fortiori when failures are permanent. + +```tla +ReactiveTransitionFail(c) == + /\ zkPeerConnected[c] = TRUE + /\ zkPeerSessionAlive[c] = TRUE + /\ \/ /\ clusterState[Peer(c)] = "ATS" + /\ clusterState[c] \in {"S", "DS"} + \/ /\ clusterState[Peer(c)] = "ANIS" + /\ clusterState[c] \in {"S", "ATS"} + \/ /\ clusterState[Peer(c)] = "AbTS" + /\ clusterState[c] = "ATS" + \/ /\ clusterState[Peer(c)] = "AIS" + /\ clusterState[c] \in {"ATS", "DS"} + \* Iteration 18 (proactive): mirrors PeerReactToOFFLINE + \/ /\ UseOfflinePeerDetection = TRUE + /\ clusterState[Peer(c)] = "OFFLINE" + /\ clusterState[c] \in {"AIS", "ANIS"} + \* Iteration 18 (proactive): mirrors PeerRecoverFromOFFLINE + \/ /\ UseOfflinePeerDetection = TRUE + /\ clusterState[Peer(c)] # "OFFLINE" + /\ clusterState[c] \in {"AWOP", "ANISWOP"} + /\ UNCHANGED <> +``` diff --git a/src/main/tla/ConsistentFailover/markdown/HDFS.md b/src/main/tla/ConsistentFailover/markdown/HDFS.md new file mode 100644 index 00000000000..9470992e934 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/HDFS.md @@ -0,0 +1,76 @@ +# HDFS -- HDFS Availability Incident Actions + +**Source:** [`HDFS.tla`](../HDFS.tla) + +## Overview + +`HDFS` models NameNode crash and recovery as environment incidents. These are pure environment actions -- they set a boolean availability flag and leave all other variables unchanged. The per-RS effects of HDFS unavailability (writer degradation, CAS races, RS aborts) are handled by the [Writer](Writer.md) and [RS](RS.md) modules. + +### Modeling Choice: Flag vs. Per-File HDFS State + +HDFS availability is modeled as a single boolean per cluster rather than per-file or per-directory state. This abstraction is appropriate because: + +1. **NameNode is the single point of failure.** When a NameNode crashes, all HDFS operations on that cluster fail. There is no partial HDFS failure in the scope of this model. +2. **The protocol reacts to HDFS availability, not individual file operations.** Writer degradation is triggered by IOException from any HDFS write, not by specific file-level failures. +3. **State space reduction.** Per-file HDFS state would explode the state space without adding any safety-relevant behavior. + +### Asymmetric Decomposition + +The decomposition between `HDFSDown`/`HDFSUp` and per-RS writer actions is asymmetric: + +- **HDFSDown(c)** sets `hdfsAvailable[c] = FALSE`. No immediate writer effect -- per-RS degradation happens individually when each RS attempts its next HDFS write and gets IOException. This enables modeling the CAS race where multiple RS on the same cluster independently detect the failure and race to update the ZK state. +- **HDFSUp(c)** sets `hdfsAvailable[c] = TRUE`. No immediate writer effect -- recovery is per-RS via the forwarder path (`WriterStoreFwdToSyncFwd` in [Writer.md](Writer.md)), which is guarded on `hdfsAvailable`. + +### Two Failure Scenarios + +Any cluster's HDFS can fail at any time, producing two distinct scenarios: + +1. **Standby HDFS fails (`HDFSDown(c_standby)`):** Active writers detect via IOException and degrade (SYNC -> S&F). This is the primary degradation path handled by `WriterToStoreFwd` in [Writer.md](Writer.md). +2. **Active cluster's own HDFS fails (`HDFSDown(c_active)`):** S&F writers on the active cluster abort because they are writing to local (own) HDFS. This is handled by `RSAbortOnLocalHDFSFailure` in [RS.md](RS.md). + +## Implementation Traceability + +| TLA+ Action | Java Source | +|---|---| +| `HDFSDown(c)` | NameNode crash; detected reactively via IOException from `ReplicationLog.apply()` | +| `HDFSUp(c)` | NameNode recovery; forwarder detects via successful `FileUtil.copy()` in `processFile()` L132-152 | + +```tla +EXTENDS SpecState, Types +``` + +## HDFSDown -- NameNode Crash + +Sets the HDFS availability flag to FALSE for cluster `c`. Per-RS writer degradation (SYNC -> S&F, SYNC_AND_FWD -> S&F) is handled individually by `WriterToStoreFwd` and `WriterSyncFwdToStoreFwd` in [Writer.md](Writer.md), which are guarded on `hdfsAvailable[Peer(c)] = FALSE`. Those actions also handle the AIS -> ANIS cluster state transition and CAS failure (-> DEAD). + +**Fairness:** No fairness (Tier 4). HDFS crashes are genuinely non-deterministic environment events. + +Source: NameNode crash (environment event). + +```tla +HDFSDown(c) == + /\ hdfsAvailable[c] = TRUE + /\ hdfsAvailable' = [hdfsAvailable EXCEPT ![c] = FALSE] + /\ UNCHANGED <> +``` + +## HDFSUp -- NameNode Recovery + +Sets `hdfsAvailable[c] = TRUE`. No immediate writer effect -- recovery is per-RS via the forwarder path. The forwarder detects connectivity by successfully copying a file from OUT to the peer's IN directory; if throughput exceeds the threshold, it transitions the writer S&F -> SYNC_AND_FWD (`WriterStoreFwdToSyncFwd` in [Writer.md](Writer.md)). + +**Fairness:** SF (Tier 3). Under SF on `HDFSUp`, HDFS cannot be permanently down. This is needed for the `DegradationRecovery` liveness property -- without it, the adversary could keep HDFS down indefinitely, preventing the writer recovery chain from completing. + +Source: `ReplicationLogDiscoveryForwarder.processFile()` L132-152. + +```tla +HDFSUp(c) == + /\ hdfsAvailable[c] = FALSE + /\ hdfsAvailable' = [hdfsAvailable EXCEPT ![c] = TRUE] + /\ UNCHANGED <> +``` diff --git a/src/main/tla/ConsistentFailover/markdown/RS.md b/src/main/tla/ConsistentFailover/markdown/RS.md new file mode 100644 index 00000000000..149c3922a5c --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/RS.md @@ -0,0 +1,99 @@ +# RS -- RegionServer Lifecycle Actions + +**Source:** [`RS.tla`](../RS.tla) + +## Overview + +`RS` models RegionServer crash (fail-stop) and process supervisor restart. These are environment and lifecycle actions that interact with the writer state machine in [Writer.md](Writer.md) through the `writerMode` variable. + +### Crash Modeling + +An RS can crash at any time (JVM crash, OOM, kill signal, process supervisor termination). The crash sets `writerMode` to DEAD but does **not** change `clusterState` -- the HA group state in ZK is independent of RS process lifecycle. This is a critical modeling decision: RS crashes are common operational events that should not destabilize the HA group state machine. + +A special-case crash, `RSAbortOnLocalHDFSFailure`, models the abort triggered when the active cluster's own HDFS fails while the writer is in STORE_AND_FWD mode (writing to local HDFS). This is distinct from `HDFSDown` in [HDFS.md](HDFS.md) (which models the *peer's* HDFS failing and degrades writers on the active side). + +### Restart Modeling + +When an RS dies (writer mode DEAD), the process supervisor (Kubernetes/YARN) detects the dead pod and creates a new one. HBase assigns regions and the writer re-initializes in INIT mode, ready to follow the normal startup path (`WriterInit` or `WriterInitToStoreFwd` in [Writer.md](Writer.md)). + +### DEAD Writer Preservation Through Standby Entry + +When a cluster transitions ATS -> S (becoming standby), the replication subsystem restart resets live writer modes to INIT but preserves DEAD writers. A crashed RS (JVM dead) cannot process the state change notification -- the process supervisor restart handles DEAD -> INIT independently. See `PeerReactToAIS` and `PeerReactToANIS` in [HAGroupStore.md](HAGroupStore.md). + +## Implementation Traceability + +| TLA+ Action | Java Source | +|---|---| +| `RSCrash(c, rs)` | JVM crash, OOM, kill signal, process supervisor termination | +| `RSAbortOnLocalHDFSFailure(c, rs)` | `StoreAndForwardModeImpl.onFailure()` L115-123 -> `logGroup.abort()` | +| `RSRestart(c, rs)` | Kubernetes/YARN pod restart -> HBase RS startup -> `ReplicationLogGroup.initializeReplicationMode()` | + +```tla +EXTENDS SpecState, Types +``` + +## RSRestart -- Process Supervisor Restarts Dead RS: DEAD -> INIT + +The restarted RS enters INIT mode. Subsequent writer actions (`WriterInit` or `WriterInitToStoreFwd` in [Writer.md](Writer.md)) handle the actual mode initialization based on HDFS availability and cluster state. + +**Fairness:** SF (Tier 3), grouped with `RSAbortOnLocalHDFSFailure` by mutual exclusivity (DEAD and STORE_AND_FWD are mutually exclusive writer modes). SF is needed because the guard depends on `writerMode` which can be changed by environment events. + +Source: Kubernetes/YARN pod restart -> HBase RS startup -> `ReplicationLogGroup.initializeReplicationMode()`. + +```tla +RSRestart(c, rs) == + /\ writerMode[c][rs] = "DEAD" + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "INIT"] + /\ UNCHANGED <> +``` + +## RSCrash -- Non-Deterministic RS Crash: Any Mode -> DEAD + +Models general RS failure (JVM crash, OOM, killed by process supervisor, etc.). The RS can crash at any time regardless of writer mode. The crash does **not** change `clusterState` -- the HA group state in ZK is independent of RS process lifecycle. + +This means a cluster can be in AIS with a DEAD RS. The `AISImpliesInSync` invariant in [ConsistentFailover.md](ConsistentFailover.md) explicitly allows DEAD alongside SYNC and INIT in AIS. A DEAD RS is not writing, so the remaining SYNC RSes maintain the in-sync property. + +**Fairness:** No fairness (Tier 4). RS crashes are genuinely non-deterministic environment events. + +Source: JVM crash, OOM, kill signal, process supervisor termination -- environment event. + +```tla +RSCrash(c, rs) == + /\ writerMode[c][rs] /= "DEAD" + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> +``` + +## RSAbortOnLocalHDFSFailure -- S&F Writer Aborts on Own HDFS Failure: STORE_AND_FWD -> DEAD + +In STORE_AND_FWD mode, the writer targets the active cluster's own (local/fallback) HDFS. If that HDFS fails, `StoreAndForwardModeImpl.onFailure()` treats the error as fatal and calls `logGroup.abort()`, killing the RS. + +### Distinction from HDFSDown + +This is distinct from `HDFSDown(c)` in [HDFS.md](HDFS.md): + +- `HDFSDown(c)` sets the availability flag for cluster `c`'s HDFS. When `c` is the standby, this triggers *peer* HDFS failure -- active writers degrade from SYNC to S&F. +- `RSAbortOnLocalHDFSFailure` models the active cluster's *own* HDFS failing while the RS is already in S&F mode (writing to local HDFS as fallback). + +Note: `hdfsAvailable[c]` is the cluster's OWN HDFS, not `Peer(c)`. RS in SYNC or SYNC_AND_FWD write to the *peer's* HDFS, so they are not affected by their own cluster's HDFS failure. Only STORE_AND_FWD writers are vulnerable because they write to local HDFS. + +**Fairness:** SF (Tier 3), grouped with `RSRestart` by mutual exclusivity (STORE_AND_FWD and DEAD are mutually exclusive writer modes). + +Source: `StoreAndForwardModeImpl.onFailure()` L115-123 -> `logGroup.abort()`. + +```tla +RSAbortOnLocalHDFSFailure(c, rs) == + /\ writerMode[c][rs] = "STORE_AND_FWD" + /\ hdfsAvailable[c] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> +``` diff --git a/src/main/tla/ConsistentFailover/markdown/Reader.md b/src/main/tla/ConsistentFailover/markdown/Reader.md new file mode 100644 index 00000000000..96b75079166 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/Reader.md @@ -0,0 +1,171 @@ +# Reader -- Standby-Side Replication Replay State Machine + +**Source:** [`Reader.tla`](../Reader.tla) + +## Overview + +`Reader` models the standby cluster's replication replay state machine. The reader replays replication logs round-by-round, tracking two counters (`lastRoundProcessed`, `lastRoundInSync`) and a replay state that determines how the counters advance. The module contains 5 action schemas covering replay advance, rewind, in-progress directory dynamics, and the failover trigger. + +### Replay State Semantics + +| State | Counter Behavior | +|---|---| +| `SYNC` | Both counters advance together (in-sync replay) | +| `DEGRADED` | Only `lastRoundProcessed` advances; `lastRoundInSync` is frozen (degraded replay) | +| `SYNCED_RECOVERY` | Rewinds `lastRoundProcessed` to `lastRoundInSync`, then CAS-transitions to `SYNC` | +| `NOT_INITIALIZED` | Pre-init on the active side; transitions to `SYNCED_RECOVERY` on first S entry after failover | + +### Replay State Diagram + +```mermaid +stateDiagram-v2 + NOT_INITIALIZED --> SYNCED_RECOVERY : S entry (recoveryListener) + NOT_INITIALIZED --> DEGRADED : DS entry (degradedListener) + SYNC --> DEGRADED : DS entry (degradedListener) + SYNC --> SYNCED_RECOVERY : S entry (recoveryListener) + DEGRADED --> SYNCED_RECOVERY : S entry (recoveryListener) + SYNCED_RECOVERY --> SYNC : ReplayRewind (CAS) + SYNCED_RECOVERY --> DEGRADED : DS entry (degradedListener) +``` + +### Listener Effect Folding + +The `degradedListener` and `recoveryListener` use unconditional `.set()` (not `.compareAndSet()`). These fire synchronously on the local `PathChildrenCache` event thread during the cluster state transition and are modeled as atomic with the triggering state-entry actions in [HAGroupStore.md](HAGroupStore.md): + +- **S entry:** `set(SYNCED_RECOVERY)` -- folded into `PeerReactToAIS`, `PeerReactToANIS` (ATS->S), `AutoComplete` (AbTS->S) +- **DS entry:** `set(DEGRADED)` -- folded into `PeerReactToANIS` (S->DS) + +### CAS Semantics + +The `SYNCED_RECOVERY -> SYNC` transition uses `compareAndSet(SYNCED_RECOVERY, SYNC)` at L332-333. The CAS can only fail if a concurrent `set(DEGRADED)` fires first (the cluster re-degrades before `replay()` can CAS). TLC's interleaving semantics model this race: either `ReplayRewind` fires first (CAS succeeds) or the DS-entry fold in `PeerReactToANIS` fires first (state becomes DEGRADED, `ReplayRewind` is no longer enabled). + +## Implementation Traceability + +| TLA+ Action | Java Source | +|---|---| +| `ReplayAdvance(c)` | `replay()` L336-343 (SYNC) and L345-351 (DEGRADED) -- round processing loop | +| `ReplayRewind(c)` | `replay()` L323-333 -- `compareAndSet(SYNCED_RECOVERY, SYNC)`; `getFirstRoundToProcess()` rewinds to `lastRoundInSync` (L389) | +| `ReplayBeginProcessing(c)` | `replay()` round processing start -- in-progress files created when a round is picked up | +| `ReplayFinishProcessing(c)` | `replay()` round processing end -- in-progress files cleaned up after round is fully processed | +| `TriggerFailover(c)` | `shouldTriggerFailover()` L500-533 (guards); `triggerFailover()` L535-548 (effect); `setHAGroupStatusToSync()` L341-355 (ZK write) | + +```tla +EXTENDS SpecState, Types +``` + +## ReplayAdvance -- Round Processing in SYNC or DEGRADED + +The reader processes the next round of replication logs: + +- **SYNC:** Both `lastRoundProcessed` and `lastRoundInSync` advance, maintaining the invariant that they are equal. Every processed round represents a consistent state. +- **DEGRADED:** Only `lastRoundProcessed` advances; `lastRoundInSync` is frozen. Rounds processed during DEGRADED may contain incomplete data because the active peer's writers are in STORE_AND_FWD mode, buffering locally instead of writing synchronously to the standby's HDFS. + +**Guard:** The cluster must be in a standby state or STA (replay continues during failover pending -- the `replay()` loop does not stop when the cluster enters STA). + +Source: `replay()` L336-343 (SYNC), L345-351 (DEGRADED). + +```tla +ReplayAdvance(c) == + /\ clusterState[c] \in StandbyStates \union {"STA"} + /\ replayState[c] \in {"SYNC", "DEGRADED"} + /\ lastRoundProcessed' = [lastRoundProcessed EXCEPT ![c] = @ + 1] + /\ lastRoundInSync' = [lastRoundInSync EXCEPT ![c] = + IF replayState[c] = "SYNC" THEN @ + 1 ELSE @] + /\ UNCHANGED <> +``` + +## ReplayRewind -- CAS to SYNC from SYNCED_RECOVERY + +In `SYNCED_RECOVERY`, `replay()` rewinds `lastRoundProcessed` to `lastRoundInSync` (via `getFirstRoundToProcess()` at L389), then attempts `compareAndSet(SYNCED_RECOVERY, SYNC)` at L332-333. + +The CAS can only fail if a concurrent `set(DEGRADED)` fires first (the cluster re-degrades before `replay()` can CAS). TLC's interleaving semantics model this race naturally: either this action fires (CAS succeeds, state becomes SYNC) or the DS-entry fold in `PeerReactToANIS` fires first (state becomes DEGRADED, this action is no longer enabled). + +The `ReplayRewindCorrectness` action constraint in [ConsistentFailover.md](ConsistentFailover.md) verifies that after rewind, `lastRoundProcessed = lastRoundInSync`. + +Source: `replay()` L323-333; `getFirstRoundToProcess()` L389. + +```tla +ReplayRewind(c) == + /\ replayState[c] = "SYNCED_RECOVERY" + /\ replayState' = [replayState EXCEPT ![c] = "SYNC"] + /\ lastRoundProcessed' = [lastRoundProcessed EXCEPT ![c] = lastRoundInSync[c]] + /\ UNCHANGED <> +``` + +## ReplayBeginProcessing -- In-Progress Directory Becomes Non-Empty + +When the reader picks up a new round for processing, it creates in-progress files in the IN-PROGRESS directory. This makes the directory non-empty, blocking the failover trigger until processing completes. + +**Guard:** The cluster is in a standby state or STA (replay continues during failover pending) and the in-progress directory is currently empty. + +Source: `replay()` L307-310 -- `getFirstRoundToProcess()` returns a round; processing begins. + +```tla +ReplayBeginProcessing(c) == + /\ clusterState[c] \in StandbyStates \union {"STA"} + /\ inProgressDirEmpty[c] = TRUE + /\ inProgressDirEmpty' = [inProgressDirEmpty EXCEPT ![c] = FALSE] + /\ UNCHANGED <> +``` + +## ReplayFinishProcessing -- In-Progress Directory Becomes Empty + +When the reader finishes processing a round, it cleans up in-progress files. The directory becomes empty, allowing the failover trigger to proceed (if other guards are satisfied). + +```tla +ReplayFinishProcessing(c) == + /\ inProgressDirEmpty[c] = FALSE + /\ inProgressDirEmpty' = [inProgressDirEmpty EXCEPT ![c] = TRUE] + /\ UNCHANGED <> +``` + +## TriggerFailover -- STA -> AIS When Replay Is Complete + +The standby cluster writes `ACTIVE_IN_SYNC` to its own ZK znode after the replication log reader determines replay is complete. This is driven by the reader component, not a peer-reactive transition. It is the final step that completes the failover. + +### Four Guards + +The four guards model the conditions under which failover is safe: + +1. **`failoverPending[c]`** -- set by `triggerFailoverListener` (L159-171) when the local cluster enters STA. Ensures failover was properly initiated. +2. **`inProgressDirEmpty[c]`** -- no partially-processed replication log files (`getInProgressFiles().isEmpty()` at L508). Ensures all in-flight rounds have completed processing. +3. **`replayState[c] = "SYNC"`** -- the `SYNCED_RECOVERY` rewind must have completed. Without this guard, failover could proceed with degraded rounds not re-processed from the sync point. This is the key zero-RPO guard. +4. **`hdfsAvailable[c] = TRUE`** -- the standby's own HDFS must be accessible; `shouldTriggerFailover()` performs HDFS reads (`getInProgressFiles`, `getNewFiles`) that throw IOException if HDFS is unavailable, blocking the trigger. + +**Guarded on `zkLocalConnected[c]`** because `triggerFailover()` calls `setHAGroupStatusToSync()` which requires `isHealthy = true`. + +**Also clears `failoverPending`**, modeling `triggerFailover()` L538 (`failoverPending.set(false)`). + +Source: `shouldTriggerFailover()` L500-533 (guards); `triggerFailover()` L535-548 (effect); `setHAGroupStatusToSync()` L341-355 (ZK write). + +```tla +TriggerFailover(c) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] = "STA" + /\ failoverPending[c] + /\ inProgressDirEmpty[c] + /\ replayState[c] = "SYNC" + /\ hdfsAvailable[c] = TRUE + /\ clusterState' = [clusterState EXCEPT ![c] = "AIS"] + /\ failoverPending' = [failoverPending EXCEPT ![c] = FALSE] + /\ UNCHANGED <> +``` + +### Relationship to Safety Properties + +The `FailoverTriggerCorrectness` and `NoDataLoss` action constraints in [ConsistentFailover.md](ConsistentFailover.md) cross-check these guards via the shared operator `STAtoAISTriggerReplayGuards` (so the two constraints stay textually aligned). They verify that every STA -> AIS transition in the model satisfies `failoverPending`, `inProgressDirEmpty`, and `replayState = "SYNC"`. If `TriggerFailover` were the only action producing STA -> AIS transitions (which it is), these constraints are redundant with the action's guards -- but they serve as independent safety net checks that would catch any accidental removal of guards during specification evolution. diff --git a/src/main/tla/ConsistentFailover/markdown/SpecState.md b/src/main/tla/ConsistentFailover/markdown/SpecState.md new file mode 100644 index 00000000000..7e7efceea82 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/SpecState.md @@ -0,0 +1,90 @@ +# SpecState -- Shared Variables and Groupings + +**Source:** [`SpecState.tla`](../SpecState.tla) + +## Overview + +`SpecState` declares the 13 specification variables in one place. The root module [`ConsistentFailover.tla`](../ConsistentFailover.tla) and every sub-module extend `SpecState` (which itself extends [`Types.tla`](../Types.tla)), so adding a variable requires editing this module and the relevant variable-group tuple, not repeating a long `VARIABLE` list in every actor module. + +`SpecState` also defines the variable-group tuples used in every action's `UNCHANGED` clause and two ZK-health predicates shared by every sub-module that guards on the ZK substrate. + +Implementation traceability for each variable remains documented in the `ConsistentFailover.tla` module header and in [ConsistentFailover.md](ConsistentFailover.md). + +## Variable groups + +Every action's `UNCHANGED` clause is written in terms of these tuples. When a group is fully unchanged, the group name stands in for the full variable list; when a group is partially changed, the unchanged members are listed individually. + +- **`writerVars`** == `<>` -- per-RS replication writer mode. +- **`clusterVars`** == `<>` -- cluster-level HA group state and per-cluster protocol state. +- **`replayVars`** == `<>` -- standby-side replay state and counters. +- **`envVars`** == `<>` -- environment substrate (HDFS availability, ZK connection/session state). +- **`vars`** == `<>` -- full variable tuple for temporal formulas (`[][Next]_vars`, `WF_vars(...)`, `SF_vars(...)`). + +## ZK-health predicates + +- **`PeerZKHealthy(c)`** == `zkPeerConnected[c] = TRUE /\ zkPeerSessionAlive[c] = TRUE` -- peer `PathChildrenCache` is delivering notifications. Used as the ZK-watcher guard for every `PeerReact*` action and `ReactiveTransitionFail` in [`HAGroupStore.tla`](../HAGroupStore.tla). +- **`LocalZKHealthy(c)`** == `zkLocalConnected[c] = TRUE` -- local `PathChildrenCache` is healthy (`isHealthy = true` in `HAGroupStoreClient`). Used to gate every action that calls `setHAGroupStatusIfNeeded()` (all local ZK writes): `AutoComplete`, `ANISHeartbeat`, `ANISToAIS`, `ANISTSToATS` in [`HAGroupStore.tla`](../HAGroupStore.tla); all writer mode-change actions in [`Writer.tla`](../Writer.tla); and `TriggerFailover` in [`Reader.tla`](../Reader.tla). + +Admin actions (`AdminStartFailover`, `AdminAbortFailover`, `AdminGoOffline`, `AdminForceRecover`) intentionally bypass `LocalZKHealthy` -- they model direct ZK writes that do not depend on the `PathChildrenCache` isHealthy signal. + +## TLA+ Source + +```tla +------------------------ MODULE SpecState ------------------------------------- +(* + * Shared state variables and state-dependent helper operators for the + * Phoenix Consistent Failover specification. + * + * The root module and all sub-modules EXTEND SpecState so the full + * variable list, variable-group tuples, and predicates that reference + * variables live in one place. See ConsistentFailover.tla module + * header for implementation traceability per variable. + * + * Variable groups partition the 13 specification variables by actor: + * writerVars -- per-RS replication writer mode + * clusterVars -- cluster-level HA group state and per-cluster + * protocol state (outDirEmpty, antiFlapTimer, + * failoverPending, inProgressDirEmpty) + * replayVars -- standby-side replay state and counters + * envVars -- environment substrate (HDFS availability, + * ZK connection/session state) + * + * UNCHANGED clauses reference these groups whenever a group is fully + * unchanged. Partially-changed groups list the unchanged members + * individually. + *) +EXTENDS Types + +VARIABLE clusterState, writerMode, outDirEmpty, hdfsAvailable, antiFlapTimer, + replayState, lastRoundInSync, lastRoundProcessed, + failoverPending, inProgressDirEmpty, + zkPeerConnected, zkPeerSessionAlive, zkLocalConnected + +--------------------------------------------------------------------------- + +(* Variable-group tuples *) + +writerVars == <> + +clusterVars == <> + +replayVars == <> + +envVars == <> + +vars == <> + +--------------------------------------------------------------------------- + +(* ZK connectivity health predicates *) + +PeerZKHealthy(c) == + /\ zkPeerConnected[c] = TRUE + /\ zkPeerSessionAlive[c] = TRUE + +LocalZKHealthy(c) == zkLocalConnected[c] = TRUE + +============================================================================ +``` diff --git a/src/main/tla/ConsistentFailover/markdown/Types.md b/src/main/tla/ConsistentFailover/markdown/Types.md new file mode 100644 index 00000000000..d6f4cbc9bb1 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/Types.md @@ -0,0 +1,422 @@ +# Types -- Pure Definitions Module + +**Source:** [`Types.tla`](../Types.tla) + +## Overview + +`Types` is a pure-definition module that provides all constants, type sets, state definitions, valid transition tables, role mappings, feature gates (`UseOfflinePeerDetection`), and helper operators used throughout the Phoenix Consistent Failover specification. It declares no variables; every definition is stateless. The root orchestrator and all sub-modules import these definitions via `EXTENDS SpecState, Types` (variables live in [`SpecState.tla`](../SpecState.tla)). + +This module establishes the vocabulary of the specification: what states exist, which transitions between them are legal, how states map to the cluster roles visible to clients, and how the anti-flapping countdown timer operates. By centralizing these definitions, the specification ensures that every module shares a single source of truth for the protocol's state space. + +### Why a Separate Module? + +Factoring pure definitions into their own module is a TLA+ best practice for specifications of this size. It avoids duplication, makes the allowed-transition table auditable in isolation, and allows each sub-module to `EXTENDS SpecState, Types` without pulling in action definitions from unrelated modules. + +## Implementation Traceability + +| Modeled Concept | Java Class / Field | +|---|---| +| `HAGroupState` (14 states) | `HAGroupStoreRecord.HAGroupState` enum (L51-65) | +| `AllowedTransitions` | `HAGroupStoreRecord` static initializer (L99-123) | +| `ClusterRole` (6 roles) | `ClusterRoleRecord.ClusterRole` enum (L59-107) | +| `RoleOf(state)` | `HAGroupState.getClusterRole()` (L73-97) | +| ANIS self-transition | `HAGroupStoreRecord` L101 (heartbeat support) | +| `UseOfflinePeerDetection` | Feature gate for AWOP/ANISWOP modeling | +| `WriterMode` (5 modes) | `ReplicationLogGroup` mode classes (SyncModeImpl, StoreAndForwardModeImpl, SyncAndForwardModeImpl) | +| `ReplayStateSet` (4 states) | `ReplicationLogDiscoveryReplay` replay state (L550-555) | +| `StableClusterStates`, `FailoverCompletionAntecedentStates`, `AbortCompletionAntecedentStates`, `NotANISClusterStates` | Named sets for liveness (`~>`) formulas in the root module | +| `AllowedWriterTransitions` | Per-RS writer mode pairs; `WriterTransitionValid` in the root module | +| `AllowedReplayTransitions` | Replay state pairs; `ReplayTransitionValid` in the root module | + +## Standard Module Extensions + +```tla +EXTENDS Naturals, FiniteSets, TLC +``` + +The module extends `Naturals` (for arithmetic on timer values and replay counters), `FiniteSets` (for the `Cardinality` operator used in assumptions), and `TLC` (for model-checking support operators). + +## Constants + +The specification is parameterized over three constants that define the model's scope. + +### Cluster + +```tla +CONSTANTS Cluster + +ASSUME Cluster # {} +ASSUME Cardinality(Cluster) = 2 +``` + +`Cluster` is the finite set of cluster identifiers participating in the model. The Phoenix Consistent Failover protocol is designed for exactly two clusters forming an HA pair -- one active, one standby. The `Cardinality(Cluster) = 2` assumption encodes this architectural constraint. The model checker instantiates `Cluster` as `{c1, c2}` in all configurations. + +### RS (Region Servers) + +```tla +CONSTANTS RS + +ASSUME RS # {} +``` + +`RS` is the finite set of region server identifiers per cluster. Each cluster runs the same set of RS; writer mode is tracked per `(cluster, RS)` pair. The exhaustive model uses 2 RS; the simulation model uses 9 RS to exercise production-scale per-RS writer interleaving. The set must be non-empty because the writer state machine is meaningless without at least one RS. + +### WaitTimeForSync + +```tla +CONSTANTS WaitTimeForSync + +ASSUME WaitTimeForSync \in Nat +ASSUME WaitTimeForSync > 0 +``` + +`WaitTimeForSync` is the anti-flapping wait threshold in logical time ticks. It controls how long the system must wait after the last STORE_AND_FWD heartbeat before the ANIS-to-AIS recovery transition is allowed. In the implementation, this maps to `HAGroupStoreClient.java` L98 where `ZK_SESSION_TIMEOUT_MULTIPLIER = 1.1` scales the ZK session timeout to produce the wait duration. The exhaustive model uses `WaitTimeForSync = 2` (the minimum value that exercises the timer's counting behavior); the simulation model uses `WaitTimeForSync = 5` to explore richer interleavings during the anti-flapping wait window. + +### UseOfflinePeerDetection + +```tla +CONSTANTS UseOfflinePeerDetection + +ASSUME UseOfflinePeerDetection \in BOOLEAN +``` + +`UseOfflinePeerDetection` is a boolean feature gate for proactive AWOP/ANISWOP modeling. This models the intended protocol behavior for a future implementation feature. + +## HA Group State Definitions + +The HA group state is the central state variable of the protocol. Each cluster maintains its state as a ZooKeeper znode, updated via versioned `setData` (optimistic CAS locking). + +### The 14 States + +```tla +HAGroupState == + { "AIS", "ANIS", "ATS", "ANISTS", + "AbTAIS", "AbTANIS", "AWOP", "ANISWOP", + "S", "STA", "DS", "AbTS", + "OFFLINE", "UNKNOWN" } +``` + +Each TLA+ abbreviation maps to a Java enum constant in `HAGroupStoreRecord.HAGroupState` (L51-65): + +| TLA+ Value | Enum Constant | Meaning | +|---|---|---| +| `"AIS"` | `ACTIVE_IN_SYNC` | Active cluster, fully in sync with standby | +| `"ANIS"` | `ACTIVE_NOT_IN_SYNC` | Active cluster, at least one RS writing locally (HDFS degraded) | +| `"ATS"` | `ACTIVE_IN_SYNC_TO_STANDBY` | Transitioning from active to standby (mutations blocked) | +| `"ANISTS"` | `ACTIVE_NOT_IN_SYNC_TO_STANDBY` | ANIS failover path: draining OUT before advancing to ATS | +| `"AbTAIS"` | `ABORT_TO_ACTIVE_IN_SYNC` | Aborting failover, returning to AIS | +| `"AbTANIS"` | `ABORT_TO_ACTIVE_NOT_IN_SYNC` | Aborting failover, returning to ANIS | +| `"AWOP"` | `ACTIVE_WITH_OFFLINE_PEER` | Active with offline peer. Reachable when `UseOfflinePeerDetection = TRUE` | +| `"ANISWOP"` | `ACTIVE_NOT_IN_SYNC_WITH_OFFLINE_PEER` | ANIS with offline peer. Reachable when `UseOfflinePeerDetection = TRUE` | +| `"S"` | `STANDBY` | Standby cluster, receiving and replaying replication logs | +| `"STA"` | `STANDBY_TO_ACTIVE` | Transitioning from standby to active (failover in progress) | +| `"DS"` | `DEGRADED_STANDBY` | Standby with degraded active peer (peer in ANIS) | +| `"AbTS"` | `ABORT_TO_STANDBY` | Aborting failover, returning to standby | +| `"OFFLINE"` | `OFFLINE` | Cluster is offline | +| `"UNKNOWN"` | `UNKNOWN` | Unknown state | + +The abbreviations are used throughout the specification for readability. `OFFLINE` is reachable when `UseOfflinePeerDetection = TRUE`; `UNKNOWN` is included for type completeness but is not reachable from the initial state in this model. + +### State Classification Sets + +The states are classified into sets based on their cluster role, which determines the client-visible behavior: + +```tla +ActiveStates == { "AIS", "ANIS", "AbTAIS", "AbTANIS", "AWOP", "ANISWOP" } +``` + +A cluster in any `ActiveStates` member is considered active and serves mutations. The `MutualExclusion` invariant requires that at most one cluster be in an `ActiveStates` member at any time. Source: `HAGroupState.getClusterRole()` L73-97 -- these states return `ClusterRole.ACTIVE`. + +```tla +AISLikeStates == { "AIS", "AWOP", "ANISWOP" } +``` + +`AISLikeStates` is the subset of `ActiveStates` whose writer-degradation path couples to `ANIS`. The [`Writer.tla`](../Writer.tla) actions `WriterInitToStoreFwd` and `WriterToStoreFwd` atomically transition `clusterState` to `ANIS` and reset `antiFlapTimer` when a writer degrades from any of these states. `AIS` is the base case; `AWOP` and `ANISWOP` are the OFFLINE-peer variants (gated on `UseOfflinePeerDetection`) -- both serve mutations while the peer is OFFLINE and are treated as `AIS`-equivalents for writer-degradation coupling. + +```tla +StandbyStates == { "S", "DS", "AbTS" } +``` + +A cluster in any `StandbyStates` member is receiving and replaying replication logs from the active peer. Source: `HAGroupState.getClusterRole()` L73-97 -- these states return `ClusterRole.STANDBY`. + +```tla +TransitionalActiveStates == { "ATS", "ANISTS" } +``` + +A cluster in `ATS` or `ANISTS` is transitioning from active to standby during a failover. Critically, mutations are blocked (`isMutationBlocked() = true`). This is the mechanism by which safety is maintained during the non-atomic failover window: the old active is in `ACTIVE_TO_STANDBY` role, which blocks all client mutations, even though the new active has already written `ACTIVE_IN_SYNC`. Source: `ClusterRoleRecord.java` L84 -- `ACTIVE_TO_STANDBY` role has `isMutationBlocked() = true`. + +```tla +ActiveRoles == {"ACTIVE"} +``` + +`ActiveRoles` operates at the role abstraction layer (not the state layer). It is the set of roles considered "active" for role-level predicates such as `MutualExclusion`. Distinguished from `ActiveStates` (which is the set of HA group *states* that map to the ACTIVE role): `ActiveRoles` is used in predicates that compare role values, not state values. Source: `ClusterRoleRecord.java` L59-67 -- the ACTIVE role has `isMutationBlocked() = false`. + +## Replication Writer Mode Definitions + +```tla +WriterMode == {"INIT", "SYNC", "STORE_AND_FWD", "SYNC_AND_FWD", "DEAD"} +``` + +Each RegionServer on the active cluster maintains one of five writer modes. The mode determines how mutations are replicated: + +| TLA+ Value | Java Class | Behavior | +|---|---|---| +| `"INIT"` | Pre-initialization | RS has not yet started writing; transitional state during startup | +| `"SYNC"` | `SyncModeImpl` | Writing directly to standby HDFS; normal steady-state mode | +| `"STORE_AND_FWD"` | `StoreAndForwardModeImpl` | Writing locally to the OUT directory when standby HDFS is unavailable | +| `"SYNC_AND_FWD"` | `SyncAndForwardModeImpl` | Draining the local OUT queue while also writing synchronously; recovery/drain mode | +| `"DEAD"` | RS aborted | Writer halted due to CAS failure or local HDFS failure; awaiting process supervisor restart | + +The `DEAD` mode is a modeling addition -- the implementation does not have an explicit "DEAD" mode enum. Instead, when `logGroup.abort()` is called (via `RuntimeException` from a CAS failure in `SyncModeImpl.onFailure()` L61-74), the Disruptor halts and the RS process is effectively dead. The process supervisor (Kubernetes/YARN) detects the dead pod and restarts it. + +Source: `ReplicationLogGroup.java` mode classes; `SyncModeImpl.onFailure()` L61-74 (CAS failure leads to abort). + +## Replication Replay State Definitions + +```tla +ReplayStateSet == {"NOT_INITIALIZED", "SYNC", "DEGRADED", "SYNCED_RECOVERY"} +``` + +The standby cluster's reader maintains one of four replay states per HA group, tracking the consistency of the replay relative to the active cluster's replication stream: + +| TLA+ Value | Meaning | +|---|---| +| `"NOT_INITIALIZED"` | Pre-init; reader has not started. Used on the active side where the reader is dormant. | +| `"SYNC"` | Fully in sync; `lastRoundProcessed` and `lastRoundInSync` advance together. Every round processed represents a consistent state. | +| `"DEGRADED"` | Active peer is in ANIS (degraded replication); `lastRoundProcessed` advances but `lastRoundInSync` is frozen. Rounds processed during DEGRADED may contain incomplete data. | +| `"SYNCED_RECOVERY"` | Active peer returned to AIS; replay rewinds `lastRoundProcessed` to `lastRoundInSync` before resuming in SYNC mode. This ensures all rounds from the degraded period are re-processed from the last known consistent point. | + +The replay state machine is the key mechanism for achieving zero RPO. The `TriggerFailover` action (in [Reader.md](Reader.md)) requires `replayState = "SYNC"` -- failover cannot proceed until the SYNCED_RECOVERY rewind completes and the replay catches up from the last in-sync point. + +Source: `ReplicationLogDiscoveryReplay.java` L550-555. + +## Allowed Transitions + +The `AllowedTransitions` set defines every valid `(from, to)` state transition pair. This is derived directly from the `allowedTransitions` static initializer in `HAGroupStoreRecord.java` (L99-123) and serves as the ground truth for the `TransitionValid` action constraint. TLC verifies that every state change produced by the `Next` relation is a member of this set. + +```tla +AllowedTransitions == + { + <<"ANIS", "ANIS">>, + <<"ANIS", "AIS">>, + <<"ANIS", "ANISTS">>, + <<"ANIS", "ANISWOP">>, + <<"AIS", "ANIS">>, + <<"AIS", "AWOP">>, + <<"AIS", "ATS">>, + <<"S", "STA">>, + <<"S", "DS">>, + <<"S", "OFFLINE">>, + <<"ANISTS", "AbTANIS">>, + <<"ANISTS", "ATS">>, + <<"ATS", "AbTAIS">>, + <<"ATS", "S">>, + <<"STA", "AbTS">>, + <<"STA", "AIS">>, + <<"DS", "S">>, + <<"DS", "STA">>, + <<"DS", "OFFLINE">>, + <<"AWOP", "ANIS">>, + <<"AbTAIS", "AIS">>, + <<"AbTAIS", "ANIS">>, + <<"AbTANIS", "ANIS">>, + <<"AbTS", "S">>, + <<"ANISWOP", "ANIS">>, + <<"OFFLINE", "S">> + } +``` + +### Notable Entries + +**ANIS self-transition** (`<<"ANIS", "ANIS">>`): This entry supports the periodic heartbeat in `StoreAndForwardModeImpl` (L71-87) that refreshes the ZK znode's `mtime` without changing the state value. In the TLA+ model, this maps to the `ANISHeartbeat` action in [HAGroupStore.md](HAGroupStore.md), which resets the anti-flapping countdown timer to `StartAntiFlapWait`. Source: `HAGroupStoreRecord.java` L101. + +**DS -> STA** (`<<"DS", "STA">>`): This entry supports the ANIS failover path where the standby is in `DEGRADED_STANDBY` when failover proceeds. The admin initiates failover on the active (ANIS -> ANISTS), the forwarder drains OUT (ANISTS -> ATS), and the standby detects ATS and transitions DS -> STA. Source: L117. + +**AbTAIS -> ANIS** (`<<"AbTAIS", "ANIS">>`): This entry is needed so that HDFS failure during the abort window can route the cluster to ANIS. Without it, S&F writers that degrade during AbTAIS would have no path to a consistent state -- the cluster would be stuck in AbTAIS with degraded writers and no way for `AutoComplete` to route to ANIS. + +**S -> OFFLINE and DS -> OFFLINE** (`<<"S", "OFFLINE">>`, `<<"DS", "OFFLINE">>`): Admin takes a standby cluster offline via `PhoenixHAAdminTool update --force --state OFFLINE`, bypassing `isTransitionAllowed()`. + +**OFFLINE -> S** (`<<"OFFLINE", "S">>`): Admin force-recovers from OFFLINE via `PhoenixHAAdminTool update --force --state STANDBY`, bypassing `isTransitionAllowed()` (OFFLINE has no allowed outbound transitions in the implementation). + +### Transition Diagram + +```mermaid +stateDiagram-v2 + AIS --> ANIS : Writer degrades + AIS --> ATS : Admin failover + AIS --> AWOP : Peer offline + + ANIS --> AIS : Recovery + ANIS --> ANIS : Heartbeat + ANIS --> ANISTS : Admin failover + ANIS --> ANISWOP : Peer offline + + ATS --> S : Peer AIS/ANIS detected + ATS --> AbTAIS : Peer AbTS / Reconcile + + ANISTS --> ATS : Drain complete + ANISTS --> AbTANIS : Abort + + S --> STA : Peer ATS detected + S --> DS : Peer ANIS detected + S --> OFFLINE : Admin offline + + DS --> S : Peer AIS detected + DS --> STA : Peer ATS detected + DS --> OFFLINE : Admin offline + + STA --> AIS : Failover trigger + STA --> AbTS : Admin abort + + AbTAIS --> AIS : Auto-complete + AbTAIS --> ANIS : Auto-complete (degraded) + AbTANIS --> ANIS : Auto-complete + AbTS --> S : Auto-complete + + AWOP --> ANIS : Peer back + ANISWOP --> ANIS : Peer back + + OFFLINE --> S : Admin force recover +``` + +## Liveness State Sets + +These named subsets of `HAGroupState` keep the root module's `~>` formulas readable and aligned with a single definition. + +```tla +StableClusterStates == + {"AIS", "ANIS", "S"} + +FailoverCompletionAntecedentStates == + {"STA", "AbTAIS", "AbTANIS", "AbTS"} + +AbortCompletionAntecedentStates == + {"AbTS", "AbTAIS", "AbTANIS"} + +NotANISClusterStates == HAGroupState \ {"ANIS"} +``` + +`StableClusterStates` is the consequent set for `FailoverCompletion` and `AbortCompletion`. `NotANISClusterStates` is equivalent to `clusterState[c] # "ANIS"` whenever `clusterState[c] \in HAGroupState`. + +## Allowed Writer and Replay Transitions + +The HA-group `AllowedTransitions` table lives in the previous section. The per-RS writer mode table and the replay state machine table are also centralized here so all transition pair sets live in one module. + +```tla +AllowedWriterTransitions == + { + <<"INIT", "SYNC">>, + <<"INIT", "STORE_AND_FWD">>, + <<"INIT", "DEAD">>, + <<"SYNC", "STORE_AND_FWD">>, + <<"SYNC", "SYNC_AND_FWD">>, + <<"SYNC", "DEAD">>, + <<"SYNC", "INIT">>, + <<"STORE_AND_FWD", "SYNC_AND_FWD">>, + <<"STORE_AND_FWD", "DEAD">>, + <<"STORE_AND_FWD", "INIT">>, + <<"SYNC_AND_FWD", "SYNC">>, + <<"SYNC_AND_FWD", "STORE_AND_FWD">>, + <<"SYNC_AND_FWD", "DEAD">>, + <<"SYNC_AND_FWD", "INIT">>, + <<"DEAD", "INIT">> + } + +AllowedReplayTransitions == + { + <<"NOT_INITIALIZED", "SYNCED_RECOVERY">>, + <<"NOT_INITIALIZED", "DEGRADED">>, + <<"SYNC", "DEGRADED">>, + <<"SYNC", "SYNCED_RECOVERY">>, + <<"DEGRADED", "SYNCED_RECOVERY">>, + <<"SYNCED_RECOVERY", "SYNC">>, + <<"SYNCED_RECOVERY", "DEGRADED">> + } +``` + +Sources: `ReplicationLogGroup` mode transitions and standby lifecycle resets; `ReplicationLogDiscoveryReplay` listeners, CAS, and replay loop. + +## Cluster Role Definitions + +```tla +ClusterRole == + { "ACTIVE", "ACTIVE_TO_STANDBY", "STANDBY", + "STANDBY_TO_ACTIVE", "OFFLINE", "UNKNOWN" } +``` + +The six cluster roles are visible to clients and determine whether a cluster accepts mutations. Source: `ClusterRoleRecord.ClusterRole` enum (L59-107). + +### RoleOf Mapping + +```tla +RoleOf(state) == + IF state \in ActiveStates THEN "ACTIVE" + ELSE IF state \in TransitionalActiveStates THEN "ACTIVE_TO_STANDBY" + ELSE IF state \in StandbyStates THEN "STANDBY" + ELSE IF state = "STA" THEN "STANDBY_TO_ACTIVE" + ELSE IF state = "OFFLINE" THEN "OFFLINE" + ELSE "UNKNOWN" +``` + +`RoleOf` maps each HA group state to its client-visible cluster role. This operator is used in the `MutualExclusion` invariant to verify that at most one cluster is in the `ACTIVE` role at any time. The mapping is derived from `HAGroupState.getClusterRole()` L73-97 in the Java implementation. + +The critical safety insight is that `ATS` and `ANISTS` map to `ACTIVE_TO_STANDBY` (not `ACTIVE`), which means mutations are blocked during the failover window. This is how the protocol maintains mutual exclusion even though the failover is non-atomic (two independent ZK writes). + +## Helpers + +### Peer + +```tla +Peer(c) == CHOOSE p \in Cluster : p # c +``` + +Returns the other cluster in the 2-cluster model. Since `|Cluster| = 2`, there is exactly one cluster that is not `c`. The `CHOOSE` operator deterministically selects it. This helper is used pervasively throughout the specification to reference the peer cluster's state. + +## Anti-Flapping Countdown Timer Helpers + +The anti-flapping mechanism uses a per-cluster countdown timer following the pattern from Lamport, "Real Time is Really Simple" (CHARME 2005, Section 2). The key idea is that real-time constraints can be modeled as ordinary state variables without introducing a separate notion of time. + +Each cluster's timer counts **down** from `WaitTimeForSync` toward 0. The timer does not represent a clock running backwards -- it represents a waiting period expiring: + +``` +WaitTimeForSync ... 2 ... 1 ... 0 +|---- gate closed (waiting) ----| gate open (transition allowed) +``` + +The S&F heartbeat (`ANISHeartbeat` in [HAGroupStore.md](HAGroupStore.md)) resets the timer to `WaitTimeForSync`, keeping the gate closed. When the heartbeat stops (all RS exit STORE_AND_FWD), the `Tick` action (in [Clock.md](Clock.md)) counts the timer down to 0, opening the gate and allowing ANIS -> AIS. + +Source: `HAGroupStoreClient.validateTransitionAndGetWaitTime()` L1027-1046; `StoreAndForwardModeImpl.startHAGroupStoreUpdateTask()` L71-87. + +### Modeling Choice: Countdown vs. Deadline + +The Lamport countdown pattern was chosen over a deadline-based approach (tracking a target time and comparing against a global clock) because: + +1. **No global clock needed.** The countdown timer is a local variable per cluster, avoiding the need for a shared time variable and the complexity of relating local and global time. +2. **Minimal state space.** The timer ranges over `0..WaitTimeForSync`, a small finite set. A global clock would grow unboundedly. +3. **Natural guard encoding.** The enabling condition for the guarded transition is simply `timer = 0`, which is a single equality check. + +### Helper Operators + +```tla +AntiFlapGateOpen(t) == t = 0 +``` + +TRUE when the anti-flapping wait period has fully elapsed. The guarded transition (ANIS -> AIS or ANISTS -> ATS) may proceed. + +```tla +AntiFlapGateClosed(t) == t > 0 +``` + +TRUE when the anti-flapping wait is still in progress. The guarded transition is blocked. + +```tla +DecrementTimer(t) == IF t > 0 THEN t - 1 ELSE 0 +``` + +Advances the countdown timer one tick toward 0, with a floor at 0. Used by the `Tick` action to model the passage of time. + +```tla +StartAntiFlapWait == WaitTimeForSync +``` + +The value that starts (or restarts) the anti-flapping wait. Used when a cluster enters ANIS or when the S&F heartbeat fires. diff --git a/src/main/tla/ConsistentFailover/markdown/Writer.md b/src/main/tla/ConsistentFailover/markdown/Writer.md new file mode 100644 index 00000000000..52b7591df39 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/Writer.md @@ -0,0 +1,311 @@ +# Writer -- Per-RS Replication Writer Mode State Machine + +**Source:** [`Writer.tla`](../Writer.tla) + +## Overview + +`Writer` models the per-RegionServer replication writer mode state machine. Each RS on the active cluster maintains a writer mode that determines how mutations are replicated: directly to standby HDFS (`SYNC`), locally buffered (`STORE_AND_FWD`), or draining the local queue while also writing synchronously (`SYNC_AND_FWD`). An RS that aborts due to a ZK CAS failure enters `DEAD` mode. + +This module contains 10 action schemas, the largest action count of any sub-module. The actions decompose into four categories: startup, degradation (CAS success and CAS failure paths), recovery/drain, and forwarder-driven transitions. + +### Writer Mode State Diagram + +```mermaid +stateDiagram-v2 + INIT --> SYNC : WriterInit + INIT --> STORE_AND_FWD : WriterInitToStoreFwd + INIT --> DEAD : WriterInitToStoreFwdFail + + SYNC --> STORE_AND_FWD : WriterToStoreFwd + SYNC --> SYNC_AND_FWD : WriterSyncToSyncFwd + SYNC --> DEAD : WriterToStoreFwdFail / RSCrash + SYNC --> INIT : Standby entry reset + + STORE_AND_FWD --> SYNC_AND_FWD : WriterStoreFwdToSyncFwd + STORE_AND_FWD --> DEAD : RSAbortOnLocalHDFS / RSCrash + STORE_AND_FWD --> INIT : Standby entry reset + + SYNC_AND_FWD --> SYNC : WriterSyncFwdToSync + SYNC_AND_FWD --> STORE_AND_FWD : WriterSyncFwdToStoreFwd + SYNC_AND_FWD --> DEAD : WriterSyncFwdToStoreFwdFail / RSCrash + SYNC_AND_FWD --> INIT : Standby entry reset + + DEAD --> INIT : RSRestart +``` + +### HDFS-Failure-Driven Degradation + +HDFS failure degradation (SYNC -> S&F, SYNC_AND_FWD -> S&F) is modeled as individual per-RS actions that each perform their own ZK CAS write. `HDFSDown` in [HDFS.md](HDFS.md) only sets the availability flag; per-RS degradation and CAS failure are handled here. This decomposition enables modeling of the ZK CAS race where multiple RS on the same cluster race to update the ZK state and the loser gets `BadVersionException` -> abort. + +### CAS Failure Semantics + +When an RS detects HDFS unavailability via IOException, it attempts a ZK CAS write (`setData().withVersion()`) to transition AIS -> ANIS (or ANIS -> ANIS self-transition). If another RS has already bumped the ZK version (stale `PathChildrenCache`), `BadVersionException` is thrown. `SyncModeImpl.onFailure()` and `SyncAndForwardModeImpl.onFailure()` treat this as fatal: `abort()` throws `RuntimeException`, halting the Disruptor -- the RS is dead. + +CAS failure is only possible when `clusterState /= "AIS"` because the first RS to write faces no concurrent version bump. This is encoded in the CAS failure action guards. + +### ZK Local Connectivity + +Actions that perform ZK writes (`setHAGroupStatusToStoreAndForward`, `setHAGroupStatusToSync`) require `isHealthy = true`, modeled by the `LocalZKHealthy(c)` predicate from [`SpecState.tla`](../SpecState.tla) (which expands to `zkLocalConnected[c] = TRUE`). Actions that are purely mode transitions driven by HDFS operations or forwarder events (`WriterInit`, `WriterSyncToSyncFwd`, `WriterStoreFwdToSyncFwd`) do NOT require a ZK connection. + +### AIS-Like State Coupling + +`WriterInitToStoreFwd` and `WriterToStoreFwd` both atomically transition `clusterState` to `ANIS` and reset `antiFlapTimer` when a writer degrades from an AIS-like state. The set of AIS-like states (`AIS`, `AWOP`, `ANISWOP`) is named in [`Types.tla`](../Types.tla) as `AISLikeStates` and documented in [Types.md](Types.md#HA-Group-State-Classification). + +## Implementation Traceability + +| TLA+ Action | Java Source | +|---|---| +| `WriterInit(c, rs)` | Normal startup -> `SyncModeImpl` | +| `WriterInitToStoreFwd(c, rs)` | Startup with peer unavailable -> `StoreAndForwardModeImpl`; CAS success path | +| `WriterInitToStoreFwdFail(c, rs)` | Startup CAS failure -> abort | +| `WriterToStoreFwd(c, rs)` | `SyncModeImpl.onFailure()` L61-74 -> `setHAGroupStatusToStoreAndForward()`; CAS success | +| `WriterToStoreFwdFail(c, rs)` | `SyncModeImpl.onFailure()` CAS failure -> abort | +| `WriterSyncToSyncFwd(c, rs)` | Forwarder ACTIVE_NOT_IN_SYNC event L98-108 while RS in SYNC | +| `WriterStoreFwdToSyncFwd(c, rs)` | Forwarder `processFile()` L133-152 throughput threshold or drain start | +| `WriterSyncFwdToSync(c, rs)` | Forwarder drain complete; queue empty -> `setHAGroupStatusToSync()` L171 | +| `WriterSyncFwdToStoreFwd(c, rs)` | `SyncAndForwardModeImpl.onFailure()` L66-78; CAS success path | +| `WriterSyncFwdToStoreFwdFail(c, rs)` | `SyncAndForwardModeImpl.onFailure()` CAS failure -> abort | +| `CanDegradeToStoreFwd(c, rs)` | Guard predicate: RS is in a mode that writes to standby HDFS | + +```tla +EXTENDS SpecState, Types +``` + +## Predicates + +### CanDegradeToStoreFwd + +Guard predicate: RS is in a mode that writes to standby HDFS and would degrade to STORE_AND_FWD on an HDFS failure. + +```tla +CanDegradeToStoreFwd(c, rs) == + writerMode[c][rs] \in {"SYNC", "SYNC_AND_FWD"} +``` + +## Startup Actions + +### WriterInit -- Normal Startup: INIT -> SYNC + +RS initializes and begins writing directly to standby HDFS. Writers only run on the active cluster. No ZK write -- pure mode transition. + +```tla +WriterInit(c, rs) == + /\ clusterState[c] \in ActiveStates + /\ writerMode[c][rs] = "INIT" + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "SYNC"] + /\ UNCHANGED <> +``` + +### WriterInitToStoreFwd -- Startup with Peer Unavailable: INIT -> STORE_AND_FWD + +RS initializes but standby HDFS is unreachable; begins buffering locally in the OUT directory. Also transitions cluster AIS -> ANIS (`setHAGroupStatusToStoreAndForward`). This is the AIS-to-ANIS coupling: the first RS to degrade atomically transitions the cluster state. + +**AWOP/ANISWOP handling:** Same as `WriterToStoreFwd` -- when AWOP or ANISWOP are reachable (`UseOfflinePeerDetection = TRUE`), HDFS failure during these states triggers `setHAGroupStatusToStoreAndForward()` which CAS-writes ANIS. `AWOP.allowedTransitions = {ANIS}` and `ANISWOP.allowedTransitions = {ANIS}`, so the transition succeeds. When `UseOfflinePeerDetection = FALSE`, AWOP/ANISWOP are unreachable and the extended IF is a no-op. + +Guarded on `zkLocalConnected[c]` because this calls `setHAGroupStatusToStoreAndForward()` which requires `isHealthy = true`. + +Source: `StoreAndForwardModeImpl.onEnter()` L54-64. + +```tla +WriterInitToStoreFwd(c, rs) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] \in ActiveStates + /\ writerMode[c][rs] = "INIT" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "STORE_AND_FWD"] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = FALSE] + /\ clusterState' = IF clusterState[c] \in AISLikeStates + THEN [clusterState EXCEPT ![c] = "ANIS"] + ELSE clusterState + /\ antiFlapTimer' = IF clusterState[c] \in AISLikeStates + THEN [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + ELSE antiFlapTimer + /\ UNCHANGED <> +``` + +## Forwarder-Driven Transitions + +### WriterSyncToSyncFwd -- Forwarder Started While In Sync: SYNC -> SYNC_AND_FWD + +On an ACTIVE_NOT_IN_SYNC event (L98-108), region servers currently in SYNC learn that the cluster has entered ANIS and transition to SYNC_AND_FWD. This event fires once when the cluster enters ANIS. ANISTS does not produce a new ACTIVE_NOT_IN_SYNC event -- it is a different ZK state change. A SYNC writer that has not yet received the event when ANIS -> ANISTS fires will remain in SYNC (harmlessly: SYNC writers write directly to standby HDFS, not to the OUT directory). + +No ZK write -- mode transition driven by forwarder event. + +Source: `ReplicationLogDiscoveryForwarder.init()` L98-108. + +```tla +WriterSyncToSyncFwd(c, rs) == + /\ clusterState[c] = "ANIS" + /\ writerMode[c][rs] = "SYNC" + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "SYNC_AND_FWD"] + /\ UNCHANGED <> +``` + +### WriterStoreFwdToSyncFwd -- Recovery Detected: STORE_AND_FWD -> SYNC_AND_FWD + +The forwarder successfully copies a file from the OUT directory to the standby's IN directory. If throughput exceeds the threshold, the writer transitions to SYNC_AND_FWD to begin draining the queue while also writing synchronously. The forwarder runs on active clusters and during the ANISTS transitional state (draining OUT before ANISTS -> ATS). + +No ZK write -- mode transition driven by forwarder file copy. + +Source: `ReplicationLogDiscoveryForwarder.processFile()` L133-152 throughput threshold or drain start. + +```tla +WriterStoreFwdToSyncFwd(c, rs) == + /\ clusterState[c] \in ActiveStates \union TransitionalActiveStates + /\ writerMode[c][rs] = "STORE_AND_FWD" + /\ hdfsAvailable[Peer(c)] = TRUE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "SYNC_AND_FWD"] + /\ UNCHANGED <> +``` + +### WriterSyncFwdToSync -- Queue Drained: SYNC_AND_FWD -> SYNC + +The forwarder has drained all buffered files from the OUT directory. The OUT directory is now empty. + +**Per-RS vs per-cluster semantics:** `processNoMoreRoundsLeft()` (`ReplicationLogDiscoveryForwarder.java` L155-184) is a per-cluster forwarder check that examines the global OUT directory -- it only fires when the entire OUT directory is empty, not when a single RS finishes. The guard `\A rs2 \in RS : writerMode[c][rs2] \notin {"STORE_AND_FWD"}` prevents setting `outDirEmpty = TRUE` while any RS is still actively writing to the OUT directory. + +**HDFS guard:** `processNoMoreRoundsLeft()` can only fire after `processFile()` has successfully copied all remaining files from OUT to the peer's IN directory, which requires the peer's HDFS to be accessible. + +Guarded on `zkLocalConnected[c]` because this calls `setHAGroupStatusToSync()` which requires `isHealthy = true`. + +Source: `ReplicationLogDiscoveryForwarder.processFile()` L133-152 copies to peer HDFS; `processNoMoreRoundsLeft()` L155-184; `setHAGroupStatusToSync()` L171. + +```tla +WriterSyncFwdToSync(c, rs) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] \in ActiveStates \union TransitionalActiveStates + /\ writerMode[c][rs] = "SYNC_AND_FWD" + /\ hdfsAvailable[Peer(c)] = TRUE + /\ \A rs2 \in RS : writerMode[c][rs2] \notin {"STORE_AND_FWD"} + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "SYNC"] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = TRUE] + /\ UNCHANGED <> +``` + +## Per-RS HDFS Failure Degradation -- CAS Success Paths + +### WriterToStoreFwd -- SYNC -> STORE_AND_FWD (CAS Success) + +Models a single RS detecting standby HDFS unavailability via IOException and successfully CAS-writing the ZK state. The ZK CAS write is synchronous and happens BEFORE the mode change (`SyncModeImpl.onFailure()` L61-74). On success, the writer transitions to STORE_AND_FWD and the cluster transitions AIS -> ANIS (if still AIS). + +This is the primary AIS-to-ANIS coupling mechanism: the first RS to detect HDFS failure atomically transitions both its own mode and the cluster state. + +**AWOP/ANISWOP handling:** When AWOP or ANISWOP are reachable (`UseOfflinePeerDetection = TRUE`), HDFS failure during these states triggers `setHAGroupStatusToStoreAndForward()` which CAS-writes ANIS. `AWOP.allowedTransitions = {ANIS}` and `ANISWOP.allowedTransitions = {ANIS}`, so the transition succeeds. When `UseOfflinePeerDetection = FALSE`, AWOP/ANISWOP are unreachable and the extended IF is a no-op. + +Source: `SyncModeImpl.onFailure()` L61-74 -> `setHAGroupStatusToStoreAndForward()`. + +```tla +WriterToStoreFwd(c, rs) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] \in ActiveStates + /\ writerMode[c][rs] = "SYNC" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "STORE_AND_FWD"] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = FALSE] + /\ clusterState' = IF clusterState[c] \in AISLikeStates + THEN [clusterState EXCEPT ![c] = "ANIS"] + ELSE clusterState + /\ antiFlapTimer' = IF clusterState[c] \in AISLikeStates + THEN [antiFlapTimer EXCEPT ![c] = StartAntiFlapWait] + ELSE antiFlapTimer + /\ UNCHANGED <> +``` + +### WriterSyncFwdToStoreFwd -- Re-Degradation During Drain: SYNC_AND_FWD -> STORE_AND_FWD (CAS Success) + +Models standby HDFS becoming unavailable again while the forwarder is draining the local queue. The RS falls back to pure local buffering. No AIS -> ANIS coupling needed: if RS is in SYNC_AND_FWD, the cluster is already ANIS or ANISTS (cannot be AIS). + +Source: `SyncAndForwardModeImpl.onFailure()` L66-78 -> `setHAGroupStatusToStoreAndForward()`. + +```tla +WriterSyncFwdToStoreFwd(c, rs) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] \in ActiveStates \union TransitionalActiveStates + /\ writerMode[c][rs] = "SYNC_AND_FWD" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "STORE_AND_FWD"] + /\ outDirEmpty' = [outDirEmpty EXCEPT ![c] = FALSE] + /\ UNCHANGED <> +``` + +## Per-RS HDFS Failure Degradation -- CAS Failure Paths (RS Abort) + +These actions model the case where the ZK CAS write fails due to a version mismatch. The RS reads a stale version from `PathChildrenCache`, attempts a CAS write, but another RS has already bumped the version. ZK throws `BadVersionException` -> `StaleHAGroupStoreRecordVersionException` -> `abort()` -> `RuntimeException` -> Disruptor halts -> RS dead. + +### WriterToStoreFwdFail -- CAS Failure During SYNC Degradation: SYNC -> DEAD + +Guard: `clusterState[c] /= "AIS"` -- CAS failure is only possible when another RS has already changed the cluster state, meaning the ZK version has been bumped beyond the cached value. If the cluster is still AIS, the first RS to write faces no concurrent version bump, so CAS cannot fail. + +Source: `SyncModeImpl.onFailure()` L61-74 catch block -> `abort()`. + +```tla +WriterToStoreFwdFail(c, rs) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] \in ActiveStates \ {"AIS"} + /\ writerMode[c][rs] = "SYNC" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> +``` + +### WriterSyncFwdToStoreFwdFail -- CAS Failure During S&FWD Re-Degradation: SYNC_AND_FWD -> DEAD + +Same CAS failure pattern as `WriterToStoreFwdFail` but from SYNC_AND_FWD mode. If RS is in SYNC_AND_FWD, the cluster is already ANIS or ANISTS (not AIS), so another RS or the S&F heartbeat may have bumped the ZK version. + +Source: `SyncAndForwardModeImpl.onFailure()` L66-78 catch block -> `abort()`. + +```tla +WriterSyncFwdToStoreFwdFail(c, rs) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] \in ActiveStates \union TransitionalActiveStates + /\ writerMode[c][rs] = "SYNC_AND_FWD" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> +``` + +### WriterInitToStoreFwdFail -- CAS Failure During Init Degradation: INIT -> DEAD + +RS starts up, `SyncModeImpl.onEnter()` fails (HDFS unavailable), `updateModeOnFailure` -> `SyncModeImpl.onFailure()` -> CAS write fails -> `abort()`. Same CAS race as `WriterToStoreFwdFail` but from the INIT state during startup. + +Guard: `clusterState[c] /= "AIS"` -- same rationale: another RS must have already bumped the version for CAS to fail. + +Source: `SyncModeImpl.onFailure()` L61-74 via `LogEventHandler.initializeMode()` failure path. + +```tla +WriterInitToStoreFwdFail(c, rs) == + /\ zkLocalConnected[c] = TRUE + /\ clusterState[c] \in ActiveStates \ {"AIS"} + /\ writerMode[c][rs] = "INIT" + /\ hdfsAvailable[Peer(c)] = FALSE + /\ writerMode' = [writerMode EXCEPT ![c][rs] = "DEAD"] + /\ UNCHANGED <> +``` diff --git a/src/main/tla/ConsistentFailover/markdown/ZK.md b/src/main/tla/ConsistentFailover/markdown/ZK.md new file mode 100644 index 00000000000..8997ce0f622 --- /dev/null +++ b/src/main/tla/ConsistentFailover/markdown/ZK.md @@ -0,0 +1,203 @@ +# ZK -- ZooKeeper Coordination Substrate + +**Source:** [`ZK.tla`](../ZK.tla) + +## Overview + +`ZK` models the ZK session lifecycle and connection state as environment actions. Two independent `PathChildrenCache` instances per `HAGroupStoreClient` drive the protocol: + +- **`pathChildrenCache` (LOCAL):** Watches the local cluster's ZK znode. Connection loss sets `isHealthy = false`, blocking all `setHAGroupStatusIfNeeded()` calls (auto-completion, heartbeat, writer cluster-state transitions, failover trigger). +- **`peerPathChildrenCache` (PEER):** Watches the peer cluster's ZK znode via a separate `CuratorFramework`/ZK connection. Connection loss or session expiry suppresses all peer-reactive transitions (`PeerReact*` actions in [HAGroupStore.md](HAGroupStore.md)). + +### ZK Failure Modes + +Three failure modes are modeled: + +1. **Peer disconnection (transient):** `peerPathChildrenCache` loses TCP connection. Peer-reactive transitions suppressed. On reconnect, Curator re-syncs and fires synthetic events. +2. **Peer session expiry (permanent until recovery):** ZK session expires. All watches permanently lost. Curator must establish a new session via retry policy. Session expiry implies disconnection. +3. **Local disconnection:** `pathChildrenCache` loses connection. `isHealthy = false`, blocking all `setHAGroupStatusIfNeeded()` calls. + +### ZK Liveness Assumption (ZLA) + +The liveness specifications encode the ZK Liveness Assumption via WF on `ZKPeerReconnect`, `ZKPeerSessionRecover`, and `ZKLocalReconnect` (Tier 2 fairness). This encodes the assumption that ZK sessions are eventually alive and connected. Without this assumption, the adversary could permanently disconnect ZK, preventing all watcher-driven transitions and violating every liveness property. + +### Post-Abort ATS Reconciliation + +`ZKPeerReconnect` and `ZKPeerSessionRecover` fold a post-abort ATS reconciliation: when the local cluster is in ATS and the peer is in S or DS at the moment of reconnect, the `PathChildrenCache` rebuild fires a synthetic event that triggers the `FailoverManagementListener`. No existing `PeerReact*` action handles (ATS, S/DS) -- the transient AbTS state was missed during the partition. The reconciliation transitions ATS -> AbTAIS, which auto-completes to AIS via `AutoComplete` in [HAGroupStore.md](HAGroupStore.md). + +This is folded into the reconnect action (rather than modeled as a separate action with a boolean flag) because the CONNECTION_RECONNECTED -> `PathChildrenCache` rebuild -> `handleStateChange()` -> `FailoverManagementListener` chain is synchronous on the same event thread, following the same listener-effect folding pattern used for `recoveryListener` and `degradedListener` in [HAGroupStore.md](HAGroupStore.md). + +Both `ZKPeerReconnect` and `ZKPeerSessionRecover` reuse the identical reconciliation fold (Curator rebuild is the same whether triggered by reconnection or session recovery), so it is extracted into a module-local operator: + +```tla +ATSReconcileEffect(c) == + IF clusterState[c] = "ATS" /\ clusterState[Peer(c)] \in {"S", "DS"} + THEN clusterState' = [clusterState EXCEPT ![c] = "AbTAIS"] + ELSE UNCHANGED clusterState +``` + +This keeps the two actions' reconciliation branches from drifting apart. + +**Race safety:** `ZKPeerReconnect` requires `zkPeerConnected[c] = FALSE`, so it cannot fire during normal operation when the connection is healthy. The normal transient (ATS, S) state during happy-path failover is handled by `PeerReactToATS` on the peer side. + +### Retry Exhaustion + +Retry exhaustion of the `FailoverManagementListener` (2-retry limit) is modeled in [HAGroupStore.md](HAGroupStore.md) as `ReactiveTransitionFail(c)`, not in this module. The ZK module models only the connection/session lifecycle, not application-level retry behavior. + +## Implementation Traceability + +| TLA+ Action | Java Source | +|---|---| +| `ZKPeerDisconnect(c)` | `HAGroupStoreClient.createCacheListener()` L894-898 -- `peerPathChildrenCache` CONNECTION_LOST/CONNECTION_SUSPENDED (no effect on `isHealthy` for PEER cache) | +| `ZKPeerReconnect(c)` | `HAGroupStoreClient.createCacheListener()` L903-906 -- `peerPathChildrenCache` CONNECTION_RECONNECTED; Curator re-syncs `PathChildrenCache`, fires synthetic CHILD_UPDATED events | +| `ZKPeerSessionExpiry(c)` | Curator maps SESSION_EXPIRED to CONNECTION_LOST internally; no explicit SESSION_EXPIRED handling in Phoenix | +| `ZKPeerSessionRecover(c)` | Curator retry policy establishes new session; `PathChildrenCache` rebuilds | +| `ZKLocalDisconnect(c)` | `HAGroupStoreClient.createCacheListener()` L894-898 -- `pathChildrenCache` (LOCAL) CONNECTION_LOST sets `isHealthy = false` | +| `ZKLocalReconnect(c)` | `HAGroupStoreClient.createCacheListener()` L903-906 -- `pathChildrenCache` (LOCAL) CONNECTION_RECONNECTED sets `isHealthy = true` | + +```tla +EXTENDS SpecState, Types +``` + +## ZKPeerDisconnect -- Peer ZK Connection Drops + +The `peerPathChildrenCache` loses its TCP connection to the peer ZK quorum. During disconnection, no watcher notifications are delivered, so peer-reactive transitions for cluster `c` are suppressed. + +The implementation does NOT set `isHealthy = false` for PEER cache disconnection -- only LOCAL cache disconnection affects `isHealthy`. This means peer disconnection suppresses watcher delivery but does not block local ZK writes (auto-completion, heartbeat, etc.). + +**Fairness:** No fairness (Tier 4). ZK disconnections are genuinely non-deterministic environment events. + +Source: `HAGroupStoreClient.createCacheListener()` L894-898 (CONNECTION_LOST/CONNECTION_SUSPENDED for PEER cache). + +```tla +ZKPeerDisconnect(c) == + /\ zkPeerConnected[c] = TRUE + /\ zkPeerConnected' = [zkPeerConnected EXCEPT ![c] = FALSE] + /\ UNCHANGED <> +``` + +## ZKPeerReconnect -- Peer ZK Connection Re-Established + +The `peerPathChildrenCache` re-establishes its TCP connection to the peer ZK quorum. Curator re-syncs `PathChildrenCache` by re-reading children and generating synthetic CHILD_UPDATED events. `handleStateChange()` compares against `lastKnownPeerState` and only fires notifications if the peer state differs from the last known value -- same-state suppression. In the TLA+ model, this is naturally handled: `PeerReact*` actions are re-enabled by the `zkPeerConnected` guard and fire when their peer-state guard is satisfied. + +Reconnection requires a live session -- if the session is expired, a new session must be established first via `ZKPeerSessionRecover`. + +### Post-Abort ATS Reconciliation + +When the local cluster is in ATS and the peer is in S or DS at the moment of reconnect, the `PathChildrenCache` rebuild fires a synthetic event that triggers the `FailoverManagementListener`. No existing `PeerReact*` action handles (ATS, S/DS). This happens when: + +1. A failover was initiated: (AIS, S) -> (ATS, S) +2. The peer ZK connection was lost during the failover window +3. The standby detected ATS and moved to STA, then AbTS (admin abort), then back to S +4. The transient AbTS state was missed by the active cluster because the peer ZK connection was down +5. On reconnect, the active sees the peer is in S (or DS if degradation occurred), but the active is still stuck in ATS + +The reconciliation transitions ATS -> AbTAIS, which auto-completes to AIS via `AutoComplete`, recovering the stuck-ATS cluster. + +**Fairness:** WF (Tier 2). Encodes the ZK Liveness Assumption. + +Source: `HAGroupStoreClient.createCacheListener()` L903-906 (CONNECTION_RECONNECTED for PEER cache). + +```tla +ZKPeerReconnect(c) == + /\ zkPeerConnected[c] = FALSE + /\ zkPeerSessionAlive[c] = TRUE + /\ zkPeerConnected' = [zkPeerConnected EXCEPT ![c] = TRUE] + /\ IF clusterState[c] = "ATS" /\ clusterState[Peer(c)] \in {"S", "DS"} + THEN clusterState' = [clusterState EXCEPT ![c] = "AbTAIS"] + ELSE UNCHANGED clusterState + /\ UNCHANGED <> +``` + +## ZKPeerSessionExpiry -- Peer ZK Session Expires + +The ZK server evicts cluster `c`'s peer session after the session timeout elapses without heartbeats. All watches are permanently lost. The client must establish a new session before any watcher notifications can be delivered. + +**Session expiry implies disconnection:** When the session dies, the TCP connection is also considered dead. Both `zkPeerSessionAlive` and `zkPeerConnected` are set to FALSE. This invariant relationship is verified by the `ZKSessionConsistency` invariant in [ConsistentFailover.md](ConsistentFailover.md). + +The implementation has no explicit SESSION_EXPIRED handling. Curator maps session expiry to CONNECTION_LOST internally, then attempts to create a new session via its retry policy. + +**Fairness:** No fairness (Tier 4). ZK session expiry is a genuinely non-deterministic environment event. + +Source: Curator internal session management; no explicit Phoenix SESSION_EXPIRED handler. + +```tla +ZKPeerSessionExpiry(c) == + /\ zkPeerSessionAlive[c] = TRUE + /\ zkPeerSessionAlive' = [zkPeerSessionAlive EXCEPT ![c] = FALSE] + /\ zkPeerConnected' = [zkPeerConnected EXCEPT ![c] = FALSE] + /\ UNCHANGED <> +``` + +## ZKPeerSessionRecover -- Peer ZK Session Recovered + +Curator's retry policy establishes a new ZK session for the peer connection. `PathChildrenCache` rebuilds its internal state by re-reading all children and fires synthetic CHILD_ADDED events, effectively re-syncing. + +**Session recovery implies reconnection:** The new session comes with a live TCP connection. Both `zkPeerSessionAlive` and `zkPeerConnected` are set to TRUE. + +### Post-Abort ATS Reconciliation + +Same folded reconciliation as `ZKPeerReconnect`. Session recovery triggers a full `PathChildrenCache` rebuild with synthetic CHILD_ADDED events, which invokes the `FailoverManagementListener` synchronously. The reconciliation logic is identical: ATS + peer in {S, DS} -> AbTAIS. + +**Fairness:** WF (Tier 2). Encodes the ZK Liveness Assumption. + +Source: Curator retry policy -> new session -> `PathChildrenCache` rebuild. + +```tla +ZKPeerSessionRecover(c) == + /\ zkPeerSessionAlive[c] = FALSE + /\ zkPeerSessionAlive' = [zkPeerSessionAlive EXCEPT ![c] = TRUE] + /\ zkPeerConnected' = [zkPeerConnected EXCEPT ![c] = TRUE] + /\ IF clusterState[c] = "ATS" /\ clusterState[Peer(c)] \in {"S", "DS"} + THEN clusterState' = [clusterState EXCEPT ![c] = "AbTAIS"] + ELSE UNCHANGED clusterState + /\ UNCHANGED <> +``` + +## ZKLocalDisconnect -- Local ZK Connection Drops + +The `pathChildrenCache` (LOCAL) loses its connection to the local ZK quorum. The implementation sets `isHealthy = false`, which blocks all `setHAGroupStatusIfNeeded()` calls with IOException. This suppresses auto-completion, heartbeat, writer cluster-state transitions, and failover trigger. + +**Fairness:** No fairness (Tier 4). ZK disconnections are genuinely non-deterministic environment events. + +Source: `HAGroupStoreClient.createCacheListener()` L894-898 (CONNECTION_LOST/CONNECTION_SUSPENDED for LOCAL cache). + +```tla +ZKLocalDisconnect(c) == + /\ zkLocalConnected[c] = TRUE + /\ zkLocalConnected' = [zkLocalConnected EXCEPT ![c] = FALSE] + /\ UNCHANGED <> +``` + +## ZKLocalReconnect -- Local ZK Connection Re-Established + +The `pathChildrenCache` (LOCAL) re-establishes its connection to the local ZK quorum. The implementation sets `isHealthy = true`, re-enabling all `setHAGroupStatusIfNeeded()` calls. + +**Fairness:** WF (Tier 2). Encodes the ZK Liveness Assumption. This is the basis for SF on all actions guarded by `zkLocalConnected`. + +Source: `HAGroupStoreClient.createCacheListener()` L903-906 (CONNECTION_RECONNECTED for LOCAL cache). + +```tla +ZKLocalReconnect(c) == + /\ zkLocalConnected[c] = FALSE + /\ zkLocalConnected' = [zkLocalConnected EXCEPT ![c] = TRUE] + /\ UNCHANGED <> +```