[1.12.0] cherry-pick #135: fix(x/audit) bootstrap-recovery exception when active set is empty#137
Merged
mateeullahmalik merged 1 commit intoMay 11, 2026
Conversation
…135) * fix(x/audit): bootstrap-recovery exception when active set is empty When the epoch's anchored active set is empty (all supernodes POSTPONED), the peer-port recovery rule in shouldRecoverAtEpochEnd becomes unsatisfiable by construction: with zero probers, no peer report exists that could attest all-ports-OPEN for any POSTPONED supernode. The chain cannot self-heal — every validator key holder must perform a manual deregister+re-register cycle out-of-band. Trigger on mainnet (live params at height 5,001,129): - consecutive_epochs_to_postpone = 1 (one missed epoch postpones) - 15 ACTIVE / 11 POSTPONED / 2 DISABLED — thin active margin - Upgrade halts >= 1 epoch (~40 min) → SNs that lag postpone in lockstep → active set can drop to 0 → permanent deadlock. Fix: when GetEpochAnchor(epochID).ActiveSupernodeAccounts is empty, accept a compliant self host-report alone as sufficient for recovery. The bootstrap exception sits AFTER the storage-truth and action-finalization redirects (they keep their own recovery semantics) and AFTER selfHostCompliant (a misbehaving SN cannot self-recover via this branch). When no anchor exists for the epoch (test fixture or pre-anchor edge case), the branch is skipped and the legacy peer-port path runs unchanged. Test matrix (5 cells): - empty anchor + compliant self-report → recover - empty anchor + no self-report → no-recover (self-gate) - empty anchor + non-compliant self-report → no-recover (self-gate) - non-empty anchor + no peer obs → no-recover (legacy preserved) - no anchor → no-recover (legacy preserved) The pre-fix scenario test that asserted deadlock (TestEnforceEpochEnd_EmptyActiveSet_PostponedCannotRecover) is inverted to its new contract: recovery succeeds via the bootstrap exception when self-reports are compliant. Risk: LOW. Reads existing deterministic state (EpochAnchor) only. Branch is only reachable when no other recovery path applies and self-compliance has already been verified. No new external calls, no wall-clock dependency, no map iteration. Cosmos determinism preserved. Rollback: revert this commit. The legacy peer-port deadlock returns, recoverable via the documented deregister+re-register procedure (skill: lumera-supernode-postponed-recovery). Refs: 2026-05-08 devnet incident where all 5 SNs went POSTPONED; gov proposal 33 to bypass via empty required_open_ports passed on-chain but silently no-op'd because Params.WithDefaults() re-fills the list. The deadlock is a real protocol-level design gap, not a devnet quirk. * test(systemtests): invert empty-active-set tests for bootstrap exception Two system tests in audit_empty_active_set_bootstrap_test.go were written to document the empty-active-set DEADLOCK as expected behavior (one used legacy MsgReportSupernodeMetrics to break it, the other asserted 3 consecutive host-only-report epochs never recover). With the bootstrap-recovery exception in shouldRecoverAtEpochEnd (this PR's main change), the deadlock no longer exists: compliant self host-reports alone are sufficient to recover when the active set is empty. Invert both tests to the new contract: 1. TestAuditEmptyActiveSetBootstrap_HostOnlyReportsRecover (was: TestAuditEmptyActiveSetBootstrap_LegacyMetricsBreaksDeadlock + TestAuditEmptyActiveSetDeadlock_HostOnlyReportsCannotRecover) Asserts both POSTPONED SNs recover to ACTIVE at epoch 1 end after submitting compliant host-only reports — no legacy metrics path needed. The chain self-heals. 2. TestAuditEmptyActiveSetBootstrap_NonCompliantHostStaysPostponed (NEW) Guards the self-compliance gate. With MinDiskFreePercent=20, a POSTPONED SN that reports DiskUsagePercent=95 (5% free) MUST remain POSTPONED even though the active set is empty. This blocks the exception from becoming a 'free pass' for misbehaving SNs and complements the unit-level violation tests in x/audit/v1/keeper/enforcement_empty_active_set_test.go. Helpers added in audit_test_helpers_test.go: - auditHostReportWithDiskUsageJSON: lets a test pin DiskUsagePercent. - setAuditParamsForFastEpochsWithMinDiskFree: lets a test override MinDiskFreePercent in genesis. Found by: PR #135 CI system-test failure (the previous tests asserted the pre-fix deadlock contract). The original assertions are now covered by historical context in commit messages and the skill lumera-supernode-postponed-recovery. (cherry picked from commit d0e181d)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Cherry-picks PR #135 (
fix(x/audit): bootstrap-recovery exception when active set is empty) onto the1.12.0release branch.What this is
Clean
git cherry-pick -xof squash-merge commitd0e181dfrom master. No conflicts. No code modifications during the cherry-pick — same fix, same tests.Why this needs to be in v1.12.0
The empty-active-set deadlock is a real protocol design gap that triggers when all supernodes are simultaneously POSTPONED at an epoch boundary. Once in this state, the chain cannot self-heal via gov or any chain-level mechanism — only an out-of-band coordinated
deregister + re-registercycle by every postponed validator key holder can recover it.Why v1.12.0 specifically is the right place:
Live mainnet params (queried at height 5,001,129):
Current mainnet supernode breakdown: 15 ACTIVE / 11 POSTPONED / 2 DISABLED. Thin active margin.
Upgrade-day trigger: chain halts at upgrade height; a subset of SN operators roll forward late (>40 min, one epoch). Those SNs miss one epoch report → POSTPONED in lockstep. If the surviving ACTIVE set drops to zero (plausible given current 15/28 active), the deadlock becomes permanent until every validator does the manual cycle.
v1.12.0 is the next upgrade after the long-running v1.11.x line. This fix MUST land in the binary that nodes upgrade INTO so the safety net is in place from block 1 of v1.12.0.
What changed
11-line fix in
x/audit/v1/keeper/enforcement.go::shouldRecoverAtEpochEnd: when the epoch's anchored active set is empty, accept a compliant self host-report alone as sufficient for recovery. Sits AFTER the storage-truth and action-finalization redirects, AFTERselfHostCompliant— a misbehaving SN cannot self-recover via this branch.Tests cover the full matrix (unit + systemtest) — see PR #135 for the table.
Verification on 1.12.0 branch
go build ./x/audit/...cleango test ./x/audit/v1/keeper/... -run "EmptyActiveSet|NoEpochAnchor|NonEmptyActiveSet"— PASSRisk
LOW. Same risk profile as PR #135 (already reviewed and merged to master). No state-key changes, no proto changes, no consensus version bump. The branch is only reachable when no other recovery path applies, self-compliance has already been verified, and the active set is empty — a pure safety net. In any normal operating state (≥1 ACTIVE), the branch is skipped and the legacy peer-port path runs unchanged. Cosmos determinism preserved.
Rollback
git revertthe cherry-pick commit. The legacy peer-port deadlock returns; manual recovery via documented runbook.Refs
d0e181d9b159d19efa22901fa606e078352f947lumera-supernode-postponed-recovery