From a08cc75dd908a0eb1090423acda2c1a1477f4b43 Mon Sep 17 00:00:00 2001 From: Erik M Jacobs Date: Thu, 21 May 2026 11:41:16 -0400 Subject: [PATCH] OLS-3018 Add kill switch spec for AgenticOLSConfig CR Adds .ai/spec/what/system-config.md specifying the new AgenticOLSConfig cluster-scoped singleton with spec.suspended kill switch, EmergencyStopped terminal phase, console banner, and CLI suspend/resume commands. Updates existing specs (crd-api, proposal-lifecycle, sandbox-execution, reconciler) to integrate the new condition type and phase. Co-Authored-By: Claude Opus 4.6 --- .ai/spec/README.md | 3 +- .ai/spec/how/reconciler.md | 18 ++++---- .ai/spec/what/README.md | 3 +- .ai/spec/what/crd-api.md | 14 ++++-- .ai/spec/what/proposal-lifecycle.md | 12 +++-- .ai/spec/what/sandbox-execution.md | 2 +- .ai/spec/what/system-config.md | 71 +++++++++++++++++++++++++++++ 7 files changed, 104 insertions(+), 19 deletions(-) create mode 100644 .ai/spec/what/system-config.md diff --git a/.ai/spec/README.md b/.ai/spec/README.md index 8b07c83b..6a70e649 100644 --- a/.ai/spec/README.md +++ b/.ai/spec/README.md @@ -21,6 +21,7 @@ AI agents (Claude). Specs optimize for precision, unambiguous rules, and machine | Look up a CRD field | `what/crd-api.md` | | Understand the approval system | `what/approval.md` | | Understand sandbox pod lifecycle | `what/sandbox-execution.md` | +| Understand the kill switch / system config | `what/system-config.md` | | Navigate the controller codebase | `how/reconciler.md` | | Understand the CLI plugin | `how/cli.md` | @@ -35,4 +36,4 @@ AI agents (Claude). Specs optimize for precision, unambiguous rules, and machine This operator watches `Proposal` CRs and drives them through a multi-phase workflow (analysis, execution, verification) by calling the sandbox runtime's `POST /v1/agent/run` endpoint. The console plugin provides the human-facing UI. Skills are mounted as OCI image volumes. -Jira tracking: Feature OCPSTRAT-3095, Epic OLS-2894. +Jira tracking: Feature OCPSTRAT-3095, Epic OLS-2894, Kill Switch OLS-3018. diff --git a/.ai/spec/how/reconciler.md b/.ai/spec/how/reconciler.md index aabd66d3..74ba052d 100644 --- a/.ai/spec/how/reconciler.md +++ b/.ai/spec/how/reconciler.md @@ -64,16 +64,17 @@ Audience: AI agents. Behavioral rules and phase semantics live in **what/** spec ## Data flow: reconcile loop -1. **Watch / enqueue:** controller-runtime delivers `ctrl.Request` for a `Proposal` namespaced name. `SetupWithManager` also `Owns` child CRs (`ProposalApproval`, `AnalysisResult`, `ExecutionResult`, `VerificationResult`, `EscalationResult`) and **Watches** cluster `ApprovalPolicy` to enqueue all non-terminal proposals. +1. **Watch / enqueue:** controller-runtime delivers `ctrl.Request` for a `Proposal` namespaced name. `SetupWithManager` also `Owns` child CRs (`ProposalApproval`, `AnalysisResult`, `ExecutionResult`, `VerificationResult`, `EscalationResult`) and **Watches** cluster `ApprovalPolicy` and `AgenticOLSConfig` to enqueue all non-terminal proposals when either changes. 2. **`Reconcile` load:** `Get` `Proposal`; ignore not-found. 3. **Deletion path:** If `DeletionTimestamp` set and finalizer `agentic.openshift.io/execution-rbac-cleanup` present: `Agent.ReleaseSandboxes`, `cleanupExecutionRBAC`, remove finalizer, return. -4. **Phase:** `agenticv1alpha1.DerivePhase(proposal.Status.Conditions)` — see **what/** for semantics. -5. **Finalizer add:** If not terminal and finalizer missing, add RBAC cleanup finalizer (re-fetch proposal after patch). -6. **Terminal / failed shortcuts:** Completed/Denied/Escalated → optional sandbox release via `Agent.ReleaseSandboxes`. `ProposalPhaseFailed` → `handleFailed` (RBAC cleanup if annotation set). -7. **Shared prelude:** `getApprovalPolicy` (cluster singleton name `cluster`), `ensureProposalApproval`, `resolveProposal`. Resolution failure → set `ProposalConditionAnalyzed=False` with `reasonWorkflowFailed`, status patch, return (no requeue). -8. **Phase switch:** Routes to `handleRevision` (if `needsRevision`) before analysis/execution/escalation arms; otherwise `handleAnalysis`, `handleExecution`, `handleVerification`, `handleEscalation`, or no-op. -9. **Handlers** set step conditions (`Unknown` → agent call → `True`/`False`), create result CRs, append `Status.Steps.*.Results`, `statusPatch` proposal. -10. **Agent path:** All agent steps go through `r.Agent.*` which (in production) is `SandboxAgentCaller`: template + `EnsureAgentTemplate` → `Sandbox.Claim` → early `patchSandboxInfo` on proposal → `WaitReady` → `AgentHTTPClient.Run` → JSON unmarshal into outputs. +4. **[PLANNED: OLS-3018] Suspension check:** Fetch `AgenticOLSConfig` singleton. If `spec.suspended == true` and proposal is non-terminal: release sandboxes, clean up RBAC, set `EmergencyStopped=True` condition, status patch, return. If CR not found, treat as not suspended. See **what/system-config.md**. +5. **Phase:** `agenticv1alpha1.DerivePhase(proposal.Status.Conditions)` — see **what/** for semantics. Now includes `EmergencyStopped` as highest-precedence terminal phase. +6. **Finalizer add:** If not terminal and finalizer missing, add RBAC cleanup finalizer (re-fetch proposal after patch). +7. **Terminal / failed shortcuts:** Completed/Denied/Escalated/EmergencyStopped → optional sandbox release via `Agent.ReleaseSandboxes`. `ProposalPhaseFailed` → `handleFailed` (RBAC cleanup if annotation set). +8. **Shared prelude:** `getApprovalPolicy` (cluster singleton name `cluster`), `ensureProposalApproval`, `resolveProposal`. Resolution failure → set `ProposalConditionAnalyzed=False` with `reasonWorkflowFailed`, status patch, return (no requeue). +9. **Phase switch:** Routes to `handleRevision` (if `needsRevision`) before analysis/execution/escalation arms; otherwise `handleAnalysis`, `handleExecution`, `handleVerification`, `handleEscalation`, or no-op. +10. **Handlers** set step conditions (`Unknown` → agent call → `True`/`False`), create result CRs, append `Status.Steps.*.Results`, `statusPatch` proposal. +11. **Agent path:** All agent steps go through `r.Agent.*` which (in production) is `SandboxAgentCaller`: template + `EnsureAgentTemplate` → `Sandbox.Claim` → early `patchSandboxInfo` on proposal → `WaitReady` → `AgentHTTPClient.Run` → JSON unmarshal into outputs. --- @@ -172,6 +173,7 @@ ProposalReconciler.Reconcile - **`cmd/main.go` scheme:** Only core + `agenticv1alpha1`. Watching or applying arbitrary CRDs from tests may need extended schemes (see `reconciler_test.go`). - **Max concurrent reconciles:** `SetupWithManager` reads cluster `ApprovalPolicy` via API reader for `MaxConcurrentProposals`, else `DefaultMaxConcurrentProposals` from API package. - **Policy watch:** Enqueues **all** non-terminal proposals on any `ApprovalPolicy` event — can be chatty. +- **[PLANNED: OLS-3018] AgenticOLSConfig watch:** Same pattern as policy watch — enqueues all non-terminal proposals on any `AgenticOLSConfig` change. When `suspended` flips to `true`, all re-queued proposals hit the suspension guard and get terminated. - **Workflow resolution errors:** Patched onto `ProposalConditionAnalyzed` false — see API for exact condition ordering vs `DerivePhase`. - **`selectedOption` vs trim:** Verification uses latest analysis result’s **first** option (`Options[0]`) when resolving; execution path uses `trimNonSelectedOptions` which respects `ProposalApproval` execution option index when multiple options exist. - **`maxAttempts`:** Combines `ApprovalPolicy.Spec.MaxAttempts` ceiling with per-approval execution override (`helpers.go`); retry semantics interact with verification failure branch in `handleVerification` (see **what/proposal-lifecycle.md**). diff --git a/.ai/spec/what/README.md b/.ai/spec/what/README.md index 706cca95..4c14eed5 100644 --- a/.ai/spec/what/README.md +++ b/.ai/spec/what/README.md @@ -7,9 +7,10 @@ These specs define WHAT the operator must do -- testable behavioral rules, confi | Spec | Description | |------|-------------| | [proposal-lifecycle.md](proposal-lifecycle.md) | Proposal phases, condition-driven state machine, retry logic, revision, escalation | -| [crd-api.md](crd-api.md) | All CRD types and field semantics: Proposal, Agent, LLMProvider, ApprovalPolicy, result CRs | +| [crd-api.md](crd-api.md) | All CRD types and field semantics: Proposal, Agent, LLMProvider, ApprovalPolicy, AgenticOLSConfig, result CRs | | [approval.md](approval.md) | Human-in-the-loop approval system: policy modes, stage gates, deny semantics | | [sandbox-execution.md](sandbox-execution.md) | Sandbox pod lifecycle, RBAC scoping, agent communication, skills mounting | +| [system-config.md](system-config.md) | AgenticOLSConfig CRD, emergency kill switch (spec.suspended), console/CLI visibility | ## Relationship to how/ Specs diff --git a/.ai/spec/what/crd-api.md b/.ai/spec/what/crd-api.md index 85d9fe32..dc74abca 100644 --- a/.ai/spec/what/crd-api.md +++ b/.ai/spec/what/crd-api.md @@ -6,7 +6,7 @@ Kubernetes API surface for the agentic operator. **Lifecycle and gates** are in 1. **Group/version**: All kinds in this specification use API group `agentic.openshift.io` and version `v1alpha1`. 2. **Scope — namespaced**: `Proposal`, `ProposalApproval`, `AnalysisResult`, `ExecutionResult`, `VerificationResult`, `EscalationResult` MUST be namespace-scoped; their `metadata.namespace` is the tenant/workload namespace. -3. **Scope — cluster**: `Agent`, `LLMProvider`, and `ApprovalPolicy` MUST be cluster-scoped; `metadata.name` is the global identifier. +3. **Scope — cluster**: `Agent`, `LLMProvider`, `ApprovalPolicy`, and `AgenticOLSConfig` MUST be cluster-scoped; `metadata.name` is the global identifier. 4. **Proposal identity**: A `Proposal` MUST include required immutable fields per CEL: at minimum `spec.request` and `spec.analysis`. Omitting `spec.execution` or `spec.verification` means those steps do not exist for that proposal (see `proposal-lifecycle.md`). 5. **Proposal — `spec.request`**: Human/agent input text; immutable after creation; max length enforced by validation. 6. **Proposal — `spec.revisionFeedback`**: Only mutable spec field; when set/non-empty and `metadata.generation` advances beyond the analyzed condition’s `observedGeneration`, operators MUST trigger re-analysis per `proposal-lifecycle.md`. @@ -14,8 +14,8 @@ Kubernetes API surface for the agentic operator. **Lifecycle and gates** are in 8. **Proposal — `spec.analysisOutput`**: Immutable after set. `mode` defaults to full analysis schema when empty/default. `mode=Minimal` REQUIRES `schema` to be set, forbids `spec.execution` and `spec.verification`, and restricts option shape accordingly. 9. **Proposal — `spec.tools`**: Default `ToolsSpec` for all steps; immutable once set. Per-step `tools` on `spec.analysis` / `spec.execution` / `spec.verification` replaces the default for that step only when non-zero. 10. **Proposal — `spec.analysis|execution|verification`**: Immutable `ProposalStep` records after set. Each non-zero step MAY name `agent` (DNS subdomain) defaulting to `default` when empty; MAY carry per-step `tools`. -11. **Proposal — `status`**: Observed-only. `status.conditions` holds map-merge conditions (types include `Analyzed`, `Executed`, `Verified`, `Denied`, `Escalated`). `status.steps` holds per-step sandbox info, retry counter (execution), and result refs. -12. **Phase display types**: `ProposalPhase` and `StepPhase` string enums in the API describe display labels only; they are not stored fields on `Proposal` (phase is derived — see `proposal-lifecycle.md`). `StepPhase` values include `PendingApproval`, `Running`, `Completed`, `Failed`, `Skipped`. +11. **Proposal — `status`**: Observed-only. `status.conditions` holds map-merge conditions (types include `Analyzed`, `Executed`, `Verified`, `Denied`, `Escalated`, `EmergencyStopped`). `status.steps` holds per-step sandbox info, retry counter (execution), and result refs. +12. **Phase display types**: `ProposalPhase` and `StepPhase` string enums in the API describe display labels only; they are not stored fields on `Proposal` (phase is derived — see `proposal-lifecycle.md`). `ProposalPhase` values include `EmergencyStopped` (terminal, set by kill switch — see `system-config.md`). `StepPhase` values include `PendingApproval`, `Running`, `Completed`, `Failed`, `Skipped`. 13. **Sandbox step enum**: `SandboxStep` values `Analysis`, `Execution`, `Verification`, `Escalation` identify workflow steps for approvals, sandbox labels, and policies. 14. **Agent — `spec.llmProvider`**: Required reference by name to a cluster `LLMProvider`. 15. **Agent — `spec.model`**: Required provider-specific model identifier string; validation restricts charset. @@ -45,6 +45,9 @@ Kubernetes API surface for the agentic operator. **Lifecycle and gates** are in 39. **Result CR ownership**: Result CRs MUST declare controller `ownerReferences` to their `Proposal` for GC; naming follows operator conventions (see `sandbox-execution.md` for when they are created). 40. **Label conventions**: Operator uses labels for proposal name, step, component, and managed template markers (exact keys are implementation-specific; behavior: selectors for GC/list, not duplicated here). 41. **CEL immutability** (Proposal): Enforced transitions include: `request`, `targetNamespaces`, `analysisOutput`, `tools`, `analysis`, `execution`, `verification` immutability after initial set as encoded in API markers. +42. **AgenticOLSConfig — singleton name**: CRD validation requires `metadata.name` equals `cluster` (same pattern as `ApprovalPolicy`). +43. **AgenticOLSConfig — `spec.suspended`**: Bool, optional, default `false`. When `true`, halts all agentic operations cluster-wide and terminates in-flight proposals with `EmergencyStopped` condition. See `system-config.md` for full semantics. +44. **AgenticOLSConfig — absence**: When no `AgenticOLSConfig` CR exists, the system MUST behave as if `spec.suspended` is `false`. ## Configuration Surface (by path) @@ -62,6 +65,10 @@ Kubernetes API surface for the agentic operator. **Lifecycle and gates** are in ### ApprovalPolicy - `metadata.name` (must be `cluster`), `spec.stages[]`, `spec.maxAttempts`, `spec.maxConcurrentProposals` +### AgenticOLSConfig +- `metadata.name` (must be `cluster`), `spec.suspended` +- See `system-config.md` for full behavioral rules + ### ProposalApproval - `metadata.name`, `metadata.namespace`, `spec.stages[]`, `status.stages[]` @@ -84,3 +91,4 @@ Kubernetes API surface for the agentic operator. **Lifecycle and gates** are in - [PLANNED: OLS-2940] Autonomous workflow CRD migrations may rename or reshape fields; specs MUST be updated when `v1alpha1` changes. - [PLANNED: OLS-2894] Explicit **Agent** fields for per-step system prompts if moved from template/runtime-only assembly (today prompts are composed outside `Agent` CR — see `sandbox-execution.md`). +- [PLANNED: OLS-3018] `AgenticOLSConfig` CRD with `spec.suspended` kill switch, `EmergencyStopped` condition type on Proposal, console and CLI visibility. See `system-config.md` for full specification. diff --git a/.ai/spec/what/proposal-lifecycle.md b/.ai/spec/what/proposal-lifecycle.md index e32af25c..e78ac761 100644 --- a/.ai/spec/what/proposal-lifecycle.md +++ b/.ai/spec/what/proposal-lifecycle.md @@ -5,15 +5,16 @@ Behavioral specification for the `Proposal` resource lifecycle. **Approval gates ## Behavioral Rules 1. **Source of truth**: `status.conditions` (Kubernetes conditions keyed by `type`) is authoritative. The **phase** is a derived display value only; it is not persisted as its own field. -2. **Phases**: The system MUST derive exactly one phase label from `status.conditions` using the algorithm in rule 9 (and precedence rules 10–11). Valid labels: `Pending`, `Analyzing`, `Proposed`, `Executing`, `Verifying`, `Completed`, `Failed`, `Denied`, `Escalating`, `Escalated`. -3. **Condition types (proposal-level)**: The workflow uses `Analyzed`, `Executed`, `Verified`, `Denied`, `Escalated` (string values as defined on the API). Status values are `True`, `False`, or `Unknown`. -4. **Terminal phases**: `Completed`, `Denied`, `Escalated`, and `Failed` are terminal for reconciliation progression. After `Completed`, `Denied`, or `Escalated`, the controller MUST stop active work and MAY release sandbox claims when present. `Failed` triggers failure cleanup behaviors (see `sandbox-execution.md` for RBAC cleanup interactions). +2. **Phases**: The system MUST derive exactly one phase label from `status.conditions` using the algorithm in rule 9 (and precedence rules 10–11). Valid labels: `Pending`, `Analyzing`, `Proposed`, `Executing`, `Verifying`, `Completed`, `Failed`, `Denied`, `Escalating`, `Escalated`, `EmergencyStopped`. +3. **Condition types (proposal-level)**: The workflow uses `Analyzed`, `Executed`, `Verified`, `Denied`, `Escalated`, `EmergencyStopped` (string values as defined on the API). Status values are `True`, `False`, or `Unknown`. +4. **Terminal phases**: `Completed`, `Denied`, `Escalated`, `Failed`, and `EmergencyStopped` are terminal for reconciliation progression. After `Completed`, `Denied`, `Escalated`, or `EmergencyStopped`, the controller MUST stop active work and MAY release sandbox claims when present. `Failed` triggers failure cleanup behaviors (see `sandbox-execution.md` for RBAC cleanup interactions). `EmergencyStopped` indicates the proposal was terminated by the system kill switch (see `system-config.md`). 5. **Workflow shape**: `spec.analysis` is always required. `spec.execution` and `spec.verification` MAY be omitted; omission skips those steps subject to rules 20–22. 6. **Revision loop**: If `spec.revisionFeedback` is non-empty AND `metadata.generation` is greater than `Analyzed.observedGeneration`, the system MUST treat the proposal as needing **re-analysis** before continuing downstream steps. Re-analysis MUST append revision context to the user-visible request text (after `spec.request`), then reset execution/verification/escalation progress as implemented for revision handling, and MUST NOT advance execution until the new analysis completes. 7. **Execution retries (verification-gated)**: When `spec.verification` is present, after a successful execution the verification step MAY fail **objectively** if the agent reports failure **or** any verification check records a non-pass outcome (even when a coarse success flag might otherwise read true). In that case the system MAY increment `status.steps.execution.retryCount` and clear execution/verification progress to run execution again, bounded by the effective max attempt count from approval policy and execution approval (see `approval.md`). While awaiting a retry, `Verified` MUST be `False` with reason indicating retrying execution. 8. **Escalation injection**: When verification has failed and retries are exhausted (per `approval.md`), the system MUST set `Verified` to `False` with reason indicating retries exhausted and MUST set `Escalated` to `Unknown` with reason indicating retries exhausted, entering the escalating phase until the escalation step completes or fails. 9. **DerivePhase — precedence (first match in order)**: - - If `Escalated` exists with status `True` → phase `Escalated`. + - If `EmergencyStopped` exists with status `True` → phase `EmergencyStopped`. + - Else if `Escalated` exists with status `True` → phase `Escalated`. - Else if `Denied` exists with status `True` → phase `Denied`. - Else if `Escalated` exists → if status is `Unknown` → phase `Escalating`; otherwise → phase `Failed`. - Else evaluate `Verified` if present: @@ -30,7 +31,7 @@ Behavioral specification for the `Proposal` resource lifecycle. **Approval gates - If `Analyzed` is `Unknown` → phase `Analyzing`. - If `Analyzed` is `False` → phase `Failed`. - Else → phase `Pending`. -10. **Denial vs escalation in derivation**: `Escalated=True` MUST win over `Denied=True` if both are present because derivation checks complete escalation before denial. Otherwise `Denied=True` MUST win over non-terminal progress (`Analyzed`, `Executed`, `Verified` combinations). +10. **EmergencyStopped vs other terminals in derivation**: `EmergencyStopped=True` MUST win over all other conditions because derivation checks it first. `Escalated=True` MUST win over `Denied=True` if both are present because derivation checks complete escalation before denial. Otherwise `Denied=True` MUST win over non-terminal progress (`Analyzed`, `Executed`, `Verified` combinations). 11. **Advisory completion**: If execution is absent and verification is absent, after successful analysis the controller MAY set `Executed` and `Verified` to `True` with skip reasons such that the derived phase is `Completed`. 12. **Trust mode completion**: If execution is present and verification is absent, after successful execution the controller MUST set `Verified` to `True` with a skip reason such that the derived phase is `Completed`. 13. **Skipped steps**: `Executed=True` with skip reason and `Verified=True` with skip reason together MUST derive `Completed` when that is the intended advisory outcome per tests and valid condition combinations. @@ -67,3 +68,4 @@ Behavioral specification for the `Proposal` resource lifecycle. **Approval gates - [PLANNED: OLS-2913] Populate `status.steps..conditions` consistently for UIs/CLI without inferring only from top-level conditions. - [PLANNED: OLS-2894] **Per-proposal approval overrides** (e.g. annotations) and **namespace-scoped approval policy** if product requires policy resolution beyond cluster singleton `ApprovalPolicy` named `cluster` (current code: cluster singleton only; see `approval.md`). +- [PLANNED: OLS-3018] `EmergencyStopped` phase and condition type added to proposal lifecycle. See `system-config.md` for full kill switch specification. diff --git a/.ai/spec/what/sandbox-execution.md b/.ai/spec/what/sandbox-execution.md index 7ce586b5..1a330ff6 100644 --- a/.ai/spec/what/sandbox-execution.md +++ b/.ai/spec/what/sandbox-execution.md @@ -31,7 +31,7 @@ Behavioral specification for how workflow steps run inside ephemeral **sandboxes 25. **Finalizers**: Non-deleted proposals MUST gain a cleanup finalizer before leaving non-terminal phases so deletion can run RBAC and sandbox release hooks safely. 26. **Result CR writes**: After each successful or failed agent invocation (per step), the controller MUST create/update an `AnalysisResult`, `ExecutionResult`, `VerificationResult`, or `EscalationResult` with immutable spec, owner reference to the `Proposal`, started/completed conditions, embedded outcome payload, sandbox reference, and optional `failureReason` for system errors. 27. **Retry index**: `ExecutionResult` and `VerificationResult` MUST record the current execution retry index in spec for correlation with `status.steps.execution.retryCount`. -28. **Sandbox release**: On proposal deletion and on terminal phases (`Completed`, `Denied`, `Escalated`), the controller MUST delete known sandbox claims recorded under `status.steps.*.sandbox` (best-effort aggregation; first error MAY be returned for visibility). +28. **Sandbox release**: On proposal deletion and on terminal phases (`Completed`, `Denied`, `Escalated`, `EmergencyStopped`), the controller MUST delete known sandbox claims recorded under `status.steps.*.sandbox` (best-effort aggregation; first error MAY be returned for visibility). For `EmergencyStopped`, sandbox release is part of the termination sequence (see `system-config.md`). 29. **Concurrency cap**: Maximum concurrent proposal reconciles SHOULD respect `ApprovalPolicy.spec.maxConcurrentProposals` when present (see `crd-api.md`). ## Configuration Surface diff --git a/.ai/spec/what/system-config.md b/.ai/spec/what/system-config.md new file mode 100644 index 00000000..48b19755 --- /dev/null +++ b/.ai/spec/what/system-config.md @@ -0,0 +1,71 @@ +# System configuration and kill switch (`AgenticOLSConfig`) + +Behavioral specification for the cluster-wide agentic system configuration CR and its **emergency suspension** (kill switch) capability. **Proposal lifecycle phases** are in `proposal-lifecycle.md`. **CRD field semantics** for other kinds are in `crd-api.md`. + +Jira tracking: OLS-3018. + +## Behavioral Rules + +### AgenticOLSConfig CRD + +1. **Kind and scope**: `AgenticOLSConfig` MUST be cluster-scoped in API group `agentic.openshift.io`, version `v1alpha1`. +2. **Singleton**: CRD validation MUST enforce `metadata.name == "cluster"` via CEL (same pattern as `ApprovalPolicy`). +3. **Absence semantics**: When no `AgenticOLSConfig` CR exists, the system MUST behave as if `spec.suspended` is `false` — the CR is not required for normal operation. +4. **Spec structure**: The spec MUST include: + - `suspended` (bool, optional, default `false`): When `true`, halts all agentic operations cluster-wide. + +### Emergency Suspension (`spec.suspended`) + +5. **Activation**: Setting `spec.suspended` to `true` MUST immediately prevent the proposal reconciler from starting any new workflow steps (analysis, execution, verification, escalation) for any proposal cluster-wide. +6. **In-flight termination**: When `spec.suspended` becomes `true`, all non-terminal proposals MUST be terminated: sandbox pods MUST be deleted (best-effort), execution RBAC MUST be cleaned up, and the `EmergencyStopped` condition MUST be set on each proposal. +7. **EmergencyStopped condition**: The operator MUST set condition type `EmergencyStopped` with status `True`, reason `SystemSuspended`, and message `"Terminated by system kill switch (AgenticOLSConfig.spec.suspended=true)"`. +8. **EmergencyStopped is terminal — no automatic restart**: `EmergencyStopped` is a terminal phase. Proposals in this state MUST NOT resume when `spec.suspended` is set back to `false`. To retry work, the admin creates new proposals. This is a safety invariant: the kill switch exists for emergencies where agent behavior is harmful, so automatically restarting the same proposals that caused the emergency would re-introduce the exact problem the admin stopped. Resumption MUST always require explicit human action (creating new proposals). +9. **DerivePhase precedence**: `EmergencyStopped=True` MUST be checked **before** all other conditions in `DerivePhase()`. It takes precedence over `Escalated`, `Denied`, and all progress conditions. +10. **Resumption**: Setting `spec.suspended` back to `false` re-enables the system for **new** proposals only. Existing `EmergencyStopped` proposals remain terminal. +11. **New proposal blocking**: While `suspended=true`, proposals that are already in `Pending` phase (no conditions set yet) MUST also be terminated with `EmergencyStopped` — suspension applies to all non-terminal proposals, not just those with active sandboxes. + +### Reconciler Integration + +12. **Watch and re-queue**: The proposal reconciler MUST watch `AgenticOLSConfig` and re-queue all non-terminal proposals when the CR changes (same pattern as the existing `ApprovalPolicy` watch). +13. **Reconcile guard**: The suspension check MUST execute after the deletion handler but before finalizer addition, terminal phase routing, approval resolution, and phase dispatch. +14. **Order of operations on termination**: For each non-terminal proposal when suspended: (a) release sandbox claims via `Agent.ReleaseSandboxes` (best-effort, log errors), (b) clean up execution RBAC via `cleanupExecutionRBAC` (best-effort, log errors), (c) set `EmergencyStopped` condition, (d) status patch. Errors in (a) or (b) MUST NOT prevent (c) and (d). +15. **Config fetch failure**: If the `AgenticOLSConfig` CR cannot be fetched and the error is not `NotFound`, the reconciler MUST return the error for retry. `NotFound` MUST be treated as `suspended=false`. + +### Console Visibility + +16. **Suspension banner**: The console plugin MUST display a cluster-wide danger alert banner when `AgenticOLSConfig.spec.suspended == true`. The banner MUST be visible on all agentic views without requiring page reload when the state changes. +17. **EmergencyStopped phase display**: The console MUST render `EmergencyStopped` proposals with a distinct visual treatment (status badge, color) that is clearly different from `Failed`. +18. **DerivePhase sync**: The console's `derivePhaseFromConditions` function in `src/models/proposal.ts` MUST be updated to handle the `EmergencyStopped` condition with the same precedence as the Go implementation (per the existing `// SYNC:` contract). + +### CLI Visibility + +19. **Status command**: `oc agentic status` (or equivalent top-level command) MUST report the system suspension state: `"Agentic System: SUSPENDED"` when suspended, `"Agentic System: Active"` when not. +20. **Suspend/resume commands**: The CLI MUST provide `oc agentic suspend` and `oc agentic resume` commands that patch `AgenticOLSConfig.spec.suspended` to `true` and `false` respectively. +21. **Suspend confirmation**: `oc agentic suspend` MUST prompt for confirmation before proceeding: `"All agentic operations will be halted and in-flight proposals will be terminated. Continue? [y/N]"`. +22. **Proposal list**: `oc agentic proposals` (or equivalent list command) MUST display `EmergencyStopped` as a distinct phase value in the phase/status column. + +## Configuration Surface + +### AgenticOLSConfig +- `metadata.name` (must be `cluster`) +- `spec.suspended` (bool, default `false`) + +### Affected Proposal fields +- `status.conditions` — new condition type `EmergencyStopped` +- Derived phase `EmergencyStopped` added to `ProposalPhase` enum + +### Affected repositories +- `lightspeed-agentic-operator` — CRD types, proposal reconciler, CLI commands +- `lightspeed-agentic-console` — `derivePhaseFromConditions` sync, suspension banner, phase display + +## Constraints + +- `EmergencyStopped` MUST be added to `isTerminal()` in the reconciler and any console/CLI equivalents. +- The `AgenticOLSConfig` controller RBAC MUST include `get`, `list`, `watch` on `agenticolsconfigs` for the proposal reconciler's service account. +- The `oc agentic suspend` / `resume` commands require the user to have `patch` permissions on `AgenticOLSConfig`. +- Termination of in-flight proposals via Approach A (reconciler re-queue) is bounded by `maxConcurrentReconciles`; at default concurrency (5) with 100 proposals, termination completes in approximately 4-8 seconds. This is acceptable for v1. If real-world scale requires faster termination, a batch-sweep approach (Approach B) can be added to the `AgenticOLSConfig` reconciler without changing any other component. + +## Planned Changes + +- [PLANNED: future] Batch-sweep termination (Approach B): if Approach A's reconciler-based termination proves too slow at scale, add a direct sweep in the `AgenticOLSConfig` reconciler that lists and terminates all non-terminal proposals in a single pass with goroutine fan-out. +- [PLANNED: future] Additional config fields (e.g., system-wide defaults, feature gates) can be added to the `AgenticOLSConfig` spec as needed.