Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .ai/spec/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ AI agents (Claude). Specs optimize for precision, unambiguous rules, and machine
| Look up a CRD field | `what/crd-api.md` |
| Understand the approval system | `what/approval.md` |
| Understand sandbox pod lifecycle | `what/sandbox-execution.md` |
| Understand the kill switch / system config | `what/system-config.md` |
| Navigate the controller codebase | `how/reconciler.md` |
| Understand the CLI plugin | `how/cli.md` |

Expand All @@ -35,4 +36,4 @@ AI agents (Claude). Specs optimize for precision, unambiguous rules, and machine

This operator watches `Proposal` CRs and drives them through a multi-phase workflow (analysis, execution, verification) by calling the sandbox runtime's `POST /v1/agent/run` endpoint. The console plugin provides the human-facing UI. Skills are mounted as OCI image volumes.

Jira tracking: Feature OCPSTRAT-3095, Epic OLS-2894.
Jira tracking: Feature OCPSTRAT-3095, Epic OLS-2894, Kill Switch OLS-3018.
18 changes: 10 additions & 8 deletions .ai/spec/how/reconciler.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,16 +64,17 @@ Audience: AI agents. Behavioral rules and phase semantics live in **what/** spec

## Data flow: reconcile loop

1. **Watch / enqueue:** controller-runtime delivers `ctrl.Request` for a `Proposal` namespaced name. `SetupWithManager` also `Owns` child CRs (`ProposalApproval`, `AnalysisResult`, `ExecutionResult`, `VerificationResult`, `EscalationResult`) and **Watches** cluster `ApprovalPolicy` to enqueue all non-terminal proposals.
1. **Watch / enqueue:** controller-runtime delivers `ctrl.Request` for a `Proposal` namespaced name. `SetupWithManager` also `Owns` child CRs (`ProposalApproval`, `AnalysisResult`, `ExecutionResult`, `VerificationResult`, `EscalationResult`) and **Watches** cluster `ApprovalPolicy` and `AgenticOLSConfig` to enqueue all non-terminal proposals when either changes.
2. **`Reconcile` load:** `Get` `Proposal`; ignore not-found.
3. **Deletion path:** If `DeletionTimestamp` set and finalizer `agentic.openshift.io/execution-rbac-cleanup` present: `Agent.ReleaseSandboxes`, `cleanupExecutionRBAC`, remove finalizer, return.
4. **Phase:** `agenticv1alpha1.DerivePhase(proposal.Status.Conditions)` β€” see **what/** for semantics.
5. **Finalizer add:** If not terminal and finalizer missing, add RBAC cleanup finalizer (re-fetch proposal after patch).
6. **Terminal / failed shortcuts:** Completed/Denied/Escalated β†’ optional sandbox release via `Agent.ReleaseSandboxes`. `ProposalPhaseFailed` β†’ `handleFailed` (RBAC cleanup if annotation set).
7. **Shared prelude:** `getApprovalPolicy` (cluster singleton name `cluster`), `ensureProposalApproval`, `resolveProposal`. Resolution failure β†’ set `ProposalConditionAnalyzed=False` with `reasonWorkflowFailed`, status patch, return (no requeue).
8. **Phase switch:** Routes to `handleRevision` (if `needsRevision`) before analysis/execution/escalation arms; otherwise `handleAnalysis`, `handleExecution`, `handleVerification`, `handleEscalation`, or no-op.
9. **Handlers** set step conditions (`Unknown` β†’ agent call β†’ `True`/`False`), create result CRs, append `Status.Steps.*.Results`, `statusPatch` proposal.
10. **Agent path:** All agent steps go through `r.Agent.*` which (in production) is `SandboxAgentCaller`: template + `EnsureAgentTemplate` β†’ `Sandbox.Claim` β†’ early `patchSandboxInfo` on proposal β†’ `WaitReady` β†’ `AgentHTTPClient.Run` β†’ JSON unmarshal into outputs.
4. **[PLANNED: OLS-3018] Suspension check:** Fetch `AgenticOLSConfig` singleton. If `spec.suspended == true` and proposal is non-terminal: release sandboxes, clean up RBAC, set `EmergencyStopped=True` condition, status patch, return. If CR not found, treat as not suspended. See **what/system-config.md**.
5. **Phase:** `agenticv1alpha1.DerivePhase(proposal.Status.Conditions)` β€” see **what/** for semantics. Now includes `EmergencyStopped` as highest-precedence terminal phase.
6. **Finalizer add:** If not terminal and finalizer missing, add RBAC cleanup finalizer (re-fetch proposal after patch).
7. **Terminal / failed shortcuts:** Completed/Denied/Escalated/EmergencyStopped β†’ optional sandbox release via `Agent.ReleaseSandboxes`. `ProposalPhaseFailed` β†’ `handleFailed` (RBAC cleanup if annotation set).
8. **Shared prelude:** `getApprovalPolicy` (cluster singleton name `cluster`), `ensureProposalApproval`, `resolveProposal`. Resolution failure β†’ set `ProposalConditionAnalyzed=False` with `reasonWorkflowFailed`, status patch, return (no requeue).
9. **Phase switch:** Routes to `handleRevision` (if `needsRevision`) before analysis/execution/escalation arms; otherwise `handleAnalysis`, `handleExecution`, `handleVerification`, `handleEscalation`, or no-op.
10. **Handlers** set step conditions (`Unknown` β†’ agent call β†’ `True`/`False`), create result CRs, append `Status.Steps.*.Results`, `statusPatch` proposal.
11. **Agent path:** All agent steps go through `r.Agent.*` which (in production) is `SandboxAgentCaller`: template + `EnsureAgentTemplate` β†’ `Sandbox.Claim` β†’ early `patchSandboxInfo` on proposal β†’ `WaitReady` β†’ `AgentHTTPClient.Run` β†’ JSON unmarshal into outputs.

---

Expand Down Expand Up @@ -172,6 +173,7 @@ ProposalReconciler.Reconcile
- **`cmd/main.go` scheme:** Only core + `agenticv1alpha1`. Watching or applying arbitrary CRDs from tests may need extended schemes (see `reconciler_test.go`).
- **Max concurrent reconciles:** `SetupWithManager` reads cluster `ApprovalPolicy` via API reader for `MaxConcurrentProposals`, else `DefaultMaxConcurrentProposals` from API package.
- **Policy watch:** Enqueues **all** non-terminal proposals on any `ApprovalPolicy` event β€” can be chatty.
- **[PLANNED: OLS-3018] AgenticOLSConfig watch:** Same pattern as policy watch β€” enqueues all non-terminal proposals on any `AgenticOLSConfig` change. When `suspended` flips to `true`, all re-queued proposals hit the suspension guard and get terminated.
- **Workflow resolution errors:** Patched onto `ProposalConditionAnalyzed` false β€” see API for exact condition ordering vs `DerivePhase`.
- **`selectedOption` vs trim:** Verification uses latest analysis result’s **first** option (`Options[0]`) when resolving; execution path uses `trimNonSelectedOptions` which respects `ProposalApproval` execution option index when multiple options exist.
- **`maxAttempts`:** Combines `ApprovalPolicy.Spec.MaxAttempts` ceiling with per-approval execution override (`helpers.go`); retry semantics interact with verification failure branch in `handleVerification` (see **what/proposal-lifecycle.md**).
Expand Down
3 changes: 2 additions & 1 deletion .ai/spec/what/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,10 @@ These specs define WHAT the operator must do -- testable behavioral rules, confi
| Spec | Description |
|------|-------------|
| [proposal-lifecycle.md](proposal-lifecycle.md) | Proposal phases, condition-driven state machine, retry logic, revision, escalation |
| [crd-api.md](crd-api.md) | All CRD types and field semantics: Proposal, Agent, LLMProvider, ApprovalPolicy, result CRs |
| [crd-api.md](crd-api.md) | All CRD types and field semantics: Proposal, Agent, LLMProvider, ApprovalPolicy, AgenticOLSConfig, result CRs |
| [approval.md](approval.md) | Human-in-the-loop approval system: policy modes, stage gates, deny semantics |
| [sandbox-execution.md](sandbox-execution.md) | Sandbox pod lifecycle, RBAC scoping, agent communication, skills mounting |
| [system-config.md](system-config.md) | AgenticOLSConfig CRD, emergency kill switch (spec.suspended), console/CLI visibility |

## Relationship to how/ Specs

Expand Down
14 changes: 11 additions & 3 deletions .ai/spec/what/crd-api.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,16 +6,16 @@ Kubernetes API surface for the agentic operator. **Lifecycle and gates** are in

1. **Group/version**: All kinds in this specification use API group `agentic.openshift.io` and version `v1alpha1`.
2. **Scope β€” namespaced**: `Proposal`, `ProposalApproval`, `AnalysisResult`, `ExecutionResult`, `VerificationResult`, `EscalationResult` MUST be namespace-scoped; their `metadata.namespace` is the tenant/workload namespace.
3. **Scope β€” cluster**: `Agent`, `LLMProvider`, and `ApprovalPolicy` MUST be cluster-scoped; `metadata.name` is the global identifier.
3. **Scope β€” cluster**: `Agent`, `LLMProvider`, `ApprovalPolicy`, and `AgenticOLSConfig` MUST be cluster-scoped; `metadata.name` is the global identifier.
4. **Proposal identity**: A `Proposal` MUST include required immutable fields per CEL: at minimum `spec.request` and `spec.analysis`. Omitting `spec.execution` or `spec.verification` means those steps do not exist for that proposal (see `proposal-lifecycle.md`).
5. **Proposal β€” `spec.request`**: Human/agent input text; immutable after creation; max length enforced by validation.
6. **Proposal β€” `spec.revisionFeedback`**: Only mutable spec field; when set/non-empty and `metadata.generation` advances beyond the analyzed condition’s `observedGeneration`, operators MUST trigger re-analysis per `proposal-lifecycle.md`.
7. **Proposal β€” `spec.targetNamespaces`**: Optional list of namespaces for context and RBAC targeting; immutable once set; when empty, RBAC targeting MAY fall back to namespaces declared in analysis RBAC output at execution time (see `sandbox-execution.md`).
8. **Proposal β€” `spec.analysisOutput`**: Immutable after set. `mode` defaults to full analysis schema when empty/default. `mode=Minimal` REQUIRES `schema` to be set, forbids `spec.execution` and `spec.verification`, and restricts option shape accordingly.
9. **Proposal β€” `spec.tools`**: Default `ToolsSpec` for all steps; immutable once set. Per-step `tools` on `spec.analysis` / `spec.execution` / `spec.verification` replaces the default for that step only when non-zero.
10. **Proposal β€” `spec.analysis|execution|verification`**: Immutable `ProposalStep` records after set. Each non-zero step MAY name `agent` (DNS subdomain) defaulting to `default` when empty; MAY carry per-step `tools`.
11. **Proposal β€” `status`**: Observed-only. `status.conditions` holds map-merge conditions (types include `Analyzed`, `Executed`, `Verified`, `Denied`, `Escalated`). `status.steps` holds per-step sandbox info, retry counter (execution), and result refs.
12. **Phase display types**: `ProposalPhase` and `StepPhase` string enums in the API describe display labels only; they are not stored fields on `Proposal` (phase is derived β€” see `proposal-lifecycle.md`). `StepPhase` values include `PendingApproval`, `Running`, `Completed`, `Failed`, `Skipped`.
11. **Proposal β€” `status`**: Observed-only. `status.conditions` holds map-merge conditions (types include `Analyzed`, `Executed`, `Verified`, `Denied`, `Escalated`, `EmergencyStopped`). `status.steps` holds per-step sandbox info, retry counter (execution), and result refs.
12. **Phase display types**: `ProposalPhase` and `StepPhase` string enums in the API describe display labels only; they are not stored fields on `Proposal` (phase is derived β€” see `proposal-lifecycle.md`). `ProposalPhase` values include `EmergencyStopped` (terminal, set by kill switch β€” see `system-config.md`). `StepPhase` values include `PendingApproval`, `Running`, `Completed`, `Failed`, `Skipped`.
13. **Sandbox step enum**: `SandboxStep` values `Analysis`, `Execution`, `Verification`, `Escalation` identify workflow steps for approvals, sandbox labels, and policies.
14. **Agent β€” `spec.llmProvider`**: Required reference by name to a cluster `LLMProvider`.
15. **Agent β€” `spec.model`**: Required provider-specific model identifier string; validation restricts charset.
Expand Down Expand Up @@ -45,6 +45,9 @@ Kubernetes API surface for the agentic operator. **Lifecycle and gates** are in
39. **Result CR ownership**: Result CRs MUST declare controller `ownerReferences` to their `Proposal` for GC; naming follows operator conventions (see `sandbox-execution.md` for when they are created).
40. **Label conventions**: Operator uses labels for proposal name, step, component, and managed template markers (exact keys are implementation-specific; behavior: selectors for GC/list, not duplicated here).
41. **CEL immutability** (Proposal): Enforced transitions include: `request`, `targetNamespaces`, `analysisOutput`, `tools`, `analysis`, `execution`, `verification` immutability after initial set as encoded in API markers.
42. **AgenticOLSConfig β€” singleton name**: CRD validation requires `metadata.name` equals `cluster` (same pattern as `ApprovalPolicy`).
43. **AgenticOLSConfig β€” `spec.suspended`**: Bool, optional, default `false`. When `true`, halts all agentic operations cluster-wide and terminates in-flight proposals with `EmergencyStopped` condition. See `system-config.md` for full semantics.
44. **AgenticOLSConfig β€” absence**: When no `AgenticOLSConfig` CR exists, the system MUST behave as if `spec.suspended` is `false`.

## Configuration Surface (by path)

Expand All @@ -62,6 +65,10 @@ Kubernetes API surface for the agentic operator. **Lifecycle and gates** are in
### ApprovalPolicy
- `metadata.name` (must be `cluster`), `spec.stages[]`, `spec.maxAttempts`, `spec.maxConcurrentProposals`

### AgenticOLSConfig
- `metadata.name` (must be `cluster`), `spec.suspended`
- See `system-config.md` for full behavioral rules

### ProposalApproval
- `metadata.name`, `metadata.namespace`, `spec.stages[]`, `status.stages[]`

Expand All @@ -84,3 +91,4 @@ Kubernetes API surface for the agentic operator. **Lifecycle and gates** are in

- [PLANNED: OLS-2940] Autonomous workflow CRD migrations may rename or reshape fields; specs MUST be updated when `v1alpha1` changes.
- [PLANNED: OLS-2894] Explicit **Agent** fields for per-step system prompts if moved from template/runtime-only assembly (today prompts are composed outside `Agent` CR β€” see `sandbox-execution.md`).
- [PLANNED: OLS-3018] `AgenticOLSConfig` CRD with `spec.suspended` kill switch, `EmergencyStopped` condition type on Proposal, console and CLI visibility. See `system-config.md` for full specification.
12 changes: 7 additions & 5 deletions .ai/spec/what/proposal-lifecycle.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,15 +5,16 @@ Behavioral specification for the `Proposal` resource lifecycle. **Approval gates
## Behavioral Rules

1. **Source of truth**: `status.conditions` (Kubernetes conditions keyed by `type`) is authoritative. The **phase** is a derived display value only; it is not persisted as its own field.
2. **Phases**: The system MUST derive exactly one phase label from `status.conditions` using the algorithm in rule 9 (and precedence rules 10–11). Valid labels: `Pending`, `Analyzing`, `Proposed`, `Executing`, `Verifying`, `Completed`, `Failed`, `Denied`, `Escalating`, `Escalated`.
3. **Condition types (proposal-level)**: The workflow uses `Analyzed`, `Executed`, `Verified`, `Denied`, `Escalated` (string values as defined on the API). Status values are `True`, `False`, or `Unknown`.
4. **Terminal phases**: `Completed`, `Denied`, `Escalated`, and `Failed` are terminal for reconciliation progression. After `Completed`, `Denied`, or `Escalated`, the controller MUST stop active work and MAY release sandbox claims when present. `Failed` triggers failure cleanup behaviors (see `sandbox-execution.md` for RBAC cleanup interactions).
2. **Phases**: The system MUST derive exactly one phase label from `status.conditions` using the algorithm in rule 9 (and precedence rules 10–11). Valid labels: `Pending`, `Analyzing`, `Proposed`, `Executing`, `Verifying`, `Completed`, `Failed`, `Denied`, `Escalating`, `Escalated`, `EmergencyStopped`.
3. **Condition types (proposal-level)**: The workflow uses `Analyzed`, `Executed`, `Verified`, `Denied`, `Escalated`, `EmergencyStopped` (string values as defined on the API). Status values are `True`, `False`, or `Unknown`.
4. **Terminal phases**: `Completed`, `Denied`, `Escalated`, `Failed`, and `EmergencyStopped` are terminal for reconciliation progression. After `Completed`, `Denied`, `Escalated`, or `EmergencyStopped`, the controller MUST stop active work and MAY release sandbox claims when present. `Failed` triggers failure cleanup behaviors (see `sandbox-execution.md` for RBAC cleanup interactions). `EmergencyStopped` indicates the proposal was terminated by the system kill switch (see `system-config.md`).
5. **Workflow shape**: `spec.analysis` is always required. `spec.execution` and `spec.verification` MAY be omitted; omission skips those steps subject to rules 20–22.
6. **Revision loop**: If `spec.revisionFeedback` is non-empty AND `metadata.generation` is greater than `Analyzed.observedGeneration`, the system MUST treat the proposal as needing **re-analysis** before continuing downstream steps. Re-analysis MUST append revision context to the user-visible request text (after `spec.request`), then reset execution/verification/escalation progress as implemented for revision handling, and MUST NOT advance execution until the new analysis completes.
7. **Execution retries (verification-gated)**: When `spec.verification` is present, after a successful execution the verification step MAY fail **objectively** if the agent reports failure **or** any verification check records a non-pass outcome (even when a coarse success flag might otherwise read true). In that case the system MAY increment `status.steps.execution.retryCount` and clear execution/verification progress to run execution again, bounded by the effective max attempt count from approval policy and execution approval (see `approval.md`). While awaiting a retry, `Verified` MUST be `False` with reason indicating retrying execution.
8. **Escalation injection**: When verification has failed and retries are exhausted (per `approval.md`), the system MUST set `Verified` to `False` with reason indicating retries exhausted and MUST set `Escalated` to `Unknown` with reason indicating retries exhausted, entering the escalating phase until the escalation step completes or fails.
9. **DerivePhase β€” precedence (first match in order)**:
- If `Escalated` exists with status `True` β†’ phase `Escalated`.
- If `EmergencyStopped` exists with status `True` β†’ phase `EmergencyStopped`.
- Else if `Escalated` exists with status `True` β†’ phase `Escalated`.
- Else if `Denied` exists with status `True` β†’ phase `Denied`.
- Else if `Escalated` exists β†’ if status is `Unknown` β†’ phase `Escalating`; otherwise β†’ phase `Failed`.
- Else evaluate `Verified` if present:
Expand All @@ -30,7 +31,7 @@ Behavioral specification for the `Proposal` resource lifecycle. **Approval gates
- If `Analyzed` is `Unknown` β†’ phase `Analyzing`.
- If `Analyzed` is `False` β†’ phase `Failed`.
- Else β†’ phase `Pending`.
10. **Denial vs escalation in derivation**: `Escalated=True` MUST win over `Denied=True` if both are present because derivation checks complete escalation before denial. Otherwise `Denied=True` MUST win over non-terminal progress (`Analyzed`, `Executed`, `Verified` combinations).
10. **EmergencyStopped vs other terminals in derivation**: `EmergencyStopped=True` MUST win over all other conditions because derivation checks it first. `Escalated=True` MUST win over `Denied=True` if both are present because derivation checks complete escalation before denial. Otherwise `Denied=True` MUST win over non-terminal progress (`Analyzed`, `Executed`, `Verified` combinations).
11. **Advisory completion**: If execution is absent and verification is absent, after successful analysis the controller MAY set `Executed` and `Verified` to `True` with skip reasons such that the derived phase is `Completed`.
12. **Trust mode completion**: If execution is present and verification is absent, after successful execution the controller MUST set `Verified` to `True` with a skip reason such that the derived phase is `Completed`.
13. **Skipped steps**: `Executed=True` with skip reason and `Verified=True` with skip reason together MUST derive `Completed` when that is the intended advisory outcome per tests and valid condition combinations.
Expand Down Expand Up @@ -67,3 +68,4 @@ Behavioral specification for the `Proposal` resource lifecycle. **Approval gates

- [PLANNED: OLS-2913] Populate `status.steps.<step>.conditions` consistently for UIs/CLI without inferring only from top-level conditions.
- [PLANNED: OLS-2894] **Per-proposal approval overrides** (e.g. annotations) and **namespace-scoped approval policy** if product requires policy resolution beyond cluster singleton `ApprovalPolicy` named `cluster` (current code: cluster singleton only; see `approval.md`).
- [PLANNED: OLS-3018] `EmergencyStopped` phase and condition type added to proposal lifecycle. See `system-config.md` for full kill switch specification.
Loading