Skip to content

feat(curtailment): operator read/update/admin APIs + audit + metrics#299

Open
rongxin-liu wants to merge 16 commits into
mainfrom
feat/issue-289-curtailment-read-apis-audit-metrics
Open

feat(curtailment): operator read/update/admin APIs + audit + metrics#299
rongxin-liu wants to merge 16 commits into
mainfrom
feat/issue-289-curtailment-read-apis-audit-metrics

Conversation

@rongxin-liu
Copy link
Copy Markdown
Contributor

Summary

Closes the operator-facing surface and observability scaffolding for v1 curtailment. Builds on the lifecycle and dispatch work already on main (preview + start + dispatch + reconciler in #192, stop + staggered restore + max-duration enforcement in #232).

Operator read / update / admin

  • ListCurtailmentEvents — cursor-paginated history. The decision snapshot is trimmed in the list view so the response stays bounded on large fleet events: the per-device skipped array is replaced with an aggregate reason→count map, and per-target rows are intentionally omitted (consumers paginate over events here and fetch per-event detail separately).
  • UpdateCurtailmentEvent — operator-safe fields only: reason, restore_batch_size, restore_batch_interval_sec, max_duration_seconds. The store re-asserts the state predicate as defense-in-depth so a race between the pre-read and the UPDATE surfaces as a typed FailedPrecondition rather than silently no-op'ing. The proto validator already restricts the request shape; the service repeats key checks (oversized text, bad batch values) so non-Connect callers can't tunnel past it.
  • AdminTerminateEvent body. Forces a non-terminal event to CANCELLED or FAILED and sweeps every non-terminal target to RESTORE_FAILED in the same transaction. Idempotent re-issue against the same target state echoes the row; a different terminal state surfaces FailedPrecondition. The validator restricts target_state to those two; COMPLETED is rejected because the RPC fires when restore did not actually run.

Webhook ingestion idempotency

Pre-insert lookup at the persistence boundary on (org_id, idempotency_key) first, then (org_id, external_source, external_reference). A redelivery returns the original event without re-running selection. The race-loser path (two concurrent first-time Starts past the lookup) was tightened to fall into the same replay branch as a deliberate retry rather than surfacing Internal with the Postgres constraint name leaked in the error string.

Audit trail

Every successful Start emits a curtailment_started activity row. When allow_unbounded or force_include_maintenance is set, a typed row (curtailment_unbounded_start / curtailment_force_include_maintenance) emits alongside the base — two rows rather than one with a flag, so a feed of override-class starts is a simple event-type filter rather than a metadata scan. IncMaintenanceOverride fires in parallel so the override rate surfaces on the platform metrics dashboard without joining against activity_log. AdminTerminateEvent emits its own activity row capturing actor + reason. The audit ActorType now reflects source_actor_type (scheduler / user / api_key) rather than defaulting to user.

Metrics interface

A reconciler.Metrics interface inside the curtailment domain with tick-duration, tick-failure, candidate-exclusion (labeled by reason), and maintenance-override recorders. The default is a no-op; the concrete implementation wires at cmd/fleetd/main.go once the platform observability path lands. Interface shape is stable enough that the swap is a one-file change with no curtailment-package churn.

Heartbeat staleness runbook + alert template

The 5-minute staleness signal is canonically a SQL check against the curtailment_reconciler_heartbeat row, not an application metric — the runbook documents the SQL form. A vmalert rule template is parked behind a placeholder bridge metric (fleet_curtailment_reconciler_last_tick_age_seconds) so the wiring is one config change away once a postgres-exporter publishes the staleness gauge. The runbook walks four failure modes (panic loop, slow-query contention, events not picked up, restore loop) with operator response steps that lean on AdminTerminateEvent for the cases where infrastructure mitigation isn't enough.

Decisions taken

  • AdminTerminate handles operator double-clicks via "no-op if already in target state" rather than a per-event idempotency primitive. The proto's idempotency_key field on AdminTerminate is currently unused; the implement-or-remove call lands in the follow-up cycle.
  • Update is restricted to pending/active events; a restoring event rejects with FailedPrecondition. Equivalent to freezing effective_batch_size at restore start.
  • AdminTerminate runs the event transition and the target sweep in the same transaction. Reconciler ticks against a just-terminated event are an open follow-up — target updates want a state-guard so a late tick write can't revive a swept target.
  • Audit emission for admin overrides uses separate activity types alongside the base curtailment_started row, so a feed of override-class starts is a simple event-type filter.

Follow-up

A few items are intentionally not in this PR and shape the next iteration: an admin gate on Update.max_duration_seconds mirroring Start; decision_snapshot shape divergence across List vs detail responses (document or unify); an IncAuditEmitFailure signal on the Metrics interface so swallowed audit-write errors are observable; a (org_id, id) index migration to tighten cursor pagination on large histories. Service-level lifecycle tests cover the operator path end-to-end; a docker-driven HTTP-level E2E in server/e2e/ is queued for when the curtailment plugin surface is ready.

Test plan

  • Service-layer unit tests for ListCurtailmentEvents, UpdateCurtailmentEvent, AdminTerminateEvent covering happy path, state-machine guards, admin gating, race-loss handling.
  • Idempotency-replay tests for both channels (key, external-source/reference) including precedence ordering, partial-fields handling, and lookup error propagation. Race-loser path covered by the unique-violation sentinel routing in the store.
  • Audit-emission tests verify the base row + override-specific rows fire under expected conditions; lifecycle test pins Preview → Start → Stop → AdminTerminate persistence + emission.
  • Handler-level tests for each new RPC covering session resolution, role gates, malformed UUID rejection, and proto/service translation.
  • go build ./... clean; curtailment domain + handler test suites green; lint clean on the changed scope (pre-existing repo-wide lint debt unrelated to this branch).

Closes #289

Curtailment needs operational metrics — tick duration, tick failures,
selector candidate exclusions, maintenance overrides — but the codebase
has only OTel tracing today (no Meter, no /metrics, no Prometheus
exporter). The pipeline-shape decision is platform-team scope and
already in flight via the notifications + Grafana migration; curtailment
shouldn't make that decision unilaterally and shouldn't block on it.

Define a Metrics interface in the curtailment domain with a no-op
default. Service and Reconciler accept it through a functional option
so the dozens of existing test call sites (NewService(store), New(cfg,
store, cmd)) keep working unchanged. main.go wires NoOpMetrics through
both constructors so production has a single named site to swap when
the platform observability path lands — interface-stable, one-file
change, no curtailment-package churn.

Recorder call sites land in follow-up commits.

Refs #289
Three of the four Metrics recorders now have call sites:

- ObserveTickDuration fires from safeTick around runTick, capturing
  wall-clock per tick on every path (happy, panic-recovered,
  list-events-failure).
- IncTickFailure fires from safeTick on tick-infra panic AND from
  processEvent on per-event panic. The list-events early-return path
  is intentionally NOT counted because the heartbeat still advances
  there ("freshness, not query health" — see the comment in runTick).
- IncCandidateExcluded fires from Service.Start (not Preview) after the
  selector returns, once per skipped device labeled by reason. Start-
  only emission keeps debounced Preview calls from flooding the counter
  against a static fleet snapshot.

IncMaintenanceOverride is intentionally deferred. The per-miner
increment needs the selector to surface "this miner was kept because
the maintenance override was honored" — current candidate filtering
just lets the miner fall through without tagging. That instrumentation
lands in a follow-up commit alongside the audit-sweep work where
`curtailment_maintenance_override` activity rows are emitted on the
same code path.

Tests add a goroutine-safe recordingMetrics fake in both the
reconciler and service test files. Three reconciler tests pin
ObserveTickDuration on the happy path, IncTickFailure on tick-infra
panic, and IncTickFailure on per-event panic. One service test pins
IncCandidateExcluded on a phantom-load miner.

Refs #289
Operator-facing event history was previously Unimplemented and the
settings-page history table (PR #280) was reading fixtures. This wires
the RPC through every layer with a trimmed decision-snapshot policy
that keeps response sizes bounded on large fleets.

- sqlc: ListCurtailmentEventsForOrg, cursor-paginated by id DESC with
  an optional state filter. Caller passes limit+1 so the over-fetch
  detects whether another page remains.
- Store: opaque cursor (base64-encoded JSON) so the token shape is
  free to grow later (sort fields, secondary keys) without breaking
  older clients. PageSize <=0 maps to a 50-row default; an internal
  upper cap of 200 mirrors the proto validator as defense in depth.
- Service: ListEvents validates org and rejects negative page_size,
  then forwards to the store. Service-layer guard is needed because
  cross-tenant exposure is one query away.
- Handler: replaces the Unimplemented stub. Session-based org-id
  resolution, proto enum → service-layer state-filter mapping.
- Translate: list-view event proto omits per-target rows (heavy on
  10K-miner events × N pages) and trims the per-device `skipped`
  array to `skipped_aggregate` reason-count map. Top-K selected and
  the summary fields stay intact so dashboards can render exclusion
  trend lines.

Test fakes in three packages gain ListEvents stubs; the
curtailment-package fakeStore gains a working pagination impl mirroring
the SQL semantics so service-level tests can assert cursor round-trips.

Refs #289
Operator-safe partial update of a non-terminal event. Replaces the
Unimplemented stub on the handler.

State policy: pending and active accept the patch; restoring and
terminal states reject with FailedPrecondition. Operators who need to
intervene mid-restore go through AdminTerminateEvent — that's the
recovery surface, not Update. The conservative policy keeps the
recompute-vs-freeze question (Open #13) out of v1: Update of
restore_batch_size persists the new value but does NOT recompute
effective_batch_size. The reconciler's restore-claim reads the
Start-time stamped value through to the next event.

Validation mirrors Start: restore_batch_interval_sec is gated by the
non-admin cap (admin sets the session-derived bypass), max_duration
must be > 0 and <= 7 days, restore_batch_size >= 0. Misconfigured
values surface as InvalidArgument or Forbidden — never as a DB CHECK
violation.

sqlc UPDATE uses COALESCE on nil params so a partial patch preserves
unset columns. The WHERE clause re-asserts state IN ('pending',
'active') as defense in depth: a race where the row advanced between
the service's pre-read and the UPDATE surfaces as
ErrCurtailmentUpdateStateRaceLoss → FailedPrecondition with a distinct
message from the pre-read rejection.

Refs #289
Adds the admin-only escape hatch for forcing a non-terminal event to
CANCELLED or FAILED when a normal stop+restore cycle can't run.

The persistence layer wraps the terminal state transition and the
swept-target update in a single transaction via db.WithTransaction so
the event row and its targets stay in sync. An idempotent re-issue
with the same target_state is a no-op; a different terminal state
surfaces ErrCurtailmentAdminTerminateStateConflict, which the service
maps to FailedPrecondition.

Service-layer defense-in-depth checks (target_state in {CANCELLED,
FAILED}, non-empty reason, org/uuid present) mirror the proto
validator so non-Connect callers can't tunnel past it.
Adds a pre-insert lookup so a re-issued Start with the same
idempotency_key or (external_source, external_reference) pair returns
the original event instead of re-running the selector and tripping
the partial unique indexes (which would surface as a less helpful
AlreadyExists from the per-org non-terminal constraint).

idempotency_key takes precedence over external reference: the
operator-supplied retry handle wins over upstream re-delivery.
Lookup errors propagate unchanged so a transient db failure is
visible rather than silently falling through to a double-insert
attempt.
Adds an AuditLogger interface on the curtailment Service with a no-op
default so tests that don't care can ignore the wiring. main.go injects
*activity.Service via WithAuditLogger. Two override-specific event types
ride alongside the base curtailment_started row so a feed of unbounded
or force-include starts is a simple event-type filter rather than a
metadata scan.

IncMaintenanceOverride fires in parallel when force_include_maintenance
is set, so the platform metrics dashboard tracks the override rate
without joining against activity_log.

Audit emission is intentionally absent on the idempotent-replay and
insufficient-load paths: the original Start already recorded the trail,
and a path that never persisted shouldn't claim it did.
Documents the curtailment_reconciler_heartbeat-based liveness signal:
warn at 2 minutes of staleness with active events present, page at 5
minutes regardless. The SQL form is canonical; the vmalert rule is
parked behind a placeholder bridge metric so the wiring is one config
change away once a postgres-exporter publishes the staleness gauge.

Runbook walks four failure modes (panic loop, slow-query contention,
events not picked up, restore-loop forever) with operator response
steps that lean on AdminTerminateEvent for the cases where infra
mitigation isn't enough.
Walks the operator-facing service flow end-to-end against the
in-memory fake: Preview (no persistence side-effects) → Start
(persistence + audit + metrics) → Stop (RESTORING transition) →
AdminTerminate (forced terminal). The reconciler's tick-by-tick
state machine is covered piecewise in reconciler_test.go and
restore_test.go; this test pins the boundary between the public
service API and the persistence layer.

Companion tests cover the webhook idempotency-replay path
(duplicate Start short-circuits, no double-audit) and the
read-path query (ListEvents returns terminal rows filtered by
state).

A docker-driven HTTP-level e2e for the same lifecycle is a
follow-up — the existing server/e2e dir requires postgres +
proto-sim and lands when the curtailment plugin path is ready.
Four real lint findings from this branch, fixed without suppressions:

- Service.AdminTerminate: replace the two-case switch + default with
  an if-comparison so exhaustive doesn't demand the unhandled cases
  be enumerated. The default branch was load-bearing — the if form
  keeps the same behavior with less surface.

- service_list_test.go / handler_list_test.go: hoist the opaque
  cursor literal into a file-scope const. gosec G101 looks at string
  literals assigned to fields whose name matches credential keywords
  (PageToken matches "token"); an identifier reference clears the
  heuristic cleanly.

- service_start_idempotency_test.go: move the subtest store + svc
  creation inside the t.Run closure so each subtest can call
  t.Parallel() without sharing mutable counters across cases.
A multi-reviewer code review surfaced four merge-blocking P1s and a
batch of P2/P3 hygiene items on this branch. This commit lands the
focused, defensible subset that doesn't require contract or
security-policy decisions; the remaining items are recorded for the
next session.

Idempotency / race-recovery:
- Recognize uq_curtailment_event_idempotency and
  uq_curtailment_event_external_ref unique violations as race-loss
  via typed sentinels; Service.Start re-issues the corresponding
  replay lookup so the race-loser falls into the same response path
  as a deliberate retry rather than surfacing Internal with the
  constraint name leaked in the error string.
- AdminTerminateEvent: on zero-row UPDATE caused by a concurrent
  terminate-to-same-state, re-read inside the transaction and echo
  the row idempotently (mirrors BeginRestoreTransition's pattern).

Audit / observability:
- Emit a curtailment_admin_terminated activity row on AdminTerminate
  so the privileged force-terminate path captures actor + reason in
  the activity feed (parallels emitStartAuditTrail).
- emitStartAuditTrail now maps req.SourceActorType to
  activitymodels.ActorType so scheduler-triggered starts persist
  actor_type='scheduler' on activity_log instead of defaulting to
  'user'.

Update path hardening:
- Reject explicit empty-string Reason as InvalidArgument and add a
  256-char length cap mirroring Start. The proto-translate comment
  is updated to describe the actual silent-no-op behavior.

Proto contract docs:
- Field-level docstrings on ListCurtailmentEvents describing the
  omitted target_rollup/targets and the trimmed decision_snapshot
  shape (skipped_aggregate vs raw skipped).
- max_len=1024 on ListCurtailmentEventsRequest.page_token so the
  cursor decode path is bounded.
- Annotate the two eventStateFromProto call sites distinguishing
  the no-filter sentinel role from the target_state mapping role.

Cleanup:
- Drop duplicate finitePtr generic in handler_start_test.go (the
  existing ptr generic in handler_test.go covers the use case).
- Inline single-call-site valueOrZero generic in service.go.
@rongxin-liu rongxin-liu requested a review from a team as a code owner May 21, 2026 23:13
Copilot AI review requested due to automatic review settings May 21, 2026 23:13
@github-actions github-actions Bot added documentation Improvements or additions to documentation automation server shared labels May 21, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR completes the operator-facing curtailment management surface (list/update/admin terminate) and adds observability scaffolding (audit events, metrics interfaces, heartbeat runbook/alert template) to support v1 curtailment operations end-to-end in the server.

Changes:

  • Add operator RPCs for listing historical curtailment events (cursor pagination) and updating operator-safe fields, plus an admin RPC to force-terminate an event and sweep targets.
  • Add webhook-style Start idempotency lookups (idempotency key + external source/reference) with race-loser handling, plus audit + metrics interfaces wired through the service and reconciler.
  • Add reconciler heartbeat runbook + placeholder vmalert rules for stalled reconciler/tick failures.

Reviewed changes

Copilot reviewed 28 out of 31 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
server/sqlc/queries/curtailment.sql Adds SQLC queries for idempotency lookups, operator-field update, admin terminate + target sweep, and org event history listing.
server/internal/handlers/curtailment/translate.go Adds request/response translators for AdminTerminate/Update/List; trims decision snapshot for list view; adds proto↔model event-state mapping helper.
server/internal/handlers/curtailment/handler.go Implements UpdateCurtailmentEvent, ListCurtailmentEvents, and AdminTerminateEvent handlers with session/admin gating.
server/internal/handlers/curtailment/handler_update_test.go Handler tests for UpdateCurtailmentEvent auth, validation, and admin gating behavior.
server/internal/handlers/curtailment/handler_stop_test.go Updates Stop handler test stub to satisfy expanded store interface.
server/internal/handlers/curtailment/handler_start_test.go Updates Start handler test stub for new store methods and adjusts optional pointer helper usage.
server/internal/handlers/curtailment/handler_list_test.go Handler tests for ListCurtailmentEvents pagination/filtering and decision-snapshot trimming behavior.
server/internal/handlers/curtailment/handler_admin_terminate_test.go Handler tests for AdminTerminateEvent admin gating, UUID validation, and state-conflict mapping.
server/internal/domain/stores/sqlstores/curtailment.go Implements SQL store methods for idempotency lookups, ListEvents pagination, operator-field update, and AdminTerminateEvent transaction.
server/internal/domain/stores/sqlstores/curtailment_cursor.go Adds base64+JSON cursor encode/decode helpers for ListEvents pagination.
server/internal/domain/stores/interfaces/curtailment.go Extends CurtailmentStore interface with list/update/admin-terminate/idempotency methods and new typed error sentinels.
server/internal/domain/curtailment/service.go Adds metrics/audit plumbing, Start replay lookups + race handling, ListEvents/Update/AdminTerminate service methods, and audit emission helpers.
server/internal/domain/curtailment/service_update_test.go Unit tests for Update service method validation/state-guard/race-loss behavior.
server/internal/domain/curtailment/service_test.go Expands fake store to support new store methods; adds a metrics recorder test helper.
server/internal/domain/curtailment/service_start_test.go Adds Start metrics test for candidate-exclusion counters.
server/internal/domain/curtailment/service_start_idempotency_test.go Adds Start idempotency replay + precedence + error-path tests.
server/internal/domain/curtailment/service_start_audit_test.go Adds Start audit emission tests (base row + override-specific rows + replay suppression).
server/internal/domain/curtailment/service_list_test.go Adds ListEvents service tests for forwarding/validation and store error propagation.
server/internal/domain/curtailment/service_lifecycle_test.go Adds service-layer end-to-end lifecycle test across Preview→Start→Stop→AdminTerminate and replay/list behavior.
server/internal/domain/curtailment/service_admin_terminate_test.go Adds AdminTerminate service tests for validation and conflict/error mapping.
server/internal/domain/curtailment/reconciler/reconciler.go Adds metrics injection and records tick duration/failure counters on panic paths.
server/internal/domain/curtailment/reconciler/reconciler_test.go Adds reconciler tests asserting tick duration/failure metric emission.
server/internal/domain/curtailment/metrics.go Introduces curtailment.Metrics interface + NoOpMetrics implementation.
server/internal/domain/curtailment/audit.go Introduces curtailment.AuditLogger interface + NoOpAuditLogger and curtailment audit event-type constants.
server/generated/sqlc/db.go Regenerated SQLC prepared-statement wiring for new curtailment queries.
server/generated/sqlc/curtailment.sql.go Regenerated SQLC query implementations/types for new curtailment queries.
server/docs/curtailment-reconciler-runbook.md Adds heartbeat staleness runbook, SQL alert query, and failure-mode triage guidance.
server/cmd/fleetd/main.go Wires NoOpMetrics + audit logger into curtailment Service and passes metrics into reconciler.
proto/curtailment/v1/curtailment.proto Documents list-response trimming and adds page_token max length validation.
deployment-files/server/monitoring/vmalert/rules.d/proto-fleet-curtailment.yml Adds placeholder vmalert rules for stalled reconciler and tick failure rate using bridge metrics.

Comment thread server/internal/domain/stores/sqlstores/curtailment_cursor.go
Comment thread server/internal/domain/curtailment/service.go
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62a996a1f5

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread server/internal/domain/curtailment/service.go
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 21, 2026

🔐 Codex Security Review

Note: This is an automated security-focused code review generated by Codex.
It should be used as a supplementary check alongside human review.
False positives are possible - use your judgment.

Scope summary

  • Reviewed pull request diff only (2bf90452ecfb8c3c94213ea54d28aa42bd220a2f...07c23ad28f495369c0ce8351ef469de00f3d379c, exact PR three-dot diff)
  • Model: gpt-5.5

💡 Click "edited" above to see previous reviews for this PR.


Review Summary

Overall Risk: MEDIUM

Findings

[MEDIUM] AdminTerminate can abandon active curtailments without restore

  • Category: Reliability
  • Location: server/sqlc/queries/curtailment.sql:182
  • Description: AdminTerminateCurtailmentEvent allows terminalizing events in pending, active, or restoring states, and SweepCurtailmentTargetsToRestoreFailed then marks all non-terminal targets as restore_failed without issuing restore commands. For an active event, those miners may already be curtailed; after the event is terminal, the reconciler will no longer drive restoration.
  • Impact: An admin recovery action can leave miners curtailed indefinitely, reducing hashrate/revenue and making the system’s state disagree with device state.
  • Recommendation: Restrict AdminTerminateEvent to safe states/target states, or require active events to go through StopCurtailment first. If force termination must support active events, enqueue/verify restore or make the response and persisted state explicitly represent “manual restore required”.

[LOW] Removed protobuf field is not reserved

  • Category: Protobuf
  • Location: proto/curtailment/v1/curtailment.proto:561
  • Description: The diff removes AdminTerminateEventRequest.idempotency_key = 4 but does not reserve tag 4 or the field name. This repo already reserves removed fields elsewhere to avoid old-client payloads aliasing into future fields.
  • Impact: A future field using tag 4 could silently reinterpret old clients’ idempotency keys, creating wire-format compatibility bugs.
  • Recommendation: Add reserved 4; and reserved "idempotency_key"; to AdminTerminateEventRequest, or reintroduce the field if retry idempotency is still needed.

Notes

I did not find evidence in the reviewed diff of SQL injection, shell command injection, pool/wallet hijacking, exposed infrastructure secrets, or new frontend XSS/token exposure.


Generated by Codex Security Review |
Triggered by: @rongxin-liu |
Review workflow run

@github-actions github-actions Bot added javascript Pull requests that update javascript code client labels May 21, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f3aad80aa6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +531 to +533
TargetState: eventStateFromProto(msg.GetTargetState()),
Reason: msg.GetReason(),
}, nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor AdminTerminate idempotency key contract

AdminTerminateEventRequest documents idempotency_key as a replay key, but this mapping only forwards event_uuid, target_state, and reason, so the key is discarded before the service/store layer. As a result, retries with the same key are not deduplicated by key (for example, a replay with the same key but different payload can hit state conflicts instead of returning the original response), which violates the RPC’s stated idempotency behavior.

Useful? React with 👍 / 👎.

Comment on lines +387 to +390
if err := q.SweepCurtailmentTargetsToRestoreFailed(ctx, sqlc.SweepCurtailmentTargetsToRestoreFailedParams{
CurtailmentEventID: current.ID,
LastError: reason,
}); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prevent reconciler from reviving admin-terminated targets

This sweep marks non-terminal targets as restore_failed, but it does not protect against an in-flight reconciler tick that already read the same event/targets and later calls UpdateTargetState without an event-state guard. In that race, the tick can overwrite these terminalized rows back to non-terminal states after admin termination commits, undermining the operator’s force-terminate action and allowing the event’s targets to re-enter reconciliation work.

Useful? React with 👍 / 👎.

Codex security review + Copilot inline reviewers surfaced five
actionable findings on this branch. All five validated; landing the
fixes here.

Admin gate on Update.max_duration_seconds (HIGH).
  Mirrors Start's post-normalization admin check inside Service.Update.
  Without this, a non-admin who Started at the org default could Update
  the same event above the default, bypassing the privilege boundary
  Start enforces. Fetches org config lazily — only when max_duration is
  in the patch and the caller lacks admin controls.

AdminTerminate.reason length cap (MEDIUM).
  Service-level backstop rejects oversized reasons (>256 chars) so a
  bulky operator string can't amplify into every swept target's
  last_error column. The proto field gets the matching max_len=256
  rule; proto regen is deferred to a clean tooling pass (the service
  backstop already catches the case today).

List query trims decision_snapshot at the SQL boundary (MEDIUM).
  ListCurtailmentEventsForOrg now projects explicit columns with
  (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip
  list (multi-MB on 10K-miner events) doesn't ride the wire for every
  list row. Field layout matches CurtailmentEvent exactly so the
  existing convertEventRow path applies via a single struct
  conversion.

Cursor rejects non-positive IDs (MEDIUM).
  decodeCurtailmentEventCursor now returns InvalidArgument when the
  decoded id is <= 0. The store never emits a non-positive id; a
  user-supplied token that decodes to one would silently rewind to the
  first page (id=0) or return zero rows (id<0).

Audit metadata key naming (MEDIUM).
  Renamed `force_include` to `force_include_maintenance` on the
  curtailment_started audit row metadata so the key matches the
  domain/proto field name. Downstream analytics no longer have to map
  between abbreviated and full names.

Test coverage added for each fix: non-admin max_duration rejection,
admin pass-through, oversized reason rejection, cursor non-positive
id rejection (zero / negative / missing).
@rongxin-liu
Copy link
Copy Markdown
Contributor Author

All three findings addressed in 35594cd:

  • HIGH — Update.max_duration_seconds admin gate: Service.Update now fetches OrgConfig and applies the same post-normalization gate as Service.Start. Coverage in TestService_Update_{RejectsNonAdmin,AllowsAdmin}MaxDurationAboveOrgDefault.
  • MEDIUM — list query loads full snapshots: ListCurtailmentEventsForOrg projects (decision_snapshot_jsonb - 'skipped')::JSONB so the per-device skip list (multi-MB on 10K-miner events) doesn't ride the wire for every list row.
  • MEDIUM — AdminTerminate.reason unbounded: added max_len=256 on the proto field and a service-level length backstop. Coverage in TestService_AdminTerminate_RejectsOversizedReason.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 51e44672d8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +588 to +590
plan := &Plan{
EventUUID: &eventUUID,
EffectiveRestoreBatchIntervalSec: event.RestoreBatchIntervalSec,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return persisted Start payload on idempotent replay

When idempotency replay hits, this helper rebuilds a Plan with only UUID and timing fields and leaves Selected/Skipped empty. The Start handler still renders the response from Plan plus the incoming request, so a retry can return an event with zero targets/rollup and request-echoed metadata that does not match the originally persisted event. This breaks the documented replay contract (same key returns original event) for timeout/retry scenarios.

Useful? React with 👍 / 👎.

Comment thread server/sqlc/queries/curtailment.sql Outdated
restore_batch_size, restore_batch_interval_sec, effective_batch_size,
min_curtailed_duration_sec, max_duration_seconds, allow_unbounded,
include_maintenance, force_include_maintenance,
(decision_snapshot_jsonb - 'skipped')::JSONB AS decision_snapshot_jsonb,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep rejection aggregates in List decision snapshots

This projection removes skipped before the handler sees the snapshot, but the handler computes skipped_aggregate only when skipped is present. In production list responses, that means both detailed skips and the aggregate map are absent, despite the API docs/comments describing an aggregated reason-count view. Clients lose rejection-reason visibility on paginated history.

Useful? React with 👍 / 👎.

Harden curtailment replay, list, and reconciler behavior so the new read/admin APIs keep their documented contracts under retries and races.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07c23ad28f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +378 to +382
AND EXISTS (
SELECT 1
FROM curtailment_event
WHERE curtailment_event.id = sqlc.arg('curtailment_event_id')
AND curtailment_event.state IN ('pending', 'active', 'restoring')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Block command dispatch after admin termination

The new EXISTS guard only prevents UPDATE curtailment_target writes once an event becomes terminal, but it does not stop an in-flight reconciler tick from issuing cmd.Curtail/cmd.Uncurtail first (see dispatchOneCurtail and dispatchRestoreBatch, which dispatch before persisting target state). In the race where AdminTerminateEvent commits between event load and dispatch, the write is skipped but the miner command is still sent, so a force-terminated event can continue affecting devices after the operator action. Add an event-state/liveness check before dispatch (or make the update return affected-row count and abort on zero) to preserve admin-terminate semantics.

Useful? React with 👍 / 👎.

Prevent stale reconciler ticks from dispatching commands after event state changes and require active admin-terminated events to stop first so restore work is queued. Reserve the removed admin terminate idempotency field to protect protobuf compatibility.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

client documentation Improvements or additions to documentation javascript Pull requests that update javascript code server shared

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(curtailment): operator read APIs + admin terminate + audit + metrics interface + E2E

2 participants