Skip to content

feat(scheduler): add durable apply dispatch and retries#121

Draft
aparajon wants to merge 1 commit into
mainfrom
armand/failed-retryable-recovery
Draft

feat(scheduler): add durable apply dispatch and retries#121
aparajon wants to merge 1 commit into
mainfrom
armand/failed-retryable-recovery

Conversation

@aparajon
Copy link
Copy Markdown
Collaborator

@aparajon aparajon commented May 19, 2026

Summary

This makes apply execution durable and scheduler-owned: /api/apply persists apply/task records, wakes scheduler workers, and returns once work is queued. Workers claim applies from storage and resume local or remote Tern work, letting SchemaBot recover from request cancellation, server crashes, and retryable engine failures without restarting completed work.

  • Add failed_retryable apply/task states, retry-budget expiration, progress/API state mapping, and scheduler retry behavior
  • Extend local and gRPC Tern clients so ResumeApply handles queued dispatch, stale recovery, and retryable re-dispatch
  • Enforce one active apply per database/type/environment with MySQL named locks and document scheduler claims, workers, and retry semantics
  • Add integration/e2e coverage for queued dispatch, crash recovery, retryable Spirit failures, multi-worker scheduling, and split CI execution modes

🤖 Generated by Codex.

Copilot AI review requested due to automatic review settings May 19, 2026 16:24
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a failed_retryable state at both the apply and task level so that transient engine errors can be automatically retried by scheduler workers up to a fixed retry budget (10 attempts), after which the apply is expired to a permanent failed. It also refactors ApplyOptions round‑tripping and the buildControlRequest flow to better preserve apply options and Vitess resume metadata across retry/recovery.

Changes:

  • Add failed_retryable task/apply state, propagating it through state derivation, normalization, storage schema (attempt column), and progress rendering.
  • Wire scheduler workers to claim retryable applies (transitioning them to pending + bumping attempt), retry via a new retryFailedApply path, and expire exhausted retries via a new ExpireRetryable store method (run by worker 0).
  • Centralize ApplyOptions ↔ map conversion (ApplyOptionsFromMap / Map), make buildControlRequest return an error, and share Vitess resume‑state construction between Start/ResumeApply/control ops.

Reviewed changes

Copilot reviewed 35 out of 35 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/state/apply.go, task.go, README.md, apply_test.go Add FailedRetryable constants, normalization, derivation priority, non-terminal classification, and tests.
pkg/storage/types.go, types_test.go Add Attempt to Apply/Task and ApplyOptionsFromMap/Map round‑trip helpers + tests.
pkg/storage/storage.go Add ExpireRetryable to ApplyStore interface.
pkg/storage/mysqlstore/applies.go, applies_test.go Persist attempt; extend FindNextApply to also claim retryable applies (flip to pending, bump attempt); add ExpireRetryable.
pkg/storage/mysqlstore/tasks.go Persist task attempt on insert/update/scan.
pkg/schema/mysql/applies.sql, tasks.sql Add attempt column.
pkg/tern/state_converters.go Map Task.FailedRetryableApply.FailedRetryable and STATE_FAILED proto.
pkg/tern/local_apply.go Pass original error to markTaskFailed/failApplyWithTasks to choose retryable vs permanent state; update sequential finalization; reorder deriveOverallState priorities.
pkg/tern/local_client.go Use ApplyOptionsFromMap and applyOpts to drive atomic mode.
pkg/tern/local_control.go, local_control_resume.go Return errors from buildControlRequest; refactor Vitess resume state into a shared helper; add retryFailedApply + failApplyPermanently; route Vitess through atomic resume; multi‑namespace atomic resume.
pkg/api/scheduler.go, README.md Add scheduler tick that expires exhausted retries (worker 0) and fails applies with no available client; updated docs.
pkg/api/progress_handlers.go, plan_handlers.go, handlers_test.go Map retryable error code; serve retryable progress from storage; use ApplyOptionsFromMap.
pkg/apitypes/apitypes.go Add ErrCodeEngineErrorRetryable and mark it retryable.
pkg/cmd/templates/progress.go, progress_states_test.go Render FailedRetryable as “Retrying” with yellow styling.
pkg/cmd/commands/watch_tui_test.go Extend retry classification test.
pkg/metrics/metrics.go, README.md Add schemabot.scheduler.expired_retryable_total counter.
pkg/tern/apply_states_test.go, integration/scheduler_test.go, e2e/local/vitess_test.go New tests for retry derivation, scheduler retry/expire, and PlanetScale main‑branch permanent failure.
docs/architecture.md, configuration.md, pkg/tern/README.md, TEMPLATES.md Documentation updates for retry recovery path.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@aparajon aparajon force-pushed the armand/failed-retryable-recovery branch 11 times, most recently from 883da47 to 5d746e2 Compare May 19, 2026 21:11
@aparajon aparajon changed the title feat(scheduler): retry failed engine errors feat(tern): add retryable failure handling to scheduler May 19, 2026
@aparajon aparajon force-pushed the armand/failed-retryable-recovery branch 3 times, most recently from 933bf79 to 7060dba Compare May 20, 2026 15:18
@aparajon aparajon force-pushed the armand/failed-retryable-recovery branch from 7060dba to be5ea54 Compare May 20, 2026 16:37
@aparajon aparajon changed the title feat(tern): add retryable failure handling to scheduler feat(scheduler): add durable apply dispatch and retries May 20, 2026
@aparajon aparajon force-pushed the armand/failed-retryable-recovery branch from be5ea54 to a342f97 Compare May 20, 2026 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants