feat(scheduler): add durable apply dispatch and retries#121
Draft
aparajon wants to merge 1 commit into
Draft
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a failed_retryable state at both the apply and task level so that transient engine errors can be automatically retried by scheduler workers up to a fixed retry budget (10 attempts), after which the apply is expired to a permanent failed. It also refactors ApplyOptions round‑tripping and the buildControlRequest flow to better preserve apply options and Vitess resume metadata across retry/recovery.
Changes:
- Add
failed_retryabletask/apply state, propagating it through state derivation, normalization, storage schema (attemptcolumn), and progress rendering. - Wire scheduler workers to claim retryable applies (transitioning them to
pending+ bumpingattempt), retry via a newretryFailedApplypath, and expire exhausted retries via a newExpireRetryablestore method (run by worker 0). - Centralize
ApplyOptions↔ map conversion (ApplyOptionsFromMap/Map), makebuildControlRequestreturn an error, and share Vitess resume‑state construction between Start/ResumeApply/control ops.
Reviewed changes
Copilot reviewed 35 out of 35 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/state/apply.go, task.go, README.md, apply_test.go | Add FailedRetryable constants, normalization, derivation priority, non-terminal classification, and tests. |
| pkg/storage/types.go, types_test.go | Add Attempt to Apply/Task and ApplyOptionsFromMap/Map round‑trip helpers + tests. |
| pkg/storage/storage.go | Add ExpireRetryable to ApplyStore interface. |
| pkg/storage/mysqlstore/applies.go, applies_test.go | Persist attempt; extend FindNextApply to also claim retryable applies (flip to pending, bump attempt); add ExpireRetryable. |
| pkg/storage/mysqlstore/tasks.go | Persist task attempt on insert/update/scan. |
| pkg/schema/mysql/applies.sql, tasks.sql | Add attempt column. |
| pkg/tern/state_converters.go | Map Task.FailedRetryable → Apply.FailedRetryable and STATE_FAILED proto. |
| pkg/tern/local_apply.go | Pass original error to markTaskFailed/failApplyWithTasks to choose retryable vs permanent state; update sequential finalization; reorder deriveOverallState priorities. |
| pkg/tern/local_client.go | Use ApplyOptionsFromMap and applyOpts to drive atomic mode. |
| pkg/tern/local_control.go, local_control_resume.go | Return errors from buildControlRequest; refactor Vitess resume state into a shared helper; add retryFailedApply + failApplyPermanently; route Vitess through atomic resume; multi‑namespace atomic resume. |
| pkg/api/scheduler.go, README.md | Add scheduler tick that expires exhausted retries (worker 0) and fails applies with no available client; updated docs. |
| pkg/api/progress_handlers.go, plan_handlers.go, handlers_test.go | Map retryable error code; serve retryable progress from storage; use ApplyOptionsFromMap. |
| pkg/apitypes/apitypes.go | Add ErrCodeEngineErrorRetryable and mark it retryable. |
| pkg/cmd/templates/progress.go, progress_states_test.go | Render FailedRetryable as “Retrying” with yellow styling. |
| pkg/cmd/commands/watch_tui_test.go | Extend retry classification test. |
| pkg/metrics/metrics.go, README.md | Add schemabot.scheduler.expired_retryable_total counter. |
| pkg/tern/apply_states_test.go, integration/scheduler_test.go, e2e/local/vitess_test.go | New tests for retry derivation, scheduler retry/expire, and PlanetScale main‑branch permanent failure. |
| docs/architecture.md, configuration.md, pkg/tern/README.md, TEMPLATES.md | Documentation updates for retry recovery path. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
883da47 to
5d746e2
Compare
933bf79 to
7060dba
Compare
7060dba to
be5ea54
Compare
be5ea54 to
a342f97
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This makes apply execution durable and scheduler-owned:
/api/applypersists apply/task records, wakes scheduler workers, and returns once work is queued. Workers claim applies from storage and resume local or remote Tern work, letting SchemaBot recover from request cancellation, server crashes, and retryable engine failures without restarting completed work.failed_retryableapply/task states, retry-budget expiration, progress/API state mapping, and scheduler retry behaviorResumeApplyhandles queued dispatch, stale recovery, and retryable re-dispatch🤖 Generated by Codex.