-
Notifications
You must be signed in to change notification settings - Fork 230
bug: repair_run triggers full job run instead of repairing failed tasks #392
Description
Summary
When a user asks Claude to perform a task repair (re-run only the failed tasks of a previous job run), the skill triggers a full run_now instead of a repair_run. This is because repair_run is not implemented in the MCP tools.
Steps to Reproduce
- Run a Databricks job where one or more tasks fail
- Ask Claude Code (with ai-dev-kit) to repair the failed run
- Observe that a new full job run is triggered instead of repairing the failed tasks
Expected Behavior
Claude should call jobs.repair_run(run_id=<failed_run_id>, ...) to re-run only the failed/skipped tasks from the original run, preserving the successful task outputs.
Actual Behavior
Claude falls back to manage_job_runs(action='run_now', job_id=...), which starts a brand new full run from scratch.
Root Cause
The manage_job_runs MCP tool only exposes these actions: run_now, get, get_output, cancel, list, wait.
A repair action is missing. The Databricks SDK supports w.jobs.repair_run() but it has not been implemented in:
databricks-mcp-server/databricks_mcp_server/tools/jobs.pydatabricks-tools-core/databricks_tools_core/jobs/runs.py
Proposed Fix
- Add a
repair_run()function indatabricks_tools_core/jobs/runs.pyusingw.jobs.repair_run() - Add a
repairaction tomanage_job_runsindatabricks_mcp_server/tools/jobs.py - Update
SKILL.mdto document the repair workflow
Impact
Customers using ai-dev-kit for job orchestration and failure recovery are inadvertently re-running entire jobs, wasting compute and time.