Skip to content

bug: repair_run triggers full job run instead of repairing failed tasks #392

@sgarla

Description

@sgarla

Summary

When a user asks Claude to perform a task repair (re-run only the failed tasks of a previous job run), the skill triggers a full run_now instead of a repair_run. This is because repair_run is not implemented in the MCP tools.

Steps to Reproduce

  1. Run a Databricks job where one or more tasks fail
  2. Ask Claude Code (with ai-dev-kit) to repair the failed run
  3. Observe that a new full job run is triggered instead of repairing the failed tasks

Expected Behavior

Claude should call jobs.repair_run(run_id=<failed_run_id>, ...) to re-run only the failed/skipped tasks from the original run, preserving the successful task outputs.

Actual Behavior

Claude falls back to manage_job_runs(action='run_now', job_id=...), which starts a brand new full run from scratch.

Root Cause

The manage_job_runs MCP tool only exposes these actions: run_now, get, get_output, cancel, list, wait.

A repair action is missing. The Databricks SDK supports w.jobs.repair_run() but it has not been implemented in:

  • databricks-mcp-server/databricks_mcp_server/tools/jobs.py
  • databricks-tools-core/databricks_tools_core/jobs/runs.py

Proposed Fix

  1. Add a repair_run() function in databricks_tools_core/jobs/runs.py using w.jobs.repair_run()
  2. Add a repair action to manage_job_runs in databricks_mcp_server/tools/jobs.py
  3. Update SKILL.md to document the repair workflow

Impact

Customers using ai-dev-kit for job orchestration and failure recovery are inadvertently re-running entire jobs, wasting compute and time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions