Skip to content

bug: job monitoring reports 'complete' when job is still running or failed #393

@sgarla

Description

@sgarla

Summary

When Claude monitors a Databricks job run, it retries up to 3 times on errors then incorrectly reports the job as complete — even when the job is still running or has failed.

Steps to Reproduce

  1. Trigger a Databricks job via Claude Code (with ai-dev-kit)
  2. Have the job run for longer than expected, or encounter an error during monitoring
  3. Observe Claude retrying ~3 times then reporting the job as "complete"

Expected Behavior

  • Claude should use manage_job_runs(action='wait') to block until a terminal state is reached
  • A failed job (TERMINATED + result_state=FAILED) should be clearly reported as failed, not complete
  • A still-running job (RUNNING, WAITING_FOR_RETRY) should never be reported as complete

Actual Behavior

  • Claude calls manage_job_runs(action='get') (snapshot) instead of action='wait' (blocking poll), gets an intermediate state, retries ~3 times, then gives up and reports success
  • The WAITING_FOR_RETRY lifecycle state (job task is retrying after failure) is not listed in TERMINAL_STATES, which can cause confusion about whether the run is done
  • The skill docs show max_retries: 3 as a job task config example — the LLM interprets this as the number of times to retry the tool call

Root Cause

Two issues:

  1. No guidance in SKILL.md to always use action='wait' for job completion monitoring instead of action='get'. The LLM defaults to polling with get and gives up after a few attempts.

  2. max_retries: 3 example in skill docs (notifications-monitoring.md) is a job task configuration parameter, but the LLM misreads it as a directive to retry the monitoring tool call 3 times before declaring success.

Relevant files:

  • databricks-skills/databricks-jobs/SKILL.md
  • databricks-skills/databricks-jobs/notifications-monitoring.md
  • databricks-tools-core/databricks_tools_core/jobs/runs.pyTERMINAL_STATES does not include WAITING_FOR_RETRY

Proposed Fix

  1. Update SKILL.md to explicitly instruct: always use action='wait' to monitor job completion; never use action='get' in a polling loop
  2. Clarify in notifications-monitoring.md that max_retries is a job task config, not a tool retry directive
  3. Consider adding WAITING_FOR_RETRY and BLOCKED to a documented "non-terminal" states list so the LLM understands those mean the job is still active

Impact

Customers see false "job complete" status messages, leading to incorrect downstream decisions when jobs are actually still running or have failed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions