-
Notifications
You must be signed in to change notification settings - Fork 230
bug: job monitoring reports 'complete' when job is still running or failed #393
Copy link
Copy link
Open
Description
Summary
When Claude monitors a Databricks job run, it retries up to 3 times on errors then incorrectly reports the job as complete — even when the job is still running or has failed.
Steps to Reproduce
- Trigger a Databricks job via Claude Code (with ai-dev-kit)
- Have the job run for longer than expected, or encounter an error during monitoring
- Observe Claude retrying ~3 times then reporting the job as "complete"
Expected Behavior
- Claude should use
manage_job_runs(action='wait')to block until a terminal state is reached - A failed job (
TERMINATED+result_state=FAILED) should be clearly reported as failed, not complete - A still-running job (
RUNNING,WAITING_FOR_RETRY) should never be reported as complete
Actual Behavior
- Claude calls
manage_job_runs(action='get')(snapshot) instead ofaction='wait'(blocking poll), gets an intermediate state, retries ~3 times, then gives up and reports success - The
WAITING_FOR_RETRYlifecycle state (job task is retrying after failure) is not listed inTERMINAL_STATES, which can cause confusion about whether the run is done - The skill docs show
max_retries: 3as a job task config example — the LLM interprets this as the number of times to retry the tool call
Root Cause
Two issues:
-
No guidance in SKILL.md to always use
action='wait'for job completion monitoring instead ofaction='get'. The LLM defaults to polling withgetand gives up after a few attempts. -
max_retries: 3example in skill docs (notifications-monitoring.md) is a job task configuration parameter, but the LLM misreads it as a directive to retry the monitoring tool call 3 times before declaring success.
Relevant files:
databricks-skills/databricks-jobs/SKILL.mddatabricks-skills/databricks-jobs/notifications-monitoring.mddatabricks-tools-core/databricks_tools_core/jobs/runs.py—TERMINAL_STATESdoes not includeWAITING_FOR_RETRY
Proposed Fix
- Update
SKILL.mdto explicitly instruct: always useaction='wait'to monitor job completion; never useaction='get'in a polling loop - Clarify in
notifications-monitoring.mdthatmax_retriesis a job task config, not a tool retry directive - Consider adding
WAITING_FOR_RETRYandBLOCKEDto a documented "non-terminal" states list so the LLM understands those mean the job is still active
Impact
Customers see false "job complete" status messages, leading to incorrect downstream decisions when jobs are actually still running or have failed.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels