bug: job monitoring reports 'complete' when job is still running or failed

## Summary

When Claude monitors a Databricks job run, it retries up to 3 times on errors then incorrectly reports the job as complete — even when the job is still running or has failed.

## Steps to Reproduce

1. Trigger a Databricks job via Claude Code (with ai-dev-kit)
2. Have the job run for longer than expected, or encounter an error during monitoring
3. Observe Claude retrying ~3 times then reporting the job as "complete"

## Expected Behavior

- Claude should use `manage_job_runs(action='wait')` to block until a terminal state is reached
- A failed job (`TERMINATED` + `result_state=FAILED`) should be clearly reported as failed, not complete
- A still-running job (`RUNNING`, `WAITING_FOR_RETRY`) should never be reported as complete

## Actual Behavior

- Claude calls `manage_job_runs(action='get')` (snapshot) instead of `action='wait'` (blocking poll), gets an intermediate state, retries ~3 times, then gives up and reports success
- The `WAITING_FOR_RETRY` lifecycle state (job task is retrying after failure) is not listed in `TERMINAL_STATES`, which can cause confusion about whether the run is done
- The skill docs show `max_retries: 3` as a job task config example — the LLM interprets this as the number of times to retry the tool call

## Root Cause

Two issues:

1. **No guidance in SKILL.md** to always use `action='wait'` for job completion monitoring instead of `action='get'`. The LLM defaults to polling with `get` and gives up after a few attempts.

2. **`max_retries: 3` example in skill docs** (`notifications-monitoring.md`) is a job task configuration parameter, but the LLM misreads it as a directive to retry the monitoring tool call 3 times before declaring success.

Relevant files:
- `databricks-skills/databricks-jobs/SKILL.md`
- `databricks-skills/databricks-jobs/notifications-monitoring.md`
- `databricks-tools-core/databricks_tools_core/jobs/runs.py` — `TERMINAL_STATES` does not include `WAITING_FOR_RETRY`

## Proposed Fix

1. Update `SKILL.md` to explicitly instruct: *always use `action='wait'` to monitor job completion; never use `action='get'` in a polling loop*
2. Clarify in `notifications-monitoring.md` that `max_retries` is a job task config, not a tool retry directive
3. Consider adding `WAITING_FOR_RETRY` and `BLOCKED` to a documented "non-terminal" states list so the LLM understands those mean the job is still active

## Impact

Customers see false "job complete" status messages, leading to incorrect downstream decisions when jobs are actually still running or have failed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: job monitoring reports 'complete' when job is still running or failed #393

Summary

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Proposed Fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bug: job monitoring reports 'complete' when job is still running or failed #393

Description

Summary

Steps to Reproduce

Expected Behavior

Actual Behavior

Root Cause

Proposed Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions