Skip to content

fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609

Open
matthv wants to merge 1 commit into
feat/prd-214-server-step-mapperfrom
fix/prd-409-ai-invoke-timeout
Open

fix(workflow-executor): add per-invocation AI timeout to surface hanging provider errors [PRD-409]#1609
matthv wants to merge 1 commit into
feat/prd-214-server-step-mapperfrom
fix/prd-409-ai-invoke-timeout

Conversation

@matthv
Copy link
Copy Markdown
Member

@matthv matthv commented May 28, 2026

Summary

When the AI provider hangs (no response, internal retries, or holds the connection open), the previous code relied on the global STEP_TIMEOUT_MS (default 5 min) to fail the step. From the user's perspective this looks like an infinite spinner.

This PR adds a dedicated timeout on each AI invocation (default 60s, configurable via AI_INVOKE_TIMEOUT_MS) using AbortController + signal so the underlying HTTP request is actually cancelled.

On timeout, the executor throws the new AiInvokeTimeoutError, which BaseStepExecutor.execute() converts to an error outcome with a user-friendly message — the orchestrator then sets context.error on the step and the frontend exits its isLoading state immediately.

Why not just lower STEP_TIMEOUT_MS globally

STEP_TIMEOUT_MS covers more than the AI call (it also covers slow agent fetches, DB lookups, etc.). Lowering it globally would kill legitimately slow non-AI work. A dedicated AI timeout is more surgical.

Changes

  • defaults.ts: new DEFAULT_AI_INVOKE_TIMEOUT_MS = 60_000
  • errors.ts: new AiInvokeTimeoutError extends WorkflowExecutorError with provider-specific user message
  • base-step-executor.ts: invokeWithTools now wraps model.invoke with AbortController + timeout
  • Config plumbing through RunnerConfigStepContextConfigExecutionContext
  • cli-core.ts: parse AI_INVOKE_TIMEOUT_MS env var
  • 6 new unit tests covering timeout fires, signal is passed, disabled when unset/<=0, non-abort errors rethrown as-is, timer cleared on success

fixes PRD-409

Test plan

  • 811 unit tests pass (6 new)
  • Lint: 0 errors (6 pre-existing warnings unrelated)
  • Live test: with SIMULATE_AI_HANG=1 AI_INVOKE_TIMEOUT_MS=10000, the frontend shows the new user message after 10s instead of spinning for 5min
  • Reviewer to confirm 60s default is appropriate (vs e.g. 30s or 120s)

🤖 Generated with Claude Code

Note

Add per-invocation AI timeout to surface hanging provider errors in workflow executor

  • Adds aiInvokeTimeoutMs (default 60,000ms) to the workflow executor's RunnerConfig, ExecutionContext, and ExecutorOptions, configurable via the AI_INVOKE_TIMEOUT_MS environment variable.
  • In BaseStepExecutor.invokeWithTools, wraps AI provider calls with an AbortController timer; if the provider hangs past the timeout, the invocation is aborted and throws AiInvokeTimeoutError.
  • Introduces AiInvokeTimeoutError with a user-facing retry message to distinguish timeout failures from other AI errors.
  • Setting aiInvokeTimeoutMs to 0 or leaving it unset disables the timeout, preserving existing behavior.
  • Risk: AI invocations that previously hung indefinitely will now fail after 60s by default, which may surface as new errors in workflows that relied on slow providers.

Macroscope summarized 1718cb4.

…ing provider errors [PRD-409]

When the AI provider hangs (no response, internal retries, or holds the
connection open), the previous code relied on the global STEP_TIMEOUT_MS
(default 5 min) to fail the step. From the user's perspective this looks
like an infinite spinner.

Add a dedicated timeout on each AI invocation (default 60s, configurable
via AI_INVOKE_TIMEOUT_MS) using AbortController + signal so the underlying
HTTP request is actually cancelled. On timeout, throws the new
AiInvokeTimeoutError, which BaseStepExecutor.execute() converts to an
error outcome with a user-friendly message — the orchestrator then sets
context.error on the step and the frontend exits its isLoading state
immediately.

fixes PRD-409

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@linear
Copy link
Copy Markdown

linear Bot commented May 28, 2026

PRD-409

@qltysh
Copy link
Copy Markdown

qltysh Bot commented May 28, 2026

1 new issue

Tool Category Rule Count
qlty Structure Function with high complexity (count = 13): invokeWithTools 1

@qltysh
Copy link
Copy Markdown

qltysh Bot commented May 28, 2026

Qlty


Coverage Impact

This PR will not change total coverage.

Modified Files with Diff Coverage (3)

RatingFile% DiffUncovered Line #s
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/executors/base-step-executor.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/errors.ts100.0%
Coverage rating: A Coverage rating: A
packages/workflow-executor/src/defaults.ts100.0%
Total100.0%
🚦 See full report on Qlty Cloud »

🛟 Help
  • Diff Coverage: Coverage for added or modified lines of code (excludes deleted files). Learn more.

  • Total Coverage: Coverage for the whole repository, calculated as the sum of all File Coverage. Learn more.

  • File Coverage: Covered Lines divided by Covered Lines plus Missed Lines. (Excludes non-executable lines including blank lines and comments.)

    • Indirect Changes: Changes to File Coverage for files that were not modified in this PR. Learn more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant