Skip to content

fix: detect exporter rapid failure loop and exit for container restart#691

Open
ambient-code[bot] wants to merge 3 commits into
mainfrom
fix/exporter-rapid-failure-loop-690
Open

fix: detect exporter rapid failure loop and exit for container restart#691
ambient-code[bot] wants to merge 3 commits into
mainfrom
fix/exporter-rapid-failure-loop-690

Conversation

@ambient-code
Copy link
Copy Markdown
Contributor

@ambient-code ambient-code Bot commented May 14, 2026

Summary

  • Fixes the exporter failure loop described in Exporter failure loop #690 where persistent errors (e.g., DNS resolution failures) cause the exporter to restart endlessly in a tight loop
  • Adds rapid failure detection to _serve_with_exc_handling(): if the child process fails within a configurable time window (default 60s) more than a configurable number of times (default 5), the main process exits with code 1 to let systemd/container orchestrator recreate the container
  • Counter resets when a child runs longer than the failure window, allowing transient errors to continue being retried

Configuration

Thresholds are configurable via the exporter config YAML under a failureDetection section:

apiVersion: jumpstarter.dev/v1alpha1
kind: ExporterConfig
metadata:
  name: my-exporter
failureDetection:
  maxRapidFailures: 5        # default: 5
  rapidFailureWindow: 60     # default: 60 seconds
# ... rest of config

Test plan

  • Added unit tests for rapid failure detection (run_test.py)
    • Verifies exit after max_rapid_failures consecutive rapid failures
    • Verifies counter resets after a long-running child
    • Verifies normal exit codes pass through unchanged
    • Verifies a single rapid failure does not cause exit
    • Verifies custom config values are respected
  • All 156 existing tests in jumpstarter-cli pass
  • Linting passes with make lint-fix
  • Type checking passes with ty check

Fixes #690

Generated with Claude Code

When the exporter child process fails repeatedly within a short window
(e.g., due to persistent DNS resolution failures), the parent process
now detects the rapid failure pattern and exits with code 1 instead of
looping forever. This allows systemd or the container orchestrator to
recreate the container fresh.

The thresholds are configurable via environment variables:
- JUMPSTARTER_MAX_RAPID_FAILURES (default: 5)
- JUMPSTARTER_RAPID_FAILURE_WINDOW (default: 30 seconds)

Fixes #690

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code ambient-code Bot mentioned this pull request May 14, 2026
The ty pre-release type checker reports false positive unresolved-reference
errors for nonlocal variable declarations in nested functions. Use a
mutable list container pattern instead, which is functionally equivalent
and avoids the ty diagnostic.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code Bot commented May 14, 2026

CI Fix: type-check-python failure

The type-check-python CI job was failing because the ty type checker (pre-release) reports false positive unresolved-reference errors for nonlocal variable declarations inside nested functions in run_test.py (lines 59, 94, 162).

Fix: Replaced the nonlocal call_count pattern with a mutable list container (counter = [0] / counter[0] += 1). This is a well-known Python idiom that is functionally equivalent and avoids the ty diagnostic.

Other CI failures (not related to this PR):

  • e2e-tests (both amd64 and arm64) — failures are in Go e2e tests (e2e_test.go:322 paginated lease listing, hooks_test.go:114 beforeLease hook execution). These are unrelated to the Python exporter changes in this PR. The most recent main branch e2e run passed successfully, suggesting these may be flaky tests.

No reviewer comments to address.

# rapid failures, the main process exits to let systemd/container orchestrator
# recreate the container.
MAX_RAPID_FAILURES = int(os.environ.get("JUMPSTARTER_MAX_RAPID_FAILURES", "5"))
RAPID_FAILURE_WINDOW = int(os.environ.get("JUMPSTARTER_RAPID_FAILURE_WINDOW", "30"))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RAPID_FAILURE_WINDOW = int(os.environ.get("JUMPSTARTER_RAPID_FAILURE_WINDOW", "30"))
RAPID_FAILURE_WINDOW = int(os.environ.get("JUMPSTARTER_RAPID_FAILURE_WINDOW", "60"))

60 seconds is a good measure for a rapid failure, anything under a minute does not make a lot of sense.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of adding two new environment variables let's add it to the exporter configuration https://github.com/jumpstarter-dev/jumpstarter/blob/main/python/packages/jumpstarter/jumpstarter/config/exporter.py#L126 ?

Not sure where would it fit best, or if we can add a section for this type of configuration.

@ambient-code
Copy link
Copy Markdown
Contributor Author

ambient-code Bot commented May 14, 2026

Addressing review feedback

Both suggestions from @mangelajo are reasonable -- will implement them:

  1. Change default RAPID_FAILURE_WINDOW from 30 to 60 seconds -- agreed, anything under a minute is not a meaningful "long-running" child.

  2. Move config from environment variables to ExporterConfigV1Alpha1 -- will add a failureDetection section (with maxRapidFailures and rapidFailureWindow fields) to the exporter config model. The _serve_with_exc_handling function already receives config, so it can read these values directly. This removes the two JUMPSTARTER_* env vars.

Working on the changes now.

Address reviewer feedback from @mangelajo:
- Move rapid failure detection configuration from environment variables
  (JUMPSTARTER_MAX_RAPID_FAILURES, JUMPSTARTER_RAPID_FAILURE_WINDOW) into
  the ExporterConfigV1Alpha1 model as a failureDetection section
- Change default rapid failure window from 30s to 60s
- Add FailureDetectionConfigV1Alpha1 model with maxRapidFailures and
  rapidFailureWindow fields (YAML-configurable)
- Update tests to use config-based approach instead of env vars

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Exporter failure loop

1 participant