fix: detect exporter rapid failure loop and exit for container restart#691
fix: detect exporter rapid failure loop and exit for container restart#691ambient-code[bot] wants to merge 3 commits into
Conversation
When the exporter child process fails repeatedly within a short window (e.g., due to persistent DNS resolution failures), the parent process now detects the rapid failure pattern and exits with code 1 instead of looping forever. This allows systemd or the container orchestrator to recreate the container fresh. The thresholds are configurable via environment variables: - JUMPSTARTER_MAX_RAPID_FAILURES (default: 5) - JUMPSTARTER_RAPID_FAILURE_WINDOW (default: 30 seconds) Fixes #690 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The ty pre-release type checker reports false positive unresolved-reference errors for nonlocal variable declarations in nested functions. Use a mutable list container pattern instead, which is functionally equivalent and avoids the ty diagnostic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
CI Fix: type-check-python failureThe Fix: Replaced the Other CI failures (not related to this PR):
No reviewer comments to address. |
| # rapid failures, the main process exits to let systemd/container orchestrator | ||
| # recreate the container. | ||
| MAX_RAPID_FAILURES = int(os.environ.get("JUMPSTARTER_MAX_RAPID_FAILURES", "5")) | ||
| RAPID_FAILURE_WINDOW = int(os.environ.get("JUMPSTARTER_RAPID_FAILURE_WINDOW", "30")) |
There was a problem hiding this comment.
| RAPID_FAILURE_WINDOW = int(os.environ.get("JUMPSTARTER_RAPID_FAILURE_WINDOW", "30")) | |
| RAPID_FAILURE_WINDOW = int(os.environ.get("JUMPSTARTER_RAPID_FAILURE_WINDOW", "60")) |
60 seconds is a good measure for a rapid failure, anything under a minute does not make a lot of sense.
There was a problem hiding this comment.
instead of adding two new environment variables let's add it to the exporter configuration https://github.com/jumpstarter-dev/jumpstarter/blob/main/python/packages/jumpstarter/jumpstarter/config/exporter.py#L126 ?
Not sure where would it fit best, or if we can add a section for this type of configuration.
Addressing review feedbackBoth suggestions from @mangelajo are reasonable -- will implement them:
Working on the changes now. |
Address reviewer feedback from @mangelajo: - Move rapid failure detection configuration from environment variables (JUMPSTARTER_MAX_RAPID_FAILURES, JUMPSTARTER_RAPID_FAILURE_WINDOW) into the ExporterConfigV1Alpha1 model as a failureDetection section - Change default rapid failure window from 30s to 60s - Add FailureDetectionConfigV1Alpha1 model with maxRapidFailures and rapidFailureWindow fields (YAML-configurable) - Update tests to use config-based approach instead of env vars Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
_serve_with_exc_handling(): if the child process fails within a configurable time window (default 60s) more than a configurable number of times (default 5), the main process exits with code 1 to let systemd/container orchestrator recreate the containerConfiguration
Thresholds are configurable via the exporter config YAML under a
failureDetectionsection:Test plan
run_test.py)make lint-fixty checkFixes #690
Generated with Claude Code