Skip to content

feat: report orchestrator health in readiness endpoint#8

Draft
deangoodmanson wants to merge 5 commits intodevelopfrom
fix/orchestrator-health
Draft

feat: report orchestrator health in readiness endpoint#8
deangoodmanson wants to merge 5 commits intodevelopfrom
fix/orchestrator-health

Conversation

@deangoodmanson
Copy link
Copy Markdown
Collaborator

@deangoodmanson deangoodmanson commented Feb 17, 2026

Summary

  • Adds orchestrator_alive: Arc<AtomicBool> to AppState (initializes false; set true only when orchestrator starts)
  • spawn_orchestrator() sets the flag true before signaling ready, false on unexpected exit (not during graceful shutdown); uses Release/Acquire ordering for correct visibility on arm64
  • /health/ready now includes orchestrator: "ok"|"unhealthy" in its checks and includes orchestrator in the all_healthy gate
  • Health CLI (kruxiaflow health) now correctly parses flat "ok" strings from the readiness endpoint — previously showed ❓ Not reported in readiness check

Behavior

Normal operation:

{ "status": "ready", "checks": { "database": "ok", "event_source": "ok", "queue": "ok", "orchestrator": "ok" } }

Orchestrator crash:

{ "status": "not_ready", "checks": { ..., "orchestrator": "unhealthy" } }

→ Returns HTTP 503, triggering a restart in Kubernetes/Docker health checks.

Test plan

  • ./docker exec kruxiaflow /kruxiaflow health shows ✅ orchestrator - ok for all services
  • cargo test -p kruxiaflow-api passes including test_readiness_endpoint_orchestrator_unhealthy
  • Killing the orchestrator task causes readiness to return 503

🤖 Generated with Claude Code

deangoodmanson and others added 2 commits February 17, 2026 07:16
- Use pgvector/pgvector:pg17 image so the vector extension is available
- Add platform: linux/amd64 for py-std-worker to suppress emulation warning
- Remove ivfflat index on document_chunks (vector(3072) exceeds 2000-dim limit)
- Fix health CLI to parse flat "ok" strings from readiness endpoint

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The /health/ready endpoint now includes an orchestrator check alongside
database, event_source, and queue. The orchestrator task signals liveness
via an Arc<AtomicBool> in AppState — set to true when the loop starts,
false if it exits unexpectedly before shutdown.

This closes the gap where orchestrator crashes were invisible to health
checks: a stalled or panicked orchestrator task now causes the readiness
probe to return 503, triggering a restart in orchestrated environments.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@deangoodmanson deangoodmanson marked this pull request as draft February 17, 2026 13:56
…dering

- Initialize orchestrator_alive to false; set true only when orchestrator
  actually starts (fixes window where health reported ok before startup)
- Store true before notify_one() to eliminate race between startup signal
  and health check reads
- Use Release/Acquire ordering instead of Relaxed for cross-thread visibility
  on weak memory model architectures (arm64)
- Remove "ready" as a valid check status (it's a top-level field, not a
  per-check value)
- Add comment explaining defensive "error" status handling
- Add test for orchestrator unhealthy path (503 + orchestrator: "unhealthy")

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@deangoodmanson
Copy link
Copy Markdown
Collaborator Author

Claude Sonnet 4.6 found a handful of issues with the first version of this PR, and committed the fixes.
It's still in draft mode as it needs further review and fixing.
I've also made changes to isolate the functionality of this PR, removing previous inherited changes from another branch (now moved to #10, also isolated)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant