-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
P1Priority 1: Critical, fix as soon as possiblePriority 1: Critical, fix as soon as possiblebugSomething isn't workingSomething isn't workingoperationsOperations, monitoring, and observabilityOperations, monitoring, and observability
Description
Description
The dev backend container (osa-dev) became unresponsive to POST /hed/chat requests while continuing to serve GET /health checks normally. All valid chat requests returned 0 bytes and timed out at every level (backend port, Apache proxy, Cloudflare Worker).
Symptoms
- POST /hed/chat with valid auth: 0 bytes received, connection timeout
- POST /hed/chat without auth: returns 401 immediately (proving uvicorn IS accepting connections)
- GET /health: works normally (200 OK)
- Docker health check: passes (runs inside container)
- Direct curl from inside container (docker exec): works fine
Root Cause (Preliminary)
The uvicorn event loop appears to deadlock or stall on LLM/OpenRouter requests when accessed through the Docker port mapping (host:38529 -> container:38528). Requests that trigger the async LLM call path hang indefinitely, while requests that return before reaching the LLM (auth failures, health checks) work fine.
Resolution
Restarting the container (docker restart osa-dev) immediately resolved the issue.
Investigation Needed
- Check uvicorn worker count (single worker = single event loop, one stalled request blocks all)
- Add request timeout for LLM calls to prevent indefinite hangs
- Add liveness probe that tests POST /chat endpoint, not just GET /health
- Consider adding async timeout wrapper around OpenRouter API calls
- Check if httpx/aiohttp connection pool exhaustion could cause this
- Review if any synchronous operations block the async event loop
Impact
- Dev environment only (production unaffected)
- Widget appeared non-functional to users
- Required manual container restart to recover
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
P1Priority 1: Critical, fix as soon as possiblePriority 1: Critical, fix as soon as possiblebugSomething isn't workingSomething isn't workingoperationsOperations, monitoring, and observabilityOperations, monitoring, and observability