Skip to content

CC-4609: fix health check goroutine freeze and service stop hang#3944

Open
Deer-WarLord wants to merge 1 commit into
developfrom
fix/CC-4609
Open

CC-4609: fix health check goroutine freeze and service stop hang#3944
Deer-WarLord wants to merge 1 commit into
developfrom
fix/CC-4609

Conversation

@Deer-WarLord
Copy link
Copy Markdown
Contributor

Summary

  • Fix permanently frozen health check goroutines that cause services to show Unknown (status 3) in the UI despite healthy containers
  • Fix serviced service stop/restart hanging indefinitely when the container's ReportInstanceDead RPC stalls
  • Fix panic: close of closed channel on healthExit during normal shutdown

Root Cause

Every RPC call in the health check reporting pipeline (container -> delegate -> master) used timeout=0, which reconnectingClient.Call() mapped to 365 days. When any leg of this chain stalled (e.g., during heavy delegate startup load), the calling goroutine blocked permanently. Since health.Ping() calls report() synchronously, the goroutine could never receive the <-cancel signal, causing health check reports to stop forever.

The same infinite timeout affected ReportInstanceDead at shutdown, causing serviced service stop to hang until Docker's kill timeout fired.

Additionally, healthExit was closed in two places (defer in Run() and explicitly in shutdownService), causing a panic on normal shutdown.

Affected RPC Chain

Container (serviced-controller)          Delegate (serviced agent)          Master
  doHealthCheck()                          agent_proxy.go                    health_server.go
    -> LBClient.ReportHealthStatus           -> master.Client.ReportHealthStatus  -> facade.ReportHealthStatus
       timeout was 0 (=365 days)                timeout was 0 (=365 days)

Changes

File Change
rpc/master/health_client.go Set 60s timeout on ReportHealthStatus, 10s on ReportInstanceDead (delegate -> master)
node/lbClient.go Set 60s timeout on ReportHealthStatus, 10s on ReportInstanceDead (container -> delegate)
node/agent_proxy.go Log warnings when delegate proxy calls to master fail/timeout
health/check.go Add pre-report cancellation check in Ping() so goroutines exit promptly on SIGTERM
container/controller.go Fix healthExit double-close with sync.Once; wrap ReportInstanceDead with 10s timeout guard; log ReportHealthStatus errors

Reproducing Before Fix

Symptoms

  1. In the serviced UI, one or more services show "Missing Some Health Checks" with all health checks in Unknown (status 3)
  2. serviced service status <serviceID> shows status 3 for health checks
  3. Manually running the health check script inside the container succeeds
  4. serviced service stop <serviceID> or serviced service restart <serviceID> hangs indefinitely

Diagnostic Commands (on the delegate hosting the affected container)

Confirm health checks are Unknown:

serviced service status <serviceID>
# Health checks show status 3 (Unknown)

Confirm scripts work inside container:

serviced service attach <serviceID>
# Run the health check script manually -- should succeed

Confirm container launched health check goroutines but nothing since:

docker logs <containerID> 2>&1 | grep -i "health"
# Should show "Kicking off health check" at startup and nothing after

Confirm no RPC data flowing from container to delegate:

ss -tnop | grep <container_IP> | grep 4979
# All connections show timer:(keepalive,...) with zero send/recv queues

Confirm no large packets (health reports carry >1200 byte JWT):

timeout 60 tcpdump -i docker0 -n "host <container_IP> and port 4979 and greater 500" -c 5
# Zero packets captured = goroutines stuck

Confirm delegate -> master path is also stuck:

ss -tnop | grep "<master_IP>:4979"
# Connections with keepalive timers only, zero data

Confirm service stop hangs:

serviced service stop <serviceID>
# Hangs in "stopping" state until Docker's kill timeout

Validating After Fix

Build and deploy the fix

make
# Deploy updated serviced binary to master, delegates, and update
# the serviced-controller image in the docker registry

Test 1: Health checks recover from transient RPC failures

  1. Deploy services and confirm all health checks are green (status 0)
  2. Temporarily block RPC port on the master: iptables -A INPUT -p tcp --dport 4979 -j DROP
  3. Wait 2 minutes -- health checks should go Unknown
  4. Unblock: iptables -D INPUT -p tcp --dport 4979 -j DROP
  5. Expected: Health checks return to green within 60-90 seconds (the RPC timeout fires, goroutines retry on next interval)
  6. Check delegate logs for timeout warnings: journalctl -u serviced | grep "Failed to proxy ReportHealthStatus"

Test 2: Service stop completes promptly

  1. With services running, stop a service: serviced service stop <serviceID>
  2. Expected: Service stops within 30 seconds (the 10s ReportInstanceDead timeout fires, then the 30s deadman switch, then exit)
  3. Before fix: Would hang indefinitely

Test 3: No panic on normal shutdown

  1. Start a service with health checks
  2. Stop it: serviced service stop <serviceID>
  3. Check container logs: docker logs <containerID> 2>&1 | grep -i panic
  4. Expected: No panic messages (sync.Once prevents double-close)

Test 4: Health checks work normally under steady state

  1. Deploy all services
  2. Confirm all health checks are green in the UI
  3. Wait 30 minutes, verify they remain green
  4. Expected: No regressions in normal health check reporting

Evidence from Live Investigation

Collected on uodnorig-cctest1-test12-cz9-delegate-mw3c, container 0fc77b620707 (Impact service 6xklm7bv3nzc9pf9wu8ak1yb0):

  • All 3 health checks showed status 3 (Unknown) for 6+ hours while the container was fully functional
  • ss -tnop showed 5 ESTAB connections from container to delegate RPC port, all idle with keepalive timers
  • tcpdump captured zero packets >500 bytes over 120 seconds (no JWT-bearing health reports sent)
  • ss -tnop from delegate to master showed 3 ESTAB connections, also all idle -- confirming the delegate->master leg was stuck too
  • tcpdump from delegate to master also showed zero large packets
  • docker logs showed health check goroutines launched at startup with no activity since
  • Master logs had zero ReportHealthStatus entries for the affected service

…t goroutine freeze

Health check goroutines inside serviced-controller could permanently
freeze when the RPC call chain (container -> delegate -> master) stalled.
The root cause was that every RPC call in the health reporting pipeline
used timeout=0, which reconnectingClient.Call() mapped to 365 days.
Once a call stalled (e.g., during delegate startup load), the goroutine
blocked forever, health reports stopped, and the master marked the
service as Unknown (status 3). Additionally, serviced service stop/restart
hung because ReportInstanceDead used the same infinite timeout, and the
healthExit channel was double-closed causing a panic.

Changes:
- Set 60s timeout on ReportHealthStatus RPC at all three layers:
  container->delegate (lbClient), delegate->master (health_client),
  and the delegate proxy handler (agent_proxy)
- Set 10s timeout on ReportInstanceDead RPC at all layers
- Wrap ReportInstanceDead at shutdown with goroutine + select guard
- Fix double-close of healthExit channel using sync.Once
- Add pre-report cancellation check in health.Ping() so goroutines
  exit promptly on SIGTERM instead of entering a blocking RPC call
- Log errors from ReportHealthStatus and ReportInstanceDead that were
  previously silently dropped

Closes CC-4609

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant