CC-4609: fix health check goroutine freeze and service stop hang by Deer-WarLord · Pull Request #3944 · control-center/serviced

Deer-WarLord · 2026-04-14T17:03:50Z

Summary

Fix permanently frozen health check goroutines that cause services to show Unknown (status 3) in the UI despite healthy containers
Fix serviced service stop/restart hanging indefinitely when the container's ReportInstanceDead RPC stalls
Fix panic: close of closed channel on healthExit during normal shutdown

Root Cause

Every RPC call in the health check reporting pipeline (container -> delegate -> master) used timeout=0, which reconnectingClient.Call() mapped to 365 days. When any leg of this chain stalled (e.g., during heavy delegate startup load), the calling goroutine blocked permanently. Since health.Ping() calls report() synchronously, the goroutine could never receive the <-cancel signal, causing health check reports to stop forever.

The same infinite timeout affected ReportInstanceDead at shutdown, causing serviced service stop to hang until Docker's kill timeout fired.

Additionally, healthExit was closed in two places (defer in Run() and explicitly in shutdownService), causing a panic on normal shutdown.

Affected RPC Chain

Container (serviced-controller)          Delegate (serviced agent)          Master
  doHealthCheck()                          agent_proxy.go                    health_server.go
    -> LBClient.ReportHealthStatus           -> master.Client.ReportHealthStatus  -> facade.ReportHealthStatus
       timeout was 0 (=365 days)                timeout was 0 (=365 days)

Changes

File	Change
`rpc/master/health_client.go`	Set 60s timeout on `ReportHealthStatus`, 10s on `ReportInstanceDead` (delegate -> master)
`node/lbClient.go`	Set 60s timeout on `ReportHealthStatus`, 10s on `ReportInstanceDead` (container -> delegate)
`node/agent_proxy.go`	Log warnings when delegate proxy calls to master fail/timeout
`health/check.go`	Add pre-report cancellation check in `Ping()` so goroutines exit promptly on SIGTERM
`container/controller.go`	Fix `healthExit` double-close with `sync.Once`; wrap `ReportInstanceDead` with 10s timeout guard; log `ReportHealthStatus` errors

Reproducing Before Fix

Symptoms

In the serviced UI, one or more services show "Missing Some Health Checks" with all health checks in Unknown (status 3)
serviced service status <serviceID> shows status 3 for health checks
Manually running the health check script inside the container succeeds
serviced service stop <serviceID> or serviced service restart <serviceID> hangs indefinitely

Diagnostic Commands (on the delegate hosting the affected container)

Confirm health checks are Unknown:

serviced service status <serviceID>
# Health checks show status 3 (Unknown)

Confirm scripts work inside container:

serviced service attach <serviceID>
# Run the health check script manually -- should succeed

Confirm container launched health check goroutines but nothing since:

docker logs <containerID> 2>&1 | grep -i "health"
# Should show "Kicking off health check" at startup and nothing after

Confirm no RPC data flowing from container to delegate:

ss -tnop | grep <container_IP> | grep 4979
# All connections show timer:(keepalive,...) with zero send/recv queues

Confirm no large packets (health reports carry >1200 byte JWT):

timeout 60 tcpdump -i docker0 -n "host <container_IP> and port 4979 and greater 500" -c 5
# Zero packets captured = goroutines stuck

Confirm delegate -> master path is also stuck:

ss -tnop | grep "<master_IP>:4979"
# Connections with keepalive timers only, zero data

Confirm service stop hangs:

serviced service stop <serviceID>
# Hangs in "stopping" state until Docker's kill timeout

Validating After Fix

Build and deploy the fix

make
# Deploy updated serviced binary to master, delegates, and update
# the serviced-controller image in the docker registry

Test 1: Health checks recover from transient RPC failures

Deploy services and confirm all health checks are green (status 0)
Temporarily block RPC port on the master: iptables -A INPUT -p tcp --dport 4979 -j DROP
Wait 2 minutes -- health checks should go Unknown
Unblock: iptables -D INPUT -p tcp --dport 4979 -j DROP
Expected: Health checks return to green within 60-90 seconds (the RPC timeout fires, goroutines retry on next interval)
Check delegate logs for timeout warnings: journalctl -u serviced | grep "Failed to proxy ReportHealthStatus"

Test 2: Service stop completes promptly

With services running, stop a service: serviced service stop <serviceID>
Expected: Service stops within 30 seconds (the 10s ReportInstanceDead timeout fires, then the 30s deadman switch, then exit)
Before fix: Would hang indefinitely

Test 3: No panic on normal shutdown

Start a service with health checks
Stop it: serviced service stop <serviceID>
Check container logs: docker logs <containerID> 2>&1 | grep -i panic
Expected: No panic messages (sync.Once prevents double-close)

Test 4: Health checks work normally under steady state

Deploy all services
Confirm all health checks are green in the UI
Wait 30 minutes, verify they remain green
Expected: No regressions in normal health check reporting

Evidence from Live Investigation

Collected on uodnorig-cctest1-test12-cz9-delegate-mw3c, container 0fc77b620707 (Impact service 6xklm7bv3nzc9pf9wu8ak1yb0):

All 3 health checks showed status 3 (Unknown) for 6+ hours while the container was fully functional
ss -tnop showed 5 ESTAB connections from container to delegate RPC port, all idle with keepalive timers
tcpdump captured zero packets >500 bytes over 120 seconds (no JWT-bearing health reports sent)
ss -tnop from delegate to master showed 3 ESTAB connections, also all idle -- confirming the delegate->master leg was stuck too
tcpdump from delegate to master also showed zero large packets
docker logs showed health check goroutines launched at startup with no activity since
Master logs had zero ReportHealthStatus entries for the affected service

…t goroutine freeze Health check goroutines inside serviced-controller could permanently freeze when the RPC call chain (container -> delegate -> master) stalled. The root cause was that every RPC call in the health reporting pipeline used timeout=0, which reconnectingClient.Call() mapped to 365 days. Once a call stalled (e.g., during delegate startup load), the goroutine blocked forever, health reports stopped, and the master marked the service as Unknown (status 3). Additionally, serviced service stop/restart hung because ReportInstanceDead used the same infinite timeout, and the healthExit channel was double-closed causing a panic. Changes: - Set 60s timeout on ReportHealthStatus RPC at all three layers: container->delegate (lbClient), delegate->master (health_client), and the delegate proxy handler (agent_proxy) - Set 10s timeout on ReportInstanceDead RPC at all layers - Wrap ReportInstanceDead at shutdown with goroutine + select guard - Fix double-close of healthExit channel using sync.Once - Add pre-report cancellation check in health.Ping() so goroutines exit promptly on SIGTERM instead of entering a blocking RPC call - Log errors from ReportHealthStatus and ReportInstanceDead that were previously silently dropped Closes CC-4609 Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CC-4609: fix health check goroutine freeze and service stop hang#3944

CC-4609: fix health check goroutine freeze and service stop hang#3944
Deer-WarLord wants to merge 1 commit into
developfrom
fix/CC-4609

Deer-WarLord commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Deer-WarLord commented Apr 14, 2026

Summary

Root Cause

Affected RPC Chain

Changes

Reproducing Before Fix

Symptoms

Diagnostic Commands (on the delegate hosting the affected container)

Validating After Fix

Build and deploy the fix

Test 1: Health checks recover from transient RPC failures

Test 2: Service stop completes promptly

Test 3: No panic on normal shutdown

Test 4: Health checks work normally under steady state

Evidence from Live Investigation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant