CC-4609: fix health check goroutine freeze and service stop hang#3944
Open
Deer-WarLord wants to merge 1 commit into
Open
CC-4609: fix health check goroutine freeze and service stop hang#3944Deer-WarLord wants to merge 1 commit into
Deer-WarLord wants to merge 1 commit into
Conversation
…t goroutine freeze Health check goroutines inside serviced-controller could permanently freeze when the RPC call chain (container -> delegate -> master) stalled. The root cause was that every RPC call in the health reporting pipeline used timeout=0, which reconnectingClient.Call() mapped to 365 days. Once a call stalled (e.g., during delegate startup load), the goroutine blocked forever, health reports stopped, and the master marked the service as Unknown (status 3). Additionally, serviced service stop/restart hung because ReportInstanceDead used the same infinite timeout, and the healthExit channel was double-closed causing a panic. Changes: - Set 60s timeout on ReportHealthStatus RPC at all three layers: container->delegate (lbClient), delegate->master (health_client), and the delegate proxy handler (agent_proxy) - Set 10s timeout on ReportInstanceDead RPC at all layers - Wrap ReportInstanceDead at shutdown with goroutine + select guard - Fix double-close of healthExit channel using sync.Once - Add pre-report cancellation check in health.Ping() so goroutines exit promptly on SIGTERM instead of entering a blocking RPC call - Log errors from ReportHealthStatus and ReportInstanceDead that were previously silently dropped Closes CC-4609 Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
serviced service stop/restarthanging indefinitely when the container'sReportInstanceDeadRPC stallspanic: close of closed channelonhealthExitduring normal shutdownRoot Cause
Every RPC call in the health check reporting pipeline (container -> delegate -> master) used
timeout=0, whichreconnectingClient.Call()mapped to 365 days. When any leg of this chain stalled (e.g., during heavy delegate startup load), the calling goroutine blocked permanently. Sincehealth.Ping()callsreport()synchronously, the goroutine could never receive the<-cancelsignal, causing health check reports to stop forever.The same infinite timeout affected
ReportInstanceDeadat shutdown, causingserviced service stopto hang until Docker's kill timeout fired.Additionally,
healthExitwas closed in two places (deferinRun()and explicitly inshutdownService), causing a panic on normal shutdown.Affected RPC Chain
Changes
rpc/master/health_client.goReportHealthStatus, 10s onReportInstanceDead(delegate -> master)node/lbClient.goReportHealthStatus, 10s onReportInstanceDead(container -> delegate)node/agent_proxy.gohealth/check.goPing()so goroutines exit promptly on SIGTERMcontainer/controller.gohealthExitdouble-close withsync.Once; wrapReportInstanceDeadwith 10s timeout guard; logReportHealthStatuserrorsReproducing Before Fix
Symptoms
serviced service status <serviceID>shows status3for health checksserviced service stop <serviceID>orserviced service restart <serviceID>hangs indefinitelyDiagnostic Commands (on the delegate hosting the affected container)
Confirm health checks are Unknown:
Confirm scripts work inside container:
Confirm container launched health check goroutines but nothing since:
Confirm no RPC data flowing from container to delegate:
Confirm no large packets (health reports carry >1200 byte JWT):
Confirm delegate -> master path is also stuck:
Confirm service stop hangs:
Validating After Fix
Build and deploy the fix
Test 1: Health checks recover from transient RPC failures
iptables -A INPUT -p tcp --dport 4979 -j DROPiptables -D INPUT -p tcp --dport 4979 -j DROPjournalctl -u serviced | grep "Failed to proxy ReportHealthStatus"Test 2: Service stop completes promptly
serviced service stop <serviceID>ReportInstanceDeadtimeout fires, then the 30s deadman switch, then exit)Test 3: No panic on normal shutdown
serviced service stop <serviceID>docker logs <containerID> 2>&1 | grep -i panicsync.Onceprevents double-close)Test 4: Health checks work normally under steady state
Evidence from Live Investigation
Collected on
uodnorig-cctest1-test12-cz9-delegate-mw3c, container0fc77b620707(Impact service6xklm7bv3nzc9pf9wu8ak1yb0):ss -tnopshowed 5 ESTAB connections from container to delegate RPC port, all idle with keepalive timerstcpdumpcaptured zero packets >500 bytes over 120 seconds (no JWT-bearing health reports sent)ss -tnopfrom delegate to master showed 3 ESTAB connections, also all idle -- confirming the delegate->master leg was stuck tootcpdumpfrom delegate to master also showed zero large packetsdocker logsshowed health check goroutines launched at startup with no activity sinceReportHealthStatusentries for the affected service