Fix/file watcher race#318
Conversation
GrpcServer::run() did not check the return value of BuildAndStart(). A partial cert write during rotation causes OpenSSL to reject the PEM silently, returning nullptr. The null was stored in m_server with state set to RUNNING, deferring a SIGSEGV to the next shutdown() call. FileWatcher::on_modified_event() read updated file contents into a local copy but never wrote them back to m_files, causing every subsequent IN_MODIFY to appear as a content change and re-fire all callbacks. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…le cache GrpcServerRunTest/throws_on_invalid_tls: verify that GrpcServer::run() throws std::runtime_error when BuildAndStart() returns null due to an invalid TLS cert/key, rather than storing the null and crashing later. FileWatcherTest/no_refire_on_repeated_write: verify that writing the same file content a second time does not re-fire the change callback. Previously, on_modified_event never persisted updated contents to the m_files map, so every IN_MODIFY always appeared as a content change. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Claude PR review |
|
|
||
| m_server = m_builder.BuildAndStart(); | ||
| if (!m_server) { | ||
| throw std::runtime_error("BuildAndStart failed — TLS config is invalid (partial or mismatched cert/key)"); |
There was a problem hiding this comment.
Would it be worth it to add a retry here prior to returning a runtime_error as it may take time for the TLS config to be loaded or rotated?
There was a problem hiding this comment.
The only "fix" is a working cert and key pair, we must wait for the next file event for some part of the situation to be corrected, otherwise what would we run, an un-authenticated server? Hopefully the data is write behind this in the next file wakeup.
There was a problem hiding this comment.
It's already debounced, so "time" has ben considered already
|
Here are a couple other potential concerns with the PR |
|
Thanks for the thorough review. Addressing each concern: Factual corrections "Unsynchronized access to "Busy-waiting / poll() wakes 200 times": "lock/unlock/lock on a single lock object": Valid points and how they are addressed File deleted while pending: Memory for large files: already bounded by the pre-existing 1MB limit enforced in O(N) iteration over all watched files: real observation. FileWatcher is designed for a small, fixed set of files (in our case, two cert files). At that scale the iteration cost is negligible. Worth noting in a follow-up if usage patterns change. A debounce timing test: fair. The existing Out of scope / not applicable TOCTOU / symlink attack: FileWatcher watches specific system-configured paths (TLS cert files), not arbitrary user-supplied paths. Path canonicalization was not done before this PR and is not introduced or worsened by it. "Root cause not addressed; debounce treats symptoms": inotify has no write-complete event. Debounce after the last IN_MODIFY in a burst is the standard and correct solution at this layer. The null-check in
|
Problem
During TLS cert rotation, two bugs combined to crash access-mgr pods with SIGSEGV (confirmed in production):
Bug 1 — FileWatcher stale filecontents cache.
on_modified_event()read updated file contents into a localFileInfocopy but never wrote them back tom_files. Every subsequentIN_MODIFYcompared against the original registered contents, treating each event as a content change and re-firing all callbacks. A single cert rotation produced 8 concurrentspinupGrpcServercalls (2 files × 2 stale-cache re-fires × 2 listeners).Bug 2 —
GrpcServer::run()did not null-checkBuildAndStart(). If the cert/key file is partially written at callback time, OpenSSL silently rejects the PEM andBuildAndStart()returnsnullptr. The old code stored that null inm_serverwith stateRUNNING, deferring a crash to the nextshutdown()call viam_server->Shutdown()→ SIGSEGV.Changes
GrpcServer::run(): throwstd::runtime_errorifBuildAndStart()returnsnullptr, before state is set toRUNNING. The zombieGrpcServeris then safely destroyed (shutdown()is a no-op inINITEDstate).FileWatcher::on_modified_event(): persist updated filecontents back tom_filesunder the map lock so subsequentIN_MODIFYevents correctly detect no change and do not re-fire.FileWatcher::start(uint32_t debounce_ms = 200): add a configurable quiet-period debounce.IN_MODIFYevents now mark the file as pending;poll()uses a dynamic timeout; content is read and callbacks fire only after no new events arrive fordebounce_ms. This closes the partial-write race for large files (e.g. real TLS certs written over multiplewrite()syscalls). Deletion events are still dispatched immediately. All existing call sites are source-compatible (default parameter).Tests
GrpcServerRunTest/throws_on_invalid_tls— verifiesrun()throws on an invalid TLS cert/key pair rather than storing a nullm_server.FileWatcherTest/no_refire_on_repeated_write— verifies that writing identical content a second time does not re-fire the change callback.