fix: free bulkT on dispatch abort to prevent memory leak under load#6747
fix: free bulkT on dispatch abort to prevent memory leak under load#6747ycombinator wants to merge 8 commits intoelastic:mainfrom
Conversation
|
This pull request does not have a backport label. Could you fix it @ycombinator? 🙏
|
internal/pkg/bulk/engine.go
Outdated
| go func() { | ||
| select { | ||
| case <-blk.ch: | ||
| b.freeBlk(blk) | ||
| case <-time.After(3 * defaultFlushInterval): | ||
| } |
There was a problem hiding this comment.
Instead of using a time based approach, is it possible to close blk.ch when the context is cancelled and detect/drain it here?
There was a problem hiding this comment.
I'm not sure closing blk.ch when the context is cancelled in this function is a good idea. The senders on this channel, e.g. from the flushBulk method in opBulk.go would still keep sending and sending on a closed channel will panic.
| b.ReportAllocs() | ||
| b.ResetTimer() | ||
|
|
||
| for i := 0; i < b.N; i++ { |
There was a problem hiding this comment.
Please use the newer b.Loop() mechanism
| b.ReportAllocs() | ||
| b.ResetTimer() | ||
|
|
||
| for i := 0; i < b.N; i++ { |
There was a problem hiding this comment.
same here, b.Loop() can be used
michel-laterman
left a comment
There was a problem hiding this comment.
new tests are not using the require package like all the other tests?
I'm also concerned out the sleep times in the tests that can easily lead to flakey builds in BK
internal/pkg/bulk/dispatch_test.go
Outdated
| if resp.err == nil { | ||
| t.Fatal("expected error from cancelled context") | ||
| } |
internal/pkg/bulk/dispatch_test.go
Outdated
| if reused.buf.Len() != 0 { | ||
| t.Fatal("expected reused blk to have reset buf") | ||
| } | ||
| if reused.action != 0 { | ||
| t.Fatal("expected reused blk to have reset action") | ||
| } |
internal/pkg/bulk/dispatch_test.go
Outdated
|
|
||
| // Let dispatch enqueue, then cancel context to trigger Phase 2 abort. | ||
| go func() { | ||
| time.Sleep(20 * time.Millisecond) |
There was a problem hiding this comment.
why was 20ms chosen? timeouts like this can be really problematic in buildkite.
Is there another way to detect the dispatch?
There was a problem hiding this comment.
Rewrote the test to not rely on timeouts in fc690ea.
internal/pkg/bulk/dispatch_test.go
Outdated
| blk.ch <- respT{} | ||
|
|
||
| // Give the drain goroutine time to complete. | ||
| time.Sleep(50 * time.Millisecond) |
There was a problem hiding this comment.
is 50ms the dispatch time? Why was this used?
There was a problem hiding this comment.
Rewrote the test to not rely on timeouts in fc690ea.
129fc50 to
1f40cf4
Compare
When dispatch() aborts due to context cancellation, the bulkT object was never returned to the sync.Pool. Under high concurrency (30k+ agents), thousands of blk objects leaked per minute during checkin storms, causing memory growth until OOM. When the abort occurs before the blk is enqueued, free it immediately since the Run loop never received it: https://github.com/elastic/fleet-server/blob/0fec36cc/internal/pkg/bulk/engine.go#L607-L615 When the abort occurs after the blk is enqueued, spawn a small drain goroutine that waits for the flush response, then frees the blk: https://github.com/elastic/fleet-server/blob/0fec36cc/internal/pkg/bulk/engine.go#L630-L638 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use 3x the flush interval (15s) instead of the flush context timeout (1m) for the drain goroutine's safety timeout. This keeps abandoned goroutines short-lived while still giving enough time for the Run loop to process the blk under heavy load. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use Go 1.24's b.Loop() mechanism in the dispatch benchmarks instead of the legacy `for i := 0; i < b.N; i++` loop. Because b.Loop() makes b.N unknown up-front, BenchmarkDispatchAbortResponse can no longer pre-size the block queue; instead, a drain goroutine empties bulker.ch so each iteration reliably enters Phase 2. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace hand-rolled error/value checks in the dispatch tests with require.Error / require.NoError / require.Zero, matching the convention used elsewhere in the repo. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous Phase 2 abort test used time.Sleep to guess when dispatch had reached its second select and how long the drain goroutine needed to run, making it prone to flakiness on Buildkite. Extract the inline drain goroutine in dispatch() into a method drainAndFreeAbortedBlk so it can be invoked synchronously from tests. Also promote 3 * defaultFlushInterval to a named constant dispatchAbortDrainTimeout. Split the old test into two: - TestDispatchAbortResponseReachesPhase2 verifies that dispatch() returns ctx.Err() after the Run loop has taken the blk, using channel-based synchronization (no sleeps) to deterministically time the cancel. - TestDrainAndFreeAbortedBlkResponse invokes drainAndFreeAbortedBlk synchronously, so the test goroutine is the sole owner of blk and can inspect its reset fields race-free (blk.reset is only called from freeBlk, so the reset observation proves the free happened). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace "Phase 1" / "Phase 2" terminology in dispatch-related comments and test names with references to dispatch's first and second select() blocks, which is more self-explanatory for readers unfamiliar with the prior review discussion. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Explain why ctx is cancelled before dispatch is invoked: to make the first select() deterministically take the abort path so blk is never enqueued. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a comment on bulkT.reset noting that only freeBlk should call it, since TestDrainAndFreeAbortedBlkResponse relies on that invariant to use reset as an observable proxy for freeBlk having run. Also split the two concerns in the test comment — synchronous execution (for race-free reads) and the reset-as-proxy assertion — onto separate lines so they're not conflated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1f40cf4 to
a4f4564
Compare
What is the problem this PR solves?
During a 30k Serverless scale test, 22 of 39 fleet-server pods were OOMKilled. Analysis of the captured pod logs showed:
dispatch()waiting to enqueue onto the bulk engine's channel (capacity 32).bulkTobjects allocated viasync.Poolindispatch(). On the success path, callers (waitBulkAction,Search,ReadRaw) return theblkto the pool viafreeBlk. But whendispatch()aborted due to context cancellation (agent disconnect), neitherdispatchnor the callers freed theblk, causing it to leak.blkobjects leaked per minute until pods hit their memory limit (~154 Mi) and were killed.How does this PR solve the problem?
When
dispatch()aborts due to context cancellation, thebulkTobject is never returned to thesync.Pool. Under high concurrency (30k+ agents), thousands ofblkobjects leak per minute during checkin storms, causing memory growth until OOM.The abort can occur in two places:
blkis enqueued, free it immediately since theRunloop never received it:fleet-server/internal/pkg/bulk/engine.go
Lines 607 to 616 in 7c563cd
blkis enqueued, spawn a small drain goroutine that waits for the flush response, then frees theblk:fleet-server/internal/pkg/bulk/engine.go
Lines 631 to 651 in 7c563cd
How to test this PR locally
Benchmark results
Before
After
The AbortResponse path shows higher B/op because it includes the drain goroutine overhead but the
blkgets reclaimed.Design Checklist
Checklist
Related issues
🤖 Generated with Claude Code