Add online shared_buffers resize without server restart#13
Open
Add online shared_buffers resize without server restart#13
Conversation
Comprehensive design proposal for making shared_buffers dynamically adjustable via SIGHUP without requiring a PostgreSQL restart. Covers: - Detailed analysis of all NBuffers-dependent data structures and code paths - Cross-system prior art (MySQL/InnoDB, Oracle SGA, SQL Server, MariaDB) - Phased implementation: virtual address reservation, online grow, online shrink, dynamic hash table resizing - ProcSignalBarrier-based coordination protocol - 15 edge/corner cases analyzed (concurrent resize, crash recovery, pinned condemned buffers, huge pages, AIO interactions, etc.) - Portability layer for Linux, FreeBSD, macOS, Windows - Performance impact analysis with zero steady-state overhead goal - Testing strategy covering unit, concurrency, crash recovery, and stress - References to Dmitry Dolgov's active pgsql-hackers RFC patch series https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
Add infrastructure for dynamically resizing the shared buffer pool
via SIGHUP, without requiring a PostgreSQL restart. This implements
Phases 1-3 from the design document.
Key changes:
New GUC: max_shared_buffers (PGC_POSTMASTER)
When set > shared_buffers, reserves virtual address space at startup
for buffer pool arrays, enabling online resize up to that limit.
Default 0 means same as shared_buffers (no online resize, preserving
current behavior).
Changed GUC: shared_buffers (PGC_POSTMASTER -> PGC_SIGHUP)
Now dynamically adjustable via ALTER SYSTEM + pg_reload_conf() when
max_shared_buffers is configured. Check/assign hooks validate limits
and initiate resize requests.
Buffer pool memory management (buf_resize.c):
- BufferPoolReserveMemory(): reserves VA space via MAP_SHARED|MAP_NORESERVE
- Grow: commits memory, initializes new descriptors, emits barrier
- Shrink: drains condemned buffers (flush dirty, wait for unpins),
emits barrier, decommits memory via MADV_DONTNEED
- Zero steady-state overhead: base pointers never change, NBuffers
remains a plain int updated via ProcSignalBarrier
Coordination protocol:
- New PROCSIGNAL_BARRIER_BUFFER_POOL_RESIZE barrier type
- ProcessBarrierBufferPoolResize() updates backend-local NBuffers
- Serialized resize operations with progress tracking
Hash table pre-sizing:
- When max_shared_buffers > shared_buffers, the buffer lookup hash
table is pre-sized for MaxNBuffers, avoiding rehashing on grow
Separate memory allocation path:
- When max_shared_buffers is configured, buffer arrays are allocated
from separately-mapped memory regions instead of main shmem segment
- BufferManagerShmemSize() excludes array sizes from main shmem
- Global pointers (BufferDescriptors, BufferBlocks, etc.) remain
stable across resize operations
https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
Major fixes to make the online buffer pool resize actually work: 1. GUC variable indirection: The GUC variable for shared_buffers was pointing directly at NBuffers, which caused the GUC mechanism to overwrite NBuffers on SIGHUP before any actual resize occurred. Introduce SharedBuffersGUC as the GUC target variable. NBuffers is now only updated by the resize code (or at startup). 2. Postmaster/child coordination: The assign hook now properly distinguishes between postmaster (requests resize) and child processes (read current_buffers from shared memory). Replaced ProcSignalBarrier approach with SIGHUP-based coordination since the postmaster lacks a PGPROC for ConditionVariable waits. 3. ExecuteBufferPoolResize in postmaster loop: Added call to process_pm_reload_request() after ProcessConfigFile() but before SignalChildren(), ensuring the resize completes before children update their NBuffers. 4. MADV_POPULATE_WRITE fallback: Handle older kernels (pre-5.14) that don't support MADV_POPULATE_WRITE by falling back to manual page touching for memory commit. Tested: grow (32MB->64MB->128MB), shrink (128MB->48MB), grow-back (48MB->96MB), exceed-max rejection, resize under concurrent pgbench load (zero failed transactions), clean shutdown/restart persistence. https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
…writer drain Address code review findings across all severity levels: CRITICAL: - Add pg_read_barrier() in child processes after reading current_buffers atomic to pair with pg_write_barrier() in GrowBufferPool/ShrinkBufferPool, ensuring descriptor initialization is visible on ARM/POWER architectures - Add rollback logic to BufferPoolCommitMemory() for partial MADV_POPULATE_WRITE failures (release already-committed pages via MADV_DONTNEED when a later array fails) HIGH: - Replace bare arithmetic with mul_size()/add_size() throughout BufferPoolReserveMemory, BufferPoolCommitMemory, BufferPoolDecommitMemory to detect integer overflow on 32-bit systems - Fix stale comment claiming MAP_PRIVATE when code uses MAP_SHARED - Fix header comment claiming BufResizeLock (LWLock) when struct uses mutex spinlock - Update file header comment to describe SIGHUP-based coordination instead of removed ProcSignalBarrier approach MEDIUM: - Remove unused includes (postmaster/bgwriter.h, storage/ipc.h, storage/pg_shmem.h, storage/lwlock.h) - Add comment on magic number 16 in Assert (matches GUC minimum) Move buffer eviction from postmaster to bgwriter: - ShrinkBufferPool now only updates NBuffers and records condemned range - New BufPoolDrainCondemnedBuffers() runs in bgwriter main loop with full backend infrastructure (ResourceOwner, private refcounts) - Fixes SIGSEGV crash when EvictUnpinnedBuffer was called from postmaster https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
…sues Round 2 code review fixes addressing findings from 5 focused review agents. CRITICAL fixes: - Bgwriter stale decommit race: BufPoolDrainCondemnedBuffers now validates that status is still BUF_RESIZE_DRAINING and drain_from/drain_to match cached values before decommitting. Prevents MADV_REMOVE from destroying pages that a concurrent grow has reclaimed for active buffers. - GrowBufferPool active descriptor race: Skip reinitialization of descriptors that still have BM_TAG_VALID set from a cancelled shrink, as backends may hold active pins. Such buffers are naturally reused by clock sweep. - EXEC_BACKEND guard: Emit FATAL on platforms using EXEC_BACKEND (Windows) when max_shared_buffers is configured, since fork()-based MAP_SHARED inheritance is required. HIGH fixes: - Switch decommit to MADV_REMOVE for buffer blocks (always page-aligned). MADV_DONTNEED on MAP_SHARED|MAP_ANONYMOUS only zaps PTEs without releasing shmem-backed pages. MADV_REMOVE punches holes in the shmem backing store. Falls back to MADV_DONTNEED if MADV_REMOVE unavailable. - Make BufferPoolCommitMemory incremental: only commit [old, new) range instead of [0, new). Avoids re-touching live pages and ensures rollback on partial failure only affects the new range. MEDIUM fixes: - Remove dead BUF_RESIZE_COMPLETING enum value (never set anywhere) - Fix remaining overcount in drain: only count buffers that fail eviction, not all valid buffers before attempting eviction https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
7fae1ee to
41687fa
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Allow
shared_buffersto be changed at runtime viaALTER SYSTEM+pg_reload_conf()(SIGHUP) without requiring a PostgreSQL restart, when a new GUCmax_shared_buffersis set at startup to reserve virtual address space.max_shared_buffersGUC (PGC_POSTMASTER) reserves VA space at startup viaMAP_SHARED|MAP_ANONYMOUS|MAP_NORESERVE, enabling online resize up to that limit with zero steady-state overheadshared_buffersfrom PGC_POSTMASTER to PGC_SIGHUP with check/assign hooks for runtime validation and resize requestsNBuffersatomically, signals children via SIGHUPNBuffersimmediately, bgwriter asynchronously drains condemned buffers (flush dirty, evict unpinned), then decommits memory viaMADV_REMOVEMaxNBuffersto avoid rehashing on growChanges
buf_resize.c(new, 864 lines): Core resize implementation — VA reservation, incremental commit/decommit, grow/shrink logic, bgwriter drain, GUC hooksbuf_resize.h(new): Shared memory control structure (BufPoolResizeCtl), function declarationsbuf_init.c: Dual-path initialization — uses pre-reserved memory whenmax_shared_buffersis configured, traditionalShmemInitStructotherwisepostmaster.c: CallsExecuteBufferPoolResize()afterProcessConfigFile()before signaling childrenbgwriter.c: CallsBufPoolDrainCondemnedBuffers()afterBgBufferSync()each cyclefreelist.c: Hash table pre-sized forMaxNBufferswhen configuredipci.c: CallsBufferPoolReserveMemory()andBufPoolResizeShmemInit()during shared memory setupguc_parameters.dat:shared_bufferschanged to PGC_SIGHUP with hooks; newmax_shared_buffersentryglobals.c/miscadmin.h: New globalsSharedBuffersGUC,MaxNBuffersDesign highlights
BufferDescriptors,BufferBlocks, etc.) never move;NBuffersremains a plain intcurrent_bufferswrites with descriptor initialization visibility; bgwriter validates drain range under spinlock before decommitting to prevent races with concurrent growshared_bufferson restartfork()forMAP_SHAREDinheritance; EXEC_BACKEND (Windows) emits FATAL ifmax_shared_buffersis set;MADV_REMOVEwithMADV_DONTNEEDfallback;MADV_POPULATE_WRITEwith manual page-touch fallback for kernels < 5.14Test plan
https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos