Skip to content

Comments

Add online shared_buffers resize without server restart#13

Open
NikolayS wants to merge 5 commits intomasterfrom
claude/add-dsm-registry-helpers-Jg0Hy
Open

Add online shared_buffers resize without server restart#13
NikolayS wants to merge 5 commits intomasterfrom
claude/add-dsm-registry-helpers-Jg0Hy

Conversation

@NikolayS
Copy link
Owner

@NikolayS NikolayS commented Feb 9, 2026

Summary

Allow shared_buffers to be changed at runtime via ALTER SYSTEM + pg_reload_conf() (SIGHUP) without requiring a PostgreSQL restart, when a new GUC max_shared_buffers is set at startup to reserve virtual address space.

  • New max_shared_buffers GUC (PGC_POSTMASTER) reserves VA space at startup via MAP_SHARED|MAP_ANONYMOUS|MAP_NORESERVE, enabling online resize up to that limit with zero steady-state overhead
  • Changed shared_buffers from PGC_POSTMASTER to PGC_SIGHUP with check/assign hooks for runtime validation and resize requests
  • Grow path: commits memory incrementally, initializes new buffer descriptors, updates NBuffers atomically, signals children via SIGHUP
  • Shrink path: reduces NBuffers immediately, bgwriter asynchronously drains condemned buffers (flush dirty, evict unpinned), then decommits memory via MADV_REMOVE
  • SIGHUP-based postmaster/child coordination (postmaster lacks PGPROC, so ProcSignalBarrier cannot be used)
  • Buffer lookup hash table pre-sized for MaxNBuffers to avoid rehashing on grow

Changes

  • buf_resize.c (new, 864 lines): Core resize implementation — VA reservation, incremental commit/decommit, grow/shrink logic, bgwriter drain, GUC hooks
  • buf_resize.h (new): Shared memory control structure (BufPoolResizeCtl), function declarations
  • buf_init.c: Dual-path initialization — uses pre-reserved memory when max_shared_buffers is configured, traditional ShmemInitStruct otherwise
  • postmaster.c: Calls ExecuteBufferPoolResize() after ProcessConfigFile() before signaling children
  • bgwriter.c: Calls BufPoolDrainCondemnedBuffers() after BgBufferSync() each cycle
  • freelist.c: Hash table pre-sized for MaxNBuffers when configured
  • ipci.c: Calls BufferPoolReserveMemory() and BufPoolResizeShmemInit() during shared memory setup
  • guc_parameters.dat: shared_buffers changed to PGC_SIGHUP with hooks; new max_shared_buffers entry
  • globals.c / miscadmin.h: New globals SharedBuffersGUC, MaxNBuffers

Design highlights

  • Zero steady-state overhead: Base pointers (BufferDescriptors, BufferBlocks, etc.) never move; NBuffers remains a plain int
  • Concurrency safety: Memory barriers pair atomic current_buffers writes with descriptor initialization visibility; bgwriter validates drain range under spinlock before decommitting to prevent races with concurrent grow
  • Crash recovery: Standard WAL replay; buffer pool starts at configured shared_buffers on restart
  • Portability: Requires fork() for MAP_SHARED inheritance; EXEC_BACKEND (Windows) emits FATAL if max_shared_buffers is set; MADV_REMOVE with MADV_DONTNEED fallback; MADV_POPULATE_WRITE with manual page-touch fallback for kernels < 5.14
  • Includes design document with prior art analysis (MySQL/InnoDB, Oracle SGA, SQL Server) and 15 edge cases analyzed

Test plan

  • Basic grow/shrink cycles (32MB → 64MB → 128MB → 48MB → 96MB)
  • Boundary conditions (128kB minimum, exceed max rejection)
  • No-op resize (same value)
  • Data integrity under concurrent resize (10k rows, 0 corruption)
  • Heavy pgbench load (12 clients) + 15 resize cycles — 0 failed transactions
  • Rapid oscillation (30 cycles, 200ms delay)
  • Checkpoint during resize — data intact
  • Multiple long-running sessions during resize
  • Crash recovery (immediate shutdown + restart) — data intact
  • Extreme stress: 16 clients, 128kB↔128MB, 50 cycles — 23k+ txns, 0 failures

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos

Comprehensive design proposal for making shared_buffers dynamically
adjustable via SIGHUP without requiring a PostgreSQL restart. Covers:

- Detailed analysis of all NBuffers-dependent data structures and code paths
- Cross-system prior art (MySQL/InnoDB, Oracle SGA, SQL Server, MariaDB)
- Phased implementation: virtual address reservation, online grow, online
  shrink, dynamic hash table resizing
- ProcSignalBarrier-based coordination protocol
- 15 edge/corner cases analyzed (concurrent resize, crash recovery, pinned
  condemned buffers, huge pages, AIO interactions, etc.)
- Portability layer for Linux, FreeBSD, macOS, Windows
- Performance impact analysis with zero steady-state overhead goal
- Testing strategy covering unit, concurrency, crash recovery, and stress
- References to Dmitry Dolgov's active pgsql-hackers RFC patch series

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
Add infrastructure for dynamically resizing the shared buffer pool
via SIGHUP, without requiring a PostgreSQL restart. This implements
Phases 1-3 from the design document.

Key changes:

New GUC: max_shared_buffers (PGC_POSTMASTER)
  When set > shared_buffers, reserves virtual address space at startup
  for buffer pool arrays, enabling online resize up to that limit.
  Default 0 means same as shared_buffers (no online resize, preserving
  current behavior).

Changed GUC: shared_buffers (PGC_POSTMASTER -> PGC_SIGHUP)
  Now dynamically adjustable via ALTER SYSTEM + pg_reload_conf() when
  max_shared_buffers is configured. Check/assign hooks validate limits
  and initiate resize requests.

Buffer pool memory management (buf_resize.c):
  - BufferPoolReserveMemory(): reserves VA space via MAP_SHARED|MAP_NORESERVE
  - Grow: commits memory, initializes new descriptors, emits barrier
  - Shrink: drains condemned buffers (flush dirty, wait for unpins),
    emits barrier, decommits memory via MADV_DONTNEED
  - Zero steady-state overhead: base pointers never change, NBuffers
    remains a plain int updated via ProcSignalBarrier

Coordination protocol:
  - New PROCSIGNAL_BARRIER_BUFFER_POOL_RESIZE barrier type
  - ProcessBarrierBufferPoolResize() updates backend-local NBuffers
  - Serialized resize operations with progress tracking

Hash table pre-sizing:
  - When max_shared_buffers > shared_buffers, the buffer lookup hash
    table is pre-sized for MaxNBuffers, avoiding rehashing on grow

Separate memory allocation path:
  - When max_shared_buffers is configured, buffer arrays are allocated
    from separately-mapped memory regions instead of main shmem segment
  - BufferManagerShmemSize() excludes array sizes from main shmem
  - Global pointers (BufferDescriptors, BufferBlocks, etc.) remain
    stable across resize operations

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
Major fixes to make the online buffer pool resize actually work:

1. GUC variable indirection: The GUC variable for shared_buffers was
   pointing directly at NBuffers, which caused the GUC mechanism to
   overwrite NBuffers on SIGHUP before any actual resize occurred.
   Introduce SharedBuffersGUC as the GUC target variable. NBuffers is
   now only updated by the resize code (or at startup).

2. Postmaster/child coordination: The assign hook now properly
   distinguishes between postmaster (requests resize) and child
   processes (read current_buffers from shared memory). Replaced
   ProcSignalBarrier approach with SIGHUP-based coordination since
   the postmaster lacks a PGPROC for ConditionVariable waits.

3. ExecuteBufferPoolResize in postmaster loop: Added call to
   process_pm_reload_request() after ProcessConfigFile() but before
   SignalChildren(), ensuring the resize completes before children
   update their NBuffers.

4. MADV_POPULATE_WRITE fallback: Handle older kernels (pre-5.14) that
   don't support MADV_POPULATE_WRITE by falling back to manual page
   touching for memory commit.

Tested: grow (32MB->64MB->128MB), shrink (128MB->48MB), grow-back
(48MB->96MB), exceed-max rejection, resize under concurrent pgbench
load (zero failed transactions), clean shutdown/restart persistence.

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
…writer drain

Address code review findings across all severity levels:

CRITICAL:
- Add pg_read_barrier() in child processes after reading current_buffers
  atomic to pair with pg_write_barrier() in GrowBufferPool/ShrinkBufferPool,
  ensuring descriptor initialization is visible on ARM/POWER architectures
- Add rollback logic to BufferPoolCommitMemory() for partial
  MADV_POPULATE_WRITE failures (release already-committed pages via
  MADV_DONTNEED when a later array fails)

HIGH:
- Replace bare arithmetic with mul_size()/add_size() throughout
  BufferPoolReserveMemory, BufferPoolCommitMemory, BufferPoolDecommitMemory
  to detect integer overflow on 32-bit systems
- Fix stale comment claiming MAP_PRIVATE when code uses MAP_SHARED
- Fix header comment claiming BufResizeLock (LWLock) when struct uses
  mutex spinlock
- Update file header comment to describe SIGHUP-based coordination
  instead of removed ProcSignalBarrier approach

MEDIUM:
- Remove unused includes (postmaster/bgwriter.h, storage/ipc.h,
  storage/pg_shmem.h, storage/lwlock.h)
- Add comment on magic number 16 in Assert (matches GUC minimum)

Move buffer eviction from postmaster to bgwriter:
- ShrinkBufferPool now only updates NBuffers and records condemned range
- New BufPoolDrainCondemnedBuffers() runs in bgwriter main loop with
  full backend infrastructure (ResourceOwner, private refcounts)
- Fixes SIGSEGV crash when EvictUnpinnedBuffer was called from postmaster

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
…sues

Round 2 code review fixes addressing findings from 5 focused review agents.

CRITICAL fixes:
- Bgwriter stale decommit race: BufPoolDrainCondemnedBuffers now validates
  that status is still BUF_RESIZE_DRAINING and drain_from/drain_to match
  cached values before decommitting. Prevents MADV_REMOVE from destroying
  pages that a concurrent grow has reclaimed for active buffers.
- GrowBufferPool active descriptor race: Skip reinitialization of descriptors
  that still have BM_TAG_VALID set from a cancelled shrink, as backends may
  hold active pins. Such buffers are naturally reused by clock sweep.
- EXEC_BACKEND guard: Emit FATAL on platforms using EXEC_BACKEND (Windows)
  when max_shared_buffers is configured, since fork()-based MAP_SHARED
  inheritance is required.

HIGH fixes:
- Switch decommit to MADV_REMOVE for buffer blocks (always page-aligned).
  MADV_DONTNEED on MAP_SHARED|MAP_ANONYMOUS only zaps PTEs without
  releasing shmem-backed pages. MADV_REMOVE punches holes in the shmem
  backing store. Falls back to MADV_DONTNEED if MADV_REMOVE unavailable.
- Make BufferPoolCommitMemory incremental: only commit [old, new) range
  instead of [0, new). Avoids re-touching live pages and ensures rollback
  on partial failure only affects the new range.

MEDIUM fixes:
- Remove dead BUF_RESIZE_COMPLETING enum value (never set anywhere)
- Fix remaining overcount in drain: only count buffers that fail eviction,
  not all valid buffers before attempting eviction

https://claude.ai/code/session_01BvQguZ27Xd1dpgKCfsuFos
@NikolayS NikolayS force-pushed the claude/add-dsm-registry-helpers-Jg0Hy branch from 7fae1ee to 41687fa Compare February 9, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants