diff --git a/doc/design-shared-buffers-online-resize.md b/doc/design-shared-buffers-online-resize.md new file mode 100644 index 0000000000000..8793e191657e5 --- /dev/null +++ b/doc/design-shared-buffers-online-resize.md @@ -0,0 +1,1432 @@ +# Design: Online Resizing of `shared_buffers` Without Restart + +**Status:** Proposal / Design Document +**Target:** PostgreSQL 19+ +**Author:** Design analysis based on PostgreSQL source code study +**Date:** 2026-02-06 +**Related work:** Dmitry Dolgov's RFC patch series on pgsql-hackers (October 2024 -- April 2025) + +--- + +## Table of Contents + +1. [Motivation](#1-motivation) +2. [Current Architecture](#2-current-architecture) +3. [Prior Art: How Other Systems Do It](#3-prior-art-how-other-systems-do-it) +4. [Design Overview](#4-design-overview) +5. [Phase 1: Virtual Address Space Reservation](#5-phase-1-virtual-address-space-reservation) +6. [Phase 2: Growing the Buffer Pool](#6-phase-2-growing-the-buffer-pool) +7. [Phase 3: Shrinking the Buffer Pool](#7-phase-3-shrinking-the-buffer-pool) +8. [Phase 4: Hash Table Resizing](#8-phase-4-hash-table-resizing) +9. [Coordination Protocol](#9-coordination-protocol) +10. [GUC and User Interface Changes](#10-guc-and-user-interface-changes) +11. [Edge Cases and Corner Cases](#11-edge-cases-and-corner-cases) +12. [Huge Pages](#12-huge-pages) +13. [Portability](#13-portability) +14. [Performance Impact](#14-performance-impact) +15. [Observability](#15-observability) +16. [Testing Strategy](#16-testing-strategy) +17. [Migration and Compatibility](#17-migration-and-compatibility) +18. [Phased Implementation Plan](#18-phased-implementation-plan) +19. [Open Questions](#19-open-questions) +20. [References](#20-references) + +--- + +## 1. Motivation + +`shared_buffers` is arguably the most important PostgreSQL tuning parameter, yet +changing it requires a full server restart -- the most disruptive operation a +DBA can perform. This creates real-world pain in several scenarios: + +- **Cloud/managed databases** that need to scale vertically without downtime +- **Autoscaling** in response to workload changes (e.g., reporting windows) +- **Initial misconfiguration** discovered under production load +- **Memory rebalancing** on multi-tenant hosts running multiple PG instances +- **Gradual warm-up** strategies: start small, grow as the working set stabilizes + +Other major databases already support this: +- MySQL/InnoDB: `innodb_buffer_pool_size` has been online-resizable since 5.7.5 (2014) +- Oracle: `db_cache_size` dynamically adjustable within SGA since 9i (2001) +- SQL Server: `max server memory` fully dynamic (always was) + +PostgreSQL should close this gap. + +--- + +## 2. Current Architecture + +Understanding what needs to change requires a detailed inventory of every data +structure and code path that depends on `NBuffers` being constant. + +### 2.1 Shared Memory Allocation + +At postmaster startup, `CreateSharedMemoryAndSemaphores()` (`src/backend/storage/ipc/ipci.c:191`) +allocates a single contiguous shared memory segment: + +``` +CalculateShmemSize() -- compute total size including BufferManagerShmemSize() +PGSharedMemoryCreate() -- mmap() one giant anonymous segment (or SysV) +CreateOrAttachShmemStructs() -- carve it up via ShmemInitStruct() +``` + +The segment size is fixed for the lifetime of the postmaster. All subsystems +allocate their shared memory from this segment via `ShmemInitStruct()`, which is +a simple bump allocator. There is no facility to grow or shrink the segment. + +### 2.2 Buffer Manager Data Structures + +`BufferManagerShmemInit()` (`src/backend/storage/buffer/buf_init.c:68`) allocates +five arrays, all dimensioned by `NBuffers`: + +| Structure | Size per buffer | Total (default 128MB / 16384 bufs) | Purpose | +|---|---|---|---| +| `BufferDescriptors[]` | 64 bytes (cache-line padded) | 1 MB | Metadata: tag, state (atomic), lock waiters | +| `BufferBlocks` | 8192 bytes (BLCKSZ) | 128 MB | Actual page data | +| `BufferIOCVArray[]` | ~64 bytes (padded) | 1 MB | I/O completion condition variables | +| `CkptBufferIds[]` | 24 bytes | 384 KB | Checkpoint sort array | +| Buffer hash table | ~40 bytes | ~800 KB | Tag-to-buffer-ID lookup (partitioned) | + +**Total overhead beyond the page data:** ~3.3 MB per 16384 buffers (~0.2 KB per buffer). + +### 2.3 Critical Code Paths Depending on NBuffers + +#### 2.3.1 Direct Array Indexing (Hot Path) + +```c +// buf_internals.h:422 -- THE hottest function in PG +static inline BufferDesc *GetBufferDescriptor(uint32 id) +{ + return &(BufferDescriptors[id]).bufferdesc; +} + +// bufmgr.c:73 -- converts descriptor to data pointer +#define BufHdrGetBlock(bufHdr) \ + ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ)) +``` + +These are zero-overhead array lookups. Every buffer pin, unpin, read, write, and +dirty operation goes through `GetBufferDescriptor()`. Any indirection added here +is on the absolute hottest path. + +#### 2.3.2 Clock Sweep (Victim Selection) + +```c +// freelist.c:99-156 +static inline uint32 ClockSweepTick(void) +{ + victim = pg_atomic_fetch_add_u32(&StrategyControl->nextVictimBuffer, 1); + if (victim >= NBuffers) + { + victim = victim % NBuffers; + // ... wrap-around handling with completePasses increment + } + return victim; +} +``` + +The clock hand is a monotonically increasing atomic counter, reduced modulo +`NBuffers` to find the actual buffer. Changing `NBuffers` while the clock hand +is in flight would cause the modulo to produce different results -- but since +the clock hand is already designed to wrap, this is actually one of the easier +parts to handle (see Section 6.3). + +#### 2.3.3 Buffer Lookup Hash Table + +```c +// buf_table.c:50 -- fixed-size, created once +InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS); +// Uses HASH_FIXED_SIZE flag -- cannot grow! +``` + +The buffer mapping hash table is created with `HASH_FIXED_SIZE`, explicitly +preventing dynamic growth. It's partitioned across `NUM_BUFFER_PARTITIONS` (128) +LWLocks. The table is sized for `NBuffers + NUM_BUFFER_PARTITIONS` entries to +handle concurrent insert-before-delete during buffer replacement. + +#### 2.3.4 Background Writer and Checkpointer + +```c +// freelist.c:230 -- scan limit in StrategyGetBuffer +trycounter = NBuffers; + +// bufmgr.c:92 -- threshold for full-pool scan vs. hash lookup +#define BUF_DROP_FULL_SCAN_THRESHOLD (uint64) (NBuffers / 32) +``` + +The bgwriter uses `StrategySyncStart()` which reads `nextVictimBuffer % NBuffers`. +The checkpointer allocates `CkptBufferIds[NBuffers]` at startup for sort space. + +#### 2.3.5 Buffer Access Strategies (Ring Buffers) + +```c +// freelist.c:560 -- ring buffers capped at 1/8 of pool +ring_buffers = Min(NBuffers / 8, ring_buffers); +``` + +Ring buffer sizes for sequential scans, VACUUM, and bulk writes are derived from +`NBuffers`. These are per-backend allocations and can tolerate NBuffers changes +between allocations -- but an active ring buffer referencing a buffer ID that +gets invalidated during shrink is dangerous. + +#### 2.3.6 Other NBuffers Dependencies + +- `GetAccessStrategyPinLimit()` returns `NBuffers` for NULL strategy +- `PrivateRefCount` hash table (per-backend, in local memory) -- no issue +- Predicate lock manager's buffer-level locks reference buffer IDs +- AIO subsystem references buffer IDs for in-flight I/O operations +- `pg_buffercache` extension iterates `0..NBuffers-1` + +### 2.4 Shared Memory Backend Model + +On Linux (the primary target), the postmaster creates shared memory via +anonymous `mmap()` with `MAP_SHARED`. Child backends inherit the mapping +through `fork()`. All backends see the same physical pages at the same virtual +address. There is no facility to notify backends that the mapping has changed. + +On `EXEC_BACKEND` platforms (Windows), backends re-attach to the shared memory +segment after `exec()` via `AttachSharedMemoryStructs()`. This path already +handles pointer re-initialization -- which is actually advantageous for resize. + +--- + +## 3. Prior Art: How Other Systems Do It + +### 3.1 MySQL/InnoDB (Since 5.7.5) + +**Unit of resize:** 128MB chunks (`innodb_buffer_pool_chunk_size`). + +**Growing:** +1. Background thread allocates new chunks (OS memory) +2. New pages added to free list +3. Hash tables resized +4. Adaptive Hash Index (AHI) re-enabled + +**Shrinking (much harder):** +1. AHI disabled +2. Defragmentation: pages from condemned chunks relocated +3. Dirty pages flushed, chunks freed +4. Hash tables resized + +**Known problems:** +- TPS drops to zero during resize (MySQL Bug #81615) +- Shrink blocked by long-running transactions holding buffer pins +- mmap failures mid-resize treated as fatal +- AHI disabled for entire duration causes latency spikes + +**Lesson:** Chunk-based allocation avoids per-page copying. But the critical +section that blocks all buffer access is the main source of production issues. + +### 3.2 MariaDB (10.11.12+) + +Evolved beyond MySQL's approach: +- Deprecated fixed chunk sizes; arbitrary 1MB increments +- `innodb_buffer_pool_size_max` reserves address space at startup +- Automatic memory-pressure-driven shrinking via Linux `madvise(MADV_DONTNEED)` +- Initially caused performance anomalies (MDEV-35000); disabled by default + +**Lesson:** OS memory pressure integration is attractive but treacherous. +Hysteresis and minimum bounds are essential. + +### 3.3 Oracle (SGA Dynamic Resize) + +**Unit of resize:** Granules (4MB if SGA < 1GB, 16MB otherwise). + +- Components resizable within `SGA_MAX_SIZE` (fixed at startup) +- ASMM/AMM automatic tuning uses cost-benefit analysis +- Shared pool shrink rarely succeeds due to pinned objects + +**Known problems:** +- Memory thrashing: 900+ resize cycles/day ending at same size +- AMM incompatible with HugePages on Linux +- Buffer cache shrank from 2.6GB to 640MB causing system hang + +**Lesson:** Always require explicit minimum bounds. Automatic tuning without +guardrails causes pathological oscillation. Pre-reserve the maximum. + +### 3.4 SQL Server + +Fundamentally different: demand-driven, page-at-a-time acquisition. No discrete +"resize operation." When `max server memory` is lowered, gradual release via +eviction. Resource Monitor handles OS memory pressure. + +**Lesson:** The cleanest model, but requires a completely different memory +architecture than PostgreSQL's. Not directly applicable as a migration target. + +### 3.5 Existing PostgreSQL Patch Work (Dolgov, 2024-2025) + +Dmitry Dolgov's RFC patch series on pgsql-hackers establishes key groundwork: + +| Patch | Approach | +|---|---| +| 0001 | Multiple shared memory mappings (instead of single mmap) | +| 0002 | Place mappings with offset (reserve space for growth) | +| 0003 | Shared memory "slots" for each buffer subsystem array | +| 0004 | Actual resize via `mremap` with GUC assign hook | +| 0005 | `memfd_create` for anonymous file-backed segments | +| 0006 | Coordination for shrinking (prevent SIGBUS from ftruncate) | + +**Key design choices:** +- `max_available_memory` GUC reserves virtual address space at startup +- Extends `ProcSignalBarrier` for global coordination +- Linux-specific (`mremap`, `memfd_create`) +- Currently grow-only; shrink coordination is WIP + +**Open issues identified by reviewers:** +- Portability to non-Linux (macOS, FreeBSD, Windows) +- HugePages interaction with `mremap` +- Address space collisions from other allocations +- No POSIX fallback for `memfd_create` + +--- + +## 4. Design Overview + +Based on the analysis above, we propose a **chunk-based, grow-first** design +that builds on Dolgov's foundation while addressing identified gaps: + +### Core Principles + +1. **Zero overhead on the hot path when not resizing.** The `GetBufferDescriptor()` + and `BufHdrGetBlock()` lookups must remain direct array indexing. No pointer + indirection, no bounds checks, no version counters in steady state. + +2. **Chunk-based allocation.** Buffer pool memory is managed in chunks + (default 128MB, configurable). Growing adds chunks; shrinking removes them. + Within a chunk, memory is contiguous. Chunks need not be contiguous with + each other. + +3. **Reserve virtual address space at startup.** A `max_shared_buffers` GUC + (default: 2x `shared_buffers`, max: total system RAM) reserves virtual + address space at postmaster start. Growing beyond this requires restart. + +4. **Grow is online and nearly non-blocking.** Shrink requires a brief + coordinated pause. + +5. **Phase the implementation.** Grow-only first. Shrink later. Auto-tuning + never (leave to external tools). + +### Architecture Diagram + +``` +Virtual Address Space (reserved at startup for max_shared_buffers): +┌──────────────────────────────────────────────────────────────────┐ +│ BufferBlocks region │ +│ ┌─────────┬─────────┬─────────┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ +│ │ Chunk 0 │ Chunk 1 │ Chunk 2 │ (reserved, uncommitted) │ +│ │ 128 MB │ 128 MB │ 128 MB │ │ +│ └─────────┴─────────┴─────────┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ +├──────────────────────────────────────────────────────────────────┤ +│ BufferDescriptors region │ +│ ┌─────────┬─────────┬─────────┬ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┐ │ +│ │ Descs 0 │ Descs 1 │ Descs 2 │ (reserved, uncommitted) │ +│ └─────────┴─────────┴─────────┴ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘ │ +├──────────────────────────────────────────────────────────────────┤ +│ BufferIOCVArray region │ +│ (same pattern) │ +├──────────────────────────────────────────────────────────────────┤ +│ CkptBufferIds region │ +│ (same pattern) │ +└──────────────────────────────────────────────────────────────────┘ +``` + +Each region is reserved as a contiguous virtual address range sized for +`max_shared_buffers`. Physical memory is committed only for the active +`shared_buffers` portion. The global pointers (`BufferDescriptors`, +`BufferBlocks`, etc.) never change -- only `NBuffers` changes. + +--- + +## 5. Phase 1: Virtual Address Space Reservation + +### 5.1 Separate Buffer Manager Memory from Main Shmem + +**Problem:** Today, buffer pool arrays are allocated from the same `mmap` +segment as everything else (lock tables, proc arrays, CLOG, etc.) via +`ShmemInitStruct()`. We cannot resize one part without affecting the rest. + +**Solution:** Allocate the buffer manager's five arrays as a **separate memory +mapping**, independent of the main shared memory segment: + +```c +/* New function in buf_init.c */ +void +BufferManagerShmemReserve(void) +{ + Size max_bufs = MaxNBuffers; /* from max_shared_buffers GUC */ + + /* Reserve VA space for BufferBlocks */ + BufferBlocks = mmap(NULL, + max_bufs * BLCKSZ + PG_IO_ALIGN_SIZE, + PROT_NONE, /* no access yet */ + MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, + -1, 0); + + /* Similarly for BufferDescriptors, BufferIOCVArray, CkptBufferIds */ + ... + + /* Commit the initial shared_buffers portion */ + BufferManagerShmemCommit(NBuffers); +} +``` + +The key insight: `PROT_NONE` + `MAP_NORESERVE` reserves virtual address space +without committing physical memory or swap. We then `mprotect()` + `mmap()` the +active portion with `MAP_SHARED | MAP_FIXED`. + +### 5.2 New GUC: `max_shared_buffers` + +``` +{ name => 'max_shared_buffers', + type => 'int', + context => 'PGC_POSTMASTER', /* requires restart */ + group => 'RESOURCES_MEM', + short_desc => 'Maximum value to which shared_buffers can be set without restart.', + flags => 'GUC_UNIT_BLOCKS', + variable => 'MaxNBuffers', + boot_val => '0', /* 0 means "same as shared_buffers" */ + min => '0', + max => 'INT_MAX / 2', +} +``` + +When `max_shared_buffers = 0` (default), it equals `shared_buffers` and no +online resize is possible -- preserving current behavior. When set to a value +greater than `shared_buffers`, online resize up to that limit is enabled. + +### 5.3 Shared Memory Backing + +For the reserved region to be shared across `fork()`ed backends, we need a +shared anonymous file descriptor. Options: + +| Method | Pros | Cons | +|---|---|---| +| `memfd_create()` | No filesystem impact, sealed | Linux 3.17+ only | +| `shm_open()` + unlink | POSIX portable | Requires /dev/shm space | +| Anonymous `mmap(MAP_SHARED)` | Simplest | Cannot `mremap()` | + +**Recommended:** Use `memfd_create()` on Linux (the dominant production +platform), with `shm_open()` fallback for FreeBSD/macOS. On Windows +(EXEC_BACKEND), use `CreateFileMapping()` with `SEC_RESERVE`. + +### 5.4 Keeping Pointers Stable + +The critical invariant: `BufferDescriptors`, `BufferBlocks`, `BufferIOCVArray`, +and `CkptBufferIds` pointers must never change after postmaster startup. +Growing the pool extends the committed region *within* the already-reserved +range, so the base address stays fixed. This means: + +- `GetBufferDescriptor(id)` continues to work with zero overhead +- `BufHdrGetBlock(bufHdr)` continues to work with zero overhead +- No pointer indirection is needed on the hot path + +--- + +## 6. Phase 2: Growing the Buffer Pool + +Growing is the simpler operation. New buffers are added at the end of the +arrays with no impact on existing buffers. + +### 6.1 Grow Algorithm + +``` +1. DBA issues: ALTER SYSTEM SET shared_buffers = '2GB'; SELECT pg_reload_conf(); + Or: SET shared_buffers = '2GB'; (with PGC_SIGHUP context) + +2. Postmaster receives SIGHUP, validates new value <= max_shared_buffers. + +3. Postmaster initiates resize sequence: + + a. Commit new memory pages: + - mmap(MAP_FIXED | MAP_SHARED) over the PROT_NONE region for each array + - Or: ftruncate() the memfd to the new size + mprotect() + + b. Initialize new buffer descriptors: + for (i = old_NBuffers; i < new_NBuffers; i++) { + BufferDesc *buf = GetBufferDescriptor(i); + ClearBufferTag(&buf->tag); + pg_atomic_init_u64(&buf->state, 0); + buf->wait_backend_pgprocno = INVALID_PROC_NUMBER; + buf->buf_id = i; + ConditionVariableInit(BufferDescriptorGetIOCV(buf)); + } + + c. Emit ProcSignalBarrier to all backends: + EmitProcSignalBarrier(PROCSIGNAL_BARRIER_BUFFER_POOL_RESIZE); + + d. Wait for all backends to acknowledge: + WaitForProcSignalBarrier(generation); + + e. Update NBuffers atomically: + pg_atomic_write_u32(&shared_NBuffers, new_NBuffers); + + f. New buffers are immediately available for clock sweep. +``` + +### 6.2 Why Growing Is Nearly Non-Blocking + +During step 3a-3b, existing buffers are untouched. Backends continue operating +normally on buffers 0..old_NBuffers-1. The barrier in step 3c-3d only requires +each backend to: + +1. Call `ProcessProcSignalBarrier()` at the next CHECK_FOR_INTERRUPTS() +2. Read the new `NBuffers` value +3. Acknowledge the barrier + +No buffer access needs to be paused. The new buffers simply appear at the end +of the arrays, and the clock sweep naturally starts visiting them. + +### 6.3 Clock Sweep Interaction + +The clock sweep hand (`nextVictimBuffer`) is a monotonically increasing atomic +counter reduced modulo `NBuffers`. When `NBuffers` increases: + +- If hand is at position H and old NBuffers was N₁ and new is N₂ (N₂ > N₁): + - `H % N₁` and `H % N₂` may differ, but this is harmless -- the clock sweep + already tolerates arbitrary starting positions + - The `completePasses` counter becomes slightly inaccurate for one cycle + - The bgwriter's sync estimation may be off for one cycle (acceptable) + +No special handling is needed beyond updating the value of NBuffers. + +### 6.4 Hash Table Interaction + +The buffer hash table (`SharedBufHash`) is currently fixed-size. After growing +NBuffers, the table may become undersized, leading to longer chains and slower +lookups. Options: + +**Option A: Over-provision at startup.** Size the hash table for +`MaxNBuffers + NUM_BUFFER_PARTITIONS` entries. Wastes memory proportional to +`max_shared_buffers - shared_buffers`, but hash tables are small (~40 bytes per +entry). For a 2x over-provision, the waste is ~40 * NBuffers ≈ 0.6 MB per GB +of buffer pool. This is the recommended approach for Phase 2. + +**Option B: Dynamic hash table.** Replace `HASH_FIXED_SIZE` with a dynamically +resizable hash table. More complex but avoids the waste. Deferred to Phase 4. + +### 6.5 AIO and In-Flight I/O + +The AIO subsystem tracks in-flight I/O operations referencing buffer IDs. +Growing is safe: new buffer IDs (≥ old_NBuffers) won't have any in-flight I/O. +Existing buffer I/O continues undisturbed. + +--- + +## 7. Phase 3: Shrinking the Buffer Pool + +Shrinking is fundamentally harder than growing. Buffers being removed may +contain dirty data, be pinned by active backends, or be referenced by in-flight +I/O operations. + +### 7.1 Shrink Algorithm + +``` +1. DBA issues: ALTER SYSTEM SET shared_buffers = '512MB'; SELECT pg_reload_conf(); + +2. Postmaster validates new value >= min_shared_buffers (16 blocks). + +3. Postmaster initiates drain sequence: + + a. Mark condemned range [new_NBuffers, old_NBuffers) as "draining": + - Set a shared flag: drain_target = new_NBuffers + - Clock sweep skips condemned buffers for allocation + - New buffer allocations cannot choose condemned buffers + + b. Drain condemned buffers (may take multiple passes): + for each buffer in condemned range: + - If buffer is dirty, schedule writeback + - If buffer has tag, remove from hash table + - Wait for refcount == 0 (buffer unpinned by all) + - Wait for I/O completion (no in-flight AIO) + - Invalidate: clear tag, set state to 0 + + c. After all condemned buffers are drained: + Emit ProcSignalBarrier(PROCSIGNAL_BARRIER_BUFFER_POOL_RESIZE) + + d. Wait for all backends to acknowledge. + + e. Update NBuffers atomically: + pg_atomic_write_u32(&shared_NBuffers, new_NBuffers); + + f. Decommit memory: + madvise(MADV_DONTNEED, ...) on the freed regions + mprotect(PROT_NONE, ...) to prevent accidental access + +4. If drain does not complete within timeout (e.g., 60 seconds): + - Log a WARNING identifying which buffers are still pinned + - Cancel the shrink operation + - Restore original NBuffers +``` + +### 7.2 Drain Coordination Details + +The drain phase is the hardest part. Each condemned buffer can be in one of +several states: + +| Buffer State | Action Required | +|---|---| +| Free (no tag, refcount=0) | Nothing -- already drainable | +| Valid, clean, unpinned | Remove from hash table, clear tag | +| Valid, dirty, unpinned | Flush to disk, then clear | +| Valid, pinned (refcount > 0) | Wait for unpin -- cannot force | +| I/O in progress | Wait for I/O completion | +| Locked (BM_LOCKED) | Wait for unlock | +| Content lock held | Wait for content lock release | + +**Pinned buffers are the critical bottleneck.** A backend holding a pin on a +condemned buffer prevents shrinking. We cannot force-unpin because: +- The backend may be in the middle of reading/writing the page +- The backend's `PrivateRefCount` would become inconsistent +- It could corrupt data + +**Strategy:** Use a cooperative approach: +1. Set a per-buffer flag `BM_CONDEMNED` in the buffer state +2. When a backend unpins a condemned buffer, instead of just decrementing + refcount, it also invalidates the buffer (removes from hash table, clears tag) +3. The postmaster's drain loop polls condemned buffers, flushing dirty ones + and waiting for pins to be released +4. A timeout prevents indefinite blocking + +### 7.3 Preventing SIGBUS on Shrink + +When using `memfd_create()`, shrinking the underlying file with `ftruncate()` +immediately invalidates the pages -- any backend accessing that memory will get +SIGBUS. This is the problem identified in Dolgov's patch 0006. + +**Solution:** The barrier protocol ensures all backends have stopped accessing +the condemned region before `ftruncate()` or `mprotect(PROT_NONE)`: + +``` +Timeline: + 1. All condemned buffers drained (refcount=0, no tags, no I/O) + 2. Barrier emitted -- all backends process it and read new NBuffers + 3. After barrier: NBuffers is smaller, so no backend will access IDs >= new NBuffers + 4. Only now: ftruncate/mprotect to release the memory +``` + +The safety invariant: after the barrier completes, no backend can form a +reference to a buffer ID >= new_NBuffers because: +- `GetBufferDescriptor(ClockSweepTick())` returns `victim % NBuffers` where + NBuffers is now smaller +- `BufTableLookup()` can't return an ID >= new_NBuffers because all condemned + entries were removed in the drain phase +- `PrivateRefCount` entries for condemned buffers were cleared during unpin + +### 7.4 In-Flight I/O and AIO + +Before shrinking, ALL in-flight I/O on condemned buffers must complete: +1. Check `io_wref` on each condemned buffer descriptor +2. If AIO is in progress, wait for completion +3. Do NOT initiate new I/O on condemned buffers after drain starts + +The bgwriter and checkpointer must also be aware of the drain -- they should +not attempt to flush condemned buffers after the drain is initiated. + +--- + +## 8. Phase 4: Hash Table Resizing + +### 8.1 Problem Statement + +The buffer hash table (`SharedBufHash`) uses PostgreSQL's `dynahash` with +`HASH_FIXED_SIZE`. After significant growth, the hash table may have excessive +chain lengths. After shrinking, it wastes memory. + +### 8.2 Incremental Rehashing + +Full rehashing requires locking all 128 partitions simultaneously -- equivalent +to stopping all buffer operations. Instead, use **incremental rehashing**: + +1. Allocate new hash table alongside the old one +2. For each partition (0..127): + a. Acquire exclusive lock on partition + b. Move all entries from old bucket to new bucket + c. Release lock + d. (Other partitions continue operating on old table concurrently) +3. After all 128 partitions migrated: + a. Emit barrier to switch all backends to new table + b. Deallocate old table + +**Concurrency:** Since each partition is independently locked, at most one +partition is being migrated at any time. Other backends see consistent state +because they look up the partition lock before accessing the table. Reads in +non-migrating partitions are unaffected. + +### 8.3 Alternative: Over-Provision + +For the initial implementation, simply pre-size the hash table for +`MaxNBuffers + NUM_BUFFER_PARTITIONS`. The additional memory cost is modest: + +| max_shared_buffers | Hash table waste | +|---|---| +| 2x shared_buffers (1GB → 2GB) | ~5 MB | +| 4x shared_buffers (1GB → 4GB) | ~15 MB | +| 8x shared_buffers (1GB → 8GB) | ~35 MB | + +This is a reasonable tradeoff for avoiding the complexity of online hash table +resizing in the initial implementation. + +--- + +## 9. Coordination Protocol + +### 9.1 ProcSignalBarrier Extension + +PostgreSQL already has a `ProcSignalBarrier` mechanism used for +`PROCSIGNAL_BARRIER_SMGRRELEASE`. We extend it with a new barrier type: + +```c +typedef enum +{ + PROCSIGNAL_BARRIER_SMGRRELEASE, + PROCSIGNAL_BARRIER_UPDATE_XLOG_LOGICAL_INFO, + PROCSIGNAL_BARRIER_BUFFER_POOL_RESIZE, /* NEW */ +} ProcSignalBarrierType; +``` + +When a backend processes this barrier: +1. Read the new value of `NBuffers` from shared memory +2. Update any backend-local cached values derived from NBuffers +3. Invalidate active `BufferAccessStrategy` objects that reference condemned IDs +4. Check `PrivateRefCount` for entries referencing condemned buffers (should be + none if drain completed correctly -- assert in debug builds) +5. Acknowledge the barrier + +### 9.2 Making NBuffers Atomic + +Currently, `NBuffers` is a plain `int` read without synchronization: + +```c +// globals.c +int NBuffers = 16384; +``` + +For online resize, it must become an atomic variable with a local cache: + +```c +// In shared memory: +pg_atomic_uint32 SharedNBuffers; + +// Per-backend cached copy (updated at barrier): +int NBuffers; /* remains a plain int for zero-overhead reads */ +``` + +The barrier protocol ensures all backends update their local `NBuffers` before +the resize is considered complete. Between barriers, the local copy is +guaranteed to be current. + +**Critical safety property:** Between the moment the postmaster updates +`SharedNBuffers` and the moment a backend processes the barrier, the backend +is using the OLD NBuffers value. This is safe because: +- For grow: the backend simply doesn't know about new buffers yet (harmless) +- For shrink: the drain phase ensures all condemned buffers are already free + and removed from the hash table, so no backend can reach them even with the + old NBuffers value (the hash table won't return condemned IDs, and the clock + sweep won't pick them because they're flagged) + +### 9.3 Ordering Guarantees + +The resize sequence must ensure: + +``` +For GROW: + memory committed → descriptors initialized → barrier → NBuffers updated + (Backends must not see new NBuffers before memory is ready) + +For SHRINK: + drain initiated → drain completed → barrier → NBuffers updated → memory freed + (Memory must not be freed before all backends acknowledge) +``` + +These orderings are enforced by the barrier mechanism, which acts as a full +memory fence across all processes. + +--- + +## 10. GUC and User Interface Changes + +### 10.1 GUC Context Change + +``` +shared_buffers: PGC_POSTMASTER → PGC_SIGHUP +max_shared_buffers: new, PGC_POSTMASTER +``` + +When `max_shared_buffers` is 0 (default), `shared_buffers` remains +PGC_POSTMASTER-like (validated at startup, cannot exceed current allocation). +When `max_shared_buffers > shared_buffers`, `shared_buffers` becomes +dynamically adjustable via `SIGHUP`. + +### 10.2 Validation Hooks + +```c +/* GUC check hook for shared_buffers */ +bool +check_shared_buffers(int *newval, void **extra, GucSource source) +{ + if (source == PGC_S_FILE || source == PGC_S_CLIENT) + { + /* Runtime change */ + if (*newval > MaxNBuffers) + { + GUC_check_errmsg("shared_buffers cannot exceed max_shared_buffers (%d)", + MaxNBuffers); + return false; + } + if (*newval < MIN_SHARED_BUFFERS) + { + GUC_check_errmsg("shared_buffers must be at least %d", + MIN_SHARED_BUFFERS); + return false; + } + } + return true; +} + +/* GUC assign hook for shared_buffers */ +void +assign_shared_buffers(int newval, void *extra) +{ + if (IsUnderPostmaster && newval != NBuffers) + { + /* Initiate async resize -- actual work happens in postmaster */ + RequestBufferPoolResize(newval); + } +} +``` + +### 10.3 SQL Interface + +```sql +-- Check current and maximum values: +SHOW shared_buffers; -- '1GB' +SHOW max_shared_buffers; -- '4GB' + +-- Grow: +ALTER SYSTEM SET shared_buffers = '2GB'; +SELECT pg_reload_conf(); + +-- Shrink: +ALTER SYSTEM SET shared_buffers = '512MB'; +SELECT pg_reload_conf(); + +-- Monitor resize progress: +SELECT * FROM pg_stat_buffer_pool_resize; +``` + +### 10.4 pg_stat_buffer_pool_resize View + +| Column | Type | Description | +|---|---|---| +| `status` | text | 'idle', 'growing', 'draining', 'completing' | +| `current_buffers` | int8 | Current NBuffers | +| `target_buffers` | int8 | Target NBuffers (= current when idle) | +| `max_buffers` | int8 | Maximum NBuffers (from max_shared_buffers) | +| `condemned_remaining` | int8 | Buffers still to drain (shrink only) | +| `condemned_pinned` | int8 | Condemned buffers blocked by pins | +| `condemned_dirty` | int8 | Condemned buffers being flushed | +| `started_at` | timestamptz | When current resize started | + +--- + +## 11. Edge Cases and Corner Cases + +### 11.1 Concurrent Resize Requests + +**Scenario:** DBA sets `shared_buffers = 2GB`, then immediately `shared_buffers = 4GB` +before the first resize completes. + +**Solution:** Serialize resize operations. Only one resize can be in progress. +If a new target arrives while resizing: +- If same direction (both grow or both shrink): update target, continue +- If opposite direction: complete current operation first, then start new one +- A resize-in-progress flag in shared memory prevents concurrent requests + +### 11.2 Crash During Resize + +**Scenario:** Postmaster crashes or is killed mid-resize. + +**For grow:** New memory was committed but NBuffers wasn't updated yet. On +restart, `shared_buffers` from config is used to compute NBuffers. The extra +committed memory is released when the old mapping is unmapped. No data loss. + +**For shrink:** Drain was in progress but NBuffers wasn't reduced yet. On +restart, full buffer pool is available. Condemned buffers that were flushed +are simply empty buffers. No data loss. + +**Key invariant:** The persistent `shared_buffers` in `postgresql.conf` is +always updated via `ALTER SYSTEM` *before* the resize begins. So on restart, +the new target value is used for fresh initialization. + +### 11.3 Backend Startup During Resize + +**Scenario:** New backend connects while resize is in progress. + +**For grow:** New backend inherits the shared memory mapping via `fork()`. +It reads NBuffers from shared memory. If the barrier hasn't completed yet, +it gets the old value -- safe (just doesn't see new buffers yet). After +processing the barrier, it sees the new value. + +**For shrink:** New backend reads NBuffers. If drain is still in progress, +it gets the old value. It won't access condemned buffers because: +1. Hash table entries for condemned pages are being removed +2. Clock sweep skips condemned buffers +3. When it processes the barrier, it gets the new value + +### 11.4 Long-Running Queries Pinning Condemned Buffers + +**Scenario:** A sequential scan holds pins on buffers in the condemned range +for the duration of a multi-hour query. + +**Solutions (in order of preference):** +1. **Wait with timeout:** Default 5 minutes. If pins aren't released, log a + WARNING with the PID and query, and cancel the shrink. +2. **Cooperative release:** When a backend unpins a condemned buffer, don't + re-add it to the ring. The scan will allocate a new buffer from the + surviving range. +3. **Admin override:** `pg_terminate_backend()` or `pg_cancel_backend()` + as a last resort. + +The shrink must NEVER force-unpin a buffer. That would corrupt the backend's +`PrivateRefCount` state and potentially the data. + +### 11.5 Checkpointer During Resize + +**Scenario:** A checkpoint is in progress when resize starts. + +**For grow:** No issue. Checkpoint doesn't know about new buffers yet, but +they're all clean (unused). Next checkpoint will include them if dirtied. + +**For shrink:** Checkpoint's `CkptBufferIds` array was allocated for old +NBuffers. The drain phase must wait for any in-progress checkpoint to +complete before it can deallocate the condemned portion of `CkptBufferIds`. + +**Solution:** Add checkpoint-awareness to the resize protocol: +1. Before initiating shrink drain, request a checkpoint +2. After checkpoint completes, proceed with drain +3. The `CkptBufferIds` array for new NBuffers is a prefix of the old array + (since we shrink from the high end), so no reallocation is needed + +### 11.6 pg_buffercache and External Extensions + +**Scenario:** `pg_buffercache` or third-party extensions iterate +`0..NBuffers-1` and read buffer descriptors. + +**Risk:** If an extension caches NBuffers and iterates after a shrink, +it may access descriptors beyond the valid range. + +**Solution:** +1. `pg_buffercache` and built-in code: update to read NBuffers at iteration + start, not cache it +2. Third-party extensions: document the behavior change. After shrink, + descriptors beyond NBuffers are zero-filled (PROT_NONE on the freed + range will SIGSEGV, which is a loud failure mode -- better than silent + corruption) +3. Provide a `BufferPoolGeneration` counter that extensions can check + +### 11.7 Predicate Locks on Condemned Buffers + +**Scenario:** Serializable transactions hold predicate locks at the buffer +level. A condemned buffer might have active predicate locks. + +**Solution:** The predicate lock manager uses buffer IDs as lock targets. +During drain: +1. Before removing a condemned buffer from the hash table, transfer any + buffer-level predicate locks to relation-level locks (coarser granularity) +2. This is consistent with existing behavior when buffers are evicted normally + +### 11.8 Relation Cache and SMgr References + +`SMgrRelation` objects cache information about which blocks are in the buffer +pool. These are per-backend and not affected by buffer pool resize, since the +buffer manager is the authoritative source. + +### 11.9 WAL Replay (Startup Process) + +**Scenario:** Buffer pool resize during WAL replay (recovery mode). + +**Solution:** Do not allow resize during recovery. Validate this in the GUC +check hook. WAL replay assumes a stable buffer pool configuration. + +### 11.10 Logical and Physical Replication + +**Scenario:** Primary resizes buffer pool; replica does not. + +**No issue.** `shared_buffers` is an independent per-instance setting. Buffer +pool size is not replicated. Each instance manages its own buffer pool +independently. + +### 11.11 `temp_buffers` Interaction + +`temp_buffers` (local buffers for temporary tables) are per-backend and +completely independent of shared buffers. No interaction. + +### 11.12 Out-of-Memory During Grow + +**Scenario:** System doesn't have enough physical memory when committing +new pages during grow. + +**Solution:** +1. `mmap()` with `MAP_POPULATE` to force page allocation; check return value +2. If allocation fails, log ERROR and abort the grow operation +3. NBuffers remains unchanged -- fully recoverable +4. Alternatively, use `madvise(MADV_POPULATE_WRITE)` after `mmap()` to detect + OOM before committing to the resize + +### 11.13 Buffer Pool Resize and VACUUM + +**Scenario:** VACUUM is running with a ring buffer during shrink. + +**Risk:** The ring buffer may contain buffer IDs in the condemned range. + +**Solution:** When processing the resize barrier, each backend checks its +active `BufferAccessStrategy`: +- If any ring buffer entry references a condemned ID, replace it with + `InvalidBuffer` (the ring will allocate a new buffer from the surviving range) +- This is analogous to `StrategyRejectBuffer()`'s existing logic + +### 11.14 Race Between PIN and NBuffers Update + +**Scenario:** Backend A reads `NBuffers = 2000`, begins to pin buffer 1999. +Concurrently, backend B processes shrink barrier and updates its NBuffers to +1000. Can A successfully pin a condemned buffer? + +**Analysis:** This cannot happen because: +1. The drain phase ensures buffer 1999 has refcount = 0 and no hash table entry + BEFORE the barrier is emitted +2. Backend A can only reach buffer 1999 via: + - Hash table lookup (entry already removed) + - Clock sweep (condemned buffers are skipped) +3. If A already had a pin on 1999 from before the drain, the drain waits for + A to release that pin before proceeding + +### 11.15 Rapid Grow-Shrink Cycles + +**Scenario:** External tooling rapidly adjusts `shared_buffers` up and down. + +**Protection:** +- Minimum cooldown period between resize operations (configurable, default + 30 seconds) +- Each resize logs to the server log with timing and old/new values +- The `pg_stat_buffer_pool_resize` view shows history for monitoring + +--- + +## 12. Huge Pages + +### 12.1 The Challenge + +When `huge_pages = on`, PostgreSQL allocates the shared memory segment using +2MB (or 1GB) huge pages via `mmap()` with `MAP_HUGETLB`. This improves TLB +coverage for the buffer pool. + +**Problem with resize:** +- `mremap()` on `MAP_HUGETLB` regions has historically been unreliable on Linux +- Committing additional huge pages after startup may fail if the system's + huge page pool is exhausted +- Huge pages cannot be partially committed -- you get a full 2MB page or nothing + +### 12.2 Solution + +**For grow with huge pages:** +1. At startup, reserve `max_shared_buffers` worth of huge pages (via + `MAP_HUGETLB | MAP_NORESERVE`) +2. Growing commits additional huge pages from the pre-reserved range +3. If the OS huge page pool is exhausted, fall back to regular pages for the + new portion (with a WARNING) + +**For shrink with huge pages:** +1. After drain and barrier, use `madvise(MADV_DONTNEED)` to release huge pages +2. On Linux 4.5+, `MADV_FREE` can be used for lazy release + +**Alternative (Dolgov's approach):** Replace `mremap()` with unmap+remap: +```c +munmap(old_addr + old_size, extend_size); +mmap(old_addr, new_size, ..., MAP_HUGETLB | MAP_FIXED, memfd, 0); +``` +This works because the `memfd` preserves the data; we're just changing the +mapping, not the content. + +### 12.3 `max_shared_buffers` and Huge Page Reservation + +When `huge_pages = on` and `max_shared_buffers > shared_buffers`: +- The system must have enough huge pages for `max_shared_buffers` worth of + virtual address reservation +- The `shared_memory_size_in_huge_pages` GUC should report the maximum + reservation needed +- Document that DBAs must configure `vm.nr_hugepages` for the maximum, not + just the initial `shared_buffers` + +--- + +## 13. Portability + +### 13.1 Linux (Primary Target) + +Full support using: +- `memfd_create()` for shared anonymous file +- `mmap()` with `MAP_FIXED` for commit/decommit +- `mprotect()` for access control +- `madvise(MADV_DONTNEED)` for memory release +- `MAP_HUGETLB` for huge page support + +### 13.2 FreeBSD + +- `memfd_create()` available since FreeBSD 13 +- `shm_open(SHM_ANON)` as alternative +- `MAP_HUGETLB` → `MAP_ALIGNED_SUPER` +- Otherwise similar to Linux + +### 13.3 macOS + +- No `memfd_create()` -- use `shm_open()` with immediate unlink +- No huge page support in `mmap()` (superpages via `VM_FLAGS_SUPERPAGE_SIZE_2MB` + in Mach VM only) +- `mmap()` with `MAP_FIXED` works +- Practical limitation: macOS is rarely used for production PG + +### 13.4 Windows (EXEC_BACKEND) + +- Use `VirtualAlloc()` with `MEM_RESERVE` / `MEM_COMMIT` +- `CreateFileMapping()` with `SEC_RESERVE` for shared memory +- `MapViewOfFile()` for backend attachment +- `VirtualFree()` with `MEM_DECOMMIT` for shrink +- Large pages via `MEM_LARGE_PAGES` + +Windows EXEC_BACKEND mode already re-attaches shared memory after `exec()`. +The resize protocol would extend `AttachSharedMemoryStructs()` to handle +variable-size regions. + +### 13.5 Portability Abstraction Layer + +Create a `pg_shmem_resize.h` abstraction: + +```c +/* Reserve virtual address space without committing physical memory */ +extern void *pg_shmem_reserve(Size size); + +/* Commit physical memory within a reserved region */ +extern bool pg_shmem_commit(void *addr, Size size, bool huge_pages); + +/* Decommit physical memory (return to OS) */ +extern void pg_shmem_decommit(void *addr, Size size); + +/* Is this region committed? */ +extern bool pg_shmem_is_committed(void *addr, Size size); +``` + +Platform-specific implementations in `src/backend/port/`. + +--- + +## 14. Performance Impact + +### 14.1 Steady-State Overhead (Not Resizing) + +**Goal: Zero overhead when not resizing.** + +Analysis of the proposed design: + +| Component | Overhead | Explanation | +|---|---|---| +| `GetBufferDescriptor()` | **None** | Still direct array indexing | +| `BufHdrGetBlock()` | **None** | Still pointer arithmetic | +| `ClockSweepTick()` | **None** | `% NBuffers` unchanged (NBuffers is a local int) | +| `BufTableLookup()` | **Negligible** | Slightly larger hash table (over-provisioned) | +| `NBuffers` reads | **None** | Local cached copy, plain int | + +The only measurable difference is a slightly larger hash table, which may +actually improve performance (fewer collisions at low fill ratio). + +### 14.2 During Grow + +- Memory allocation: OS kernel overhead for committing pages (~ms) +- Barrier propagation: Each backend processes barrier at next + `CHECK_FOR_INTERRUPTS()` -- typically within milliseconds +- No query pauses or lock contention + +**Expected impact: < 100ms for typical grow operations.** + +### 14.3 During Shrink + +- Drain phase: depends on how many condemned buffers are dirty and/or pinned + - Best case (all clean, unpinned): milliseconds + - Typical case (some dirty): seconds (bounded by flush speed) + - Worst case (pinned by long queries): may need to wait minutes or cancel +- Barrier propagation: same as grow +- Memory decommit: OS kernel overhead (~ms) + +**Expected impact: seconds for typical shrink operations, bounded by the +slowest-to-drain buffer.** + +### 14.4 Benchmarking Plan + +Measure with pgbench at various scales: +1. **Baseline:** Fixed shared_buffers, no resize capability compiled in +2. **Overhead test:** max_shared_buffers > shared_buffers but no resize occurs +3. **Grow test:** Grow from 1GB to 4GB under pgbench load, measure TPS impact +4. **Shrink test:** Shrink from 4GB to 1GB under pgbench load +5. **Stress test:** Rapid grow/shrink cycles to detect race conditions + +--- + +## 15. Observability + +### 15.1 Server Log Messages + +``` +LOG: buffer pool resize started: 131072 -> 262144 buffers (1 GB -> 2 GB) +LOG: buffer pool resize: committing memory for 131072 new buffers +LOG: buffer pool resize: initializing new buffer descriptors +LOG: buffer pool resize: waiting for all backends to acknowledge +LOG: buffer pool resize completed in 127 ms +``` + +For shrink: +``` +LOG: buffer pool resize started: 262144 -> 131072 buffers (2 GB -> 1 GB) +LOG: buffer pool resize: draining 131072 condemned buffers +LOG: buffer pool resize: draining progress: 130000/131072 (1072 remaining, 42 pinned, 15 dirty) +LOG: buffer pool resize: drain complete, waiting for barrier +LOG: buffer pool resize completed in 3247 ms +``` + +### 15.2 Wait Events + +New wait events: +- `BufferPoolResize` -- backend waiting during barrier processing +- `BufferPoolDrain` -- postmaster waiting for condemned buffers to drain + +### 15.3 pg_stat_activity Integration + +During resize, backends processing the barrier show: +``` +wait_event_type = 'IPC' +wait_event = 'BufferPoolResize' +``` + +--- + +## 16. Testing Strategy + +### 16.1 Unit Tests + +- Grow from minimum (128kB) to 1GB in increments +- Shrink from 1GB to minimum +- Grow and shrink to same target (no-op) +- Exceed max_shared_buffers (must fail with clear error) +- Shrink below minimum (must fail) +- NBuffers boundary: test buffers at old_NBuffers-1 and new_NBuffers-1 + +### 16.2 Concurrency Tests (TAP Tests) + +- Grow while pgbench is running +- Shrink while pgbench is running +- Grow while VACUUM is running (ring buffer interaction) +- Shrink while long-running SELECT holds pins on condemned buffers +- Grow while checkpoint is in progress +- Shrink while checkpoint is in progress +- Backend connects during resize +- Backend disconnects during resize +- Two concurrent resize requests (must serialize) + +### 16.3 Crash Recovery Tests + +- Kill postmaster during grow (between commit and NBuffers update) +- Kill postmaster during shrink (during drain) +- Kill postmaster during barrier propagation +- Kill individual backend during barrier processing +- OOM during grow (mmap fails) + +### 16.4 Regression Tests + +- `pg_buffercache` output before and after resize +- `EXPLAIN (BUFFERS)` output during resize +- `pg_stat_bgwriter` counters during resize +- Extension loading (`shared_preload_libraries`) with max_shared_buffers + +### 16.5 Stress Tests + +- Rapid grow/shrink cycles (every 5 seconds) under pgbench +- Grow to very large values (256GB) if hardware permits +- Shrink while all buffers are dirty +- 1000 concurrent backends, all active during resize + +### 16.6 Platform Tests + +- Linux x86_64 (primary) +- Linux aarch64 +- FreeBSD +- macOS (development only) +- Windows (EXEC_BACKEND) +- With and without huge_pages = on + +--- + +## 17. Migration and Compatibility + +### 17.1 Default Behavior + +When `max_shared_buffers = 0` (default), the system behaves identically to +current PostgreSQL: +- `shared_buffers` requires restart to change +- Buffer pool memory is allocated exactly as today +- No additional virtual address space reservation +- No performance overhead + +Online resize is opt-in via setting `max_shared_buffers`. + +### 17.2 Extension Compatibility + +Extensions that access buffer internals must be updated: + +| Extension | Impact | Required Change | +|---|---|---| +| `pg_buffercache` | Medium | Read NBuffers at scan start, not at load | +| `pg_prewarm` | Low | No change needed (calls existing buffer manager APIs) | +| `pg_stat_statements` | None | Doesn't access buffers directly | +| Custom bgworkers | Medium | Must handle `PROCSIGNAL_BARRIER_BUFFER_POOL_RESIZE` | + +### 17.3 Upgrade Path + +- pg_upgrade: No special handling (max_shared_buffers defaults to 0) +- Replication: No impact (shared_buffers is instance-local) +- Backup/restore: No impact + +--- + +## 18. Phased Implementation Plan + +### Phase 1: Foundation (Target: PostgreSQL 19) + +**Goal:** Separate buffer pool memory from main shared memory segment. + +1. Create `pg_shmem_resize.h` portability layer +2. Move buffer manager arrays to separate memory mapping +3. Add `max_shared_buffers` GUC (PGC_POSTMASTER) +4. Pre-size hash table for `max_shared_buffers` when set +5. Regression tests pass with no behavior change + +**Validation:** All existing tests pass. No performance regression in pgbench. + +### Phase 2: Online Grow (Target: PostgreSQL 19) + +**Goal:** Allow increasing `shared_buffers` without restart. + +1. Change `shared_buffers` context to PGC_SIGHUP (with max_shared_buffers guard) +2. Implement memory commit for new buffer chunks +3. Implement new descriptor initialization +4. Add `PROCSIGNAL_BARRIER_BUFFER_POOL_RESIZE` barrier type +5. Implement `NBuffers` update protocol +6. Add `pg_stat_buffer_pool_resize` view +7. Add TAP tests for online grow + +**Validation:** Can double `shared_buffers` under pgbench load with < 100ms +interruption. No data corruption. + +### Phase 3: Online Shrink (Target: PostgreSQL 20) + +**Goal:** Allow decreasing `shared_buffers` without restart. + +1. Implement drain protocol for condemned buffers +2. Add `BM_CONDEMNED` flag to buffer state +3. Implement cooperative buffer invalidation on unpin +4. Add memory decommit after drain +5. Handle SIGBUS prevention +6. Add timeout and cancellation for stuck drains +7. Add TAP tests for online shrink + +**Validation:** Can halve `shared_buffers` under pgbench load. Dirty page +flushing completes within checkpoint_timeout. Pinned-buffer timeout works. + +### Phase 4: Dynamic Hash Table (Target: PostgreSQL 20+) + +**Goal:** Allow the buffer hash table to resize dynamically. + +1. Remove `HASH_FIXED_SIZE` from `SharedBufHash` +2. Implement incremental rehashing across partitions +3. Remove the over-provisioning workaround from Phase 2 +4. Benchmark to ensure no regression + +### Phase 5: Observability and Polish (Ongoing) + +1. Integrate with `pg_stat_io` +2. Add `log_buffer_pool_resize` GUC for detailed logging +3. Document in official PostgreSQL documentation +4. Write pg_buffercache extension updates +5. Consider auto-resize hooks (but NOT automatic tuning) + +--- + +## 19. Open Questions + +1. **Should shrink be interruptible?** If a DBA starts a shrink and realizes + it was a mistake, can they cancel it by setting `shared_buffers` back up? + (Proposed: yes, by detecting the new target during drain.) + +2. **Chunk size configurability.** Should the unit of resize be configurable? + MySQL uses 128MB chunks. We could default to 128MB but allow tuning for + systems with very large or very small buffer pools. + +3. **Memory overcommit.** On systems with `vm.overcommit_memory = 0` (heuristic), + reserving virtual address space for `max_shared_buffers` may fail even though + no physical memory is needed. Should we document this requirement, or detect + it? + +4. **Interaction with cgroups memory limits.** In containerized environments, + growing the buffer pool may hit cgroup memory limits. Should we detect this + proactively? + +5. **WAL implications.** Does buffer pool resize create any WAL consistency + issues? (Believed: no, because WAL replay operates on specific blocks, not + buffer IDs. But needs careful analysis.) + +6. **Relation to DSM registry work.** Can the DSM registry infrastructure + (`GetNamedDSMSegment()`) be leveraged for the buffer pool mapping? Probably + not -- the DSM registry is designed for extension-managed allocations that + can be recreated, not for the core buffer pool which must be persistent and + contiguous. But the DSM registry's patterns for safe cross-backend + initialization are relevant to the coordination protocol. + +7. **Future: online `max_connections` resize.** The same barrier infrastructure + could be reused for online `max_connections` changes (another frequently + requested feature). Should the coordination protocol be designed generically? + +--- + +## 20. References + +### PostgreSQL Source Code + +- `src/backend/storage/buffer/buf_init.c` -- Buffer pool initialization +- `src/backend/storage/buffer/bufmgr.c` -- Buffer manager core +- `src/backend/storage/buffer/freelist.c` -- Clock sweep and strategy +- `src/backend/storage/buffer/buf_table.c` -- Buffer hash table +- `src/backend/storage/ipc/ipci.c` -- Shared memory setup +- `src/backend/storage/ipc/dsm_registry.c` -- DSM registry +- `src/backend/storage/ipc/procsignal.c` -- ProcSignalBarrier +- `src/backend/port/sysv_shmem.c` -- Shared memory allocation +- `src/include/storage/buf_internals.h` -- Buffer descriptor definitions + +### PostgreSQL Mailing List + +- Dmitry Dolgov, "Changing shared_buffers without restart" (October 2024) + https://www.postgresql.org/message-id/cnthxg2eekacrejyeonuhiaezc7vd7o2uowlsbenxqfkjwgvwj@qgzu6eoqrglb +- Follow-up discussion with Robert Haas, Thomas Munro, Peter Eisentraut (2024-2025) + https://www.postgresql.org/message-id/eqs6v4rsboazl67xz3wxc6xjkgrpfybitpl45y3lmb2br67wbj@o7czebb3rlgd + +### Other Database Systems + +- MySQL InnoDB online buffer pool resize (WL#6117): + https://dev.mysql.com/doc/refman/8.4/en/innodb-buffer-pool-resize.html +- Oracle SGA dynamic resize: + https://docs.oracle.com/en/database/oracle/oracle-database/19/tgdba/tuning-system-global-area.html +- SQL Server memory management: + https://learn.microsoft.com/en-us/sql/relational-databases/memory-management-architecture-guide + +### Academic Papers + +- Storm et al., "Adaptive Self-Tuning Memory in DB2 (STMM)", VLDB 2006 +- Tan et al., "iBTune: Individualized Buffer Tuning for Cloud Databases", VLDB 2019 +- Leis et al., "Virtual-Memory Assisted Buffer Management (vmcache)", SIGMOD 2023 +- "Evolution of Buffer Management in Database Systems", arXiv:2512.22995, December 2025 diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c index 80e3088fc7e30..7642817a677c2 100644 --- a/src/backend/postmaster/bgwriter.c +++ b/src/backend/postmaster/bgwriter.c @@ -40,6 +40,7 @@ #include "postmaster/interrupt.h" #include "storage/aio_subsys.h" #include "storage/buf_internals.h" +#include "storage/buf_resize.h" #include "storage/bufmgr.h" #include "storage/condition_variable.h" #include "storage/fd.h" @@ -235,6 +236,11 @@ BackgroundWriterMain(const void *startup_data, size_t startup_data_len) */ can_hibernate = BgBufferSync(&wb_context); + /* + * Drain any condemned buffers from a buffer pool shrink. + */ + BufPoolDrainCondemnedBuffers(); + /* Report pending statistics to the cumulative stats system */ pgstat_report_bgwriter(); pgstat_report_wal(true); diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 921d73226d632..56bbcd305a1c5 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -110,6 +110,7 @@ #include "replication/slotsync.h" #include "replication/walsender.h" #include "storage/aio_subsys.h" +#include "storage/buf_resize.h" #include "storage/fd.h" #include "storage/io_worker.h" #include "storage/ipc.h" @@ -2014,6 +2015,16 @@ process_pm_reload_request(void) ereport(LOG, (errmsg("received SIGHUP, reloading configuration files"))); ProcessConfigFile(PGC_SIGHUP); + + /* + * Execute any pending buffer pool resize before notifying children. + * The resize (if any) was requested by assign_shared_buffers() during + * ProcessConfigFile(). We execute it now so that NBuffers is updated + * (via ProcSignalBarrier) in all backends before they process SIGHUP + * and update their SharedBuffersGUC. + */ + ExecuteBufferPoolResize(); + SignalChildren(SIGHUP, btmask_all_except(B_DEAD_END_BACKEND)); /* Reload authentication config files too */ diff --git a/src/backend/storage/buffer/Makefile b/src/backend/storage/buffer/Makefile index fd7c40dcb089d..c908add2c06d7 100644 --- a/src/backend/storage/buffer/Makefile +++ b/src/backend/storage/buffer/Makefile @@ -14,6 +14,7 @@ include $(top_builddir)/src/Makefile.global OBJS = \ buf_init.o \ + buf_resize.o \ buf_table.o \ bufmgr.o \ freelist.o \ diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c index 9a312bcc7b3c6..863df769a0dff 100644 --- a/src/backend/storage/buffer/buf_init.c +++ b/src/backend/storage/buffer/buf_init.c @@ -16,6 +16,7 @@ #include "storage/aio.h" #include "storage/buf_internals.h" +#include "storage/buf_resize.h" #include "storage/bufmgr.h" BufferDescPadded *BufferDescriptors; @@ -63,6 +64,16 @@ CkptSortItem *CkptBufferIds; * * This is called once during shared-memory initialization (either in the * postmaster, or in a standalone backend). + * + * When max_shared_buffers is configured, BufferPoolReserveMemory() has + * already set up the global pointers (BufferDescriptors, BufferBlocks, etc.) + * pointing into separately-mapped VA regions. In that case, we skip the + * ShmemInitStruct allocations for the buffer arrays and just initialize + * the descriptors in the pre-allocated memory. + * + * When max_shared_buffers is not configured (the default), we use the + * traditional path of allocating everything from the main shared memory + * segment via ShmemInitStruct. */ void BufferManagerShmemInit(void) @@ -71,36 +82,55 @@ BufferManagerShmemInit(void) foundDescs, foundIOCV, foundBufCkpt; + bool using_reserved_memory = (MaxNBuffers > 0 && + MaxNBuffers > NBuffers); + + if (using_reserved_memory) + { + /* + * Memory was already reserved by BufferPoolReserveMemory() and + * global pointers are already set. Mark as "not found" so we + * initialize the descriptors below. + */ + foundDescs = false; + foundBufs = false; + foundIOCV = false; + foundBufCkpt = false; + } + else + { + /* Traditional path: allocate from main shared memory segment */ + + /* Align descriptors to a cacheline boundary. */ + BufferDescriptors = (BufferDescPadded *) + ShmemInitStruct("Buffer Descriptors", + NBuffers * sizeof(BufferDescPadded), + &foundDescs); + + /* Align buffer pool on IO page size boundary. */ + BufferBlocks = (char *) + TYPEALIGN(PG_IO_ALIGN_SIZE, + ShmemInitStruct("Buffer Blocks", + NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, + &foundBufs)); + + /* Align condition variables to cacheline boundary. */ + BufferIOCVArray = (ConditionVariableMinimallyPadded *) + ShmemInitStruct("Buffer IO Condition Variables", + NBuffers * sizeof(ConditionVariableMinimallyPadded), + &foundIOCV); - /* Align descriptors to a cacheline boundary. */ - BufferDescriptors = (BufferDescPadded *) - ShmemInitStruct("Buffer Descriptors", - NBuffers * sizeof(BufferDescPadded), - &foundDescs); - - /* Align buffer pool on IO page size boundary. */ - BufferBlocks = (char *) - TYPEALIGN(PG_IO_ALIGN_SIZE, - ShmemInitStruct("Buffer Blocks", - NBuffers * (Size) BLCKSZ + PG_IO_ALIGN_SIZE, - &foundBufs)); - - /* Align condition variables to cacheline boundary. */ - BufferIOCVArray = (ConditionVariableMinimallyPadded *) - ShmemInitStruct("Buffer IO Condition Variables", - NBuffers * sizeof(ConditionVariableMinimallyPadded), - &foundIOCV); - - /* - * The array used to sort to-be-checkpointed buffer ids is located in - * shared memory, to avoid having to allocate significant amounts of - * memory at runtime. As that'd be in the middle of a checkpoint, or when - * the checkpointer is restarted, memory allocation failures would be - * painful. - */ - CkptBufferIds = (CkptSortItem *) - ShmemInitStruct("Checkpoint BufferIds", - NBuffers * sizeof(CkptSortItem), &foundBufCkpt); + /* + * The array used to sort to-be-checkpointed buffer ids is located in + * shared memory, to avoid having to allocate significant amounts of + * memory at runtime. As that'd be in the middle of a checkpoint, or + * when the checkpointer is restarted, memory allocation failures + * would be painful. + */ + CkptBufferIds = (CkptSortItem *) + ShmemInitStruct("Checkpoint BufferIds", + NBuffers * sizeof(CkptSortItem), &foundBufCkpt); + } if (foundDescs || foundBufs || foundIOCV || foundBufCkpt) { @@ -148,32 +178,43 @@ BufferManagerShmemInit(void) * * compute the size of shared memory for the buffer pool including * data pages, buffer descriptors, hash tables, etc. + * + * When max_shared_buffers is configured for online resize, the buffer + * arrays are allocated separately (not from the main shmem segment), + * so we only include the strategy/hash table sizes here. */ Size BufferManagerShmemSize(void) { Size size = 0; + bool using_reserved_memory = (MaxNBuffers > 0 && + MaxNBuffers > NBuffers); - /* size of buffer descriptors */ - size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded))); - /* to allow aligning buffer descriptors */ - size = add_size(size, PG_CACHE_LINE_SIZE); + if (!using_reserved_memory) + { + /* Traditional path: everything in main shared memory */ - /* size of data pages, plus alignment padding */ - size = add_size(size, PG_IO_ALIGN_SIZE); - size = add_size(size, mul_size(NBuffers, BLCKSZ)); + /* size of buffer descriptors */ + size = add_size(size, mul_size(NBuffers, sizeof(BufferDescPadded))); + /* to allow aligning buffer descriptors */ + size = add_size(size, PG_CACHE_LINE_SIZE); - /* size of stuff controlled by freelist.c */ - size = add_size(size, StrategyShmemSize()); + /* size of data pages, plus alignment padding */ + size = add_size(size, PG_IO_ALIGN_SIZE); + size = add_size(size, mul_size(NBuffers, BLCKSZ)); + + /* size of I/O condition variables */ + size = add_size(size, mul_size(NBuffers, + sizeof(ConditionVariableMinimallyPadded))); + /* to allow aligning the above */ + size = add_size(size, PG_CACHE_LINE_SIZE); - /* size of I/O condition variables */ - size = add_size(size, mul_size(NBuffers, - sizeof(ConditionVariableMinimallyPadded))); - /* to allow aligning the above */ - size = add_size(size, PG_CACHE_LINE_SIZE); + /* size of checkpoint sort array in bufmgr.c */ + size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem))); + } - /* size of checkpoint sort array in bufmgr.c */ - size = add_size(size, mul_size(NBuffers, sizeof(CkptSortItem))); + /* size of stuff controlled by freelist.c (always in main shmem) */ + size = add_size(size, StrategyShmemSize()); return size; } diff --git a/src/backend/storage/buffer/buf_resize.c b/src/backend/storage/buffer/buf_resize.c new file mode 100644 index 0000000000000..34559013e6299 --- /dev/null +++ b/src/backend/storage/buffer/buf_resize.c @@ -0,0 +1,864 @@ +/*------------------------------------------------------------------------- + * + * buf_resize.c + * Online buffer pool resizing without server restart. + * + * This module implements the ability to change shared_buffers at runtime + * via SIGHUP, without requiring a PostgreSQL restart. It works by: + * + * 1. At startup, reserving virtual address space for max_shared_buffers + * worth of buffer pool arrays (descriptors, blocks, CVs, ckpt IDs). + * + * 2. Committing physical memory only for the initial shared_buffers. + * + * 3. On grow: committing additional memory, initializing new descriptors, + * and publishing the new NBuffers via an atomic variable. The + * postmaster performs the resize then signals children via SIGHUP; + * each child reads current_buffers from shared memory. + * + * 4. On shrink: updating NBuffers immediately, then having the bgwriter + * asynchronously drain condemned buffers (flushing dirty pages, + * evicting unpinned buffers) before decommitting memory. + * + * The key invariant is that the base pointers (BufferDescriptors, + * BufferBlocks, etc.) never change -- only NBuffers changes. This means + * GetBufferDescriptor() and BufHdrGetBlock() remain zero-overhead. + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * IDENTIFICATION + * src/backend/storage/buffer/buf_resize.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include +#include + +#include "miscadmin.h" +#include "storage/aio.h" +#include "storage/buf_internals.h" +#include "storage/buf_resize.h" +#include "storage/bufmgr.h" +#include "storage/condition_variable.h" +#include "storage/proc.h" +#include "storage/proclist.h" +#include "storage/shmem.h" +#include "utils/guc.h" +#include "utils/guc_hooks.h" +#include "utils/timestamp.h" + +/* GUC variable MaxNBuffers is declared in globals.c */ + +/* Shared memory control structure */ +BufPoolResizeCtl *BufResizeCtl = NULL; + +/* + * Separately-mapped regions for each buffer pool array. + * These are the reserved VA ranges, sized for MaxNBuffers. + * The actual committed portion covers [0, NBuffers). + */ +static void *ReservedBufferBlocks = NULL; +static void *ReservedBufferDescriptors = NULL; +static void *ReservedBufferIOCVs = NULL; +static void *ReservedCkptBufferIds = NULL; + +/* Effective max: either MaxNBuffers if set, or NBuffers */ +static int +GetEffectiveMaxNBuffers(void) +{ + return MaxNBuffers > 0 ? MaxNBuffers : NBuffers; +} + +/* + * Reserve virtual address space for buffer pool arrays. + * + * This is called once during postmaster startup. We use mmap with + * PROT_NONE to reserve address space without committing physical memory. + * The reserved ranges are later partially committed as needed. + * + * After this call, BufferBlocks, BufferDescriptors, BufferIOCVArray, + * and CkptBufferIds point to the starts of their reserved regions. + */ +void +BufferPoolReserveMemory(void) +{ + int max_bufs = GetEffectiveMaxNBuffers(); + Size blocks_size; + Size descs_size; + Size iocv_size; + Size ckpt_size; + + /* If max equals current, no reservation needed -- use normal shmem path */ + if (MaxNBuffers <= 0 || MaxNBuffers <= NBuffers) + return; + +#ifdef EXEC_BACKEND + /* + * On EXEC_BACKEND (Windows), child processes are started via CreateProcess + * rather than fork(), so they do not inherit mmap'd regions. Online + * buffer pool resize requires fork() semantics for shared anonymous + * mappings. Refuse to start rather than silently breaking. + */ + ereport(FATAL, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("max_shared_buffers is not supported on this platform"), + errhint("Remove the max_shared_buffers setting from postgresql.conf."))); +#endif + + /* + * Calculate sizes for the maximum possible buffer count. + */ + blocks_size = add_size(mul_size((Size) max_bufs, BLCKSZ), PG_IO_ALIGN_SIZE); + descs_size = add_size(mul_size((Size) max_bufs, sizeof(BufferDescPadded)), PG_CACHE_LINE_SIZE); + iocv_size = add_size(mul_size((Size) max_bufs, sizeof(ConditionVariableMinimallyPadded)), PG_CACHE_LINE_SIZE); + ckpt_size = mul_size((Size) max_bufs, sizeof(CkptSortItem)); + + /* + * Reserve virtual address space for each array. MAP_NORESERVE tells + * the kernel not to reserve swap space for pages we haven't touched. + * MAP_SHARED | MAP_ANONYMOUS gives us pages visible across fork(), + * so child processes inherit the same mappings. + * + * Note: On Linux, MAP_NORESERVE means no physical memory or swap is + * consumed until pages are actually touched. + */ + ReservedBufferBlocks = mmap(NULL, blocks_size, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED | MAP_NORESERVE, + -1, 0); + if (ReservedBufferBlocks == MAP_FAILED) + ereport(FATAL, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("could not reserve %zu bytes of virtual address space for buffer blocks", + blocks_size))); + + ReservedBufferDescriptors = mmap(NULL, descs_size, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED | MAP_NORESERVE, + -1, 0); + if (ReservedBufferDescriptors == MAP_FAILED) + ereport(FATAL, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("could not reserve virtual address space for buffer descriptors"))); + + ReservedBufferIOCVs = mmap(NULL, iocv_size, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED | MAP_NORESERVE, + -1, 0); + if (ReservedBufferIOCVs == MAP_FAILED) + ereport(FATAL, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("could not reserve virtual address space for buffer IO CVs"))); + + ReservedCkptBufferIds = mmap(NULL, ckpt_size, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_SHARED | MAP_NORESERVE, + -1, 0); + if (ReservedCkptBufferIds == MAP_FAILED) + ereport(FATAL, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("could not reserve virtual address space for checkpoint buffer IDs"))); + + /* + * Set global pointers. These will be stable for the lifetime of the + * postmaster (and thus all child backends via fork()). + */ + BufferBlocks = (char *) TYPEALIGN(PG_IO_ALIGN_SIZE, ReservedBufferBlocks); + BufferDescriptors = (BufferDescPadded *) + TYPEALIGN(PG_CACHE_LINE_SIZE, ReservedBufferDescriptors); + BufferIOCVArray = (ConditionVariableMinimallyPadded *) + TYPEALIGN(PG_CACHE_LINE_SIZE, ReservedBufferIOCVs); + CkptBufferIds = (CkptSortItem *) ReservedCkptBufferIds; + + elog(DEBUG1, "reserved buffer pool VA space for %d buffers (%zu MB)", + max_bufs, blocks_size / (1024 * 1024)); +} + +/* + * Commit physical memory for buffers in the range [start_buf, end_buf). + * + * When growing, this makes new pages accessible. The memory was already + * reserved by BufferPoolReserveMemory() using MAP_NORESERVE. On Linux, + * simply touching the pages will fault them in. + * + * We first try MADV_POPULATE_WRITE (Linux 5.14+) for efficient bulk + * population with early OOM detection. If unsupported, we fall back to + * manually touching each page to fault it in. + * + * Only the delta range [start_buf, end_buf) is committed, not the entire + * pool. This avoids re-touching already-committed pages and ensures + * rollback on failure only affects the new range (not live buffers). + * + * Returns true on success, false if memory could not be committed (OOM). + */ +bool +BufferPoolCommitMemory(int start_buf, int end_buf) +{ + Size blocks_off = mul_size((Size) start_buf, BLCKSZ); + Size blocks_len = mul_size((Size) (end_buf - start_buf), BLCKSZ); + Size descs_off = mul_size((Size) start_buf, sizeof(BufferDescPadded)); + Size descs_len = mul_size((Size) (end_buf - start_buf), sizeof(BufferDescPadded)); + Size iocv_off = mul_size((Size) start_buf, sizeof(ConditionVariableMinimallyPadded)); + Size iocv_len = mul_size((Size) (end_buf - start_buf), sizeof(ConditionVariableMinimallyPadded)); + Size ckpt_off = mul_size((Size) start_buf, sizeof(CkptSortItem)); + Size ckpt_len = mul_size((Size) (end_buf - start_buf), sizeof(CkptSortItem)); + bool use_madvise = false; + +#ifdef MADV_POPULATE_WRITE + /* + * Try MADV_POPULATE_WRITE first. This causes the kernel to allocate + * physical pages for the range. If unsupported (EINVAL on older + * kernels), fall back to manual page touching. + * + * If population succeeds for some arrays but fails for others, we + * roll back by releasing only the newly-committed pages. + */ + if (madvise(BufferBlocks + blocks_off, blocks_len, MADV_POPULATE_WRITE) == 0) + { + use_madvise = true; + + if (madvise((char *) BufferDescriptors + descs_off, descs_len, + MADV_POPULATE_WRITE) != 0) + { + madvise(BufferBlocks + blocks_off, blocks_len, MADV_DONTNEED); + ereport(WARNING, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("could not commit memory for buffer descriptors: %m"))); + return false; + } + if (madvise((char *) BufferIOCVArray + iocv_off, iocv_len, + MADV_POPULATE_WRITE) != 0) + { + madvise(BufferBlocks + blocks_off, blocks_len, MADV_DONTNEED); + madvise((char *) BufferDescriptors + descs_off, descs_len, MADV_DONTNEED); + ereport(WARNING, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("could not commit memory for buffer IO CVs: %m"))); + return false; + } + if (madvise((char *) CkptBufferIds + ckpt_off, ckpt_len, + MADV_POPULATE_WRITE) != 0) + { + madvise(BufferBlocks + blocks_off, blocks_len, MADV_DONTNEED); + madvise((char *) BufferDescriptors + descs_off, descs_len, MADV_DONTNEED); + madvise((char *) BufferIOCVArray + iocv_off, iocv_len, MADV_DONTNEED); + ereport(WARNING, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("could not commit memory for checkpoint buffer IDs: %m"))); + return false; + } + } + else if (errno != EINVAL) + { + ereport(WARNING, + (errcode(ERRCODE_OUT_OF_MEMORY), + errmsg("could not commit memory for buffers %d..%d: %m", + start_buf, end_buf))); + return false; + } + /* else: EINVAL means MADV_POPULATE_WRITE not supported, fall through */ +#endif + + if (!use_madvise) + { + volatile char *p; + Size page_size = sysconf(_SC_PAGESIZE); + + /* + * Touch one byte per OS page to fault in the physical memory. + * The volatile pointer prevents the compiler from optimizing this away. + */ + for (p = (volatile char *) BufferBlocks + blocks_off; + p < (volatile char *) BufferBlocks + blocks_off + blocks_len; + p += page_size) + *p = *p; + + for (p = (volatile char *) BufferDescriptors + descs_off; + p < (volatile char *) BufferDescriptors + descs_off + descs_len; + p += page_size) + *p = *p; + + for (p = (volatile char *) BufferIOCVArray + iocv_off; + p < (volatile char *) BufferIOCVArray + iocv_off + iocv_len; + p += page_size) + *p = *p; + + for (p = (volatile char *) CkptBufferIds + ckpt_off; + p < (volatile char *) CkptBufferIds + ckpt_off + ckpt_len; + p += page_size) + *p = *p; + + elog(DEBUG1, "committed buffer pool memory via page touching for buffers %d..%d", + start_buf, end_buf); + } + + return true; +} + +/* + * Decommit physical memory for buffers beyond the given count. + * + * After shrinking, we release physical pages back to the OS but keep the + * virtual address reservation intact for future growth. + * + * For the buffer blocks array (which is always page-aligned since + * BLCKSZ >= page size), we use MADV_REMOVE to punch a hole in the + * shmem backing and actually free the pages. MADV_DONTNEED alone + * is insufficient on MAP_SHARED mappings because it only unmaps PTEs + * without releasing the underlying shmem pages. + * + * For smaller arrays (descriptors, CVs, ckpt IDs), their offsets may + * not be page-aligned, so we use MADV_DONTNEED as a best-effort hint. + * The memory waste from these arrays is small relative to the blocks. + */ +void +BufferPoolDecommitMemory(int old_nbufs, int new_nbufs) +{ + Size blocks_offset = mul_size((Size) new_nbufs, BLCKSZ); + Size blocks_len = mul_size((Size) (old_nbufs - new_nbufs), BLCKSZ); + Size descs_offset = mul_size((Size) new_nbufs, sizeof(BufferDescPadded)); + Size descs_len = mul_size((Size) (old_nbufs - new_nbufs), sizeof(BufferDescPadded)); + Size iocv_offset = mul_size((Size) new_nbufs, sizeof(ConditionVariableMinimallyPadded)); + Size iocv_len = mul_size((Size) (old_nbufs - new_nbufs), sizeof(ConditionVariableMinimallyPadded)); + Size ckpt_offset = mul_size((Size) new_nbufs, sizeof(CkptSortItem)); + Size ckpt_len = mul_size((Size) (old_nbufs - new_nbufs), sizeof(CkptSortItem)); + + /* + * Release physical pages for buffer blocks. MADV_REMOVE punches a hole + * in the shmem backing store, actually freeing the memory. If it fails + * (e.g., unsupported kernel), fall back to MADV_DONTNEED. + */ + if (blocks_len > 0) + { +#ifdef MADV_REMOVE + if (madvise(BufferBlocks + blocks_offset, blocks_len, MADV_REMOVE) != 0) +#endif + madvise(BufferBlocks + blocks_offset, blocks_len, MADV_DONTNEED); + } + + /* + * For smaller arrays, use MADV_DONTNEED as a best-effort hint. + * These offsets may not be page-aligned, in which case madvise + * silently does nothing (returns EINVAL which we ignore). + */ + if (descs_len > 0) + madvise((char *) BufferDescriptors + descs_offset, descs_len, MADV_DONTNEED); + if (iocv_len > 0) + madvise((char *) BufferIOCVArray + iocv_offset, iocv_len, MADV_DONTNEED); + if (ckpt_len > 0) + madvise((char *) CkptBufferIds + ckpt_offset, ckpt_len, MADV_DONTNEED); + + elog(DEBUG1, "decommitted buffer pool memory: %d -> %d buffers", + old_nbufs, new_nbufs); +} + +/* ---------------------------------------------------------------- + * Shared memory initialization + * ---------------------------------------------------------------- + */ + +Size +BufPoolResizeShmemSize(void) +{ + return MAXALIGN(sizeof(BufPoolResizeCtl)); +} + +void +BufPoolResizeShmemInit(void) +{ + bool found; + + BufResizeCtl = (BufPoolResizeCtl *) + ShmemInitStruct("Buffer Pool Resize Ctl", + BufPoolResizeShmemSize(), + &found); + + if (!found) + { + MemSet(BufResizeCtl, 0, sizeof(BufPoolResizeCtl)); + SpinLockInit(&BufResizeCtl->mutex); + BufResizeCtl->status = BUF_RESIZE_IDLE; + BufResizeCtl->target_buffers = NBuffers; + pg_atomic_init_u32(&BufResizeCtl->current_buffers, (uint32) NBuffers); + } +} + +/* ---------------------------------------------------------------- + * Buffer pool grow operation + * ---------------------------------------------------------------- + */ + +/* + * GrowBufferPool - add new buffers to the pool. + * + * This is called from the postmaster via ExecuteBufferPoolResize() after + * processing a SIGHUP that changed shared_buffers. new_nbuffers must be + * > NBuffers and <= MaxNBuffers. + * + * After this function returns, the postmaster's NBuffers is updated and + * the shared current_buffers atomic is set. Child processes update their + * local NBuffers from current_buffers when they process the SIGHUP that + * the postmaster sends after this function returns. + */ +static bool +GrowBufferPool(int new_nbuffers) +{ + int old_nbuffers = NBuffers; + int i; + + Assert(new_nbuffers > old_nbuffers); + Assert(new_nbuffers <= GetEffectiveMaxNBuffers()); + + elog(LOG, "buffer pool resize started: %d -> %d buffers (%d MB -> %d MB)", + old_nbuffers, new_nbuffers, + (int) ((Size) old_nbuffers * BLCKSZ / (1024 * 1024)), + (int) ((Size) new_nbuffers * BLCKSZ / (1024 * 1024))); + + /* + * Step 1: Commit physical memory for the new buffers. + */ + if (ReservedBufferBlocks != NULL) + { + if (!BufferPoolCommitMemory(old_nbuffers, new_nbuffers)) + { + elog(WARNING, "buffer pool grow failed: could not commit memory"); + return false; + } + } + + /* + * Step 2: Initialize new buffer descriptors. + * + * New buffers are appended at the end, so existing buffers are not + * disturbed. This is safe because no backend can access buffer IDs + * >= old_nbuffers yet (NBuffers hasn't been updated). + * + * However, if a previous shrink was cancelled before its drain completed, + * some descriptors in this range may still have BM_TAG_VALID set and + * could have active pins from backends. We must NOT reinitialize those + * -- doing so would zero the refcount and corrupt the buffer state. + * Such buffers will be naturally reused by the clock sweep once NBuffers + * is updated to include them again. + */ + for (i = old_nbuffers; i < new_nbuffers; i++) + { + BufferDesc *buf = GetBufferDescriptor(i); + uint64 buf_state; + + /* Skip buffers still in use from a cancelled shrink */ + buf_state = pg_atomic_read_u64(&buf->state); + if (buf_state & BM_TAG_VALID) + continue; + + ClearBufferTag(&buf->tag); + pg_atomic_init_u64(&buf->state, 0); + buf->wait_backend_pgprocno = INVALID_PROC_NUMBER; + buf->buf_id = i; + pgaio_wref_clear(&buf->io_wref); + proclist_init(&buf->lock_waiters); + + /* Initialize the I/O condition variable for this buffer */ + ConditionVariableInit(BufferDescriptorGetIOCV(buf)); + } + + /* + * Step 3: Write the new NBuffers to shared memory and update the + * postmaster's local copy. A write barrier ensures the descriptor + * initializations above are visible before any backend sees the new + * buffer count. + */ + pg_write_barrier(); + pg_atomic_write_u32(&BufResizeCtl->current_buffers, (uint32) new_nbuffers); + + /* Update the postmaster's local NBuffers */ + NBuffers = new_nbuffers; + + /* + * Child processes will update their local NBuffers when they process + * the SIGHUP that the postmaster sends after this function returns. + * See assign_shared_buffers(). + */ + elog(LOG, "buffer pool resize completed: %d -> %d buffers", + old_nbuffers, new_nbuffers); + + return true; +} + +/* ---------------------------------------------------------------- + * Buffer pool shrink operation + * ---------------------------------------------------------------- + */ + +/* + * ShrinkBufferPool - reduce the buffer pool size. + * + * Called from the postmaster during ExecuteBufferPoolResize(). This + * function only updates NBuffers and records the condemned range. The + * actual eviction of condemned buffers is done asynchronously by the + * bgwriter via BufPoolDrainCondemnedBuffers(), because eviction requires + * full backend infrastructure (ResourceOwner, private refcounts, etc.) + * that the postmaster does not have. + * + * After this call, no new buffer allocations will use the condemned range + * (clock sweep respects NBuffers). Existing pins on condemned buffers + * will complete normally; the bgwriter will evict them once unpinned. + */ +static bool +ShrinkBufferPool(int new_nbuffers) +{ + int old_nbuffers = NBuffers; + + Assert(new_nbuffers < old_nbuffers); + Assert(new_nbuffers >= 16); /* matches GUC minimum for shared_buffers */ + + elog(LOG, "buffer pool shrink started: %d -> %d buffers (%d MB -> %d MB)", + old_nbuffers, new_nbuffers, + (int) ((Size) old_nbuffers * BLCKSZ / (1024 * 1024)), + (int) ((Size) new_nbuffers * BLCKSZ / (1024 * 1024))); + + /* + * Record the condemned range for the bgwriter to drain, then update + * NBuffers. The order matters: we set the drain range before publishing + * the new NBuffers so the bgwriter knows what to clean up. + */ + SpinLockAcquire(&BufResizeCtl->mutex); + BufResizeCtl->status = BUF_RESIZE_DRAINING; + BufResizeCtl->drain_from = new_nbuffers; + BufResizeCtl->drain_to = old_nbuffers; + BufResizeCtl->condemned_remaining = old_nbuffers - new_nbuffers; + SpinLockRelease(&BufResizeCtl->mutex); + + pg_write_barrier(); + pg_atomic_write_u32(&BufResizeCtl->current_buffers, (uint32) new_nbuffers); + NBuffers = new_nbuffers; + + elog(LOG, "buffer pool shrink completed: NBuffers %d -> %d " + "(bgwriter will drain %d condemned buffers)", + old_nbuffers, new_nbuffers, old_nbuffers - new_nbuffers); + + return true; +} + +/* + * BufPoolDrainCondemnedBuffers - evict buffers in the condemned range. + * + * Called from the bgwriter main loop each cycle (~200ms). The bgwriter + * has full backend infrastructure needed for EvictUnpinnedBuffer(). + * + * This does one pass over the condemned range per call, evicting what it + * can. When all condemned buffers are invalidated, it marks the drain + * as complete and optionally decommits memory. + */ +void +BufPoolDrainCondemnedBuffers(void) +{ + int drain_from, + drain_to; + int i; + int remaining = 0; + int pinned = 0; + int dirty = 0; + BufPoolResizeStatus status; + + if (BufResizeCtl == NULL) + return; + + /* Quick check without lock */ + status = BufResizeCtl->status; + if (status != BUF_RESIZE_DRAINING) + return; + + SpinLockAcquire(&BufResizeCtl->mutex); + drain_from = BufResizeCtl->drain_from; + drain_to = BufResizeCtl->drain_to; + SpinLockRelease(&BufResizeCtl->mutex); + + if (drain_from >= drain_to) + return; + + /* One pass over the condemned range */ + for (i = drain_from; i < drain_to; i++) + { + BufferDesc *buf = GetBufferDescriptor(i); + uint64 buf_state; + + buf_state = pg_atomic_read_u64(&buf->state); + + /* Skip already-invalidated buffers */ + if (!(buf_state & BM_TAG_VALID)) + continue; + + /* Can't touch pinned buffers */ + if (BUF_STATE_GET_REFCOUNT(buf_state) != 0) + { + remaining++; + pinned++; + continue; + } + + /* Evict the buffer (handles dirty flush + invalidation) */ + { + bool flushed = false; + bool evicted; + + if (buf_state & BM_DIRTY) + dirty++; + evicted = EvictUnpinnedBuffer(BufferDescriptorGetBuffer(buf), + &flushed); + if (!evicted) + remaining++; + } + } + + /* Update progress under lock */ + SpinLockAcquire(&BufResizeCtl->mutex); + BufResizeCtl->condemned_remaining = remaining; + BufResizeCtl->condemned_pinned = pinned; + BufResizeCtl->condemned_dirty = dirty; + + if (remaining == 0) + { + /* + * All condemned buffers drained. Before decommitting, verify the + * drain hasn't been superseded by a new resize request. A grow + * that overlaps the condemned range could have been initiated by + * the postmaster while we were iterating -- in that case, the + * status and/or drain range will have changed under us. + */ + if (BufResizeCtl->status == BUF_RESIZE_DRAINING && + BufResizeCtl->drain_from == drain_from && + BufResizeCtl->drain_to == drain_to) + { + BufResizeCtl->status = BUF_RESIZE_IDLE; + BufResizeCtl->drain_from = 0; + BufResizeCtl->drain_to = 0; + BufResizeCtl->started_at = 0; + BufResizeCtl->condemned_remaining = 0; + BufResizeCtl->condemned_pinned = 0; + BufResizeCtl->condemned_dirty = 0; + SpinLockRelease(&BufResizeCtl->mutex); + + elog(LOG, "bgwriter: condemned buffer drain complete"); + + /* Now safe to decommit memory */ + if (ReservedBufferBlocks != NULL) + BufferPoolDecommitMemory(drain_to, drain_from); + } + else + { + /* Drain was superseded; skip decommit */ + SpinLockRelease(&BufResizeCtl->mutex); + elog(LOG, "bgwriter: drain superseded by new resize, skipping decommit"); + } + } + else + { + SpinLockRelease(&BufResizeCtl->mutex); + } +} + +/* ---------------------------------------------------------------- + * Resize coordination + * ---------------------------------------------------------------- + */ + +/* + * RequestBufferPoolResize - request an asynchronous resize. + * + * Called from the GUC assign hook. Sets the target and lets the + * postmaster or a bgworker pick it up. + */ +void +RequestBufferPoolResize(int new_nbuffers) +{ + if (BufResizeCtl == NULL) + return; /* Not yet initialized */ + + SpinLockAcquire(&BufResizeCtl->mutex); + + /* + * If a bgwriter drain is in progress (BUF_RESIZE_DRAINING from a + * previous shrink), cancel it -- the new request supersedes. The + * bgwriter validates the drain range before decommitting, so it's + * safe to change the range while it's iterating. + * + * Don't interrupt a grow (BUF_RESIZE_GROWING) since the postmaster + * is actively executing it. + */ + if (BufResizeCtl->status == BUF_RESIZE_GROWING) + { + SpinLockRelease(&BufResizeCtl->mutex); + ereport(WARNING, + (errmsg("buffer pool resize already in progress, " + "ignoring new request"))); + return; + } + + /* Cancel any pending drain */ + BufResizeCtl->drain_from = 0; + BufResizeCtl->drain_to = 0; + BufResizeCtl->condemned_remaining = 0; + BufResizeCtl->condemned_pinned = 0; + BufResizeCtl->condemned_dirty = 0; + + BufResizeCtl->target_buffers = new_nbuffers; + if (new_nbuffers > NBuffers) + BufResizeCtl->status = BUF_RESIZE_GROWING; + else if (new_nbuffers < NBuffers) + BufResizeCtl->status = BUF_RESIZE_DRAINING; + else + BufResizeCtl->status = BUF_RESIZE_IDLE; + + BufResizeCtl->started_at = GetCurrentTimestamp(); + SpinLockRelease(&BufResizeCtl->mutex); +} + +/* + * ExecuteBufferPoolResize - perform a pending resize. + * + * This should be called from the postmaster main loop or a dedicated + * bgworker. It checks for pending resize requests and executes them. + */ +void +ExecuteBufferPoolResize(void) +{ + int target; + BufPoolResizeStatus status; + + if (BufResizeCtl == NULL) + return; + + SpinLockAcquire(&BufResizeCtl->mutex); + status = BufResizeCtl->status; + target = BufResizeCtl->target_buffers; + SpinLockRelease(&BufResizeCtl->mutex); + + if (status == BUF_RESIZE_IDLE) + return; + + if (status == BUF_RESIZE_GROWING && target > NBuffers) + { + GrowBufferPool(target); + + /* Mark grow as complete immediately */ + SpinLockAcquire(&BufResizeCtl->mutex); + BufResizeCtl->status = BUF_RESIZE_IDLE; + BufResizeCtl->started_at = 0; + SpinLockRelease(&BufResizeCtl->mutex); + } + else if (status == BUF_RESIZE_DRAINING && target < NBuffers) + { + /* + * ShrinkBufferPool updates NBuffers and keeps status as + * BUF_RESIZE_DRAINING. The bgwriter will drain the condemned + * buffers asynchronously and set status to BUF_RESIZE_IDLE. + */ + ShrinkBufferPool(target); + } +} + +/* ---------------------------------------------------------------- + * GUC hooks + * ---------------------------------------------------------------- + */ + +/* + * GUC check hook for shared_buffers. + * + * The GUC variable is SharedBuffersGUC, NOT NBuffers. This is critical: + * the GUC mechanism updates SharedBuffersGUC on SIGHUP, but NBuffers is + * only updated by the resize code (or at startup). This prevents NBuffers + * from changing before the buffer pool arrays are actually resized. + * + * Validates that the new value is within the allowed range: + * - At startup: normal validation (min/max from GUC definition) + * - At runtime with max_shared_buffers: must be <= MaxNBuffers + * - At runtime without max_shared_buffers: value is accepted (for ALTER + * SYSTEM writes that take effect on next restart) but the assign hook + * will not trigger a resize + */ +bool +check_shared_buffers(int *newval, void **extra, GucSource source) +{ + /* + * If max_shared_buffers is configured, enforce it as an upper bound. + * This applies both at startup and at runtime. + */ + if (MaxNBuffers > 0 && *newval > MaxNBuffers) + { + GUC_check_errmsg("shared_buffers (%d) cannot exceed max_shared_buffers (%d)", + *newval, MaxNBuffers); + return false; + } + + return true; +} + +/* + * GUC assign hook for shared_buffers. + * + * The GUC variable (SharedBuffersGUC) has already been updated by the GUC + * mechanism. At startup, we copy the value into NBuffers. At runtime, + * we request an async resize if the infrastructure is available. + * + * If max_shared_buffers is not set, runtime changes to SharedBuffersGUC + * are harmless -- they'll take effect on next restart when NBuffers is + * re-initialized from SharedBuffersGUC. + */ +void +assign_shared_buffers(int newval, void *extra) +{ + /* + * If resize infrastructure isn't available (initial startup, standalone + * backend, or max_shared_buffers not configured), set NBuffers directly. + */ + if (BufResizeCtl == NULL || MaxNBuffers <= 0) + { + NBuffers = newval; + return; + } + + /* + * At runtime with max_shared_buffers configured. + * + * The postmaster (IsUnderPostmaster=false) requests a resize. This is + * a no-op here because ExecuteBufferPoolResize() is called separately + * from process_pm_reload_request() after ProcessConfigFile returns. + * + * Child processes (IsUnderPostmaster=true) update their local NBuffers + * from the shared current_buffers atomic, which was set by the postmaster + * during ExecuteBufferPoolResize() before signaling children. + */ + if (!IsUnderPostmaster) + { + /* Postmaster: request resize (executed later by postmaster loop) */ + if (newval != NBuffers) + RequestBufferPoolResize(newval); + } + else + { + /* + * Child process: read the authoritative NBuffers from shared memory. + * The postmaster has already performed the resize and updated + * current_buffers before sending us SIGHUP. + */ + int current = (int) pg_atomic_read_u32(&BufResizeCtl->current_buffers); + + /* + * A read barrier ensures we see the fully initialized descriptor + * data that the postmaster wrote before publishing current_buffers. + * Pairs with the pg_write_barrier() in GrowBufferPool/ShrinkBufferPool. + */ + pg_read_barrier(); + + if (current != NBuffers) + { + elog(DEBUG1, "backend updated NBuffers: %d -> %d", + NBuffers, current); + NBuffers = current; + } + } +} diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c index 9a93fb335fcb8..6b651b2e408eb 100644 --- a/src/backend/storage/buffer/freelist.c +++ b/src/backend/storage/buffer/freelist.c @@ -381,8 +381,23 @@ StrategyShmemSize(void) { Size size = 0; - /* size of lookup hash table ... see comment in StrategyInitialize */ - size = add_size(size, BufTableShmemSize(NBuffers + NUM_BUFFER_PARTITIONS)); + /* + * Size of lookup hash table ... see comment in StrategyInitialize. + * + * When max_shared_buffers is configured for online resize, pre-size the + * hash table for the maximum possible buffer count so that growing the + * buffer pool doesn't require rehashing. + */ + { + int hash_size; + + if (MaxNBuffers > 0 && MaxNBuffers > NBuffers) + hash_size = MaxNBuffers + NUM_BUFFER_PARTITIONS; + else + hash_size = NBuffers + NUM_BUFFER_PARTITIONS; + + size = add_size(size, BufTableShmemSize(hash_size)); + } /* size of the shared replacement strategy control block */ size = add_size(size, MAXALIGN(sizeof(BufferStrategyControl))); @@ -412,7 +427,14 @@ StrategyInitialize(bool init) * happening in each partition concurrently, so we could need as many as * NBuffers + NUM_BUFFER_PARTITIONS entries. */ - InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS); + /* + * When max_shared_buffers is configured, pre-size for the maximum to + * avoid needing to rehash when the buffer pool grows. + */ + if (MaxNBuffers > 0 && MaxNBuffers > NBuffers) + InitBufTable(MaxNBuffers + NUM_BUFFER_PARTITIONS); + else + InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS); /* * Get or create the shared strategy control block diff --git a/src/backend/storage/buffer/meson.build b/src/backend/storage/buffer/meson.build index ed84bf089716a..269d686125f85 100644 --- a/src/backend/storage/buffer/meson.build +++ b/src/backend/storage/buffer/meson.build @@ -2,6 +2,7 @@ backend_sources += files( 'buf_init.c', + 'buf_resize.c', 'buf_table.c', 'bufmgr.c', 'freelist.c', diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c index 85c67b2c183d6..c6cfd19a7b0f9 100644 --- a/src/backend/storage/ipc/ipci.c +++ b/src/backend/storage/ipc/ipci.c @@ -39,6 +39,7 @@ #include "replication/walreceiver.h" #include "replication/walsender.h" #include "storage/aio_subsys.h" +#include "storage/buf_resize.h" #include "storage/bufmgr.h" #include "storage/dsm.h" #include "storage/dsm_registry.h" @@ -103,6 +104,7 @@ CalculateShmemSize(void) size = add_size(size, dsm_estimate_size()); size = add_size(size, DSMRegistryShmemSize()); size = add_size(size, BufferManagerShmemSize()); + size = add_size(size, BufPoolResizeShmemSize()); size = add_size(size, LockManagerShmemSize()); size = add_size(size, PredicateLockShmemSize()); size = add_size(size, ProcGlobalShmemSize()); @@ -200,6 +202,14 @@ CreateSharedMemoryAndSemaphores(void) size = CalculateShmemSize(); elog(DEBUG3, "invoking IpcMemoryCreate(size=%zu)", size); + /* + * If max_shared_buffers is configured, reserve virtual address space + * for the buffer pool arrays before creating the main shmem segment. + * This sets up the global pointers (BufferDescriptors, BufferBlocks, + * etc.) pointing to separately-mapped memory regions that can grow. + */ + BufferPoolReserveMemory(); + /* * Create the shmem segment */ @@ -276,6 +286,7 @@ CreateOrAttachShmemStructs(void) SUBTRANSShmemInit(); MultiXactShmemInit(); BufferManagerShmemInit(); + BufPoolResizeShmemInit(); /* * Set up lock manager diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index 36ad708b36027..638749d91d319 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -140,6 +140,8 @@ int max_parallel_maintenance_workers = 2; * register background workers. */ int NBuffers = 16384; +int SharedBuffersGUC = 16384; +int MaxNBuffers = 0; int MaxConnections = 100; int max_worker_processes = 8; int max_parallel_workers = 8; diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat index 7c60b12556464..f58a6016cedcb 100644 --- a/src/backend/utils/misc/guc_parameters.dat +++ b/src/backend/utils/misc/guc_parameters.dat @@ -72,8 +72,6 @@ variable => 'archiveCleanupCommand', boot_val => '""', }, - - { name => 'archive_command', type => 'string', context => 'PGC_SIGHUP', group => 'WAL_ARCHIVING', short_desc => 'Sets the shell command that will be called to archive a WAL file.', long_desc => 'An empty string means use "archive_library".', @@ -2029,6 +2027,20 @@ max => 'MAX_BACKENDS /* XXX? */', }, + +# Maximum value shared_buffers can be set to without restart. +# When set to a value greater than shared_buffers, virtual address space +# is reserved at startup and the buffer pool can be resized online. +# 0 (default) means same as shared_buffers (no online resize). +{ name => 'max_shared_buffers', type => 'int', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM', + short_desc => 'Maximum value of shared_buffers that can be set without restart.', + long_desc => '0 means same as shared_buffers, disabling online resize.', + flags => 'GUC_UNIT_BLOCKS', + variable => 'MaxNBuffers', + boot_val => '0', + min => '0', + max => 'INT_MAX / 2', +}, { name => 'max_slot_wal_keep_size', type => 'int', context => 'PGC_SIGHUP', group => 'REPLICATION_SENDING', short_desc => 'Sets the maximum WAL size that can be reserved by replication slots.', long_desc => 'Replication slots will be marked as failed, and segments released for deletion or recycling, if this much space is occupied by WAL on disk. -1 means no maximum.', @@ -2523,8 +2535,6 @@ variable => 'send_abort_for_kill', boot_val => 'false', }, - - { name => 'seq_page_cost', type => 'real', context => 'PGC_USERSET', group => 'QUERY_TUNING_COST', short_desc => 'Sets the planner\'s estimate of the cost of a sequentially fetched disk page.', flags => 'GUC_EXPLAIN', @@ -2594,16 +2604,19 @@ options => 'session_replication_role_options', assign_hook => 'assign_session_replication_role', }, - # We sometimes multiply the number of shared buffers by two without # checking for overflow, so we mustn't allow more than INT_MAX / 2. -{ name => 'shared_buffers', type => 'int', context => 'PGC_POSTMASTER', group => 'RESOURCES_MEM', +# When max_shared_buffers is set, shared_buffers can be changed at runtime +# via SIGHUP without requiring a restart (PGC_SIGHUP context). +{ name => 'shared_buffers', type => 'int', context => 'PGC_SIGHUP', group => 'RESOURCES_MEM', short_desc => 'Sets the number of shared memory buffers used by the server.', flags => 'GUC_UNIT_BLOCKS', - variable => 'NBuffers', + variable => 'SharedBuffersGUC', boot_val => '16384', min => '16', max => 'INT_MAX / 2', + check_hook => 'check_shared_buffers', + assign_hook => 'assign_shared_buffers', }, { name => 'shared_memory_size', type => 'int', context => 'PGC_INTERNAL', group => 'PRESET_OPTIONS', diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index db559b39c4dd4..66c1a0e485896 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -174,6 +174,8 @@ extern PGDLLIMPORT char *DataDir; extern PGDLLIMPORT int data_directory_mode; extern PGDLLIMPORT int NBuffers; +extern PGDLLIMPORT int SharedBuffersGUC; +extern PGDLLIMPORT int MaxNBuffers; extern PGDLLIMPORT int MaxBackends; extern PGDLLIMPORT int MaxConnections; extern PGDLLIMPORT int max_worker_processes; diff --git a/src/include/storage/buf_resize.h b/src/include/storage/buf_resize.h new file mode 100644 index 0000000000000..08328292350e9 --- /dev/null +++ b/src/include/storage/buf_resize.h @@ -0,0 +1,120 @@ +/*------------------------------------------------------------------------- + * + * buf_resize.h + * Declarations for online shared buffer pool resizing. + * + * This module allows shared_buffers to be changed at runtime via SIGHUP + * without requiring a server restart, provided max_shared_buffers was + * set at startup to reserve sufficient virtual address space. + * + * Portions Copyright (c) 1996-2026, PostgreSQL Global Development Group + * Portions Copyright (c) 1994, Regents of the University of California + * + * src/include/storage/buf_resize.h + * + *------------------------------------------------------------------------- + */ +#ifndef BUF_RESIZE_H +#define BUF_RESIZE_H + +#include "storage/shmem.h" +#include "storage/spin.h" + +/* + * Possible states for an in-progress buffer pool resize operation. + */ +typedef enum BufPoolResizeStatus +{ + BUF_RESIZE_IDLE = 0, /* No resize in progress */ + BUF_RESIZE_GROWING, /* Adding new buffers */ + BUF_RESIZE_DRAINING /* Draining condemned buffers for shrink */ +} BufPoolResizeStatus; + +/* + * Shared memory state for buffer pool resize coordination. + * + * Non-atomic fields are protected by the mutex spinlock. The + * current_buffers field is accessed atomically without the lock. + */ +typedef struct BufPoolResizeCtl +{ + /* Spinlock protecting non-atomic fields */ + slock_t mutex; + + /* Current resize state */ + BufPoolResizeStatus status; + + /* Target NBuffers for the current resize operation */ + int target_buffers; + + /* Progress tracking for shrink drain (run by bgwriter) */ + int drain_from; /* start of condemned range (= new NBuffers) */ + int drain_to; /* end of condemned range (= old NBuffers) */ + int condemned_remaining; + int condemned_pinned; + int condemned_dirty; + + /* Timestamp when current resize started (0 if idle) */ + TimestampTz started_at; + + /* The current authoritative NBuffers value (updated atomically) */ + pg_atomic_uint32 current_buffers; +} BufPoolResizeCtl; + +/* MaxNBuffers is declared in miscadmin.h (defined in globals.c) */ + +/* Pointer to shared memory control structure */ +extern PGDLLIMPORT BufPoolResizeCtl *BufResizeCtl; + +/* + * Functions for buffer pool resize. + */ + +/* Shared memory initialization */ +extern Size BufPoolResizeShmemSize(void); +extern void BufPoolResizeShmemInit(void); + +/* + * Reserve virtual address space for buffer pool arrays. + * Called once at postmaster startup, before BufferManagerShmemInit(). + * Returns the base addresses for each array. + */ +extern void BufferPoolReserveMemory(void); + +/* + * Commit physical memory for buffers in the range [start_buf, end_buf) + * within the previously reserved address space. + */ +extern bool BufferPoolCommitMemory(int start_buf, int end_buf); + +/* + * Decommit physical memory for buffers beyond the given count. + */ +extern void BufferPoolDecommitMemory(int old_nbufs, int new_nbufs); + +/* + * Initiate a buffer pool resize to the given target NBuffers. + * Called from the GUC assign hook when shared_buffers changes. + * The actual resize happens asynchronously via the postmaster. + */ +extern void RequestBufferPoolResize(int new_nbuffers); + +/* + * Execute a pending buffer pool resize. Called from the postmaster + * main loop or a dedicated background worker. + */ +extern void ExecuteBufferPoolResize(void); + +/* + * Drain condemned buffers after a shrink. Called from the bgwriter + * main loop, which has full backend infrastructure (ResourceOwner, + * private refcounts, etc.) needed for buffer eviction. + */ +extern void BufPoolDrainCondemnedBuffers(void); + +/* + * GUC hooks for shared_buffers are declared in utils/guc_hooks.h, + * not here, to avoid pulling guc.h into storage headers. + */ + +#endif /* BUF_RESIZE_H */ diff --git a/src/include/utils/guc_hooks.h b/src/include/utils/guc_hooks.h index f723668da9ec2..f8822a53b6166 100644 --- a/src/include/utils/guc_hooks.h +++ b/src/include/utils/guc_hooks.h @@ -131,6 +131,8 @@ extern bool check_serial_buffers(int *newval, void **extra, GucSource source); extern bool check_session_authorization(char **newval, void **extra, GucSource source); extern void assign_session_authorization(const char *newval, void *extra); extern void assign_session_replication_role(int newval, void *extra); +extern bool check_shared_buffers(int *newval, void **extra, GucSource source); +extern void assign_shared_buffers(int newval, void *extra); extern void assign_stats_fetch_consistency(int newval, void *extra); extern bool check_ssl(bool *newval, void **extra, GucSource source); extern bool check_stage_log_stats(bool *newval, void **extra, GucSource source);