Skip to content

NikolayS/pg_ash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

277 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

pg_ash

CI Postgres 14–18 License Pure SQL Functions tested

Active Session History for Postgres — lightweight wait event sampling with zero bloat.

The anti-extension. Pure SQL + PL/pgSQL that works on any Postgres 14+ — including RDS, Cloud SQL, AlloyDB, Supabase, Neon, and every other managed provider. No C extension, no shared_preload_libraries, no provider approval, no restart. Just \i and go.

pg_ash v1.4 investigation flow

Short walkthrough of the LLM-assisted investigation flow against a live row-lock spike in Postgres 18. Source: demos/.

Why

Postgres has no built-in session history. When something was slow an hour ago, there is nothing to look at. pg_ash samples pg_stat_activity every second and stores the results in a compact format queryable with plain SQL.

How it compares

pg_ash pg_wait_sampling pgsentinel External sampling
Install \i (pure SQL) shared_preload_libraries shared_preload_libraries (package or compile) Separate infra
Works on managed (RDS, Cloud SQL, Supabase, ...) Yes Cloud SQL only (limited managed support) Not known to be supported Yes, with effort
Sampling rate 1s (via pg_cron, system cron, or any scheduler) 10ms (in-process) 10ms (in-process) 15-60s typical
Visibility Inside Postgres Inside Postgres Inside Postgres Outside only
Storage Disk (~30 MiB/day) Memory only Memory only External store
Historical queries Yes (persistent) Ring buffer (lost on restart) Ring buffer (lost on restart) Depends on setup
Pure SQL Yes No (C extension) No (C extension) No
Maintenance overhead None None None High
Requirements None (pg_cron optional) shared_preload_libraries (restart required) shared_preload_libraries (restart required) Agent + storage

Quick start

-- prerequisites (optional but recommended)
create extension if not exists pg_stat_statements;  -- enables query text + execution metrics
-- pg_cron is optional: if installed, ash.start() uses it; otherwise see "Scheduling without pg_cron"

-- install (just run the SQL file — works on RDS, Cloud SQL, AlloyDB, etc.)
\i sql/ash-install.sql

-- start sampling (1 sample/second — uses pg_cron if available, otherwise prints external scheduling instructions)
select ash.start('1 second');

-- wait a few minutes, then query
select * from ash.top_waits('1 hour');
select * from ash.top_queries_with_text('1 hour');
select * from ash.top_by_type('1 hour');

-- stop sampling
select ash.stop();

-- uninstall (drops the ash schema and pg_cron jobs)
select ash.uninstall('yes');

Sampling intervals

ash.start(interval) accepts PostgreSQL interval values. The interval is converted to a pg_cron schedule:

Interval range Example input pg_cron schedule Description
1–59 seconds '1 second' .. '59 seconds' N seconds Every N seconds (native pg_cron format)
1–59 minutes '1 minute' .. '59 minutes' */N * * * * Every N minutes via cron syntax
1–23 hours '1 hour' .. '23 hours' 0 */N * * * (or 0 * * * * for 1h) Every N hours via cron syntax (max 23h)

Notes:

  • Sub-minute intervals must be exact seconds (e.g., '30 seconds')
  • Minute and hour intervals must be exact (e.g., '5 minutes' works, '90 seconds' does not — use '1 minute' instead)
  • Hour intervals are limited to 23 hours maximum (cron syntax limitation). For daily sampling, use '23 hours' or consider a different approach
  • The default and recommended interval is '1 second' for high-resolution sampling
  • See select * from ash.status() for the current sampling interval

Privileges

The role that runs sampling (the owner of ash.take_sample(), or the pg_cron job owner) should be a superuser or a member of the built-in pg_read_all_stats role. Without it, pg_stat_activity.query_id is visible only for activity owned by the sampling role; queries run by other users come back with query_id = NULL, which ash records under the sentinel value 0.

This silently skews ash.top_queries*, ash.query_waits, and any per-query drill-downs — all of that "other-user" traffic collapses into a single query_id = 0 bucket. To grant the role:

-- as a superuser
grant pg_read_all_stats to <sampling_role>;

On managed services where pg_read_all_stats is already granted to the primary admin (RDS rds_superuser, Cloud SQL cloudsqlsuperuser, Supabase postgres), installing and running ash as that role is sufficient. Always verify with select pg_has_role(current_user, 'pg_read_all_stats', 'MEMBER');.

If the privilege probe itself errors (e.g. missing pg_roles access on a locked-down managed service), ash.start() does not abort — it emits a RAISE NOTICE 'privilege probe failed: ...' so the skipped check remains visible in server / CI logs.

Upgrade

-- from 1.0 to 1.1
\i sql/ash-1.0-to-1.1.sql

-- from 1.1 to 1.2
\i sql/ash-1.1-to-1.2.sql

-- from 1.2 to 1.3
\i sql/ash-1.2-to-1.3.sql

-- from 1.3 to 1.4
\i sql/ash-1.3-to-1.4.sql

-- check version
select * from ash.status();

Function reference

Admin

Function Description
ash.start(interval) Start sampling (default: '1 second'). Uses pg_cron if available, otherwise prints external scheduling instructions. Also schedules rollup jobs
ash.stop() Stop sampling and rollups (removes pg_cron jobs, sets sampling_enabled = false)
ash.status() Sampling status, version, partition info, rollup metrics, debug_logging state
ash.take_sample() Take one sample manually (called automatically by the scheduler)
ash.rotate() Rotate sample partitions (called automatically, or manually for external schedulers). Runs pre-truncation rollup to prevent data loss
ash.rebuild_partitions(N, 'yes') Change partition count (3–32). Destructive — all raw sample data is lost; requires 'yes' confirmation token. Rollup tables survive. Call ash.start() after to resume
ash.rollup_minute([batch]) Aggregate raw samples into per-minute rollups. Watermark-based with catch-up. Default batch: 60 minutes
ash.rollup_hour() Aggregate minute rollups into hourly rollups. Watermark-based
ash.rollup_cleanup() Delete expired rollup rows per retention config
ash.set_debug_logging([bool]) Enable/disable per-session RAISE LOG in take_sample() for diagnostics. Call with no argument to check current state
ash.uninstall('yes') Drop the ash schema and remove pg_cron jobs

Relative time (last N hours)

Function Description
ash.top_waits(interval, limit, width) Top wait events ranked by sample count, with bar chart
ash.top_queries(interval, limit) Top queries ranked by sample count
ash.top_queries_with_text(interval, limit) Same as top_queries, with pg_stat_statements join (requires pg_stat_statements)
ash.query_waits(query_id, interval, width, color) Wait profile for a specific query
ash.top_by_type(interval, width, color) Breakdown by wait event type
ash.wait_timeline(interval, bucket) Wait events bucketed over time
ash.samples_by_database(interval) Per-database activity
ash.activity_summary(interval) One-call overview: samples, peak backends, top waits, top queries
ash.timeline_chart(interval, bucket, top, width) Stacked bar chart of wait events over time
ash.event_queries(event, interval, limit) Top queries for a specific wait event (requires pg_stat_statements)
ash.samples(interval, limit) Fully decoded raw samples with timestamps and query text

All interval-based functions default to '1 hour'. Limit defaults to 10 (top 9 + "Other" rollup row).

Absolute time (incident investigation)

Function Description
ash.top_waits_at(start, end, limit, width) Top waits in a time range, with bar chart
ash.top_queries_at(start, end, limit) Top queries in a time range
ash.query_waits_at(query_id, start, end, width, color) Query wait profile in a time range
ash.event_queries_at(event, start, end, limit) Top queries for a wait event in a time range (requires pg_stat_statements)
ash.samples_at(start, end, limit) Fully decoded raw samples in a time range
ash.top_by_type_at(start, end, width, color) Breakdown by wait event type in a time range
ash.wait_timeline_at(start, end, bucket) Wait timeline in a time range
ash.timeline_chart_at(start, end, bucket, top, width) Stacked bar chart in a time range

Start and end are timestamptz. Bucket defaults to '1 minute'.

Long-term trends (from rollup tables)

Function Description
ash.minute_waits(interval, limit) Top wait events from minute rollups (default: last 1 hour)
ash.minute_waits_at(start, end, limit) Same, absolute time range
ash.hourly_queries(interval, limit) Top queries from hourly rollups with pg_stat_statements text (default: last 1 day)
ash.hourly_queries_at(start, end, limit) Same, absolute time range
ash.daily_peak_backends(interval) Peak and average backends per day (default: last 7 days)
ash.daily_peak_backends_at(start, end) Same, absolute time range

Rollup readers query rollup_1m / rollup_1h tables — they work even after raw samples have rotated away.

Helpers

Function Description
ash.ts_from_timestamptz(timestamptz) Convert timestamptz to internal int4 epoch offset (useful for querying rollup tables directly)
ash.ts_to_timestamptz(int4) Convert int4 epoch offset back to timestamptz
ash.decode_sample(integer[], smallint) Decode a single packed ash.sample.data array. Pass slot for unambiguous query_id resolution
ash.decode_sample(int4) Convenience: decode every ash.sample row at the given sample_ts (across all datids/slots). Returns (datid, wait_event, query_id, count)
ash.decode_sample_at(timestamptz) Same as above but accepts timestamptz (converted via ts_from_timestamptz). Named with the _at suffix (consistent with samples_at / top_waits_at) to avoid decode_sample(unknown) overload ambiguity

decode_sample / decode_sample_at have EXECUTE revoked from PUBLIC (per the privilege hardening in #45). Grant explicitly to roles that need them, e.g. grant execute on function ash.decode_sample(int4) to my_reader;.

Example

-- All decoded backends recorded at a specific moment, by database:
select db.datname, d.wait_event, d.query_id, d.count
from ash.decode_sample_at('2026-04-19 14:30:00+00'::timestamptz) d
join pg_database db on db.oid = d.datid
order by db.datname, d.wait_event;

Usage

Check status

select * from ash.status();
           metric           |             value
----------------------------+-------------------------------
 version                    | 1.4
 color                      | off
 num_partitions             | 3
 sampling_enabled           | true
 skipped_samples            | 0
 current_slot               | 0
 sample_interval            | 00:00:01
 rotation_period            | 1 day
 raw_retention              | 1 day + current partial
 include_bg_workers         | false
 debug_logging              | false
 installed_at               | 2026-02-16 08:30:00.000000+00
 rotated_at                 | 2026-02-16 08:30:00.000000+00
 time_since_rotation        | 00:09:03.123456
 last_sample_ts             | 2026-02-16 08:39:03+00
 samples_in_current_slot    | 56
 samples_total              | 56
 wait_event_map_count       | 11
 wait_event_map_utilization | 0.03%
 query_map_count            | 8
 rollup_1m_rows             | 540
 rollup_1m_oldest           | 2026-02-16 08:30:00+00
 rollup_1m_newest           | 2026-02-16 08:39:00+00
 rollup_1m_retention        | 30 days
 rollup_1h_rows             | 0
 rollup_1h_retention        | 1825 days
 pg_cron_available          | yes

What hurt recently?

-- morning coffee: what happened overnight?
select * from ash.activity_summary('5 minutes');
        metric        |                                            value
----------------------+---------------------------------------------------------------------------------------------
 time_range           | 00:05:00
 total_samples        | 56
 avg_active_backends  | 6.6
 peak_active_backends | 10
 peak_time            | 2026-02-16 08:38:16+00
 databases_active     | 1
 top_wait_1           | Client:ClientRead (46.77%)
 top_wait_2           | Timeout:PgSleep (11.83%)
 top_wait_3           | Lock:transactionid (9.68%)
 top_query_1          | -2835399305386018931 — COMMIT (29.73%)
 top_query_2          | 3365820675399133794 — UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE b (23.24%)
 top_query_3          | -4378765880691287891 — UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE t (11.35%)
-- top wait events (default: top 10 + Other)
select * from ash.top_waits('5 minutes');
     wait_event     | samples |  pct  |                  bar
--------------------+---------+-------+---------------------------------------
 Client:ClientRead  |     174 | 46.77 | ██████████████████████████████ 46.77%
 Timeout:PgSleep    |      44 | 11.83 | ████████ 11.83%
 Lock:transactionid |      36 |  9.68 | ██████ 9.68%
 CPU*               |      35 |  9.41 | ██████ 9.41%
 LWLock:WALWrite    |      31 |  8.33 | █████ 8.33%
 IdleTx             |      26 |  6.99 | ████ 6.99%
 IO:WalSync         |      19 |  5.11 | ███ 5.11%
 Lock:tuple         |       5 |  1.34 | █ 1.34%
 LWLock:LockManager |       2 |  0.54 | █ 0.54%
-- top queries with text from pg_stat_statements
select * from ash.top_queries_with_text('5 minutes', 5);
       query_id       | samples |  pct  | calls  | total_exec_time_ms | mean_exec_time_ms |                             query_text
----------------------+---------+-------+--------+--------------------+-------------------+---------------------------------------------------------------------
 -2835399305386018931 |     110 | 29.73 | 283202 |            1234.56 |              0.00 | commit
  3365820675399133794 |      86 | 23.24 | 283195 |          518349.35 |              1.83 | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE bid = $2
  5457019535816659310 |      44 | 11.89 |     11 |          195225.25 |          17747.75 | select pg_sleep($1)
 -4378765880691287891 |      42 | 11.35 | 283195 |          113278.00 |              0.40 | UPDATE pgbench_tellers SET tbalance = tbalance + $1 WHERE tid = $2
-- breakdown by wait event type
select * from ash.top_by_type('5 minutes');
 wait_event_type | samples |  pct  |                       bar
-----------------+---------+-------+-------------------------------------------------
 Client          |     174 | 46.77 | ████████████████████████████████████████ 46.77%
 Timeout         |      44 | 11.83 | ██████████ 11.83%
 Lock            |      41 | 11.02 | █████████ 11.02%
 CPU*            |      35 |  9.41 | ████████ 9.41%
 LWLock          |      33 |  8.87 | ████████ 8.87%
 IdleTx          |      26 |  6.99 | ██████ 6.99%
 IO              |      19 |  5.11 | ████ 5.11%

Analyze a specific query

-- what is query 3365820675399133794 waiting on?
select * from ash.query_waits(3365820675399133794, '5 minutes');
     wait_event     | samples |  pct  |                       bar
--------------------+---------+-------+-------------------------------------------------
 Client:ClientRead  |      32 | 54.24 | ████████████████████████████████████████ 54.24%
 Lock:transactionid |      12 | 20.34 | ███████████████ 20.34%
 LWLock:WALWrite    |       6 | 10.17 | ████████ 10.17%
 CPU*               |       4 |  6.78 | █████ 6.78%
 IO:WalSync         |       3 |  5.08 | ████ 5.08%
 IdleTx             |       2 |  3.39 | ██ 3.39%
-- same, but during a specific time window
select * from ash.query_waits_at(3365820675399133794, '2026-02-16 08:38', '2026-02-16 08:40');

Drill into a wait event

-- which queries are stuck on Lock:transactionid?
select * from ash.event_queries('Lock:transactionid', '1 hour');

-- or by wait type (matches all events of that type)
select * from ash.event_queries('IO', '1 hour');

Browse raw samples

-- see the last 20 decoded samples with query text
select * from ash.samples('10 minutes', 20);
      sample_time       | database_name | active_backends |     wait_event     |       query_id       |                          query_text
------------------------+---------------+-----------------+--------------------+----------------------+--------------------------------------------------------------
 2026-02-16 11:18:51+00 | postgres      |               7 | CPU*               | -2835399305386018931 | END
 2026-02-16 11:18:51+00 | postgres      |               7 | CPU*               |  3365820675399133794 | UPDATE pgbench_branches SET bbalance = bbalance + $1 WHERE ...
 2026-02-16 11:18:49+00 | postgres      |               5 | Client:ClientRead  |  9144568883098003499 | SELECT abalance FROM pgbench_accounts WHERE aid = $1
 2026-02-16 11:18:49+00 | postgres      |               5 | IO:WalSync         | -2835399305386018931 | END
 2026-02-16 11:18:49+00 | postgres      |               3 | Lock:transactionid | -2835399305386018931 | END
 2026-02-16 11:18:49+00 | postgres      |               5 | LWLock:WALWrite    | -2835399305386018931 | END
-- raw samples during an incident
select * from ash.samples_at('2026-02-14 03:00', '2026-02-14 03:05', 50);

Dump samples to CSV

Always go through ash.samples() / ash.samples_at() — the underlying ash.sample table stores a packed integer[] and cannot be joined directly. The defaults for ash.samples() are p_interval => '1 hour' and p_limit => 100; pass a large p_limit when dumping.

-- dump every sample from the last hour
\copy (select * from ash.samples('1 hour'::interval, 10000000)) to '/tmp/ash.csv' csv header

-- dump a specific incident window
\copy (select * from ash.samples_at('2026-02-14 03:00', '2026-02-14 03:05', 10000000)) to '/tmp/incident.csv' csv header

Use \copy (psql) rather than server-side COPY TO if /tmp isn't writable by the Postgres user (managed services), and check the exit status — silently redirecting stderr to /dev/null will hide errors like typos in table names.

Timeline chart

Visualize wait event patterns over time — spot spikes, correlate with deployments, see what changed.

select bucket_start, active, detail, chart
from ash.timeline_chart('5 minutes', '30 seconds', 3, 40);
      bucket_start       | active |                             detail                             |                           chart
-------------------------+--------+----------------------------------------------------------------+-----------------------------------------------------------
                         |        |                                                                | █ Client:ClientRead  ▓ LWLock:WALWrite  ░ IdleTx  · Other
 2026-02-16 08:37:30+00  |    2.0 | Other=2.0                                                      | ···········
 2026-02-16 08:38:00+00  |    7.0 | Client:ClientRead=2.3 LWLock:WALWrite=0.8 IdleTx=0.4 Other=3.5 | █████████████▓▓▓▓▓░░····················
 2026-02-16 08:38:30+00  |    6.6 | Client:ClientRead=4.0 LWLock:WALWrite=0.4 IdleTx=0.5 Other=1.7 | ███████████████████████▓▓░░░··········
 2026-02-16 08:39:00+00  |    5.3 | Client:ClientRead=3.3 LWLock:WALWrite=0.3 IdleTx=0.7 Other=1.0 | ███████████████████▓▓░░░░······

Each rank gets a distinct character — (rank 1), (rank 2), (rank 3), (rank 4+), · (Other) — so the breakdown is visible without color.

-- zoom into a specific time window
select * from ash.timeline_chart_at(
  now() - interval '10 minutes', now(),
  '1 minute', 3, 50
);

Experimental: ANSI colors. Enable per-session or per-call — green = CPU*, blue = IO, red = Lock, pink = LWLock, cyan = IPC, yellow = Client, orange = Timeout, teal = BufferPin, purple = Activity, light purple = Extension, light yellow = IdleTx.

-- Option 1: enable once for the session (recommended)
set ash.color = on;

-- Option 2: per-call
select * from ash.top_waits('1 hour', p_color => true);

psql's table formatter escapes ANSI codes — to render colors, pipe through sed:

-- add to ~/.psqlrc for a reusable :color command
\set color '\\g | sed ''s/\\\\x1B/\\x1b/g'' | less -R'

-- then use it
select * from ash.top_waits('1 hour') :color
select * from ash.timeline_chart('1 hour') :color

Colors also render natively in pgcli, DataGrip, and other clients that pass raw bytes.

top_waits with colors:

top_waits with ANSI colors

timeline_chart with colors:

timeline_chart with ANSI colors

Example data generated with pgbench -c 8 -T 65 on Postgres 17 with concurrent lock contention and idle-in-transaction sessions.

Investigate an incident

Use the _at functions with absolute timestamps to zoom into a specific time window:

-- what happened between 3:00 and 3:10 am?
select * from ash.top_waits_at('2026-02-14 03:00', '2026-02-14 03:10');

-- which queries were running during the incident?
select * from ash.top_queries_at('2026-02-14 03:00', '2026-02-14 03:10');

-- minute-by-minute timeline of the incident
select * from ash.wait_timeline_at(
    '2026-02-14 03:00',
    '2026-02-14 03:10',
    '1 minute'
);

LLM-assisted investigation

pg_ash functions chain naturally for how an LLM investigates a problem — each answer tells it what to ask next.

Prompt: "There was a performance issue about 5 minutes ago. Investigate."

Step 1 — the LLM checks the big picture:

select * from ash.activity_summary('10 minutes');
       metric        |  value
---------------------+---------
 samples             | 600
 avg_active_sessions | 4.2
 max_active_sessions | 12
 top_wait_event      | Lock:tuple
 top_query_id        | 7283901445

"Average 4.2 active sessions but peak 12 — something spiked. And Lock:tuple is the top wait event."

Step 2 — drill into the waits:

select * from ash.top_waits('10 minutes');
   wait_event   | samples | pct  |         bar
----------------+---------+------+---------------------
 Lock:tuple     |    2810 |  68% | ████████████████████
 CPU*           |     830 |  20% | ██████
 IO:DataFileRead|     290 |   7% | ██
 Client:ClientRe|     125 |   3% | █
 Other          |      83 |   2% |

"Lock:tuple is 68% of all waits. Multiple sessions fighting over the same rows."

Step 3 — see the timeline:

select * from ash.timeline_chart('10 minutes', '30 seconds', 5, 60);
      bucket_start       | active |                              chart                              |              detail
-------------------------+--------+-----------------------------------------------------------------+----------------------------------
                         |        | █ Lock:tuple  ▓ CPU*  ░ IO:DataFileRead  ▒ Client:ClientRead    |
 2026-02-17 14:00:00+00  |    2.1 | ▓▓▓▓░░···                                                       | CPU*=1.2 IO=0.5 Other=0.4
 2026-02-17 14:00:30+00  |    2.3 | ▓▓▓▓▓░░···                                                      | CPU*=1.4 IO=0.5 Other=0.4
 2026-02-17 14:01:00+00  |    3.8 | ██▓▓▓▓░░····                                                    | Lock=0.6 CPU*=1.5 IO=0.5 Other=1.2
 2026-02-17 14:01:30+00  |    8.5 | █████████████████▓▓▓░░···                                       | Lock=5.8 CPU*=1.3 IO=0.5 Other=0.9
 2026-02-17 14:02:00+00  |   12.0 | █████████████████████████████▓▓░░··                             | Lock=9.8 CPU*=1.1 IO=0.4 Other=0.7
 2026-02-17 14:02:30+00  |   11.2 | ███████████████████████████▓▓░░··                               | Lock=9.0 CPU*=1.0 IO=0.5 Other=0.7
 2026-02-17 14:03:00+00  |    4.1 | ███▓▓▓▓░░····                                                   | Lock=1.2 CPU*=1.5 IO=0.6 Other=0.8
 2026-02-17 14:03:30+00  |    2.0 | ▓▓▓▓░···                                                        | CPU*=1.1 IO=0.4 Other=0.5
 2026-02-17 14:04:00+00  |    2.2 | ▓▓▓▓░░···                                                       | CPU*=1.3 IO=0.5 Other=0.4

"The spike is clearly 14:01:30 to 14:02:30 — Lock:tuple dominates. Let me find which queries."

Step 4 — find the guilty queries:

select * from ash.event_queries_at(
  'Lock:tuple',
  '2026-02-17 14:01:30', '2026-02-17 14:02:30'
);
  query_id   | samples | pct  |         bar
-------------+---------+------+---------------------
  7283901445 |     412 |  85% | ████████████████████
  9102384756 |      53 |  11% | ███
  Other      |      19 |   4% | █

"Query 7283901445 accounts for 85% of the lock waits."

Step 5 — get the SQL text:

select * from ash.top_queries_with_text('10 minutes');
  query_id   | samples | pct  | mean_exec_time_ms | total_exec_time_ms |                    query_text
-------------+---------+------+-------------------+--------------------+--------------------------------------------------
  7283901445 |    2395 |  58% |            842.30 |          534120.50 | UPDATE orders SET status = $1 WHERE id = $2
  9102384756 |     530 |  13% |            215.60 |           42347.20 | UPDATE orders SET shipped_at = $1 WHERE id = $2

LLM's conclusion:

Root cause: multiple concurrent UPDATE orders ... WHERE id = $2 statements are contending on the same rows (Lock:tuple). Two different update patterns hit the orders table — status updates and shipping updates — and when they target overlapping rows, they serialize on tuple locks.

Mitigation options:

  1. Use SELECT ... FOR UPDATE SKIP LOCKED to skip already-locked rows and process them later
  2. Batch the status and shipping updates into a single statement to reduce lock duration
  3. If these run from a queue worker, reduce concurrency or partition the work by order ID range

How it works

Sampling

ash.take_sample() runs every second via pg_cron. It reads pg_stat_activity, groups active backends by (wait_event_type, wait_event, state), and encodes the result into a single integer[] per database:

{-5, 3, 101, 102, 103, -1, 2, 104, 105, -8, 1, 106}
 │   │  │              │  │  │           │  │  │
 │   │  └─ query_ids   │  │  └─ qids     │  │  └─ qid
 │   └─ count=3        │  └─ count=2     │  └─ count=1
 └─ wait_event_id=5    └─ weid=1         └─ weid=8

6 active backends across 3 wait events = 1 row, 12 array elements. Each query_id is one backend — if two backends run the same query, the same map_id appears twice (the count reflects total backends, not distinct queries). Full row size: 24 (tuple header) + 4 (sample_ts) + 4 (datid) + 2 (active_count) + 2 (slot) + 68 (array: 20-byte header + 12 × 4) + alignment = 106 bytes (measured with pg_column_size).

Dictionary tables

Table Purpose
ash.wait_event_map Maps (state, wait_event_type, wait_event) to integer IDs
ash.query_map_0..{N-1} Maps query_id (from pg_stat_activity) to integer IDs (one per sample partition, truncated on rotation)
ash.rollup_1m Per-minute aggregated samples (30-day retention)
ash.rollup_1h Per-hour aggregated rollups (5-year retention)

Dictionaries are auto-populated by the sampler. Wait events are stable (~600 entries max across all Postgres versions). Query map grows as new queries appear and is garbage-collected based on last_seen.

Encoding version is tracked in ash.config.encoding_version, not in the array itself — zero per-row overhead.

Note on CPU*: When wait_event_type and wait_event are both NULL in pg_stat_activity, the backend is active but not in a known wait state. This is either genuine CPU work or an uninstrumented code path where Postgres does not report a wait event. The asterisk signals this ambiguity. See gaps.wait.events for details on uninstrumented wait events in Postgres — these gaps are being closed over time, making CPU* increasingly accurate.

Rotation

Skytools PGQ-style N-partition ring buffer (default N=3, configurable 3–32 via ash.rebuild_partitions(N, 'yes')). Physical tables (sample_0 through sample_{N-1}) rotate at rotation_period intervals. TRUNCATE replaces the oldest partition — zero dead tuples, zero bloat, no VACUUM needed for sample tables.

N-1 partitions hold data at any time. One is always empty, ready for the next rotation. Before truncation, rotate() calls rollup_minute() to aggregate endangered samples into rollup tables.

┌──────────┐  ┌───────────┐  ┌──────────┐    ┌───────────────┐
│ sample_0 │  │ sample_1  │  │ sample_2 │    │ sample_{N-1}  │
│ (today)  │  │(yesterday)│  │ (empty)  │... │ (readable)    │
│ writing  │  │ readable  │  │ next     │    │               │
└──────────┘  └───────────┘  └──────────┘    └───────────────┘
                              ↑ TRUNCATE + rotate

Reader optimization

Reader functions decode arrays inline using generate_subscripts() with direct array subscript access. This avoids per-row plpgsql function calls and is 9-17x faster than the CROSS JOIN LATERAL decode_sample() approach.

Storage

Raw samples

Active backends Storage/day Max on disk (N-1 partitions, default N=3)
10 11 MiB 22 MiB
50 30 MiB 60 MiB
100 50 MiB 100 MiB
200 100 MiB 200 MiB
500 245 MiB 490 MiB

At 500+ backends, TOAST LZ4 compression reduces actual storage. Increasing num_partitions increases the number of days kept, not the daily rate.

Rollup tables

Level Retention Rows/db Storage/db
1-minute (rollup_1m) 30 days ~43,200 ~43 MiB
1-hour (rollup_1h) 5 years ~43,800 ~77 MiB

Total: ~120 MiB per database for 5 years of trend data.

Performance

Measured on Postgres 17, 50 backends, 1s sampling, jit = off (median of 10 runs, warm cache):

Metric Result
top_waits('1 hour') 30 ms
top_waits('24 hours') 6.1 s
top_queries_with_text('1 hour') 31 ms
take_sample() overhead 53 ms
WAL per sample ~29 KiB (~2.4 GiB/day)
Rotation (1-day partition) 9 ms
Dead tuples after rotation 0

See issue #1 for full benchmarks — EXPLAIN ANALYZE output, backend scaling, multi-database tests, WAL analysis, and concurrency testing.

Requirements

  • Postgres 14+ (requires query_id in pg_stat_activity)
  • pg_cron 1.5+ (optional — for built-in scheduling; see Scheduling without pg_cron for alternatives)
  • pg_stat_statements (optional but recommended — enables query text and execution metrics; without it, top_queries_with_text(), event_queries(), and event_queries_at() will error, and top_queries(), samples() will return NULL for query_text)

Note on query_id: The default compute_query_id = auto only populates query_id when pg_stat_statements is in shared_preload_libraries. If query_id is NULL in pg_stat_activity, set:

alter system set compute_query_id = 'on';
-- requires reload: select pg_reload_conf();

Configuration

-- change sampling interval (default: 1 second)
select ash.stop();
select ash.start('5 seconds');

-- change rotation interval (default: 1 day)
update ash.config set rotation_period = '12 hours';

-- check current configuration
select * from ash.status();

Defaults

All configuration is in the ash.config singleton table:

Setting Default Description
sample_interval 1 second Time between samples
rotation_period 1 day How often partitions rotate
num_partitions 3 Number of sample partitions (3–32)
include_bg_workers false Sample autovacuum, logical replication, parallel workers
debug_logging false RAISE LOG for every sampled session
rollup_1m_retention_days 30 How long to keep minute-level rollups
rollup_1h_retention_days 1825 How long to keep hourly rollups (5 years)
rollup_min_backend_seconds 3 Minimum backend-seconds for a query to appear in rollup query_counts

Configurable partitions

By default, pg_ash uses 3 partitions (1 day of history + current partial). To keep more raw sample history, increase the partition count:

-- keep 7 days of raw samples (9 partitions × 1-day rotation = 7 readable days + current)
-- 'yes' is required because the call drops all raw sample data
select ash.rebuild_partitions(9, 'yes');

-- resume sampling after rebuild
select ash.start();

-- verify
select * from ash.status();
--  num_partitions  | 9
--  raw_retention   | 7 days + current partial

The retention formula is (N - 2) × rotation_period. The minimum is 3 (current + previous + one being truncated), the maximum is 32.

rebuild_partitions() is destructive — all raw samples are lost. To prevent accidents, the call requires a 'yes' confirmation token (e.g. ash.rebuild_partitions(9, 'yes')); calling it without 'yes' raises an error and changes nothing. Rollup tables survive. You must call ash.start() afterward to resume sampling.

Rollup tables for long-term trends

Raw samples rotate away after (N-2) × rotation_period. Rollup tables preserve aggregated data for long-term trend analysis:

  • rollup_1m: per-minute aggregates, kept for 30 days (~43 MiB/db)
  • rollup_1h: per-hour aggregates, kept for 5 years (~77 MiB/db)

Rollups are populated automatically when pg_cron is available (ash.start() schedules them). Without pg_cron, schedule externally:

# Every minute: aggregate raw samples into minute rollups
* * * * * psql -qAtX -d mydb -c "SELECT ash.rollup_minute();"

# Every hour: aggregate minutes into hourly rollups
0 * * * * psql -qAtX -d mydb -c "SELECT ash.rollup_hour();"

# Daily at 3am: delete expired rollup rows
0 3 * * * psql -qAtX -d mydb -c "SELECT ash.rollup_cleanup();"

Query rollup data with the rollup reader functions:

-- what were the top wait events in the last 6 hours? (from minute rollups)
select * from ash.minute_waits('6 hours');

-- top queries over the last week (from hourly rollups)
select * from ash.hourly_queries('7 days');

-- peak concurrency trend over the last 30 days
select * from ash.daily_peak_backends('30 days');

-- investigate a specific time range (even if raw samples are gone)
select * from ash.minute_waits_at('2026-03-01 02:00', '2026-03-01 03:00');

Rollups use backend-seconds as the count unit (Oracle ASH-compatible). Each sample appearance = 1 backend-second at 1s sampling interval.

To change retention:

update ash.config set rollup_1m_retention_days = 14 where singleton;   -- keep 2 weeks
update ash.config set rollup_1h_retention_days = 365 where singleton;  -- keep 1 year

Debug logging

Enable per-session RAISE LOG output from take_sample() — useful for diagnosing connection pooler issues (e.g., PgBouncer mapping client_addr to pooler sessions):

-- check current state
select ash.set_debug_logging();

-- enable: each take_sample() call logs every active session to the Postgres log
select ash.set_debug_logging(true);

-- sample output in the Postgres server log:
-- LOG: ash.take_sample: pid=107 state=active wait_type=CPU* wait_event=CPU* backend_type=client backend query_id=-5287352711091412819
-- LOG: ash.take_sample: pid=108 state=idle in transaction wait_type=Client wait_event=ClientRead backend_type=client backend query_id=-6949053775937549307

-- disable
select ash.set_debug_logging(false);

pg_cron run history

pg_cron logs every job execution to cron.job_run_details. At 1-second sampling, this adds ~12 MiB/day of unbounded growth with no built-in purge.

Recommended: disable cron.log_run. Errors from failed jobs still appear in the Postgres server log (cron.log_min_messages defaults to WARNING) — you lose nothing important, only the job_run_details table entries.

alter system set cron.log_run = off;
-- requires Postgres restart (postmaster context)

If you need run history for other pg_cron jobs (unfortunately, as of pg_cron 1.6, per-job logging configuration is not supported), schedule periodic cleanup instead:

select cron.schedule(
  'ash_purge_cron_log',
  '0 * * * *',
  $$delete from cron.job_run_details where end_time < now() - interval '1 day'$$
);

ash.start() will warn about this overhead.

Scheduling without pg_cron

pg_cron is optional. All core functions — ash.take_sample(), ash.rotate(), and all reporting — work without it. When pg_cron is not installed, ash.start('1 second') records the intended interval in ash.config and prints instructions for external scheduling.

You can call ash.take_sample() from any external scheduler:

System cron (1-minute minimum granularity):

# Every minute
* * * * * psql -qAtX -d mydb -c "SET statement_timeout='500ms'; SELECT ash.take_sample();"

# Every second (cron launches a loop each minute)
* * * * * for i in $(seq 1 59); do psql -qAtX -d mydb -c "SET statement_timeout='500ms'; SELECT ash.take_sample();"; sleep 1; done

Dedicated loop script (most reliable for 1-second sampling):

#!/bin/bash
# ash_sampler.sh — run via systemd, screen, tmux, or nohup
while true; do
  psql -qAtX -d mydb -c "SET statement_timeout='500ms'; SELECT ash.take_sample();" 2>/dev/null
  sleep 1
done

systemd timer (Linux, precise 1-second ticking):

# /etc/systemd/system/ash-sampler.service
[Service]
Type=oneshot
ExecStart=psql -qAtX -d mydb -c "SET statement_timeout='500ms'; SELECT ash.take_sample();"
User=postgres

# /etc/systemd/system/ash-sampler.timer
[Timer]
OnActiveSec=0
OnUnitActiveSec=1s
AccuracySec=100ms
[Install]
WantedBy=timers.target

psql \watch (quick ad-hoc testing):

SELECT ash.take_sample() \watch 1

Any language (Python example):

import psycopg2, time
conn = psycopg2.connect("dbname=mydb")
conn.autocommit = True
while True:
    with conn.cursor() as cur:
        cur.execute("SET statement_timeout='500ms'; SELECT ash.take_sample()")
    time.sleep(1)

Don't forget to also schedule rotation and rollups:

# System cron: rotate daily at midnight
0 0 * * * psql -qAtX -d mydb -c "SELECT ash.rotate();"

# Rollup: every minute, every hour, daily cleanup
* * * * * psql -qAtX -d mydb -c "SELECT ash.rollup_minute();"
0 * * * * psql -qAtX -d mydb -c "SELECT ash.rollup_hour();"
0 3 * * * psql -qAtX -d mydb -c "SELECT ash.rollup_cleanup();"

Privileges

pg_ash installs with a locked-down privilege model: admin functions (ash.start(), ash.stop(), ash.rotate(), ash.take_sample(), ash.set_debug_logging(), ash.uninstall()) are restricted to the schema owner, and EXECUTE on all reader functions plus SELECT on reader tables (ash.sample, ash.query_map_all, ash.config, ash.wait_event_map, and per-slot partitions) is revoked from PUBLIC. The installing role retains full access.

Grant access to a monitoring or read-only role with the convenience helpers:

-- one call, minimum privileges: USAGE on schema ash, EXECUTE on every
-- public reader function, SELECT on the tables readers depend on
-- (sample + partitions, query_map_all + partitions, config, wait_event_map,
-- rollup_1m, rollup_1h). Idempotent.
create role grafana login password 'xxx';
select ash.grant_reader('grafana');

-- ...later, take it back. Symmetric undo of grant_reader().
select ash.revoke_reader('grafana');

Both helpers are owner-only, validate the role exists in pg_roles, quote the role name, and emit a RAISE NOTICE summarizing what changed.

Note: If you subsequently change the partition count via ash.rebuild_partitions(N, 'yes'), previously-granted reader roles will lose access to the new partition tables. Re-run ash.grant_reader(...) for each monitoring role after any rebuild_partitions call.

Note on pg_cron visibility: ash.grant_reader() does not grant USAGE ON SCHEMA cron, since pg_cron is not an ash object. When pg_cron is loaded but the monitoring role lacks USAGE on schema cron, ash.status() emits a single fallback row of the form cron_jobs = '<no cron.job access; grant USAGE ON SCHEMA cron TO <role>>' instead of per-job cron_job_* rows. To surface real cron job details, either run grant usage on schema cron to <role> (and grant select on cron.job to <role>) once, or simply ignore the row.

If you prefer manual control, the equivalent explicit grants are:

-- allow a monitoring role to call the readers
grant usage on schema ash to my_monitor_role;
grant execute on function ash.top_waits(interval, int, int)           to my_monitor_role;
grant execute on function ash.top_queries(interval, int)              to my_monitor_role;
grant execute on function ash.top_queries_with_text(interval, int)    to my_monitor_role;
grant execute on function ash.samples(interval, int)                  to my_monitor_role;
grant execute on function ash.status()                                to my_monitor_role;

-- or grant all reader functions at once
grant execute on all functions in schema ash to my_monitor_role;

-- grant direct read on raw tables for ad-hoc SQL (optional)
grant select on all tables in schema ash to my_monitor_role;

pg_stat_statements in a non-default schema

pgss reader functions (top_queries, top_queries_at, top_queries_with_text, samples, samples_at, event_queries, event_queries_at) need the pg_stat_statements schema on their search_path. Install detects it automatically. If you install pg_stat_statements after pg_ash, or move it to a non-default schema, re-apply:

-- detect the pgss schema and re-apply search_path on pgss readers
select ash._apply_pgss_search_path();

Known limitations

  • Primary only — pg_ash requires writes (INSERT into sample tables, TRUNCATE on rotation), so it cannot run on physical standbys or read replicas. Install it on the primary; it samples all databases from there.
  • Observer-effect protection — the sampler pg_cron command includes SET statement_timeout = '500ms' to prevent take_sample() from becoming a problem on overloaded servers. If pg_stat_activity is slow (thousands of backends), the sample is canceled rather than piling up. Normal execution is ~50ms — the 500ms cap gives 10× headroom. Adjust in cron.job if needed.
  • Sampling gaps under heavy load — pg_cron runs in a single background worker and under heavy load (lock storms, many concurrent sessions) it can't always keep up with the 1-second schedule. You may see gaps of 8s, 13s, or even 30s+ between samples — ironically during the most interesting moments. This is a fundamental pg_cron limitation, not a bug. If precise 1-second sampling matters, use an external sampler which is more reliable under load.
  • 24-hour raw sample queries are slow (~6s for full-day scan) — use rollup reader functions (ash.minute_waits, ash.hourly_queries, ash.daily_peak_backends) for queries over long time ranges.
  • JIT protection built in — all reader functions use SET jit = off to prevent JIT compilation overhead (which can be 10-750x slower depending on Postgres version and dataset size). No global configuration needed.
  • Single-database install — pg_ash installs in one database and samples all databases from there. Per-database filtering works via the datid column.
  • query_map hard cap at 50k entries — on Postgres 14-15, volatile SQL comments (e.g., marginalia, sqlcommenter with session IDs or timestamps) produce unique query_id values that are not normalized. This can flood the query_map partitions. A hard cap of 50,000 entries per partition prevents unbounded growth — queries beyond the cap are tracked as "unknown." PG16+ normalizes comments, so this is rarely hit. Check query_map_count in ash.status() to monitor.
  • Parallel query workers counted individually — parallel workers share the same query_id as the leader but are counted as separate backends. This inflates the apparent "weight" of parallel queries in top_queries(). leader_pid grouping is not yet implemented.
  • WAL overhead — 1-second sampling generates ~29 KiB WAL per sample (~2.4 GiB/day), dominated by full_page_writes. This is significant for WAL-sensitive replication setups. Consider 5-second or 10-second sampling intervals (ash.start('5 seconds')) if WAL volume is a concern. The overhead scales linearly with sampling frequency.
  • Epoch overflow horizon (~2094)sample_ts is stored as int4 seconds since 2026-01-01 UTC and int4 is exhausted around 2094-01-19. Past that point, the ::int4 cast in ash.take_sample() raises ERROR: integer out of range and sampling hard-fails (it does NOT silently wrap). ash.status() exposes epoch_seconds_remaining so operators can plan a bigint migration of the column well before the horizon. See issue #37.

License

Apache 2.0


pg_ash is part of SAMO — self-driving Postgres.

About

Active Session History for PostgreSQL — wait event sampling with zero bloat (pg_cron + PGQ-style partition rotation)

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors