High memory usage at idle

Running biofuse against the wide_bench VCZ store (100k samples × 496k variants, default 32-worker readahead), the encoder-server subprocess holds ~900 MB resident memory at idle, i.e. after several open/read/close cycles have completed and no streaming file handles remain. The expected idle footprint is well under 100 MB — a `VczReader` with a handful of cached metadata arrays, plus the precomputed static-sidecar bytes and an idle ThreadPoolExecutor.

## Conclusion (short)

The retention is not in Python objects. It is glibc per-thread malloc arena fragmentation. The 32-worker readahead pool inside `VczReader` allocates large zarr-decode buffers across many threads; freed pages stay pinned in per-thread arenas indefinitely. `pympler.asizeof` puts the entire live Python graph at ~30 MiB and `gc.get_objects()` reports zero live numpy arrays larger than 1 MiB, yet RSS sits at hundreds of MB. `MALLOC_ARENA_MAX=2` plus an explicit `malloc_trim(0)` returns RSS to within ~50 MB of the expected idle budget.

## How the investigation was run

The reproducer is in-process: spin up `biofuse.encoder_server._ServerSession` directly (no FUSE, no socket, no subprocess), then loop 5× constructing an encoder via `spec.encoder_factory(reader)`, issuing a short (16 MiB) `encoder.read(0, …)` to drive the StreamReader pipeline, dropping the encoder, and `gc.collect()`. At each lifecycle point we record `psutil.Process().memory_info().rss`, take a `tracemalloc` snapshot, asizeof the `VczReader` + `_ServerSession`, walk `gc.get_objects()` for live numpy arrays larger than 1 MiB, and count live `StreamReader` / `CachedLogicalVariantsChunk` instances. A second probe drives the real `EncoderClient` over its `AF_UNIX` socket and reads the child's RSS via `/proc/<pid>/status`. Both probes were run with default arena count and with `MALLOC_ARENA_MAX=2` for comparison.

## Numbers

In-process probe, PLINK and BGEN on wide_bench, "idle" = post-pass-5 + `gc.collect() × 2`, `malloc_trim` value is after a single `ctypes` call to `malloc_trim(0)` from the main thread.

| Config | Post-import | Post-session | Post-pass-1 | Idle | After `malloc_trim(0)` |
| --- | --- | --- | --- | --- | --- |
| PLINK, 32 workers, default arenas | 83 | 203 | 528 | **666** | 583 |
| PLINK, 32 workers, `MALLOC_ARENA_MAX=2` | 83 | 191 | 206 | **461** | **141** |
| PLINK, 4 workers, default arenas | 83 | 191 | 436 | **511** | 433 |
| BGEN, 32 workers, default arenas | 83 | 184 | 449 | **703** | 503 |
| BGEN, 32 workers, `MALLOC_ARENA_MAX=2` | 83 | 210 | 196 | **902** | **169** |

All values are MiB. The pattern is identical for both formats: with the default 8×ncores glibc arena cap, `malloc_trim(0)` from a non-arena thread releases only the main arena's pad — typically 60–200 MiB out of ~700 — and the remainder stays pinned in the per-thread arenas. With `MALLOC_ARENA_MAX=2`, freed pages concentrate into one of two arenas and the same `malloc_trim` call returns several hundred MB to the kernel. BGEN reaches a higher transient peak than PLINK because `_build_bgen_static` calls `write_bgen_index` which does a full variant scan during the handshake.

End-to-end probe via the real `EncoderClient` + spawn subprocess (5 short-read passes, child RSS measured at "final idle" after a 2s settle):

| Config | PLINK idle | BGEN idle |
| --- | --- | --- |
| Unpatched | 644.6 MiB | 904.6 MiB |
| `MALLOC_ARENA_MAX=2` (env var only) | 425.4 MiB | 841.6 MiB |

`MALLOC_ARENA_MAX=2` alone gives a 7–34% reduction. The full ~80% reduction the in-process probe demonstrated only materialises when a `malloc_trim(0)` call runs in the child after the working set drains. Without that call, even with two arenas, glibc keeps the high-water-mark pages mapped because the top chunk of the arena is not free.

## What is *not* the cause (ruled out)

- `reader._static_field_cache` is always empty during PLINK and BGEN static-file builds. `generate_bim` / `generate_fam` / `generate_sample` access `@cached_property` fields, not `_load_static_field`.
- `gc.get_objects()` filtered to `np.ndarray` with `nbytes > 1 MiB` is empty at idle — no leaked decoded chunks.
- `count_instances(StreamReader)` and `CachedLogicalVariantsChunk` are zero between passes — no leaked iterator state.
- `tracemalloc` final-vs-baseline diff totals ~20 MiB of attributable Python allocations, dominated by `.bim`/`.fam` bytes and the chunk plan.
- The `@cached_property` arrays on `VczReader` (`raw_sample_ids`, `sample_ids`, `samples_selection`, `contig_ids`, …) together total ~8 MiB on wide_bench.
- `session.static_files` is ~10 MiB on PLINK wide_bench and similar on BGEN — confirmed via `len()` of each entry.

Together these account for ~30 MiB of live Python heap. The remaining 600–870 MiB of RSS is outside Python's view.

## Why the per-thread arena pattern fits

glibc's arena allocator gives each thread that calls `malloc` its own arena (up to `MALLOC_ARENA_MAX`, default `8 × ncores`). The readahead pool decodes zarr chunks on worker threads — these are large allocations (a single call_genotype chunk on wide_bench is 5000 variants × 100k samples × 2 ploidy ≈ 1 GiB uncompressed). When the consumer thread frees the resulting numpy array, the `free()` is routed to the arena the original `malloc` came from. glibc keeps those pages in the arena's freelist; they are returned to the kernel only when the top chunk of the arena is itself free and trim is invoked. A `malloc_trim(0)` call from the main thread reaches only the main arena. Hence: many arenas × deep working set per arena × no trim ⇒ several hundred MB of unreachable-but-mapped RSS.

This is well-documented behaviour on multi-threaded Python workloads that allocate large objects across thread boundaries. `numpy` allocations larger than `M_MMAP_THRESHOLD` (default 128 KiB) go through `mmap` directly and are returned cleanly on free, but the chunked decode path produces many sub-mmap-threshold allocations and the issue dominates the macro footprint.

## Proposed follow-up — in order of preference

1. **Set `MALLOC_ARENA_MAX=2` for the encoder-server subprocess at spawn.** Smallest blast radius. In `biofuse/encoder_client.py` around `ctx.Process(...).start()`, mutate `os.environ["MALLOC_ARENA_MAX"]` for the duration of the spawn call (the env var is read by glibc at child libc init, so the parent's own allocator is unaffected). Respect a user-set override. Pass: confirmed via `/proc/<pid>/environ` that the value propagates; tests/ green (225 pass). Caveat: with only 2 arenas the 32 worker threads will contend on arena locks; we should validate throughput on the benchmark suite before landing.

2. **Add a `ctypes` `malloc_trim(0)` call in the server.** The env var alone gives modest gains (~30%); pairing it with periodic trim is what closes the gap. Natural call sites: at the end of `_ServerSession.__init__` after `build_static_files` (releases handshake-time peaks), and either in `serve_forever` between `accept()` calls or at the end of `_handle_connection`. Cheap (a few microseconds), no API impact. Requires `ctypes.CDLL(ctypes.util.find_library("c")).malloc_trim(0)` — must guard for non-glibc platforms (BSD, musl) but the server only runs on Linux today.

3. **Lower `DEFAULT_READAHEAD_WORKERS` from 32.** Reduces the arena fragmentation surface in proportion to thread count. The 4-worker run idled at 511 MiB versus 666 MiB at 32 workers, so this is a partial mitigation. Worth measuring against icechunk / S3 backends before landing — high-concurrency I/O is the main reason the default is 32.

4. **`LD_PRELOAD` `jemalloc` or `tcmalloc` in the server.** Biggest win, biggest blast radius — defer.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage at idle #21

Conclusion (short)

How the investigation was run

Numbers

What is not the cause (ruled out)

Why the per-thread arena pattern fits

Proposed follow-up — in order of preference

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Config	Post-import	Post-session	Post-pass-1	Idle	After `malloc_trim(0)`
PLINK, 32 workers, default arenas	83	203	528	666	583
PLINK, 32 workers, `MALLOC_ARENA_MAX=2`	83	191	206	461	141
PLINK, 4 workers, default arenas	83	191	436	511	433
BGEN, 32 workers, default arenas	83	184	449	703	503
BGEN, 32 workers, `MALLOC_ARENA_MAX=2`	83	210	196	902	169

Config	PLINK idle	BGEN idle
Unpatched	644.6 MiB	904.6 MiB
`MALLOC_ARENA_MAX=2` (env var only)	425.4 MiB	841.6 MiB

High memory usage at idle #21

Description

Conclusion (short)

How the investigation was run

Numbers

What is not the cause (ruled out)

Why the per-thread arena pattern fits

Proposed follow-up — in order of preference

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions