Skip to content

High memory usage at idle #21

@jeromekelleher

Description

@jeromekelleher

Running biofuse against the wide_bench VCZ store (100k samples × 496k variants, default 32-worker readahead), the encoder-server subprocess holds ~900 MB resident memory at idle, i.e. after several open/read/close cycles have completed and no streaming file handles remain. The expected idle footprint is well under 100 MB — a VczReader with a handful of cached metadata arrays, plus the precomputed static-sidecar bytes and an idle ThreadPoolExecutor.

Conclusion (short)

The retention is not in Python objects. It is glibc per-thread malloc arena fragmentation. The 32-worker readahead pool inside VczReader allocates large zarr-decode buffers across many threads; freed pages stay pinned in per-thread arenas indefinitely. pympler.asizeof puts the entire live Python graph at ~30 MiB and gc.get_objects() reports zero live numpy arrays larger than 1 MiB, yet RSS sits at hundreds of MB. MALLOC_ARENA_MAX=2 plus an explicit malloc_trim(0) returns RSS to within ~50 MB of the expected idle budget.

How the investigation was run

The reproducer is in-process: spin up biofuse.encoder_server._ServerSession directly (no FUSE, no socket, no subprocess), then loop 5× constructing an encoder via spec.encoder_factory(reader), issuing a short (16 MiB) encoder.read(0, …) to drive the StreamReader pipeline, dropping the encoder, and gc.collect(). At each lifecycle point we record psutil.Process().memory_info().rss, take a tracemalloc snapshot, asizeof the VczReader + _ServerSession, walk gc.get_objects() for live numpy arrays larger than 1 MiB, and count live StreamReader / CachedLogicalVariantsChunk instances. A second probe drives the real EncoderClient over its AF_UNIX socket and reads the child's RSS via /proc/<pid>/status. Both probes were run with default arena count and with MALLOC_ARENA_MAX=2 for comparison.

Numbers

In-process probe, PLINK and BGEN on wide_bench, "idle" = post-pass-5 + gc.collect() × 2, malloc_trim value is after a single ctypes call to malloc_trim(0) from the main thread.

Config Post-import Post-session Post-pass-1 Idle After malloc_trim(0)
PLINK, 32 workers, default arenas 83 203 528 666 583
PLINK, 32 workers, MALLOC_ARENA_MAX=2 83 191 206 461 141
PLINK, 4 workers, default arenas 83 191 436 511 433
BGEN, 32 workers, default arenas 83 184 449 703 503
BGEN, 32 workers, MALLOC_ARENA_MAX=2 83 210 196 902 169

All values are MiB. The pattern is identical for both formats: with the default 8×ncores glibc arena cap, malloc_trim(0) from a non-arena thread releases only the main arena's pad — typically 60–200 MiB out of ~700 — and the remainder stays pinned in the per-thread arenas. With MALLOC_ARENA_MAX=2, freed pages concentrate into one of two arenas and the same malloc_trim call returns several hundred MB to the kernel. BGEN reaches a higher transient peak than PLINK because _build_bgen_static calls write_bgen_index which does a full variant scan during the handshake.

End-to-end probe via the real EncoderClient + spawn subprocess (5 short-read passes, child RSS measured at "final idle" after a 2s settle):

Config PLINK idle BGEN idle
Unpatched 644.6 MiB 904.6 MiB
MALLOC_ARENA_MAX=2 (env var only) 425.4 MiB 841.6 MiB

MALLOC_ARENA_MAX=2 alone gives a 7–34% reduction. The full ~80% reduction the in-process probe demonstrated only materialises when a malloc_trim(0) call runs in the child after the working set drains. Without that call, even with two arenas, glibc keeps the high-water-mark pages mapped because the top chunk of the arena is not free.

What is not the cause (ruled out)

  • reader._static_field_cache is always empty during PLINK and BGEN static-file builds. generate_bim / generate_fam / generate_sample access @cached_property fields, not _load_static_field.
  • gc.get_objects() filtered to np.ndarray with nbytes > 1 MiB is empty at idle — no leaked decoded chunks.
  • count_instances(StreamReader) and CachedLogicalVariantsChunk are zero between passes — no leaked iterator state.
  • tracemalloc final-vs-baseline diff totals ~20 MiB of attributable Python allocations, dominated by .bim/.fam bytes and the chunk plan.
  • The @cached_property arrays on VczReader (raw_sample_ids, sample_ids, samples_selection, contig_ids, …) together total ~8 MiB on wide_bench.
  • session.static_files is ~10 MiB on PLINK wide_bench and similar on BGEN — confirmed via len() of each entry.

Together these account for ~30 MiB of live Python heap. The remaining 600–870 MiB of RSS is outside Python's view.

Why the per-thread arena pattern fits

glibc's arena allocator gives each thread that calls malloc its own arena (up to MALLOC_ARENA_MAX, default 8 × ncores). The readahead pool decodes zarr chunks on worker threads — these are large allocations (a single call_genotype chunk on wide_bench is 5000 variants × 100k samples × 2 ploidy ≈ 1 GiB uncompressed). When the consumer thread frees the resulting numpy array, the free() is routed to the arena the original malloc came from. glibc keeps those pages in the arena's freelist; they are returned to the kernel only when the top chunk of the arena is itself free and trim is invoked. A malloc_trim(0) call from the main thread reaches only the main arena. Hence: many arenas × deep working set per arena × no trim ⇒ several hundred MB of unreachable-but-mapped RSS.

This is well-documented behaviour on multi-threaded Python workloads that allocate large objects across thread boundaries. numpy allocations larger than M_MMAP_THRESHOLD (default 128 KiB) go through mmap directly and are returned cleanly on free, but the chunked decode path produces many sub-mmap-threshold allocations and the issue dominates the macro footprint.

Proposed follow-up — in order of preference

  1. Set MALLOC_ARENA_MAX=2 for the encoder-server subprocess at spawn. Smallest blast radius. In biofuse/encoder_client.py around ctx.Process(...).start(), mutate os.environ["MALLOC_ARENA_MAX"] for the duration of the spawn call (the env var is read by glibc at child libc init, so the parent's own allocator is unaffected). Respect a user-set override. Pass: confirmed via /proc/<pid>/environ that the value propagates; tests/ green (225 pass). Caveat: with only 2 arenas the 32 worker threads will contend on arena locks; we should validate throughput on the benchmark suite before landing.

  2. Add a ctypes malloc_trim(0) call in the server. The env var alone gives modest gains (~30%); pairing it with periodic trim is what closes the gap. Natural call sites: at the end of _ServerSession.__init__ after build_static_files (releases handshake-time peaks), and either in serve_forever between accept() calls or at the end of _handle_connection. Cheap (a few microseconds), no API impact. Requires ctypes.CDLL(ctypes.util.find_library("c")).malloc_trim(0) — must guard for non-glibc platforms (BSD, musl) but the server only runs on Linux today.

  3. Lower DEFAULT_READAHEAD_WORKERS from 32. Reduces the arena fragmentation surface in proportion to thread count. The 4-worker run idled at 511 MiB versus 666 MiB at 32 workers, so this is a partial mitigation. Worth measuring against icechunk / S3 backends before landing — high-concurrency I/O is the main reason the default is 32.

  4. LD_PRELOAD jemalloc or tcmalloc in the server. Biggest win, biggest blast radius — defer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions