Running biofuse against the wide_bench VCZ store (100k samples × 496k variants, default 32-worker readahead), the encoder-server subprocess holds ~900 MB resident memory at idle, i.e. after several open/read/close cycles have completed and no streaming file handles remain. The expected idle footprint is well under 100 MB — a VczReader with a handful of cached metadata arrays, plus the precomputed static-sidecar bytes and an idle ThreadPoolExecutor.
Conclusion (short)
The retention is not in Python objects. It is glibc per-thread malloc arena fragmentation. The 32-worker readahead pool inside VczReader allocates large zarr-decode buffers across many threads; freed pages stay pinned in per-thread arenas indefinitely. pympler.asizeof puts the entire live Python graph at ~30 MiB and gc.get_objects() reports zero live numpy arrays larger than 1 MiB, yet RSS sits at hundreds of MB. MALLOC_ARENA_MAX=2 plus an explicit malloc_trim(0) returns RSS to within ~50 MB of the expected idle budget.
How the investigation was run
The reproducer is in-process: spin up biofuse.encoder_server._ServerSession directly (no FUSE, no socket, no subprocess), then loop 5× constructing an encoder via spec.encoder_factory(reader), issuing a short (16 MiB) encoder.read(0, …) to drive the StreamReader pipeline, dropping the encoder, and gc.collect(). At each lifecycle point we record psutil.Process().memory_info().rss, take a tracemalloc snapshot, asizeof the VczReader + _ServerSession, walk gc.get_objects() for live numpy arrays larger than 1 MiB, and count live StreamReader / CachedLogicalVariantsChunk instances. A second probe drives the real EncoderClient over its AF_UNIX socket and reads the child's RSS via /proc/<pid>/status. Both probes were run with default arena count and with MALLOC_ARENA_MAX=2 for comparison.
Numbers
In-process probe, PLINK and BGEN on wide_bench, "idle" = post-pass-5 + gc.collect() × 2, malloc_trim value is after a single ctypes call to malloc_trim(0) from the main thread.
| Config |
Post-import |
Post-session |
Post-pass-1 |
Idle |
After malloc_trim(0) |
| PLINK, 32 workers, default arenas |
83 |
203 |
528 |
666 |
583 |
PLINK, 32 workers, MALLOC_ARENA_MAX=2 |
83 |
191 |
206 |
461 |
141 |
| PLINK, 4 workers, default arenas |
83 |
191 |
436 |
511 |
433 |
| BGEN, 32 workers, default arenas |
83 |
184 |
449 |
703 |
503 |
BGEN, 32 workers, MALLOC_ARENA_MAX=2 |
83 |
210 |
196 |
902 |
169 |
All values are MiB. The pattern is identical for both formats: with the default 8×ncores glibc arena cap, malloc_trim(0) from a non-arena thread releases only the main arena's pad — typically 60–200 MiB out of ~700 — and the remainder stays pinned in the per-thread arenas. With MALLOC_ARENA_MAX=2, freed pages concentrate into one of two arenas and the same malloc_trim call returns several hundred MB to the kernel. BGEN reaches a higher transient peak than PLINK because _build_bgen_static calls write_bgen_index which does a full variant scan during the handshake.
End-to-end probe via the real EncoderClient + spawn subprocess (5 short-read passes, child RSS measured at "final idle" after a 2s settle):
| Config |
PLINK idle |
BGEN idle |
| Unpatched |
644.6 MiB |
904.6 MiB |
MALLOC_ARENA_MAX=2 (env var only) |
425.4 MiB |
841.6 MiB |
MALLOC_ARENA_MAX=2 alone gives a 7–34% reduction. The full ~80% reduction the in-process probe demonstrated only materialises when a malloc_trim(0) call runs in the child after the working set drains. Without that call, even with two arenas, glibc keeps the high-water-mark pages mapped because the top chunk of the arena is not free.
What is not the cause (ruled out)
reader._static_field_cache is always empty during PLINK and BGEN static-file builds. generate_bim / generate_fam / generate_sample access @cached_property fields, not _load_static_field.
gc.get_objects() filtered to np.ndarray with nbytes > 1 MiB is empty at idle — no leaked decoded chunks.
count_instances(StreamReader) and CachedLogicalVariantsChunk are zero between passes — no leaked iterator state.
tracemalloc final-vs-baseline diff totals ~20 MiB of attributable Python allocations, dominated by .bim/.fam bytes and the chunk plan.
- The
@cached_property arrays on VczReader (raw_sample_ids, sample_ids, samples_selection, contig_ids, …) together total ~8 MiB on wide_bench.
session.static_files is ~10 MiB on PLINK wide_bench and similar on BGEN — confirmed via len() of each entry.
Together these account for ~30 MiB of live Python heap. The remaining 600–870 MiB of RSS is outside Python's view.
Why the per-thread arena pattern fits
glibc's arena allocator gives each thread that calls malloc its own arena (up to MALLOC_ARENA_MAX, default 8 × ncores). The readahead pool decodes zarr chunks on worker threads — these are large allocations (a single call_genotype chunk on wide_bench is 5000 variants × 100k samples × 2 ploidy ≈ 1 GiB uncompressed). When the consumer thread frees the resulting numpy array, the free() is routed to the arena the original malloc came from. glibc keeps those pages in the arena's freelist; they are returned to the kernel only when the top chunk of the arena is itself free and trim is invoked. A malloc_trim(0) call from the main thread reaches only the main arena. Hence: many arenas × deep working set per arena × no trim ⇒ several hundred MB of unreachable-but-mapped RSS.
This is well-documented behaviour on multi-threaded Python workloads that allocate large objects across thread boundaries. numpy allocations larger than M_MMAP_THRESHOLD (default 128 KiB) go through mmap directly and are returned cleanly on free, but the chunked decode path produces many sub-mmap-threshold allocations and the issue dominates the macro footprint.
Proposed follow-up — in order of preference
-
Set MALLOC_ARENA_MAX=2 for the encoder-server subprocess at spawn. Smallest blast radius. In biofuse/encoder_client.py around ctx.Process(...).start(), mutate os.environ["MALLOC_ARENA_MAX"] for the duration of the spawn call (the env var is read by glibc at child libc init, so the parent's own allocator is unaffected). Respect a user-set override. Pass: confirmed via /proc/<pid>/environ that the value propagates; tests/ green (225 pass). Caveat: with only 2 arenas the 32 worker threads will contend on arena locks; we should validate throughput on the benchmark suite before landing.
-
Add a ctypes malloc_trim(0) call in the server. The env var alone gives modest gains (~30%); pairing it with periodic trim is what closes the gap. Natural call sites: at the end of _ServerSession.__init__ after build_static_files (releases handshake-time peaks), and either in serve_forever between accept() calls or at the end of _handle_connection. Cheap (a few microseconds), no API impact. Requires ctypes.CDLL(ctypes.util.find_library("c")).malloc_trim(0) — must guard for non-glibc platforms (BSD, musl) but the server only runs on Linux today.
-
Lower DEFAULT_READAHEAD_WORKERS from 32. Reduces the arena fragmentation surface in proportion to thread count. The 4-worker run idled at 511 MiB versus 666 MiB at 32 workers, so this is a partial mitigation. Worth measuring against icechunk / S3 backends before landing — high-concurrency I/O is the main reason the default is 32.
-
LD_PRELOAD jemalloc or tcmalloc in the server. Biggest win, biggest blast radius — defer.
Running biofuse against the wide_bench VCZ store (100k samples × 496k variants, default 32-worker readahead), the encoder-server subprocess holds ~900 MB resident memory at idle, i.e. after several open/read/close cycles have completed and no streaming file handles remain. The expected idle footprint is well under 100 MB — a
VczReaderwith a handful of cached metadata arrays, plus the precomputed static-sidecar bytes and an idle ThreadPoolExecutor.Conclusion (short)
The retention is not in Python objects. It is glibc per-thread malloc arena fragmentation. The 32-worker readahead pool inside
VczReaderallocates large zarr-decode buffers across many threads; freed pages stay pinned in per-thread arenas indefinitely.pympler.asizeofputs the entire live Python graph at ~30 MiB andgc.get_objects()reports zero live numpy arrays larger than 1 MiB, yet RSS sits at hundreds of MB.MALLOC_ARENA_MAX=2plus an explicitmalloc_trim(0)returns RSS to within ~50 MB of the expected idle budget.How the investigation was run
The reproducer is in-process: spin up
biofuse.encoder_server._ServerSessiondirectly (no FUSE, no socket, no subprocess), then loop 5× constructing an encoder viaspec.encoder_factory(reader), issuing a short (16 MiB)encoder.read(0, …)to drive the StreamReader pipeline, dropping the encoder, andgc.collect(). At each lifecycle point we recordpsutil.Process().memory_info().rss, take atracemallocsnapshot, asizeof theVczReader+_ServerSession, walkgc.get_objects()for live numpy arrays larger than 1 MiB, and count liveStreamReader/CachedLogicalVariantsChunkinstances. A second probe drives the realEncoderClientover itsAF_UNIXsocket and reads the child's RSS via/proc/<pid>/status. Both probes were run with default arena count and withMALLOC_ARENA_MAX=2for comparison.Numbers
In-process probe, PLINK and BGEN on wide_bench, "idle" = post-pass-5 +
gc.collect() × 2,malloc_trimvalue is after a singlectypescall tomalloc_trim(0)from the main thread.malloc_trim(0)MALLOC_ARENA_MAX=2MALLOC_ARENA_MAX=2All values are MiB. The pattern is identical for both formats: with the default 8×ncores glibc arena cap,
malloc_trim(0)from a non-arena thread releases only the main arena's pad — typically 60–200 MiB out of ~700 — and the remainder stays pinned in the per-thread arenas. WithMALLOC_ARENA_MAX=2, freed pages concentrate into one of two arenas and the samemalloc_trimcall returns several hundred MB to the kernel. BGEN reaches a higher transient peak than PLINK because_build_bgen_staticcallswrite_bgen_indexwhich does a full variant scan during the handshake.End-to-end probe via the real
EncoderClient+ spawn subprocess (5 short-read passes, child RSS measured at "final idle" after a 2s settle):MALLOC_ARENA_MAX=2(env var only)MALLOC_ARENA_MAX=2alone gives a 7–34% reduction. The full ~80% reduction the in-process probe demonstrated only materialises when amalloc_trim(0)call runs in the child after the working set drains. Without that call, even with two arenas, glibc keeps the high-water-mark pages mapped because the top chunk of the arena is not free.What is not the cause (ruled out)
reader._static_field_cacheis always empty during PLINK and BGEN static-file builds.generate_bim/generate_fam/generate_sampleaccess@cached_propertyfields, not_load_static_field.gc.get_objects()filtered tonp.ndarraywithnbytes > 1 MiBis empty at idle — no leaked decoded chunks.count_instances(StreamReader)andCachedLogicalVariantsChunkare zero between passes — no leaked iterator state.tracemallocfinal-vs-baseline diff totals ~20 MiB of attributable Python allocations, dominated by.bim/.fambytes and the chunk plan.@cached_propertyarrays onVczReader(raw_sample_ids,sample_ids,samples_selection,contig_ids, …) together total ~8 MiB on wide_bench.session.static_filesis ~10 MiB on PLINK wide_bench and similar on BGEN — confirmed vialen()of each entry.Together these account for ~30 MiB of live Python heap. The remaining 600–870 MiB of RSS is outside Python's view.
Why the per-thread arena pattern fits
glibc's arena allocator gives each thread that calls
mallocits own arena (up toMALLOC_ARENA_MAX, default8 × ncores). The readahead pool decodes zarr chunks on worker threads — these are large allocations (a single call_genotype chunk on wide_bench is 5000 variants × 100k samples × 2 ploidy ≈ 1 GiB uncompressed). When the consumer thread frees the resulting numpy array, thefree()is routed to the arena the originalmalloccame from. glibc keeps those pages in the arena's freelist; they are returned to the kernel only when the top chunk of the arena is itself free and trim is invoked. Amalloc_trim(0)call from the main thread reaches only the main arena. Hence: many arenas × deep working set per arena × no trim ⇒ several hundred MB of unreachable-but-mapped RSS.This is well-documented behaviour on multi-threaded Python workloads that allocate large objects across thread boundaries.
numpyallocations larger thanM_MMAP_THRESHOLD(default 128 KiB) go throughmmapdirectly and are returned cleanly on free, but the chunked decode path produces many sub-mmap-threshold allocations and the issue dominates the macro footprint.Proposed follow-up — in order of preference
Set
MALLOC_ARENA_MAX=2for the encoder-server subprocess at spawn. Smallest blast radius. Inbiofuse/encoder_client.pyaroundctx.Process(...).start(), mutateos.environ["MALLOC_ARENA_MAX"]for the duration of the spawn call (the env var is read by glibc at child libc init, so the parent's own allocator is unaffected). Respect a user-set override. Pass: confirmed via/proc/<pid>/environthat the value propagates; tests/ green (225 pass). Caveat: with only 2 arenas the 32 worker threads will contend on arena locks; we should validate throughput on the benchmark suite before landing.Add a
ctypesmalloc_trim(0)call in the server. The env var alone gives modest gains (~30%); pairing it with periodic trim is what closes the gap. Natural call sites: at the end of_ServerSession.__init__afterbuild_static_files(releases handshake-time peaks), and either inserve_foreverbetweenaccept()calls or at the end of_handle_connection. Cheap (a few microseconds), no API impact. Requiresctypes.CDLL(ctypes.util.find_library("c")).malloc_trim(0)— must guard for non-glibc platforms (BSD, musl) but the server only runs on Linux today.Lower
DEFAULT_READAHEAD_WORKERSfrom 32. Reduces the arena fragmentation surface in proportion to thread count. The 4-worker run idled at 511 MiB versus 666 MiB at 32 workers, so this is a partial mitigation. Worth measuring against icechunk / S3 backends before landing — high-concurrency I/O is the main reason the default is 32.LD_PRELOADjemallocortcmallocin the server. Biggest win, biggest blast radius — defer.