Skip to content

Compacting the heap for forking #1443

@wks

Description

@wks

Some applications, notably the pitchfork HTTP server written in CRuby and the Zygote process in Android, use the pre-fork model. It initializes the application or framework to a certain state, and make a fork of the process for every request or application. This works around the global VM lock issue in CRuby, and the expensive VM initialization for Android applications.

In #1067, we added prepare_fork and after_fork to make it possible to fork. However, to further reduce the memory footprint, we need to compact the heap.

CRuby provides the Process.warmup method so that the application can call it when "the boot sequence is finished", and the comment in the source code listed a few things it does. On CRuby, Process.warmup:

  • Performs a major GC.
  • Compacts the heap.
  • Promotes all surviving objects to the old generation.
  • Precomputes the coderange of all strings.
  • Frees all empty heap pages and increments the allocatable pages counter by the number of pages freed.
  • Invoke malloc_trim if available to free empty malloc pages.

@k-sareen forked mmtk-core and added a handle_pre_first_zygote_fork_collection_request method that does these, too.

It may be worth adding an API to mmtk-core to compact the heap and reduce the memory footprint.

The API and its semantics

We should have an API function with clear semantics rather than a hint that MMTk can ignore. (FYI, a JVM is allowed to ignore System.gc().)

The semantics can be "compacting the heap to make the memory overhead as small as possible". We can make "friendly to forking" an explicit part of the API so that any VMs that have the need to fork can use this API.

API design

We can reuse the MMTK::handle_user_collection_request method. It currently has two parameters:

  • force: bool: If it is true, the request cannot be ignored. It is currently used to override Options::ignore_system_gc and Collection::is_collection_enabled.
  • exhaustive: bool: When the plan is generational, it will perform full-heap GC instead of nursery GC. It is currently documented as a "hint".

The heap compaction is orthogonal to those arguments. Specifically, an "exhaustive" GC can be full-heap, but is not required to be moving, and Immix-based plans can still do non-moving (non-defrag) full-heap GC. We should add a third argument:

  • compact: bool: When true, the GC will try to compact the heap as much as possible. It should imply both force and exhaustive.

Alternatively, we can make a struct with default args so that we won't change the signature of MMTK::handle_user_collection_request often.

struct UserGCArgs {
    force: bool,
    exhaustive: bool,
    compact: bool,
}

impl Default for UserGCArgs {
    fn default() -> Self {
        Self {
            force: false,
            exhaustive: false,
            compact: false,
        }
    }
}

impl MMTK {
    fn handle_user_collection_request(&self, args: &UserGCArgs) { ... }
}

Users that constructs UserGCArgs from its default value won't need to modify the source code when the API changes.

let args = UserGCArgs { force: true, ..Default::default() };  // No need to change after we introduce `compact`.
mmtk.handle_user_collection_request(&args);

Implementation

Compacting the spaces

Only moving GC can compact the heap. Therefore, NoGC does a no-op when requested to compact, and MarkSweep and non-moving Immix and StickyImmix plans only perform a non-moving full-heap GC when requested to compact.

Compacting is trivial for SemiSpace, GenCopy, MarkCompact and Compressor. Their default behavior of full-heap GC evacuates the whole from-space (and nursery if present), or compacts the whole space.

It is a bit complicated for Immix-based plans. Immix needs a "headroom" to do the defragmentation. By default, it is a small fraction (2%) of the current reserved pages. However, when the user requests a compaction, MMTk may go beyond that threshold, and use all the memory available for the heap for defragmentation. The simplest solution is simply marking all blocks as defrag source, and (for dynamic heap size) setting the heap size to the max heap size. This will essentially degenerate Immix into SemiSpace. It will be slower than a usual GC, but is expected by the user.

We can collect but not compact the LOS since it is non-moving.

Returning pages to the OS

After compacting, MMTk needs to return free pages/blocks/chunks back to the OS. On Linux, we can do this with madvise and MADV_DONTNEED, or MADV_FREE for Linux 4.5 or later..

Dedicated "ZygoteSpace"

The ZygoteSpace in the mmtk-core fork is designed specifically for this use case. It wraps a ImmixSpace and behaves like a regular ImmixSpace before "warming up" (as in CRuby's terminology). After "warming up", the ZygoteSpace will not release its objects, making it essentially behave like an ImmortalSpace.

The advantage of the pre-fork model is that a child process can share most of its memory pages with its parent process due to copy on write (CoW). Modifying (writing to) objects in the ZygoteSpace will trigger copy-on-write, which takes time and increases memory usage. So the application should consider objects created up to the point of "warming up" as immortal. It is up to the application to take advantage of this property, but an explicit ZygoteSpace which enforces the non-moving immortal properties can help the application by preventing certain things (such as moving objects or modifying object headers) from happening.

We may consider porting the ZygoteSpace to the master branch of mmtk-core.

Number of GC worker threads

There may be a correlation between the number of GC threads and the compactness of the resulting heap. I am not certain about the effect, but I keep a note here for further discussion. We can ignore this part in our initial implementation.

  • For evacuating GC, each GC worker holds a private block to evacuate objects into. The more GC workers there are, the blocks there are, and the heap may be likely to be more fragmented due to unused spaces at the end of blocks. But single-threaded GC can still leave unused space at the end of blocks if an object is larger than the remaining space.
  • In Compressor, multi-threaded compacting will divide the heap into ranges and compact each range, and there will be unused space at the end of each range. When we make it single-threaded, it will be a single range, similar to MarkCompact. But again since the range is much larger than a page, the effectiveness of single-threaded compaction on page-grained compactness may not be as high as expected.

Related topics

Returning pages to the OS and heap compaction

Returning page to the OS and compacting the heap can be two orthogonal operations. We can return pages to the OS during regular non-compacting collections, too. It happens to be helpful to return pages to the OS before forking.

We may use options, parameters or heuristics to control whether mmtk-core returns pages to the OS.

  • We may add an option Options::eager_return_pages so that the GC will return completely free pages/blocks/chunks to the OS during every GC. We may further control whether we only return pages during major GC, or during all GCs.
  • We may add another field to UserGCArgs: return_pages: Option<bool> which defaults to None (use default value).
  • We may let the GCTrigger decide whether to return pages. For example, if the used pages suddenly drop significantly, it may indicate that the application has gone past its peak of memory usage, and it may be a good chance to return pages to the OS.

And we may add a dedicated API function to trigger page returning, but not GC. It should be faster than an actual GC because it doesn't need to stop mutators. It only needs to go through some bitmaps (or free lists) and call madvise on certain memory ranges.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions