Compacting the heap for forking

Some applications, notably the [pitchfork](https://github.com/Shopify/pitchfork) HTTP server written in CRuby and the [Zygote](https://source.android.com/docs/core/runtime/zygote) process in Android, use the pre-fork model.  It initializes the application or framework to a certain state, and make a fork of the process for every request or application.  This works around the global VM lock issue in CRuby, and the expensive VM initialization for Android applications.

In https://github.com/mmtk/mmtk-core/pull/1067, we added `prepare_fork` and `after_fork` to make it *possible* to fork.  However, to further reduce the memory footprint, we need to compact the heap.

CRuby provides the `Process.warmup` method so that the application can call it when "the boot sequence is finished", and the comment in the source code listed a few things it does.  On CRuby, `Process.warmup`:

* Performs a major GC.
* Compacts the heap.
* Promotes all surviving objects to the old generation.
* Precomputes the coderange of all strings.
* Frees all empty heap pages and increments the allocatable pages counter by the number of pages freed.
* Invoke `malloc_trim` if available to free empty malloc pages.

@k-sareen forked mmtk-core and added [a `handle_pre_first_zygote_fork_collection_request` method](https://github.com/k-sareen/mmtk-core/blob/03e15f91de0415df69435a01e4ab9b524621a0bc/src/mmtk.rs#L690) that does these, too.

It may be worth adding an API to mmtk-core to compact the heap and reduce the memory footprint.

# The API and its semantics

We should have an API function with clear semantics rather than a hint that MMTk can ignore. (FYI, a JVM is allowed to ignore `System.gc()`.)

The semantics can be **"compacting the heap to make the memory overhead as small as possible"**.  We can make "friendly to forking" an explicit part of the API so that any VMs that have the need to fork can use this API.

## API design

We can reuse the `MMTK::handle_user_collection_request` method.  It currently has two parameters:

-   `force: bool`: If it is true, the request cannot be ignored.  It is currently used to override `Options::ignore_system_gc` and `Collection::is_collection_enabled`.
-   `exhaustive: bool`: When the plan is generational, it will perform full-heap GC instead of nursery GC.  It is currently documented as a "hint".

The heap compaction is orthogonal to those arguments.  Specifically, an "exhaustive" GC can be full-heap, but is not required to be moving, and Immix-based plans can still do non-moving (non-defrag) full-heap GC.  We should add a third argument:

-   `compact: bool`: When true, the GC will try to compact the heap as much as possible.  It should imply both `force` and `exhaustive`.

Alternatively, we can make a struct with default args so that we won't change the signature of `MMTK::handle_user_collection_request` often.

```rust
struct UserGCArgs {
    force: bool,
    exhaustive: bool,
    compact: bool,
}

impl Default for UserGCArgs {
    fn default() -> Self {
        Self {
            force: false,
            exhaustive: false,
            compact: false,
        }
    }
}

impl MMTK {
    fn handle_user_collection_request(&self, args: &UserGCArgs) { ... }
}
```

Users that constructs `UserGCArgs` from its default value won't need to modify the source code when the API changes.

```rust
let args = UserGCArgs { force: true, ..Default::default() };  // No need to change after we introduce `compact`.
mmtk.handle_user_collection_request(&args);
```

# Implementation

## Compacting the spaces

Only moving GC can compact the heap.  Therefore, `NoGC` does a no-op when requested to compact, and `MarkSweep` and non-moving `Immix` and `StickyImmix` plans only perform a non-moving full-heap GC when requested to compact.

Compacting is trivial for `SemiSpace`, `GenCopy`, `MarkCompact` and `Compressor`.  Their default behavior of full-heap GC evacuates the whole from-space (and nursery if present), or compacts the whole space.

It is a bit complicated for Immix-based plans.  Immix needs a "headroom" to do the defragmentation.  By default, it is a small fraction (2%) of the current reserved pages.  However, when the user requests a compaction, MMTk may go beyond that threshold, and use all the memory available for the heap for defragmentation.  The simplest solution is simply marking all blocks as defrag source, and (for dynamic heap size) setting the heap size to the max heap size.  This will essentially degenerate Immix into SemiSpace.  It will be slower than a usual GC, but is expected by the user.

We can collect but not compact the LOS since it is non-moving.

## Returning pages to the OS

After compacting, MMTk needs to return free pages/blocks/chunks back to the OS.  On Linux, we can do this with `madvise` and `MADV_DONTNEED`, or `MADV_FREE` for Linux 4.5 or later..

## Dedicated "ZygoteSpace"

The `ZygoteSpace` in the [mmtk-core fork](https://github.com/k-sareen/mmtk-core/blob/03e15f91de0415df69435a01e4ab9b524621a0bc/src/policy/zygotespace.rs) is designed specifically for this use case.  It wraps a `ImmixSpace` and behaves like a regular `ImmixSpace` before "warming up" (as in CRuby's terminology).  After "warming up", the `ZygoteSpace` will not release its objects, making it essentially behave like an `ImmortalSpace`.

The advantage of the pre-fork model is that a child process can share most of its memory pages with its parent process due to copy on write (CoW).  Modifying (writing to) objects in the `ZygoteSpace` will trigger copy-on-write, which takes time and increases memory usage.  So the application should consider objects created up to the point of "warming up" as immortal.  It is up to the application to take advantage of this property, but an explicit `ZygoteSpace` which enforces the non-moving immortal properties can help the application by preventing certain things (such as moving objects or modifying object headers) from happening.

We may consider porting the `ZygoteSpace` to the master branch of mmtk-core.

## Number of GC worker threads

There may be a correlation between the number of GC threads and the compactness of the resulting heap.  I am not certain about the effect, but I keep a note here for further discussion.  We can ignore this part in our initial implementation.

-   For evacuating GC, each GC worker holds a private block to evacuate objects into.  The more GC workers there are, the blocks there are, and the heap may be likely to be more fragmented due to unused spaces at the end of blocks.  But single-threaded GC can still leave unused space at the end of blocks if an object is larger than the remaining space.
-   In Compressor, multi-threaded compacting will divide the heap into ranges and compact each range, and there will be unused space at the end of each range.  When we make it single-threaded, it will be a single range, similar to `MarkCompact`.  But again since the range is much larger than a page, the effectiveness of single-threaded compaction on page-grained compactness may not be as high as expected.

# Related topics

## Returning pages to the OS and heap compaction

Returning page to the OS and compacting the heap can be two orthogonal operations.  We can return pages to the OS during regular non-compacting collections, too.  It happens to be helpful to return pages to the OS before forking.

We may use options, parameters or heuristics to control whether mmtk-core returns pages to the OS.

-   We may add an option `Options::eager_return_pages` so that the GC will return completely free pages/blocks/chunks to the OS during every GC.  We may further control whether we only return pages during major GC, or during all GCs.
-   We may add another field to `UserGCArgs`: `return_pages: Option<bool>` which defaults to `None` (use default value).
-   We may let the `GCTrigger` decide whether to return pages.  For example, if the used pages suddenly drop significantly, it may indicate that the application has gone past its peak of memory usage, and it may be a good chance to return pages to the OS.

And we may add a dedicated API function to trigger page returning, but not GC.  It should be faster than an actual GC because it doesn't need to stop mutators.  It only needs to go through some bitmaps (or free lists) and call `madvise` on certain memory ranges.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compacting the heap for forking #1443

The API and its semantics

API design

Implementation

Compacting the spaces

Returning pages to the OS

Dedicated "ZygoteSpace"

Number of GC worker threads

Related topics

Returning pages to the OS and heap compaction

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Compacting the heap for forking #1443

Description

The API and its semantics

API design

Implementation

Compacting the spaces

Returning pages to the OS

Dedicated "ZygoteSpace"

Number of GC worker threads

Related topics

Returning pages to the OS and heap compaction

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions