Skip to content

[Request]: Cross-runtime performance regression suite as part of release validation #729

@ylluminate

Description

@ylluminate

Feature or enhancement request details

Summary

Propose that apple/containerization adopt a public, cross-runtime performance benchmark suite as part of its regression and release validation process - measuring volume I/O, container-to-container networking, startup latency, port-mapped TTFB, small-file throughput, and image build performance against the runtimes Apple Silicon developers use day-to-day (Colima, Docker Desktop, OrbStack).

@zot24 has published a reproducible suite covering these dimensions, with runs on both Sequoia (15.7) and Tahoe (26.4):

Why this matters for apple/containerization

Apple Container is the only runtime in the comparison built on a 1-VM-per-container isolation model. That's an architectural choice with real upside - full kernel-level boundary per workload, strongest isolation in this comparison - but it also means the project sits on a different curve than the shared-VM runtimes, and developers picking a runtime for daily use will compare them directly anyway.

Public, repeatable performance numbers tracked release-over-release would:

  1. Make it visible when an OS-level change (Virtualization.framework, vmnet, kernel I/O paths) moves the needle for Apple Container.
  2. Give the team early-warning regression signal before users notice.
  3. Document wins (e.g., volume I/O is already competitive on writes) so the picture isn't dominated by the axes that need the most work.
  4. Establish Apple as the runtime that publishes its own numbers - which, given the hardware advantage on Apple Silicon, is a strong long-term position.

Snapshot of current measurements (Tahoe 26.4.1, M3, 16GB)

From @zot24's published results - 4 runs × 3 iterations per metric, per-container limits enforced at --cpus 4 --memory 4g:

Test Colima Docker Desktop OrbStack Apple Container
Container Startup 0.291s 0.329s 0.371s 0.935s
Volume Write (256MB seq.) 1,306 MB/s 1,327 MB/s 1,566 MB/s 1,280 MB/s
Volume Read (256MB seq.) 3,354 MB/s 2,970 MB/s 10,061 MB/s 2,765 MB/s
Small Files (1000 × 4KB) 0.761s 0.711s 0.620s 1.257s
HTTP Fetch (curl) 0.830s 0.716s 0.774s 0.771s
C2C Throughput (iperf3) 110.3 Gbps 119.1 Gbps 130.2 Gbps 23.0 Gbps
Port-mapped TTFB (nginx) 1.56ms 2.69ms 2.18ms 3.24ms
Image Build (Dockerfile) 6.81s 4.98s 5.28s n/a*

* Build skipped because the builder VM has no outbound network access, so apk add (and any other package installer) fails.

What's strong

  • Volume write throughput (1,280 MB/s) is within 4% of Docker Desktop and 2% of Colima - Apple Container is in the same class as the shared-VM runtimes on sequential write.
  • HTTP fetch (0.771s) is second-fastest in the set - tied with OrbStack (0.774s) and ahead of Colima (0.830s).
  • The 1-VM-per-container isolation model has no peer in the comparison - the right architectural foundation for security-sensitive and multi-tenant workloads, and worth keeping prominent in messaging.
  • Tahoe-era Virtualization.framework gains lifted volume-write numbers across the three runtimes that ran on both releases (+61% to +104%). The underlying platform work is paying off across the ecosystem.

Where the gaps are, and likely root causes

Presented as observations from the benchmark plus a working hypothesis for each, not as criticism - the goal is to make them measurable so progress is visible.

  1. Container startup: 0.935s vs 0.291s (Colima) - ~3.2x slower.
    Hypothesis: full VM kernel boot per container, vs. namespace creation inside an already-running VM. The architectural choice is deliberate, but the absolute number is something users feel on every container run and every test invocation. Worth tracking whether kernel image size, paravirt boot path, or VM warm-pooling could close some of that gap without compromising the isolation boundary.

  2. C2C throughput: 23.0 Gbps vs 130.2 Gbps (OrbStack) - ~5.7x slower.
    Per the benchmark notes, this is attributed to per-container vmnet networking; the shared-VM runtimes have an in-VM bridge between containers. This is the largest single gap and probably the biggest pain point for multi-container workloads (databases + app servers, microservice fan-out, build pipelines). Would benefit from a public roadmap item: shared overlay, vsock-based fast path, or accelerated inter-VM transport.

  3. Image build: blocked entirely - the builder VM cannot reach the internet, so any Dockerfile that pulls dependencies (apk add, apt-get, pip install, npm install, etc.) fails. This is the single biggest blocker to daily-driver adoption today, because almost every real-world Dockerfile pulls something. Even an opt-in network mode for the builder would unblock most users.

  4. Small-file write: 1.257s vs 0.620s (OrbStack) - ~2x slower. Likely a shared-volume metadata path effect; relevant for node_modules, Python venvs, and Git operations inside containers.

  5. Port-mapped TTFB: 3.24ms vs 1.56ms (Colima) - ~2x slower. Relevant for local web development workflows where the user is hitting the container from the host browser repeatedly.

The actual request

Incorporate a benchmark suite of this shape into the project's regression and release process. Concretely:

  1. Adopt or fork the existing suite. @zot24's benchmark.sh + aggregate.py + charts.py is MIT-licensed and runs all four runtimes on the same machine in the same session, with outlier filtering and per-container limits. Forking it (with attribution) gives the project a turnkey baseline.

  2. Track the eight metrics above as release gates / dashboards. Publish per-release numbers under apple/containerization so users see progress, and so regressions surface immediately rather than being discovered by users in the field.

  3. Run the suite cross-OS (e.g., macOS 26.x point releases, future major versions). The Sequoia → Tahoe deltas in @zot24's data show how much OS-level changes matter; an Apple-side baseline means Apple sees its own platform's effect on its own runtime first.

  4. Publish the methodology and raw results. Even if Apple's internal benchmarks are more comprehensive, a public reproducible subset gives the community a shared reference and removes ambiguity in third-party comparisons. Publishing the project's own numbers establishes leadership; otherwise the framing gets set by whoever else publishes first.

  5. Optionally extend the suite with metrics specific to Apple Container's strengths - VM-boundary isolation tests, attack-surface measurements, cold-start-from-snapshot if/when implemented - so the comparison highlights where the architectural choice pays off.

What this looks like adopted

  • A benchmarks/ directory in the repo with a runnable script and documented setup.
  • A CI job (or scheduled workflow on representative hardware) that produces a results JSON per release.
  • A BENCHMARKS.md that publishes the latest numbers and the trend over the last N releases.
  • Cross-runtime numbers kept honest - including the axes where competitors are ahead - because that's how regression suites earn trust and how the team gets the clearest signal about where to invest.

References

Open to helping with metric definition or contributing to a fork of the suite.

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions