Skip to content

Add syscall binary rewriting infrastructure#22

Merged
jserv merged 1 commit intomainfrom
syscall-rewriting
Mar 26, 2026
Merged

Add syscall binary rewriting infrastructure#22
jserv merged 1 commit intomainfrom
syscall-rewriting

Conversation

@jserv
Copy link
Contributor

@jserv jserv commented Mar 26, 2026

This implements syscall rewriting fast-path, adding a complete binary rewriting layer alongside the existing seccomp-unotify (Tier 3) and SIGSYS trap (Tier 1) interception paths.

Core additions:

  • x86-64 instruction length decoder (x86-decode.c) that walks true instruction boundaries, eliminating false-positive 0F 05/0F 34 matches inside immediates and displacements
  • Site-aware rewrite classification (WRAPPER vs COMPLEX) to gate the inline fast path: only simple syscall+ret wrapper sites return virtualized PID values (getpid=1, gettid=1, getppid=0) directly; complex sites (raise->gettid->tgkill) use full dispatch
  • aarch64 binary rewriting: SVC #0 patched to B-branch trampolines with veneer pages (LDR+BR indirect stubs) for sites beyond 128MB
  • Userspace ELF loader for in-process trap/rewrite launch
  • Unified syscall request abstraction across all three tiers
  • Per-architecture AUTO mode: seccomp on x86_64, rewrite/trap on aarch64 for non-shell commands without fork wrappers
  • W^X enforcement (PROT_WRITE|PROT_EXEC blocked in trap/rewrite mode)
  • CLONE_THREAD blocked in trap/rewrite (EPERM); multi-threaded guests must use --syscall-mode=seccomp
  • Shadow FD cache with dup() per open for independent file offsets
  • Safe argv/envp validation in trap-mode execve via process_vm_readv
  • Execve re-translation: rewrite runtime re-installed on new binary
  • Fork-site scanner for AUTO mode safety (detects fork/clone wrappers)

Benchmark (bench-test, 200 iterations, release build, us/call):

  • x86_64 (AMD Threadripper 2990WX, Linux 6.8.0):
    syscall      native  seccomp   trap   rewrite  auto
    stat            2.6    13.5    22.0      20.2   13.5
    open+close      5.4    25.2    89.6      58.9   25.3
    lseek+read      1.9     2.3     2.8       2.1    2.4
    write           2.7     2.2     2.5       2.0    2.4
    getpid          0.3     0.0     0.1       0.1    0.0
  • aarch64 (Marvell ThunderX2, Linux 6.14.0):
    syscall      native  seccomp   trap   rewrite  auto
    stat            1.5    21.7    25.1       1.0   21.4
    open+close      3.4    40.3   138.4     117.0   39.2
    lseek+read      1.0     1.3     1.6       1.6    1.4
    write           1.1     1.4     1.8       1.9    1.4
    getpid          0.3     0.0     0.1       0.1    0.0
  • AUTO selects seccomp on x86_64 (lower USER_NOTIF overhead), rewrite on aarch64 (stat 1.0us via in-process LKL inode cache vs 21.7us
  • seccomp round-trip). aarch64 rewrite stat at 0.7x native is real: LKL serves from memory without hitting the block layer.

Change-Id: I9052fc27388027b9147348c0998bc8fa4e1e123e

@jserv jserv requested a review from rota1001 March 26, 2026 03:21
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

14 issues found across 65 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/image.c">

<violation number="1" location="src/image.c:1067">
P3: `munmap(elf_buf, elf_buf_len)` is called when `kbox_read_elf_header_window_fd` failed. At that point `elf_buf` is still `NULL` (never assigned by the failing function). The `munmap(NULL, 0)` call is a no-op on Linux but is dead/incorrect code — remove it or guard with `if (elf_buf)`.</violation>
</file>

<file name="src/loader-image.c">

<violation number="1" location="src/loader-image.c:61">
P2: `fail:` path discards the specific `rc` error code and always returns `-1`. Since `rc` is set to meaningful negative errno values at each failure site, this should be `return rc;` to preserve diagnostic information for callers.</violation>
</file>

<file name="src/fd-table.c">

<violation number="1" location="src/fd-table.c:151">
P3: `kbox_fd_table_insert_hostonly` is newly added but has no call sites, so it is dead code and can drift from the real host-only allocation path.</violation>
</file>

<file name="src/probe.c">

<violation number="1" location="src/probe.c:181">
P2: Check `waitpid` return status in the seccomp filter probe; failures/interruption can currently be misclassified as success.</violation>
</file>

<file name="tests/unit/test-syscall-trap.c">

<violation number="1" location="tests/unit/test-syscall-trap.c:451">
P2: Avoid an unbounded blocking `read()` in the test; add a bounded wait so failures don't hang CI.</violation>

<violation number="2" location="tests/unit/test-syscall-trap.c:478">
P2: Poll `has_pending_dispatch` with an atomic load; plain reads race the service thread.</violation>
</file>

<file name="tests/guest/trap-bench.S">

<violation number="1" location="tests/guest/trap-bench.S:18">
P2: The benchmark loop adds per-iteration `push/pop` overhead inside the timed section, which skews the reported getpid latency.</violation>
</file>

<file name="src/procmem.c">

<violation number="1" location="src/procmem.c:141">
P2: `close(fd)` between a failed `pwrite` and the `errno` read can clobber the original error code. Save `errno` before calling `close()`.</violation>
</file>

<file name="src/syscall-trap.c">

<violation number="1" location="src/syscall-trap.c:165">
P0: Incorrect argument register assignment in `rt_sigprocmask` assembly: `sigset_size` is placed in `%rdx` (oldset) instead of `%r10` (sigsetsize), and `%ecx` is zeroed instead of `%r10`. The kernel's 4th syscall argument is `%r10`, not `%rcx`. This will either EINVAL (stale `%r10`) or crash by writing the old sigmask to address `sigset_size`.</violation>

<violation number="2" location="src/syscall-trap.c:438">
P0: Same argument-shuffle bug on aarch64: `sigset_size` is placed in `x2` (oldset) instead of `x3` (sigsetsize), and `x3` is zeroed. This will crash or EINVAL.</violation>

<violation number="3" location="src/syscall-trap.c:998">
P2: When `service_running` is true, `trap_host_write` returns raw kernel error codes (e.g., `-EAGAIN`) without setting the C `errno`. The subsequent `errno == EAGAIN` check is only correct for the libc `write()` path; the raw-syscall path will incorrectly return -1 on a benign `EAGAIN`.</violation>
</file>

<file name="tests/unit/test-rewrite.c">

<violation number="1" location="tests/unit/test-rewrite.c:930">
P2: The classification test is ineffective because it expects the same result for known and unknown syscall numbers, so it cannot catch a broken implementation.</violation>
</file>

<file name="docs/gdb-workflow.md">

<violation number="1" location="docs/gdb-workflow.md:210">
P3: The new dispatch breakpoint description is inaccurate: `kbox-syscall-trace` currently breaks on `kbox_dispatch_syscall` only, not `kbox_dispatch_request`.</violation>
</file>

<file name="src/rewrite.c">

<violation number="1" location="src/rewrite.c:1">
P1: `current[4]` is too small for x86_64 wrapper patches (width 8). The `width > sizeof(current)` guard will reject every x86_64 site, making this function silently fail for x86_64 binaries. Other memfd patching functions in this file use `KBOX_REWRITE_MAX_PATCH_BYTES` for this buffer.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment on lines +165 to +169
"mov %rsi, %rdx\n"
"mov $" XSTR(FUTEX_WAIT_PRIVATE) ", %esi\n"
"xor %r10d, %r10d\n"
"xor %r8d, %r8d\n"
"xor %r9d, %r9d\n"
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0: Incorrect argument register assignment in rt_sigprocmask assembly: sigset_size is placed in %rdx (oldset) instead of %r10 (sigsetsize), and %ecx is zeroed instead of %r10. The kernel's 4th syscall argument is %r10, not %rcx. This will either EINVAL (stale %r10) or crash by writing the old sigmask to address sigset_size.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall-trap.c, line 165:

<comment>Incorrect argument register assignment in `rt_sigprocmask` assembly: `sigset_size` is placed in `%rdx` (oldset) instead of `%r10` (sigsetsize), and `%ecx` is zeroed instead of `%r10`. The kernel's 4th syscall argument is `%r10`, not `%rcx`. This will either EINVAL (stale `%r10`) or crash by writing the old sigmask to address `sigset_size`.</comment>

<file context>
@@ -0,0 +1,1534 @@
+    ".globl kbox_syscall_trap_host_futex_wait_start\n"
+    "kbox_syscall_trap_host_futex_wait_start:\n"
+    "kbox_syscall_trap_host_futex_wait_private:\n"
+    "mov %rsi, %rdx\n"
+    "mov $" XSTR(FUTEX_WAIT_PRIVATE) ", %esi\n"
+    "xor %r10d, %r10d\n"
</file context>
Suggested change
"mov %rsi, %rdx\n"
"mov $" XSTR(FUTEX_WAIT_PRIVATE) ", %esi\n"
"xor %r10d, %r10d\n"
"xor %r8d, %r8d\n"
"xor %r9d, %r9d\n"
"mov %rsi, %r10\n"
"mov %rdi, %rsi\n"
"xor %edx, %edx\n"
"xor %r8d, %r8d\n"
"xor %r9d, %r9d\n"
Fix with Cubic

size_t size)
{
if (!image || image->region_count >= KBOX_LOADER_MAX_MAPPINGS)
return -1;
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: fail: path discards the specific rc error code and always returns -1. Since rc is set to meaningful negative errno values at each failure site, this should be return rc; to preserve diagnostic information for callers.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/loader-image.c, line 61:

<comment>`fail:` path discards the specific `rc` error code and always returns `-1`. Since `rc` is set to meaningful negative errno values at each failure site, this should be `return rc;` to preserve diagnostic information for callers.</comment>

<file context>
@@ -0,0 +1,186 @@
+                         size_t size)
+{
+    if (!image || image->region_count >= KBOX_LOADER_MAX_MAPPINGS)
+        return -1;
+    image->regions[image->region_count++] = (struct kbox_loader_image_region) {
+        .addr = addr,
</file context>
Suggested change
return -1;
return rc;
Fix with Cubic

@jserv jserv force-pushed the syscall-rewriting branch 3 times, most recently from caf83d2 to 07915f3 Compare March 26, 2026 06:29
@sysprog21 sysprog21 deleted a comment from cubic-dev-ai bot Mar 26, 2026
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 issues found across 71 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/loader-image.c">

<violation number="1" location="src/loader-image.c:150">
P2: Memory leak: if `record_region` fails, the mapping from `map_region_exact` is never recorded and the `fail` cleanup path (`unmap_recorded_regions`) won't unmap it. Add `munmap(mapped, mapping->size)` before the `goto fail`.</violation>
</file>

<file name="src/elf.c">

<violation number="1" location="src/elf.c:328">
P3: Redundant `memcpy` — `pread_full` on the next line re-reads from offset 0 into the same buffer, overwriting everything that was just copied. Remove the `memcpy` to avoid up to 256 KB of unnecessary work (especially relevant since the comment notes this path can execute inside a SIGSYS handler).</violation>
</file>

<file name="tests/unit/test-syscall-trap.c">

<violation number="1" location="tests/unit/test-syscall-trap.c:447">
P2: Initialize ctx.host_nrs before starting the service thread; the dispatch path dereferences ctx->host_nrs and will crash with the zeroed ctx.</violation>
</file>

<file name="src/procmem.c">

<violation number="1" location="src/procmem.c:97">
P2: Stale `errno` when `pread` returns 0: `errno` is not cleared on success/EOF, so `errno ? -errno : -EIO` may return a stale error code from a prior failed call instead of the intended `-EIO`. Split the check into `n < 0` (use errno) and `n == 0` (return -EIO directly).</violation>
</file>

<file name="src/rewrite.c">

<violation number="1" location="src/rewrite.c:1">
P2: Unsigned underflow in bounds check: `image_len - 8` wraps to `SIZE_MAX - 7` when `image_len < 8`, silently bypassing the guard. Replace with an addition-based comparison to avoid the wraparound.</violation>
</file>

<file name="src/syscall-trap.c">

<violation number="1" location="src/syscall-trap.c:1113">
P2: `service_stop` is read/written across threads without atomics, unlike the other shared flags in this struct (`has_pending_request`, `has_pending_dispatch`) which use `__atomic_load_n`/`__atomic_store_n`. The compiler may hoist the non-atomic load out of the service-thread loop, preventing the thread from seeing the stop flag.</violation>
</file>

<file name="tests/guest/bench-test.c">

<violation number="1" location="tests/guest/bench-test.c:31">
P2: The benchmark silently measures the error path if stat() fails.</violation>

<violation number="2" location="tests/guest/bench-test.c:46">
P2: The benchmark silently ignores open() failures, which can result in measuring the error path and outputting deceptively fast numbers.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

&mapped);
if (rc < 0)
goto fail;
if (record_region(image, mapped, (size_t) mapping->size) < 0) {
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Memory leak: if record_region fails, the mapping from map_region_exact is never recorded and the fail cleanup path (unmap_recorded_regions) won't unmap it. Add munmap(mapped, mapping->size) before the goto fail.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/loader-image.c, line 150:

<comment>Memory leak: if `record_region` fails, the mapping from `map_region_exact` is never recorded and the `fail` cleanup path (`unmap_recorded_regions`) won't unmap it. Add `munmap(mapped, mapping->size)` before the `goto fail`.</comment>

<file context>
@@ -0,0 +1,186 @@
+                              &mapped);
+        if (rc < 0)
+            goto fail;
+        if (record_region(image, mapped, (size_t) mapping->size) < 0) {
+            rc = -ENOMEM;
+            goto fail;
</file context>
Fix with Cubic

ASSERT_EQ(
kbox_syscall_trap_runtime_init(&runtime, &ctx, &capture_only_trap_ops),
0);
ASSERT_EQ(kbox_syscall_trap_runtime_service_start(&runtime), 0);
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Initialize ctx.host_nrs before starting the service thread; the dispatch path dereferences ctx->host_nrs and will crash with the zeroed ctx.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/unit/test-syscall-trap.c, line 447:

<comment>Initialize ctx.host_nrs before starting the service thread; the dispatch path dereferences ctx->host_nrs and will crash with the zeroed ctx.</comment>

<file context>
@@ -0,0 +1,512 @@
+    ASSERT_EQ(
+        kbox_syscall_trap_runtime_init(&runtime, &ctx, &capture_only_trap_ops),
+        0);
+    ASSERT_EQ(kbox_syscall_trap_runtime_service_start(&runtime), 0);
+    ASSERT_EQ(kbox_syscall_trap_runtime_capture(&runtime, &req), 0);
+
</file context>
Fix with Cubic

@@ -0,0 +1,4060 @@
/* SPDX-License-Identifier: MIT */
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Unsigned underflow in bounds check: image_len - 8 wraps to SIZE_MAX - 7 when image_len < 8, silently bypassing the guard. Replace with an addition-based comparison to avoid the wraparound.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/rewrite.c:

<comment>Unsigned underflow in bounds check: `image_len - 8` wraps to `SIZE_MAX - 7` when `image_len < 8`, silently bypassing the guard. Replace with an addition-based comparison to avoid the wraparound.</comment>
Fix with Cubic

@jserv jserv force-pushed the syscall-rewriting branch 2 times, most recently from 18c2e70 to 5c09f57 Compare March 26, 2026 09:35
@sysprog21 sysprog21 deleted a comment from cubic-dev-ai bot Mar 26, 2026
Copy link

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 71 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/loader-layout.c">

<violation number="1" location="src/loader-layout.c:76">
P2: Missing overflow checks for `load_bias + seg->map_start` (and similar `load_bias +` additions at lines 99, 103, 113). The function uses `__builtin_add_overflow` for `seg->vaddr` arithmetic but performs these bias additions unchecked, which could wrap around with a large load bias. Consider adding overflow checks for consistency with the existing pattern.</violation>
</file>

<file name="src/seccomp-bpf.c">

<violation number="1" location="src/seccomp-bpf.c:672">
P1: Bounds check on `filter[]` runs after all writes complete — if `idx` ever exceeds `MAX_PROG_LEN` during emission, the stack is already corrupted. Move the overflow check into the emit helpers or add a guard before each `filter[idx++]` write, or at minimum assert `idx < MAX_PROG_LEN` at the start of each emit loop iteration.</violation>
</file>

<file name="src/syscall-trap.c">

<violation number="1" location="src/syscall-trap.c:1232">
P0: On x86_64, `host_syscall` reads the syscall number from `REG_RAX` in the ucontext, but the kernel has already overwritten RAX with `-ENOSYS` before delivering SIGSYS for `SECCOMP_RET_TRAP`. This means every CONTINUE-dispatched syscall in trap mode will execute `syscall(-ENOSYS)` instead of the intended syscall. The request struct built earlier by `kbox_syscall_regs_from_sigsys` has the correct number (from `si_syscall`), but `host_syscall` doesn't use it.</violation>
</file>

<file name="tests/unit/test-syscall-trap.c">

<violation number="1" location="tests/unit/test-syscall-trap.c:358">
P2: Initialize `ctx.host_nrs` before calling `kbox_syscall_dispatch_sigsys`; the dispatch path dereferences it unconditionally, so leaving it NULL will crash this test.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

static int64_t host_syscall(const ucontext_t *uc)
{
#if defined(__x86_64__)
long nr = uc->uc_mcontext.gregs[REG_RAX];
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0: On x86_64, host_syscall reads the syscall number from REG_RAX in the ucontext, but the kernel has already overwritten RAX with -ENOSYS before delivering SIGSYS for SECCOMP_RET_TRAP. This means every CONTINUE-dispatched syscall in trap mode will execute syscall(-ENOSYS) instead of the intended syscall. The request struct built earlier by kbox_syscall_regs_from_sigsys has the correct number (from si_syscall), but host_syscall doesn't use it.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/syscall-trap.c, line 1232:

<comment>On x86_64, `host_syscall` reads the syscall number from `REG_RAX` in the ucontext, but the kernel has already overwritten RAX with `-ENOSYS` before delivering SIGSYS for `SECCOMP_RET_TRAP`. This means every CONTINUE-dispatched syscall in trap mode will execute `syscall(-ENOSYS)` instead of the intended syscall. The request struct built earlier by `kbox_syscall_regs_from_sigsys` has the correct number (from `si_syscall`), but `host_syscall` doesn't use it.</comment>

<file context>
@@ -0,0 +1,1542 @@
+static int64_t host_syscall(const ucontext_t *uc)
+{
+#if defined(__x86_64__)
+    long nr = uc->uc_mcontext.gregs[REG_RAX];
+
+    if (nr == __NR_exit || nr == __NR_exit_group)
</file context>
Fix with Cubic

filter[idx++] = (struct kbox_sock_filter) KBOX_BPF_STMT(
KBOX_BPF_RET | KBOX_BPF_K, KBOX_SECCOMP_RET_TRAP);

if (idx > MAX_PROG_LEN) {
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1: Bounds check on filter[] runs after all writes complete — if idx ever exceeds MAX_PROG_LEN during emission, the stack is already corrupted. Move the overflow check into the emit helpers or add a guard before each filter[idx++] write, or at minimum assert idx < MAX_PROG_LEN at the start of each emit loop iteration.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At src/seccomp-bpf.c, line 672:

<comment>Bounds check on `filter[]` runs after all writes complete — if `idx` ever exceeds `MAX_PROG_LEN` during emission, the stack is already corrupted. Move the overflow check into the emit helpers or add a guard before each `filter[idx++]` write, or at minimum assert `idx < MAX_PROG_LEN` at the start of each emit loop iteration.</comment>

<file context>
@@ -334,15 +492,214 @@ int kbox_install_seccomp_listener(const struct kbox_host_nrs *h)
+    filter[idx++] = (struct kbox_sock_filter) KBOX_BPF_STMT(
+        KBOX_BPF_RET | KBOX_BPF_K, KBOX_SECCOMP_RET_TRAP);
+
+    if (idx > MAX_PROG_LEN) {
+        errno = EINVAL;
+        return -1;
</file context>
Fix with Cubic

info.si_signo = SIGSYS;
#endif

ASSERT_EQ(kbox_syscall_dispatch_sigsys(&ctx, 55, &info, &uc), expected_rc);
Copy link

@cubic-dev-ai cubic-dev-ai bot Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Initialize ctx.host_nrs before calling kbox_syscall_dispatch_sigsys; the dispatch path dereferences it unconditionally, so leaving it NULL will crash this test.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At tests/unit/test-syscall-trap.c, line 358:

<comment>Initialize `ctx.host_nrs` before calling `kbox_syscall_dispatch_sigsys`; the dispatch path dereferences it unconditionally, so leaving it NULL will crash this test.</comment>

<file context>
@@ -0,0 +1,517 @@
+    info.si_signo = SIGSYS;
+#endif
+
+    ASSERT_EQ(kbox_syscall_dispatch_sigsys(&ctx, 55, &info, &uc), expected_rc);
+#if defined(__x86_64__)
+    ASSERT_EQ(uc.uc_mcontext.gregs[REG_RAX], info.si_syscall);
</file context>
Fix with Cubic

This implements syscall rewriting fast-path, adding a complete binary
rewriting layer alongside the existing seccomp-unotify (Tier 3) and
SIGSYS trap (Tier 1) interception paths.

Core additions:
- x86-64 instruction length decoder (x86-decode.c) that walks true
  instruction boundaries, eliminating false-positive 0F 05/0F 34
  matches inside immediates and displacements
- Site-aware rewrite classification (WRAPPER vs COMPLEX) to gate the
  inline fast path: only simple syscall+ret wrapper sites return
  virtualized PID values (getpid=1, gettid=1, getppid=0) directly;
  complex sites (raise->gettid->tgkill) use full dispatch
- aarch64 binary rewriting: SVC #0 patched to B-branch trampolines
  with veneer pages (LDR+BR indirect stubs) for sites beyond 128MB
- Userspace ELF loader for in-process trap/rewrite launch
- Unified syscall request abstraction across all three tiers
- Per-architecture AUTO mode: seccomp on x86_64, rewrite/trap on
  aarch64 for non-shell commands without fork wrappers
- W^X enforcement (PROT_WRITE|PROT_EXEC blocked in trap/rewrite mode)
- CLONE_THREAD blocked in trap/rewrite (EPERM); multi-threaded guests
  must use --syscall-mode=seccomp
- Shadow FD cache with dup() per open for independent file offsets
- Safe argv/envp validation in trap-mode execve via process_vm_readv
- Execve re-translation: rewrite runtime re-installed on new binary
- Fork-site scanner for AUTO mode safety (detects fork/clone wrappers)

Benchmark (bench-test, 200 iterations, release build, us/call):

  x86_64 (AMD Threadripper 2990WX, Linux 6.8.0):
    syscall      native  seccomp   trap   rewrite  auto
    stat            2.6    13.5    22.0      20.2   13.5
    open+close      5.4    25.2    89.6      58.9   25.3
    lseek+read      1.9     2.3     2.8       2.1    2.4
    write           2.7     2.2     2.5       2.0    2.4
    getpid          0.3     0.0     0.1       0.1    0.0

  aarch64 (Marvell ThunderX2, Linux 6.14.0):
    syscall      native  seccomp   trap   rewrite  auto
    stat            1.5    21.7    25.1       1.0   21.4
    open+close      3.4    40.3   138.4     117.0   39.2
    lseek+read      1.0     1.3     1.6       1.6    1.4
    write           1.1     1.4     1.8       1.9    1.4
    getpid          0.3     0.0     0.1       0.1    0.0

* AUTO selects seccomp on x86_64 (lower USER_NOTIF overhead), rewrite on
  aarch64 (stat 1.0us via in-process LKL inode cache vs 21.7us
* seccomp round-trip). aarch64 rewrite stat at 0.7x native is real: LKL
  serves from memory without hitting the block layer.

Change-Id: I9052fc27388027b9147348c0998bc8fa4e1e123e
@jserv jserv force-pushed the syscall-rewriting branch from 5c09f57 to 6080aad Compare March 26, 2026 10:17
@jserv jserv merged commit 78d13e1 into main Mar 26, 2026
3 checks passed
@jserv jserv deleted the syscall-rewriting branch March 26, 2026 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant