Skip to content

ptraceomatic: fix page fault retry and F2/F3 prefix in undefined flags#2705

Open
evnchn wants to merge 2 commits intoish-app:masterfrom
evnchn:ptraceomatic-libc-fix
Open

ptraceomatic: fix page fault retry and F2/F3 prefix in undefined flags#2705
evnchn wants to merge 2 commits intoish-app:masterfrom
evnchn:ptraceomatic-libc-fix

Conversation

@evnchn
Copy link
Copy Markdown

@evnchn evnchn commented Mar 15, 2026

Summary

Two fixes for pre-existing ptraceomatic failures with libc-linked static binaries. After these fixes, ptraceomatic successfully validates return 42, printf("hello"), and printf("%f", sqrt(2.0)) programs end-to-end on kernel 5.4.

Bug 1: Page fault retry (EIP desync)

When glibc probes the stack after a large allocation:

sub esp, 0x1000
or  [esp], 0x0    ; <-- page fault: stack growth

cpu_run_to_interrupt returns INT_GPF without advancing EIP (the instruction didn't complete). handle_interrupt maps the new stack page, but step_tracing then steps the real CPU forward — creating a permanent 4-byte EIP desync. The fix retries the emulated instruction once after the page is mapped.

Q: What if the retry itself faults? handle_interrupt delivers SIGSEGV to the emulated process. The next compare_cpus detects the divergence. This is the same failure path as any other genuine segfault — no worse than status quo.

Bug 2: F2/F3 prefix not skipped in undefined_flags_mask

undefined_flags_mask determines which flags are architecturally undefined after an instruction (so ptraceomatic doesn't compare them). It already skips the 0x66 prefix to reach the actual opcode, but didn't skip F2/F3. This caused tzcnt (F3 0F BC, decoded as rep bsf on non-BMI1 CPUs like Ivy Bridge) to return 0 instead of O|S|A|P|C, triggering a false eflags mismatch in glibc's __printf_fp_l.

How to test

Requires an x86_64 Linux host with gcc-multilib installed and ptrace_scope=0:

# Build
meson setup build -Dengine=asbestos -Dkernel=ish
ninja -C build

# Compile test binaries
echo 'int main() { return 42; }' | gcc -m32 -static -o /tmp/ret42 -x c -
echo '#include <stdio.h>
int main() { printf("hello\n"); return 0; }' | gcc -m32 -static -o /tmp/hello -x c -
echo '#include <stdio.h>
#include <math.h>
int main() { printf("%.10f\n", sqrt(2.0)); return 0; }' | gcc -m32 -static -o /tmp/math -x c - -lm

# Run ptraceomatic — all should exit cleanly (no SIGTRAP)
./build/tools/ptraceomatic /tmp/ret42    # expected exit code: 42
./build/tools/ptraceomatic /tmp/hello    # prints "hello", exit 0
./build/tools/ptraceomatic /tmp/math     # prints "1.4142135624", exit 0

Without this patch, /tmp/ret42 crashes with an EIP mismatch at _dl_get_origin (glibc stack probe), and /tmp/math crashes with an eflags mismatch at __printf_fp_l (tzcnt undefined flags).

Process note

This fix was developed with Claude Code (Anthropic's AI coding agent). The initial investigation involved extensive instrumentation (ring buffer traces, fprintf diagnostics, GDB sessions) to locate the root causes across ~3000 instructions of glibc init code. The final diff was then distilled down to the minimal 9-line fix you see here — every line was individually questioned and justified during review. The instrumentation scaffolding was deliberately removed to keep the PR focused.

Tested on an Ubuntu 20.04 KVM VM (kernel 5.4.0-216-generic).

@evnchn
Copy link
Copy Markdown
Author

evnchn commented Mar 15, 2026

Human note: For bug 1 I think is valid. Bug 2 not sure, since it is specialized to me running iSH dev on an i7-3632QM (Ivy Bridge as mentioned). Nevertheless removing either fix causes the test to fail on my machine.

Copy link
Copy Markdown
Member

@tbodt tbodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second fix looks right to me actually. Even generally, rep prefix shouldn't affect which flags are undefined.

Human note: To respect my time, please handwrite PR descriptions and review comments, even if claude did the work - I want to know that another human looked at the code and can vouch for it being ready for review.

Comment thread tools/ptraceomatic.c
@evnchn
Copy link
Copy Markdown
Author

evnchn commented Mar 16, 2026

Also @tbodt please use "hide whitespace" view so that the diff is more manageable.

image

Copy link
Copy Markdown
Member

@tbodt tbodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls clean up history, then lgtm

Comment thread tools/ptraceomatic.c Outdated
Evan Chan and others added 2 commits March 16, 2026 16:38
When cpu_run_to_interrupt returns INT_GPF (e.g. stack growth), the
faulting instruction doesn't execute and eip doesn't advance.
handle_interrupt maps the page, but step_tracing then stepped the
real CPU forward, creating a permanent eip desync.

Loop until eip changes, matching the existing real-CPU pattern for
repeated string instructions. This also handles the case where a
page fault goes to a signal handler.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only the 0x66 prefix was skipped when looking up the opcode to
determine undefined flags. This caused tzcnt (F3 0F BC, decoded
as rep bsf on non-BMI1 CPUs) to not get its flags (O,S,A,P,C)
marked as undefined, triggering a false eflags mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@evnchn evnchn force-pushed the ptraceomatic-libc-fix branch from aac4ac7 to 6f7f72b Compare March 16, 2026 08:41
@evnchn
Copy link
Copy Markdown
Author

evnchn commented Mar 16, 2026

@tbodt I hope I didn't mess up 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants