ptraceomatic: fix page fault retry and F2/F3 prefix in undefined flags#2705
ptraceomatic: fix page fault retry and F2/F3 prefix in undefined flags#2705evnchn wants to merge 2 commits intoish-app:masterfrom
Conversation
|
Human note: For bug 1 I think is valid. Bug 2 not sure, since it is specialized to me running iSH dev on an i7-3632QM (Ivy Bridge as mentioned). Nevertheless removing either fix causes the test to fail on my machine. |
tbodt
left a comment
There was a problem hiding this comment.
Second fix looks right to me actually. Even generally, rep prefix shouldn't affect which flags are undefined.
Human note: To respect my time, please handwrite PR descriptions and review comments, even if claude did the work - I want to know that another human looked at the code and can vouch for it being ready for review.
|
Also @tbodt please use "hide whitespace" view so that the diff is more manageable.
|
tbodt
left a comment
There was a problem hiding this comment.
Pls clean up history, then lgtm
When cpu_run_to_interrupt returns INT_GPF (e.g. stack growth), the faulting instruction doesn't execute and eip doesn't advance. handle_interrupt maps the page, but step_tracing then stepped the real CPU forward, creating a permanent eip desync. Loop until eip changes, matching the existing real-CPU pattern for repeated string instructions. This also handles the case where a page fault goes to a signal handler. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Only the 0x66 prefix was skipped when looking up the opcode to determine undefined flags. This caused tzcnt (F3 0F BC, decoded as rep bsf on non-BMI1 CPUs) to not get its flags (O,S,A,P,C) marked as undefined, triggering a false eflags mismatch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aac4ac7 to
6f7f72b
Compare
|
@tbodt I hope I didn't mess up 😅 |

Summary
Two fixes for pre-existing ptraceomatic failures with libc-linked static binaries. After these fixes, ptraceomatic successfully validates
return 42,printf("hello"), andprintf("%f", sqrt(2.0))programs end-to-end on kernel 5.4.Bug 1: Page fault retry (EIP desync)
When glibc probes the stack after a large allocation:
cpu_run_to_interruptreturnsINT_GPFwithout advancing EIP (the instruction didn't complete).handle_interruptmaps the new stack page, butstep_tracingthen steps the real CPU forward — creating a permanent 4-byte EIP desync. The fix retries the emulated instruction once after the page is mapped.Q: What if the retry itself faults?
handle_interruptdelivers SIGSEGV to the emulated process. The nextcompare_cpusdetects the divergence. This is the same failure path as any other genuine segfault — no worse than status quo.Bug 2: F2/F3 prefix not skipped in
undefined_flags_maskundefined_flags_maskdetermines which flags are architecturally undefined after an instruction (so ptraceomatic doesn't compare them). It already skips the0x66prefix to reach the actual opcode, but didn't skipF2/F3. This causedtzcnt(F3 0F BC, decoded asrep bsfon non-BMI1 CPUs like Ivy Bridge) to return 0 instead ofO|S|A|P|C, triggering a false eflags mismatch in glibc's__printf_fp_l.How to test
Requires an x86_64 Linux host with
gcc-multilibinstalled andptrace_scope=0:Without this patch,
/tmp/ret42crashes with an EIP mismatch at_dl_get_origin(glibc stack probe), and/tmp/mathcrashes with an eflags mismatch at__printf_fp_l(tzcnt undefined flags).Process note
This fix was developed with Claude Code (Anthropic's AI coding agent). The initial investigation involved extensive instrumentation (ring buffer traces, fprintf diagnostics, GDB sessions) to locate the root causes across ~3000 instructions of glibc init code. The final diff was then distilled down to the minimal 9-line fix you see here — every line was individually questioned and justified during review. The instrumentation scaffolding was deliberately removed to keep the PR focused.
Tested on an Ubuntu 20.04 KVM VM (kernel 5.4.0-216-generic).