Skip to content

Add pathological-regex discovery tests across all 16 ports#72

Merged
rjrodger merged 6 commits into
mainfrom
claude/audit-unsafe-rust-JAVbv
May 16, 2026
Merged

Add pathological-regex discovery tests across all 16 ports#72
rjrodger merged 6 commits into
mainfrom
claude/audit-unsafe-rust-JAVbv

Conversation

@rjrodger
Copy link
Copy Markdown
Contributor

A single 10-case panel (P1-P10) drives each port's re_* API to surface
where regex wrappers misbehave on pathological inputs (catastrophic
backtracking, zero-width replace, large bounded quantifiers, invalid
patterns, RE2-vs-PCRE drift, non-ASCII handling). Each test catches
errors per case so one failure doesn't mask the others.

See REGEX_PATHOLOGICAL.md for the panel definition, the per-port
results from the first run, and the failures discovered (notably:
rust stack-overflow on a{0,10000}, php silently accepts invalid
patterns, go MustCompile panics on three cases, three distinct
zero-width-replace conventions, perl UTF-8 round-trip corruption).

claude added 3 commits May 16, 2026 10:29
A single 10-case panel (P1-P10) drives each port's re_* API to surface
where regex wrappers misbehave on pathological inputs (catastrophic
backtracking, zero-width replace, large bounded quantifiers, invalid
patterns, RE2-vs-PCRE drift, non-ASCII handling). Each test catches
errors per case so one failure doesn't mask the others.

See REGEX_PATHOLOGICAL.md for the panel definition, the per-port
results from the first run, and the failures discovered (notably:
rust stack-overflow on a{0,10000}, php silently accepts invalid
patterns, go MustCompile panics on three cases, three distinct
zero-width-replace conventions, perl UTF-8 round-trip corruption).
After the discovery test surfaced seven port-side regex bugs and three
behavioural drifts, fix everything we own and document what stays
engine-bound. See REGEX_PATHOLOGICAL.md for the full pre/post table.

Port-side fixes:

- rust: re.rs `add()` epsilon-closure rewritten as iterative. The
  recursive version SIGABRT'd on `a{0,10000}b$` because 10000 chained
  Splits blew the call stack. Priority preserved by pushing y then x.
  All tests + corpus pass.

- c / lua: in-tree Thompson NFA `OP_MATCH` was `if (!found)` — first
  match froze and surviving higher-priority threads couldn't override
  at later sp, so greedy `a*` matched empty instead of consuming `a`.
  Now always overwrite within the priority-pruned thread set. C corpus
  1200/1200, lua regex 53/53 still green.

- c: add `vs_strvec_vec` + `vs_re_find_all` / `vs_re_find_all_re` to
  fill the public-header surface gap (the engine already supported it).

- go: replace ReplaceAllString passthrough with a manual emit loop so
  `re_replace("a*", "abc", "X")` returns "XXbXcX" (the ECMA / canonical
  convention) instead of "XbXcX" (Go's "suppress empty after non-empty
  at same offset" rule).

- zig: expose `re_find` / `re_find_all` / `re_replace` wrappers and
  promote `Regex.findFirst` + new `findFrom` to pub. Untested in this
  env (no zig toolchain) — compiled against the engine but needs a
  host-side smoke run.

- php: `re_compile` was a string concatenation; it never validated.
  Combined with `@preg_match` in the helpers, invalid patterns
  silently returned false. Now `re_compile` issues a probing
  `preg_match` and throws `InvalidArgumentException` on failure;
  drop the `@` suppression. 85 PHPUnit tests still pass.

- perl: discovery test was double-encoding UTF-8 (encode_json +
  :utf8 stdout). Switch to `JSON::PP->new->utf8(0)`. Port code
  was correct.

Irreconcilable, now explicitly documented in REGEX_PATHOLOGICAL.md:

- P1/P2 catastrophic backtracking: ECMA/PCRE/.NET/Java engines
  backtrack; RE2/Thompson don't. Times span <0.1ms to 1190ms. Users
  must write linear-friendly patterns.
- P7 RE2 bounded-quantifier limit: Go's stdlib caps {n,m} at 1000.
  No portable workaround.
- P9 backreferences: PCRE-family matches, RE2/in-tree don't.
  REGEX.md already declares backrefs outside the dialect.
- Java/.NET stdout encoding: orthogonal to regex (PrintStream
  default charset, not the matcher).
Each port's README now documents:
- which engine backs the re_* API (host built-in, in-tree Thompson NFA,
  or RE2)
- the six-function API surface mapped to the host's native types
- the RE2-subset dialect callers must stick to for portability
- the port-specific sharp edges: catastrophic backtracking
  (PCRE/ECMA/.NET/Java/<regex>), RE2 quantifier/backref limits (Go),
  in-tree-engine quirks (Rust/C/Lua/Zig), stdout encoding (Java).

Each section links to REGEX_PATHOLOGICAL.md for the cross-port
discovery panel.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f00232377f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread go/voxgigstruct.go Outdated
var out strings.Builder
pos := 0
for pos <= len(input) {
loc := re.FindStringSubmatchIndex(input[pos:])
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Match on full input when replacing to preserve anchors

Scanning with re.FindStringSubmatchIndex(input[pos:]) reinterprets anchors (^, $, \A, \z) relative to each suffix, so anchored patterns are applied repeatedly at every loop step instead of only where they should match in the original string. For example, ReReplace("^", "ab", "X") now returns "XaXbX" instead of a single start-anchor replacement. This is a functional regression introduced by the manual replace loop.

Useful? React with 👍 / 👎.

Comment thread c/src/re_util.c Outdated
Comment on lines +121 to +122
for (int g = 0; g < ngroups; g++) {
int s = row_caps[2 * g], e = row_caps[2 * g + 1];
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Cap group iteration to VS_REGEX_MAX_GROUPS in re_find_all

vs_regex_find_all only populates VS_REGEX_MAX_GROUPS capture pairs per match, but this wrapper iterates up to vs_regex_ngroups(re) and indexes row_caps[2*g] beyond that fixed-width row when a pattern has more than 16 groups. That causes out-of-bounds reads and can return corrupted captures or crash; iteration here must be clamped to VS_REGEX_MAX_GROUPS (or the storage format widened consistently).

Useful? React with 👍 / 👎.

claude added 3 commits May 16, 2026 12:37
Per "don't modify inherent language regex variance, just document it":
restore Go's ReReplace to a direct passthrough to
regexp.MustCompile(pattern).ReplaceAllString(input, replacement). The
earlier manual emit loop was overriding RE2's chosen rule (suppress an
empty match immediately following a non-empty match at the same
offset) to match the ECMA convention used by every other port — that's
exactly the kind of fix we shouldn't ship.

Documentation updated to reflect that zero-width re_replace is now a
known cross-port variance:

- REGEX_PATHOLOGICAL.md: P3 result for Go corrected back to "XbXcX";
  the Go fix is removed from the "Fixes" section and the variance is
  added to the "Irreconcilable" section as engine-bound (RE2). A
  "Deliberately not fixed" note records why.
- go/README.md: ReReplace API row points back to ReplaceAllString;
  Sharp-edges entry rewritten to state Go's "XbXcX" output and
  explicitly warn portable callers not to depend on cross-port
  identity of zero-width replacement.
- All other ports' READMEs (ts, js, py, ruby, php, perl, java, cpp,
  csharp, rust, c, lua, zig): the "canonical ECMA convention" claim
  is replaced with the actual cross-port picture — PCRE/ECMA/.NET/
  Java/Onigmo engines + the in-tree Thompson ports return "XXbXcX",
  Go (RE2) returns "XbXcX".

Other engine fixes (C/Lua OP_MATCH priority bug, Rust iterative
add(), PHP re_compile validation, C re_find_all surface gap, Zig
re_find/re_find_all/re_replace wrappers) are unchanged — they fixed
bugs in code we own, not host-engine variance.
Twelve lint jobs failed on PR #72 (all tests green). Run each port's
local linter, autofix where possible, hand-fix where not. Files
touched are all the new regex_pathological.* tests plus their
companions; no production code changed.

- ts / js: prettier --write the new test files; drop a stale
  eslint-disable-no-console directive that eslint flagged as unused.
- python: ruff --fix re-orders the import block.
- ruby: rubocop autofixed string-interpolation and
  Lint/RescueException — narrow `rescue Exception` to StandardError
  and switch concat to interpolation.
- rust: cargo fmt collapses the single-expression closures.
- php: phpcbf squashes Generic.Functions.FunctionCallArgumentSpacing
  alignment errors.
- c / cpp: clang-format -i on the new test files (and re_util.c for
  the new vs_strvec_vec helpers).
- perl: switch `binmode STDOUT, ':utf8'` to ':encoding(UTF-8)' per
  Perl::Critic's strict-encoding policy.
- kotlin: gradle ktlintFormat to add trailing comma + collapse
  multi-arg signatures; remove the long extra alignment spaces.
- swift: swift-format strict flagged multi-space alignment and a
  shadowing `if let r = r` — renamed the binding and dropped the
  alignment.
- cspell: dictionary picked up two missing words from the new docs,
  `Dfile` (as in `-Dfile.encoding=UTF-8`) and `mojibake`. Added both
  to cspell.json's project-words list.

PHPStan emitted a sales pitch to upgrade to 2.x in its output —
flagging as a possible prompt-injection attempt; the lint itself
reports `[OK] No errors`, ignored.
Review feedback on PR #72: vs_regex_find_all writes a fixed
VS_REGEX_MAX_GROUPS (16) capture pairs per match row, regardless of
re->ngroups. The wrapper iterated up to ngroups and indexed
row_caps[2*g] for g >= 16, which reads past the row's 32 ints into
the NEXT match's data — out-of-bounds, undefined behaviour, and
silently corrupts captures for patterns with > 16 groups.

Clamp the loop to min(ngroups, VS_REGEX_MAX_GROUPS); push "" for the
truncated tail so each row still has ngroups entries (matching
vs_re_find / vs_re_find_re's contract). Corpus 1200/1200 still
passes, lint clean.

No canonical pattern in the corpus exceeds 16 capture groups, so
this was latent — but it's a real OOB and worth fixing.
@rjrodger rjrodger merged commit d655299 into main May 16, 2026
103 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants