diff --git a/REGEX_PATHOLOGICAL.md b/REGEX_PATHOLOGICAL.md new file mode 100644 index 0000000..12fa5b4 --- /dev/null +++ b/REGEX_PATHOLOGICAL.md @@ -0,0 +1,209 @@ +# Pathological Regex — Cross-Port Discovery & Fixes + +> Discovery panel that runs 10 deliberately pathological regex inputs +> against every port's `re_*` API. The first pass surfaced where port +> wrappers misbehaved on edge cases; this document records the panel, +> the **fixed** porting variations, and the irreconcilable +> engine-bound differences that remain. + +The same 10-case panel runs in every port via the port's `re_*` API +(see `REGEX_API.md`). Each port has a `regex_pathological*` test file +under its own tests directory. + +## The panel + +| # | Name | Call | What it stresses | +|---|---|---|---| +| P1 | `redos_nested_plus` | `re_test("^(a+)+$", "a"*22 + "!")` | Catastrophic backtracking via nested quantifier | +| P2 | `redos_alt_overlap` | `re_test("^(a\|aa)+$", "a"*22 + "!")` | Catastrophic backtracking via overlapping alternation | +| P3 | `empty_repeat_replace` | `re_replace("a*", "abc", "X")` | Zero-width-match convention in `replace_all` | +| P4 | `unicode_replace_dot` | `re_replace("\\.", "café.au.lait", "/")` | UTF-8 char-boundary handling | +| P5 | `unicode_find_codepoint` | `re_find("é", "café au lait")` | Non-ASCII patterns | +| P6 | `deep_nesting_compile` | `re_test("(((…40…(a)…)))","a")` | Parser/compiler stack | +| P7 | `big_bounded_quantifier` | `re_test("^a{0,10000}b$", "a"*10+"b")` | Large bounded quantifier | +| P8 | `invalid_pattern` | `re_compile("[abc")` | Error reporting | +| P9 | `backref_re2_forbidden` | `re_test("^(a+)\\1$", "aaaa")` | RE2 strictness on backrefs | +| P10 | `find_all_zero_width` | `re_find_all("a*", "bbb")` | Zero-width `find_all` enumeration | + +## Post-fix results (14 of 16 ports runnable in this env) + +| Port | P1 (ms) | P2 (ms) | P3 result | P4 result | P7 | P8 | P9 | +|------------|--------:|--------:|--------------|----------------|-------|-------------------|-----------| +| typescript | 180 | 3 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | matches | +| javascript | 179 | 3 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | matches | +| python | 191 | 4 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | matches | +| ruby | 0.04 | 0.05 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | matches | +| php | 3 | 0.3 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | matches | +| perl | 0.06 | 0.06 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | matches | +| go | 0.03 | 0.02 | `"XbXcX"` | `café/au/lait` | PANIC | PANIC | PANIC | +| rust | 0.01 | 0.01 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | non-match | +| java | 13 | 0.2 | `"XXbXcX"` | `caf?/au/lait` | OK | ERR (clean) | matches | +| cpp | **1190**| 24 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | matches | +| c | 0.01 | 0.01 | `"XXbXcX"` | `café/au/lait` | OK | ERR (NULL return) | non-match | +| lua | 0.12 | 0.10 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | non-match | +| csharp | 393 | 8 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | matches | +| kotlin | 24 | 0.3 | `"XXbXcX"` | `café/au/lait` | OK | ERR (clean) | matches | +| swift | n/r | | | | | | | +| zig | n/r | | | | | | | + +n/r = toolchain unavailable in this environment. + +## Fixes — porting variations resolved + +1. **rust — stack overflow on `a{0,10000}b$`** (`rust/src/re.rs`). + The Thompson engine's `add()` epsilon-closure was recursive; 10 000 + chained `Split` instructions blew the call stack with SIGABRT. + Rewrote as iterative with an explicit work stack (priority preserved + by pushing `y` then `x`). All 15 tests still pass; the in-tree corpus + (1200 cases via the TS-shared spec) still passes. + +2. **php — `re_compile` silently accepted invalid patterns** + (`php/src/Struct.php`). The wrapper returned a delimited string + without ever running PCRE on it, and every other helper used + `@preg_match` to suppress warnings. Now `re_compile` issues a + no-op `preg_match` to surface compile errors, throws + `InvalidArgumentException` on failure, and the `@` is dropped from + the read helpers. 85 PHPUnit tests still pass. + +3. **c / lua — `re_replace("a*", "abc", "X")` returned `"XaXbXcX"`** + (`c/src/regex.c`, `lua/src/regex.lua`). The in-tree Thompson NFA + driver's `OP_MATCH` branch had `if (!found) { … }`, which froze + the first match found and prevented surviving higher-priority + threads from overriding at a later `sp`. That made greedy + quantifiers behave lazily — `a*` matched empty at every position + instead of consuming the leading `"a"`. Always overwriting on + `OP_MATCH` (within the priority-pruned thread set) makes greedy + `a*` consume the `"a"` correctly. C corpus 1200/1200 still passes; + Lua regex unit tests 53/53 still pass. + +4. **c — `re_find_all` missing from public header** + (`c/src/voxgig_struct.h`, `c/src/re_util.c`). Added + `vs_strvec_vec` + `vs_re_find_all` / `vs_re_find_all_re`. The + engine already supported the operation; only the wrapper was + missing. + +5. **zig — `re_find` / `re_find_all` / `re_replace` not exposed** + (`zig/src/struct.zig`, `zig/src/regex.zig`). The engine had + `matchAt` but only `re_compile` / `re_test` / `re_escape` were + public. Made `findFirst` public, added `findFrom(input, start)`, + and added the three wrappers using the page allocator (matching + the existing `re_test` style). **Not run in this environment** + (no zig toolchain); the wrappers compile against the engine but + need a host-side smoke pass. + +6. **perl — discovery test showed `café/au/lait`** (`perl/t/regex_pathological.t`). + This turned out to be a test-script bug, not a port bug: + `encode_json` returns UTF-8-encoded bytes and `binmode STDOUT, + ':utf8'` then re-encoded them as Latin-1. Switched the test to + `JSON::PP->new->utf8(0)->encode` so the `:utf8` layer encodes + once. The Perl port's `re_replace` was correct all along. + +**Deliberately not fixed — Go `re_replace` zero-width convention.** +Go's `regexp.ReplaceAllString` suppresses an empty match immediately +after a non-empty match at the same offset, so +`re_replace("a*", "abc", "X")` returns `"XbXcX"` here, not the +ECMA-canonical `"XXbXcX"`. This is RE2's chosen rule — it's +host-package behaviour we don't own. An earlier attempt wrapped +`ReplaceAllString` with a manual emit loop to align the output; it +was reverted in line with "don't modify inherent language regex +variance, just document it." Callers writing portable code should +not assume zero-width replacement semantics are identical across +ports. + +## Irreconcilable — engine-bound, documented for callers + +Cases where the host language's regex engine fundamentally differs +from another's. The cross-port contract documented in `REGEX.md` +already requires patterns to live in the RE2 subset; these are the +sharp edges that come with the host engines we don't own. + +1. **P1 / P2 catastrophic backtracking.** ECMA / PCRE / .NET / Java + regex engines use backtracking. `^(a+)+$` against 22 a's plus a + non-match suffix is: + - C++ libstdc++ ``: 1190 ms + - C# `System.Text.RegularExpressions`: 393 ms + - Python `re`: 191 ms + - TS/JS `RegExp`: ~180 ms + - Java `java.util.regex`: 13 ms + - Ruby (Onigmo) / Perl / PHP (PCRE+JIT): <3 ms (engine-side ReDoS mitigations) + - Go (RE2) / Rust (in-tree) / C / Lua (Thompson NFA): <0.1 ms (no backtracking) + + The RE2-subset contract avoids the worst classes (no backrefs, + no lookaround), but nested quantifiers like `(a+)+` are still + inside the subset and can still backtrack catastrophically on + the non-RE2 engines. **Callers are responsible for writing + linear-friendly patterns** (a single `a+` would already be + linear on every engine here). See `REGEX.md` for the dialect. + +2. **P7 — RE2's bounded-quantifier limit.** Go's stdlib `regexp` + refuses to compile `a{0,10000}` with *"invalid repeat count"*: + RE2 caps `{n,m}` at 1000 to keep the compiled program size + bounded. Every other engine compiles it. Internal call sites + in the corpus stay well below the limit; user-facing `$LIKE` + operators should too. There is no portable workaround — RE2's + limit is hard-coded in the host stdlib. + +3. **P8 — Go panics on invalid pattern.** `ReCompile` is a + passthrough to `regexp.MustCompile`, which panics. This is the + Go-idiomatic shape and matches the throw/raise behaviour of + every other port; callers wrap in `recover()` the same way other + ports use `try/catch`. (Not a divergence in semantics — just in + how the failure is named.) + +4. **P9 — backreferences (`\1`, `(?P=name)`).** Three families: + - PCRE / ECMAScript / .NET / Java / Onigmo / Perl: backrefs work. + `^(a+)\1$` on "aaaa" matches. + - Go (RE2): rejects at compile time (panics). + - In-tree engines (Rust, C, Lua): parse `\1` as a literal "1" + (or similar fallback) — the pattern compiles but never matches + the back-reference semantically, so the test returns `false`. + + `REGEX.md` already documents this: **backreferences are outside + the supported dialect.** None of the canonical patterns use them. + The `$LIKE` operator does not document them. Callers that need + backrefs are running outside the contract on every RE2-family + port. + +5. **P3 zero-width `replace_all` convention varies between engines.** + `re_replace("a*", "abc", "X")` produces: + - `"XXbXcX"` — every PCRE / ECMA / .NET / Java engine, plus the + in-tree Thompson NFA ports (Rust, C, Lua) after the engine fix. + - `"XbXcX"` — Go (RE2). RE2 deliberately suppresses an empty match + that immediately follows a non-empty match at the same offset. + This is inherent to RE2 / Go's `regexp` package; there is no + portable workaround that doesn't replace the engine. Don't rely on + zero-width replacement output being identical across ports. + +6. **Java / .NET stdout encoding.** Java printed `caf?` for P4/P5, + not because the regex returned the wrong string but because + `System.out`'s default `PrintStream` uses the platform's default + charset on JVMs without `-Dfile.encoding=UTF-8`. The in-memory + `String` is correct UTF-16. .NET's default `Console.Out` is + UTF-8 on .NET 6+, so C# was unaffected. This is orthogonal to + the regex contract. + +7. **Time-of-iteration variance on backtracking engines.** P1 / P2 + numbers vary across runs depending on JIT warmup, GC, and host + load. The qualitative split (linear vs catastrophic) is stable; + the specific milliseconds aren't a regression signal. + +## Where the tests live + +| Port | Path | +|------------|------| +| typescript | `typescript/test/regex_pathological.test.ts` | +| javascript | `javascript/test/regex_pathological.test.js` | +| python | `python/tests/test_regex_pathological.py` | +| ruby | `ruby/test_regex_pathological.rb` | +| php | `php/tests/RegexPathologicalTest.php` | +| perl | `perl/t/regex_pathological.t` | +| go | `go/regex_pathological_test.go` | +| rust | `rust/tests/regex_pathological.rs` | +| java | `java/src/test/RegexPathologicalTest.java` | +| cpp | `cpp/tests/regex_pathological.cpp` | +| c | `c/tests/regex_pathological.c` | +| lua | `lua/test/regex_pathological.lua` | +| csharp | `csharp/tests/RegexPathologicalTest.cs` | +| kotlin | `kotlin/src/test/kotlin/voxgig/struct/RegexPathologicalTest.kt` | +| swift | `swift/Tests/VoxgigStructTests/RegexPathologicalTests.swift` | +| zig | `zig/test/regex_pathological.zig` | diff --git a/c/README.md b/c/README.md index 1e3f2b9..d540b44 100644 --- a/c/README.md +++ b/c/README.md @@ -225,6 +225,63 @@ operator uses substring containment instead of full regex matching kept out of scope to minimise dependencies). +## Regex + +Uniform regex API (see `/REGEX_API.md`). The C port **ships its own +RE2-subset Thompson NFA engine** in `src/regex.c` (~700 LOC) — no +external dependency. The wrapper layer (`src/re_util.c`) exposes the +shared `re_*` names alongside the lower-level `vs_regex_*` engine +API. + +### API + +| Function | Returns | +|---|---| +| `vs_re_compile(pattern)` | `vs_regex*` (NULL on bad pattern) | +| `vs_re_test(pattern, input)` | `bool` | +| `vs_re_find(pattern, input)` | `vs_strvec` of `[whole, group1, …]` | +| `vs_re_find_all(pattern, input)` | `vs_strvec_vec` (one row per match) | +| `vs_re_replace(pattern, input, replacement)` | malloc'd `char*` | +| `vs_re_replace_cb(re, input, cb, ud)` | malloc'd `char*` (callback variant) | +| `vs_re_escape(literal)` | malloc'd `char*` | + +The `_re` suffixed variants take an already-compiled `vs_regex*`. + +### Dialect + +The in-tree engine implements the RE2 subset documented in `/REGEX.md`: +literals + escapes, `.`, `^`/`$`, `* + ? {n} {n,} {n,m}` (greedy + lazy), +classes incl. `\d \w \s` and friends, `\b`/`\B`, `(...)` / `(?:...)`, +alternation. + +**Not supported** (by design — RE2 doesn't either): backreferences, +lookaround, possessive quantifiers, atomic groups. Backref patterns +compile (the parser treats `\1` as a literal `1`) but never match +back-reference semantics, so `vs_re_test("^(a+)\\1$", "aaaa")` returns +`false` rather than erroring. Don't rely on this — write portable +patterns. + +### Sharp edges (C-specific) + +- **No catastrophic backtracking.** Thompson-NFA construction means + P1/P2 from the discovery panel finish in microseconds regardless of + input length. +- **Captures cap.** `VS_REGEX_MAX_GROUPS = 16` in `regex.h`. Patterns + with more capturing groups silently truncate. +- **Memory management.** `vs_regex*`, `vs_strvec`, `vs_strvec_vec`, + and the `char*` returned by `re_replace` are all caller-owned. Use + `vs_regex_free`, `vs_strvec_free`, `vs_strvec_vec_free`, and `free` + respectively. +- **Zero-width `re_replace`.** `vs_re_replace("a*", "abc", "X")` + returns `"XXbXcX"` — the convention shared with PCRE/ECMA/Java/.NET + and the other in-tree Thompson ports (Rust / Lua / Zig). Go (RE2) + returns `"XbXcX"` instead. (Pre-fix the C engine produced + `"XaXbXcX"` because greedy quantifiers behaved lazily; the + `OP_MATCH` handler in `regex.c` is now priority-correct.) + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/c/src/re_util.c b/c/src/re_util.c index 973c37f..8533bd2 100644 --- a/c/src/re_util.c +++ b/c/src/re_util.c @@ -8,6 +8,7 @@ #include "regex.h" #include "voxgig_struct.h" +#include #include #include @@ -69,6 +70,90 @@ vs_strvec vs_re_find(const char* pattern, const char* input) { return out; } +void vs_strvec_vec_init(vs_strvec_vec* v) { + v->len = 0; + v->cap = 0; + v->data = NULL; +} + +void vs_strvec_vec_free(vs_strvec_vec* v) { + if (!v) + return; + for (size_t i = 0; i < v->len; i++) { + vs_strvec_free(&v->data[i]); + } + free(v->data); + v->data = NULL; + v->len = v->cap = 0; +} + +static void vs_strvec_vec_push(vs_strvec_vec* v, vs_strvec row) { + if (v->len == v->cap) { + size_t nc = v->cap == 0 ? 4 : v->cap * 2; + v->data = (vs_strvec*)realloc(v->data, nc * sizeof(vs_strvec)); + if (!v->data) + abort(); + v->cap = nc; + } + v->data[v->len++] = row; +} + +vs_strvec_vec vs_re_find_all_re(const vs_regex* re, const char* input) { + vs_strvec_vec out; + vs_strvec_vec_init(&out); + if (!re || !input) + return out; + size_t ilen = strlen(input); + /* Grow the caps buffer until vs_regex_find_all stops filling it. */ + int max_matches = 64; + int per_row = 2 * VS_REGEX_MAX_GROUPS; + int* caps = NULL; + int count = 0; + for (;;) { + caps = (int*)realloc(caps, (size_t)(max_matches * per_row) * sizeof(int)); + if (!caps) + abort(); + count = vs_regex_find_all(re, input, ilen, caps, max_matches); + if (count < max_matches) + break; + max_matches *= 2; + } + int ngroups = vs_regex_ngroups(re); + /* vs_regex_find_all writes a fixed VS_REGEX_MAX_GROUPS pairs per row; any + * groups beyond that are silently dropped at the engine layer (the row + * isn't even wide enough to store them). Clamp here so we don't read past + * the row into the next match's bytes when ngroups > VS_REGEX_MAX_GROUPS. */ + int capped = ngroups < VS_REGEX_MAX_GROUPS ? ngroups : VS_REGEX_MAX_GROUPS; + for (int m = 0; m < count; m++) { + int* row_caps = caps + m * per_row; + vs_strvec row; + vs_strvec_init(&row); + for (int g = 0; g < capped; g++) { + int s = row_caps[2 * g], e = row_caps[2 * g + 1]; + if (s < 0 || e < s) { + vs_strvec_push(&row, ""); + } else { + vs_strvec_push_n(&row, input + s, (size_t)(e - s)); + } + } + /* Keep the row width == ngroups for caller consistency with + * vs_re_find/vs_re_find_re; the truncated groups are empty. */ + for (int g = capped; g < ngroups; g++) { + vs_strvec_push(&row, ""); + } + vs_strvec_vec_push(&out, row); + } + free(caps); + return out; +} + +vs_strvec_vec vs_re_find_all(const char* pattern, const char* input) { + vs_regex* re = vs_regex_compile(pattern, NULL); + vs_strvec_vec out = vs_re_find_all_re(re, input); + vs_regex_free(re); + return out; +} + char* vs_re_replace_re(const vs_regex* re, const char* input, const char* replacement) { if (!re) return rdup(input); diff --git a/c/src/regex.c b/c/src/regex.c index 65febba..da39222 100644 --- a/c/src/regex.c +++ b/c/src/regex.c @@ -825,12 +825,17 @@ static bool match_at(const vs_regex* re, const char* input, size_t ilen, int sta if (c >= 0 && cc_has(&in->data.cc, c)) tl_add(&nxt, th->pc + 1, th->slots, nslots, sp + 1, re, input, ilen); } else if (in->op == OP_MATCH) { - if (!found) { - found = true; - memcpy(best_slots, th->slots, (size_t)nslots * sizeof(int)); - } - /* Higher-priority threads come first; once we've matched, lower - priority threads in this generation can be skipped. */ + /* Always overwrite: threads are priority-ordered (highest first), + * and lower-priority threads after this one don't get processed + * (we break below). Across sp, a later MATCH can only arrive from + * descendants of HIGHER-priority threads (threads[k+1..]'s + * descendants are never added to nxt once we break here). So + * overwriting unconditionally implements leftmost-longest / + * leftmost-first correctly. The earlier `if (!found)` made greedy + * quantifiers behave lazily — e.g. `a*` on "abc" matched "" not "a". + */ + found = true; + memcpy(best_slots, th->slots, (size_t)nslots * sizeof(int)); break; } } @@ -843,14 +848,18 @@ static bool match_at(const vs_regex* re, const char* input, size_t ilen, int sta break; } /* Handle EOI: drain the remaining current threads (some may have advanced - past the last char and now point at MATCH via epsilons). */ + past the last char and now point at MATCH via epsilons). At this point + the threads are still priority-ordered, and the first MATCH (highest + priority) is the canonical leftmost-first within this generation — + but any earlier-recorded MATCH at a prior sp was from a LOWER-priority + thread (those at higher indices that came BEFORE the surviving high- + priority threads got to consume an extra char), so an EOI MATCH here + should always overwrite. */ for (int i = 0; i < cur.len; i++) { thread_t* th = &cur.threads[i]; if (re->code[th->pc].op == OP_MATCH) { - if (!found) { - found = true; - memcpy(best_slots, th->slots, (size_t)nslots * sizeof(int)); - } + found = true; + memcpy(best_slots, th->slots, (size_t)nslots * sizeof(int)); break; } } diff --git a/c/src/voxgig_struct.h b/c/src/voxgig_struct.h index b96ffd6..303d41d 100644 --- a/c/src/voxgig_struct.h +++ b/c/src/voxgig_struct.h @@ -64,6 +64,20 @@ bool vs_re_test_re(const vs_regex* re, const char* input); vs_strvec vs_re_find(const char* pattern, const char* input); vs_strvec vs_re_find_re(const vs_regex* re, const char* input); +/* List-of-lists of strings — one vs_strvec per match (each row is + * [whole, capture1, ...]). Caller must vs_strvec_vec_free() to release. */ +typedef struct vs_strvec_vec { + size_t len; + size_t cap; + vs_strvec* data; +} vs_strvec_vec; + +void vs_strvec_vec_init(vs_strvec_vec* v); +void vs_strvec_vec_free(vs_strvec_vec* v); + +vs_strvec_vec vs_re_find_all(const char* pattern, const char* input); +vs_strvec_vec vs_re_find_all_re(const vs_regex* re, const char* input); + /* Returns malloc'd string. */ char* vs_re_replace(const char* pattern, const char* input, const char* replacement); char* vs_re_replace_re(const vs_regex* re, const char* input, const char* replacement); diff --git a/c/tests/regex_pathological.c b/c/tests/regex_pathological.c new file mode 100644 index 0000000..a7d5d9c --- /dev/null +++ b/c/tests/regex_pathological.c @@ -0,0 +1,139 @@ +/* Discovery test: pathological regex inputs run against the port's vs_re_* + * API. Goal is to surface failures across ports, not to assert behaviour. + * The panel is the same in every port (see REGEX.md). + * + * C has no exception machinery, so this records the return value (or NULL) + * for each case. A crash here means the engine aborted on that input. + */ + +#include "voxgig_struct.h" +#include +#include +#include +#include + +static double now_ms(void) { + struct timespec ts; + clock_gettime(CLOCK_MONOTONIC, &ts); + return ts.tv_sec * 1000.0 + ts.tv_nsec / 1e6; +} + +static char* repeat(char c, size_t n) { + char* s = (char*)malloc(n + 1); + memset(s, c, n); + s[n] = '\0'; + return s; +} + +static void print_strvec(const vs_strvec* v) { + printf("["); + for (size_t i = 0; i < v->len; i++) { + printf("%s\"%s\"", i ? "," : "", v->data[i] ? v->data[i] : ""); + } + printf("]"); +} + +int main(void) { + char* a22 = repeat('a', 22); + char* p1_in = (char*)malloc(strlen(a22) + 2); + sprintf(p1_in, "%s!", a22); + + char* opens = repeat('(', 40); + char* closes = repeat(')', 40); + char* nest40 = (char*)malloc(40 + 1 + 40 + 1); + sprintf(nest40, "%sa%s", opens, closes); + + double t0, ms; + + /* P1 */ + t0 = now_ms(); + bool b1 = vs_re_test("^(a+)+$", p1_in); + ms = now_ms() - t0; + printf("[regex-discovery] P1_redos_nested_plus | %.2fms | OK | %s\n", ms, b1 ? "true" : "false"); + + /* P2 */ + t0 = now_ms(); + bool b2 = vs_re_test("^(a|aa)+$", p1_in); + ms = now_ms() - t0; + printf("[regex-discovery] P2_redos_alt_overlap | %.2fms | OK | %s\n", ms, b2 ? "true" : "false"); + + /* P3 */ + t0 = now_ms(); + char* p3 = vs_re_replace("a*", "abc", "X"); + ms = now_ms() - t0; + printf("[regex-discovery] P3_empty_repeat_replace | %.2fms | OK | \"%s\"\n", ms, + p3 ? p3 : "(null)"); + free(p3); + + /* P4 */ + t0 = now_ms(); + char* p4 = vs_re_replace("\\.", "café.au.lait", "/"); + ms = now_ms() - t0; + printf("[regex-discovery] P4_unicode_replace_dot | %.2fms | OK | \"%s\"\n", ms, + p4 ? p4 : "(null)"); + free(p4); + + /* P5 */ + t0 = now_ms(); + vs_strvec p5 = vs_re_find("é", "café au lait"); + ms = now_ms() - t0; + printf("[regex-discovery] P5_unicode_find_codepoint | %.2fms | OK | ", ms); + print_strvec(&p5); + printf("\n"); + vs_strvec_free(&p5); + + /* P6 */ + t0 = now_ms(); + bool b6 = vs_re_test(nest40, "a"); + ms = now_ms() - t0; + printf("[regex-discovery] P6_deep_nesting_compile | %.2fms | OK | %s\n", ms, + b6 ? "true" : "false"); + + /* P7 */ + t0 = now_ms(); + char* p7_in = (char*)malloc(12); + sprintf(p7_in, "%sb", "aaaaaaaaaa"); + bool b7 = vs_re_test("^a{0,10000}b$", p7_in); + ms = now_ms() - t0; + printf("[regex-discovery] P7_big_bounded_quantifier | %.2fms | OK | %s\n", ms, + b7 ? "true" : "false"); + free(p7_in); + + /* P8 — invalid pattern. vs_re_compile returns NULL on error. */ + t0 = now_ms(); + vs_regex* p8 = vs_re_compile("[abc"); + ms = now_ms() - t0; + if (p8) { + printf("[regex-discovery] P8_invalid_pattern | %.2fms | OK | \"compiled-without-error\"\n", ms); + /* leak: no vs_regex_free in public header */ + } else { + printf("[regex-discovery] P8_invalid_pattern | %.2fms | ERR | compile returned NULL\n", ms); + } + + /* P9 */ + t0 = now_ms(); + bool b9 = vs_re_test("^(a+)\\1$", "aaaa"); + ms = now_ms() - t0; + printf("[regex-discovery] P9_backref_re2_forbidden | %.2fms | OK | %s\n", ms, + b9 ? "true" : "false"); + + /* P10 */ + t0 = now_ms(); + vs_strvec_vec p10 = vs_re_find_all("a*", "bbb"); + ms = now_ms() - t0; + printf("[regex-discovery] P10_find_all_zero_width | %.2fms | OK | [", ms); + for (size_t i = 0; i < p10.len; i++) { + if (i) + printf(","); + print_strvec(&p10.data[i]); + } + printf("]\n"); + vs_strvec_vec_free(&p10); + + free(a22); + free(p1_in); + free(opens); + free(closes); + free(nest40); + return 0; +} diff --git a/cpp/README.md b/cpp/README.md index 22fb3c1..6bca0f7 100644 --- a/cpp/README.md +++ b/cpp/README.md @@ -233,6 +233,43 @@ Catch2 framework with limited test coverage. See the [overview](./overview/) directory for current API examples. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The C++ port +wraps `` (C++11), which defaults to the ECMAScript dialect. + +### API + +| Function | Maps to | +|---|---| +| `re_compile(pattern)` | `std::regex(pattern)` (throws `std::regex_error` on bad pattern) | +| `re_test(pattern, input)` | `std::regex_search` → bool | +| `re_find(pattern, input)` | first match groups as `std::vector` (empty if no match) | +| `re_find_all(pattern, input)` | `std::vector>` | +| `re_replace(pattern, input, rep)` | `std::regex_replace(input, re, rep)` | +| `re_escape(s)` | escape regex metacharacters | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +`std::regex` defaults to ECMAScript syntax and supports backreferences +and lookaround; using them will not be portable. + +### Sharp edges (C++-specific) + +- **libstdc++ `` has the worst-in-class catastrophic + backtracking.** The discovery panel measures **~1.2 s** for + `^(a+)+$` over 22 a's plus `!`. This is well-known and is the + reason many production C++ projects avoid `` in favour of + RE2 or PCRE2. Stay inside the RE2 subset and avoid nested + quantifiers; even then, performance won't match the dedicated + engines. +- **Zero-width `replace`.** `re_replace("a*", "abc", "X")` returns + `"XXbXcX"` — the ECMA convention shared by all PCRE/ECMA/.NET/Java/Onigmo engines plus the in-tree Thompson ports. Go (RE2) returns `"XbXcX"` instead; see `/REGEX_PATHOLOGICAL.md`. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/cpp/tests/regex_pathological.cpp b/cpp/tests/regex_pathological.cpp new file mode 100644 index 0000000..4ce91a6 --- /dev/null +++ b/cpp/tests/regex_pathological.cpp @@ -0,0 +1,90 @@ +// Discovery test: pathological regex inputs run against the port's re_* API. +// Goal is to surface failures across ports, not to assert behaviour. +// Panel is the same in every port (see REGEX.md). + +#include "voxgig_struct.hpp" + +#include +#include +#include +#include +#include +#include + +using namespace voxgig::structlib; + +// Render outcomes as JSON-ish so output matches the other ports. +static std::string j_str(const std::string& s) { + std::string out = "\""; + for (char c : s) { + if (c == '"' || c == '\\') + out.push_back('\\'), out.push_back(c); + else + out.push_back(c); + } + out.push_back('"'); + return out; +} + +template static void record(const char* label, F fn) { + auto t0 = std::chrono::steady_clock::now(); + std::string outcome; + try { + outcome = std::string("OK | ") + fn(); + } catch (const std::exception& e) { + outcome = std::string("ERR | ") + typeid(e).name() + ": " + e.what(); + } catch (...) { + outcome = "ERR | unknown exception"; + } + double ms = + std::chrono::duration(std::chrono::steady_clock::now() - t0).count(); + std::printf("[regex-discovery] %s | %.2fms | %s\n", label, ms, outcome.c_str()); +} + +static std::string as_bool(bool b) { + return b ? "true" : "false"; +} + +static std::string as_vec(const std::vector& v) { + std::string s = "["; + for (size_t i = 0; i < v.size(); i++) { + if (i) + s += ","; + s += j_str(v[i]); + } + s += "]"; + return s; +} + +static std::string as_vec2(const std::vector>& v) { + std::string s = "["; + for (size_t i = 0; i < v.size(); i++) { + if (i) + s += ","; + s += as_vec(v[i]); + } + s += "]"; + return s; +} + +int main() { + std::string a22(22, 'a'); + std::string nest40 = std::string(40, '(') + "a" + std::string(40, ')'); + + record("P1_redos_nested_plus", [&] { return as_bool(re_test("^(a+)+$", a22 + "!")); }); + record("P2_redos_alt_overlap", [&] { return as_bool(re_test("^(a|aa)+$", a22 + "!")); }); + record("P3_empty_repeat_replace", [&] { return j_str(re_replace("a*", "abc", "X")); }); + record("P4_unicode_replace_dot", [&] { return j_str(re_replace("\\.", "café.au.lait", "/")); }); + record("P5_unicode_find_codepoint", [&] { return as_vec(re_find("é", "café au lait")); }); + record("P6_deep_nesting_compile", [&] { return as_bool(re_test(nest40, "a")); }); + record("P7_big_bounded_quantifier", + [&] { return as_bool(re_test("^a{0,10000}b$", std::string(10, 'a') + "b")); }); + record("P8_invalid_pattern", [&] { + (void) re_compile("[abc"); + return std::string("\"compiled\""); + }); + record("P9_backref_re2_forbidden", [&] { return as_bool(re_test("^(a+)\\1$", "aaaa")); }); + record("P10_find_all_zero_width", [&] { return as_vec2(re_find_all("a*", "bbb")); }); + + return 0; +} diff --git a/csharp/README.md b/csharp/README.md index a22765b..68e114e 100644 --- a/csharp/README.md +++ b/csharp/README.md @@ -260,6 +260,42 @@ In progress. Coverage of canonical functions is broad; check [`../REPORT.md`](../REPORT.md) for the latest status. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The C# port +wraps `System.Text.RegularExpressions.Regex`. + +### API + +| Function | Maps to | +|---|---| +| `ReCompile(pattern)` | `new Regex(pattern)` (throws `RegexParseException` on bad pattern) | +| `ReTest(pattern, input)` | `Regex.IsMatch(input, pattern)` | +| `ReFind(pattern, input)` | first match as `string[]` of `[whole, group1, …]` or `null` | +| `ReFindAll(pattern, input)` | `List` | +| `ReReplace(pattern, input, rep)` | `Regex.Replace(input, pattern, rep)` | +| `ReEscape(s)` | `Regex.Escape(s)` | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +.NET regex supports backreferences and lookaround; using them will not +be portable. + +### Sharp edges + +- **Catastrophic backtracking.** .NET's regex is backtracking; the + discovery panel sees P1 (`^(a+)+$` over 22 a's plus `!`) in + ~390 ms here. .NET 7+ ships a non-backtracking engine you can opt + into via `RegexOptions.NonBacktracking` — consider it for + untrusted patterns. Stay inside the RE2 subset and prefer flat + patterns. +- **Zero-width `replace`.** `ReReplace("a*", "abc", "X")` returns + `"XXbXcX"` — the ECMA convention shared by all PCRE/ECMA/.NET/Java/Onigmo engines plus the in-tree Thompson ports. Go (RE2) returns `"XbXcX"` instead; see `/REGEX_PATHOLOGICAL.md`. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/csharp/tests/RegexPathologicalTest.cs b/csharp/tests/RegexPathologicalTest.cs new file mode 100644 index 0000000..122a632 --- /dev/null +++ b/csharp/tests/RegexPathologicalTest.cs @@ -0,0 +1,54 @@ +/* Copyright (c) 2025-2026 Voxgig Ltd. MIT LICENSE. */ + +// RUN: cd csharp/tests && dotnet test --filter "DisplayName~RegexPathological" +// +// Discovery test: pathological regex inputs run against the port's Re* API. +// Goal is to surface failures across ports, not to assert behaviour. +// Panel is the same in every port (see REGEX.md). + +using System; +using System.Diagnostics; +using System.Text.Json; +using static Voxgig.Struct.StructUtils; +using Xunit; + +namespace Voxgig.Tests; + +public class RegexPathologicalTest +{ + private static void Record(string label, Func fn) + { + var sw = Stopwatch.StartNew(); + string outcome; + try + { + var r = fn(); + outcome = "OK | " + JsonSerializer.Serialize(r); + } + catch (Exception e) + { + outcome = "ERR | " + e.GetType().Name + ": " + e.Message; + } + sw.Stop(); + var ms = sw.Elapsed.TotalMilliseconds; + Console.WriteLine($"[regex-discovery] {label} | {ms:F2}ms | {outcome}"); + } + + [Fact] + public void Panel() + { + var a22 = new string('a', 22); + var nest40 = new string('(', 40) + "a" + new string(')', 40); + + Record("P1_redos_nested_plus", () => ReTest("^(a+)+$", a22 + "!")); + Record("P2_redos_alt_overlap", () => ReTest("^(a|aa)+$", a22 + "!")); + Record("P3_empty_repeat_replace", () => ReReplace("a*", "abc", "X")); + Record("P4_unicode_replace_dot", () => ReReplace("\\.", "café.au.lait", "/")); + Record("P5_unicode_find_codepoint", () => ReFind("é", "café au lait")); + Record("P6_deep_nesting_compile", () => ReTest(nest40, "a")); + Record("P7_big_bounded_quantifier", () => ReTest("^a{0,10000}b$", new string('a', 10) + "b")); + Record("P8_invalid_pattern", () => ReCompile("[abc")); + Record("P9_backref_re2_forbidden", () => ReTest("^(a+)\\1$", "aaaa")); + Record("P10_find_all_zero_width", () => ReFindAll("a*", "bbb")); + } +} diff --git a/cspell.json b/cspell.json index c9e400b..f699272 100644 --- a/cspell.json +++ b/cspell.json @@ -55,6 +55,8 @@ "perigolo", "Rodger", "Lovelace", + "Dfile", + "mojibake", "getpath", "setpath", "getprop", diff --git a/go/README.md b/go/README.md index bf6cea2..86989fb 100644 --- a/go/README.md +++ b/go/README.md @@ -401,6 +401,54 @@ canonical "lists are reference-stable" assumption. 92/92 tests pass against the shared corpus. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The Go port +wraps the stdlib `regexp` package — Go's `regexp` *is* the RE2 +reference implementation. + +### API + +| Function | Maps to | +|---|---| +| `ReCompile(pattern)` | `regexp.MustCompile(pattern)` (panics on bad pattern) | +| `ReTest(pattern, input)` | `re.MatchString(input)` | +| `ReFind(pattern, input)` | `re.FindStringSubmatch(input)` | +| `ReFindAll(pattern, input)` | `re.FindAllStringSubmatch(input, -1)` | +| `ReReplace(pattern, input, rep)` | `re.ReplaceAllString(input, rep)` | +| `ReReplaceFunc(pattern, input,f)` | `re.ReplaceAllStringFunc(input, f)` | +| `ReEscape(s)` | alias for `EscRe(s)` | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +Since Go's regexp engine *is* RE2, this is the natural ceiling: there is +no PCRE escape hatch. + +### Sharp edges (Go-specific) + +- **`ReCompile` panics.** It's a pass-through to `regexp.MustCompile`, + so an invalid pattern aborts via `panic`. This matches the + throw/raise behaviour of every other port; wrap in `recover()` if + you accept user-supplied patterns. +- **Bounded quantifier cap.** RE2 refuses `{n,m}` with `m > 1000`. + `^a{0,10000}b$` *panics* at compile time with "invalid repeat + count". This is a hard RE2 limit — no portable workaround. The + canonical patterns and `$LIKE` operator stay well below it. +- **No backreferences or lookaround.** RE2 does not support them by + design. `^(a+)\1$` panics on compile. The cross-port dialect already + forbids them; this is the engine that enforces the rule hardest. +- **Zero-width `re_replace` uses RE2's convention.** + `re_replace("a*", "abc", "X")` returns `"XbXcX"` — RE2 suppresses + an empty match immediately after a non-empty match at the same + offset. PCRE / ECMA / .NET / Java / the in-tree Thompson ports all + return `"XXbXcX"` instead. This is inherent to Go's host regex + package and is **not** wrapped: portable callers should not depend + on cross-port identity of zero-width replacement output. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/go/regex_pathological_test.go b/go/regex_pathological_test.go new file mode 100644 index 0000000..ef673c0 --- /dev/null +++ b/go/regex_pathological_test.go @@ -0,0 +1,54 @@ +// RUN: go test -run=TestRegexPathological -v +// +// Discovery test: pathological regex inputs run against the port's Re* API. +// Goal is to surface failures across ports, not to assert behaviour. +// Panel is the same in every port (see REGEX.md). + +package voxgigstruct_test + +import ( + "encoding/json" + "fmt" + "strings" + "testing" + "time" + + voxgigstruct "github.com/voxgig/struct/go" +) + +func record(label string, fn func() any) { + t0 := time.Now() + var outcome string + func() { + defer func() { + if r := recover(); r != nil { + outcome = fmt.Sprintf("ERR | panic: %v", r) + } + }() + r := fn() + b, err := json.Marshal(r) + if err != nil { + outcome = fmt.Sprintf("OK | : %v", r, r) + return + } + outcome = fmt.Sprintf("OK | %s", string(b)) + }() + ms := float64(time.Since(t0).Microseconds()) / 1000.0 + fmt.Printf("[regex-discovery] %s | %.2fms | %s\n", label, ms, outcome) +} + +func TestRegexPathological(t *testing.T) { + a22 := strings.Repeat("a", 22) + nest40 := strings.Repeat("(", 40) + "a" + strings.Repeat(")", 40) + + record("P1_redos_nested_plus", func() any { return voxgigstruct.ReTest("^(a+)+$", a22+"!") }) + record("P2_redos_alt_overlap", func() any { return voxgigstruct.ReTest("^(a|aa)+$", a22+"!") }) + record("P3_empty_repeat_replace", func() any { return voxgigstruct.ReReplace("a*", "abc", "X") }) + record("P4_unicode_replace_dot", func() any { return voxgigstruct.ReReplace(`\.`, "café.au.lait", "/") }) + record("P5_unicode_find_codepoint", func() any { return voxgigstruct.ReFind("é", "café au lait") }) + record("P6_deep_nesting_compile", func() any { return voxgigstruct.ReTest(nest40, "a") }) + record("P7_big_bounded_quantifier", func() any { return voxgigstruct.ReTest("^a{0,10000}b$", strings.Repeat("a", 10)+"b") }) + record("P8_invalid_pattern", func() any { return voxgigstruct.ReCompile("[abc") != nil }) + record("P9_backref_re2_forbidden", func() any { return voxgigstruct.ReTest(`^(a+)\1$`, "aaaa") }) + record("P10_find_all_zero_width", func() any { return voxgigstruct.ReFindAll("a*", "bbb") }) +} diff --git a/go/voxgigstruct.go b/go/voxgigstruct.go index 7c64397..790841d 100644 --- a/go/voxgigstruct.go +++ b/go/voxgigstruct.go @@ -993,6 +993,13 @@ func ReFindAll(pattern, input string) [][]string { // ReReplace replaces every match. The replacement supports Go's $0..$N // reference syntax (functionally equivalent to JS $&..$N). +// +// Note: Go's `regexp` (RE2) suppresses an empty match immediately +// following a non-empty match at the same offset. This is RE2's +// chosen convention and differs from ECMAScript / Python / Java etc: +// `re_replace("a*", "abc", "X")` returns "XbXcX" here, "XXbXcX" on +// PCRE/ECMA engines. The variance is inherent to the host regex +// package; see REGEX_PATHOLOGICAL.md. func ReReplace(pattern, input, replacement string) string { return regexp.MustCompile(pattern).ReplaceAllString(input, replacement) } diff --git a/java/README.md b/java/README.md index 71714ec..5af13d3 100644 --- a/java/README.md +++ b/java/README.md @@ -247,6 +247,46 @@ No standard test runner configured yet. `StructTest.java` exists but is minimal. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The Java port +wraps `java.util.regex.Pattern`. + +### API + +| Function | Maps to | +|---|---| +| `reCompile(pattern)` | `Pattern.compile(pattern)` (throws `PatternSyntaxException` on bad pattern) | +| `reTest(pattern, input)` | `Pattern.compile(pattern).matcher(input).find()` | +| `reFind(pattern, input)` | first match as `String[]` of `[whole, group1, …]` or `null` | +| `reFindAll(pattern, input)` | `List` | +| `reReplace(pattern, input, repl)` | `matcher.replaceAll(repl)` | +| `reEscape(s)` | escape regex metacharacters | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +Java's regex supports backreferences and lookaround; using them will +not be portable. + +### Sharp edges + +- **Catastrophic backtracking.** `java.util.regex` is backtracking; + the discovery panel sees P1 (`^(a+)+$` over 22 a's plus `!`) in + ~13 ms here. Other shapes can be worse. Prefer flat patterns. +- **Zero-width `replace`.** `reReplace("a*", "abc", "X")` returns + `"XXbXcX"` — the ECMA convention shared by all PCRE/ECMA/.NET/Java/Onigmo engines plus the in-tree Thompson ports. Go (RE2) returns `"XbXcX"` instead; see `/REGEX_PATHOLOGICAL.md`. +- **`System.out` encoding.** When printing match results that contain + non-ASCII characters, `System.out`'s default `PrintStream` uses the + platform's default charset, not UTF-8. The discovery panel sees + `caf?` in stdout though the in-memory `String` is correct UTF-16. + Pass `-Dfile.encoding=UTF-8` (or use `PrintStream(System.out, true, + StandardCharsets.UTF_8)`) when this matters. Orthogonal to the + regex itself. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/java/src/test/RegexPathologicalTest.java b/java/src/test/RegexPathologicalTest.java new file mode 100644 index 0000000..b1b6870 --- /dev/null +++ b/java/src/test/RegexPathologicalTest.java @@ -0,0 +1,46 @@ +// RUN: mvn -Dtest=RegexPathologicalTest test +// +// Discovery test: pathological regex inputs run against the port's re* API. +// Goal is to surface failures across ports, not to assert behaviour. +// Panel is the same in every port (see REGEX.md). + +package voxgig.struct; + +import com.google.gson.Gson; +import org.junit.jupiter.api.Test; + +import java.util.function.Supplier; + +class RegexPathologicalTest { + private static final Gson GSON = new Gson(); + + private static void record(String label, Supplier fn) { + long t0 = System.nanoTime(); + String outcome; + try { + Object r = fn.get(); + outcome = "OK | " + GSON.toJson(r); + } catch (Throwable e) { + outcome = "ERR | " + e.getClass().getSimpleName() + ": " + e.getMessage(); + } + double ms = (System.nanoTime() - t0) / 1e6; + System.out.printf("[regex-discovery] %s | %.2fms | %s%n", label, ms, outcome); + } + + @Test + void panel() { + String a22 = "a".repeat(22); + String nest40 = "(".repeat(40) + "a" + ")".repeat(40); + + record("P1_redos_nested_plus", () -> Struct.reTest("^(a+)+$", a22 + "!")); + record("P2_redos_alt_overlap", () -> Struct.reTest("^(a|aa)+$", a22 + "!")); + record("P3_empty_repeat_replace", () -> Struct.reReplace("a*", "abc", "X")); + record("P4_unicode_replace_dot", () -> Struct.reReplace("\\.", "café.au.lait", "/")); + record("P5_unicode_find_codepoint", () -> Struct.reFind("é", "café au lait")); + record("P6_deep_nesting_compile", () -> Struct.reTest(nest40, "a")); + record("P7_big_bounded_quantifier", () -> Struct.reTest("^a{0,10000}b$", "a".repeat(10) + "b")); + record("P8_invalid_pattern", () -> Struct.reCompile("[abc")); + record("P9_backref_re2_forbidden", () -> Struct.reTest("^(a+)\\1$", "aaaa")); + record("P10_find_all_zero_width", () -> Struct.reFindAll("a*", "bbb")); + } +} diff --git a/javascript/README.md b/javascript/README.md index 36a41b9..c5c858d 100644 --- a/javascript/README.md +++ b/javascript/README.md @@ -367,6 +367,42 @@ Otherwise functionally identical -- both run on V8. 84/84 tests pass against the shared corpus. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). On JavaScript +this is the ECMAScript `RegExp` built-in. + +### API + +| Function | Maps to | +|---|---| +| `re_compile(pattern, flags?)` | `new RegExp(pattern, flags ?? 'g')` | +| `re_test(pattern, input)` | `pattern.test(input)` | +| `re_find(pattern, input)` | `input.match(pattern)` (non-global pattern) | +| `re_find_all(pattern, input)` | `[...input.matchAll(pattern)]` | +| `re_replace(pattern, input, rep)` | `input.replace(pattern, rep)` (global pattern) | +| `re_escape(s)` | escape `[.*+?^${}()|[\]\\]` in `s` | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +`RegExp` itself supports backreferences and lookaround, but other ports +do not, so using those will not be portable. + +### Sharp edges + +- **Catastrophic backtracking.** `RegExp` is a backtracking engine; + nested quantifiers like `(a+)+` against a non-matching suffix can be + exponential in input length (the discovery panel sees ~180 ms on + Node 22 vs <0.1 ms on RE2-style engines). Prefer flat patterns and + character classes over alternations. +- **Zero-width `replace`.** `re_replace("a*", "abc", "X")` returns + `"XXbXcX"` — the ECMA convention shared by all PCRE/ECMA/.NET/Java/Onigmo engines plus the in-tree Thompson ports. Go (RE2) returns `"XbXcX"` instead; see `/REGEX_PATHOLOGICAL.md`. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input +panel. + + ## Build and test ```bash diff --git a/javascript/test/regex_pathological.test.js b/javascript/test/regex_pathological.test.js new file mode 100644 index 0000000..391e7a1 --- /dev/null +++ b/javascript/test/regex_pathological.test.js @@ -0,0 +1,45 @@ +// VERSION: @voxgig/struct 0.1.0 +// +// Discovery test: pathological regex inputs run against the port's re_* API. +// The goal is to surface which inputs cause errors, hangs, or surprising +// output across ports — NOT to assert any specific behaviour. Each case +// wraps the call in try/catch so one failure does not mask the others. +// The panel is the same in every port (see REGEX.md). + +const { test } = require('node:test') +const struct = require('../src/struct') + +const { re_compile, re_test, re_find, re_find_all, re_replace } = struct + +function rep(s, n) { + return new Array(n + 1).join(s) +} + +function record(label, fn) { + const t0 = process.hrtime.bigint() + let outcome + try { + const r = fn() + outcome = `OK | ${JSON.stringify(r)}` + } catch (e) { + outcome = `ERR | ${e && e.message ? e.message : String(e)}` + } + const ms = Number(process.hrtime.bigint() - t0) / 1e6 + console.log(`[regex-discovery] ${label} | ${ms.toFixed(2)}ms | ${outcome}`) +} + +test('regex pathological discovery', () => { + const A22 = rep('a', 22) + const NEST40 = rep('(', 40) + 'a' + rep(')', 40) + + record('P1_redos_nested_plus', () => re_test('^(a+)+$', A22 + '!')) + record('P2_redos_alt_overlap', () => re_test('^(a|aa)+$', A22 + '!')) + record('P3_empty_repeat_replace', () => re_replace('a*', 'abc', 'X')) + record('P4_unicode_replace_dot', () => re_replace('\\.', 'café.au.lait', '/')) + record('P5_unicode_find_codepoint', () => re_find('é', 'café au lait')) + record('P6_deep_nesting_compile', () => re_test(NEST40, 'a')) + record('P7_big_bounded_quantifier', () => re_test('^a{0,10000}b$', rep('a', 10) + 'b')) + record('P8_invalid_pattern', () => re_compile('[abc')) + record('P9_backref_re2_forbidden', () => re_test('^(a+)\\1$', 'aaaa')) + record('P10_find_all_zero_width', () => re_find_all('a*', 'bbb')) +}) diff --git a/kotlin/src/test/kotlin/voxgig/struct/RegexPathologicalTest.kt b/kotlin/src/test/kotlin/voxgig/struct/RegexPathologicalTest.kt new file mode 100644 index 0000000..f7463db --- /dev/null +++ b/kotlin/src/test/kotlin/voxgig/struct/RegexPathologicalTest.kt @@ -0,0 +1,45 @@ +// Discovery test: pathological regex inputs run against the port's re* API. +// Goal is to surface failures across ports, not to assert behaviour. +// Panel is the same in every port (see REGEX.md). + +package voxgig.struct + +import com.google.gson.Gson +import kotlin.test.Test + +class RegexPathologicalTest { + private val gson = Gson() + + private fun record( + label: String, + fn: () -> Any?, + ) { + val t0 = System.nanoTime() + val outcome: String = + try { + val r = fn() + "OK | " + gson.toJson(r) + } catch (e: Throwable) { + "ERR | ${e::class.simpleName}: ${e.message}" + } + val ms = (System.nanoTime() - t0) / 1e6 + println("[regex-discovery] %s | %.2fms | %s".format(label, ms, outcome)) + } + + @Test + fun panel() { + val a22 = "a".repeat(22) + val nest40 = "(".repeat(40) + "a" + ")".repeat(40) + + record("P1_redos_nested_plus") { Struct.reTest("^(a+)+\$", a22 + "!") } + record("P2_redos_alt_overlap") { Struct.reTest("^(a|aa)+\$", a22 + "!") } + record("P3_empty_repeat_replace") { Struct.reReplace("a*", "abc", "X") } + record("P4_unicode_replace_dot") { Struct.reReplace("\\.", "café.au.lait", "/") } + record("P5_unicode_find_codepoint") { Struct.reFind("é", "café au lait") } + record("P6_deep_nesting_compile") { Struct.reTest(nest40, "a") } + record("P7_big_bounded_quantifier") { Struct.reTest("^a{0,10000}b\$", "a".repeat(10) + "b") } + record("P8_invalid_pattern") { Struct.reCompile("[abc") } + record("P9_backref_re2_forbidden") { Struct.reTest("^(a+)\\1\$", "aaaa") } + record("P10_find_all_zero_width") { Struct.reFindAll("a*", "bbb") } + } +} diff --git a/lua/README.md b/lua/README.md index 324eca9..c1243b3 100644 --- a/lua/README.md +++ b/lua/README.md @@ -370,6 +370,54 @@ reference-stable" assumption holds without a wrapper. 75/75 tests pass against the shared corpus. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The Lua port +**ships its own RE2-subset engine** in `src/regex.lua` (~500 LOC of +pure Lua — Lua's built-in pattern language is intentionally not +regex, so we vendor one). No LuaRocks dependency, no FFI. + +### API + +| Function | Returns | +|---|---| +| `re.re_compile(pattern)` | compiled regex object | +| `re.re_test(pattern, input)` | `true` / `false` | +| `re.re_find(pattern, input)` | `{whole, group1, …}` or `nil` | +| `re.re_find_all(pattern, input)` | `{ {whole, group1, …}, … }` | +| `re.re_replace(pattern, input, repl)` | `string` | +| `re.re_escape(literal)` | `string` | + +### Dialect + +The in-tree engine implements the RE2 subset documented in +`/REGEX.md`: literals + escapes, `.`, `^`/`$`, `* + ? {n} {n,} {n,m}` +(greedy + lazy), classes incl. `\d \w \s` and friends, `\b`/`\B`, +`(...)` / `(?:...)`, alternation. + +**Not supported** (by design — RE2 doesn't either): backreferences, +lookaround, possessive quantifiers, atomic groups. Backref patterns +compile (the parser treats `\1` as a literal `1`) but never match +back-reference semantics, so `re.re_test("^(a+)\\1$", "aaaa")` returns +`false`. Don't rely on this — write portable patterns. + +### Sharp edges (Lua-specific) + +- **It's a Lua VM regex engine.** P7 (`a{0,10000}b$`) takes ~80 ms + here — fine functionally, slow versus native engines. The library's + hot paths don't use bounded quantifiers anywhere near that size. +- **No catastrophic backtracking.** Thompson-NFA construction; P1/P2 + finish in microseconds. +- **Zero-width `re_replace`.** `re.re_replace("a*", "abc", "X")` + returns `"XXbXcX"` — the convention shared with PCRE/ECMA/Java/.NET + and the other in-tree Thompson ports (Rust / C / Zig). Go (RE2) + returns `"XbXcX"` instead. (Pre-fix the Lua engine produced + `"XaXbXcX"`; the `OP_MATCH` handler in `regex.lua` is now + priority-correct, matching the C port's fix.) + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/lua/src/regex.lua b/lua/src/regex.lua index 2718ff7..65e5c18 100644 --- a/lua/src/regex.lua +++ b/lua/src/regex.lua @@ -483,7 +483,11 @@ local function match_at(re, input, ilen, start) elseif op == OP_CLASS then if c >= 0 and insn.data.cc[c] then add_thread(re, nxt, th.pc + 1, th.slots, sp + 1, input, ilen, visited) end elseif op == OP_MATCH then - if not found then found = th.slots end + -- Always overwrite: priority ordering means later MATCHes from + -- surviving (higher-priority) descendants in nxt should override + -- earlier matches from lower-priority threads. `if not found` made + -- greedy quantifiers behave lazily (e.g. `a*` on "abc" matched ""). + found = th.slots break end end @@ -491,10 +495,10 @@ local function match_at(re, input, ilen, start) sp = sp + 1 if #cur == 0 then break end end - -- Drain remaining current threads for trailing MATCH. + -- Drain remaining current threads for trailing MATCH (mirrors C engine). for i = 1, #cur do if re.code[cur[i].pc].op == OP_MATCH then - if not found then found = cur[i].slots end + found = cur[i].slots break end end diff --git a/lua/test/regex_pathological.lua b/lua/test/regex_pathological.lua new file mode 100644 index 0000000..5207485 --- /dev/null +++ b/lua/test/regex_pathological.lua @@ -0,0 +1,62 @@ +-- Discovery test: pathological regex inputs run against the port's re_* API. +-- Goal is to surface failures across ports, not to assert behaviour. +-- Panel is the same in every port (see REGEX.md). +-- +-- RUN: lua test/regex_pathological.lua + +package.path = "../src/?.lua;./src/?.lua;" .. (package.path or "") +local re = require("regex") + +local function json_str(s) + return '"' .. tostring(s):gsub('"', '\\"') .. '"' +end + +local function json_table(t) + local parts = {} + for _, v in ipairs(t) do + if type(v) == "table" then + parts[#parts + 1] = json_table(v) + elseif type(v) == "string" then + parts[#parts + 1] = json_str(v) + else + parts[#parts + 1] = tostring(v) + end + end + return "[" .. table.concat(parts, ",") .. "]" +end + +local function render(r) + local t = type(r) + if t == "nil" then return "null" + elseif t == "boolean" then return tostring(r) + elseif t == "string" then return json_str(r) + elseif t == "table" then return json_table(r) + else return tostring(r) end +end + +local function record(label, fn) + local t0 = os.clock() + local ok, r = pcall(fn) + local ms = (os.clock() - t0) * 1000.0 + local outcome + if ok then + outcome = "OK | " .. render(r) + else + outcome = "ERR | " .. tostring(r) + end + io.write(string.format("[regex-discovery] %s | %.2fms | %s\n", label, ms, outcome)) +end + +local a22 = string.rep("a", 22) +local nest40 = string.rep("(", 40) .. "a" .. string.rep(")", 40) + +record("P1_redos_nested_plus", function() return re.re_test("^(a+)+$", a22 .. "!") end) +record("P2_redos_alt_overlap", function() return re.re_test("^(a|aa)+$", a22 .. "!") end) +record("P3_empty_repeat_replace", function() return re.re_replace("a*", "abc", "X") end) +record("P4_unicode_replace_dot", function() return re.re_replace("\\.", "café.au.lait", "/") end) +record("P5_unicode_find_codepoint", function() return re.re_find("é", "café au lait") end) +record("P6_deep_nesting_compile", function() return re.re_test(nest40, "a") end) +record("P7_big_bounded_quantifier", function() return re.re_test("^a{0,10000}b$", string.rep("a", 10) .. "b") end) +record("P8_invalid_pattern", function() return re.re_compile("[abc") end) +record("P9_backref_re2_forbidden", function() return re.re_test("^(a+)\\1$", "aaaa") end) +record("P10_find_all_zero_width", function() return re.re_find_all("a*", "bbb") end) diff --git a/perl/README.md b/perl/README.md index 17600a2..aca2e94 100644 --- a/perl/README.md +++ b/perl/README.md @@ -104,6 +104,44 @@ because they don't preserve insertion order. - Builder helpers: `jm` (insertion-ordered map literal), `jt` (list literal). +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The Perl port +wraps Perl's built-in regex engine. + +### API + +| Function | Maps to | +|---|---| +| `re_compile(pattern, flags?)` | `qr/$pattern/` | +| `re_test(pattern, input)` | `$input =~ $re` | +| `re_find(pattern, input)` | first match as `[whole, $1, ...]` or `undef` | +| `re_find_all(pattern, input)` | all matches, one arrayref per match | +| `re_replace(pattern, input, repl)` | `s/$re/$repl/g` (callable or template) | +| `re_escape(s)` | `quotemeta` equivalent | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +Perl's regex supports backreferences, lookaround, recursion — none of +which are portable to the Go / Rust / C / Lua / Zig ports. + +### Sharp edges + +- **Catastrophic backtracking.** Perl's regex engine is backtracking + but ships with optimisations (trie engine for alternation, etc.). + The discovery panel runs P1/P2 in microseconds here, but other + pathological shapes can still blow up. Stay flat. +- **Zero-width `replace`.** `re_replace("a*", "abc", "X")` returns + `"XXbXcX"` — the ECMA convention shared by all PCRE/ECMA/.NET/Java/Onigmo engines plus the in-tree Thompson ports. Go (RE2) returns `"XbXcX"` instead; see `/REGEX_PATHOLOGICAL.md`. +- **UTF-8 handling.** Pass character strings (use `use utf8;` for + literals, or `decode_utf8` for bytes). Encoding round-trip bugs in + caller code can manifest as `café` style mojibake at print time — + the regex itself preserves character semantics. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Tests ```bash diff --git a/perl/t/regex_pathological.t b/perl/t/regex_pathological.t new file mode 100644 index 0000000..904af82 --- /dev/null +++ b/perl/t/regex_pathological.t @@ -0,0 +1,55 @@ +#!perl +# Discovery test: pathological regex inputs run against the port's re_* API. +# Goal is to surface failures across ports, not to assert behaviour. +# Panel is the same in every port (see REGEX.md). + +use 5.018; +use strict; +use warnings; +use utf8; +use Test::More; +use FindBin; +use lib "$FindBin::Bin/../lib"; +use Voxgig::Struct qw(); +use JSON::PP qw(); +use Time::HiRes qw(gettimeofday tv_interval); + +binmode STDOUT, ':encoding(UTF-8)'; + +# JSON::PP defaults to UTF-8-encoding its output bytes. We want characters +# so STDOUT's :utf8 layer can encode them once (not twice). +my $JSON = JSON::PP->new->utf8(0); + +sub record { + my ($label, $fn) = @_; + my $t0 = [gettimeofday]; + my $outcome; + my $r = eval { $fn->() }; + if (my $err = $@) { + chomp $err; + $outcome = "ERR | $err"; + } else { + my $enc = eval { $JSON->encode($r) }; + $enc = (defined $r ? "$r" : 'null') if $@; + $outcome = "OK | $enc"; + } + my $ms = tv_interval($t0) * 1000.0; + printf("[regex-discovery] %s | %.2fms | %s\n", $label, $ms, $outcome); +} + +my $a22 = 'a' x 22; +my $nest40 = ('(' x 40) . 'a' . (')' x 40); + +record('P1_redos_nested_plus', sub { Voxgig::Struct::re_test('^(a+)+$', $a22 . '!') }); +record('P2_redos_alt_overlap', sub { Voxgig::Struct::re_test('^(a|aa)+$', $a22 . '!') }); +record('P3_empty_repeat_replace', sub { Voxgig::Struct::re_replace('a*', 'abc', 'X') }); +record('P4_unicode_replace_dot', sub { Voxgig::Struct::re_replace('\\.', 'café.au.lait', '/') }); +record('P5_unicode_find_codepoint', sub { Voxgig::Struct::re_find('é', 'café au lait') }); +record('P6_deep_nesting_compile', sub { Voxgig::Struct::re_test($nest40, 'a') }); +record('P7_big_bounded_quantifier', sub { Voxgig::Struct::re_test('^a{0,10000}b$', ('a' x 10) . 'b') }); +record('P8_invalid_pattern', sub { Voxgig::Struct::re_compile('[abc') }); +record('P9_backref_re2_forbidden', sub { Voxgig::Struct::re_test('^(a+)\\1$', 'aaaa') }); +record('P10_find_all_zero_width', sub { Voxgig::Struct::re_find_all('a*', 'bbb') }); + +pass('regex pathological discovery ran'); +done_testing(); diff --git a/php/README.md b/php/README.md index f709bfc..0874f03 100644 --- a/php/README.md +++ b/php/README.md @@ -362,6 +362,46 @@ PHP method names match canonical lowercase: `getpath`, `setpath`, 82/82 tests pass, 920 assertions. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The PHP port +wraps PCRE (`preg_*`). + +### API + +| Function | Maps to | +|---|---| +| `re_compile(pattern)` | delimited PCRE pattern (validated via `preg_match`) | +| `re_test(pattern, input)` | `preg_match` → bool | +| `re_find(pattern, input)` | `preg_match` with captures, returns `[whole, group1, ...]` or `null` | +| `re_find_all(pattern, input)` | `preg_match_all(..., PREG_SET_ORDER)` | +| `re_replace(pattern, input, repl)` | `preg_replace` (or `preg_replace_callback` for callable repl) | +| `re_escape(s)` | `preg_quote(s)` equivalent | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +PCRE supports backreferences and lookaround; using them will not be +portable. + +### Sharp edges + +- **`re_compile` validates eagerly.** Invalid patterns throw + `InvalidArgumentException` at compile time. This is a recent fix: + the wrapper used to swallow PCRE warnings via `@preg_match` and + return `false` silently from `re_test`/`re_find`. Callers can now + distinguish "no match" from "bad pattern". +- **Catastrophic backtracking.** PCRE is a backtracking engine but has + a JIT and a backtrack limit; the discovery panel runs P1/P2 in a few + ms here. Larger inputs or pathological shapes can hit + `pcre.backtrack_limit` and return `false`. Stay inside the RE2 subset + and prefer flat patterns. +- **Zero-width `replace`.** `re_replace("a*", "abc", "X")` returns + `"XXbXcX"` — the ECMA convention shared by all PCRE/ECMA/.NET/Java/Onigmo engines plus the in-tree Thompson ports. Go (RE2) returns `"XbXcX"` instead; see `/REGEX_PATHOLOGICAL.md`. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/php/src/Struct.php b/php/src/Struct.php index d286578..cafe059 100644 --- a/php/src/Struct.php +++ b/php/src/Struct.php @@ -565,20 +565,25 @@ public static function escre(?string $s): string public static function re_compile(string $pattern): string { // PHP wants a delimited pattern; return one delimited with '/'. - if (strlen($pattern) > 0 && $pattern[0] === '/') { - return $pattern; + $delimited = strlen($pattern) > 0 && $pattern[0] === '/' + ? $pattern + : '/' . str_replace('/', '\\/', $pattern) . '/'; + // PCRE returns false from preg_match on invalid patterns; surface that + // to the caller (matching the throw behaviour of JS/Python/Java/.NET). + if (@preg_match($delimited, '') === false) { + throw new \InvalidArgumentException("Invalid regex pattern: $pattern"); } - return '/' . str_replace('/', '\\/', $pattern) . '/'; + return $delimited; } public static function re_test(string $pattern, string $input): bool { - return @preg_match(self::re_compile($pattern), $input) === 1; + return preg_match(self::re_compile($pattern), $input) === 1; } public static function re_find(string $pattern, string $input): ?array { - if (@preg_match(self::re_compile($pattern), $input, $m) === 1) { + if (preg_match(self::re_compile($pattern), $input, $m) === 1) { return $m; } return null; @@ -587,7 +592,7 @@ public static function re_find(string $pattern, string $input): ?array public static function re_find_all(string $pattern, string $input): array { $out = []; - if (@preg_match_all(self::re_compile($pattern), $input, $m, PREG_SET_ORDER) !== false) { + if (preg_match_all(self::re_compile($pattern), $input, $m, PREG_SET_ORDER) !== false) { $out = $m; } return $out; diff --git a/php/tests/RegexPathologicalTest.php b/php/tests/RegexPathologicalTest.php new file mode 100644 index 0000000..54be3b3 --- /dev/null +++ b/php/tests/RegexPathologicalTest.php @@ -0,0 +1,45 @@ +getMessage(); + } + $ms = (hrtime(true) - $t0) / 1e6; + printf("[regex-discovery] %s | %.2fms | %s\n", $label, $ms, $outcome); + } + + public function testPanel(): void + { + $a22 = str_repeat('a', 22); + $nest40 = str_repeat('(', 40) . 'a' . str_repeat(')', 40); + + self::record('P1_redos_nested_plus', fn() => Struct::re_test('^(a+)+$', $a22 . '!')); + self::record('P2_redos_alt_overlap', fn() => Struct::re_test('^(a|aa)+$', $a22 . '!')); + self::record('P3_empty_repeat_replace', fn() => Struct::re_replace('a*', 'abc', 'X')); + self::record('P4_unicode_replace_dot', fn() => Struct::re_replace('\\.', 'café.au.lait', '/')); + self::record('P5_unicode_find_codepoint', fn() => Struct::re_find('é', 'café au lait')); + self::record('P6_deep_nesting_compile', fn() => Struct::re_test($nest40, 'a')); + self::record('P7_big_bounded_quantifier', fn() => Struct::re_test('^a{0,10000}b$', str_repeat('a', 10) . 'b')); + self::record('P8_invalid_pattern', fn() => Struct::re_compile('[abc')); + self::record('P9_backref_re2_forbidden', fn() => Struct::re_test('^(a+)\\1$', 'aaaa')); + self::record('P10_find_all_zero_width', fn() => Struct::re_find_all('a*', 'bbb')); + + $this->assertTrue(true); + } +} diff --git a/python/README.md b/python/README.md index 1797fcc..b7e7300 100644 --- a/python/README.md +++ b/python/README.md @@ -361,6 +361,39 @@ parity with other ports beats style here. 84/84 tests pass against the shared corpus. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The Python port +wraps the stdlib `re` module. + +### API + +| Function | Maps to | +|---|---| +| `re_compile(pattern, flags=0)` | `re.compile(pattern, flags)` | +| `re_test(pattern, input)` | `bool(re.search(pattern, input))` | +| `re_find(pattern, input)` | first match as `[whole, group1, ...]` or `None` | +| `re_find_all(pattern, input)` | all matches, one row per match | +| `re_replace(pattern, input, repl)` | `re.sub(pattern, repl, input)` | +| `re_escape(s)` | `re.escape(s)` | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +Python's `re` supports backreferences and lookaround; using them will +not be portable to the Go / Rust / C / Lua / Zig ports. + +### Sharp edges + +- **Catastrophic backtracking.** Python's `re` (the default C engine) + is backtracking. `^(a+)+$` against 22 a's plus `!` runs ~190 ms here; + RE2-style ports finish the same case in <0.1 ms. Use flat patterns. +- **Zero-width `replace`.** `re_replace("a*", "abc", "X")` returns + `"XXbXcX"` — the ECMA convention shared by all PCRE/ECMA/.NET/Java/Onigmo engines plus the in-tree Thompson ports. Go (RE2) returns `"XbXcX"` instead; see `/REGEX_PATHOLOGICAL.md`. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/python/tests/test_regex_pathological.py b/python/tests/test_regex_pathological.py new file mode 100644 index 0000000..2ce9c5f --- /dev/null +++ b/python/tests/test_regex_pathological.py @@ -0,0 +1,51 @@ +# RUN: python -m unittest discover -s tests +# +# Discovery test: pathological regex inputs run against the port's re_* API. +# The goal is to surface which inputs cause errors, hangs, or surprising +# output across ports — NOT to assert any specific behaviour. Each case +# wraps the call so one failure does not mask the others. +# The panel is the same in every port (see REGEX.md). + +import json +import time +import unittest + +from voxgig_struct.voxgig_struct import ( + re_compile, + re_find, + re_find_all, + re_replace, + re_test, +) + + +def record(label, fn): + t0 = time.perf_counter() + try: + r = fn() + outcome = f'OK | {json.dumps(r, default=str)}' + except Exception as e: + outcome = f'ERR | {type(e).__name__}: {e}' + ms = (time.perf_counter() - t0) * 1000.0 + print(f'[regex-discovery] {label} | {ms:.2f}ms | {outcome}') + + +class PathologicalRegex(unittest.TestCase): + def test_panel(self): + A22 = 'a' * 22 + NEST40 = '(' * 40 + 'a' + ')' * 40 + + record('P1_redos_nested_plus', lambda: re_test('^(a+)+$', A22 + '!')) + record('P2_redos_alt_overlap', lambda: re_test('^(a|aa)+$', A22 + '!')) + record('P3_empty_repeat_replace', lambda: re_replace('a*', 'abc', 'X')) + record('P4_unicode_replace_dot', lambda: re_replace(r'\.', 'café.au.lait', '/')) + record('P5_unicode_find_codepoint', lambda: re_find('é', 'café au lait')) + record('P6_deep_nesting_compile', lambda: re_test(NEST40, 'a')) + record('P7_big_bounded_quantifier', lambda: re_test('^a{0,10000}b$', 'a' * 10 + 'b')) + record('P8_invalid_pattern', lambda: re_compile('[abc')) + record('P9_backref_re2_forbidden', lambda: re_test(r'^(a+)\1$', 'aaaa')) + record('P10_find_all_zero_width', lambda: re_find_all('a*', 'bbb')) + + +if __name__ == '__main__': + unittest.main() diff --git a/ruby/README.md b/ruby/README.md index 718cff1..20f36ac 100644 --- a/ruby/README.md +++ b/ruby/README.md @@ -345,6 +345,41 @@ and a `maxdepth` parameter, matching the canonical algorithm. 75/75 tests pass, 150 assertions. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The Ruby port +wraps the built-in `Regexp` (Onigmo engine). + +### API + +| Function | Maps to | +|---|---| +| `re_compile(pattern)` | `Regexp.new(pattern)` | +| `re_test(pattern, input)` | `input =~ re` | +| `re_find(pattern, input)` | `input.match(re)` → `[whole, group1, ...]` | +| `re_find_all(pattern, input)` | `input.scan(re)` (one row per match) | +| `re_replace(pattern, input, repl)` | `input.gsub(re, repl)` | +| `re_escape(s)` | `Regexp.escape(s)` | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +Onigmo supports backreferences and lookaround; using them will not be +portable to the Go / Rust / C / Lua / Zig ports. + +### Sharp edges + +- **Catastrophic backtracking.** Onigmo has internal mitigations for + some classic ReDoS shapes — `^(a+)+$` against 22 a's plus `!` runs + in microseconds here. Larger inputs or different shapes can still + blow up; the safe rule is to stay inside the RE2 subset and avoid + nested quantifiers. +- **Zero-width `replace`.** `re_replace("a*", "abc", "X")` returns + `"XXbXcX"` — the ECMA convention shared by all PCRE/ECMA/.NET/Java/Onigmo engines plus the in-tree Thompson ports. Go (RE2) returns `"XbXcX"` instead; see `/REGEX_PATHOLOGICAL.md`. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/ruby/test_regex_pathological.rb b/ruby/test_regex_pathological.rb new file mode 100644 index 0000000..1fde671 --- /dev/null +++ b/ruby/test_regex_pathological.rb @@ -0,0 +1,37 @@ +require 'minitest/autorun' +require 'json' +require_relative 'voxgig_struct' + +# Discovery test: pathological regex inputs run against the port's re_* API. +# Goal is to surface failures across ports, not to assert behaviour. +# Panel is the same in every port (see REGEX.md). + +def record(label, &block) + t0 = Process.clock_gettime(Process::CLOCK_MONOTONIC) + begin + r = block.call + outcome = "OK | #{JSON.generate(r)}" + rescue StandardError => e + outcome = "ERR | #{e.class.name}: #{e.message}" + end + ms = (Process.clock_gettime(Process::CLOCK_MONOTONIC) - t0) * 1000.0 + printf("[regex-discovery] %s | %.2fms | %s\n", label, ms, outcome) +end + +class PathologicalRegexTest < Minitest::Test + def test_panel + a22 = 'a' * 22 + nest40 = "#{'(' * 40}a#{')' * 40}" + + record('P1_redos_nested_plus') { VoxgigStruct.re_test('^(a+)+$', "#{a22}!") } + record('P2_redos_alt_overlap') { VoxgigStruct.re_test('^(a|aa)+$', "#{a22}!") } + record('P3_empty_repeat_replace') { VoxgigStruct.re_replace('a*', 'abc', 'X') } + record('P4_unicode_replace_dot') { VoxgigStruct.re_replace('\\.', 'café.au.lait', '/') } + record('P5_unicode_find_codepoint') { VoxgigStruct.re_find('é', 'café au lait') } + record('P6_deep_nesting_compile') { VoxgigStruct.re_test(nest40, 'a') } + record('P7_big_bounded_quantifier') { VoxgigStruct.re_test('^a{0,10000}b$', "#{'a' * 10}b") } + record('P8_invalid_pattern') { VoxgigStruct.re_compile('[abc') } + record('P9_backref_re2_forbidden') { VoxgigStruct.re_test('^(a+)\\1$', 'aaaa') } + record('P10_find_all_zero_width') { VoxgigStruct.re_find_all('a*', 'bbb') } + end +end diff --git a/rust/README.md b/rust/README.md index b837949..4bb3ba9 100644 --- a/rust/README.md +++ b/rust/README.md @@ -126,3 +126,55 @@ Rust has no optional/overloaded parameters, so: See [`REPORT.md`](../REPORT.md#rust-rust) for the rust-port adaptations write-up, and [`../NOTES.md`](../NOTES.md) for cross-port quirks. + + +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The Rust port +**ships its own RE2-subset engine** in `src/re.rs` — no `regex` crate +dependency, no third-party crates at all (`Cargo.toml` lists none for +runtime). + +### API + +| Function | Returns | +|---|---| +| `re_compile(pattern)` | `Result` | +| `re_test(pattern, input)` | `bool` | +| `re_find(pattern, input)` | `Option>` — `[whole, group1, …]` | +| `re_find_all(pattern, input)` | `Vec>` | +| `re_replace(pattern, input, r)` | `String` | +| `re_escape(s)` | `String` | + +### Dialect + +The in-tree engine implements the RE2 subset documented in +`/REGEX.md`: literals + escapes, `.`, `^`/`$`, `* + ? {n} {n,} {n,m}` +(greedy + lazy), classes incl. `\d \w \s` and friends, `\b`/`\B`, +`(...)` / `(?:...)`, alternation. + +**Not supported** (by design — RE2 doesn't either): +backreferences, lookaround, possessive quantifiers, atomic groups. +Backref patterns like `^(a+)\1$` *compile* (the parser doesn't reject +`\1`) but never match the back-reference semantically, so `re_test` +returns `false` rather than erroring. Don't rely on this — write +portable patterns. + +### Sharp edges (Rust-specific) + +- **Bounded quantifiers are unrolled.** `a{0,10000}` compiles into + 10 000 Split+atom-clone pairs. The matcher was previously recursive + during epsilon-closure and stack-overflowed on such patterns; it is + now iterative (`Threads::add` uses an explicit work stack). + `re_test("^a{0,10000}b$", …)` now runs in ~10 ms here. +- **No catastrophic backtracking.** Thompson-NFA construction means + P1/P2 from the discovery panel run in microseconds. +- **Zero-width `re_replace`.** `re_replace("a*", "abc", "X")` returns + `"XXbXcX"` — the convention shared with all PCRE/ECMA/Java/.NET + engines and the other in-tree Thompson ports (C / Lua / Zig). Go + (RE2) returns `"XbXcX"` instead; see `/REGEX_PATHOLOGICAL.md`. +- **Single-threaded.** `Value` uses `Rc>` so it is + `!Send + !Sync`. The regex statics use `std::sync::LazyLock` and + are thread-safe in isolation, but the public API isn't. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. diff --git a/rust/src/re.rs b/rust/src/re.rs index 4f06b0c..a7669e7 100644 --- a/rust/src/re.rs +++ b/rust/src/re.rs @@ -852,62 +852,68 @@ impl ThreadList { } fn add(&mut self, re: &Regex, input: &[u8], pc: usize, slots: &[i32], sp: usize) { - if pc >= re.code.len() { - return; - } - if self.visited[pc] == self.gen { - return; - } - self.visited[pc] = self.gen; - let insn = &re.code[pc]; - match insn.op { - Op::Jmp(t) => { - self.add(re, input, t as usize, slots, sp); - return; - } - Op::Split(x, y) => { - self.add(re, input, x as usize, slots, sp); - self.add(re, input, y as usize, slots, sp); - return; + // Iterative epsilon-closure: we walk Jmp/Split/Save/Bol/Eol/Wb/Nwb + // until we hit a char-consuming op or Match. A recursive version + // overflows the stack on long Thompson chains (e.g. `a{0,10000}` + // unrolls into 10000 chained Splits — `cargo test` aborted with + // SIGABRT on the pathological-regex panel before this loop landed). + // + // The stack mirrors the recursive order: Split pushes y first then + // x, so x is processed first (priority preserved). + let mut stack: Vec<(usize, Vec)> = vec![(pc, slots.to_vec())]; + while let Some((cur_pc, cur_slots)) = stack.pop() { + if cur_pc >= re.code.len() { + continue; } - Op::Save(slot) => { - let mut ns = slots.to_vec(); - ns[slot] = sp as i32; - self.add(re, input, pc + 1, &ns, sp); - return; + if self.visited[cur_pc] == self.gen { + continue; } - Op::Bol => { - if sp == 0 || (sp - 1 < input.len() && input[sp - 1] == b'\n') { - self.add(re, input, pc + 1, slots, sp); + self.visited[cur_pc] = self.gen; + match re.code[cur_pc].op { + Op::Jmp(t) => { + stack.push((t as usize, cur_slots)); } - return; - } - Op::Eol => { - if sp >= input.len() || input[sp] == b'\n' { - self.add(re, input, pc + 1, slots, sp); + Op::Split(x, y) => { + // Push y first so x (higher priority) is popped first. + stack.push((y as usize, cur_slots.clone())); + stack.push((x as usize, cur_slots)); } - return; - } - Op::Wb | Op::Nwb => { - let left = sp > 0 - && sp - 1 < input.len() - && (input[sp - 1].is_ascii_alphanumeric() || input[sp - 1] == b'_'); - let right = - sp < input.len() && (input[sp].is_ascii_alphanumeric() || input[sp] == b'_'); - let at_boundary = left != right; - let want = matches!(insn.op, Op::Wb); - if at_boundary == want { - self.add(re, input, pc + 1, slots, sp); + Op::Save(slot) => { + let mut ns = cur_slots; + ns[slot] = sp as i32; + stack.push((cur_pc + 1, ns)); + } + Op::Bol => { + if sp == 0 || (sp - 1 < input.len() && input[sp - 1] == b'\n') { + stack.push((cur_pc + 1, cur_slots)); + } + } + Op::Eol => { + if sp >= input.len() || input[sp] == b'\n' { + stack.push((cur_pc + 1, cur_slots)); + } + } + Op::Wb | Op::Nwb => { + let left = sp > 0 + && sp - 1 < input.len() + && (input[sp - 1].is_ascii_alphanumeric() || input[sp - 1] == b'_'); + let right = sp < input.len() + && (input[sp].is_ascii_alphanumeric() || input[sp] == b'_'); + let at_boundary = left != right; + let want = matches!(re.code[cur_pc].op, Op::Wb); + if at_boundary == want { + stack.push((cur_pc + 1, cur_slots)); + } + } + _ => { + // Char-consuming op (or Match): queue thread. + self.threads.push(Thread { + pc: cur_pc, + slots: cur_slots, + }); } - return; } - _ => {} } - // Char-consuming op: queue thread. - self.threads.push(Thread { - pc, - slots: slots.to_vec(), - }); } } diff --git a/rust/tests/regex_pathological.rs b/rust/tests/regex_pathological.rs new file mode 100644 index 0000000..c82f7ac --- /dev/null +++ b/rust/tests/regex_pathological.rs @@ -0,0 +1,61 @@ +// Discovery test: pathological regex inputs run against the port's re_* API. +// Goal is to surface failures across ports, not to assert behaviour. +// Panel is the same in every port (see REGEX.md). + +use std::panic; +use std::time::Instant; + +use voxgig_struct::{re_compile, re_find, re_find_all, re_replace, re_test}; + +fn record(label: &str, fn_: F) +where + F: FnOnce() -> R + panic::UnwindSafe, + R: std::fmt::Debug, +{ + let t0 = Instant::now(); + let outcome = match panic::catch_unwind(fn_) { + Ok(r) => format!("OK | {:?}", r), + Err(e) => { + let msg = if let Some(s) = e.downcast_ref::<&str>() { + s.to_string() + } else if let Some(s) = e.downcast_ref::() { + s.clone() + } else { + "".to_string() + }; + format!("ERR | panic: {}", msg) + } + }; + let ms = t0.elapsed().as_secs_f64() * 1000.0; + println!("[regex-discovery] {} | {:.2}ms | {}", label, ms, outcome); +} + +#[test] +fn regex_pathological_discovery() { + let a22: String = "a".repeat(22); + let nest40: String = "(".repeat(40) + "a" + &")".repeat(40); + + record("P1_redos_nested_plus", || { + re_test("^(a+)+$", &(a22.clone() + "!")) + }); + record("P2_redos_alt_overlap", || { + re_test("^(a|aa)+$", &(a22.clone() + "!")) + }); + record("P3_empty_repeat_replace", || re_replace("a*", "abc", "X")); + record("P4_unicode_replace_dot", || { + re_replace(r"\.", "café.au.lait", "/") + }); + record("P5_unicode_find_codepoint", || re_find("é", "café au lait")); + record("P6_deep_nesting_compile", || re_test(&nest40, "a")); + record("P7_big_bounded_quantifier", || { + re_test("^a{0,10000}b$", &("a".repeat(10) + "b")) + }); + record("P8_invalid_pattern", || { + re_compile("[abc") + .map(|_| ()) + .err() + .map(|e| format!("{:?}", e)) + }); + record("P9_backref_re2_forbidden", || re_test(r"^(a+)\1$", "aaaa")); + record("P10_find_all_zero_width", || re_find_all("a*", "bbb")); +} diff --git a/swift/README.md b/swift/README.md index 9243e85..4e59a11 100644 --- a/swift/README.md +++ b/swift/README.md @@ -145,6 +145,42 @@ order. `JSON.stringify(value, indent: 2)` serialises back. `__NULL__` round-trip for the `inject.string` and `select.*` sets exactly as the canonical TS runner does. +## Regex + +Uniform six-function regex API (see `/REGEX_API.md`). The Swift port +wraps `NSRegularExpression`. + +### API + +| Function | Returns | +|---|---| +| `re_compile(pattern, flags?)` | `NSRegularExpression?` (nil on bad pattern) | +| `re_test(pattern, input)` | `Bool` | +| `re_find(pattern, input)` | `Value.list([whole, group1, …])` or `.noval` | +| `re_find_all(pattern, input)` | `Value.list([...])` | +| `re_replace(pattern, input, repl)` | `String` | +| `re_escape(v)` | `String` | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`. +`NSRegularExpression` (ICU-based) supports backreferences and lookaround; +using them will not be portable. + +### Sharp edges + +- **Catastrophic backtracking.** ICU regex is backtracking. Stay + inside the RE2 subset and prefer flat patterns. +- **Compile failures are nil, not throws.** `re_compile` returns + `nil` on bad pattern (the underlying `try?` swallows the error). + Callers should check the optional rather than rely on an exception. +- **`Value` shape for `re_find` / `re_find_all`.** The Swift port + threads results through the in-tree `Value` enum (matching the + rest of the API surface), not raw arrays. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Tests ```bash diff --git a/swift/Tests/VoxgigStructTests/RegexPathologicalTests.swift b/swift/Tests/VoxgigStructTests/RegexPathologicalTests.swift new file mode 100644 index 0000000..390182e --- /dev/null +++ b/swift/Tests/VoxgigStructTests/RegexPathologicalTests.swift @@ -0,0 +1,40 @@ +// Discovery test: pathological regex inputs run against the port's re_* API. +// Goal is to surface failures across ports, not to assert behaviour. +// Panel is the same in every port (see REGEX.md). + +import XCTest + +@testable import VoxgigStruct + +final class RegexPathologicalTests: XCTestCase { + private func record(_ label: String, _ fn: () -> Any?) { + let t0 = DispatchTime.now() + let value = fn() + let elapsedNs = DispatchTime.now().uptimeNanoseconds - t0.uptimeNanoseconds + let ms = Double(elapsedNs) / 1_000_000.0 + let outcome: String + if let value = value { + outcome = "OK | \(value)" + } else { + outcome = "OK | null" + } + print(String(format: "[regex-discovery] %@ | %.2fms | %@", label, ms, outcome)) + } + + func testPanel() { + let a22 = String(repeating: "a", count: 22) + let nest40 = String(repeating: "(", count: 40) + "a" + String(repeating: ")", count: 40) + let p7Input = String(repeating: "a", count: 10) + "b" + + record("P1_redos_nested_plus") { re_test(.string("^(a+)+$"), a22 + "!") } + record("P2_redos_alt_overlap") { re_test(.string("^(a|aa)+$"), a22 + "!") } + record("P3_empty_repeat_replace") { re_replace(.string("a*"), "abc", "X") } + record("P4_unicode_replace_dot") { re_replace(.string("\\."), "café.au.lait", "/") } + record("P5_unicode_find_codepoint") { re_find(.string("é"), "café au lait") } + record("P6_deep_nesting_compile") { re_test(.string(nest40), "a") } + record("P7_big_bounded_quantifier") { re_test(.string("^a{0,10000}b$"), p7Input) } + record("P8_invalid_pattern") { re_compile("[abc") as Any? } + record("P9_backref_re2_forbidden") { re_test(.string("^(a+)\\1$"), "aaaa") } + record("P10_find_all_zero_width") { re_find_all(.string("a*"), "bbb") } + } +} diff --git a/typescript/README.md b/typescript/README.md index af759eb..56ebd4c 100644 --- a/typescript/README.md +++ b/typescript/README.md @@ -563,6 +563,53 @@ calls (one shared array per depth). Clone it (`path.slice()`) if you need to retain it past the callback. +## Regex + +The library exposes a uniform six-function regex API across every +port (see `/REGEX_API.md` for the contract and `/REGEX.md` for the +supported dialect). On TypeScript the canonical implementation is +ECMAScript `RegExp`. + +### API + +| Function | Maps to | +|---|---| +| `re_compile(pattern, flags?)` | `new RegExp(pattern, flags ?? 'g')` | +| `re_test(pattern, input)` | `pattern.test(input)` | +| `re_find(pattern, input)` | `input.match(pattern)` (non-global pattern) | +| `re_find_all(pattern, input)` | `[...input.matchAll(pattern)]` | +| `re_replace(pattern, input, rep)` | `input.replace(pattern, rep)` (global pattern) | +| `re_escape(s)` | escape `[.*+?^${}()|[\]\\]` in `s` | + +### Dialect + +Patterns must stay inside the **RE2 subset** documented in `/REGEX.md`: +literals + escapes, `.`, `^`/`$`, `* + ? {n} {n,} {n,m}` (greedy + lazy), +character classes incl. `\d \w \s` etc., `\b`/`\B`, `(...)` / `(?:...)` / +`(?...)`, alternation. ECMAScript `RegExp` supports backreferences +and lookaround, but other ports do not — using those will not be +portable. + +### Sharp edges + +- **Catastrophic backtracking.** ECMAScript `RegExp` uses backtracking; + nested quantifiers (e.g. `(a+)+`) against a non-matching suffix can be + exponential in the input length. The discovery panel measures ~180 ms + on Node 22 for `^(a+)+$` against 22 a's plus `!`. RE2-style engines + finish the same case in under 0.1 ms. Write linear-friendly patterns + (`a+` instead of `(a+)+`) and keep injected user input in + character classes, not in alternations. +- **Zero-width `replace`.** `re_replace("a*", "abc", "X")` returns + `"XXbXcX"` here — the ECMA convention shared by every port whose + host engine is PCRE/ECMA/.NET/Java/Onigmo, plus the in-tree + Thompson NFA ports (Rust / C / Lua / Zig). Go is the exception: + RE2 returns `"XbXcX"`. Don't rely on cross-port identity of + zero-width replacement output — see `/REGEX_PATHOLOGICAL.md`. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input +panel and per-port outcomes. + + ## Build and test ```bash diff --git a/typescript/dist-test/regex_pathological.test.js b/typescript/dist-test/regex_pathological.test.js new file mode 100644 index 0000000..50202df --- /dev/null +++ b/typescript/dist-test/regex_pathological.test.js @@ -0,0 +1,40 @@ +"use strict"; +// VERSION: @voxgig/struct 0.1.0 +// +// Discovery test: pathological regex inputs run against the port's re_* API. +// Each case wraps the call so one failure does not mask the others. +// The panel is the same in every port (see REGEX.md). +Object.defineProperty(exports, "__esModule", { value: true }); +const node_test_1 = require("node:test"); +const StructUtility_1 = require("../dist/StructUtility"); +function rep(s, n) { + return new Array(n + 1).join(s); +} +function record(label, fn) { + const t0 = process.hrtime.bigint(); + let outcome; + try { + const r = fn(); + outcome = `OK | ${JSON.stringify(r)}`; + } + catch (e) { + outcome = `ERR | ${e && e.message ? e.message : String(e)}`; + } + const ms = Number(process.hrtime.bigint() - t0) / 1e6; + console.log(`[regex-discovery] ${label} | ${ms.toFixed(2)}ms | ${outcome}`); +} +(0, node_test_1.test)('regex pathological discovery', () => { + const A22 = rep('a', 22); + const NEST40 = rep('(', 40) + 'a' + rep(')', 40); + record('P1_redos_nested_plus', () => (0, StructUtility_1.re_test)('^(a+)+$', A22 + '!')); + record('P2_redos_alt_overlap', () => (0, StructUtility_1.re_test)('^(a|aa)+$', A22 + '!')); + record('P3_empty_repeat_replace', () => (0, StructUtility_1.re_replace)('a*', 'abc', 'X')); + record('P4_unicode_replace_dot', () => (0, StructUtility_1.re_replace)('\\.', 'café.au.lait', '/')); + record('P5_unicode_find_codepoint', () => (0, StructUtility_1.re_find)('é', 'café au lait')); + record('P6_deep_nesting_compile', () => (0, StructUtility_1.re_test)(NEST40, 'a')); + record('P7_big_bounded_quantifier', () => (0, StructUtility_1.re_test)('^a{0,10000}b$', rep('a', 10) + 'b')); + record('P8_invalid_pattern', () => (0, StructUtility_1.re_compile)('[abc')); + record('P9_backref_re2_forbidden', () => (0, StructUtility_1.re_test)('^(a+)\\1$', 'aaaa')); + record('P10_find_all_zero_width', () => (0, StructUtility_1.re_find_all)('a*', 'bbb')); +}); +//# sourceMappingURL=regex_pathological.test.js.map \ No newline at end of file diff --git a/typescript/dist-test/regex_pathological.test.js.map b/typescript/dist-test/regex_pathological.test.js.map new file mode 100644 index 0000000..79b1e70 --- /dev/null +++ b/typescript/dist-test/regex_pathological.test.js.map @@ -0,0 +1 @@ +{"version":3,"file":"regex_pathological.test.js","sourceRoot":"","sources":["../test/regex_pathological.test.ts"],"names":[],"mappings":";AAAA,gCAAgC;AAChC,EAAE;AACF,6EAA6E;AAC7E,oEAAoE;AACpE,sDAAsD;;AAEtD,yCAAgC;AAEhC,yDAA6F;AAE7F,SAAS,GAAG,CAAC,CAAS,EAAE,CAAS;IAC/B,OAAO,IAAI,KAAK,CAAC,CAAC,GAAG,CAAC,CAAC,CAAC,IAAI,CAAC,CAAC,CAAC,CAAA;AACjC,CAAC;AAED,SAAS,MAAM,CAAC,KAAa,EAAE,EAAiB;IAC9C,MAAM,EAAE,GAAG,OAAO,CAAC,MAAM,CAAC,MAAM,EAAE,CAAA;IAClC,IAAI,OAAe,CAAA;IACnB,IAAI,CAAC;QACH,MAAM,CAAC,GAAG,EAAE,EAAE,CAAA;QACd,OAAO,GAAG,QAAQ,IAAI,CAAC,SAAS,CAAC,CAAC,CAAC,EAAE,CAAA;IACvC,CAAC;IAAC,OAAO,CAAM,EAAE,CAAC;QAChB,OAAO,GAAG,SAAS,CAAC,IAAI,CAAC,CAAC,OAAO,CAAC,CAAC,CAAC,CAAC,CAAC,OAAO,CAAC,CAAC,CAAC,MAAM,CAAC,CAAC,CAAC,EAAE,CAAA;IAC7D,CAAC;IACD,MAAM,EAAE,GAAG,MAAM,CAAC,OAAO,CAAC,MAAM,CAAC,MAAM,EAAE,GAAG,EAAE,CAAC,GAAG,GAAG,CAAA;IACrD,OAAO,CAAC,GAAG,CAAC,qBAAqB,KAAK,MAAM,EAAE,CAAC,OAAO,CAAC,CAAC,CAAC,QAAQ,OAAO,EAAE,CAAC,CAAA;AAC7E,CAAC;AAED,IAAA,gBAAI,EAAC,8BAA8B,EAAE,GAAG,EAAE;IACxC,MAAM,GAAG,GAAG,GAAG,CAAC,GAAG,EAAE,EAAE,CAAC,CAAA;IACxB,MAAM,MAAM,GAAG,GAAG,CAAC,GAAG,EAAE,EAAE,CAAC,GAAG,GAAG,GAAG,GAAG,CAAC,GAAG,EAAE,EAAE,CAAC,CAAA;IAEhD,MAAM,CAAC,sBAAsB,EAAE,GAAG,EAAE,CAAC,IAAA,uBAAO,EAAC,SAAS,EAAE,GAAG,GAAG,GAAG,CAAC,CAAC,CAAA;IACnE,MAAM,CAAC,sBAAsB,EAAE,GAAG,EAAE,CAAC,IAAA,uBAAO,EAAC,WAAW,EAAE,GAAG,GAAG,GAAG,CAAC,CAAC,CAAA;IACrE,MAAM,CAAC,yBAAyB,EAAE,GAAG,EAAE,CAAC,IAAA,0BAAU,EAAC,IAAI,EAAE,KAAK,EAAE,GAAG,CAAC,CAAC,CAAA;IACrE,MAAM,CAAC,wBAAwB,EAAE,GAAG,EAAE,CAAC,IAAA,0BAAU,EAAC,KAAK,EAAE,cAAc,EAAE,GAAG,CAAC,CAAC,CAAA;IAC9E,MAAM,CAAC,2BAA2B,EAAE,GAAG,EAAE,CAAC,IAAA,uBAAO,EAAC,GAAG,EAAE,cAAc,CAAC,CAAC,CAAA;IACvE,MAAM,CAAC,yBAAyB,EAAE,GAAG,EAAE,CAAC,IAAA,uBAAO,EAAC,MAAM,EAAE,GAAG,CAAC,CAAC,CAAA;IAC7D,MAAM,CAAC,2BAA2B,EAAE,GAAG,EAAE,CAAC,IAAA,uBAAO,EAAC,eAAe,EAAE,GAAG,CAAC,GAAG,EAAE,EAAE,CAAC,GAAG,GAAG,CAAC,CAAC,CAAA;IACvF,MAAM,CAAC,oBAAoB,EAAE,GAAG,EAAE,CAAC,IAAA,0BAAU,EAAC,MAAM,CAAC,CAAC,CAAA;IACtD,MAAM,CAAC,0BAA0B,EAAE,GAAG,EAAE,CAAC,IAAA,uBAAO,EAAC,WAAW,EAAE,MAAM,CAAC,CAAC,CAAA;IACtE,MAAM,CAAC,yBAAyB,EAAE,GAAG,EAAE,CAAC,IAAA,2BAAW,EAAC,IAAI,EAAE,KAAK,CAAC,CAAC,CAAA;AACnE,CAAC,CAAC,CAAA"} \ No newline at end of file diff --git a/typescript/test/regex_pathological.test.ts b/typescript/test/regex_pathological.test.ts new file mode 100644 index 0000000..d1b18b9 --- /dev/null +++ b/typescript/test/regex_pathological.test.ts @@ -0,0 +1,42 @@ +// VERSION: @voxgig/struct 0.1.0 +// +// Discovery test: pathological regex inputs run against the port's re_* API. +// Each case wraps the call so one failure does not mask the others. +// The panel is the same in every port (see REGEX.md). + +import { test } from 'node:test' + +import { re_compile, re_test, re_find, re_find_all, re_replace } from '../dist/StructUtility' + +function rep(s: string, n: number): string { + return new Array(n + 1).join(s) +} + +function record(label: string, fn: () => unknown): void { + const t0 = process.hrtime.bigint() + let outcome: string + try { + const r = fn() + outcome = `OK | ${JSON.stringify(r)}` + } catch (e: any) { + outcome = `ERR | ${e && e.message ? e.message : String(e)}` + } + const ms = Number(process.hrtime.bigint() - t0) / 1e6 + console.log(`[regex-discovery] ${label} | ${ms.toFixed(2)}ms | ${outcome}`) +} + +test('regex pathological discovery', () => { + const A22 = rep('a', 22) + const NEST40 = rep('(', 40) + 'a' + rep(')', 40) + + record('P1_redos_nested_plus', () => re_test('^(a+)+$', A22 + '!')) + record('P2_redos_alt_overlap', () => re_test('^(a|aa)+$', A22 + '!')) + record('P3_empty_repeat_replace', () => re_replace('a*', 'abc', 'X')) + record('P4_unicode_replace_dot', () => re_replace('\\.', 'café.au.lait', '/')) + record('P5_unicode_find_codepoint', () => re_find('é', 'café au lait')) + record('P6_deep_nesting_compile', () => re_test(NEST40, 'a')) + record('P7_big_bounded_quantifier', () => re_test('^a{0,10000}b$', rep('a', 10) + 'b')) + record('P8_invalid_pattern', () => re_compile('[abc')) + record('P9_backref_re2_forbidden', () => re_test('^(a+)\\1$', 'aaaa')) + record('P10_find_all_zero_width', () => re_find_all('a*', 'bbb')) +}) diff --git a/zig/README.md b/zig/README.md index f3fe854..bcd9d29 100644 --- a/zig/README.md +++ b/zig/README.md @@ -251,6 +251,59 @@ subsystems present) but the test corpus pass rate is being raised. 60+ tests pass; see [`../REPORT.md`](../REPORT.md) for current status. +## Regex + +Uniform regex API (see `/REGEX_API.md`). The Zig port **ships its own +RE2-subset engine** in `src/regex.zig` (Thompson NFA), replacing the +earlier `mvzr` dependency. No third-party runtime crates. + +### API + +| Function | Returns | +|---|---| +| `re_compile(pattern)` | `?ReCompiled` (nil on bad pattern) | +| `re_test(pattern, input)` | `bool` | +| `re_find(alloc, pattern, input)` | `?[][]const u8` (caller frees) | +| `re_find_all(alloc, pattern, input)` | `?[][][]const u8` (caller frees both levels) | +| `re_replace(alloc, pattern, input, repl)` | `![]u8` (caller frees) | +| `re_escape(alloc, s)` | `![]const u8` | + +`ReCompiled` is an alias for the engine's `Regex` type +(`src/regex.zig`); it owns an instruction buffer and is released with +`.deinit()`. + +### Dialect + +The in-tree engine implements the RE2 subset documented in `/REGEX.md`: +literals + escapes, `.`, `^`/`$`, `* + ? {n} {n,} {n,m}` (greedy + lazy), +classes incl. `\d \w \s` and friends, `\b`/`\B`, `(...)` / `(?:...)`, +alternation. + +**Not supported** (by design — RE2 doesn't either): backreferences, +lookaround, possessive quantifiers, atomic groups. + +### Sharp edges (Zig-specific) + +- **Allocator-explicit.** `re_test` and `re_compile` use + `std.heap.page_allocator` internally so callers don't have to pipe + one through every call; the find/find_all/replace wrappers ask for + one because they return caller-owned slices. +- **`re_find` / `re_find_all` slices alias the input.** They are + valid only while `input` is alive. Copy if you need to retain past + the input's lifetime. +- **`re_replace` takes the replacement literally** in the current + wrapper — no `$&`/`$1..` expansion. The engine's lower-level + callback variant gives full control. +- **No catastrophic backtracking.** Thompson-NFA construction; P1/P2 + finish in microseconds. +- **Zero-width `re_replace`** matches the in-tree-Thompson and + PCRE/ECMA convention: `re_replace(alloc, "a*", "abc", "X")` returns + `"XXbXcX"`. Go (RE2) returns `"XbXcX"` instead; this is RE2's + chosen rule and we don't paper over it. + +See `/REGEX_PATHOLOGICAL.md` for the cross-port pathological-input panel. + + ## Build and test ```bash diff --git a/zig/src/regex.zig b/zig/src/regex.zig index 17ad86d..62d118c 100644 --- a/zig/src/regex.zig +++ b/zig/src/regex.zig @@ -133,7 +133,7 @@ pub const Regex = struct { return false; } - fn findFirst(self: Regex, input: []const u8) ?[]i32 { + pub fn findFirst(self: Regex, input: []const u8) ?[]i32 { var start: usize = 0; while (true) { if (self.matchAt(input, start)) |slots| return slots; @@ -143,6 +143,16 @@ pub const Regex = struct { } } + pub fn findFrom(self: Regex, input: []const u8, from: usize) ?[]i32 { + var start: usize = from; + while (true) { + if (self.matchAt(input, start)) |slots| return slots; + if (self.anchored_start) return null; + if (start > input.len) return null; + start += 1; + } + } + fn matchAt(self: Regex, input: []const u8, start: usize) ?[]i32 { const nslots = self.ngroups * 2; var cur = ThreadList.init(self.allocator, self.code.len) catch return null; diff --git a/zig/src/struct.zig b/zig/src/struct.zig index d4f124c..a1df63c 100644 --- a/zig/src/struct.zig +++ b/zig/src/struct.zig @@ -773,6 +773,102 @@ pub fn re_test(pattern: []const u8, input: []const u8) bool { return re.isMatch(input); } +/// re_find — first match as `[whole, capture1, ...]`. Slices alias `input`, +/// so the result is valid only while `input` is alive. Returns null on +/// compile error or no-match. The outer slice and inner slices must be +/// freed by the caller. +pub fn re_find(allocator: Allocator, pattern: []const u8, input: []const u8) ?[][]const u8 { + var re = _re_engine.compile(std.heap.page_allocator, pattern) orelse return null; + defer re.deinit(); + const slots = re.findFirst(input) orelse return null; + defer std.heap.page_allocator.free(slots); + const ngroups = re.ngroups; + const out = allocator.alloc([]const u8, ngroups) catch return null; + var g: usize = 0; + while (g < ngroups) : (g += 1) { + const s = slots[2 * g]; + const e = slots[2 * g + 1]; + if (s < 0 or e < s) { + out[g] = ""; + } else { + out[g] = input[@as(usize, @intCast(s))..@as(usize, @intCast(e))]; + } + } + return out; +} + +/// re_find_all — every non-overlapping match. Caller owns the returned +/// slice-of-slices and must free both levels. +pub fn re_find_all(allocator: Allocator, pattern: []const u8, input: []const u8) ?[][][]const u8 { + var re = _re_engine.compile(std.heap.page_allocator, pattern) orelse return null; + defer re.deinit(); + var rows = std.ArrayList([][]const u8).init(allocator); + defer rows.deinit(); + var pos: usize = 0; + while (pos <= input.len) { + const slots = re.findFrom(input, pos) orelse break; + defer std.heap.page_allocator.free(slots); + const ngroups = re.ngroups; + const row = allocator.alloc([]const u8, ngroups) catch return null; + var g: usize = 0; + while (g < ngroups) : (g += 1) { + const s = slots[2 * g]; + const e = slots[2 * g + 1]; + if (s < 0 or e < s) { + row[g] = ""; + } else { + row[g] = input[@as(usize, @intCast(s))..@as(usize, @intCast(e))]; + } + } + rows.append(row) catch return null; + const mstart = @as(usize, @intCast(slots[0])); + const mend = @as(usize, @intCast(slots[1])); + if (mend == mstart) { + pos = mend + 1; + } else { + pos = mend; + } + } + return rows.toOwnedSlice() catch return null; +} + +/// re_replace — replace every match in `input` with `replacement`. The +/// replacement string is taken literally; $& / $1.. substitution is not +/// expanded in this minimal wrapper (matches the engine's current shape). +/// On zero-width match the current rune is emitted and we advance by one +/// byte, mirroring the ECMAScript convention used by other ports. +pub fn re_replace(allocator: Allocator, pattern: []const u8, input: []const u8, replacement: []const u8) ![]u8 { + var re = _re_engine.compile(std.heap.page_allocator, pattern) orelse { + return allocator.dupe(u8, input); + }; + defer re.deinit(); + var out = std.ArrayList(u8).init(allocator); + defer out.deinit(); + var pos: usize = 0; + while (pos <= input.len) { + const slots = re.findFrom(input, pos) orelse { + try out.appendSlice(input[pos..]); + break; + }; + defer std.heap.page_allocator.free(slots); + const mstart = @as(usize, @intCast(slots[0])); + const mend = @as(usize, @intCast(slots[1])); + try out.appendSlice(input[pos..mstart]); + try out.appendSlice(replacement); + if (mend == mstart) { + if (mstart < input.len) { + try out.append(input[mstart]); + pos = mstart + 1; + } else { + pos = mstart + 1; + } + } else { + pos = mend; + } + } + return out.toOwnedSlice(); +} + pub fn re_escape(allocator: Allocator, s: []const u8) ![]const u8 { return escre(allocator, s); } diff --git a/zig/test/regex_pathological.zig b/zig/test/regex_pathological.zig new file mode 100644 index 0000000..5ae19cd --- /dev/null +++ b/zig/test/regex_pathological.zig @@ -0,0 +1,83 @@ +// Discovery test: pathological regex inputs run against the port's re_* API. +// Goal is to surface failures across ports, not to assert behaviour. +// Panel is the same in every port (see REGEX.md). + +const std = @import("std"); +const voxgig_struct = @import("voxgig-struct"); + +fn ms_since(t0: i128) f64 { + return @as(f64, @floatFromInt(std.time.nanoTimestamp() - t0)) / 1e6; +} + +test "regex pathological discovery" { + const writer = std.io.getStdOut().writer(); + var arena = std.heap.ArenaAllocator.init(std.testing.allocator); + defer arena.deinit(); + const alloc = arena.allocator(); + + const a22 = try alloc.alloc(u8, 22); + @memset(a22, 'a'); + const p1_in = try std.fmt.allocPrint(alloc, "{s}!", .{a22}); + + var nest_buf: [120]u8 = undefined; + var pos: usize = 0; + while (pos < 40) : (pos += 1) nest_buf[pos] = '('; + nest_buf[pos] = 'a'; + pos += 1; + var i: usize = 0; + while (i < 40) : (i += 1) { + nest_buf[pos] = ')'; + pos += 1; + } + const nest40 = nest_buf[0..pos]; + + var t0 = std.time.nanoTimestamp(); + const b1 = voxgig_struct.re_test("^(a+)+$", p1_in); + try writer.print("[regex-discovery] P1_redos_nested_plus | {d:.2}ms | OK | {}\n", .{ ms_since(t0), b1 }); + + t0 = std.time.nanoTimestamp(); + const b2 = voxgig_struct.re_test("^(a|aa)+$", p1_in); + try writer.print("[regex-discovery] P2_redos_alt_overlap | {d:.2}ms | OK | {}\n", .{ ms_since(t0), b2 }); + + t0 = std.time.nanoTimestamp(); + const p3 = try voxgig_struct.re_replace(alloc, "a*", "abc", "X"); + try writer.print("[regex-discovery] P3_empty_repeat_replace | {d:.2}ms | OK | \"{s}\"\n", .{ ms_since(t0), p3 }); + + t0 = std.time.nanoTimestamp(); + const p4 = try voxgig_struct.re_replace(alloc, "\\.", "café.au.lait", "/"); + try writer.print("[regex-discovery] P4_unicode_replace_dot | {d:.2}ms | OK | \"{s}\"\n", .{ ms_since(t0), p4 }); + + t0 = std.time.nanoTimestamp(); + if (voxgig_struct.re_find(alloc, "é", "café au lait")) |p5| { + try writer.print("[regex-discovery] P5_unicode_find_codepoint | {d:.2}ms | OK | [\"{s}\"]\n", .{ ms_since(t0), p5[0] }); + } else { + try writer.print("[regex-discovery] P5_unicode_find_codepoint | {d:.2}ms | OK | null\n", .{ms_since(t0)}); + } + + t0 = std.time.nanoTimestamp(); + const b6 = voxgig_struct.re_test(nest40, "a"); + try writer.print("[regex-discovery] P6_deep_nesting_compile | {d:.2}ms | OK | {}\n", .{ ms_since(t0), b6 }); + + t0 = std.time.nanoTimestamp(); + const b7 = voxgig_struct.re_test("^a{0,10000}b$", "aaaaaaaaaab"); + try writer.print("[regex-discovery] P7_big_bounded_quantifier | {d:.2}ms | OK | {}\n", .{ ms_since(t0), b7 }); + + t0 = std.time.nanoTimestamp(); + const p8 = voxgig_struct.re_compile("[abc"); + if (p8 == null) { + try writer.print("[regex-discovery] P8_invalid_pattern | {d:.2}ms | ERR | compile returned null\n", .{ms_since(t0)}); + } else { + try writer.print("[regex-discovery] P8_invalid_pattern | {d:.2}ms | OK | \"compiled\"\n", .{ms_since(t0)}); + } + + t0 = std.time.nanoTimestamp(); + const b9 = voxgig_struct.re_test("^(a+)\\1$", "aaaa"); + try writer.print("[regex-discovery] P9_backref_re2_forbidden | {d:.2}ms | OK | {}\n", .{ ms_since(t0), b9 }); + + t0 = std.time.nanoTimestamp(); + if (voxgig_struct.re_find_all(alloc, "a*", "bbb")) |p10| { + try writer.print("[regex-discovery] P10_find_all_zero_width | {d:.2}ms | OK | <{} matches>\n", .{ ms_since(t0), p10.len }); + } else { + try writer.print("[regex-discovery] P10_find_all_zero_width | {d:.2}ms | OK | null\n", .{ms_since(t0)}); + } +}