Fix C tokenizer NUL byte truncation by kimjune01 · Pull Request #358 · earwig/mwparserfromhell

kimjune01 · 2026-05-12T09:02:25Z

Summary

The C tokenizer returned '\0' both for real NUL bytes in the input and for end-of-input, causing silent truncation at the first NUL byte. This patch introduces a TOKENIZER_EOF sentinel (0x110000, first value outside valid Unicode range) and replaces all !this / !last truthiness checks with explicit == TOKENIZER_EOF comparisons.
Removes '\0' from the MARKERS array since NUL is no longer a special marker.
Adds regression tests confirming NUL bytes are preserved in plain text, templates, and multi-NUL inputs.
Both C and Python tokenizers now produce identical output for NUL-containing wikitext.

Test plan

pytest tests/test_tokenizer.py::test_nul_byte_preservation passes for both CTokenizer and PyTokenizer
Full test suite passes with no regressions

The C tokenizer was silently truncating input at the first NUL byte (\x00) because it used '\0' both as a valid input character and as the EOF sentinel. Root cause: - Tokenizer_read() returned '\0' for both: 1. Real NUL bytes in the input 2. End-of-input (when index >= text.length) - This made them indistinguishable, causing real NULs to be treated as EOF Fix: 1. Define TOKENIZER_EOF as 0x110000 (first invalid Unicode code point) 2. Update Tokenizer_read() and Tokenizer_read_backwards() to return TOKENIZER_EOF instead of '\0' for out-of-bounds reads 3. Replace all `!this` and `'\0'` checks with explicit `TOKENIZER_EOF` checks 4. Remove '\0' from the MARKERS array (no longer needed as EOF marker) 5. Move EOF check before is_marker() in main parse loop to ensure TOKENIZER_EOF doesn't try to emit as a character 6. Fix Tokenizer_has_leading_whitespace() to recognize TOKENIZER_EOF The Python tokenizer already preserved NUL bytes correctly; this brings the C tokenizer into parity. Regression test added: test_nul_byte_preservation() verifies that both tokenizers now preserve NUL bytes in plain text, templates, and multiple-NUL scenarios.

Four start-of-input checks in the main parse loop still used !last (falsy NUL) to detect beginning-of-input. After the TOKENIZER_EOF sentinel change, Tokenizer_read_backwards returns 0x110000 instead of '\0', so !last is always false and headings, lists, and horizontal rules at position 0 would silently fail to parse.

Tokenizer_read returns TOKENIZER_EOF for end-of-input, not a falsy value. The old truthiness check let NUL bytes truncate parsing.

Two bugs introduced by the NUL truncation fix: 1. Tokenizer_handle_invalid_tag_start: the tag name scanning loop checked is_marker() and Py_UNICODE_ISSPACE() to terminate, but TOKENIZER_EOF (0x110000) matches neither, causing an infinite loop when an incomplete closing tag like "</ref" reaches EOF without ">". 2. Tokenizer_parse colon handling: the bare external link check "this == ':' && !is_marker(last)" fired at start-of-input because is_marker(TOKENIZER_EOF) returns false (0x110000 not in MARKERS). This intercepted ":" before the list handler could run, breaking definition list items like ":text" → <dd>.

kimjune01 added 4 commits May 11, 2026 20:49

fix: check for EOF instead of truthiness in tag close handling

158b896

Tokenizer_read returns TOKENIZER_EOF for end-of-input, not a falsy value. The old truthiness check let NUL bytes truncate parsing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix C tokenizer NUL byte truncation#358

Fix C tokenizer NUL byte truncation#358
kimjune01 wants to merge 4 commits into
earwig:mainfrom
kimjune01:fix-nul-truncation

kimjune01 commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kimjune01 commented May 12, 2026

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant