Skip to content

Fix C tokenizer NUL byte truncation#358

Open
kimjune01 wants to merge 4 commits into
earwig:mainfrom
kimjune01:fix-nul-truncation
Open

Fix C tokenizer NUL byte truncation#358
kimjune01 wants to merge 4 commits into
earwig:mainfrom
kimjune01:fix-nul-truncation

Conversation

@kimjune01
Copy link
Copy Markdown

Summary

  • The C tokenizer returned '\0' both for real NUL bytes in the input and for end-of-input, causing silent truncation at the first NUL byte. This patch introduces a TOKENIZER_EOF sentinel (0x110000, first value outside valid Unicode range) and replaces all !this / !last truthiness checks with explicit == TOKENIZER_EOF comparisons.
  • Removes '\0' from the MARKERS array since NUL is no longer a special marker.
  • Adds regression tests confirming NUL bytes are preserved in plain text, templates, and multi-NUL inputs.
  • Both C and Python tokenizers now produce identical output for NUL-containing wikitext.

Test plan

  • pytest tests/test_tokenizer.py::test_nul_byte_preservation passes for both CTokenizer and PyTokenizer
  • Full test suite passes with no regressions

kimjune01 added 4 commits May 11, 2026 20:49
The C tokenizer was silently truncating input at the first NUL byte (\x00)
because it used '\0' both as a valid input character and as the EOF sentinel.

Root cause:
- Tokenizer_read() returned '\0' for both:
  1. Real NUL bytes in the input
  2. End-of-input (when index >= text.length)
- This made them indistinguishable, causing real NULs to be treated as EOF

Fix:
1. Define TOKENIZER_EOF as 0x110000 (first invalid Unicode code point)
2. Update Tokenizer_read() and Tokenizer_read_backwards() to return
   TOKENIZER_EOF instead of '\0' for out-of-bounds reads
3. Replace all `!this` and `'\0'` checks with explicit `TOKENIZER_EOF` checks
4. Remove '\0' from the MARKERS array (no longer needed as EOF marker)
5. Move EOF check before is_marker() in main parse loop to ensure
   TOKENIZER_EOF doesn't try to emit as a character
6. Fix Tokenizer_has_leading_whitespace() to recognize TOKENIZER_EOF

The Python tokenizer already preserved NUL bytes correctly; this brings
the C tokenizer into parity.

Regression test added: test_nul_byte_preservation() verifies that both
tokenizers now preserve NUL bytes in plain text, templates, and
multiple-NUL scenarios.
Four start-of-input checks in the main parse loop still used !last
(falsy NUL) to detect beginning-of-input. After the TOKENIZER_EOF
sentinel change, Tokenizer_read_backwards returns 0x110000 instead of
'\0', so !last is always false and headings, lists, and horizontal
rules at position 0 would silently fail to parse.
Tokenizer_read returns TOKENIZER_EOF for end-of-input, not a falsy
value. The old truthiness check let NUL bytes truncate parsing.
Two bugs introduced by the NUL truncation fix:

1. Tokenizer_handle_invalid_tag_start: the tag name scanning loop
   checked is_marker() and Py_UNICODE_ISSPACE() to terminate, but
   TOKENIZER_EOF (0x110000) matches neither, causing an infinite loop
   when an incomplete closing tag like "</ref" reaches EOF without ">".

2. Tokenizer_parse colon handling: the bare external link check
   "this == ':' && !is_marker(last)" fired at start-of-input because
   is_marker(TOKENIZER_EOF) returns false (0x110000 not in MARKERS).
   This intercepted ":" before the list handler could run, breaking
   definition list items like ":text" → <dd>.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant