Fix unreachable!() panic when DOCTYPE appears between text runs in element content#964
Open
williamareynolds wants to merge 1 commit into
Open
Conversation
Mingun
requested changes
May 13, 2026
Collaborator
Mingun
left a comment
There was a problem hiding this comment.
Generally approve except needing to check the trailing whitespace handling question.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #964 +/- ##
==========================================
+ Coverage 55.08% 57.31% +2.22%
==========================================
Files 44 46 +2
Lines 16911 18197 +1286
==========================================
+ Hits 9316 10429 +1113
- Misses 7595 7768 +173
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0587e39 to
7b7991d
Compare
…ement content The deserializer's `drain_text` merges consecutive `Text`, `CData`, and `GeneralRef` payload events into a single `DeEvent::Text` so that `read_text` and friends see at most one text run per element. DOCTYPE events were not part of that merge: `drain_text`'s break condition (`current_event_is_last_text`) treats DocType as a non-text event and exits the loop, and the outer `next()` resumes by capturing the DocType and continuing — emitting a *second* `DeEvent::Text` for the trailing text run. For input like `<a>x<!DOCTYPE y>z</a>` (and shapes derived from it), `read_text` then matched on a second `DeEvent::Text` and tripped `unreachable!()` with the comment 'Cannot be two consequent Text events, they would be merged into one'. The merge invariant was the right idea — the implementation just missed the DocType case. Fix: treat DocType as transparent during the text drain. It still goes through `entity_resolver.capture()` (so DTD entities defined inside the document body remain available for later `&entity;` resolution), but the surrounding text runs are merged into one `DeEvent::Text` and the `unreachable!()` is no longer reachable from valid or malformed real-world input. Discovered via libFuzzer running against a SAML deserializer harness on the consuming application — the input that surfaced it is a 100-byte sequence containing nested DOCTYPE declarations inside what the parser treats as element content. Adds four regression tests in a new `doctype_in_element_text` module in `tests/serde-de.rs`: a minimal repro (`<a>x<!DOCTYPE y>z</a>`), a multi-DOCTYPE shape mirroring the original fuzzer find, a leading- DOCTYPE variant, and a whitespace-around-DOCTYPE variant (added per review feedback) confirming that adjacent whitespace is preserved verbatim when the surrounding text runs are merged. Full `cargo test --all-features` stays green (1,712 passing).
7b7991d to
e00ae5c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The serde deserializer panics with
internal error: entered unreachable code(src/de/mod.rsread_text's "Cannot be two consequent Text events" branch) when a DOCTYPE declaration appears between text runs inside element content.Minimal repro:
I found this while fuzzing a new library I'm working on. I'm not an expert on quick-xml so I hope I got this all right.
Root cause
Deserializer::drain_textis the merge step that guaranteesread_text(and friends) see at most oneDeEvent::Textper element. It loops whilecurrent_event_is_last_textreturnsfalse, drainingText/CData/GeneralRefpayload events into a single result.
DocTypeis not incurrent_event_is_last_text's match arm, so the loop exits when a DocType appears. The outerDeserializer::next()then:DeEvent::Textfor the first text run.DocType(captures it via the entity resolver) andcontinues.DeEvent::Text.read_text, expecting a single text run followed byEnd, hits the consecutive-Textunreachable!().Fix
Treat
DocTypeas transparent during the text drain:current_event_is_last_textnow also returnsfalsewhen the lookahead isDocType, sodrain_textkeeps draining instead of breaking at a DOCTYPE boundary.drain_text's match arm gains aPayloadEvent::DocType(e)case that forwards the event toentity_resolver.capture()(identical to the existingDeserializer::next()path) and continues draining.DTD entities defined inside the document body remain available for subsequent
&entity;resolution — the entity resolver sees DOCTYPE events in exactly the same order as before. The only observable change is that the surrounding textruns merge into one
DeEvent::Text, which matches whatread_textand the rest of the serde deserializer have always assumed.Tests
Adds a new
doctype_in_element_textmodule intests/serde-de.rswith three cases:single_doctype_between_text<a>x<!DOCTYPE y>z</a>multiple_doctypes_between_text<a>x<!DOCTYPE y><!DOCTYPE z>w</a>(matches the fuzzer-discovered shape)leading_doctype_then_text<a><!DOCTYPE y>x</a>All three previously panicked, now pass. Full
cargo test --all-featuresstays green (1,711 passing, no regressions).MSRV / fmt / minimal-versions
cargo fmt --check: clean.cargo checkonrust-toolchain 1.79.0: clean (no new language or stdlib features used).src/de/mod.rsplus the new test cases.