Skip to content

Fix unreachable!() panic when DOCTYPE appears between text runs in element content#964

Open
williamareynolds wants to merge 1 commit into
tafia:masterfrom
williamareynolds:fix/de-doctype-in-text-unreachable
Open

Fix unreachable!() panic when DOCTYPE appears between text runs in element content#964
williamareynolds wants to merge 1 commit into
tafia:masterfrom
williamareynolds:fix/de-doctype-in-text-unreachable

Conversation

@williamareynolds
Copy link
Copy Markdown

Summary

The serde deserializer panics with internal error: entered unreachable code (src/de/mod.rs read_text's "Cannot be two consequent Text events" branch) when a DOCTYPE declaration appears between text runs inside element content.
Minimal repro:

use serde::Deserialize;

#[derive(Deserialize)]
struct Wrapper { #[serde(rename = "$text")] text: String }

// Panics on master:
//   thread '...' panicked at quick-xml/src/de/mod.rs:2910:21:
//   internal error: entered unreachable code
let _: Wrapper = quick_xml::de::from_str(r#"<a>x<!DOCTYPE y>z</a>"#).unwrap();

I found this while fuzzing a new library I'm working on. I'm not an expert on quick-xml so I hope I got this all right.

Root cause

Deserializer::drain_text is the merge step that guarantees read_text (and friends) see at most one DeEvent::Text per element. It loops while current_event_is_last_text returns false, draining Text / CData / GeneralRef
payload events into a single result.

DocType is not in current_event_is_last_text's match arm, so the loop exits when a DocType appears. The outer Deserializer::next() then:

  1. Returns the accumulated DeEvent::Text for the first text run.
  2. On the next call, processes the DocType (captures it via the entity resolver) and continues.
  3. Returns the second text run as a separate DeEvent::Text.

read_text, expecting a single text run followed by End, hits the consecutive-Text unreachable!().

Fix

Treat DocType as transparent during the text drain:

  • current_event_is_last_text now also returns false when the lookahead is DocType, so drain_text keeps draining instead of breaking at a DOCTYPE boundary.
  • drain_text's match arm gains a PayloadEvent::DocType(e) case that forwards the event to entity_resolver.capture() (identical to the existing Deserializer::next() path) and continues draining.

DTD entities defined inside the document body remain available for subsequent &entity; resolution — the entity resolver sees DOCTYPE events in exactly the same order as before. The only observable change is that the surrounding text
runs merge into one DeEvent::Text, which matches what read_text and the rest of the serde deserializer have always assumed.

Tests

Adds a new doctype_in_element_text module in tests/serde-de.rs with three cases:

Case Input
single_doctype_between_text <a>x<!DOCTYPE y>z</a>
multiple_doctypes_between_text <a>x<!DOCTYPE y><!DOCTYPE z>w</a> (matches the fuzzer-discovered shape)
leading_doctype_then_text <a><!DOCTYPE y>x</a>

All three previously panicked, now pass. Full cargo test --all-features stays green (1,711 passing, no regressions).

MSRV / fmt / minimal-versions

  • cargo fmt --check: clean.
  • cargo check on rust-toolchain 1.79.0: clean (no new language or stdlib features used).
  • No new dependencies; the fix is local to src/de/mod.rs plus the new test cases.

Copy link
Copy Markdown
Collaborator

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally approve except needing to check the trailing whitespace handling question.

Comment thread tests/serde-de.rs
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 13, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 28.57143% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.31%. Comparing base (a759d65) to head (e00ae5c).
⚠️ Report is 54 commits behind head on master.

Files with missing lines Patch % Lines
src/de/mod.rs 28.57% 10 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #964      +/-   ##
==========================================
+ Coverage   55.08%   57.31%   +2.22%     
==========================================
  Files          44       46       +2     
  Lines       16911    18197    +1286     
==========================================
+ Hits         9316    10429    +1113     
- Misses       7595     7768     +173     
Flag Coverage Δ
unittests 57.31% <28.57%> (+2.22%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@williamareynolds williamareynolds force-pushed the fix/de-doctype-in-text-unreachable branch from 0587e39 to 7b7991d Compare May 13, 2026 20:55
…ement content

The deserializer's `drain_text` merges consecutive `Text`, `CData`, and
`GeneralRef` payload events into a single `DeEvent::Text` so that
`read_text` and friends see at most one text run per element. DOCTYPE
events were not part of that merge: `drain_text`'s break condition
(`current_event_is_last_text`) treats DocType as a non-text event and
exits the loop, and the outer `next()` resumes by capturing the
DocType and continuing — emitting a *second* `DeEvent::Text` for the
trailing text run.

For input like `<a>x<!DOCTYPE y>z</a>` (and shapes derived from it),
`read_text` then matched on a second `DeEvent::Text` and tripped
`unreachable!()` with the comment 'Cannot be two consequent Text
events, they would be merged into one'. The merge invariant was the
right idea — the implementation just missed the DocType case.

Fix: treat DocType as transparent during the text drain. It still goes
through `entity_resolver.capture()` (so DTD entities defined inside
the document body remain available for later `&entity;` resolution),
but the surrounding text runs are merged into one `DeEvent::Text` and
the `unreachable!()` is no longer reachable from valid or malformed
real-world input.

Discovered via libFuzzer running against a SAML deserializer harness
on the consuming application — the input that surfaced it is a
100-byte sequence containing nested DOCTYPE declarations inside what
the parser treats as element content.

Adds four regression tests in a new `doctype_in_element_text` module
in `tests/serde-de.rs`: a minimal repro (`<a>x<!DOCTYPE y>z</a>`), a
multi-DOCTYPE shape mirroring the original fuzzer find, a leading-
DOCTYPE variant, and a whitespace-around-DOCTYPE variant (added per
review feedback) confirming that adjacent whitespace is preserved
verbatim when the surrounding text runs are merged. Full
`cargo test --all-features` stays green (1,712 passing).
@williamareynolds williamareynolds force-pushed the fix/de-doctype-in-text-unreachable branch from 7b7991d to e00ae5c Compare May 14, 2026 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants