Skip to content

Conversation

@lowellstewart
Copy link

Some DOCX files contain <w:lastRenderedPageBreak/> elements, which seems to be Word's way of indicating that "the last time I calculated the pagination for this document, a page break was here." While this might be useful for some applications, that element does NOT indicate any actual visible or editable content in the document.

The element is not recognized by UnicodeMapper (which renders it as a U+0001 control character), and because of that, it also messes up the behavior of OpenXmlRegex. The UnicodeMapper issue also screws up the behavior of DocumentAssembler WHEN the <w:lastRenderedPageBreak/> happens to fall within the contents of a field... in this case, the control character becomes part of the XML DocumentAssembler is trying to parse, and it throws an exception. Fixing the issue in UnicodeMapper fixes that DA exception, but then you get another related issue in OpenXmlRegex (as it is also used by DocumentAssembler), so both must be fixed at the same time.

I have added test cases that highlight specific failure cases and then fixed the bugs so the test cases pass.

@lowellstewart lowellstewart requested a review from stesee as a code owner January 6, 2026 21:46
@stesee stesee merged commit 7c43497 into Codeuctivity:main Jan 7, 2026
6 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Jan 7, 2026
@lowellstewart lowellstewart deleted the fix/open-xml-regex-bugs branch January 7, 2026 23:19
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants