Skip to content

Ensure proper cleanup for <pre> tags containing complex HTML#22

Open
rainux wants to merge 1 commit intokepano:mainfrom
rainux:handle-complex-pre
Open

Ensure proper cleanup for <pre> tags containing complex HTML#22
rainux wants to merge 1 commit intokepano:mainfrom
rainux:handle-complex-pre

Conversation

@rainux
Copy link
Contributor

@rainux rainux commented Apr 1, 2025

This PR maximize the compatibility with the situation where <pre> tags containing block-level elements (e.g., <p>, <ul>, <br>) were not being cleaned correctly by Defuddle.

Problem: The previous logic treated all <pre>-like elements as potential code blocks. Complex HTML inside them was inappropriately formatted because standard cleanup ignores <pre> internals.

Solution: The rule handling preformatted elements (codeBlockRules in code.ts) now checks the children of <pre> (and similar containers):

  1. If block-level elements or <br> are found, the <pre> is converted to a <div>. This ensures its content is treated as standard HTML by later cleanup steps.
  2. Otherwise, it's standardized into a <pre><code> block (maintaining the original behavior for simple preformatted text and code).

Potential Refactoring:

The transform function now does a couple of different things based on the content. I'm not sure if it's better to keep it this way for simplicity, or if we should maybe split the logic for "convert complex <pre> to <div>" and "standardize to <pre><code>" into separate helper functions within code.ts, also the filename and rule name may be renamed to something like preformatedXxx.

Happy to discuss or explore this in a follow-up if you think it makes sense!

This PR was co-authored with Gemini 2.5 Pro.

@kepano
Copy link
Owner

kepano commented Apr 2, 2025

Thanks. Can you provide an example of a page that didn't work before?

@rainux
Copy link
Contributor Author

rainux commented Apr 3, 2025

I'm sorry, I didn't directly provide URL since it's a site for porngraphy novel. https://hlib.cc/n/15263018

Also I understand complex HTML tags should not exist in <pre> tags, this PR is not "fixing" something but try to maximize the compatibility with sites which didn't respect web standards.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants