Skip to content

fix: update HTML cleanup to preserve text after <sup> tags#1955

Open
er-mene wants to merge 1 commit intounclecode:developfrom
er-mene:bugfix/sup-tag-text-deletion
Open

fix: update HTML cleanup to preserve text after <sup> tags#1955
er-mene wants to merge 1 commit intounclecode:developfrom
er-mene:bugfix/sup-tag-text-deletion

Conversation

@er-mene
Copy link
Copy Markdown

@er-mene er-mene commented May 4, 2026

Summary

Fixes the only_text cleanup regression where removing inline tags such as <sup> could also delete text that appears after the closing tag.
This updates the BeautifulSoup-based optimized content extraction path so inline tags are unwrapped or replaced and then immediately returned from, instead of continuing to process a detached node.

List of files changed and why

  • crawl4ai/utils.py - Fixed inline-tag handling in get_content_of_website_optimized() so text after is preserved, and reused the shared ONLY_TEXT_ELIGIBLE_TAGS constant from config instead of duplicating the list.

  • tests/regression/test_reg_utils.py - Added a regression test covering Alpha1Beta in only_text=True mode to ensure trailing text is not removed.

How Has This Been Tested?

  • Ran targeted regression test:
    .venv/bin/python -m pytest tests/regression/test_reg_utils.py -k sup_preserves_following_text

  • Result: 1 passed, 71 deselected

Checklist:

  • [ x ] My code follows the style guidelines of this project
  • [ x ] I have performed a self-review of my own code
  • [ x ] I have commented my code, particularly in hard-to-understand areas
  • [ x ] I have made corresponding changes to the documentation
  • [ x ] I have added/updated unit tests that prove my fix is effective or that my feature works
  • [ x ] New and existing unit tests pass locally with my changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant