fix: update HTML cleanup to preserve text after <sup> tags by er-mene · Pull Request #1955 · unclecode/crawl4ai

er-mene · 2026-05-04T11:37:26Z

Summary

Fixes the only_text cleanup regression where removing inline tags such as <sup> could also delete text that appears after the closing tag.
This updates the BeautifulSoup-based optimized content extraction path so inline tags are unwrapped or replaced and then immediately returned from, instead of continuing to process a detached node.

List of files changed and why

crawl4ai/utils.py - Fixed inline-tag handling in get_content_of_website_optimized() so text after is preserved, and reused the shared ONLY_TEXT_ELIGIBLE_TAGS constant from config instead of duplicating the list.
tests/regression/test_reg_utils.py - Added a regression test covering Alpha¹Beta in only_text=True mode to ensure trailing text is not removed.

How Has This Been Tested?

Ran targeted regression test:
.venv/bin/python -m pytest tests/regression/test_reg_utils.py -k sup_preserves_following_text
Result: 1 passed, 71 deselected

Checklist:

[ x ] My code follows the style guidelines of this project
[ x ] I have performed a self-review of my own code
[ x ] I have commented my code, particularly in hard-to-understand areas
[ x ] I have made corresponding changes to the documentation
[ x ] I have added/updated unit tests that prove my fix is effective or that my feature works
[ x ] New and existing unit tests pass locally with my changes

fix: update HTML cleanup to preserve text after <sup> tags

bbe3e1e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: update HTML cleanup to preserve text after <sup> tags#1955

fix: update HTML cleanup to preserve text after <sup> tags#1955
er-mene wants to merge 1 commit intounclecode:developfrom
er-mene:bugfix/sup-tag-text-deletion

er-mene commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

er-mene commented May 4, 2026

Summary

List of files changed and why

How Has This Been Tested?

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant