fix: update HTML cleanup to preserve text after <sup> tags#1955
Open
er-mene wants to merge 1 commit intounclecode:developfrom
Open
fix: update HTML cleanup to preserve text after <sup> tags#1955er-mene wants to merge 1 commit intounclecode:developfrom
er-mene wants to merge 1 commit intounclecode:developfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the only_text cleanup regression where removing inline tags such as <sup> could also delete text that appears after the closing tag.
This updates the BeautifulSoup-based optimized content extraction path so inline tags are unwrapped or replaced and then immediately returned from, instead of continuing to process a detached node.
List of files changed and why
crawl4ai/utils.py - Fixed inline-tag handling in get_content_of_website_optimized() so text after is preserved, and reused the shared ONLY_TEXT_ELIGIBLE_TAGS constant from config instead of duplicating the list.
tests/regression/test_reg_utils.py - Added a regression test covering Alpha1Beta in only_text=True mode to ensure trailing text is not removed.
How Has This Been Tested?
Ran targeted regression test:
.venv/bin/python -m pytest tests/regression/test_reg_utils.py -k sup_preserves_following_text
Result: 1 passed, 71 deselected
Checklist: