Add DOM mutation operations: remove/2, remove_attribute/2, set_attribute/3#34
Open
jsmestad wants to merge 1 commit intodashbitco:mainfrom
Open
Add DOM mutation operations: remove/2, remove_attribute/2, set_attribute/3#34jsmestad wants to merge 1 commit intodashbitco:mainfrom
jsmestad wants to merge 1 commit intodashbitco:mainfrom
Conversation
…ute/3
Three new functions that mutate the native Lexbor DOM in place, enabling
efficient tree transformation without round-tripping through Elixir tuples.
## New functions
- `remove/2` - Removes all elements matching a CSS selector from the DOM.
Uses the same selector engine as `query/2`. Collects matching nodes first,
then destroys them via `lxb_dom_node_destroy` (unlink + free).
- `remove_attribute/2` - Removes a named attribute from all element nodes
and their descendants. Walks the subtree in C via `lxb_dom_node_simple_walk`.
- `set_attribute/3` - Sets an attribute on all element nodes in the set.
Uses the existing `lxb_dom_element_set_attribute` (already used by `from_tree`).
## Motivation
For HTML transformation workloads (sanitization, content stripping), the
current workflow requires exporting the DOM to Elixir tuples via `to_tree`,
walking them in pure Elixir, then importing back. On a corpus of 51 real
production pages (8.4MB HTML), this takes ~500ms.
With native mutation, the same work can stay in C throughout:
html
|> LazyHTML.from_fragment()
|> LazyHTML.remove("script, style, nav, footer")
|> LazyHTML.remove("[hidden]")
|> LazyHTML.to_html()
Estimated speedup: ~8x (500ms -> ~60ms).
## Safety note
`remove/2` mutates the underlying DOM. Any `%LazyHTML{}` values previously
obtained via `query/2` that reference removed nodes become invalid. This is
documented in the function's warning admonition.
## Tests
17 new tests covering:
- Simple and compound CSS selectors
- Nested element removal
- Attribute selector removal (`[hidden]`)
- Root node removal
- No-op when nothing matches
- Subsequent queries reflecting mutations
- Attribute removal from nested elements
- Attribute set/overwrite
- Multi-node attribute set from query results
3 new doctests.
Member
|
Unfortunately mutations introduce a bunch of side-effects into the tree and leaves it up to the user to manage state and deal with the side-effects of that. I think it would make more sense to introduce a transform API, where you express query selectors and the operations you want to do with those, and then you apply it on a copy of the tree. So you get the benefits you mentioned but on top of a pure API. Something like: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three new functions that mutate the native Lexbor DOM in place, enabling efficient tree transformation without round-tripping through Elixir tuples.
New functions
remove/2- removes all elements matching a CSS selector from the DOM tree. Uses the same CSS selector engine asquery/2. Collects matching nodes first (can't modify during traversal), then destroys them vialxb_dom_node_destroy(unlink + free).remove_attribute/2- removes a named attribute from all element nodes and their descendants. Walks the subtree in C vialxb_dom_node_simple_walk.set_attribute/3- sets an attribute on all element nodes in the set. Uses the existinglxb_dom_element_set_attribute(already called byfrom_tree).Motivation
For HTML transformation workloads (sanitization, content stripping), the current workflow requires exporting the DOM to Elixir tuples via
to_tree, walking them in pure Elixir, then importing back viafrom_tree. On a corpus of 51 real production pages (8.4MB HTML), this takes ~500ms.Cost breakdown of the current approach:
from_fragmentto_treeTree.to_htmlWith native mutation, the entire pipeline stays in C:
The
to_treeexport and Elixir tree walking are eliminated entirely.Implementation
All three NIF functions follow the existing patterns in
lazy_html.cpp:dom_removereusesparse_css_selectorandlxb_selectors_findfromquery. After collecting matching nodes, it callslxb_dom_node_destroy(which both unlinks and frees). It also scrubs theLazyHTML.nodesvector to prevent dangling pointer access.dom_remove_attributeuseslxb_dom_node_simple_walkto traverse descendants andlxb_dom_element_remove_attributeon each element.dom_set_attributecallslxb_dom_element_set_attribute, which is already used infrom_tree.Safety
remove/2mutates the underlying DOM. Any%LazyHTML{}values previously obtained viaquery/2that reference removed nodes become invalid. This is documented with a warning admonition in the function docs.Tests
17 new tests + 3 new doctests covering:
All 96 tests pass (39 doctests + 57 tests, 0 failures).