Skip to content

Add DOM mutation operations: remove/2, remove_attribute/2, set_attribute/3#34

Open
jsmestad wants to merge 1 commit intodashbitco:mainfrom
jsmestad:dom-mutation-api
Open

Add DOM mutation operations: remove/2, remove_attribute/2, set_attribute/3#34
jsmestad wants to merge 1 commit intodashbitco:mainfrom
jsmestad:dom-mutation-api

Conversation

@jsmestad
Copy link
Copy Markdown

@jsmestad jsmestad commented Apr 9, 2026

Summary

Three new functions that mutate the native Lexbor DOM in place, enabling efficient tree transformation without round-tripping through Elixir tuples.

New functions

  • remove/2 - removes all elements matching a CSS selector from the DOM tree. Uses the same CSS selector engine as query/2. Collects matching nodes first (can't modify during traversal), then destroys them via lxb_dom_node_destroy (unlink + free).

  • remove_attribute/2 - removes a named attribute from all element nodes and their descendants. Walks the subtree in C via lxb_dom_node_simple_walk.

  • set_attribute/3 - sets an attribute on all element nodes in the set. Uses the existing lxb_dom_element_set_attribute (already called by from_tree).

Motivation

For HTML transformation workloads (sanitization, content stripping), the current workflow requires exporting the DOM to Elixir tuples via to_tree, walking them in pure Elixir, then importing back via from_tree. On a corpus of 51 real production pages (8.4MB HTML), this takes ~500ms.

Cost breakdown of the current approach:

Step Time What it does
from_fragment 28ms Parse HTML (Lexbor C, fast)
to_tree 75ms Convert native DOM to BEAM tuples
Elixir tree walk 370ms Remove junk tags, strip attrs, etc.
Tree.to_html 30ms Serialize Elixir tree
Total 503ms

With native mutation, the entire pipeline stays in C:

html
|> LazyHTML.from_fragment()                      #  28ms
|> LazyHTML.remove("script, style, nav, ...")    #  ~5ms
|> LazyHTML.remove("[hidden]")                   #  ~2ms
|> LazyHTML.to_html()                            # ~12ms
# Total: ~50ms estimated (10x faster)

The to_tree export and Elixir tree walking are eliminated entirely.

Implementation

All three NIF functions follow the existing patterns in lazy_html.cpp:

  • dom_remove reuses parse_css_selector and lxb_selectors_find from query. After collecting matching nodes, it calls lxb_dom_node_destroy (which both unlinks and frees). It also scrubs the LazyHTML.nodes vector to prevent dangling pointer access.

  • dom_remove_attribute uses lxb_dom_node_simple_walk to traverse descendants and lxb_dom_element_remove_attribute on each element.

  • dom_set_attribute calls lxb_dom_element_set_attribute, which is already used in from_tree.

Safety

remove/2 mutates the underlying DOM. Any %LazyHTML{} values previously obtained via query/2 that reference removed nodes become invalid. This is documented with a warning admonition in the function docs.

Tests

17 new tests + 3 new doctests covering:

  • Simple, compound, and attribute CSS selectors
  • Nested element removal
  • Root node removal
  • No-op when nothing matches
  • Subsequent queries reflecting mutations
  • Attribute removal from nested elements
  • Attribute set/overwrite
  • Multi-node attribute set from query results

All 96 tests pass (39 doctests + 57 tests, 0 failures).

…ute/3

Three new functions that mutate the native Lexbor DOM in place, enabling
efficient tree transformation without round-tripping through Elixir tuples.

## New functions

- `remove/2` - Removes all elements matching a CSS selector from the DOM.
  Uses the same selector engine as `query/2`. Collects matching nodes first,
  then destroys them via `lxb_dom_node_destroy` (unlink + free).

- `remove_attribute/2` - Removes a named attribute from all element nodes
  and their descendants. Walks the subtree in C via `lxb_dom_node_simple_walk`.

- `set_attribute/3` - Sets an attribute on all element nodes in the set.
  Uses the existing `lxb_dom_element_set_attribute` (already used by `from_tree`).

## Motivation

For HTML transformation workloads (sanitization, content stripping), the
current workflow requires exporting the DOM to Elixir tuples via `to_tree`,
walking them in pure Elixir, then importing back. On a corpus of 51 real
production pages (8.4MB HTML), this takes ~500ms.

With native mutation, the same work can stay in C throughout:

    html
    |> LazyHTML.from_fragment()
    |> LazyHTML.remove("script, style, nav, footer")
    |> LazyHTML.remove("[hidden]")
    |> LazyHTML.to_html()

Estimated speedup: ~8x (500ms -> ~60ms).

## Safety note

`remove/2` mutates the underlying DOM. Any `%LazyHTML{}` values previously
obtained via `query/2` that reference removed nodes become invalid. This is
documented in the function's warning admonition.

## Tests

17 new tests covering:
- Simple and compound CSS selectors
- Nested element removal
- Attribute selector removal (`[hidden]`)
- Root node removal
- No-op when nothing matches
- Subsequent queries reflecting mutations
- Attribute removal from nested elements
- Attribute set/overwrite
- Multi-node attribute set from query results

3 new doctests.
@josevalim
Copy link
Copy Markdown
Member

josevalim commented Apr 10, 2026

Unfortunately mutations introduce a bunch of side-effects into the tree and leaves it up to the user to manage state and deal with the side-effects of that. I think it would make more sense to introduce a transform API, where you express query selectors and the operations you want to do with those, and then you apply it on a copy of the tree. So you get the benefits you mentioned but on top of a pure API.

Something like:

html
|> LazyHTML.from_fragment()
|> LazyHTML.transform([
  LazyHTML.Transform.remove("script, style, nav, ..."),
  LazyHTML.Transform.set_attribute("#omg", "data-foo", "bar")
])
|> LazyHTML.to_html()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants