fix: improve browser tool instructions and element resolution#351
fix: improve browser tool instructions and element resolution#351
Conversation
- Rewrite browser section in worker prompt with parameter table, concrete JSON examples, and explicit act_kind emphasis - Add multi-strategy element selector (native tags, aria-label, title attr) to handle implicit ARIA roles on native HTML elements - Improve all error messages to include valid values and examples - Update tool description to front-load act_kind requirement
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
WalkthroughThis PR clarifies the browser tool workflow (launch → navigate → snapshot → act → close), expands action definitions and act requirements, and refactors element resolution in Rust to a JS-driven, multi-pattern selector strategy with new helper functions and improved error handling. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/tools/browser.rs (2)
1419-1435: Consider extendingtextboxmapping to also trytextarea.The current mapping
"textbox" => Some("input")misses<textarea>elements, which also have the implicittextboxrole. This could reduce match rates for multi-line text inputs. Sincebuild_selectors_for_refiterates through selectors, the existing fallbacks should still work, but you could improve coverage by handling both.💡 Potential enhancement (optional)
You could return a comma-separated selector or handle this in
build_selectors_for_refby pushing bothinputandtextareavariants when role istextbox:- "textbox" => Some("input"), + "textbox" => Some("input, textarea"),Or handle it at the
build_selectors_for_reflevel for finer control. This is a nice-to-have improvement that can be deferred.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tools/browser.rs` around lines 1419 - 1435, The role_to_native_tag function currently maps "textbox" only to "input", missing textarea elements; update role_to_native_tag to include textarea as well (e.g., return a selector that matches both input and textarea for the "textbox" case) or alternatively modify build_selectors_for_ref to push both the input and textarea selectors when it sees role "textbox" so multi-line textareas are covered by the selector generation; locate function role_to_native_tag and adjust the "textbox" branch (or update build_selectors_for_ref's handling of textbox) accordingly.
1442-1448: Consider more robust CSS escaping for element names.The current escaping at line 1444 only handles single quotes. Element names containing backslashes, newlines, or null characters could break CSS selector parsing or cause unexpected matches. While accessible names from the AX tree are typically sanitized, a more defensive approach would be safer.
💡 Optional: More robust escaping
- let escaped_name = elem_ref.name.as_ref().map(|n| n.replace('\'', "\\'")); + let escaped_name = elem_ref.name.as_ref().map(|n| { + n.replace('\\', "\\\\") + .replace('\'', "\\'") + .replace('\n', "\\n") + });Alternatively, consider using CSS.escape() via the
cssparsercrate if names become more complex. This is low priority since AX tree names are typically well-formed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tools/browser.rs` around lines 1442 - 1448, The code in build_selectors_for_ref only replaces single quotes when building escaped_name; update it to perform robust CSS attribute-value escaping for elem_ref.name (or call a library escape function) so backslashes, control characters (newlines, nulls) and quotes are properly escaped before formatting selectors; implement a small helper (e.g., escape_css_attr) or use a crate-provided escape (cssparser or similar) and replace the current escaped_name computation and subsequent uses in build_selectors_for_ref to use the new safe-escaped value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/tools/browser.rs`:
- Around line 1419-1435: The role_to_native_tag function currently maps
"textbox" only to "input", missing textarea elements; update role_to_native_tag
to include textarea as well (e.g., return a selector that matches both input and
textarea for the "textbox" case) or alternatively modify build_selectors_for_ref
to push both the input and textarea selectors when it sees role "textbox" so
multi-line textareas are covered by the selector generation; locate function
role_to_native_tag and adjust the "textbox" branch (or update
build_selectors_for_ref's handling of textbox) accordingly.
- Around line 1442-1448: The code in build_selectors_for_ref only replaces
single quotes when building escaped_name; update it to perform robust CSS
attribute-value escaping for elem_ref.name (or call a library escape function)
so backslashes, control characters (newlines, nulls) and quotes are properly
escaped before formatting selectors; implement a small helper (e.g.,
escape_css_attr) or use a crate-provided escape (cssparser or similar) and
replace the current escaped_name computation and subsequent uses in
build_selectors_for_ref to use the new safe-escaped value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0d7828a5-4ee2-4f5b-b9f6-6395fbf9ea99
📒 Files selected for processing (3)
prompts/en/tools/browser_description.md.j2prompts/en/worker.md.j2src/tools/browser.rs
| | `act` | `act_kind`, `element_ref` | Interact with an element. **`act_kind` is mandatory.** | | ||
| | `screenshot` | `full_page` (optional) | Capture the viewport (or full page). | |
There was a problem hiding this comment.
Minor nit: the quick-reference table makes element_ref look required for all act kinds, but press_key can omit it. Also screenshot supports element_ref (element screenshot) per the tool args, but the table doesn’t mention it.
| | `act` | `act_kind`, `element_ref` | Interact with an element. **`act_kind` is mandatory.** | | |
| | `screenshot` | `full_page` (optional) | Capture the viewport (or full page). | | |
| | `act` | `act_kind`, `element_ref` (except `press_key`) | Interact with an element. **`act_kind` is mandatory.** | | |
| | `screenshot` | `element_ref` (optional), `full_page` (optional) | Capture the viewport, full page, or a specific element. | |
| @@ -1 +1 @@ | |||
| Browser automation tool. Launch a headless Chrome browser, navigate pages, interact with elements, take screenshots, and extract page content. Workflow: launch → navigate → snapshot (get element refs) → act (click/type by ref) → screenshot. Element refs like "e1", "e2" are assigned during snapshot and used in act calls. No newline at end of file | |||
| Browser automation tool. Workflow: launch → navigate → snapshot → act → close. The `act` action REQUIRES `act_kind` (click, type, press_key, hover, scroll_into_view, focus) and `element_ref` from the last snapshot. Example: {"action": "act", "act_kind": "click", "element_ref": "e3"}. Always snapshot before acting — refs reset on navigation. Use `navigate` to go to URLs (not `open`, which creates new tabs). No newline at end of file | |||
There was a problem hiding this comment.
Small consistency thing with the worker prompt: press_key doesn’t require an element_ref, so saying act requires element_ref can push workers into always inventing one.
| Browser automation tool. Workflow: launch → navigate → snapshot → act → close. The `act` action REQUIRES `act_kind` (click, type, press_key, hover, scroll_into_view, focus) and `element_ref` from the last snapshot. Example: {"action": "act", "act_kind": "click", "element_ref": "e3"}. Always snapshot before acting — refs reset on navigation. Use `navigate` to go to URLs (not `open`, which creates new tabs). | |
| Browser automation tool. Workflow: launch → navigate → snapshot → act → close. The `act` action REQUIRES `act_kind` (click, type, press_key, hover, scroll_into_view, focus). For most act kinds, pass `element_ref` from the last snapshot; for `press_key`, pass `key` and optionally `element_ref`. Example: {"action": "act", "act_kind": "click", "element_ref": "e3"}. Always snapshot before acting — refs reset on navigation. Use `navigate` to go to URLs (not `open`, which creates new tabs). |
src/tools/browser.rs
Outdated
| for selector in &selectors { | ||
| if let Ok(element) = page.find_element(selector).await { | ||
| return Ok(element); | ||
| } | ||
| } | ||
|
|
||
| Err(BrowserError::new(format!( | ||
| "failed to find element for ref '{ref_id}' \ | ||
| (role: {}, name: {:?}) — the page may have changed since \ | ||
| the last snapshot. Run snapshot again to get fresh refs, \ | ||
| then retry with the updated ref. \ | ||
| Selectors tried: {}", | ||
| elem_ref.role, | ||
| elem_ref.name, | ||
| selectors.join(", "), | ||
| ))) |
There was a problem hiding this comment.
Right now we swallow all find_element errors and always return the same “page may have changed” message. If the selector is invalid (or CDP errors), snapshot+retry won’t help and we’ll just loop.
Could be worth capturing the last error and including it in the final message so the worker can course-correct faster.
| for selector in &selectors { | |
| if let Ok(element) = page.find_element(selector).await { | |
| return Ok(element); | |
| } | |
| } | |
| Err(BrowserError::new(format!( | |
| "failed to find element for ref '{ref_id}' \ | |
| (role: {}, name: {:?}) — the page may have changed since \ | |
| the last snapshot. Run snapshot again to get fresh refs, \ | |
| then retry with the updated ref. \ | |
| Selectors tried: {}", | |
| elem_ref.role, | |
| elem_ref.name, | |
| selectors.join(", "), | |
| ))) | |
| let mut last_error: Option<String> = None; | |
| for selector in &selectors { | |
| match page.find_element(selector).await { | |
| Ok(element) => return Ok(element), | |
| Err(error) => last_error = Some(error.to_string()), | |
| } | |
| } | |
| let last_error = last_error.unwrap_or_else(|| "unknown error".to_string()); | |
| Err(BrowserError::new(format!( | |
| "failed to find element for ref '{ref_id}' \ | |
| (role: {}, name: {:?}) — the page may have changed since \ | |
| the last snapshot. Run snapshot again to get fresh refs, \ | |
| then retry with the updated ref. \ | |
| Selectors tried: {}. Last error: {last_error}", | |
| elem_ref.role, | |
| elem_ref.name, | |
| selectors.join(", "), | |
| ))) |
| fn build_selectors_for_ref(elem_ref: &ElementRef) -> Vec<String> { | ||
| let mut selectors = Vec::with_capacity(4); | ||
| let escaped_name = elem_ref.name.as_ref().map(|n| n.replace('\'', "\\'")); | ||
|
|
||
| // Strategy 1: [role='X'][aria-label='Y'] — explicit ARIA attributes | ||
| if let Some(name) = &escaped_name { | ||
| selectors.push(format!("[role='{}'][aria-label='{name}']", elem_ref.role)); | ||
| } | ||
|
|
||
| if let Some(name) = &elem_ref.name { | ||
| // Escape single quotes in the name for CSS selector safety | ||
| let escaped = name.replace('\'', "\\'"); | ||
| format!("{role_selector}[aria-label='{escaped}']") | ||
| } else { | ||
| role_selector | ||
| // Strategy 2: native tag + aria-label (e.g., button[aria-label='Submit']) | ||
| if let Some(tag) = role_to_native_tag(&elem_ref.role) { | ||
| if let Some(name) = &escaped_name { | ||
| selectors.push(format!("{tag}[aria-label='{name}']")); | ||
| } | ||
| } | ||
|
|
||
| // Strategy 3: native tag with text content match via XPath-style selector | ||
| // using the name as button/link text | ||
| if let Some(tag) = role_to_native_tag(&elem_ref.role) { |
There was a problem hiding this comment.
Nit: the “XPath-style selector / text content match” comment doesn’t match what the code actually does (title + tag-only fallbacks). Also with_capacity(4) looks a bit low given the max selectors here is 5.
| fn build_selectors_for_ref(elem_ref: &ElementRef) -> Vec<String> { | |
| let mut selectors = Vec::with_capacity(4); | |
| let escaped_name = elem_ref.name.as_ref().map(|n| n.replace('\'', "\\'")); | |
| // Strategy 1: [role='X'][aria-label='Y'] — explicit ARIA attributes | |
| if let Some(name) = &escaped_name { | |
| selectors.push(format!("[role='{}'][aria-label='{name}']", elem_ref.role)); | |
| } | |
| if let Some(name) = &elem_ref.name { | |
| // Escape single quotes in the name for CSS selector safety | |
| let escaped = name.replace('\'', "\\'"); | |
| format!("{role_selector}[aria-label='{escaped}']") | |
| } else { | |
| role_selector | |
| // Strategy 2: native tag + aria-label (e.g., button[aria-label='Submit']) | |
| if let Some(tag) = role_to_native_tag(&elem_ref.role) { | |
| if let Some(name) = &escaped_name { | |
| selectors.push(format!("{tag}[aria-label='{name}']")); | |
| } | |
| } | |
| // Strategy 3: native tag with text content match via XPath-style selector | |
| // using the name as button/link text | |
| if let Some(tag) = role_to_native_tag(&elem_ref.role) { | |
| fn build_selectors_for_ref(elem_ref: &ElementRef) -> Vec<String> { | |
| let mut selectors = Vec::with_capacity(5); | |
| let escaped_name = elem_ref.name.as_ref().map(|n| n.replace('\'', "\\'")); | |
| // Strategy 1: [role='X'][aria-label='Y'] — explicit ARIA attributes | |
| if let Some(name) = &escaped_name { | |
| selectors.push(format!("[role='{}'][aria-label='{name}']", elem_ref.role)); | |
| } | |
| // Strategy 2: native tag + aria-label (e.g., button[aria-label='Submit']) | |
| if let Some(tag) = role_to_native_tag(&elem_ref.role) { | |
| if let Some(name) = &escaped_name { | |
| selectors.push(format!("{tag}[aria-label='{name}']")); | |
| } | |
| } | |
| // Strategy 3: native tag + title fallback, then tag-only fallback | |
| if let Some(tag) = role_to_native_tag(&elem_ref.role) { |
Element clicks/type/hover/focus now use Runtime.evaluate with DOM methods instead of CDP node IDs that go stale between snapshot and interaction. - Rewrite handle_act to use JS injection for all ActKind variants - Add build_js_selector: tries CSS selectors then text-content fallback - Add run_js_action: executes JS via page.evaluate, parses JSON result - Update handle_screenshot to use JS + clip viewport instead of Element - Remove old resolve_element_ref (replaced by JS injection) - Merge role_to_native_tag entries from both branches
Summary
Workers keep making the same browser tool mistakes in a loop — forgetting
act_kind, usingopeninstead ofnavigate, failing to find elements. Three fixes:act_kindemphasis, and a "common mistakes" section<button>,<a>,<input>) have implicit ARIA roles without[role]attributes in the DOM. Now tries:[role][aria-label]→tag[aria-label]→tag[title]→tag→[role]fallbackNote
This PR improves browser tool usability by providing clearer documentation, better error guidance, and more robust element selection. Changes span prompt files (worker.md.j2, browser_description.md.j2) and implementation (browser.rs), with updated error messages that include JSON examples and parameter validation guidance for faster LLM self-correction.
Written by Tembo for commit 5aab401. This will update automatically on new commits.