-
Notifications
You must be signed in to change notification settings - Fork 1
Open
Labels
enhancementNew feature or requestNew feature or request
Description
Summary
Run the same Wikipedia tasks from experiments/nexa/ using Nexa's vision-based Web-Agent-Qwen3VL (Playwright + screenshots) and compare against the DOMShell results.
Why
The current Nexa experiment compares 1.7B/4B local models against Claude Opus — that's a model-size comparison, not an interface comparison. To validate DOMShell's text/AX-tree design, we need an apples-to-apples test: same model, same tasks, different interface (DOMShell text vs Playwright screenshots).
Approach
- Use Nexa's existing Web-Agent cookbook (Playwright + browser-use + Qwen3-VL)
- Run the same 3 Wikipedia tasks from
experiments/nexa/nexa_prompts.md - Compare: tool calls, correctness, completeness, token usage
- Key question: does DOMShell's structured text approach outperform pixel-based browsing at the same model size?
Expected Outcome
DOMShell should be more token-efficient (text vs images) and may enable smaller models to succeed where vision models need 4B+ parameters.
References
experiments/nexa/— existing DOMShell results (0/12 with 1.7B-4B)integrations/nexa/agent.py— DOMShell agent- Web-Agent-Qwen3VL cookbook — vision-based comparison target
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request