Skip to content

Experiment: DOMShell vs Vision (browser-use) with same local model #28

@apireno

Description

@apireno

Summary

Run the same Wikipedia tasks from experiments/nexa/ using Nexa's vision-based Web-Agent-Qwen3VL (Playwright + screenshots) and compare against the DOMShell results.

Why

The current Nexa experiment compares 1.7B/4B local models against Claude Opus — that's a model-size comparison, not an interface comparison. To validate DOMShell's text/AX-tree design, we need an apples-to-apples test: same model, same tasks, different interface (DOMShell text vs Playwright screenshots).

Approach

  • Use Nexa's existing Web-Agent cookbook (Playwright + browser-use + Qwen3-VL)
  • Run the same 3 Wikipedia tasks from experiments/nexa/nexa_prompts.md
  • Compare: tool calls, correctness, completeness, token usage
  • Key question: does DOMShell's structured text approach outperform pixel-based browsing at the same model size?

Expected Outcome

DOMShell should be more token-efficient (text vs images) and may enable smaller models to succeed where vision models need 4B+ parameters.

References

  • experiments/nexa/ — existing DOMShell results (0/12 with 1.7B-4B)
  • integrations/nexa/agent.py — DOMShell agent
  • Web-Agent-Qwen3VL cookbook — vision-based comparison target

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions