A Linux command-line tool that clicks on-screen UI elements from a natural-language
description. You give it a target like "the green X to close the terminal"; it
captures a screenshot, locates the target, and issues a click at that point.
The pipeline (in src/main.rs) is:
- Screenshot — capture the current screen. Backends are tried in order
(
src/screenshot.rs): Spectacle DBus, KWin DBusScreenShot2,grim, then thespectacleCLI as a fallback. - CV fast path (optional,
--fast) — before calling the model, try cheap computer-vision heuristics (src/cv_fast.rs): synthetic-template matching for close (X) buttons, edge/contour-based button finding, and color-based targeting (e.g. "the red X"). If a match clears the confidence threshold, the model is skipped. - VLM grounding — otherwise POST the screenshot plus the target text to a
Qwen2.5-VL server (OpenAI-compatible
/v1/chat/completions, defaulthttp://127.0.0.1:8082). The model is asked to return JSON ({"point_2d": [x, y], ...}or abbox_2d);src/vlm.rsparses it, tolerating code fences and surrounding prose, and clamps the point to image bounds. - Click — dispatch the click with
ydotool(src/click.rs):mousemove --absolutethenclick 0xC0.
There is also a background sampler (src/sampler.rs, --sampler/--daemon)
that captures frames continuously at a configured rate so a click request can use a
pre-captured frame instead of taking a fresh screenshot.
Working but environment-specific and not widely tested. It was built and run against
one specific setup: KDE Plasma 6 on Wayland, with ydotool and a local Qwen2.5-VL
server. It depends on external tools being present and configured (a screenshot
backend, ydotool + its socket, the VLM endpoint), and there is no setup automation
for those. Unit tests cover JSON parsing and point/bbox handling; there are no
integration tests exercising real screenshots, the model, or clicking.
- Linux with a working screenshot path: Spectacle (KDE), KWin DBus, or a
wlr-screencopycompositor forgrim. ydotoolwithydotooldrunning. The default socket path is/run/user/1000/.ydotool_socket(override with--ydotool-socket).- A Qwen2.5-VL server exposing an OpenAI-compatible chat-completions API (override
with
--vl-url). The--fastCV path can skip this for some targets but does not replace it. - A recent stable Rust toolchain.
cargo build --release
# binary at target/release/screen-click# One-shot: locate and click
screen-click --target "the green X to close the terminal"
# Plan only, do not click
screen-click --target "the OK button" --dry-run
# Show per-stage timing and the raw model output
screen-click --target "the search box" --verbose
# Try CV heuristics before the model
screen-click --target "the red X" --fast
# Read the target from stdin
echo "the settings gear" | screen-click --from-stdin
# Background sampler / daemon: read targets line-by-line from stdin
screen-click --daemon --fast
# then type a target per line; "stats" prints sampler counters, "quit"/"exit" stopsKey flags (see screen-click --help):
--target <TEXT>/--from-stdin— the target description.--dry-run— print the planned click without performing it.--verbose/-v— per-stage timing and raw VLM output.--fast— try the CV fast path before the VLM.--sampler/--sampler-hz <N>— run the background screen sampler.--daemon— sampler plus a stdin command loop.--vl-url <URL>— VLM endpoint (defaulthttp://127.0.0.1:8082).--ydotool-socket <PATH>— ydotool socket path.
- KDE/Wayland-oriented. Other environments may work only if a supported screenshot
backend and
ydotoolare available. - Accuracy of the click depends on the VLM's grounding quality; out-of-bounds coordinates are clamped to the screen.
- The CV fast path is heuristic and covers a limited set of cases (close buttons, simple buttons, color cues); it falls through to the VLM when not confident.
- Timing figures in the source comments are observations from the author's machine, not guaranteed benchmarks.
MIT — see LICENSE.