screen-click

A Linux command-line tool that clicks on-screen UI elements from a natural-language description. You give it a target like "the green X to close the terminal"; it captures a screenshot, locates the target, and issues a click at that point.

What it does

The pipeline (in src/main.rs) is:

Screenshot — capture the current screen. Backends are tried in order (src/screenshot.rs): Spectacle DBus, KWin DBus ScreenShot2, grim, then the spectacle CLI as a fallback.
CV fast path (optional, --fast) — before calling the model, try cheap computer-vision heuristics (src/cv_fast.rs): synthetic-template matching for close (X) buttons, edge/contour-based button finding, and color-based targeting (e.g. "the red X"). If a match clears the confidence threshold, the model is skipped.
VLM grounding — otherwise POST the screenshot plus the target text to a Qwen2.5-VL server (OpenAI-compatible /v1/chat/completions, default http://127.0.0.1:8082). The model is asked to return JSON ({"point_2d": [x, y], ...} or a bbox_2d); src/vlm.rs parses it, tolerating code fences and surrounding prose, and clamps the point to image bounds.
Click — dispatch the click with ydotool (src/click.rs): mousemove --absolute then click 0xC0.

There is also a background sampler (src/sampler.rs, --sampler/--daemon) that captures frames continuously at a configured rate so a click request can use a pre-captured frame instead of taking a fresh screenshot.

Status

Working but environment-specific and not widely tested. It was built and run against one specific setup: KDE Plasma 6 on Wayland, with ydotool and a local Qwen2.5-VL server. It depends on external tools being present and configured (a screenshot backend, ydotool + its socket, the VLM endpoint), and there is no setup automation for those. Unit tests cover JSON parsing and point/bbox handling; there are no integration tests exercising real screenshots, the model, or clicking.

Requirements

Linux with a working screenshot path: Spectacle (KDE), KWin DBus, or a wlr-screencopy compositor for grim.
ydotool with ydotoold running. The default socket path is /run/user/1000/.ydotool_socket (override with --ydotool-socket).
A Qwen2.5-VL server exposing an OpenAI-compatible chat-completions API (override with --vl-url). The --fast CV path can skip this for some targets but does not replace it.
A recent stable Rust toolchain.

Build

cargo build --release
# binary at target/release/screen-click

Usage

# One-shot: locate and click
screen-click --target "the green X to close the terminal"

# Plan only, do not click
screen-click --target "the OK button" --dry-run

# Show per-stage timing and the raw model output
screen-click --target "the search box" --verbose

# Try CV heuristics before the model
screen-click --target "the red X" --fast

# Read the target from stdin
echo "the settings gear" | screen-click --from-stdin

# Background sampler / daemon: read targets line-by-line from stdin
screen-click --daemon --fast
# then type a target per line; "stats" prints sampler counters, "quit"/"exit" stops

Key flags (see screen-click --help):

--target <TEXT> / --from-stdin — the target description.
--dry-run — print the planned click without performing it.
--verbose / -v — per-stage timing and raw VLM output.
--fast — try the CV fast path before the VLM.
--sampler / --sampler-hz <N> — run the background screen sampler.
--daemon — sampler plus a stdin command loop.
--vl-url <URL> — VLM endpoint (default http://127.0.0.1:8082).
--ydotool-socket <PATH> — ydotool socket path.

Notes and limitations

KDE/Wayland-oriented. Other environments may work only if a supported screenshot backend and ydotool are available.
Accuracy of the click depends on the VLM's grounding quality; out-of-bounds coordinates are clamped to the screen.
The CV fast path is heuristic and covers a limited set of cases (close buttons, simple buttons, color cues); it falls through to the VLM when not confident.
Timing figures in the source comments are observations from the author's machine, not guaranteed benchmarks.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

screen-click

What it does

Status

Requirements

Build

Usage

Notes and limitations

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

screen-click

What it does

Status

Requirements

Build

Usage

Notes and limitations

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages