Skip to content

Peterc3-dev/screen-click

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

screen-click

A Linux command-line tool that clicks on-screen UI elements from a natural-language description. You give it a target like "the green X to close the terminal"; it captures a screenshot, locates the target, and issues a click at that point.

What it does

The pipeline (in src/main.rs) is:

  1. Screenshot — capture the current screen. Backends are tried in order (src/screenshot.rs): Spectacle DBus, KWin DBus ScreenShot2, grim, then the spectacle CLI as a fallback.
  2. CV fast path (optional, --fast) — before calling the model, try cheap computer-vision heuristics (src/cv_fast.rs): synthetic-template matching for close (X) buttons, edge/contour-based button finding, and color-based targeting (e.g. "the red X"). If a match clears the confidence threshold, the model is skipped.
  3. VLM grounding — otherwise POST the screenshot plus the target text to a Qwen2.5-VL server (OpenAI-compatible /v1/chat/completions, default http://127.0.0.1:8082). The model is asked to return JSON ({"point_2d": [x, y], ...} or a bbox_2d); src/vlm.rs parses it, tolerating code fences and surrounding prose, and clamps the point to image bounds.
  4. Click — dispatch the click with ydotool (src/click.rs): mousemove --absolute then click 0xC0.

There is also a background sampler (src/sampler.rs, --sampler/--daemon) that captures frames continuously at a configured rate so a click request can use a pre-captured frame instead of taking a fresh screenshot.

Status

Working but environment-specific and not widely tested. It was built and run against one specific setup: KDE Plasma 6 on Wayland, with ydotool and a local Qwen2.5-VL server. It depends on external tools being present and configured (a screenshot backend, ydotool + its socket, the VLM endpoint), and there is no setup automation for those. Unit tests cover JSON parsing and point/bbox handling; there are no integration tests exercising real screenshots, the model, or clicking.

Requirements

  • Linux with a working screenshot path: Spectacle (KDE), KWin DBus, or a wlr-screencopy compositor for grim.
  • ydotool with ydotoold running. The default socket path is /run/user/1000/.ydotool_socket (override with --ydotool-socket).
  • A Qwen2.5-VL server exposing an OpenAI-compatible chat-completions API (override with --vl-url). The --fast CV path can skip this for some targets but does not replace it.
  • A recent stable Rust toolchain.

Build

cargo build --release
# binary at target/release/screen-click

Usage

# One-shot: locate and click
screen-click --target "the green X to close the terminal"

# Plan only, do not click
screen-click --target "the OK button" --dry-run

# Show per-stage timing and the raw model output
screen-click --target "the search box" --verbose

# Try CV heuristics before the model
screen-click --target "the red X" --fast

# Read the target from stdin
echo "the settings gear" | screen-click --from-stdin

# Background sampler / daemon: read targets line-by-line from stdin
screen-click --daemon --fast
# then type a target per line; "stats" prints sampler counters, "quit"/"exit" stops

Key flags (see screen-click --help):

  • --target <TEXT> / --from-stdin — the target description.
  • --dry-run — print the planned click without performing it.
  • --verbose / -v — per-stage timing and raw VLM output.
  • --fast — try the CV fast path before the VLM.
  • --sampler / --sampler-hz <N> — run the background screen sampler.
  • --daemon — sampler plus a stdin command loop.
  • --vl-url <URL> — VLM endpoint (default http://127.0.0.1:8082).
  • --ydotool-socket <PATH> — ydotool socket path.

Notes and limitations

  • KDE/Wayland-oriented. Other environments may work only if a supported screenshot backend and ydotool are available.
  • Accuracy of the click depends on the VLM's grounding quality; out-of-bounds coordinates are clamped to the screen.
  • The CV fast path is heuristic and covers a limited set of cases (close buttons, simple buttons, color cues); it falls through to the VLM when not confident.
  • Timing figures in the source comments are observations from the author's machine, not guaranteed benchmarks.

License

MIT — see LICENSE.

About

Linux CLI: click on-screen UI from a natural-language target via screenshot + CV heuristics + Qwen2.5-VL grounding + ydotool

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages