Murphy automatically evaluates websites by generating and executing test scenarios in a real browser with an AI judge. It supports two planning strategies — broad feature discovery (default) and goal-directed exploration (--goal) — followed by test execution. It produces structured evaluation reports with pass/fail results, failure categorization, and actionable summaries.
Built on top of browser-use (AI browser automation library).
- Python >= 3.11
- An OpenAI API key (
OPENAI_API_KEY) — default model isgpt-5-mini
| Local (uv) | Docker | |
|---|---|---|
| Best for | Sites requiring login (--auth), development |
Public sites, reproducible environments |
| Auth support | Full — opens a visible browser for manual login | No — browser runs headless with no visible window, so --auth cannot work |
| Review pauses | Works — you edit files on disk and press Enter | Works — files are on a mounted volume, press Enter in the same terminal |
| Requires | Python >= 3.11, uv, Chromium | Docker |
Rule of thumb: use local if the site requires authentication. Both setups are interactive — Murphy pauses for you to review and edit the generated test plan (and features, when not using --goal) before continuing.
1. Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh2. Clone and install dependencies:
git clone https://github.com/ProsusAI/Murphy.git
cd Murphy
uv sync3. Install Chromium:
uv run playwright install chromium4. Create .env with your API key:
cp .env.example .envThen set your key:
OPENAI_API_KEY=sk-...
Note: Docker runs the browser in headless mode. The
--authflag (manual login) will not work — use the local setup above if your site requires login.
1. Build the image:
docker build -f docker/Dockerfile . -t murphy --no-cache2. Create .env with your API key (same as above).
3. Run via the helper script:
./run.sh --url https://example.com [options]The script mounts murphy/ and .env into the container and runs python -m murphy with your arguments.
All examples below use uv run murphy. If running via Docker, replace with ./run.sh.
# Full run: auto-detect auth -> analyze site -> generate tests -> execute
uv run murphy --url https://example.com
# Goal-directed: explores with focus, skips feature discovery, generates plan directly
uv run murphy --url https://example.com --goal "test the checkout flow"
# Site requires login — opens browser for manual auth first (local only, not Docker)
uv run murphy --url https://example.com --auth
# Public site, skip auth detection entirely
uv run murphy --url https://example.com --no-auth
# Resume from previously generated/edited files
uv run murphy --url https://example.com --features murphy/output/example_com_features.md
uv run murphy --url https://example.com --plan murphy/output/test_plan.yamlMurphy.mp4
Murphy supports two planning strategies, both followed by the same execution phase:
Strategy A — Feature discovery (default, no --goal):
An AI agent navigates the site, discovers pages, and catalogs features. Murphy saves an editable <site>_features.md and pauses for review. Then an LLM reads the features and produces test scenarios, saving an editable test_plan.yaml with another pause for review.
Strategy B — Goal-directed exploration (--goal):
An AI agent explores the site with the given goal in mind, then synthesizes a test plan directly from the exploration. Murphy saves an editable test_plan.yaml and pauses for review. No features.md is generated — the exploration replaces broad feature discovery.
Execution (both strategies): An AI agent runs each test scenario in a real browser, and a separate judge LLM evaluates pass/fail. Saves evaluation_report.json and evaluation_report.md.
You can resume from any point by passing --features or --plan with a previously generated (and optionally edited) file.
Default output directory: ./murphy/output/
| File | Description |
|---|---|
<site>_features.md |
Discovered features, pages, and user flows (editable; only generated without --goal) |
test_plan.yaml |
Generated test scenarios with steps and success criteria (editable) |
evaluation_report.json |
Full structured results (machine-readable) |
evaluation_report.md |
Human-readable summary with pass/fail per test |
After a run, evaluation_report.md looks like this (abbreviated):
# Evaluation Report: Example Store
> **An e-commerce site with product listings, search, and checkout.**
| | |
|---|---|
| URL | https://example.com |
| Category | ecommerce |
| Date | 2026-03-07 |
## Results at a Glance
**6/8 tests passed (75.0%)**
- Website Issues: 1
- Test Limitations: 1
| Test | Persona | Result | Category | Duration |
|------|---------|--------|----------|----------|
| Search for existing product | Happy Path | Passed | | 42s |
| Submit empty checkout form | Edge Case | Passed | | 38s |
| XSS payload in search bar | Adversarial | Passed | | 35s |
| Add item without logging in | Confused Novice | Failed | Website Issue | 51s |
| Rapid checkout button clicks | Impatient User | Passed | | 29s |
| ... | ... | ... | ... | ... |
## Executive Summary
The site handles core flows well but lacks feedback for unauthenticated
actions — adding an item to cart silently fails with no error message.
### Key Findings
1. No feedback when guest user attempts cart actions (Website Issue)
2. Search handles XSS payloads correctly via silent sanitization
3. Empty form submission shows clear inline validation errors
### Recommended Actions
1. Show a login prompt or error when unauthenticated users attempt cart actions
2. Add rate-limiting feedback for rapid repeated submissions
3. Improve loading indicators on slow network requestsThe full JSON report (evaluation_report.json) contains structured results, action traces, screenshots, trait evaluations, and feedback quality scores.
| Flag | Default | Description |
|---|---|---|
--url |
(required) | Target URL to evaluate |
--goal |
Free-text goal to bias test generation (e.g. "test the checkout flow") |
|
--auth |
false |
Skip auto-detection, go straight to manual login wait |
--no-auth |
false |
Skip auth detection entirely, treat site as public |
--features |
Path to existing features markdown (skips feature discovery) | |
--plan |
Path to existing YAML test plan (skips planning, goes straight to execution) | |
--max-tests |
8 |
Maximum number of test scenarios to generate |
--model |
gpt-5-mini |
LLM model for agent tasks |
--judge-model |
gpt-5-mini |
LLM model for judging verdicts |
--output-dir |
./murphy/output |
Output directory for all generated files |
--category |
Site category hint (ecommerce, saas, content, social) |
|
--ui |
false |
Launch interactive web UI instead of terminal output |
--no-highlights |
false |
Disable bounding boxes on interactive elements in the browser |
--max-steps |
30 |
Max agent steps per exploration/execution phase |
--parallel |
3 |
Number of tests to run concurrently |
Launch the web UI with:
murphy --url https://example.com --uiThe UI lets you review the generated test plan, run all tests with a live progress bar, and view detailed results with pass/fail verdicts, action traces, and failure analysis.
Murphy exposes a REST API for programmatic evaluation. Start the server with:
murphy-apiEndpoints: /analyze, /generate-plan, /execute, /evaluate, /jobs/{job_id}. Each POST endpoint supports synchronous, async+webhook, and async+polling modes. Auth via X-API-Key header.
See docs/ARCHITECTURE.md for full endpoint documentation.
All variables are optional unless noted. See .env.example for a template.
| Variable | Description |
|---|---|
OPENAI_API_KEY |
OpenAI API key (required) |
| Variable | Default | Description |
|---|---|---|
MURPHY_API_KEY |
(none) | API key for REST API authentication (open access if unset) |
MURPHY_API_HOST |
0.0.0.0 |
Host to bind the API server |
MURPHY_API_PORT |
8000 |
Port for the API server |
MURPHY_MAX_CONCURRENT_JOBS |
2 |
Maximum concurrent browser jobs |
MURPHY_REQUEST_TIMEOUT |
1800 |
HTTP keep-alive timeout (seconds) |
MURPHY_JOB_TIMEOUT_OVERRIDE |
(none) | Override all per-endpoint job timeouts (seconds) |
| Variable | Default | Description |
|---|---|---|
BROWSER_USE_EXECUTABLE_PATH |
(auto) | Path to Chrome/Chromium executable |
BROWSER_USE_HEADLESS |
true |
Run browser in headless mode |
BROWSER_USE_LOGGING_LEVEL |
info |
Logging level |
See docs/ARCHITECTURE.md for details on the codebase structure, including the Murphy evaluation pipeline and the vendored browser-use engine.
See CONTRIBUTING.md for development workflow, testing, code style, and contribution guidelines.