Developed under Nexus Labs · University College London
Built upon Avenir-Web by the Princeton AI² Lab.
Authors: Wee Joe Tan, Zi Rui Lucas Lim, Shashank Durgad, Karim Obegi, Aiden Yiliu Li
OpenFlo is an autonomous web agent framework for systematic measurement of web usability patterns at scale. Extending the Avenir-Web architecture, OpenFlo introduces a multi-metric UX evaluation layer that scores each agent action across four dimensions — overall ease (SEQ), efficiency, clarity, and confidence — using the established Single Ease Question (SEQ) methodology extended beyond its original single-item form. Step-level scores are then synthesised into a System Usability Scale (SUS) report at session end, following the standard Brooke (1996) scoring formula and Sauro-Lewis curved grading. A configurable persona framework injects user archetypes (defined by digital literacy, device, reading speed, and friction tolerance) directly into LLM scoring prompts, enabling comparative usability analysis across demographic profiles without re-running tasks. Together these components form a fully automated pipeline for studying usability patterns across real websites at scale.
(No releases yet — check back soon.)
Requirements:
- Python
>=3.9(3.11 recommended) - Playwright-compatible browser (Chromium recommended)
- An API key for your chosen LLM provider (OpenRouter preferred)
conda create -n openflo python=3.11
conda activate openflo
pip install -e src
playwright install chromiumRecommended — set as an environment variable:
export OPENROUTER_API_KEY="your-key"Or copy the template and fill it in:
cp .env.example .env
# edit .env and set OPENROUTER_API_KEYEnvironment variables take precedence over any [api_keys] values in the config TOML.
Run all commands from src/ (config paths are relative to src/):
cd srcSet experiment.task_file_path in src/config/auto_mode.toml, then:
uv run run_agent.py -c config/auto_mode.tomluv run run_agent.py -c config/auto_mode.toml -p config/persona.tomlThe persona is injected into SEQ scoring prompts and surfaced in the final sus_report.json. See src/config/persona.toml for all available fields with inline documentation.
Batch mode expects a JSON array:
[
{
"task_id": "task_001",
"confirmed_task": "Find the official API docs for X",
"website": "https://example.com/"
}
]OpenFlo's primary research contribution is an automated UX evaluation pipeline built on top of Avenir-Web's agent execution layer.
The Single Ease Question (SEQ) is a validated 7-point micro-usability metric for individual task steps. OpenFlo extends it with three additional dimensions, all scored 1–7 after each agent action:
| Metric | What it captures |
|---|---|
| SEQ (overall ease) | Perceived difficulty of completing the action |
| Efficiency | Whether the path to the action was direct and fast |
| Clarity | How understandable the UI element or system response was |
| Confidence | How certain the user felt about the action and its outcome |
Each metric also produces a 1–2 sentence qualitative assessment. Low-scoring steps (SEQ ≤ 3) are flagged as friction points and classified by severity.
At session end, the System Usability Scale (SUS) (Brooke, 1996) is computed from the accumulated step-level data using the standard formula:
Final SUS Score = (X + Y) × 2.5
X = Σ (score − 1) for positive items {1, 3, 5, 7, 9}
Y = Σ (5 − score) for negative items {2, 4, 6, 8, 10}
Grades follow the Sauro-Lewis curved scale (A+ ≥ 84.1 → F < 51.7). A statistical heuristic fallback is used if the LLM is unavailable, using volatility and learning-curve analysis across the four metric dimensions.
To enable UX evaluation, set in your config:
[ux]
enable_synthesis = trueUX sessions can be evaluated through the lens of a configurable user persona. When a persona is active, the LLM embodies the described user's profile when scoring each step — adjusting judgements based on their characteristics:
| Field | Options |
|---|---|
digital_literacy |
"expert" | "intermediate" | "beginner" | "very_low" |
primary_device |
"desktop_keyboard" | "desktop_mouse" | "tablet_touch" | "mobile_touch" |
reading_speed |
"fast" | "normal" | "slow" |
tolerance_for_friction |
"high" | "medium" | "low" | "very_low" |
prior_experience |
Free text fed directly into scoring prompts |
description |
3–4 sentence narrative the LLM embodies during scoring |
common_friction_types |
Labels surfaced in the report (e.g. waiting, confusion, searching) |
[persona.scoring_bias] |
Integer offsets applied per metric after LLM response (e.g. seq_modifier = -1) |
This makes it possible to compare how the same workflow scores for an expert desktop user vs. a low-literacy mobile user without re-running the task.
OpenFlo/
├── src/
│ ├── openflo/
│ │ ├── agent/
│ │ │ ├── agent.py # Central orchestrator: predict → execute → evaluate loop
│ │ │ ├── config.py # Config loading and validation (TOML + env)
│ │ │ ├── executor.py # Action dispatch: click, type, scroll, drag, …
│ │ │ ├── predictor.py # LLM interaction, action prediction, history compression
│ │ │ ├── evaluation.py # Task completion verification and termination logic
│ │ │ └── reporting.py # Result serialisation and action summary generation
│ │ ├── browser/ # Playwright integration and browser state management
│ │ ├── llm/ # LLM engine abstraction (via LiteLLM)
│ │ ├── managers/
│ │ │ └── ux_synthesis.py # SEQ-to-SUS orchestration (UXSynthesisManager)
│ │ ├── ux/
│ │ │ ├── seq_scorer.py # Multi-metric SEQ evaluator
│ │ │ ├── sus_calculator.py # SUS scoring with Sauro-Lewis grading + heuristic fallback
│ │ │ └── report_generator.py # Markdown and JSON UX report output
│ │ ├── personas/
│ │ │ └── profile.py # PersonaProfile dataclass
│ │ ├── prompts/ # Prompt templates and builders
│ │ └── utils/ # Image processing, reasoning utilities
│ ├── run_agent.py # Entry point: demo + batch runner
│ └── config/
│ ├── auto_mode.toml # Primary config (model, playwright, UX, experiment)
│ └── persona.toml # Example persona config with inline documentation
├── data/ # Example task JSON files
└── .env.example # API key template
Configs are TOML files. See src/config/auto_mode.toml for a fully annotated example.
save_file_dir— output root directorydefault_task,default_website— used when no batch file is provided
name— model identifier (e.g.openrouter/anthropic/claude-sonnet-4-5)temperature,rate_limitreasoning_model— separate model for termination/evaluation reasoningchecklist_model— model for checklist managementcompletion_eval_model— model for final task success evaluation
task_file_path— JSON task list for batch modeoverwrite— skip or overwrite existing task output foldersmax_op,max_continuous_no_op— execution limitshighlight— draw labeled overlays on screenshots
headless,viewport,tracing,save_videolocale,geolocation— for locale/region-sensitive tasks
enable_synthesis— enable SEQ/SUS evaluation (defaultfalse)generate_report— writesus_report.jsonat session end (defaulttrue)ux_model— model for SEQ scoring; falls back to main model if omittedseq_screenshot_context— include screenshots in step evaluation (defaulttrue)
All fields optional; or pass a separate file with -p persona.toml. See the [Persona Framework](#persona-framework) section for field definitions.
Each task writes to <save_file_dir>/<task_id>/:
| File | Contents |
|---|---|
agent.log |
Per-task execution log |
result.json |
Final summary (task, actions, outcome, timing) |
config.toml |
Resolved config snapshot |
all_predictions.json |
Full LLM I/O trace for the task |
screenshots/ |
screen_<step>.png and screen_<step>_labeled.png |
sus_report.json |
UX evaluation: SEQ scores per step, SUS score and grade, friction analysis, persona context |
Run-level logs are written to src/logs/.
- Missing API key — fill in
OPENROUTER_API_KEYin.env(copy from.env.example) - Playwright browser not found — run
playwright install chromium - Want to watch the browser — set
playwright.headless = false - Config paths look wrong — run from
src/or pass an absolute path with-c
- Brooke, J. (1996). SUS: A quick and dirty usability scale. In P. Jordan et al. (Eds.), Usability Evaluation in Industry. Taylor & Francis.
- Sauro, J. & Lewis, J. R. (2011). Correlations among prototypical usability metrics: evidence for the construct of usability. CHI '11, ACM. — basis for the Sauro-Lewis grading scale
- measuringu.com/seq10/ — SEQ methodology and benchmarks
- measuringu.com/sus/ — SUS grading and percentile lookup
OpenFlo is built upon Avenir-Web by the Princeton AI² Lab. We thank the Avenir-Web authors for open-sourcing their framework.
This repository is provided for research use. Model outputs may be incorrect, incomplete, or unsafe. You are responsible for reviewing agent actions and complying with applicable laws and website terms of service when running web automation.
Wee Joe Tan — joe.tan.25@ucl.ac.uk
This project is licensed under the Apache License 2.0 — see the LICENSE file for details.
Copyright © 2025 UCL Nexus Labs. All rights reserved.