From 5de30071a9e1355de1dc901f01fab4019c4b9003 Mon Sep 17 00:00:00 2001
From: 0x_hackerfren <0xhackerfren@gmail.com>
Date: Thu, 14 May 2026 13:25:35 -0400
Subject: [PATCH] Add agentic product testing: agents verify the product as
 real users

Adds a new verification layer where AI agents operate the assembled application as real users. Each agent receives a user profile (technical literacy, domain knowledge, constraints) and a goal-oriented objective, then navigates the product through its UI. Captures UX quality metrics: click count, navigation depth, backtracking, error encounters, confusion index. UX budgets set quantitative thresholds as quality gates. This is a uniquely agentic capability that no previous testing methodology could provide.
---
 CHANGELOG.md                               |   8 +
 MANIFESTO.md                               |  10 +-
 README.md                                  |  12 ++
 ROADMAP.md                                 |   1 +
 harness/.cursor/rules/codex-automata.mdc   |  11 +-
 harness/PLAYBOOK.md                        |  52 ++++-
 harness/templates/product-test-template.md |  95 ++++++++++
 harness/templates/test-plan-template.md    |   9 +
 harness/templates/user-profile-template.md |  82 ++++++++
 reference/glossary.md                      |  12 ++
 reference/product-testing.md               | 210 +++++++++++++++++++++
 reference/workflow.md                      |  28 ++-
 12 files changed, 522 insertions(+), 8 deletions(-)
 create mode 100644 harness/templates/product-test-template.md
 create mode 100644 harness/templates/user-profile-template.md
 create mode 100644 reference/product-testing.md

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 57f5ca9..7c4cb64 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -19,6 +19,14 @@ Types of changes: **Added**, **Changed**, **Deprecated**, **Removed**, **Fixed**
 - Retroactive Gap Discovery section in `reference/workflow.md` feedback loops.
 - Recovery terms added to `reference/glossary.md`: contract gap, coverage erosion, gap assessment, mold gap, recovery, spec gap.
 - Recovery constraint and terminology added to `harness/.cursor/rules/codex-automata.mdc`.
+- Agentic product testing (`reference/product-testing.md`): verification layer where AI agents operate the assembled application as real users with defined profiles and goal-oriented objectives. Measures UX quality through click counts, navigation depth, backtracking, error encounters, and time budgets.
+- Product test template (`harness/templates/product-test-template.md`): structured template for defining product test scenarios with objectives, UX budgets, and specification traceability.
+- User profile template (`harness/templates/user-profile-template.md`): test fixture template for defining simulated user personas with technical literacy, domain knowledge, constraints, and behavioral tendencies.
+- Phase 5b (Product Testing) added to `harness/PLAYBOOK.md` between Review and Deployment.
+- Product test section added to `harness/templates/test-plan-template.md`.
+- Product testing terms added to `reference/glossary.md`: product test, test objective, user profile, UX budget.
+- Product testing section added to `harness/.cursor/rules/codex-automata.mdc`.
+- Product testing section added to `MANIFESTO.md` Section VI (The Machine).
 
 ## [0.1.0] - 2026-05-11
 
diff --git a/MANIFESTO.md b/MANIFESTO.md
index 2a83668..3c73c8e 100644
--- a/MANIFESTO.md
+++ b/MANIFESTO.md
@@ -150,7 +150,15 @@ An agent reads each spec and builds the mold. Unit tests, integration tests, con
 
 Then the cast. Five modules. Five agents. Each receives its spec, its tests, and the interface contracts of its neighbors. They work in parallel, independently, silently. They do not communicate because they do not need to. The specs and contracts contain everything. One finishes in minutes. Another takes an hour. They are not synchronized. When each agent's tests pass, the module is done.
 
-Automated gates fire. An agent reviews against the spec. A human reviews last, checking for coherence. Does the system still make sense as a whole? The pipeline runs. The code deploys. Monitoring confirms that production matches the spec. If it does not, the cycle restarts at the mold.
+Automated gates fire. An agent reviews against the spec. A human reviews last, checking for coherence. Does the system still make sense as a whole?
+
+But one verification layer remains. The modules pass their tests. The contracts hold. The code is correct. Is the product any good? A system can satisfy every unit test and still be unusable. The signup flow requires twelve clicks when three would suffice. The error message is technically accurate but incomprehensible. The navigation makes sense to the engineer who built it but bewilders the customer who needs it.
+
+Traditional testing cannot catch this. Scripted end-to-end tests verify a predetermined path: click this button, assert this text. They confirm the path works, not that a user can find it. Agentic product testing inverts the approach. Give an agent a user profile and an objective. "As a first-time visitor, create an account and reach the dashboard." The agent navigates the application the way a real user would, reading labels, clicking buttons, making mistakes, recovering from errors. If it cannot accomplish the objective, the product has a usability defect. If it takes twenty clicks when the budget is eight, the product has a friction defect. The agent's journey is measurable evidence of experience quality.
+
+This is a capability that exists only in the agentic era. No previous testing methodology could instruct a test to "figure out how to accomplish this goal" and then measure whether the experience was efficient, discoverable, and humane. Click counts, backtracking rates, dead ends, hesitation points, error recovery paths: these become quantitative signals derived from agents operating the product exactly as users would. The specification defines what the product must do. Product testing verifies that real people can actually do it.
+
+The pipeline runs. The code deploys. Monitoring confirms that production matches the spec. If it does not, the cycle restarts at the mold.
 
 Remove any piece and the machine breaks. Without specs, agents hallucinate, producing plausible code aimed at the wrong problem. Without tests, infinite variety with no definition of correct. Without modularity, agents collide in merge conflicts, contradictions, and chaos. Without flow, work pools in the wrong places. Without CI/CD, the whole arrangement depends on discipline that humans will eventually forget and agents will never have.
 
diff --git a/README.md b/README.md
index fb2bfef..fb2a5ed 100644
--- a/README.md
+++ b/README.md
@@ -117,6 +117,7 @@ codex-automata/
 |   |-- kanban.md             Flow-based project management
 |   |-- agent-operating-model.md  How agents operate
 |   |-- recovery.md           Recovery protocol for closing gaps
+|   |-- product-testing.md    Agentic product testing reference
 |   |-- glossary.md           Terminology reference
 |
 |-- examples/                WORKED EXAMPLES (read, don't copy)
@@ -144,6 +145,17 @@ Specification --> Tests --> Code
 
 If the casting is defective, fix the mold. If the mold is wrong, fix the specification. Do not debug the implementation directly.
 
+## Product Testing
+
+After the code is assembled, AI agents verify the product by operating it as real users. Each agent receives a user profile (technical literacy, domain knowledge, constraints) and a goal-oriented objective ("as a first-time user, create an account and reach the dashboard"). The agent navigates the application through the UI, and the journey produces measurable signals:
+
+- **Click count and navigation depth** measure friction.
+- **Backtracking and dead ends** measure discoverability.
+- **Error encounters** measure input guidance quality.
+- **Abandonment** flags critical usability failures.
+
+UX budgets set quantitative thresholds for each metric. Product tests run as quality gates in CI, staging, and production canaries. See `reference/product-testing.md` for the full reference.
+
 ## When Gaps Are Discovered
 
 Real projects discover gaps after code exists: a production incident exposes an unspecified failure mode, a review reveals missing tests, or a new team member finds a module with no contract tests. Codex Automata defines a formal recovery protocol for these situations.
diff --git a/ROADMAP.md b/ROADMAP.md
index 59aabf0..d8079b9 100644
--- a/ROADMAP.md
+++ b/ROADMAP.md
@@ -6,6 +6,7 @@ This document sketches planned evolution of the Codex Automata methodology harne
 
 - **Version 0.1.0**: Baseline harness: specification doctrine, molds and casting metaphors across docs, bounded context guidance, agent task boundaries, interface contract discipline, Cursor integration (rules, skills, subagents, hooks), templates, examples, GitHub-oriented quality gate patterns.
 - **Recovery protocol**: Formal process for closing gaps in specification, tests, or coverage after code exists. Includes gap classification taxonomy, severity-based triage, recovery sequence (audit, spec patch, mold patch, recast, re-review), kanban integration, gap assessment template, recurrence prevention, and health metrics. Wired into manifesto, playbook, agent rules, Cursor rules, and glossary.
+- **Product testing**: Agentic verification layer where AI agents operate the assembled application as real users. Agents receive user profiles (technical literacy, domain knowledge, constraints) and goal-oriented objectives, then navigate the product through its UI. Captures UX quality metrics (click count, navigation depth, backtracking, error encounters, time budgets). Includes product test template, user profile template, Phase 5b in the playbook, and glossary terms.
 
 ## v0.2.0 (goals)
 
diff --git a/harness/.cursor/rules/codex-automata.mdc b/harness/.cursor/rules/codex-automata.mdc
index 8229b34..cb88f11 100644
--- a/harness/.cursor/rules/codex-automata.mdc
+++ b/harness/.cursor/rules/codex-automata.mdc
@@ -25,8 +25,17 @@ This project follows the Codex Automata methodology: Documentation first, Tests
 - If spec or tests are missing, stop and report before proceeding. Recommend a gap assessment for triage.
 - During recovery tasks (closing gaps in existing code), derive specifications from domain knowledge and tests from specifications. Never derive specs or tests from the existing code.
 
+## Product Testing
+
+When the project has user-facing features, define product tests alongside module-level molds. Product tests use AI agents operating the application as real users:
+
+- Define user profiles (`templates/user-profile-template.md`) representing the actual user base.
+- Define test objectives as goals, not scripts: "As a first-time user, create an account and reach the dashboard."
+- Set UX budgets (click limits, navigation depth, backtracking limits) grounded in product standards.
+- Product tests run against the assembled application, not individual modules.
+
 ## Terminology
 
-Use these terms consistently: specification, mold, casting, bounded context, interface contract, quality gate, agent task, human review, flow, recovery, gap assessment, spec gap, mold gap, coverage erosion, contract gap.
+Use these terms consistently: specification, mold, casting, bounded context, interface contract, quality gate, agent task, human review, flow, recovery, gap assessment, spec gap, mold gap, coverage erosion, contract gap, product test, user profile, test objective, UX budget.
 
 See `agent/AGENT_RULES.md` for the complete operating manual.
diff --git a/harness/PLAYBOOK.md b/harness/PLAYBOOK.md
index f0a9106..9a2427d 100644
--- a/harness/PLAYBOOK.md
+++ b/harness/PLAYBOOK.md
@@ -317,6 +317,55 @@ Verify that castings match the mold and that the mold matches the original inten
 
 ---
 
+## Phase 5b: Product Testing
+
+**Purpose**
+
+Verify the assembled product by deploying AI agents that operate the application as real users would. Product testing catches experience defects that module-level tests cannot: unusable workflows, excessive friction, confusing navigation, poor error guidance, and accessibility failures.
+
+This phase runs after review confirms the code is correct and before deployment ships it to users. It answers the question: the code works, but does the product work for the people who use it?
+
+For the full product testing reference, see the [Product Testing](https://github.com/0xhackerfren/Codex-Automata/blob/main/reference/product-testing.md) document in the Codex Automata repository.
+
+**Inputs**
+
+- Approved code from Phase 5 (assembled, running in a staging or preview environment)
+- User profile documents (use `templates/user-profile-template.md`)
+- Product test scenarios (use `templates/product-test-template.md`)
+- UX budgets from the specification (click budgets, navigation depth limits, time budgets)
+
+**Outputs**
+
+- Product test results with journey logs for each scenario
+- UX metrics: click counts, backtracking rates, navigation depth, error encounters, time to completion
+- Experience signals: hesitation points, dead ends, discovery paths, confusion indicators
+- Defect reports for failed objectives or budget violations
+
+**Exit Criteria**
+
+- [ ] All critical journey objectives pass (agent completes the objective).
+- [ ] All UX budgets are met (click count, navigation depth, time within thresholds).
+- [ ] No critical errors (crashes, data loss, security failures) during any journey.
+- [ ] Accessibility-constrained profiles complete all required objectives.
+- [ ] Journey logs are stored for trend analysis.
+
+**Human Responsibilities**
+
+- Define user profiles that represent the actual user base.
+- Set UX budgets grounded in product standards and competitive benchmarks.
+- Interpret experience signals and decide whether friction patterns warrant specification changes.
+- Product testing reveals experience defects, but humans decide which defects matter enough to fix before shipping.
+
+**Agent Responsibilities**
+
+- Operate the application through its user interface using browser tools, screen readers, or mobile emulators as the profile dictates.
+- Follow the user profile's behavioral model (technical literacy, domain knowledge, behavioral tendencies).
+- Do not use implementation knowledge. Navigate using only what is visible on screen and what the profile's user would reasonably know.
+- Record every action, observation, hesitation, error, and recovery in a journey log.
+- Report honestly when an objective cannot be completed, including where the agent got stuck and what it tried.
+
+---
+
 ## Phase 6: Deployment and Observation
 
 **Purpose**
@@ -325,7 +374,7 @@ Ship verified code to production and confirm that production behavior matches th
 
 **Inputs**
 
-- Approved code from Phase 5
+- Approved code from Phase 5b
 - Deployment configuration and infrastructure
 - Monitoring and alerting requirements from the specification
 
@@ -450,5 +499,6 @@ Batch related gaps within a single module into one recovery card. Create separat
 | 3: Test Molding | Agent (human review) | No | `test-plan-template.md` |
 | 4: Code Casting | Agent | No | `agent-task-template.md` |
 | 5: Review | Human | **Yes** | `human-review-template.md` |
+| 5b: Product Testing | Agent (human review) | No | `product-test-template.md` |
 | 6: Deployment | Human + Agent | No | N/A |
 | Recovery | Human + Agent | No | `gap-assessment-template.md` |
diff --git a/harness/templates/product-test-template.md b/harness/templates/product-test-template.md
new file mode 100644
index 0000000..b800866
--- /dev/null
+++ b/harness/templates/product-test-template.md
@@ -0,0 +1,95 @@
+# Product Test: [Scenario Name]
+
+<!--
+[Define a product test scenario where an AI agent operates the application as a real user. The agent receives a user profile and an objective, then navigates the product to accomplish the goal. For the full product testing reference, see https://github.com/0xhackerfren/Codex-Automata/blob/main/reference/product-testing.md]
+-->
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| Scenario name | [ ] |
+| Version | [ ] |
+| Date | [YYYY-MM-DD] |
+| Author | [name or role] |
+| Priority | Critical / High / Medium / Low [pick one] |
+
+## User Profile
+
+[Reference the user profile document this test uses.]
+
+- **Profile:** [name, e.g., "First-time visitor" or path to user profile document]
+- **Technical literacy:** [high / moderate / low]
+- **Domain knowledge:** [expert / familiar / novice]
+- **Constraints:** [e.g., mobile-only, screen reader, slow network, non-native language, none]
+
+## Objective
+
+[State what the agent must accomplish as a goal, not a script. Write from the user's perspective.]
+
+**Goal:** [ ]
+
+**Example:** "As a first-time visitor, create an account, complete onboarding, and reach the main dashboard."
+
+## Preconditions
+
+[What state must the environment be in before this test runs?]
+
+- **Application state:** [clean database / seeded with test data / specific configuration]
+- **User state:** [no existing account / logged in / specific role or permissions]
+- **Environment:** [staging URL / local / production canary]
+- **Data fixtures:** [describe any required test data]
+
+## Success Criteria
+
+[What defines successful completion? Be specific about the end state.]
+
+- [ ] [Observable condition that proves the objective was met, e.g., "Dashboard is displayed with the user's name"]
+- [ ] [Additional conditions as needed]
+
+## UX Budgets
+
+[Quantitative thresholds for acceptable experience quality. Set based on product standards.]
+
+| Metric | Budget | Rationale |
+|--------|--------|-----------|
+| Click budget | [max interactions] | [why this number] |
+| Navigation depth | [max page transitions] | [why this number] |
+| Backtracking limit | [max backward navigations] | [expected: 0 for simple flows] |
+| Error encounters | [max user-facing errors] | [expected: 0 for happy path] |
+| Time budget | [max seconds, normalized] | [baseline or industry benchmark] |
+
+## Specification Traceability
+
+[Link this test to the specification sections it verifies.]
+
+| Spec section | Behavior tested |
+|-------------|-----------------|
+| [section ID] | [which user-facing behavior] |
+| [ ] | [ ] |
+
+## Failure Handling
+
+[What should happen if the test fails?]
+
+- **Objective not completed:** File as a product defect. The journey log identifies where the agent got stuck.
+- **Budget exceeded:** Review the journey log for friction points. Determine whether the budget is too tight or the product needs improvement.
+- **Critical error encountered:** File as a severity-1 defect. Product should not crash or lose data during normal use.
+
+## Journey Log Requirements
+
+[What should the agent record during execution?]
+
+- Every page or view visited
+- Every interaction (click, type, scroll, navigate)
+- Every error message encountered and how it was handled
+- Points where the agent hesitated or re-read content
+- Dead ends where the agent had to backtrack
+- Time spent on each step
+- Screenshots at key decision points (optional but recommended)
+
+## Notes
+
+[Additional context, known limitations, or special instructions for this scenario.]
+
+[ ]
diff --git a/harness/templates/test-plan-template.md b/harness/templates/test-plan-template.md
index 2e7c08a..ccba042 100644
--- a/harness/templates/test-plan-template.md
+++ b/harness/templates/test-plan-template.md
@@ -55,6 +55,15 @@
 |-----------|-------------|--------|----------------|-------|
 | [ ] | [latency / throughput] | [load tool, benchmark] | [ ] | [ ] |
 
+## Product Tests
+
+[Include when the specification defines user-facing behaviors. Product tests verify the assembled application by having agents operate it as real users. See `templates/product-test-template.md` and `templates/user-profile-template.md` for full templates.]
+
+| Objective | User profile | Spec reference | UX budget (clicks) | Priority |
+|-----------|-------------|----------------|---------------------|----------|
+| [goal-oriented statement] | [profile name] | [section] | [max clicks] | [critical / high / medium / low] |
+| [ ] | [ ] | [ ] | [ ] | [ ] |
+
 ## Test Environment Requirements
 
 - **Dependencies:** [services, databases, queues]
diff --git a/harness/templates/user-profile-template.md b/harness/templates/user-profile-template.md
new file mode 100644
index 0000000..36843ad
--- /dev/null
+++ b/harness/templates/user-profile-template.md
@@ -0,0 +1,82 @@
+# User Profile: [Profile Name]
+
+<!--
+[Define a simulated user persona for product testing. This profile constrains how an AI agent behaves when operating the application. It is a test fixture, not a marketing persona. For the full product testing reference, see https://github.com/0xhackerfren/Codex-Automata/blob/main/reference/product-testing.md]
+-->
+
+## Metadata
+
+| Field | Value |
+|-------|-------|
+| Profile name | [e.g., "First-time visitor," "Power user," "Accessibility-dependent user"] |
+| Version | [ ] |
+| Date | [YYYY-MM-DD] |
+| Author | [name or role] |
+
+## Identity
+
+- **Role:** [What role does this user play? e.g., end user, admin, team lead, external partner]
+- **Context:** [What situation is this user in? e.g., evaluating the product for the first time, performing a daily task, handling an emergency]
+
+## Technical Literacy
+
+[How comfortable is this user with software?]
+
+- **Level:** High / Moderate / Low [pick one]
+- **Behavioral implications:**
+  - High: scans for patterns, uses keyboard shortcuts, skips tutorials, expects conventional UI placement
+  - Moderate: follows visible cues, reads labels, uses help when stuck, tries obvious actions first
+  - Low: reads everything carefully, avoids unfamiliar interfaces, needs explicit guidance at each step
+
+## Domain Knowledge
+
+[How well does this user understand the problem domain?]
+
+- **Level:** Expert / Familiar / Novice [pick one]
+- **Behavioral implications:**
+  - Expert: understands terminology, anticipates workflows, uses domain-specific shortcuts
+  - Familiar: recognizes core concepts, may need help with advanced features
+  - Novice: needs plain-language explanations, may not understand domain jargon
+
+## Goals
+
+[What is this user generally trying to accomplish? Goals shape how the agent prioritizes actions.]
+
+1. [ ]
+2. [ ]
+
+## Constraints
+
+[What limitations does this user operate under? Check all that apply and elaborate.]
+
+- [ ] **Device:** [desktop / mobile / tablet; screen size; input method]
+- [ ] **Accessibility:** [screen reader / keyboard-only / high contrast / reduced motion / magnification]
+- [ ] **Network:** [broadband / slow connection / intermittent connectivity]
+- [ ] **Session time:** [unlimited / limited to N minutes / frequently interrupted]
+- [ ] **Language:** [native speaker / non-native / specific language]
+- [ ] **None:** This profile has no special constraints.
+
+## Behavioral Tendencies
+
+[How does this user approach software? These tendencies constrain the agent's navigation strategy.]
+
+- **Instructions:** [reads carefully / skims / skips entirely]
+- **Exploration:** [follows the most obvious path / explores menus and options / uses search first]
+- **Error response:** [reads error messages carefully / retries the same action / abandons the task / seeks help]
+- **Patience:** [persistent, tries many approaches / gives up after 2-3 failures]
+- **Trust level:** [trusts the application / suspicious of data collection / cautious with permissions]
+
+## Specification Traceability
+
+[Link this profile to the specification sections that define target users.]
+
+| Spec section | User class described |
+|-------------|---------------------|
+| [section ID] | [which user class this profile represents] |
+| [ ] | [ ] |
+
+## Notes
+
+[Additional context about this profile. What makes this user type important to test?]
+
+[ ]
diff --git a/reference/glossary.md b/reference/glossary.md
index d5393c9..184d26b 100644
--- a/reference/glossary.md
+++ b/reference/glossary.md
@@ -38,6 +38,9 @@ The executable test suite and fixtures derived from a specification that define
 **Mold Gap**  
 A gap class where behavior is documented in the specification but has no tests, or the tests are too weak to constrain the implementation meaningfully. Elaboration: the specification says what should happen, but no mold enforces it; recovery derives tests from the existing specification following standard test molding rules.
 
+**Product Test**  
+A verification scenario where an AI agent operates the assembled application as a real user, given a user profile and a goal-oriented objective. Elaboration: measures both functional completion (did the agent accomplish the goal?) and experience quality (click count, backtracking, navigation depth, error encounters, time to completion); see `product-testing.md` in this directory.
+
 **Quality Gate**  
 An automated check in the CI/CD pipeline that enforces process discipline mechanically (build, test, policy, security, signing, compatibility, performance budgets as applicable). Elaboration: failures block progression by default; waivers require explicit human risk acceptance tied to records.
 
@@ -53,8 +56,17 @@ The phase where humans produce or revise specifications and contracts; typically
 **Specification**  
 A precise, testable document defining what a module or bounded context must do, including behaviors, edge cases, non-functional expectations, and failure modes. Elaboration: the primary human authored engineering artifact that downstream molds and reviews reference.
 
+**Test Objective**  
+A goal-oriented statement of what a simulated user must accomplish during a product test, stated as an outcome rather than a script of steps. Elaboration: "create an account and reach the dashboard" is an objective; "click the signup button, fill in the email field" is a script; objectives test discoverability and usability, scripts test only a predetermined path.
+
 **Test Molding**  
 The phase where tests and fixtures are derived from specifications to build the mold. Elaboration: separates acceptance shape from implementation and enables mechanical enforcement through quality gates.
 
+**User Profile**  
+A test fixture that defines who a simulated user is during a product test, constraining agent behavior to simulate a specific class of user. Elaboration: includes technical literacy, domain knowledge, goals, constraints (accessibility, device, network), and behavioral tendencies; derived from the specification's user-facing sections; uses `templates/user-profile-template.md` in the harness.
+
+**UX Budget**  
+A quantitative threshold for acceptable user experience during a product test, such as maximum clicks, navigation depth, backtracking count, error encounters, or time to completion. Elaboration: transforms subjective experience assessments into falsifiable metrics; set per objective and per profile; calibrated from baselines and product standards.
+
 **WIP Limit**  
 A cap on work in process at a station to prevent overload, shorten feedback loops, and expose bottlenecks. Elaboration: pulling work only when capacity exists pairs with explicit definitions of ready for each station.
diff --git a/reference/product-testing.md b/reference/product-testing.md
new file mode 100644
index 0000000..84aff49
--- /dev/null
+++ b/reference/product-testing.md
@@ -0,0 +1,210 @@
+# Product Testing
+
+This document defines how Codex Automata projects verify the assembled product by deploying AI agents that operate the application as real users would. Product testing is a verification layer that runs on the working product, not on individual modules. It tests what users experience, not what code does.
+
+For module-level molds (unit, integration, contract tests), see the test molding phase in `PLAYBOOK.md`. For the overall pipeline, see `workflow.md` in this directory. For terminology, see `glossary.md` in this directory.
+
+## Why Product Testing Exists
+
+The existing mold pipeline verifies that each module satisfies its specification. Unit tests confirm individual behavior. Integration tests confirm that modules compose. Contract tests confirm that boundaries hold. All of these run against code, not against the product a user actually touches.
+
+A system can pass every unit test, every integration test, and every contract test while still being unusable. The signup flow might require twelve clicks when three would suffice. The error message might be technically correct but incomprehensible to the target audience. The navigation might be logical to the developer who built it but opaque to a first-time user. The mobile layout might render correctly but be impossible to operate with a thumb.
+
+These are not code defects. They are product defects. They live in the gap between "the code works" and "the product works for the people who use it." Traditional E2E testing addresses this partially by scripting user journeys: click this button, type this text, assert this element appears. But scripted E2E tests are brittle. They break when the UI changes. They test a predetermined path, not whether a user can actually find and follow that path. They measure "does the path work" but not "is the path discoverable, efficient, and humane."
+
+Agentic product testing closes this gap. Instead of scripting steps, you give an AI agent a user profile and an objective. The agent navigates the application the way a real user would: reading labels, clicking buttons, filling forms, making mistakes, recovering from errors. If the agent cannot accomplish the objective, the product has a usability defect. If the agent accomplishes it but requires excessive effort, the product has a friction defect. The agent's journey becomes measurable evidence of product quality.
+
+This is a capability that exists only in the agentic era. No previous testing methodology could instruct a test to "figure out how to sign up" and measure whether the experience was any good. Codex Automata treats it as a first-class verification layer.
+
+## Core Concepts
+
+### User Profiles
+
+A user profile defines who the simulated user is. It is not a persona document for marketing. It is a test fixture that constrains agent behavior to simulate a specific class of user.
+
+A user profile includes:
+
+- **Name and role.** A label for reference (e.g., "First-time visitor," "Power user," "Accessibility-dependent user").
+- **Technical literacy.** How comfortable is this user with software? Do they scan for patterns or read every label? Do they know keyboard shortcuts or rely entirely on visible UI?
+- **Domain knowledge.** How much does this user understand about the problem domain? A financial analyst using a trading platform has different expectations than a retail investor.
+- **Goals.** What is this user trying to accomplish in general terms? Goals shape how the agent prioritizes actions and evaluates progress.
+- **Constraints.** What limitations does this user operate under? Screen reader dependency, mobile-only access, slow network, limited session time, non-native language.
+- **Behavioral tendencies.** Does this user read instructions or skip them? Do they explore or follow the most obvious path? Do they recover gracefully from errors or abandon the task?
+
+Profiles are derived from the specification's user-facing sections. If the specification defines target users, each user class should have a corresponding test profile. Use `templates/user-profile-template.md` in the harness.
+
+### Test Objectives
+
+A test objective defines what the agent must accomplish, stated as a goal, not a script. The distinction is fundamental.
+
+**Scripted (traditional E2E):** Navigate to /signup. Fill in email field. Fill in password field. Click "Create Account." Assert redirect to /dashboard.
+
+**Objective-based (product test):** As a first-time visitor, create an account and reach the main dashboard.
+
+The scripted version tests one hardcoded path. The objective version tests whether the product enables a user to accomplish the goal by whatever path they discover. If the signup button is buried under three menus and the agent cannot find it, that is a product defect even though the signup endpoint works perfectly.
+
+Objectives trace to specification sections the same way unit test cases do. Every user-facing behavior in the specification should have a product test objective that verifies a real user can actually exercise that behavior.
+
+### UX Budgets
+
+A UX budget defines quantitative thresholds for acceptable user experience during a product test. Budgets transform subjective "this feels clunky" assessments into falsifiable metrics.
+
+- **Click budget.** Maximum number of interactions (clicks, taps, keystrokes) to accomplish the objective. If the agent requires more, the journey has excessive friction.
+- **Navigation depth.** Maximum number of page transitions or view changes. Deep navigation suggests poor information architecture.
+- **Backtracking count.** Number of times the agent navigates backward or returns to a previous state. Backtracking indicates confusion, misleading labels, or dead ends.
+- **Error encounters.** Number of user-facing error states triggered during the journey. Frequent errors suggest poor input guidance or unclear form design.
+- **Time budget.** Wall-clock time for the agent to complete the objective (normalized for agent execution speed). Exceptionally long journeys indicate friction even when click counts are acceptable.
+- **Abandonment.** Whether the agent gives up before completing the objective. Abandonment is a critical failure.
+
+Budgets are defined per objective and per profile. A power user completing a routine task should have tighter budgets than a first-time user performing initial setup. Budgets are set by humans during specification, informed by domain norms and product standards.
+
+### Experience Signals
+
+Beyond pass/fail and budget compliance, product tests capture qualitative signals that inform product decisions:
+
+- **Path taken.** The sequence of pages, actions, and decisions the agent made. Compare across profiles to identify whether different user types navigate differently.
+- **Hesitation points.** Where did the agent pause, re-read, or explore before acting? These indicate UI elements that are not self-evident.
+- **Dead ends.** Pages or states where the agent could not find a way forward and had to backtrack. Dead ends are navigation defects.
+- **Error recovery.** When the agent triggered an error, how did it recover? Did the error message guide it toward the correct action, or did it have to experiment?
+- **Discovery path.** How did the agent find the feature it needed? Through navigation menus, search, help documentation, or trial and error? Discovery paths reveal information architecture quality.
+
+These signals are not pass/fail gates. They are diagnostic data that feeds back into specification and design decisions.
+
+## How Product Testing Fits the Pipeline
+
+Product testing runs after the code is assembled and before (or during) deployment verification. It occupies the space between "the code works" and "the product ships."
+
+```text
+Code Casting --> Review --> Product Testing --> Deployment and Observation
+                                  |
+                                  v
+                          Feedback to Spec / UX
+```
+
+### Relationship to Existing Phases
+
+**Phase 3 (Test Molding)** produces module-level molds: unit, integration, contract tests. These run against code. Product test objectives are defined alongside module molds in the test plan but are not executable until the product is assembled.
+
+**Phase 5 (Review)** verifies that castings match the specification. Product testing extends this verification to the assembled product's user experience.
+
+**Phase 6 (Deployment and Observation)** includes smoke tests and monitoring. Product tests can serve as pre-deployment smoke tests (does the critical user journey still work?) and post-deployment verification (does the journey work in production?).
+
+Product testing does not replace any existing phase. It adds a verification layer that no existing phase covers: the experience of operating the assembled product as a user.
+
+### When Product Tests Run
+
+**Pre-merge (CI gate).** Critical journey objectives (signup, core workflow, payment) run as quality gates on integration branches. These are the product equivalent of unit test gates.
+
+**Staging environment.** Full product test suites run against staging after deployment. This catches experience regressions that module-level tests cannot detect.
+
+**Production (canary).** A subset of product tests run against production canaries during progressive rollout. If a journey breaks in production, the canary fails before full rollout.
+
+**Scheduled (regression).** Product test suites run on a schedule against the live product to detect experience degradation from data growth, configuration drift, or third-party changes.
+
+## Defining Product Tests
+
+Product tests are defined in a product test template (`templates/product-test-template.md` in the harness). Each test specifies:
+
+1. **The user profile** that the agent will adopt (reference to a user profile document).
+2. **The objective** the agent must accomplish, stated as a goal.
+3. **Preconditions** for the test (clean account, seeded data, specific environment state).
+4. **Success criteria** that define completion (what state must the product be in when the objective is met?).
+5. **UX budgets** for the journey (click budget, navigation depth, backtracking, time).
+6. **Specification traceability** linking the objective to spec section(s).
+
+### Writing Good Objectives
+
+**Good objectives** are goal-oriented and user-centric:
+
+- "As a new user, create an account and see a personalized dashboard."
+- "As a returning user, find and re-order a previous purchase."
+- "As an admin, invite a new team member and verify they can access the shared workspace."
+- "As a screen-reader user, navigate from the homepage to the pricing page and identify the enterprise tier."
+
+**Bad objectives** are implementation-aware or scripted:
+
+- "Click the signup button in the top-right corner." (Tests a specific UI location, not a user goal.)
+- "Navigate to /api/users and POST a new user." (Tests an API, not a user experience.)
+- "Fill in the form fields in order." (Prescribes a path instead of an outcome.)
+
+### Setting UX Budgets
+
+Budgets should be grounded in product standards and competitive benchmarks, not arbitrary numbers. Consider:
+
+- Industry norms for the task type (e.g., e-commerce checkout averages 4-6 steps).
+- Accessibility standards (WCAG requires that all functionality be operable through a keyboard interface).
+- Your own product's stated experience goals from the specification.
+- Progressive tightening: start with generous budgets, measure baselines, then tighten as the product matures.
+
+A budget that no real user could meet is useless. A budget so generous it never fails is also useless. Calibrate using initial agent runs as baseline data, then set budgets as quality commitments.
+
+## Agent Behavior During Product Tests
+
+The agent operating a product test is not the same as the agent writing code. It operates under a different set of constraints:
+
+- **The agent uses the product through its user interface.** Browser tools, screen readers, mobile emulators, or API clients depending on the profile. It does not access internal code, databases, or admin tools unless the profile explicitly grants that access.
+- **The agent follows the user profile's behavioral model.** A "cautious first-time user" reads labels and explores tentatively. A "power user" looks for shortcuts and keyboard navigation. The profile constrains the agent's strategy.
+- **The agent does not use knowledge of the implementation.** It knows what the user profile would know: what is visible on screen, what the product's help documentation says, what a reasonable person in that role would expect. It does not know internal routes, hidden features, or implementation details.
+- **The agent records its journey.** Every action, observation, hesitation, error, and recovery is logged. The journey log is the test artifact, equivalent to a test report for module-level molds.
+- **The agent reports honestly.** If it cannot accomplish the objective, it reports exactly where it got stuck, what it tried, and why it believes the product blocked it. This report is diagnostic data for the specification and design team.
+
+## Metrics and Quality Gates
+
+### Gate Metrics (Pass/Fail)
+
+- **Objective completion.** Did the agent accomplish the stated goal? Binary pass/fail.
+- **Budget compliance.** Did the journey stay within all UX budgets? A budget violation is a gate failure.
+- **Critical error absence.** Did the agent encounter any critical errors (crashes, data loss, security failures) during the journey?
+
+### Diagnostic Metrics (Informational)
+
+- **Click count.** Total interactions to complete the objective.
+- **Navigation depth.** Page transitions or view changes.
+- **Backtracking rate.** Backward navigations as a fraction of total navigations.
+- **Time to completion.** Normalized wall-clock time.
+- **Error encounter rate.** User-facing errors per journey.
+- **Confusion index.** A composite of hesitation points, re-reads, and exploratory actions that did not advance the objective. Higher values indicate less intuitive UX.
+- **Path variance.** When multiple profiles complete the same objective, how different are their paths? High variance may indicate that the product's navigation model is inconsistent.
+
+### Tracking Over Time
+
+Product test metrics tracked over releases reveal experience trends:
+
+- **Friction creep.** Click counts or navigation depth increasing over time for the same objective.
+- **Accessibility regression.** Constrained profiles (screen reader, keyboard-only) failing objectives that unconstrained profiles complete.
+- **Onboarding degradation.** First-time user profiles taking longer or backtracking more as features accumulate.
+
+These trends feed back to specification and architecture. If onboarding friction is climbing, the specification should address it. If accessibility is regressing, the molds should encode accessibility requirements more sharply.
+
+## Relationship to Other Test Types
+
+Product testing does not replace existing test types. It covers a distinct layer:
+
+| Test type | What it verifies | Runs against | When |
+|-----------|------------------|-------------|------|
+| Unit tests | Individual module behavior | Code | Before casting (red state) and after |
+| Integration tests | Module composition | Assembled modules | After casting |
+| Contract tests | Boundary compliance | Interface boundaries | After casting, in CI |
+| Product tests | User experience quality | Running application | After assembly, in staging/production |
+
+Product tests depend on all other test types passing first. If unit tests fail, there is no product to test. Product tests are the outermost ring of verification.
+
+## Feedback to Specification
+
+Product test results are a primary input to specification revision. When a product test reveals a UX defect:
+
+1. **Diagnose.** Is the defect in the specification (the spec said to build it this way), the implementation (the code does not match the spec), or the design (the spec needs to change)?
+2. **Route.** Specification defects go back to Phase 2. Implementation defects go back to Phase 4. Design defects require human judgment about whether and how to revise the specification.
+3. **Update budgets.** If a journey consistently uses fewer interactions than its budget, tighten the budget to lock in the improvement. If a justified design change increases interactions, update the budget with rationale.
+
+Product test feedback is the primary mechanism by which user experience quality enters the specification-first pipeline. Without it, specifications define what the system does but not whether it is good to use.
+
+## Companion Documents
+
+- `workflow.md` in this directory for end-to-end pipeline context.
+- `principles.md` in this directory for foundational principles product testing enforces.
+- `recovery.md` in this directory for handling gaps discovered through product testing.
+- `kanban.md` in this directory for flow and WIP management.
+- `PLAYBOOK.md` in the harness for phase-by-phase execution guidance.
+- `templates/product-test-template.md` in the harness for defining product test scenarios.
+- `templates/user-profile-template.md` in the harness for defining test user profiles.
diff --git a/reference/workflow.md b/reference/workflow.md
index 5d4eb99..08c1200 100644
--- a/reference/workflow.md
+++ b/reference/workflow.md
@@ -46,13 +46,27 @@ Unresolved mismatches escalate to revised specifications or contracts before wea
 
 Outputs: approvals, rework instructions referencing spec clauses, backlog follow-ups for ambiguity removal.
 
-## 7. Integration and CI/CD Gates
+## 7. Product Testing
 
-Merged changes proceed through centralized quality gates that bundle mold execution, static analysis, supply chain and license checks, signing and provenance where required, contract tests across bounded contexts, integration smoke suites, performance and reliability budgets tied to SLOs, schema and API compatibility checks for migrations, progressive delivery controls (feature flags, canaries, rollbacks), and operational readiness checks (alert routes, runbooks, synthetic monitors) appropriate to the system class.
+After human review approves the casting and before deployment, AI agents verify the assembled product by operating it as real users. Each agent receives a user profile (technical literacy, domain knowledge, constraints, behavioral tendencies) and a goal-oriented objective ("as a first-time user, create an account and reach the dashboard"). The agent navigates the application through its user interface, recording every interaction, hesitation, error, and recovery.
+
+Product tests measure both functional completion (did the agent accomplish the goal?) and experience quality: click count, navigation depth, backtracking rate, error encounters, time to completion, and a confusion index derived from hesitation and exploratory actions. UX budgets set quantitative thresholds for each metric. Budget violations are quality gate failures.
+
+Product testing catches defects that module-level molds cannot: unusable workflows, excessive friction, confusing navigation, poor error guidance, and accessibility regression. Agents operating under constrained profiles (screen reader, keyboard-only, mobile) verify that the product works for all target user classes.
+
+Product test results feed back to specification when they reveal experience defects. If the specification directed a design that produces poor UX metrics, the specification is revised. If the casting deviates from a good specification, the casting is recast.
+
+Outputs: journey logs, UX metrics per scenario, pass/fail status per objective and budget, experience signal reports, defect filings for failed or over-budget journeys.
+
+For the full product testing reference, see `product-testing.md` in this directory.
+
+## 8. Integration and CI/CD Gates
+
+Merged changes proceed through centralized quality gates that bundle mold execution, static analysis, supply chain and license checks, signing and provenance where required, contract tests across bounded contexts, integration smoke suites, product test objective suites, performance and reliability budgets tied to SLOs, schema and API compatibility checks for migrations, progressive delivery controls (feature flags, canaries, rollbacks), and operational readiness checks (alert routes, runbooks, synthetic monitors) appropriate to the system class.
 
 Outputs: immutable artifacts destined for deployment with provenance appropriate to organizational policy.
 
-## 8. Deployment and Observation
+## 9. Deployment and Observation
 
 Deploy using agreed progressive strategies. Observability verifies golden signals correlated with intake metrics. Incident learning updates specifications rather than patching undocumented behavior indefinitely.
 
@@ -82,7 +96,10 @@ Human Review ---------------------------+
   |
   | (approve)
   v
-CI/CD Quality Gates (molds + policy)
+Product Testing (agents as users)
+  |
+  v
+CI/CD Quality Gates (molds + policy + product tests)
   |
   v
 Deploy / Observe ----------------------> feedback to Spec / Contracts / Architecture (ADRs)
@@ -130,6 +147,7 @@ Operational tracking mirrors stations:
 | Test Molding | Derive molds and fixtures from specs | Agent with human adjudication |
 | Code Casting | Implement to satisfy molds within contracts | Agent |
 | Human Review | Confirm casting matches specification and systemic risk posture | Human |
+| Product Testing | Verify assembled product by agents operating as users | Agent with human interpretation |
 | Deployment | Progressive release and verification | Humans with automation gates |
 
 Treat integration testing and CI/CD quality gates either as implicit sub stages within molding and pre-deployment review lanes or explicit columns when WIP instrumentation demands finer grain. Naming remains consistent across boards: Specification, molds, casting, review, gates, rollout, observation tie back to the same vocabulary referenced in dashboards and retrospective notes.
@@ -138,4 +156,4 @@ Consult `PLAYBOOK.md` (in the harness) when expanding any station into granular
 
 ## Companion Documents
 
-For the recovery protocol covering retroactive gap discovery, classification, triage, and remediation, consult `recovery.md` in this directory. For pull policies, limits, bottleneck interpretation, Toyota Production System parallels, metric guidance, consult `kanban.md` in this directory. For vocabulary alignment across teams, consult `glossary.md` in this directory.
+For the recovery protocol covering retroactive gap discovery, classification, triage, and remediation, consult `recovery.md` in this directory. For agentic product testing with user profiles, test objectives, and UX budgets, consult `product-testing.md` in this directory. For pull policies, limits, bottleneck interpretation, Toyota Production System parallels, metric guidance, consult `kanban.md` in this directory. For vocabulary alignment across teams, consult `glossary.md` in this directory.