Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,14 @@ Types of changes: **Added**, **Changed**, **Deprecated**, **Removed**, **Fixed**
- Retroactive Gap Discovery section in `reference/workflow.md` feedback loops.
- Recovery terms added to `reference/glossary.md`: contract gap, coverage erosion, gap assessment, mold gap, recovery, spec gap.
- Recovery constraint and terminology added to `harness/.cursor/rules/codex-automata.mdc`.
- Agentic product testing (`reference/product-testing.md`): verification layer where AI agents operate the assembled application as real users with defined profiles and goal-oriented objectives. Measures UX quality through click counts, navigation depth, backtracking, error encounters, and time budgets.
- Product test template (`harness/templates/product-test-template.md`): structured template for defining product test scenarios with objectives, UX budgets, and specification traceability.
- User profile template (`harness/templates/user-profile-template.md`): test fixture template for defining simulated user personas with technical literacy, domain knowledge, constraints, and behavioral tendencies.
- Phase 5b (Product Testing) added to `harness/PLAYBOOK.md` between Review and Deployment.
- Product test section added to `harness/templates/test-plan-template.md`.
- Product testing terms added to `reference/glossary.md`: product test, test objective, user profile, UX budget.
- Product testing section added to `harness/.cursor/rules/codex-automata.mdc`.
- Product testing section added to `MANIFESTO.md` Section VI (The Machine).

## [0.1.0] - 2026-05-11

Expand Down
10 changes: 9 additions & 1 deletion MANIFESTO.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,15 @@ An agent reads each spec and builds the mold. Unit tests, integration tests, con

Then the cast. Five modules. Five agents. Each receives its spec, its tests, and the interface contracts of its neighbors. They work in parallel, independently, silently. They do not communicate because they do not need to. The specs and contracts contain everything. One finishes in minutes. Another takes an hour. They are not synchronized. When each agent's tests pass, the module is done.

Automated gates fire. An agent reviews against the spec. A human reviews last, checking for coherence. Does the system still make sense as a whole? The pipeline runs. The code deploys. Monitoring confirms that production matches the spec. If it does not, the cycle restarts at the mold.
Automated gates fire. An agent reviews against the spec. A human reviews last, checking for coherence. Does the system still make sense as a whole?

But one verification layer remains. The modules pass their tests. The contracts hold. The code is correct. Is the product any good? A system can satisfy every unit test and still be unusable. The signup flow requires twelve clicks when three would suffice. The error message is technically accurate but incomprehensible. The navigation makes sense to the engineer who built it but bewilders the customer who needs it.

Traditional testing cannot catch this. Scripted end-to-end tests verify a predetermined path: click this button, assert this text. They confirm the path works, not that a user can find it. Agentic product testing inverts the approach. Give an agent a user profile and an objective. "As a first-time visitor, create an account and reach the dashboard." The agent navigates the application the way a real user would, reading labels, clicking buttons, making mistakes, recovering from errors. If it cannot accomplish the objective, the product has a usability defect. If it takes twenty clicks when the budget is eight, the product has a friction defect. The agent's journey is measurable evidence of experience quality.

This is a capability that exists only in the agentic era. No previous testing methodology could instruct a test to "figure out how to accomplish this goal" and then measure whether the experience was efficient, discoverable, and humane. Click counts, backtracking rates, dead ends, hesitation points, error recovery paths: these become quantitative signals derived from agents operating the product exactly as users would. The specification defines what the product must do. Product testing verifies that real people can actually do it.

The pipeline runs. The code deploys. Monitoring confirms that production matches the spec. If it does not, the cycle restarts at the mold.

Remove any piece and the machine breaks. Without specs, agents hallucinate, producing plausible code aimed at the wrong problem. Without tests, infinite variety with no definition of correct. Without modularity, agents collide in merge conflicts, contradictions, and chaos. Without flow, work pools in the wrong places. Without CI/CD, the whole arrangement depends on discipline that humans will eventually forget and agents will never have.

Expand Down
12 changes: 12 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ codex-automata/
| |-- kanban.md Flow-based project management
| |-- agent-operating-model.md How agents operate
| |-- recovery.md Recovery protocol for closing gaps
| |-- product-testing.md Agentic product testing reference
| |-- glossary.md Terminology reference
|
|-- examples/ WORKED EXAMPLES (read, don't copy)
Expand Down Expand Up @@ -144,6 +145,17 @@ Specification --> Tests --> Code

If the casting is defective, fix the mold. If the mold is wrong, fix the specification. Do not debug the implementation directly.

## Product Testing

After the code is assembled, AI agents verify the product by operating it as real users. Each agent receives a user profile (technical literacy, domain knowledge, constraints) and a goal-oriented objective ("as a first-time user, create an account and reach the dashboard"). The agent navigates the application through the UI, and the journey produces measurable signals:

- **Click count and navigation depth** measure friction.
- **Backtracking and dead ends** measure discoverability.
- **Error encounters** measure input guidance quality.
- **Abandonment** flags critical usability failures.

UX budgets set quantitative thresholds for each metric. Product tests run as quality gates in CI, staging, and production canaries. See `reference/product-testing.md` for the full reference.

## When Gaps Are Discovered

Real projects discover gaps after code exists: a production incident exposes an unspecified failure mode, a review reveals missing tests, or a new team member finds a module with no contract tests. Codex Automata defines a formal recovery protocol for these situations.
Expand Down
1 change: 1 addition & 0 deletions ROADMAP.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ This document sketches planned evolution of the Codex Automata methodology harne

- **Version 0.1.0**: Baseline harness: specification doctrine, molds and casting metaphors across docs, bounded context guidance, agent task boundaries, interface contract discipline, Cursor integration (rules, skills, subagents, hooks), templates, examples, GitHub-oriented quality gate patterns.
- **Recovery protocol**: Formal process for closing gaps in specification, tests, or coverage after code exists. Includes gap classification taxonomy, severity-based triage, recovery sequence (audit, spec patch, mold patch, recast, re-review), kanban integration, gap assessment template, recurrence prevention, and health metrics. Wired into manifesto, playbook, agent rules, Cursor rules, and glossary.
- **Product testing**: Agentic verification layer where AI agents operate the assembled application as real users. Agents receive user profiles (technical literacy, domain knowledge, constraints) and goal-oriented objectives, then navigate the product through its UI. Captures UX quality metrics (click count, navigation depth, backtracking, error encounters, time budgets). Includes product test template, user profile template, Phase 5b in the playbook, and glossary terms.

## v0.2.0 (goals)

Expand Down
11 changes: 10 additions & 1 deletion harness/.cursor/rules/codex-automata.mdc
Original file line number Diff line number Diff line change
Expand Up @@ -25,8 +25,17 @@ This project follows the Codex Automata methodology: Documentation first, Tests
- If spec or tests are missing, stop and report before proceeding. Recommend a gap assessment for triage.
- During recovery tasks (closing gaps in existing code), derive specifications from domain knowledge and tests from specifications. Never derive specs or tests from the existing code.

## Product Testing

When the project has user-facing features, define product tests alongside module-level molds. Product tests use AI agents operating the application as real users:

- Define user profiles (`templates/user-profile-template.md`) representing the actual user base.
- Define test objectives as goals, not scripts: "As a first-time user, create an account and reach the dashboard."
- Set UX budgets (click limits, navigation depth, backtracking limits) grounded in product standards.
- Product tests run against the assembled application, not individual modules.

## Terminology

Use these terms consistently: specification, mold, casting, bounded context, interface contract, quality gate, agent task, human review, flow, recovery, gap assessment, spec gap, mold gap, coverage erosion, contract gap.
Use these terms consistently: specification, mold, casting, bounded context, interface contract, quality gate, agent task, human review, flow, recovery, gap assessment, spec gap, mold gap, coverage erosion, contract gap, product test, user profile, test objective, UX budget.

See `agent/AGENT_RULES.md` for the complete operating manual.
52 changes: 51 additions & 1 deletion harness/PLAYBOOK.md
Original file line number Diff line number Diff line change
Expand Up @@ -317,6 +317,55 @@ Verify that castings match the mold and that the mold matches the original inten

---

## Phase 5b: Product Testing

**Purpose**

Verify the assembled product by deploying AI agents that operate the application as real users would. Product testing catches experience defects that module-level tests cannot: unusable workflows, excessive friction, confusing navigation, poor error guidance, and accessibility failures.

This phase runs after review confirms the code is correct and before deployment ships it to users. It answers the question: the code works, but does the product work for the people who use it?

For the full product testing reference, see the [Product Testing](https://github.com/0xhackerfren/Codex-Automata/blob/main/reference/product-testing.md) document in the Codex Automata repository.

**Inputs**

- Approved code from Phase 5 (assembled, running in a staging or preview environment)
- User profile documents (use `templates/user-profile-template.md`)
- Product test scenarios (use `templates/product-test-template.md`)
- UX budgets from the specification (click budgets, navigation depth limits, time budgets)

**Outputs**

- Product test results with journey logs for each scenario
- UX metrics: click counts, backtracking rates, navigation depth, error encounters, time to completion
- Experience signals: hesitation points, dead ends, discovery paths, confusion indicators
- Defect reports for failed objectives or budget violations

**Exit Criteria**

- [ ] All critical journey objectives pass (agent completes the objective).
- [ ] All UX budgets are met (click count, navigation depth, time within thresholds).
- [ ] No critical errors (crashes, data loss, security failures) during any journey.
- [ ] Accessibility-constrained profiles complete all required objectives.
- [ ] Journey logs are stored for trend analysis.

**Human Responsibilities**

- Define user profiles that represent the actual user base.
- Set UX budgets grounded in product standards and competitive benchmarks.
- Interpret experience signals and decide whether friction patterns warrant specification changes.
- Product testing reveals experience defects, but humans decide which defects matter enough to fix before shipping.

**Agent Responsibilities**

- Operate the application through its user interface using browser tools, screen readers, or mobile emulators as the profile dictates.
- Follow the user profile's behavioral model (technical literacy, domain knowledge, behavioral tendencies).
- Do not use implementation knowledge. Navigate using only what is visible on screen and what the profile's user would reasonably know.
- Record every action, observation, hesitation, error, and recovery in a journey log.
- Report honestly when an objective cannot be completed, including where the agent got stuck and what it tried.

---

## Phase 6: Deployment and Observation

**Purpose**
Expand All @@ -325,7 +374,7 @@ Ship verified code to production and confirm that production behavior matches th

**Inputs**

- Approved code from Phase 5
- Approved code from Phase 5b
- Deployment configuration and infrastructure
- Monitoring and alerting requirements from the specification

Expand Down Expand Up @@ -450,5 +499,6 @@ Batch related gaps within a single module into one recovery card. Create separat
| 3: Test Molding | Agent (human review) | No | `test-plan-template.md` |
| 4: Code Casting | Agent | No | `agent-task-template.md` |
| 5: Review | Human | **Yes** | `human-review-template.md` |
| 5b: Product Testing | Agent (human review) | No | `product-test-template.md` |
| 6: Deployment | Human + Agent | No | N/A |
| Recovery | Human + Agent | No | `gap-assessment-template.md` |
95 changes: 95 additions & 0 deletions harness/templates/product-test-template.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,95 @@
# Product Test: [Scenario Name]

<!--
[Define a product test scenario where an AI agent operates the application as a real user. The agent receives a user profile and an objective, then navigates the product to accomplish the goal. For the full product testing reference, see https://github.com/0xhackerfren/Codex-Automata/blob/main/reference/product-testing.md]
-->

## Metadata

| Field | Value |
|-------|-------|
| Scenario name | [ ] |
| Version | [ ] |
| Date | [YYYY-MM-DD] |
| Author | [name or role] |
| Priority | Critical / High / Medium / Low [pick one] |

## User Profile

[Reference the user profile document this test uses.]

- **Profile:** [name, e.g., "First-time visitor" or path to user profile document]
- **Technical literacy:** [high / moderate / low]
- **Domain knowledge:** [expert / familiar / novice]
- **Constraints:** [e.g., mobile-only, screen reader, slow network, non-native language, none]

## Objective

[State what the agent must accomplish as a goal, not a script. Write from the user's perspective.]

**Goal:** [ ]

**Example:** "As a first-time visitor, create an account, complete onboarding, and reach the main dashboard."

## Preconditions

[What state must the environment be in before this test runs?]

- **Application state:** [clean database / seeded with test data / specific configuration]
- **User state:** [no existing account / logged in / specific role or permissions]
- **Environment:** [staging URL / local / production canary]
- **Data fixtures:** [describe any required test data]

## Success Criteria

[What defines successful completion? Be specific about the end state.]

- [ ] [Observable condition that proves the objective was met, e.g., "Dashboard is displayed with the user's name"]
- [ ] [Additional conditions as needed]

## UX Budgets

[Quantitative thresholds for acceptable experience quality. Set based on product standards.]

| Metric | Budget | Rationale |
|--------|--------|-----------|
| Click budget | [max interactions] | [why this number] |
| Navigation depth | [max page transitions] | [why this number] |
| Backtracking limit | [max backward navigations] | [expected: 0 for simple flows] |
| Error encounters | [max user-facing errors] | [expected: 0 for happy path] |
| Time budget | [max seconds, normalized] | [baseline or industry benchmark] |

## Specification Traceability

[Link this test to the specification sections it verifies.]

| Spec section | Behavior tested |
|-------------|-----------------|
| [section ID] | [which user-facing behavior] |
| [ ] | [ ] |

## Failure Handling

[What should happen if the test fails?]

- **Objective not completed:** File as a product defect. The journey log identifies where the agent got stuck.
- **Budget exceeded:** Review the journey log for friction points. Determine whether the budget is too tight or the product needs improvement.
- **Critical error encountered:** File as a severity-1 defect. Product should not crash or lose data during normal use.

## Journey Log Requirements

[What should the agent record during execution?]

- Every page or view visited
- Every interaction (click, type, scroll, navigate)
- Every error message encountered and how it was handled
- Points where the agent hesitated or re-read content
- Dead ends where the agent had to backtrack
- Time spent on each step
- Screenshots at key decision points (optional but recommended)

## Notes

[Additional context, known limitations, or special instructions for this scenario.]

[ ]
9 changes: 9 additions & 0 deletions harness/templates/test-plan-template.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,15 @@
|-----------|-------------|--------|----------------|-------|
| [ ] | [latency / throughput] | [load tool, benchmark] | [ ] | [ ] |

## Product Tests

[Include when the specification defines user-facing behaviors. Product tests verify the assembled application by having agents operate it as real users. See `templates/product-test-template.md` and `templates/user-profile-template.md` for full templates.]

| Objective | User profile | Spec reference | UX budget (clicks) | Priority |
|-----------|-------------|----------------|---------------------|----------|
| [goal-oriented statement] | [profile name] | [section] | [max clicks] | [critical / high / medium / low] |
| [ ] | [ ] | [ ] | [ ] | [ ] |

## Test Environment Requirements

- **Dependencies:** [services, databases, queues]
Expand Down
Loading