Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ actions-runner/

# caches
.cache/
__pycache__/
.turbo/
.parcel-cache/
.eslintcache
Expand Down
56 changes: 56 additions & 0 deletions tools/code-exec-harness/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
# Code Exec Harness

`tools/code-exec-harness/harness.py` runs isolated `code exec --json` scenarios
and saves compact evidence under `.tmp/code-exec-harness/`.

Use it to compare Every Code behavior across prompt, memory, skill, model, and
configuration variants without writing to real GitHub or reusing the real
`CODE_HOME`.

The harness defaults `code exec` to `danger-full-access` because external tool
shims such as fake `gh` need to write logs and state outside the fixture
workspace. Run it only with trusted scenarios.

By default, each scenario gets an empty `CODE_HOME`. Pass `--inherit-auth` only
when you want a live model-backed run; it copies auth files into the isolated
run home without copying the rest of your config.

`HOME`, `ZDOTDIR`, `XDG_CONFIG_HOME`, and `XDG_CACHE_HOME` are also redirected
inside the run directory so shell startup files and home-directory tooling do
not silently use the real user profile.

## Run

```sh
python3 tools/code-exec-harness/harness.py \
tools/code-exec-harness/scenarios/github-plan-smoke.json \
--skill-root /Users/cbusillo/Developer/codex-skills \
--inherit-auth
```

Each run writes:

- `artifacts/stdout.jsonl`: raw `code exec --json` events
- `artifacts/stderr.log`: stderr from the run
- `artifacts/summary.json`: final answer, token usage, tool commands, fake `gh`
calls, fake GitHub state, and expectation failures
- `artifacts/gh-calls.jsonl`: fake `gh` invocations when the scenario enables it
- `artifacts/gh-state.json`: fake issue state after the run

## Scenario Shape

Scenarios are JSON files. Common fields:

- `prompt`: prompt passed to `code exec`
- `files`: workspace files to create before the run
- `skill_roots`: skill roots copied or symlinked into isolated `CODE_HOME/skills`
- `gh`: fake GitHub fixture; when present, the harness prepends a fake `gh` to
`PATH`
- `config_toml`: isolated `CODE_HOME/config.toml` contents
- `config_overrides`: `-c key=value` arguments passed to `code exec`
- `inherit_auth`: copy auth files from the current `CODE_HOME` for this scenario
- `expect`: simple assertions over the final answer, commands, fake `gh` calls,
and exit code

The harness is intentionally black-box: the unit under test is the real `code
exec` binary and its emitted JSONL stream.
Loading