Noer-Automata

Next-step Optimization for Experimental Research: Automata

Noer-Automata is a template for autonomous experimentation on structured machine learning tasks. It is designed for workflows where an agent iteratively proposes a change, runs a local evaluation, logs the result, and keeps or reverts the change based on the measured metric. It can be used for benchmark-style ML tasks, challenge-format coursework, structured assignments, or any project where the input/output contract looks like a train/test pipeline with an objective metric.

This framework is intended to run with OpenAI Codex as the code-generation and experiment agent.

What it does

Experimental Research: the loop is built for repeatable trials, evaluation, comparison, and improvement.
Automata: the system is designed to operate with minimal manual intervention once the project is configured.

Codex reads the project instructions.
Codex makes exactly one experiment change.
A local evaluation command runs.
The controller logs the result.
The controller keeps the new version only if it improves the configured metric.

This makes the workflow easier to audit than a free-form coding session. Each attempt is recorded, each metric is tracked, and changes can be reverted automatically.

Repository structure

Noer-Automata/
├─ .codex/
│  └─ config.toml
├─ scripts/
│  └─ run_loop.py
├─ work/
│  └─ solution.py
├─ references/
│  └─ .gitkeep
├─ experiments/
│  ├─ history.jsonl
│  ├─ metrics.csv
│  └─ current/
├─ output/
├─ Challenge/
├─ AGENTS.md
├─ challenge_config.json
├─ challenge_details.md
├─ README.md
└─ .gitignore

Folder purpose

Challenge/: local dataset and task assets. Treat this as read-only.
references/: optional public or prior solution material. Treat this as read-only.
work/: the active mutable solution that Codex is allowed to change.
experiments/: history, metrics, per-run logs, and temporary experiment outputs.
output/: generated outputs such as submission files or predictions.
scripts/: controller logic such as the experiment loop.
AGENTS.md: repo-level instructions that Codex reads before starting work.
challenge_config.json: project-specific settings, commands, metric name, and bootstrap mode.
challenge_details.md: human-readable task description, schema notes, evaluation, and submission contract.

Bootstrap modes

Noer-Automata supports two starting modes.

1. `improve_existing`

Use this when you already have a working baseline.

Put the current solution in work/solution.py.
Set bootstrap_mode to improve_existing.
Optionally allow references/.
The controller can register the current solution as the initial baseline.
Later runs try to improve it.

2. `cold_start`

Use this when you want Codex to build the solution over time.

Start with a minimal placeholder in work/solution.py.
Set bootstrap_mode to cold_start.
Usually set use_references to false.
Clear old experiment history.
The first accepted run becomes the initial baseline.

Prerequisites

Python 3.10+ recommended
Node.js and npm for the Codex CLI
Git initialized in the repository

Setup

1. Create or clone the repo

mkdir Noer-Automata
cd Noer-Automata
git init

2. Create a Python environment

Using conda:

conda create -n noer_automata python=3.12 -y
conda activate noer_automata

Install your project dependencies after that. At minimum, your solution script should be able to run the configured dev and test commands.

3. Install the Codex CLI

npm install -g @openai/codex
codex

Run codex once so you can sign in locally before using the automated loop.

4. Add project-level Codex config

Create .codex/config.toml:

model = "gpt-5-codex"
approval_policy = "on-request"
sandbox_mode = "workspace-write"

[sandbox_workspace_write]
network_access = true

Codex supports project-scoped .codex/config.toml files, and project config overrides user config for trusted projects.

5. Add repo instructions

Create AGENTS.md in the repo root. Codex reads repo-level AGENTS.md files as part of its instruction layering.

A good AGENTS.md should define:

allowed editable paths
read-only paths
preferred environment
one-experiment-per-run rule
logging policy
whether references are allowed
whether the repo is in cold_start or improve_existing mode

6. Create `challenge_details.md`

This file should describe:

task overview
dataset files
schema and important columns
targets
evaluation metric
submission/output format
constraints such as runtime or hardware limits

7. Create `challenge_config.json`

This file controls how the controller behaves.

Example:

{
  "challenge_name": "YOUR-Challenge",
  "challenge_root": ".",
  "dataset_dir": "./Challenge",
  "references_dir": "./references",
  "challenge_details_path": "./challenge_details.md",

  "bootstrap_mode": "improve_existing",
  "use_references": true,
  "register_existing_solution_as_baseline": true,
  "require_real_code_change_after_baseline": true,

  "editable_paths": ["work", "scripts", "experiments"],
  "read_only_paths": ["Challenge", "references", "challenge_details.md"],

  "primary_metric": "score",
  "metric_direction": "max",

  "time_budget_minutes": 480,
  "experiment_time_limit_minutes": 30,
  "max_experiments_per_loop": 20,

  "environment": {
    "conda_env": "noer_automata",
    "python_command": "python",
    "allow_package_installs": true
  },

  "codex": {
    "command": "codex",
    "full_auto": true,
    "skip_git_repo_check": false
  },

  "evaluation": {
    "dev_command": "python work/solution.py --mode dev --metrics-out ./experiments/current/metrics.json",
    "test_command": "python work/solution.py --mode test --submission-out ./working/submission.csv",
    "metrics_path": "./experiments/current/metrics.json",
    "submission_path": "./working/submission.csv"
  },

  "logging": {
    "history_jsonl": "./experiments/history.jsonl",
    "metrics_csv": "./experiments/metrics.csv",
    "current_run_dir": "./experiments/current"
  }
}

How the experiment loop works

A typical run does this:

Load the config and current best result.
Decide whether to register an existing baseline.
Build a single Codex prompt for one experiment.
Snapshot the mutable source files.
Run Codex once.
Run the local dev evaluation.
Parse the metrics.
Keep the change only if it improves the primary metric.
Log the result to experiments/history.jsonl and experiments/metrics.csv.

This means Codex is responsible for proposing and implementing one experiment, while the controller is responsible for evaluation, logging, and acceptance.

How to use it

Scenario A: improve an existing solution

Put your current pipeline into work/solution.py.
Put any optional prior material into references/.
Set:

"bootstrap_mode": "improve_existing",
"use_references": true,
"register_existing_solution_as_baseline": true

Run one loop:

python scripts/run_loop.py --max-experiments 1

The first run can register your current solution as the initial baseline. Later runs will try to beat it.

Scenario B: start from scratch

Replace work/solution.py with a minimal placeholder scaffold.
Empty references/ or disable it in config.
Set:

"bootstrap_mode": "cold_start",
"use_references": false,
"register_existing_solution_as_baseline": false

Clear old logs in experiments/.
Run:

python scripts/run_loop.py --max-experiments 1

The first accepted run becomes the initial baseline. Later runs try to improve it.

Expected contract for `work/solution.py`

The template assumes that work/solution.py supports two modes.

Dev mode

python work/solution.py --mode dev --metrics-out ./experiments/current/metrics.json

This should:

run a local validation or dev pipeline
write a metrics JSON file
include the primary metric named in challenge_config.json

Example output JSON:

{
  "score": 0.8421,
  "perfect_rate": 0.412,
  "approach_summary": "TF-IDF + logistic regression baseline with calibrated decoding."
}

Test mode

python work/solution.py --mode test --submission-out ./working/submission.csv

This should:

train on the full training set if needed
generate predictions for the test set
write the final output file expected by the task

Logging

Noer-Automata writes two main logs:

`experiments/history.jsonl`

One JSON object per experiment. Useful for debugging, audit trails, and detailed run analysis.

`experiments/metrics.csv`

A flat table useful for plotting metric progress over time.

Typical logged fields include:

experiment id
timestamp
accepted or reverted
hypothesis
primary metric value
previous best
delta vs best
status
short notes

Troubleshooting

Codex says the repo is not trusted

Initialize Git in the repo and make an initial commit, or use the appropriate CLI flag if you intentionally want to bypass the repo check.

Codex runs but makes no real change

Tighten AGENTS.md and the controller prompt so one run must modify at least one meaningful source file under work/.

Evaluation works manually but fails in the loop

Check the configured dev_command, environment activation, file paths, and metrics output path.

Windows output decoding errors

Use UTF-8 subprocess decoding with replacement in the controller. This is especially useful when Codex or shell tools emit characters outside the default Windows code page.

Philosophy

The goal is not to let an agent rewrite everything blindly. It is to create a repeatable experimentation loop that reflects how AI/ML engineers normally work: propose one hypothesis, implement one meaningful change, measure the outcome, and decide whether the change should be kept or reverted. Each run should correspond to one idea, one implementation attempt, one evaluation result, and one keep-or-revert decision. That is what makes the framework useful for research, benchmark style tasks, and structured ML assignments.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.codex		.codex
Challenge/dataset		Challenge/dataset
experiments		experiments
references		references
scripts		scripts
work		work
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
challenge_config.json		challenge_config.json
challenge_details.md		challenge_details.md

Folders and files

Latest commit

History

Repository files navigation

Noer-Automata

Next-step Optimization for Experimental Research: Automata

What it does

Repository structure

Folder purpose

Bootstrap modes

1. improve_existing

2. cold_start

Prerequisites

Setup

1. Create or clone the repo

2. Create a Python environment

3. Install the Codex CLI

4. Add project-level Codex config

5. Add repo instructions

6. Create challenge_details.md

7. Create challenge_config.json

How the experiment loop works

How to use it

Scenario A: improve an existing solution

Scenario B: start from scratch

Expected contract for work/solution.py

Dev mode

Test mode

Logging

experiments/history.jsonl

experiments/metrics.csv

Troubleshooting

Codex says the repo is not trusted

Codex runs but makes no real change

Evaluation works manually but fails in the loop

Windows output decoding errors

Philosophy

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. `improve_existing`

2. `cold_start`

6. Create `challenge_details.md`

7. Create `challenge_config.json`

Expected contract for `work/solution.py`

`experiments/history.jsonl`

`experiments/metrics.csv`

Packages