Noer-Automata is a template for autonomous experimentation on structured machine learning tasks. It is designed for workflows where an agent iteratively proposes a change, runs a local evaluation, logs the result, and keeps or reverts the change based on the measured metric. It can be used for benchmark-style ML tasks, challenge-format coursework, structured assignments, or any project where the input/output contract looks like a train/test pipeline with an objective metric.
This framework is intended to run with OpenAI Codex as the code-generation and experiment agent.
- Experimental Research: the loop is built for repeatable trials, evaluation, comparison, and improvement.
- Automata: the system is designed to operate with minimal manual intervention once the project is configured.
- Codex reads the project instructions.
- Codex makes exactly one experiment change.
- A local evaluation command runs.
- The controller logs the result.
- The controller keeps the new version only if it improves the configured metric.
This makes the workflow easier to audit than a free-form coding session. Each attempt is recorded, each metric is tracked, and changes can be reverted automatically.
Noer-Automata/
├─ .codex/
│ └─ config.toml
├─ scripts/
│ └─ run_loop.py
├─ work/
│ └─ solution.py
├─ references/
│ └─ .gitkeep
├─ experiments/
│ ├─ history.jsonl
│ ├─ metrics.csv
│ └─ current/
├─ output/
├─ Challenge/
├─ AGENTS.md
├─ challenge_config.json
├─ challenge_details.md
├─ README.md
└─ .gitignore
- Challenge/: local dataset and task assets. Treat this as read-only.
- references/: optional public or prior solution material. Treat this as read-only.
- work/: the active mutable solution that Codex is allowed to change.
- experiments/: history, metrics, per-run logs, and temporary experiment outputs.
- output/: generated outputs such as submission files or predictions.
- scripts/: controller logic such as the experiment loop.
- AGENTS.md: repo-level instructions that Codex reads before starting work.
- challenge_config.json: project-specific settings, commands, metric name, and bootstrap mode.
- challenge_details.md: human-readable task description, schema notes, evaluation, and submission contract.
Noer-Automata supports two starting modes.
Use this when you already have a working baseline.
- Put the current solution in
work/solution.py. - Set
bootstrap_modetoimprove_existing. - Optionally allow
references/. - The controller can register the current solution as the initial baseline.
- Later runs try to improve it.
Use this when you want Codex to build the solution over time.
- Start with a minimal placeholder in
work/solution.py. - Set
bootstrap_modetocold_start. - Usually set
use_referencestofalse. - Clear old experiment history.
- The first accepted run becomes the initial baseline.
- Python 3.10+ recommended
- Node.js and npm for the Codex CLI
- Git initialized in the repository
mkdir Noer-Automata
cd Noer-Automata
git initUsing conda:
conda create -n noer_automata python=3.12 -y
conda activate noer_automataInstall your project dependencies after that. At minimum, your solution script should be able to run the configured dev and test commands.
npm install -g @openai/codex
codexRun codex once so you can sign in locally before using the automated loop.
Create .codex/config.toml:
model = "gpt-5-codex"
approval_policy = "on-request"
sandbox_mode = "workspace-write"
[sandbox_workspace_write]
network_access = trueCodex supports project-scoped .codex/config.toml files, and project config overrides user config for trusted projects.
Create AGENTS.md in the repo root. Codex reads repo-level AGENTS.md files as part of its instruction layering.
A good AGENTS.md should define:
- allowed editable paths
- read-only paths
- preferred environment
- one-experiment-per-run rule
- logging policy
- whether references are allowed
- whether the repo is in
cold_startorimprove_existingmode
This file should describe:
- task overview
- dataset files
- schema and important columns
- targets
- evaluation metric
- submission/output format
- constraints such as runtime or hardware limits
This file controls how the controller behaves.
Example:
{
"challenge_name": "YOUR-Challenge",
"challenge_root": ".",
"dataset_dir": "./Challenge",
"references_dir": "./references",
"challenge_details_path": "./challenge_details.md",
"bootstrap_mode": "improve_existing",
"use_references": true,
"register_existing_solution_as_baseline": true,
"require_real_code_change_after_baseline": true,
"editable_paths": ["work", "scripts", "experiments"],
"read_only_paths": ["Challenge", "references", "challenge_details.md"],
"primary_metric": "score",
"metric_direction": "max",
"time_budget_minutes": 480,
"experiment_time_limit_minutes": 30,
"max_experiments_per_loop": 20,
"environment": {
"conda_env": "noer_automata",
"python_command": "python",
"allow_package_installs": true
},
"codex": {
"command": "codex",
"full_auto": true,
"skip_git_repo_check": false
},
"evaluation": {
"dev_command": "python work/solution.py --mode dev --metrics-out ./experiments/current/metrics.json",
"test_command": "python work/solution.py --mode test --submission-out ./working/submission.csv",
"metrics_path": "./experiments/current/metrics.json",
"submission_path": "./working/submission.csv"
},
"logging": {
"history_jsonl": "./experiments/history.jsonl",
"metrics_csv": "./experiments/metrics.csv",
"current_run_dir": "./experiments/current"
}
}A typical run does this:
- Load the config and current best result.
- Decide whether to register an existing baseline.
- Build a single Codex prompt for one experiment.
- Snapshot the mutable source files.
- Run Codex once.
- Run the local dev evaluation.
- Parse the metrics.
- Keep the change only if it improves the primary metric.
- Log the result to
experiments/history.jsonlandexperiments/metrics.csv.
This means Codex is responsible for proposing and implementing one experiment, while the controller is responsible for evaluation, logging, and acceptance.
- Put your current pipeline into
work/solution.py. - Put any optional prior material into
references/. - Set:
"bootstrap_mode": "improve_existing",
"use_references": true,
"register_existing_solution_as_baseline": true- Run one loop:
python scripts/run_loop.py --max-experiments 1The first run can register your current solution as the initial baseline. Later runs will try to beat it.
- Replace
work/solution.pywith a minimal placeholder scaffold. - Empty
references/or disable it in config. - Set:
"bootstrap_mode": "cold_start",
"use_references": false,
"register_existing_solution_as_baseline": false- Clear old logs in
experiments/. - Run:
python scripts/run_loop.py --max-experiments 1The first accepted run becomes the initial baseline. Later runs try to improve it.
The template assumes that work/solution.py supports two modes.
python work/solution.py --mode dev --metrics-out ./experiments/current/metrics.jsonThis should:
- run a local validation or dev pipeline
- write a metrics JSON file
- include the primary metric named in
challenge_config.json
Example output JSON:
{
"score": 0.8421,
"perfect_rate": 0.412,
"approach_summary": "TF-IDF + logistic regression baseline with calibrated decoding."
}python work/solution.py --mode test --submission-out ./working/submission.csvThis should:
- train on the full training set if needed
- generate predictions for the test set
- write the final output file expected by the task
Noer-Automata writes two main logs:
One JSON object per experiment. Useful for debugging, audit trails, and detailed run analysis.
A flat table useful for plotting metric progress over time.
Typical logged fields include:
- experiment id
- timestamp
- accepted or reverted
- hypothesis
- primary metric value
- previous best
- delta vs best
- status
- short notes
Initialize Git in the repo and make an initial commit, or use the appropriate CLI flag if you intentionally want to bypass the repo check.
Tighten AGENTS.md and the controller prompt so one run must modify at least one meaningful source file under work/.
Check the configured dev_command, environment activation, file paths, and metrics output path.
Use UTF-8 subprocess decoding with replacement in the controller. This is especially useful when Codex or shell tools emit characters outside the default Windows code page.
The goal is not to let an agent rewrite everything blindly. It is to create a repeatable experimentation loop that reflects how AI/ML engineers normally work: propose one hypothesis, implement one meaningful change, measure the outcome, and decide whether the change should be kept or reverted. Each run should correspond to one idea, one implementation attempt, one evaluation result, and one keep-or-revert decision. That is what makes the framework useful for research, benchmark style tasks, and structured ML assignments.