Skip to content

hanifnoerr/Noer-Automata

Repository files navigation

Noer-Automata

Next-step Optimization for Experimental Research: Automata

Noer-Automata is a template for autonomous experimentation on structured machine learning tasks. It is designed for workflows where an agent iteratively proposes a change, runs a local evaluation, logs the result, and keeps or reverts the change based on the measured metric. It can be used for benchmark-style ML tasks, challenge-format coursework, structured assignments, or any project where the input/output contract looks like a train/test pipeline with an objective metric.

This framework is intended to run with OpenAI Codex as the code-generation and experiment agent.

What it does

  • Experimental Research: the loop is built for repeatable trials, evaluation, comparison, and improvement.
  • Automata: the system is designed to operate with minimal manual intervention once the project is configured.
  1. Codex reads the project instructions.
  2. Codex makes exactly one experiment change.
  3. A local evaluation command runs.
  4. The controller logs the result.
  5. The controller keeps the new version only if it improves the configured metric.

This makes the workflow easier to audit than a free-form coding session. Each attempt is recorded, each metric is tracked, and changes can be reverted automatically.

Repository structure

Noer-Automata/
├─ .codex/
│  └─ config.toml
├─ scripts/
│  └─ run_loop.py
├─ work/
│  └─ solution.py
├─ references/
│  └─ .gitkeep
├─ experiments/
│  ├─ history.jsonl
│  ├─ metrics.csv
│  └─ current/
├─ output/
├─ Challenge/
├─ AGENTS.md
├─ challenge_config.json
├─ challenge_details.md
├─ README.md
└─ .gitignore

Folder purpose

  • Challenge/: local dataset and task assets. Treat this as read-only.
  • references/: optional public or prior solution material. Treat this as read-only.
  • work/: the active mutable solution that Codex is allowed to change.
  • experiments/: history, metrics, per-run logs, and temporary experiment outputs.
  • output/: generated outputs such as submission files or predictions.
  • scripts/: controller logic such as the experiment loop.
  • AGENTS.md: repo-level instructions that Codex reads before starting work.
  • challenge_config.json: project-specific settings, commands, metric name, and bootstrap mode.
  • challenge_details.md: human-readable task description, schema notes, evaluation, and submission contract.

Bootstrap modes

Noer-Automata supports two starting modes.

1. improve_existing

Use this when you already have a working baseline.

  • Put the current solution in work/solution.py.
  • Set bootstrap_mode to improve_existing.
  • Optionally allow references/.
  • The controller can register the current solution as the initial baseline.
  • Later runs try to improve it.

2. cold_start

Use this when you want Codex to build the solution over time.

  • Start with a minimal placeholder in work/solution.py.
  • Set bootstrap_mode to cold_start.
  • Usually set use_references to false.
  • Clear old experiment history.
  • The first accepted run becomes the initial baseline.

Prerequisites

  • Python 3.10+ recommended
  • Node.js and npm for the Codex CLI
  • Git initialized in the repository

Setup

1. Create or clone the repo

mkdir Noer-Automata
cd Noer-Automata
git init

2. Create a Python environment

Using conda:

conda create -n noer_automata python=3.12 -y
conda activate noer_automata

Install your project dependencies after that. At minimum, your solution script should be able to run the configured dev and test commands.

3. Install the Codex CLI

npm install -g @openai/codex
codex

Run codex once so you can sign in locally before using the automated loop.

4. Add project-level Codex config

Create .codex/config.toml:

model = "gpt-5-codex"
approval_policy = "on-request"
sandbox_mode = "workspace-write"

[sandbox_workspace_write]
network_access = true

Codex supports project-scoped .codex/config.toml files, and project config overrides user config for trusted projects.

5. Add repo instructions

Create AGENTS.md in the repo root. Codex reads repo-level AGENTS.md files as part of its instruction layering.

A good AGENTS.md should define:

  • allowed editable paths
  • read-only paths
  • preferred environment
  • one-experiment-per-run rule
  • logging policy
  • whether references are allowed
  • whether the repo is in cold_start or improve_existing mode

6. Create challenge_details.md

This file should describe:

  • task overview
  • dataset files
  • schema and important columns
  • targets
  • evaluation metric
  • submission/output format
  • constraints such as runtime or hardware limits

7. Create challenge_config.json

This file controls how the controller behaves.

Example:

{
  "challenge_name": "YOUR-Challenge",
  "challenge_root": ".",
  "dataset_dir": "./Challenge",
  "references_dir": "./references",
  "challenge_details_path": "./challenge_details.md",

  "bootstrap_mode": "improve_existing",
  "use_references": true,
  "register_existing_solution_as_baseline": true,
  "require_real_code_change_after_baseline": true,

  "editable_paths": ["work", "scripts", "experiments"],
  "read_only_paths": ["Challenge", "references", "challenge_details.md"],

  "primary_metric": "score",
  "metric_direction": "max",

  "time_budget_minutes": 480,
  "experiment_time_limit_minutes": 30,
  "max_experiments_per_loop": 20,

  "environment": {
    "conda_env": "noer_automata",
    "python_command": "python",
    "allow_package_installs": true
  },

  "codex": {
    "command": "codex",
    "full_auto": true,
    "skip_git_repo_check": false
  },

  "evaluation": {
    "dev_command": "python work/solution.py --mode dev --metrics-out ./experiments/current/metrics.json",
    "test_command": "python work/solution.py --mode test --submission-out ./working/submission.csv",
    "metrics_path": "./experiments/current/metrics.json",
    "submission_path": "./working/submission.csv"
  },

  "logging": {
    "history_jsonl": "./experiments/history.jsonl",
    "metrics_csv": "./experiments/metrics.csv",
    "current_run_dir": "./experiments/current"
  }
}

How the experiment loop works

A typical run does this:

  1. Load the config and current best result.
  2. Decide whether to register an existing baseline.
  3. Build a single Codex prompt for one experiment.
  4. Snapshot the mutable source files.
  5. Run Codex once.
  6. Run the local dev evaluation.
  7. Parse the metrics.
  8. Keep the change only if it improves the primary metric.
  9. Log the result to experiments/history.jsonl and experiments/metrics.csv.

This means Codex is responsible for proposing and implementing one experiment, while the controller is responsible for evaluation, logging, and acceptance.

How to use it

Scenario A: improve an existing solution

  1. Put your current pipeline into work/solution.py.
  2. Put any optional prior material into references/.
  3. Set:
"bootstrap_mode": "improve_existing",
"use_references": true,
"register_existing_solution_as_baseline": true
  1. Run one loop:
python scripts/run_loop.py --max-experiments 1

The first run can register your current solution as the initial baseline. Later runs will try to beat it.

Scenario B: start from scratch

  1. Replace work/solution.py with a minimal placeholder scaffold.
  2. Empty references/ or disable it in config.
  3. Set:
"bootstrap_mode": "cold_start",
"use_references": false,
"register_existing_solution_as_baseline": false
  1. Clear old logs in experiments/.
  2. Run:
python scripts/run_loop.py --max-experiments 1

The first accepted run becomes the initial baseline. Later runs try to improve it.

Expected contract for work/solution.py

The template assumes that work/solution.py supports two modes.

Dev mode

python work/solution.py --mode dev --metrics-out ./experiments/current/metrics.json

This should:

  • run a local validation or dev pipeline
  • write a metrics JSON file
  • include the primary metric named in challenge_config.json

Example output JSON:

{
  "score": 0.8421,
  "perfect_rate": 0.412,
  "approach_summary": "TF-IDF + logistic regression baseline with calibrated decoding."
}

Test mode

python work/solution.py --mode test --submission-out ./working/submission.csv

This should:

  • train on the full training set if needed
  • generate predictions for the test set
  • write the final output file expected by the task

Logging

Noer-Automata writes two main logs:

experiments/history.jsonl

One JSON object per experiment. Useful for debugging, audit trails, and detailed run analysis.

experiments/metrics.csv

A flat table useful for plotting metric progress over time.

Typical logged fields include:

  • experiment id
  • timestamp
  • accepted or reverted
  • hypothesis
  • primary metric value
  • previous best
  • delta vs best
  • status
  • short notes

Troubleshooting

Codex says the repo is not trusted

Initialize Git in the repo and make an initial commit, or use the appropriate CLI flag if you intentionally want to bypass the repo check.

Codex runs but makes no real change

Tighten AGENTS.md and the controller prompt so one run must modify at least one meaningful source file under work/.

Evaluation works manually but fails in the loop

Check the configured dev_command, environment activation, file paths, and metrics output path.

Windows output decoding errors

Use UTF-8 subprocess decoding with replacement in the controller. This is especially useful when Codex or shell tools emit characters outside the default Windows code page.

Philosophy

The goal is not to let an agent rewrite everything blindly. It is to create a repeatable experimentation loop that reflects how AI/ML engineers normally work: propose one hypothesis, implement one meaningful change, measure the outcome, and decide whether the change should be kept or reverted. Each run should correspond to one idea, one implementation attempt, one evaluation result, and one keep-or-revert decision. That is what makes the framework useful for research, benchmark style tasks, and structured ML assignments.

About

Next-step Optimization for Experimental Research: Automata

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages