Feat: Add batch eval script for public test only (#87)

joyemang33 · andylizf · web-flow · commit 625edf76a17e · 2026-02-02T01:36:31.000-08:00
* feat: add run_eval_public.sh

* feat: update readme

* fix: add solutions_dir support and filter options for batch eval

- Fix BatchEvaluator to use custom solutions_dir for hash computation
- Add --model and --problem filter options to CLI and run_eval_public.sh
- Allow users to test specific models or problems without running everything

* docs: add filter options examples to README

* refactor: improve batch CLI with --track and auto paths

- Replace --algorithmic flag with --track research|algorithmic (required)
- Set default solutions/problems paths based on track
- Auto-create track subdir in results_dir (results/{track}/)
- Simplify shell script to rely on CLI defaults
- Update README with CLI usage examples

* docs: clarify batch evaluation with custom solutions directory

* feat: add SkyPilot cleanup and default backend for research track

- Add signal handler to cleanup SkyPilot clusters on Ctrl+C
- Research track defaults to SkyPilot backend (no need for --skypilot flag)
- Add --docker flag to override default for research track
- Simplify shell script since CLI handles backend defaults

* refactor: simplify batch CLI with --backend flag and auto-adjust workers

* fix: update run_eval.sh for new CLI flags

* docs: update batch CLI examples in markdown files

* refactor: use positional track argument for all CLI commands

---------

Co-authored-by: Andy Lee &lt;andylizf@outlook.com&gt;
diff --git a/README.md b/README.md
@@ -97,13 +97,13 @@ Here's [Algorithmic Problem 0](algorithmic/problems/0/statement.txt) - try to be
 
 ```bash
 # Run the example solution (Human Expert Solution)
-frontier eval --algorithmic 0 algorithmic/problems/0/examples/reference.cpp
+frontier eval algorithmic 0 algorithmic/problems/0/examples/reference.cpp
 
 # Run the example solution (GPT-5 Thinking Solution)
-frontier eval --algorithmic 0 algorithmic/problems/0/examples/gpt5.cpp
+frontier eval algorithmic 0 algorithmic/problems/0/examples/gpt5.cpp
 
 # Try your own solution!
-frontier eval --algorithmic 0 <your_solution.cpp>
+frontier eval algorithmic 0 <your_solution.cpp>
 ```
 
 <p align="center">
@@ -114,13 +114,13 @@ frontier eval --algorithmic 0 <your_solution.cpp>
 
 ```bash
 # List all problems
-frontier list
+frontier list research
 
 # Evaluate a generated solution locally for flash_attn problem (requires Docker)
-frontier eval flash_attn <your_solution.py>
+frontier eval research flash_attn <your_solution.py>
 
 # Evaluate on cloud (requires SkyPilot)
-frontier eval flash_attn <your_solution.py> --skypilot
+frontier eval research flash_attn <your_solution.py> --skypilot
 ```
 
 See [research/README.md](research/README.md) for full documentation.
@@ -129,10 +129,10 @@ See [research/README.md](research/README.md) for full documentation.
 
 ```bash
 # Evaluate a solution locally (requires Docker)
-frontier eval --algorithmic 1 <your_solution.cpp>
+frontier eval algorithmic 1 <your_solution.cpp>
 
 # Evaluate on cloud (requires SkyPilot)
-frontier eval --algorithmic 1 <your_solution.cpp> --skypilot
+frontier eval algorithmic 1 <your_solution.cpp> --skypilot
 ```
 
 See [algorithmic/README.md](algorithmic/README.md) for full documentation.
@@ -143,8 +143,8 @@ Frontier-CS supports unbounded scoring, enabling open-ended evaluation compatibl
 
 ```bash
 # Get unbounded score (without clipping to 100)
-frontier eval --unbounded flash_attn <your_solution.py>
-frontier eval --algorithmic --unbounded 1 <your_solution.cpp>
+frontier eval research flash_attn <your_solution.py> --unbounded
+frontier eval algorithmic 1 <your_solution.cpp> --unbounded
 ```
 
 ### Python API
@@ -170,23 +170,63 @@ print(f"Score (unbounded): {result.score_unbounded}")
 
 ### Batch Evaluation
 
-For running evaluations at scale, use the batch evaluation script:
+For testing your solutions at scale with public test cases.
+
+**Solution directory structure:**
+```
+{track}/solutions/
+  {problem}/
+    {model}.py          # variant 0
+    {model}_1.py        # variant 1
+    {model}_2.py        # variant 2
+```
+
+Example for research track:
+```
+research/solutions/
+  flash_attn/
+    gpt5.py
+    claude4.5sonnet.py
+  cross_entropy/
+    gpt5.py
+```
+
+**Basic usage:**
 
 ```bash
-# Evaluate all research solutions (uses SkyPilot)
-./scripts/run_eval.sh --track research
+# Evaluate all research solutions (uses SkyPilot by default)
+uv run frontier-eval batch research
 
-# Evaluate all algorithmic solutions (uses Docker)
-./scripts/run_eval.sh --track algorithmic
+# Evaluate all algorithmic solutions (uses Docker by default)
+uv run frontier-eval batch algorithmic
 
-# Custom parallelism
-./scripts/run_eval.sh --track research -j 20
+# Filter by model or problem
+uv run frontier-eval batch research --model gpt5.1
+uv run frontier-eval batch research --problem flash_attn
+uv run frontier-eval batch research --model gpt5.1 --problem flash_attn
 
-# Force re-evaluation (ignore cache)
-./scripts/run_eval.sh --track algorithmic --force
+# Override default backend
+uv run frontier-eval batch research --backend docker
+uv run frontier-eval batch algorithmic --backend skypilot
 ```
 
-The script auto-clones the internal and results repositories. See `./scripts/run_eval.sh --help` for all options.
+**Custom solutions directory:** You can test solutions from a custom directory with the same structure:
+
+```bash
+# Your custom directory should have the same structure:
+# my_solutions/{problem}/{model}.py
+
+uv run frontier-eval batch research --solutions-dir ./my_solutions
+```
+
+Results are saved to `./results/batch/{track}/` by default. The state file tracks which (solution, problem) pairs have been evaluated, so you can:
+- Resume interrupted evaluations automatically
+- Run multiple times with different `--solutions-dir` and results accumulate
+
+See `--help` for all options.
+
+> **Note:** For maintainers, `./scripts/run_eval.sh` is used for full evaluation with private test cases.
+
 
 ## Submitting Results
 
diff --git a/SUBMIT.md b/SUBMIT.md
@@ -137,13 +137,13 @@ Before submitting, you can verify your solutions locally:
 
 ```bash
 # Evaluate a single solution
-frontier-eval flash_attn solution.py
+frontier-eval research flash_attn solution.py
 
 # Batch evaluation with progress tracking
-frontier-eval batch --pairs-file pairs.txt --results-dir results/
+frontier-eval batch research --pairs-file pairs.txt --results-dir results/
 
 # Batch evaluation with SkyPilot (cloud)
-frontier-eval batch --pairs-file pairs.txt --skypilot --max-concurrent 4
+frontier-eval batch research --pairs-file pairs.txt --backend skypilot --workers 4
 ```
 
 ## How to Submit
diff --git a/algorithmic/README.md b/algorithmic/README.md
@@ -59,23 +59,23 @@ print(f"Score (unbounded): {result.score_unbounded}")
 
 ```bash
 # Evaluate a solution
-frontier-eval --algorithmic 1 solution.cpp
+frontier-eval algorithmic 1 solution.cpp
 
 # Get unbounded score
-frontier-eval --algorithmic 1 solution.cpp --unbounded
+frontier-eval algorithmic 1 solution.cpp --unbounded
 ```
 
 ### Batch Evaluation
 
 ```bash
 # Evaluate all solutions in algorithmic/solutions/
-frontier-eval batch --algorithmic --workers 10
+frontier-eval batch algorithmic
 
 # With SkyPilot (cloud go-judge)
-frontier-eval batch --algorithmic --skypilot --workers 10
+frontier-eval batch algorithmic --backend skypilot
 
 # Check status
-frontier-eval batch --algorithmic --status
+frontier-eval batch algorithmic --status
 ```
 
 **Note:** For algorithmic track, `--clusters` is not used. All workers share a single go-judge server (local Docker or SkyPilot).
@@ -86,11 +86,11 @@ For environments where Docker privileged mode is unavailable (e.g., gVisor, Clou
 
 ```bash
 # Auto-launch cloud judge
-frontier eval --algorithmic --skypilot 1 solution.cpp
+frontier eval algorithmic 1 solution.cpp --skypilot
 
 # Or manually launch
 sky launch -c algo-judge algorithmic/sky-judge.yaml --idle-minutes-to-autostop 10
-frontier eval --algorithmic --judge-url http://$(sky status --ip algo-judge):8081 1 solution.cpp
+frontier eval algorithmic 1 solution.cpp --judge-url http://$(sky status --ip algo-judge):8081
 ```
 
 ### Customized Problems
@@ -118,7 +118,7 @@ checker: chk.cc         # or interactor: interactor.cc
 
 #### docker-compose.yml
 
-The judge server will be auto-started when running `frontier-eval --algorithmic`.
+The judge server will be auto-started when running `frontier-eval algorithmic`.
 
 ```yaml
 environment:
diff --git a/research/README.md b/research/README.md
@@ -6,13 +6,13 @@ Real-world systems challenges requiring domain expertise in GPU computing, distr
 
 ```bash
 # List all problems
-frontier list
+frontier list research
 
 # Evaluate a solution (requires Docker)
-frontier eval flash_attn <your_solution.py>
+frontier eval research flash_attn <your_solution.py>
 
 # Evaluate multiple problems
-frontier eval --problems flash_attn,cross_entropy <your_solution.py>
+frontier eval research --problems flash_attn,cross_entropy <your_solution.py>
 ```
 
 ## Cloud Evaluation with SkyPilot
@@ -30,32 +30,32 @@ See [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/getting-started/in
 **Usage:**
 
 ```bash
-frontier eval flash_attn <your_solution.py> --skypilot
+frontier eval research flash_attn <your_solution.py> --skypilot
 ```
 
 ## Batch Evaluation
 
 Batch evaluation automatically scans `solutions/` and parses problem IDs from filenames:
 
 ```bash
-# Evaluate all solutions (auto-skips completed)
-frontier-eval batch
+# Evaluate all solutions (uses SkyPilot by default, auto-skips completed)
+frontier-eval batch research
 
-# With SkyPilot (cloud VMs)
-frontier-eval batch --skypilot --workers 20 --clusters 4
+# With custom parallelism
+frontier-eval batch research --workers 20 --clusters 4
 
 # Check status
-frontier-eval batch --status
+frontier-eval batch research --status
 
 # Force re-evaluate all
-frontier-eval batch --no-resume
+frontier-eval batch research --no-resume
 
 # Retry failed evaluations
-frontier-eval batch --retry-failed
+frontier-eval batch research --retry-failed
 ```
 
 **Parameters:**
-- `--workers`: Number of parallel workers (default: 1)
+- `--workers`: Number of parallel workers (default: 10)
 - `--clusters`: Number of SkyPilot clusters for load-balancing (default: same as workers, research + skypilot only)
 
 With `--workers 20 --clusters 4`, 20 workers share 4 clusters via load-balancing.
diff --git a/scripts/run_eval.sh b/scripts/run_eval.sh
@@ -447,18 +447,15 @@ echo "Running tools from: $PUBLIC_DIR"
 
 # Set paths based on track
 # Solutions always from public, problems from internal (more test cases)
-# Results saved directly to results repo
+# Results saved directly to results repo (CLI adds track subdir automatically)
 if [[ "$TRACK" == "algorithmic" ]]; then
     SOLUTIONS_DIR="$PUBLIC_DIR/algorithmic/solutions"
-    RESULTS_DIR="$RESULTS_REPO/algorithmic/batch"
     PROBLEMS_DIR="$INTERNAL_DIR/algorithmic/problems"
-    EXTRA_ARGS="--algorithmic"
 else
     SOLUTIONS_DIR="$PUBLIC_DIR/research/solutions"
-    RESULTS_DIR="$RESULTS_REPO/research/batch"
     PROBLEMS_DIR="$INTERNAL_DIR/research/problems"
-    EXTRA_ARGS=""
 fi
+RESULTS_DIR="$RESULTS_REPO/batch"
 
 if [[ ! -d "$SOLUTIONS_DIR" ]]; then
     echo "ERROR: Solutions directory not found: $SOLUTIONS_DIR"
@@ -469,18 +466,13 @@ fi
 mkdir -p "$RESULTS_DIR"
 
 # Build command
-CMD="uv run frontier-eval batch"
+CMD="uv run frontier-eval batch $TRACK"
 CMD="$CMD --solutions-dir $SOLUTIONS_DIR"
 CMD="$CMD --results-dir $RESULTS_DIR"
-CMD="$CMD $EXTRA_ARGS"
-
-# For algorithmic track, use internal's problems (more test cases)
-if [[ -n "$PROBLEMS_DIR" ]]; then
-    CMD="$CMD --problems-dir $PROBLEMS_DIR"
-fi
+CMD="$CMD --problems-dir $PROBLEMS_DIR"
 
 if $SKYPILOT; then
-    CMD="$CMD --skypilot --workers $PARALLELISM --clusters $PARALLELISM"
+    CMD="$CMD --backend skypilot --workers $PARALLELISM --clusters $PARALLELISM"
 else
     CMD="$CMD --workers $PARALLELISM"
 fi
diff --git a/src/frontier_cs/batch/evaluator.py b/src/frontier_cs/batch/evaluator.py
diff --git a/src/frontier_cs/cli.py b/src/frontier_cs/cli.py