Skip to content

Commit 625edf7

Browse files
joyemang33andylizf
andauthored
Feat: Add batch eval script for public test only (#87)
* feat: add run_eval_public.sh * feat: update readme * fix: add solutions_dir support and filter options for batch eval - Fix BatchEvaluator to use custom solutions_dir for hash computation - Add --model and --problem filter options to CLI and run_eval_public.sh - Allow users to test specific models or problems without running everything * docs: add filter options examples to README * refactor: improve batch CLI with --track and auto paths - Replace --algorithmic flag with --track research|algorithmic (required) - Set default solutions/problems paths based on track - Auto-create track subdir in results_dir (results/{track}/) - Simplify shell script to rely on CLI defaults - Update README with CLI usage examples * docs: clarify batch evaluation with custom solutions directory * feat: add SkyPilot cleanup and default backend for research track - Add signal handler to cleanup SkyPilot clusters on Ctrl+C - Research track defaults to SkyPilot backend (no need for --skypilot flag) - Add --docker flag to override default for research track - Simplify shell script since CLI handles backend defaults * refactor: simplify batch CLI with --backend flag and auto-adjust workers * fix: update run_eval.sh for new CLI flags * docs: update batch CLI examples in markdown files * refactor: use positional track argument for all CLI commands --------- Co-authored-by: Andy Lee <andylizf@outlook.com>
1 parent 25f159c commit 625edf7

7 files changed

Lines changed: 288 additions & 242 deletions

File tree

README.md

Lines changed: 60 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -97,13 +97,13 @@ Here's [Algorithmic Problem 0](algorithmic/problems/0/statement.txt) - try to be
9797

9898
```bash
9999
# Run the example solution (Human Expert Solution)
100-
frontier eval --algorithmic 0 algorithmic/problems/0/examples/reference.cpp
100+
frontier eval algorithmic 0 algorithmic/problems/0/examples/reference.cpp
101101

102102
# Run the example solution (GPT-5 Thinking Solution)
103-
frontier eval --algorithmic 0 algorithmic/problems/0/examples/gpt5.cpp
103+
frontier eval algorithmic 0 algorithmic/problems/0/examples/gpt5.cpp
104104

105105
# Try your own solution!
106-
frontier eval --algorithmic 0 <your_solution.cpp>
106+
frontier eval algorithmic 0 <your_solution.cpp>
107107
```
108108

109109
<p align="center">
@@ -114,13 +114,13 @@ frontier eval --algorithmic 0 <your_solution.cpp>
114114

115115
```bash
116116
# List all problems
117-
frontier list
117+
frontier list research
118118

119119
# Evaluate a generated solution locally for flash_attn problem (requires Docker)
120-
frontier eval flash_attn <your_solution.py>
120+
frontier eval research flash_attn <your_solution.py>
121121

122122
# Evaluate on cloud (requires SkyPilot)
123-
frontier eval flash_attn <your_solution.py> --skypilot
123+
frontier eval research flash_attn <your_solution.py> --skypilot
124124
```
125125

126126
See [research/README.md](research/README.md) for full documentation.
@@ -129,10 +129,10 @@ See [research/README.md](research/README.md) for full documentation.
129129

130130
```bash
131131
# Evaluate a solution locally (requires Docker)
132-
frontier eval --algorithmic 1 <your_solution.cpp>
132+
frontier eval algorithmic 1 <your_solution.cpp>
133133

134134
# Evaluate on cloud (requires SkyPilot)
135-
frontier eval --algorithmic 1 <your_solution.cpp> --skypilot
135+
frontier eval algorithmic 1 <your_solution.cpp> --skypilot
136136
```
137137

138138
See [algorithmic/README.md](algorithmic/README.md) for full documentation.
@@ -143,8 +143,8 @@ Frontier-CS supports unbounded scoring, enabling open-ended evaluation compatibl
143143

144144
```bash
145145
# Get unbounded score (without clipping to 100)
146-
frontier eval --unbounded flash_attn <your_solution.py>
147-
frontier eval --algorithmic --unbounded 1 <your_solution.cpp>
146+
frontier eval research flash_attn <your_solution.py> --unbounded
147+
frontier eval algorithmic 1 <your_solution.cpp> --unbounded
148148
```
149149

150150
### Python API
@@ -170,23 +170,63 @@ print(f"Score (unbounded): {result.score_unbounded}")
170170

171171
### Batch Evaluation
172172

173-
For running evaluations at scale, use the batch evaluation script:
173+
For testing your solutions at scale with public test cases.
174+
175+
**Solution directory structure:**
176+
```
177+
{track}/solutions/
178+
{problem}/
179+
{model}.py # variant 0
180+
{model}_1.py # variant 1
181+
{model}_2.py # variant 2
182+
```
183+
184+
Example for research track:
185+
```
186+
research/solutions/
187+
flash_attn/
188+
gpt5.py
189+
claude4.5sonnet.py
190+
cross_entropy/
191+
gpt5.py
192+
```
193+
194+
**Basic usage:**
174195

175196
```bash
176-
# Evaluate all research solutions (uses SkyPilot)
177-
./scripts/run_eval.sh --track research
197+
# Evaluate all research solutions (uses SkyPilot by default)
198+
uv run frontier-eval batch research
178199

179-
# Evaluate all algorithmic solutions (uses Docker)
180-
./scripts/run_eval.sh --track algorithmic
200+
# Evaluate all algorithmic solutions (uses Docker by default)
201+
uv run frontier-eval batch algorithmic
181202

182-
# Custom parallelism
183-
./scripts/run_eval.sh --track research -j 20
203+
# Filter by model or problem
204+
uv run frontier-eval batch research --model gpt5.1
205+
uv run frontier-eval batch research --problem flash_attn
206+
uv run frontier-eval batch research --model gpt5.1 --problem flash_attn
184207

185-
# Force re-evaluation (ignore cache)
186-
./scripts/run_eval.sh --track algorithmic --force
208+
# Override default backend
209+
uv run frontier-eval batch research --backend docker
210+
uv run frontier-eval batch algorithmic --backend skypilot
187211
```
188212

189-
The script auto-clones the internal and results repositories. See `./scripts/run_eval.sh --help` for all options.
213+
**Custom solutions directory:** You can test solutions from a custom directory with the same structure:
214+
215+
```bash
216+
# Your custom directory should have the same structure:
217+
# my_solutions/{problem}/{model}.py
218+
219+
uv run frontier-eval batch research --solutions-dir ./my_solutions
220+
```
221+
222+
Results are saved to `./results/batch/{track}/` by default. The state file tracks which (solution, problem) pairs have been evaluated, so you can:
223+
- Resume interrupted evaluations automatically
224+
- Run multiple times with different `--solutions-dir` and results accumulate
225+
226+
See `--help` for all options.
227+
228+
> **Note:** For maintainers, `./scripts/run_eval.sh` is used for full evaluation with private test cases.
229+
190230

191231
## Submitting Results
192232

SUBMIT.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -137,13 +137,13 @@ Before submitting, you can verify your solutions locally:
137137

138138
```bash
139139
# Evaluate a single solution
140-
frontier-eval flash_attn solution.py
140+
frontier-eval research flash_attn solution.py
141141

142142
# Batch evaluation with progress tracking
143-
frontier-eval batch --pairs-file pairs.txt --results-dir results/
143+
frontier-eval batch research --pairs-file pairs.txt --results-dir results/
144144

145145
# Batch evaluation with SkyPilot (cloud)
146-
frontier-eval batch --pairs-file pairs.txt --skypilot --max-concurrent 4
146+
frontier-eval batch research --pairs-file pairs.txt --backend skypilot --workers 4
147147
```
148148

149149
## How to Submit

algorithmic/README.md

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -59,23 +59,23 @@ print(f"Score (unbounded): {result.score_unbounded}")
5959

6060
```bash
6161
# Evaluate a solution
62-
frontier-eval --algorithmic 1 solution.cpp
62+
frontier-eval algorithmic 1 solution.cpp
6363

6464
# Get unbounded score
65-
frontier-eval --algorithmic 1 solution.cpp --unbounded
65+
frontier-eval algorithmic 1 solution.cpp --unbounded
6666
```
6767

6868
### Batch Evaluation
6969

7070
```bash
7171
# Evaluate all solutions in algorithmic/solutions/
72-
frontier-eval batch --algorithmic --workers 10
72+
frontier-eval batch algorithmic
7373

7474
# With SkyPilot (cloud go-judge)
75-
frontier-eval batch --algorithmic --skypilot --workers 10
75+
frontier-eval batch algorithmic --backend skypilot
7676

7777
# Check status
78-
frontier-eval batch --algorithmic --status
78+
frontier-eval batch algorithmic --status
7979
```
8080

8181
**Note:** For algorithmic track, `--clusters` is not used. All workers share a single go-judge server (local Docker or SkyPilot).
@@ -86,11 +86,11 @@ For environments where Docker privileged mode is unavailable (e.g., gVisor, Clou
8686

8787
```bash
8888
# Auto-launch cloud judge
89-
frontier eval --algorithmic --skypilot 1 solution.cpp
89+
frontier eval algorithmic 1 solution.cpp --skypilot
9090

9191
# Or manually launch
9292
sky launch -c algo-judge algorithmic/sky-judge.yaml --idle-minutes-to-autostop 10
93-
frontier eval --algorithmic --judge-url http://$(sky status --ip algo-judge):8081 1 solution.cpp
93+
frontier eval algorithmic 1 solution.cpp --judge-url http://$(sky status --ip algo-judge):8081
9494
```
9595

9696
### Customized Problems
@@ -118,7 +118,7 @@ checker: chk.cc # or interactor: interactor.cc
118118
119119
#### docker-compose.yml
120120
121-
The judge server will be auto-started when running `frontier-eval --algorithmic`.
121+
The judge server will be auto-started when running `frontier-eval algorithmic`.
122122

123123
```yaml
124124
environment:

research/README.md

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,13 @@ Real-world systems challenges requiring domain expertise in GPU computing, distr
66

77
```bash
88
# List all problems
9-
frontier list
9+
frontier list research
1010

1111
# Evaluate a solution (requires Docker)
12-
frontier eval flash_attn <your_solution.py>
12+
frontier eval research flash_attn <your_solution.py>
1313

1414
# Evaluate multiple problems
15-
frontier eval --problems flash_attn,cross_entropy <your_solution.py>
15+
frontier eval research --problems flash_attn,cross_entropy <your_solution.py>
1616
```
1717

1818
## Cloud Evaluation with SkyPilot
@@ -30,32 +30,32 @@ See [SkyPilot docs](https://skypilot.readthedocs.io/en/latest/getting-started/in
3030
**Usage:**
3131

3232
```bash
33-
frontier eval flash_attn <your_solution.py> --skypilot
33+
frontier eval research flash_attn <your_solution.py> --skypilot
3434
```
3535

3636
## Batch Evaluation
3737

3838
Batch evaluation automatically scans `solutions/` and parses problem IDs from filenames:
3939

4040
```bash
41-
# Evaluate all solutions (auto-skips completed)
42-
frontier-eval batch
41+
# Evaluate all solutions (uses SkyPilot by default, auto-skips completed)
42+
frontier-eval batch research
4343

44-
# With SkyPilot (cloud VMs)
45-
frontier-eval batch --skypilot --workers 20 --clusters 4
44+
# With custom parallelism
45+
frontier-eval batch research --workers 20 --clusters 4
4646

4747
# Check status
48-
frontier-eval batch --status
48+
frontier-eval batch research --status
4949

5050
# Force re-evaluate all
51-
frontier-eval batch --no-resume
51+
frontier-eval batch research --no-resume
5252

5353
# Retry failed evaluations
54-
frontier-eval batch --retry-failed
54+
frontier-eval batch research --retry-failed
5555
```
5656

5757
**Parameters:**
58-
- `--workers`: Number of parallel workers (default: 1)
58+
- `--workers`: Number of parallel workers (default: 10)
5959
- `--clusters`: Number of SkyPilot clusters for load-balancing (default: same as workers, research + skypilot only)
6060

6161
With `--workers 20 --clusters 4`, 20 workers share 4 clusters via load-balancing.

scripts/run_eval.sh

Lines changed: 5 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -447,18 +447,15 @@ echo "Running tools from: $PUBLIC_DIR"
447447

448448
# Set paths based on track
449449
# Solutions always from public, problems from internal (more test cases)
450-
# Results saved directly to results repo
450+
# Results saved directly to results repo (CLI adds track subdir automatically)
451451
if [[ "$TRACK" == "algorithmic" ]]; then
452452
SOLUTIONS_DIR="$PUBLIC_DIR/algorithmic/solutions"
453-
RESULTS_DIR="$RESULTS_REPO/algorithmic/batch"
454453
PROBLEMS_DIR="$INTERNAL_DIR/algorithmic/problems"
455-
EXTRA_ARGS="--algorithmic"
456454
else
457455
SOLUTIONS_DIR="$PUBLIC_DIR/research/solutions"
458-
RESULTS_DIR="$RESULTS_REPO/research/batch"
459456
PROBLEMS_DIR="$INTERNAL_DIR/research/problems"
460-
EXTRA_ARGS=""
461457
fi
458+
RESULTS_DIR="$RESULTS_REPO/batch"
462459

463460
if [[ ! -d "$SOLUTIONS_DIR" ]]; then
464461
echo "ERROR: Solutions directory not found: $SOLUTIONS_DIR"
@@ -469,18 +466,13 @@ fi
469466
mkdir -p "$RESULTS_DIR"
470467

471468
# Build command
472-
CMD="uv run frontier-eval batch"
469+
CMD="uv run frontier-eval batch $TRACK"
473470
CMD="$CMD --solutions-dir $SOLUTIONS_DIR"
474471
CMD="$CMD --results-dir $RESULTS_DIR"
475-
CMD="$CMD $EXTRA_ARGS"
476-
477-
# For algorithmic track, use internal's problems (more test cases)
478-
if [[ -n "$PROBLEMS_DIR" ]]; then
479-
CMD="$CMD --problems-dir $PROBLEMS_DIR"
480-
fi
472+
CMD="$CMD --problems-dir $PROBLEMS_DIR"
481473

482474
if $SKYPILOT; then
483-
CMD="$CMD --skypilot --workers $PARALLELISM --clusters $PARALLELISM"
475+
CMD="$CMD --backend skypilot --workers $PARALLELISM --clusters $PARALLELISM"
484476
else
485477
CMD="$CMD --workers $PARALLELISM"
486478
fi

0 commit comments

Comments
 (0)