Skip to content

[WIP] eval: unified Evaluator + EvalManager refactor#420

Open
eugenevinitsky wants to merge 26 commits intoemerge/temp_trainingfrom
ev/unified-evals
Open

[WIP] eval: unified Evaluator + EvalManager refactor#420
eugenevinitsky wants to merge 26 commits intoemerge/temp_trainingfrom
ev/unified-evals

Conversation

@eugenevinitsky
Copy link
Copy Markdown

Summary

Replaces the five overlapping eval pipelines with one base class + manager. See docs/eval_unification.md (gitignored working spec) and the design discussion preceding this PR.

WIP — do not merge yet. Functional integration tested locally on macOS (15 unit + map-type tests green); not yet exercised on the cluster against a real training run.

What changed

  • New pufferlib/ocean/benchmark/evaluators/ package: `Evaluator` base class + `EvalResult` dataclass + four subclasses (`MultiScenarioEvaluator`, `HumanReplayEvaluator`, `BehaviorClassEvaluator`, `WOSACEvaluator`).
  • New `pufferlib/ocean/benchmark/manager.py`: `EvalManager` with section discovery, recursive `inherits` chain, `clean_eval` macro expansion, dotted-key flattening, inline + subprocess dispatch, wandb namespacing.
  • `drive.ini`: `[eval]` replaced with `[eval.]` sections (one per evaluator). 12 nuPlan behavior classes inherit from `behaviors_defaults` template.
  • `driving_behaviours_eval.ini`: deleted, folded into `drive.ini`.
  • C-side `goal_advance_mode` knob: `c_step`'s last-goal branch dispatches on this instead of `simulation_mode`. Defaults preserve current behavior (gigaflow → regenerate, replay → saturate).
  • Removed from `pufferl.py` (~990 lines): `eval()`, `eval_multi_scenarios()`, `eval_multi_scenarios_render()`, `build_eval_overrides()`, `load_eval_multi_scenarios_config()`, `_swap_policy_obs_counts()`, `_render_driving_behaviours()`, `_export_metrics()`, `_log_eval_metrics()`, `verify_scenario_coverage()`, `verify_scenario_coverage_gigaflow()`.
  • Removed from `utils.py` (~300 lines): `run_human_replay_eval_in_subprocess`, `run_wosac_eval_in_subprocess`, `run_driving_behaviours_eval_in_subprocess`.
  • `PuffeRL.evaluate()` now calls `self._eval_manager.maybe_run()` — a single unified path.
  • CLI `puffer eval --evaluator [--out path]` wired for both standalone and subprocess use.

Net diff: ~+1100, ~−1900.

Tests

  • `tests/test_eval_manager.py` (new, 10 cases): config-merge logic — inheritance chain, child-wins, cycle detection, unknown parent, clean macro applied/disabled/overridden, template skipping, unknown-type error.
  • `tests/test_drive_map_types.py` (existing, 5 cases): all still green after the C-side `goal_advance_mode` refactor.

Test plan

  • Run a short training job on the cluster end-to-end and verify `validation_gigaflow` fires every 25 epochs, `behaviors_*` fires every 250.
  • Verify wandb panels land under `{eval_name}/` and renders under `{eval_name}/render/{view}`.
  • Verify `puffer eval puffer_drive --evaluator behaviors_hard_stop` runs standalone against a fixed checkpoint.
  • Verify `mode = "subprocess"` works (currently untested; the JSON-out path is plumbed but not exercised).
  • CI pre-commit + utest workflows green.

Out of scope

  • Persistent eval-worker pool (deferred until subprocess overhead becomes painful).
  • C-engine multi-pipe render (one rollout, N ffmpeg outputs per view).
  • Engine-level vecenv reuse via runtime `map_dir` swap.
  • `goal_advance_mode = "shift_reuse"` — design discussion deferred.
  • `eval_summary` rollup metric.

🤖 Generated with Claude Code

Eugene Vinitsky and others added 5 commits May 9, 2026 18:27
…ation yet)

Adds the new evaluator framework as pure additions:

  pufferlib/ocean/benchmark/evaluators/
    base.py            Evaluator + EvalResult dataclass
    multi_scenario.py  MultiScenarioEvaluator (replaces eval_multi_scenarios)
    human_replay.py    HumanReplayEvaluator (replay + control_sdc_only loop)
    behavior_class.py  BehaviorClassEvaluator (per-class nuPlan suite)
    wosac.py           Thin wrapper around the existing WOSACEvaluator
  pufferlib/ocean/benchmark/manager.py
    EvalManager: section discovery, inheritance chain, clean-eval macro,
    dotted-key flattening, inline + subprocess dispatch, wandb logging.

Plus tests/test_eval_manager.py for the config-merge logic.

See docs/eval_unification.md for the full design rationale.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-discovered evaluator sections, schema A (dotted keys). Each
behavior class is its own section inheriting from behaviors_defaults
template; gigaflow validation has separate metric-only and render
sections at different intervals.

driving_behaviours_eval.ini deleted — folded into drive.ini.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Removed from pufferl.py (~990 lines):
  eval(), eval_multi_scenarios(), eval_multi_scenarios_render(),
  build_eval_overrides(), load_eval_multi_scenarios_config(),
  _swap_policy_obs_counts(), _render_driving_behaviours(),
  _export_metrics(), _log_eval_metrics(),
  verify_scenario_coverage(), verify_scenario_coverage_gigaflow().
  Plus the legacy eval block in PuffeRL.evaluate() and the
  driving_behaviours_eval.ini loader in load_config.

Removed from utils.py (~300 lines):
  run_human_replay_eval_in_subprocess,
  run_wosac_eval_in_subprocess,
  run_driving_behaviours_eval_in_subprocess.

PuffeRL.evaluate() now calls self._eval_manager.maybe_run() — single
unified path for all evals. main() wires `puffer eval --evaluator <name>
--out <path>` for both standalone and subprocess use.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…EPLAY in c_step

Add GOAL_ADVANCE_REGENERATE / GOAL_ADVANCE_SATURATE constants and a
goal_advance_mode field on Drive. c_step's last-goal branch now
dispatches on goal_advance_mode instead of simulation_mode.

Drive.__init__ accepts goal_advance_mode kwarg with auto-pick based
on simulation_mode (gigaflow → regenerate, replay → saturate) — same
behavior as before, but the choice is explicit and per-eval-overridable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 9, 2026 22:33
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors PufferDrive’s evaluation system by replacing multiple overlapping eval entry points with a unified Evaluator framework and an EvalManager that discovers [eval.<name>] sections from drive.ini and runs them inline or via subprocess. It also introduces a new C/Python goal_advance_mode knob to decouple last-goal behavior from simulation_mode.

Changes:

  • Added pufferlib/ocean/benchmark/evaluators/ (base Evaluator + concrete evaluators) and pufferlib/ocean/benchmark/manager.py (discovery, inheritance merge, clean macro, dispatch, logging).
  • Refactored pufferlib/pufferl.py eval CLI + training loop to route evaluation through EvalManager; removed legacy subprocess eval helpers from pufferlib/utils.py.
  • Updated drive.ini to [eval.<name>] suites (with templates + inheritance), removed driving_behaviours_eval.ini, and added goal_advance_mode through Python/C bindings.

Reviewed changes

Copilot reviewed 16 out of 17 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_eval_manager.py New unit tests for eval-section discovery, inheritance merge, dotted-key expansion, and clean macro behavior.
README.md Updates cluster install wording (NYU cluster specifics).
pufferlib/utils.py Removes legacy subprocess eval helpers; retains video rendering helper.
pufferlib/pufferl.py Routes eval through EvalManager; rewires CLI puffer eval --evaluator ... --out ...; removes legacy eval pipelines.
pufferlib/ocean/drive/drive.py Adds Python-side goal_advance_mode and passes it into C env init kwargs.
pufferlib/ocean/drive/drive.h Adds GOAL_ADVANCE_* constants + goal_advance_mode field; switches last-goal behavior to the new knob in c_step.
pufferlib/ocean/drive/binding.c Unpacks goal_advance_mode into the C env struct.
pufferlib/ocean/benchmark/manager.py New EvalManager with section discovery, inheritance merge, clean macro, inline/subprocess dispatch, and logging.
pufferlib/ocean/benchmark/evaluators/base.py New Evaluator base class + EvalResult.
pufferlib/ocean/benchmark/evaluators/multi_scenario.py New multi-scenario evaluator + optional EGL render pass.
pufferlib/ocean/benchmark/evaluators/human_replay.py New human replay evaluator implementation.
pufferlib/ocean/benchmark/evaluators/behavior_class.py New behavior-class evaluator (sampling + symlink-dir cleanup).
pufferlib/ocean/benchmark/evaluators/wosac.py New WOSAC evaluator adapter around existing realism evaluator logic.
pufferlib/ocean/benchmark/evaluators/__init__.py Registers evaluator types for manager lookup.
pufferlib/config/ocean/driving_behaviours_eval.ini Deleted (folded into drive.ini).
pufferlib/config/ocean/drive.ini Replaces [eval] with [eval.<name>] suites + templates/inheritance; adds WOSAC and behavior-class sections.
.gitignore Un-ignores pufferlib/ocean/benchmark/** (was previously matched by benchmark*/).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +62 to +68
obs, _, _, _, info_list = vecenv.step(action_np)
if info_list:
all_infos.extend(info_list)
# Stop once every bin has yielded one info to avoid double-counting
# on the second cycle through the dir.
if len(all_infos) >= num_maps:
break
Comment thread pufferlib/ocean/benchmark/manager.py Outdated
Comment on lines +105 to +125
def _run_inline(self, ev: Evaluator, policy, env_name: str, global_step) -> EvalResult:
args = self._build_eval_args(ev, env_name=env_name, global_step=global_step)

package = args.get("package", "ocean")
module_name = "pufferlib.ocean" if package == "ocean" else f"pufferlib.environments.{package}"
env_module = importlib.import_module(module_name)
make_env = env_module.env_creator(env_name)

vec_kwargs = ev.vec_overrides()
num_envs = int(vec_kwargs.get("num_envs", 1))
env_kwargs_list = [args["env"] for _ in range(num_envs)]
env_creators = [make_env] * num_envs
env_args_list = [[]] * num_envs

vec_call_kwargs = dict(vec_kwargs)
vec_call_kwargs.setdefault("num_workers", num_envs)
vec_call_kwargs.setdefault("batch_size", num_envs)

vecenv = pufferlib.vector.make(
env_creators, env_args=env_args_list, env_kwargs=env_kwargs_list, **vec_call_kwargs
)
Comment on lines +132 to +137
def _run_subprocess(self, ev: Evaluator, env_name: str, global_step) -> EvalResult:
out_path = Path(self.train_config.get("data_dir", ".")) / "eval_subprocess_out" / f"{ev.name}.json"
out_path.parent.mkdir(parents=True, exist_ok=True)
cfg_path = out_path.with_suffix(".cfg.json")
with open(cfg_path, "w") as f:
json.dump({"name": ev.name, "global_step": global_step}, f)
Comment on lines +153 to +155
subprocess.run(cmd, check=True)
with open(out_path) as f:
payload = json.load(f)
Comment thread pufferlib/pufferl.py
eval_args["load_model_path"] = os.path.join(
self.config["data_dir"], experiment_name, "models", f"inline_epoch_{self.epoch}.pt"
logger=self.logger,
global_step=self.global_step,
Comment thread pufferlib/config/ocean/drive.ini Outdated
Comment on lines +266 to +284
env.map_dir = "/scratch/$USER/data/nuplan/nuplan_mini_train_bins"

[eval.behaviors_hard_stop]
inherits = "behaviors_defaults"
type = "behavior_class"
enabled = true
env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/hard_stop"

[eval.behaviors_highway_straight]
inherits = "behaviors_defaults"
type = "behavior_class"
enabled = true
env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/highway_straight"

[eval.behaviors_lane_change]
inherits = "behaviors_defaults"
type = "behavior_class"
enabled = true
env.map_dir = "/scratch/$USER/data/nuplan/categories_v021/lane_change"
Comment thread pufferlib/utils.py
Comment on lines 1 to 8
import os
import sys
import glob
import random
import shutil
import subprocess
import tempfile
import json
Comment on lines +71 to +84
def maybe_run(self, epoch: int, policy, env_name: str, logger=None, global_step=None) -> dict:
"""Called from the training loop. Runs every enabled evaluator
whose `interval` divides `epoch`. Returns a dict of {eval_name → metrics}."""
results = {}
for ev in self.evaluators:
if not ev.enabled:
continue
if ev.interval <= 0:
continue
if epoch % ev.interval != 0:
continue
res = self._run_one(ev, policy=policy, env_name=env_name, logger=logger, global_step=global_step)
results[ev.name] = res
return results
Comment on lines +126 to +127
try:
res = ev.rollout(vecenv, policy, args)
Eugene Vinitsky and others added 20 commits May 9, 2026 18:46
The HumanReplay and MultiScenario rollouts had ~80% the same step loop
(reset → forward_eval → step → collect infos → aggregate). Move it to
the base class as `_run_rollout_loop`. Subclasses override only the
hooks they actually need to diverge on:

  _initial_reset       (sync vs async reset)
  _maybe_reset_lstm    (per-scenario reset cadence)
  _should_stop         (termination condition)
  _flatten_infos       (multi-worker vs single-worker info shape)
  _aggregate_infos     (per-key mean is the default)
  _render_pass         (no-op default; MultiScenario implements EGL)

HumanReplayEvaluator now overrides only `env_overrides` + `_should_stop`.
BehaviorClassEvaluator still inherits HumanReplay; only adds the tmp
symlink dir for random sampling. WOSAC keeps its custom rollout —
its per-scene multi-rollout loop doesn't fit the default shape.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The benchmark/** unignore was over-broad — it pulled in pycache files
that should never be committed. Add an explicit re-ignore for
__pycache__ under that path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- eval.render_num_scenarios: explicit per-evaluator render budget;
  defaults to min(eval.num_scenarios, 3). Renders are expensive
  (mp4 encode + wandb upload) and shouldn't run at metric scale.
- Render path randomizes starting_map per epoch when not pinned,
  so successive renders show different bins from the dir instead
  of the first N alphabetically. Restores the old behavior from
  _render_driving_behaviours that the refactor lost.
- behaviors_defaults pins render_num_scenarios = 2 so 12 classes
  × 2 views × 2 scenarios = 48 mp4s/epoch (vs 1200 if it inherited
  num_scenarios=50).
- Test: render_num_scenarios is inheritable.

"Worst N" selection deferred — needs per-scenario scores piped out
of the metric rollout into the render call.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled fixes:

  EvalManager:
    - Accepts run_id at construction. Used to locate the per-run
      models/ directory.
    - latest_checkpoint(env_name) walks data_dir/<env>_<run_id>/models/
      for the newest model_*.pt. Falls back to train_config.load_model_path
      if no checkpoints exist yet.
    - has_subprocess_evals_at(epoch) reports whether any enabled
      subprocess evaluator would fire at that epoch.
    - _run_subprocess uses latest_checkpoint instead of train_config.load_model_path.

  PuffeRL.evaluate:
    - Calls save_checkpoint() before maybe_run() if any subprocess
      evaluator would fire. Mirrors the old run_driving_behaviours
      _eval_in_subprocess flow.
    - train() passes logger.run_id when constructing the manager.

For all-inline configs (today's drive.ini default) this is a no-op.
Activates when an evaluator is flipped to mode=subprocess.

Tests: latest_checkpoint picks the newest by ctime, falls back to
load_model_path when no models exist; has_subprocess_evals_at fires
only on enabled subprocess evaluators at matching intervals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unified-eval refactor doesn't actually need this knob, and pushing
it through the C struct + binding + drive.py at the same time bloats
this PR. Restoring the original `if (env->simulation_mode == SIMULATION_REPLAY)`
branch in c_step. The "promote implicit branches to explicit knobs"
audit is real work but lives in its own PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…erride stack

Adds 4 tests covering the regression-prone surface the parser-only
tests miss:

  test_maybe_run_dispatches_by_interval_and_enabled
    Stubs _run_one and verifies that maybe_run fires only enabled
    evaluators whose interval divides epoch. epoch=33 fires nothing,
    epoch=250 fires both 25-interval and 250-interval evaluators.

  test_flatten_infos_handles_shape_variations
    _flatten_infos must handle multi-worker (list of lists) AND
    PufferEnv (flat list) backends, plus None / empty entries —
    one bad isinstance check silently drops episode infos.

  test_behavior_class_cleanup_removes_symlink_dir
    Builds a real 5-bin map_dir, requests a 2-bin sample, verifies
    the tmp symlink dir gets created with 2 bins, then cleanup()
    removes it. tempfile.mkdtemp leftovers are a real footgun.

  test_eval_args_compose_train_section_and_clean_macro
    The full override stack: train baseline → section overrides →
    clean macro. Section beats baseline, explicit beats macro,
    untouched baseline survives.

25 tests total (was 21).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the heavier-eval cadence the rest of the new eval
sections use. 25-epoch interval was too aggressive for inline
gigaflow validation given the per-pass setup cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Time the rollout in Evaluator.rollout (base class) and inject
`eval_seconds` into the returned metrics dict. Manager's _log
posts it to wandb under {ev.name}/eval_seconds — wall-clock cost
per evaluator becomes a first-class panel.

Refactor WOSACEvaluator to override _run_rollout_loop instead of
rollout — now WOSAC also benefits from the timing without code
duplication.

Test: stub evaluator with a forced 20ms floor; verifies eval_seconds
lands in the result and the inner metrics survive.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The eval-mode resample path at drive.py:447 was missing goal_radius
in its kwargs to binding.shared, while the initial-spawn call at
line 293 has it. binding.shared's C side (binding.c:1545) requires
goal_radius via unpack(), so the resample crashes with
"Missing required keyword argument 'goal_radius'" once a scenario
batch completes and a new one is requested.

Latent bug — never triggered because the legacy [eval] section had
multi_scenario_eval = False as the default, so the eval rollout
path that triggers resample never fired in production. The new
EvalManager flips multi_scenario_eval-equivalent ([eval.validation_gigaflow])
on by default, surfacing the crash.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single section now does both: 250-scenario metric pass + 5-scenario
render pass via the render_num_scenarios knob. The split made sense
before render became a per-section flag; it's redundant now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PufferEnv backend in pufferlib.vector.make treats env_creator as a
single callable and passes env_args/env_kwargs to it directly (line
697 in vector.py). Multiprocessing/Serial backends expect lists.
Render must use PufferEnv (one ffmpeg pipe per env), so pass a
single creator + dict, not lists.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 8-map carla dir doesn't need 250 scenarios — gigaflow's C-side eval
already creates one internal env per scenario (capped at num_scenarios)
and steps them in a single batched rollout, so num_scenarios=8 covers
every map exactly once in parallel. Drop num_agents 512→400 to fill
exactly 8 × 50 slots with no wasted capacity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…o_length

Render path was falling back to env.scenario_length (3000 = 100s mp4 at
30fps) because render_max_steps wasn't a real config knob. Two fixes:

  1. multi_scenario.py: read eval.render_max_steps from args["eval"]
     where _build_eval_args puts evaluator-private fields, and default
     to 91 (~3 sec) — not scenario_length, which is the metric pass.

  2. drive.ini: set eval.render_max_steps = 91 on validation_gigaflow
     explicitly so the comment + value document the intent.

EGL render is ~3 fps wall-clock at 1080p, so 91 × 5 × 2 ≈ 5 min/render
pass instead of 28 min when defaulting to scenario_length=3000.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The old MultiScenarioEvaluator forced Multiprocessing + async_reset by
default, which (a) added fork/IPC startup cost without throughput gain
at our scale, (b) broke render (one ffmpeg pipe per env requires
single-process), and (c) couldn't construct the vec env at all because
the manager passed the PufferEnv-shaped single creator to the list-shaped
backend dispatch. Three changes:

  1. multi_scenario.py: drop the vec_overrides() and _initial_reset()
     overrides that pinned Multiprocessing — inherit base-class PufferEnv
     + sync reset.

  2. manager.py:_run_inline: branch on backend so Multiprocessing remains
     a valid opt-in via [eval.<name>.vec] backend = "Multiprocessing".
     Useful for memory-bound replay sweeps, hetero-config evals, or
     async overlap on long rollouts — not needed for the 8-map carla
     validation eval that's the hot path today.

  3. The 8-map validation eval now spins up one Drive with 8 internal
     envs in-process, no fork.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… path

Two fixes for the multi-evaluator smoke test:

  1. multi_scenario._render_view: glob filter excludes mp4s that already
     existed in out_dir before this pass. The dir is shared across epochs
     and across views, so a bare `out_dir.glob('*.mp4')` was returning
     every mp4 from prior render passes too — turning epoch 12510's
     16-mp4 render into a 32-mp4 wandb upload (the 16 from epoch 12505
     were still on disk).

  2. drive.ini: replace `/scratch/$USER/data/nuplan/...` with the literal
     path. configparser doesn't expand env vars, and the manager doesn't
     either, so behavior evals were crashing on FileNotFoundError trying
     to open `/scratch/$USER/data/...`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously `{scenario_id}{view_suffix}.mp4` — each render epoch
overwrote the previous epoch's mp4 in place, making the prior fix's
existing-file snapshot filter incorrect (filter would exclude the
freshly-written file because Path equality matches the prior path).

Now `{scenario_id}_step{N}{view_suffix}.mp4` so:
  - successive eval epochs produce distinct mp4s (no overwrites).
  - wandb's render carousel shows one entry per epoch, letting the
    user watch policy evolve over training.
  - the return-paths glob is exact: just files matching this step's
    pattern, no snapshot-filtering trickery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Behavior-class evaluators previously had render=true in config but no
_render_pass implementation, so the videos silently never appeared.
Three changes:

  1. base.py: lift _render_pass + _render_view from MultiScenarioEvaluator
     into Evaluator. The render loop is generic — fresh PufferEnv with
     render_mode=headless, ffmpeg pipe per active env per view, mp4s
     stamped with global_step in the filename.

  2. base.py: add _render_env_overrides hook so subclasses can tweak the
     render env (default = metric env + render_mode=headless). Render
     loop now caps internal-env render count at eval.render_num_scenarios
     instead of always rendering the full batch (the C kernel still steps
     the full batch — just fewer ffmpeg pipes). Drops the dead
     batch_size_eval write.

  3. multi_scenario.py: keeps the random-starting_map override (its only
     real difference from the default render path) — everything else is
     now inherited.

Result: behavior_class and human_replay inherit a working render path.
For 12 behavior classes with render_num_scenarios=2, render is bounded
by ~3 fps × 300 steps × 2 views ≈ 3.5 min/class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Every evaluator runs at the same global_step, so the previous shared
out_dir + step glob made each evaluator's _render_view re-collect every
earlier evaluator's mp4s into result.frames. wandb then logged
validation_gigaflow's videos under behaviors_*/render too.

Per-evaluator subdir (`mp4/<ev_name>/`) keeps each evaluator's render
output isolated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants