lift #4 scaffold: FluxVLA pi0.5 LIBERO-10 checkpoint eval + shared rollout helper by rylinjames · Pull Request #139 · FastCrest/reflex-vla

rylinjames · 2026-05-20T04:15:09Z

Summary

Scaffolds lift #4 of the fluxvla-lift-program — Modal pipeline to validate FluxVLA's published 97.85% LIBERO-10 number against reflex's export + serve path. Closes the customer-visible 64% vs 97.85% benchmark-comparison gap.

Two commits:

0d89dd6 — scripts/modal_fluxvla_checkpoint_eval.py (559 LOC) — 4-stage Modal pipeline (download → format conversion → reflex export → LIBERO eval). Pinned at FluxVLA upstream subdir pi05_paligemma_libero_10_full_finetune_bs64, file step-028548-epoch-18-loss=0.0111.safetensors. Apache 2.0 attribution preserved.
e4faf17 — Extract LIBERO rollout helper to src/reflex/eval/libero_rollout.py (513 LOC). The existing modal_libero_pi05_decomposed.py @app.function-bound rollout couldn't be called from another Modal app. Per CLAUDE.md "real fixes not band-aids" + "fix it now not later," refactored properly (in-place behavior-preserving) instead of copy-pasting. modal_libero_pi05_decomposed.py drops 600 → 355 LOC; new shared helper is callable from both scripts and (future) any other LIBERO-using Modal pipeline.

Two known stubs to surface on first fire

These are intentional + clearly marked. They surface on --smoke fire and inform the next iteration:

_convert_fluxvla_to_lerobot() — name_mapping = {} is empty because FluxVLA ships raw training safetensors, not lerobot-format. Script logs the first 10 state_dict keys + shapes on first fire so the next iteration fills in the rename map. ~1 hour of iteration after first-fire output lands.
HF org fastcrest — Day 2 republish step needs an HF org with write perms. Setup is one-time, ~10 min, blocks the republish step. Not blocking on this PR.

Cost authorization needed

First-fire: modal run scripts/modal_fluxvla_checkpoint_eval.py --smoke (~$5).
Expected total to land lift #4: ~$30-40 across 3-4 iterations (smoke → name_mapping fix → second smoke → full N=50/task × 4 suites).

Plan doc

reflex_context/features/02_distill/fluxvla-checkpoint-republish_plan.md — day-by-day execution plan with per-day acceptance criteria.

Test plan

python -c "import ast; ast.parse(open('scripts/modal_libero_pi05_decomposed.py').read())" — clean (already verified locally)
python -c "import ast; ast.parse(open('src/reflex/eval/libero_rollout.py').read())" — clean (already verified locally)
CI green (pytest + doctor regression smoke; no test cases removed)
First-fire --smoke surfaces FluxVLA state_dict layout in logs (authorize ~$5 Modal)
Second-fire --smoke after name_mapping fill validates the conversion produces a loadable lerobot-format checkpoint (~$5)
Full fire validates 4-suite LIBERO numbers; experiment note lands at 03_experiments/2026-05-2X-fluxvla-checkpoint-eval.md (~$15-20)

🤖 Generated with Claude Code

The 2026-05-20 rollout refactor removed all the inline imports that the extracted helper now owns, but left a `Path(decomposed_dir).name` call on line 278 that uses Path. CI didn't catch this because pytest doesn't import Modal scripts; would have surfaced as NameError on first Modal fire. Caught during pre-merge self-review pass of PR #139. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

LIBERO-10 eval pipeline for FluxVLA's pi05_paligemma_libero_10_full_finetune_bs64 checkpoint. Stage 1 (HF download) + Stage 2 (FluxVLA→lerobot format conversion) + Stage 3 (reflex export with parity gate) + Stage 4 (LIBERO eval at N=50/task across 4 subsuites) wired end-to-end. Matches the proven image recipe + LIBERO constants from modal_libero_pi05_decomposed.py. Two intentional stubs marked TODO for first-fire iteration: - _convert_fluxvla_to_lerobot: name_mapping dict is empty until we observe FluxVLA's actual state_dict key prefixes on first fire (log line emits sample of 10 keys + shapes) - _run_libero_suite: needs the existing run_decomposed_libero rollout loop factored out of its @app.function decorator so we can call it from a separate Modal app. Refactor planned for next turn. Cost estimate: ~$15-20 per full N=50 fire; expect ~3 iterations to converge (first surfaces state_dict layout, second tunes conversion, third measures final number). Total $45-60 worst case across iterations. Lift #4 of the fluxvla-lift-program. Plan doc: reflex_context/features/02_distill/fluxvla-checkpoint-republish_plan.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…t.py Lift #4 prerequisite per fluxvla-lift-program. The 370-LOC rollout body inside modal_libero_pi05_decomposed.py was @app.function-bound and couldn't be called from another Modal app. Per CLAUDE.md "real fixes not band-aids" + "fix it now not later," extracted to a shared module instead of copy-pasting. Changes: - src/reflex/eval/libero_rollout.py (new, 513 LOC): - run_libero_rollout() — LIBERO env + per-step preprocessor → inference → postprocessor pipeline + action chunk dispatch + video frame capture + aggregate results. Takes an InferenceProtocol-compatible object so future exporters (DreamZero, fast-kernels Pi0.5, GR00T DiT) can swap in. - load_pi05_policy_and_processors() — extracted policy + processor pipeline loader. Handles SnapFlow-student vs HF-fallback dispatch + state-out preprocessor swap. Both Modal scripts call into it. - TASK_SUITE_MAX_STEPS + LIBERO_DUMMY_ACTION constants moved here too (shared with both scripts). - All LIBERO + mujoco imports are LAZY (inside the function body) so the reflex package itself does not depend on LIBERO; only callers that actually run a rollout pay the dep cost. - scripts/modal_libero_pi05_decomposed.py (600 → 355 LOC, -245): - Replaced inline rollout body with run_libero_rollout() call - Replaced inline policy + processor loader with load_pi05_policy_and_processors() call - Behavior is bit-identical — same processor pipeline, same Pi05DecomposedInference invocation, same per-task aggregation, same video frame capture. - Caller-specific metadata (cache_mode, cache_ttl_sec, phash_hamming) added to results dict after the rollout returns. - scripts/modal_fluxvla_checkpoint_eval.py: _run_libero_suite() stub wired up. Was NotImplementedError; now delegates to run_libero_rollout() via load_pi05_policy_and_processors() + Pi05DecomposedInference. Lift #4 first-fire is now technically possible (still gated on _convert_fluxvla_to_lerobot's empty name_mapping dict which fills in on first-fire log inspection). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The 2026-05-20 rollout refactor removed all the inline imports that the extracted helper now owns, but left a `Path(decomposed_dir).name` call on line 278 that uses Path. CI didn't catch this because pytest doesn't import Modal scripts; would have surfaced as NameError on first Modal fire. Caught during pre-merge self-review pass of PR #139. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rylinjames mentioned this pull request May 20, 2026

test(decomposed-server): fix np.float32 vs python-float equality flake #142

Merged

rylinjames mentioned this pull request May 20, 2026

fix(deps): scope numpy<2 upper bound to aarch64 (Jetson) only #143

Merged

rylinjames and others added 3 commits May 20, 2026 07:42

rylinjames force-pushed the lift/fluxvla-checkpoint-eval-scaffold branch from 90a1d05 to a9a582e Compare May 20, 2026 11:42

rylinjames merged commit d1057e6 into main May 20, 2026
6 checks passed

rylinjames deleted the lift/fluxvla-checkpoint-eval-scaffold branch May 20, 2026 16:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lift #4 scaffold: FluxVLA pi0.5 LIBERO-10 checkpoint eval + shared rollout helper#139

lift #4 scaffold: FluxVLA pi0.5 LIBERO-10 checkpoint eval + shared rollout helper#139
rylinjames merged 3 commits into
mainfrom
lift/fluxvla-checkpoint-eval-scaffold

rylinjames commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rylinjames commented May 20, 2026

Summary

Two known stubs to surface on first fire

Cost authorization needed

Plan doc

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant