lift #4 scaffold: FluxVLA pi0.5 LIBERO-10 checkpoint eval + shared rollout helper#139
Merged
Merged
Conversation
rylinjames
added a commit
that referenced
this pull request
May 20, 2026
The 2026-05-20 rollout refactor removed all the inline imports that the extracted helper now owns, but left a `Path(decomposed_dir).name` call on line 278 that uses Path. CI didn't catch this because pytest doesn't import Modal scripts; would have surfaced as NameError on first Modal fire. Caught during pre-merge self-review pass of PR #139. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
LIBERO-10 eval pipeline for FluxVLA's pi05_paligemma_libero_10_full_finetune_bs64 checkpoint. Stage 1 (HF download) + Stage 2 (FluxVLA→lerobot format conversion) + Stage 3 (reflex export with parity gate) + Stage 4 (LIBERO eval at N=50/task across 4 subsuites) wired end-to-end. Matches the proven image recipe + LIBERO constants from modal_libero_pi05_decomposed.py. Two intentional stubs marked TODO for first-fire iteration: - _convert_fluxvla_to_lerobot: name_mapping dict is empty until we observe FluxVLA's actual state_dict key prefixes on first fire (log line emits sample of 10 keys + shapes) - _run_libero_suite: needs the existing run_decomposed_libero rollout loop factored out of its @app.function decorator so we can call it from a separate Modal app. Refactor planned for next turn. Cost estimate: ~$15-20 per full N=50 fire; expect ~3 iterations to converge (first surfaces state_dict layout, second tunes conversion, third measures final number). Total $45-60 worst case across iterations. Lift #4 of the fluxvla-lift-program. Plan doc: reflex_context/features/02_distill/fluxvla-checkpoint-republish_plan.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t.py Lift #4 prerequisite per fluxvla-lift-program. The 370-LOC rollout body inside modal_libero_pi05_decomposed.py was @app.function-bound and couldn't be called from another Modal app. Per CLAUDE.md "real fixes not band-aids" + "fix it now not later," extracted to a shared module instead of copy-pasting. Changes: - src/reflex/eval/libero_rollout.py (new, 513 LOC): - run_libero_rollout() — LIBERO env + per-step preprocessor → inference → postprocessor pipeline + action chunk dispatch + video frame capture + aggregate results. Takes an InferenceProtocol-compatible object so future exporters (DreamZero, fast-kernels Pi0.5, GR00T DiT) can swap in. - load_pi05_policy_and_processors() — extracted policy + processor pipeline loader. Handles SnapFlow-student vs HF-fallback dispatch + state-out preprocessor swap. Both Modal scripts call into it. - TASK_SUITE_MAX_STEPS + LIBERO_DUMMY_ACTION constants moved here too (shared with both scripts). - All LIBERO + mujoco imports are LAZY (inside the function body) so the reflex package itself does not depend on LIBERO; only callers that actually run a rollout pay the dep cost. - scripts/modal_libero_pi05_decomposed.py (600 → 355 LOC, -245): - Replaced inline rollout body with run_libero_rollout() call - Replaced inline policy + processor loader with load_pi05_policy_and_processors() call - Behavior is bit-identical — same processor pipeline, same Pi05DecomposedInference invocation, same per-task aggregation, same video frame capture. - Caller-specific metadata (cache_mode, cache_ttl_sec, phash_hamming) added to results dict after the rollout returns. - scripts/modal_fluxvla_checkpoint_eval.py: _run_libero_suite() stub wired up. Was NotImplementedError; now delegates to run_libero_rollout() via load_pi05_policy_and_processors() + Pi05DecomposedInference. Lift #4 first-fire is now technically possible (still gated on _convert_fluxvla_to_lerobot's empty name_mapping dict which fills in on first-fire log inspection). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-05-20 rollout refactor removed all the inline imports that the extracted helper now owns, but left a `Path(decomposed_dir).name` call on line 278 that uses Path. CI didn't catch this because pytest doesn't import Modal scripts; would have surfaced as NameError on first Modal fire. Caught during pre-merge self-review pass of PR #139. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
90a1d05 to
a9a582e
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Scaffolds lift #4 of the fluxvla-lift-program — Modal pipeline to validate FluxVLA's published 97.85% LIBERO-10 number against reflex's export + serve path. Closes the customer-visible 64% vs 97.85% benchmark-comparison gap.
Two commits:
0d89dd6—scripts/modal_fluxvla_checkpoint_eval.py(559 LOC) — 4-stage Modal pipeline (download → format conversion → reflex export → LIBERO eval). Pinned at FluxVLA upstream subdirpi05_paligemma_libero_10_full_finetune_bs64, filestep-028548-epoch-18-loss=0.0111.safetensors. Apache 2.0 attribution preserved.e4faf17— Extract LIBERO rollout helper tosrc/reflex/eval/libero_rollout.py(513 LOC). The existingmodal_libero_pi05_decomposed.py@app.function-bound rollout couldn't be called from another Modal app. Per CLAUDE.md "real fixes not band-aids" + "fix it now not later," refactored properly (in-place behavior-preserving) instead of copy-pasting.modal_libero_pi05_decomposed.pydrops 600 → 355 LOC; new shared helper is callable from both scripts and (future) any other LIBERO-using Modal pipeline.Two known stubs to surface on first fire
These are intentional + clearly marked. They surface on
--smokefire and inform the next iteration:_convert_fluxvla_to_lerobot()—name_mapping = {}is empty because FluxVLA ships raw training safetensors, not lerobot-format. Script logs the first 10 state_dict keys + shapes on first fire so the next iteration fills in the rename map. ~1 hour of iteration after first-fire output lands.HF org
fastcrest— Day 2 republish step needs an HF org with write perms. Setup is one-time, ~10 min, blocks the republish step. Not blocking on this PR.Cost authorization needed
First-fire:
modal run scripts/modal_fluxvla_checkpoint_eval.py --smoke(~$5).Expected total to land lift #4: ~$30-40 across 3-4 iterations (smoke → name_mapping fix → second smoke → full N=50/task × 4 suites).
Plan doc
reflex_context/features/02_distill/fluxvla-checkpoint-republish_plan.md— day-by-day execution plan with per-day acceptance criteria.Test plan
python -c "import ast; ast.parse(open('scripts/modal_libero_pi05_decomposed.py').read())"— clean (already verified locally)python -c "import ast; ast.parse(open('src/reflex/eval/libero_rollout.py').read())"— clean (already verified locally)--smokesurfaces FluxVLA state_dict layout in logs (authorize ~$5 Modal)--smokeafter name_mapping fill validates the conversion produces a loadable lerobot-format checkpoint (~$5)03_experiments/2026-05-2X-fluxvla-checkpoint-eval.md(~$15-20)🤖 Generated with Claude Code