Skip to content

lift #4 scaffold: FluxVLA pi0.5 LIBERO-10 checkpoint eval + shared rollout helper#139

Merged
rylinjames merged 3 commits into
mainfrom
lift/fluxvla-checkpoint-eval-scaffold
May 20, 2026
Merged

lift #4 scaffold: FluxVLA pi0.5 LIBERO-10 checkpoint eval + shared rollout helper#139
rylinjames merged 3 commits into
mainfrom
lift/fluxvla-checkpoint-eval-scaffold

Conversation

@rylinjames
Copy link
Copy Markdown
Collaborator

Summary

Scaffolds lift #4 of the fluxvla-lift-program — Modal pipeline to validate FluxVLA's published 97.85% LIBERO-10 number against reflex's export + serve path. Closes the customer-visible 64% vs 97.85% benchmark-comparison gap.

Two commits:

  1. 0d89dd6scripts/modal_fluxvla_checkpoint_eval.py (559 LOC) — 4-stage Modal pipeline (download → format conversion → reflex export → LIBERO eval). Pinned at FluxVLA upstream subdir pi05_paligemma_libero_10_full_finetune_bs64, file step-028548-epoch-18-loss=0.0111.safetensors. Apache 2.0 attribution preserved.

  2. e4faf17 — Extract LIBERO rollout helper to src/reflex/eval/libero_rollout.py (513 LOC). The existing modal_libero_pi05_decomposed.py @app.function-bound rollout couldn't be called from another Modal app. Per CLAUDE.md "real fixes not band-aids" + "fix it now not later," refactored properly (in-place behavior-preserving) instead of copy-pasting. modal_libero_pi05_decomposed.py drops 600 → 355 LOC; new shared helper is callable from both scripts and (future) any other LIBERO-using Modal pipeline.

Two known stubs to surface on first fire

These are intentional + clearly marked. They surface on --smoke fire and inform the next iteration:

  • _convert_fluxvla_to_lerobot()name_mapping = {} is empty because FluxVLA ships raw training safetensors, not lerobot-format. Script logs the first 10 state_dict keys + shapes on first fire so the next iteration fills in the rename map. ~1 hour of iteration after first-fire output lands.

  • HF org fastcrest — Day 2 republish step needs an HF org with write perms. Setup is one-time, ~10 min, blocks the republish step. Not blocking on this PR.

Cost authorization needed

First-fire: modal run scripts/modal_fluxvla_checkpoint_eval.py --smoke (~$5).
Expected total to land lift #4: ~$30-40 across 3-4 iterations (smoke → name_mapping fix → second smoke → full N=50/task × 4 suites).

Plan doc

reflex_context/features/02_distill/fluxvla-checkpoint-republish_plan.md — day-by-day execution plan with per-day acceptance criteria.

Test plan

  • python -c "import ast; ast.parse(open('scripts/modal_libero_pi05_decomposed.py').read())" — clean (already verified locally)
  • python -c "import ast; ast.parse(open('src/reflex/eval/libero_rollout.py').read())" — clean (already verified locally)
  • CI green (pytest + doctor regression smoke; no test cases removed)
  • First-fire --smoke surfaces FluxVLA state_dict layout in logs (authorize ~$5 Modal)
  • Second-fire --smoke after name_mapping fill validates the conversion produces a loadable lerobot-format checkpoint (~$5)
  • Full fire validates 4-suite LIBERO numbers; experiment note lands at 03_experiments/2026-05-2X-fluxvla-checkpoint-eval.md (~$15-20)

🤖 Generated with Claude Code

rylinjames added a commit that referenced this pull request May 20, 2026
The 2026-05-20 rollout refactor removed all the inline imports that the
extracted helper now owns, but left a `Path(decomposed_dir).name` call
on line 278 that uses Path. CI didn't catch this because pytest doesn't
import Modal scripts; would have surfaced as NameError on first Modal
fire.

Caught during pre-merge self-review pass of PR #139.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
rylinjames and others added 3 commits May 20, 2026 07:42
LIBERO-10 eval pipeline for FluxVLA's pi05_paligemma_libero_10_full_finetune_bs64
checkpoint. Stage 1 (HF download) + Stage 2 (FluxVLA→lerobot format conversion)
+ Stage 3 (reflex export with parity gate) + Stage 4 (LIBERO eval at N=50/task
across 4 subsuites) wired end-to-end. Matches the proven image recipe + LIBERO
constants from modal_libero_pi05_decomposed.py.

Two intentional stubs marked TODO for first-fire iteration:
- _convert_fluxvla_to_lerobot: name_mapping dict is empty until we observe
  FluxVLA's actual state_dict key prefixes on first fire (log line emits sample
  of 10 keys + shapes)
- _run_libero_suite: needs the existing run_decomposed_libero rollout loop
  factored out of its @app.function decorator so we can call it from a
  separate Modal app. Refactor planned for next turn.

Cost estimate: ~$15-20 per full N=50 fire; expect ~3 iterations to converge
(first surfaces state_dict layout, second tunes conversion, third measures
final number). Total $45-60 worst case across iterations.

Lift #4 of the fluxvla-lift-program. Plan doc:
reflex_context/features/02_distill/fluxvla-checkpoint-republish_plan.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t.py

Lift #4 prerequisite per fluxvla-lift-program. The 370-LOC rollout body
inside modal_libero_pi05_decomposed.py was @app.function-bound and
couldn't be called from another Modal app. Per CLAUDE.md "real fixes
not band-aids" + "fix it now not later," extracted to a shared module
instead of copy-pasting.

Changes:

- src/reflex/eval/libero_rollout.py (new, 513 LOC):
  - run_libero_rollout() — LIBERO env + per-step preprocessor → inference
    → postprocessor pipeline + action chunk dispatch + video frame capture
    + aggregate results. Takes an InferenceProtocol-compatible object so
    future exporters (DreamZero, fast-kernels Pi0.5, GR00T DiT) can swap in.
  - load_pi05_policy_and_processors() — extracted policy + processor
    pipeline loader. Handles SnapFlow-student vs HF-fallback dispatch +
    state-out preprocessor swap. Both Modal scripts call into it.
  - TASK_SUITE_MAX_STEPS + LIBERO_DUMMY_ACTION constants moved here too
    (shared with both scripts).
  - All LIBERO + mujoco imports are LAZY (inside the function body) so
    the reflex package itself does not depend on LIBERO; only callers
    that actually run a rollout pay the dep cost.

- scripts/modal_libero_pi05_decomposed.py (600 → 355 LOC, -245):
  - Replaced inline rollout body with run_libero_rollout() call
  - Replaced inline policy + processor loader with
    load_pi05_policy_and_processors() call
  - Behavior is bit-identical — same processor pipeline, same Pi05DecomposedInference
    invocation, same per-task aggregation, same video frame capture.
  - Caller-specific metadata (cache_mode, cache_ttl_sec, phash_hamming)
    added to results dict after the rollout returns.

- scripts/modal_fluxvla_checkpoint_eval.py: _run_libero_suite() stub
  wired up. Was NotImplementedError; now delegates to run_libero_rollout()
  via load_pi05_policy_and_processors() + Pi05DecomposedInference. Lift
  #4 first-fire is now technically possible (still gated on
  _convert_fluxvla_to_lerobot's empty name_mapping dict which fills in
  on first-fire log inspection).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 2026-05-20 rollout refactor removed all the inline imports that the
extracted helper now owns, but left a `Path(decomposed_dir).name` call
on line 278 that uses Path. CI didn't catch this because pytest doesn't
import Modal scripts; would have surfaced as NameError on first Modal
fire.

Caught during pre-merge self-review pass of PR #139.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rylinjames rylinjames force-pushed the lift/fluxvla-checkpoint-eval-scaffold branch from 90a1d05 to a9a582e Compare May 20, 2026 11:42
@rylinjames rylinjames merged commit d1057e6 into main May 20, 2026
6 checks passed
@rylinjames rylinjames deleted the lift/fluxvla-checkpoint-eval-scaffold branch May 20, 2026 16:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant