Optimize LenVM guided sampling by namezhenzhang · Pull Request #4 · UCSB-AI/Length-Value-Model

namezhenzhang · 2026-05-27T22:08:57Z

Summary

Add sparse LenVM guided sampling so only LenVM-guided rows override native SGLang sampling results instead of rebuilding a full guided probability tensor for the whole batch.
Add a fused in-process LenVM prefix+candidate forward path to avoid separate prefix extend and candidate launches on the text path.
Add full-vocab entropy skip support with a Triton helper and CPU fallback, plus lightweight per-step timing summaries via SGLANG_LVM_TIMING_LOG.
Fix candidate value slicing for fused tree-value batches in Qwen2/Qwen3/Qwen2.5-VL LenVM heads.

Simplification before PR

Dropped the experimental value_guidance_interval path and related CLI plumbing.
Dropped duplicate guidance/inproc stats code; the PR keeps only the timing summary mechanism.
Kept experiment launch scripts and local logs out of this PR branch.

Validation

git diff --check
python -m py_compile on modified SGLang Python files
CPU entropy helper smoke test vs torch.special.entr
Same-parameter speed sanity run from this optimization pass:
- Qwen2.5-7B math500 q=30 n=4 max_tokens=500 c=8: base 5716.2 tok/s, +0.5B LenVM 1989.2 tok/s
- Qwen3-A3B math500 q=30 n=4 max_tokens=500 c=8: base 3071.4 tok/s, +1.7B LenVM 1377.4 tok/s

Note: direct import smoke on the login node is blocked by the local sgl_kernel/libcuda environment, so the validation above uses compile checks plus the GPU experiment runs.

namezhenzhang · 2026-05-27T22:42:18Z

Review 了 PR #4。我会建议先不要 merge，至少前两个是实质性回归。

[P1] 非 GPU in-proc guidance 会直接崩。
sampler.py 现在无条件调用 sample_token_ids()，但 lvm_guided_sampling.py 只接受 pending.gpu_candidates is not None and inproc，否则抛 Sparse LenVM sampling requires the GPU candidate path。这会把原来支持的 external /tree_value server、custom guidance fn、top_k=ALL、hard target constraint 等模式打崩。这里需要对 sparse fast path 做 capability gate，不支持时回退到旧的 apply() probability path。
[P1] guided token 的 logprob 现在来自未 guided 的 base 分布。
新逻辑在 sampler.py 只覆盖 batch_next_token_ids，但 sampler.py 仍然用原始 probs 算 logprobs，最后 sampler.py 返回的 next_token_logprobs 就不对应实际 sampled token 的 guided distribution。旧逻辑会把 probs = guided_probs，所以默认 return_logprob 语义是 guided 后的概率。这个会影响 OpenAI-style logprobs、debug、评估和任何依赖 token logprob 的调用方。
[P2] GPU fast path 没有保持 lvm_expectation_guidance 的完整语义。
CPU 版本会处理 value_min 跳过、scale <= 0 返回原分布、length_gamma 自定义等逻辑，例如 lvm_guided_sampling.py 和 lvm_guided_sampling.py。GPU 版本在 lvm_guided_sampling.py 混合 value_mode 时直接退成 mul，并在 lvm_guided_sampling.py hard-code gamma=0.997。这会让同样的 request 在 fast path 和旧 path 下采样分布不同，属于 silent behavior change。

验证：改动文件 py_compile 通过；没有跑完整 runtime 采样测试，因为本地 Python 环境缺 torch。

Optimize LenVM guided sampling

13b756c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize LenVM guided sampling#4

Optimize LenVM guided sampling#4
namezhenzhang wants to merge 1 commit into
mainfrom
codex/lenvm-speed-optimizations

namezhenzhang commented May 27, 2026

Uh oh!

namezhenzhang commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

namezhenzhang commented May 27, 2026

Summary

Simplification before PR

Validation

Uh oh!

namezhenzhang commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant