Skip to content

Optimize LenVM guided sampling#4

Open
namezhenzhang wants to merge 1 commit into
mainfrom
codex/lenvm-speed-optimizations
Open

Optimize LenVM guided sampling#4
namezhenzhang wants to merge 1 commit into
mainfrom
codex/lenvm-speed-optimizations

Conversation

@namezhenzhang
Copy link
Copy Markdown
Collaborator

Summary

  • Add sparse LenVM guided sampling so only LenVM-guided rows override native SGLang sampling results instead of rebuilding a full guided probability tensor for the whole batch.
  • Add a fused in-process LenVM prefix+candidate forward path to avoid separate prefix extend and candidate launches on the text path.
  • Add full-vocab entropy skip support with a Triton helper and CPU fallback, plus lightweight per-step timing summaries via SGLANG_LVM_TIMING_LOG.
  • Fix candidate value slicing for fused tree-value batches in Qwen2/Qwen3/Qwen2.5-VL LenVM heads.

Simplification before PR

  • Dropped the experimental value_guidance_interval path and related CLI plumbing.
  • Dropped duplicate guidance/inproc stats code; the PR keeps only the timing summary mechanism.
  • Kept experiment launch scripts and local logs out of this PR branch.

Validation

  • git diff --check
  • python -m py_compile on modified SGLang Python files
  • CPU entropy helper smoke test vs torch.special.entr
  • Same-parameter speed sanity run from this optimization pass:
    • Qwen2.5-7B math500 q=30 n=4 max_tokens=500 c=8: base 5716.2 tok/s, +0.5B LenVM 1989.2 tok/s
    • Qwen3-A3B math500 q=30 n=4 max_tokens=500 c=8: base 3071.4 tok/s, +1.7B LenVM 1377.4 tok/s

Note: direct import smoke on the login node is blocked by the local sgl_kernel/libcuda environment, so the validation above uses compile checks plus the GPU experiment runs.

@namezhenzhang
Copy link
Copy Markdown
Collaborator Author

Review 了 PR #4。我会建议先不要 merge,至少前两个是实质性回归。

  1. [P1] 非 GPU in-proc guidance 会直接崩。
    sampler.py 现在无条件调用 sample_token_ids(),但 lvm_guided_sampling.py 只接受 pending.gpu_candidates is not None and inproc,否则抛 Sparse LenVM sampling requires the GPU candidate path。这会把原来支持的 external /tree_value server、custom guidance fn、top_k=ALL、hard target constraint 等模式打崩。这里需要对 sparse fast path 做 capability gate,不支持时回退到旧的 apply() probability path。

  2. [P1] guided token 的 logprob 现在来自未 guided 的 base 分布。
    新逻辑在 sampler.py 只覆盖 batch_next_token_ids,但 sampler.py 仍然用原始 probslogprobs,最后 sampler.py 返回的 next_token_logprobs 就不对应实际 sampled token 的 guided distribution。旧逻辑会把 probs = guided_probs,所以默认 return_logprob 语义是 guided 后的概率。这个会影响 OpenAI-style logprobs、debug、评估和任何依赖 token logprob 的调用方。

  3. [P2] GPU fast path 没有保持 lvm_expectation_guidance 的完整语义。
    CPU 版本会处理 value_min 跳过、scale <= 0 返回原分布、length_gamma 自定义等逻辑,例如 lvm_guided_sampling.pylvm_guided_sampling.py。GPU 版本在 lvm_guided_sampling.py 混合 value_mode 时直接退成 mul,并在 lvm_guided_sampling.py hard-code gamma=0.997。这会让同样的 request 在 fast path 和旧 path 下采样分布不同,属于 silent behavior change。

验证:改动文件 py_compile 通过;没有跑完整 runtime 采样测试,因为本地 Python 环境缺 torch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant