You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add sparse LenVM guided sampling so only LenVM-guided rows override native SGLang sampling results instead of rebuilding a full guided probability tensor for the whole batch.
Add a fused in-process LenVM prefix+candidate forward path to avoid separate prefix extend and candidate launches on the text path.
Add full-vocab entropy skip support with a Triton helper and CPU fallback, plus lightweight per-step timing summaries via SGLANG_LVM_TIMING_LOG.
Fix candidate value slicing for fused tree-value batches in Qwen2/Qwen3/Qwen2.5-VL LenVM heads.
Simplification before PR
Dropped the experimental value_guidance_interval path and related CLI plumbing.
Dropped duplicate guidance/inproc stats code; the PR keeps only the timing summary mechanism.
Kept experiment launch scripts and local logs out of this PR branch.
Validation
git diff --check
python -m py_compile on modified SGLang Python files
CPU entropy helper smoke test vs torch.special.entr
Same-parameter speed sanity run from this optimization pass:
Note: direct import smoke on the login node is blocked by the local sgl_kernel/libcuda environment, so the validation above uses compile checks plus the GPU experiment runs.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
SGLANG_LVM_TIMING_LOG.Simplification before PR
value_guidance_intervalpath and related CLI plumbing.Validation
git diff --checkpython -m py_compileon modified SGLang Python filestorch.special.entrNote: direct import smoke on the login node is blocked by the local
sgl_kernel/libcudaenvironment, so the validation above uses compile checks plus the GPU experiment runs.