feat: one-sided target probability acceptance for MTP drafts increases acceptance rate and throughput compared to argmax alone by sujitvasanth · Pull Request #8 · AtomicBot-ai/atomic-llama-cpp-turboquant

sujitvasanth · 2026-05-11T02:15:10Z

Overview

MTP drafters use greedy argmax internally — they do not expose a full logit distribution, by design, for speed. This change adds further tok/s improvements by allowing users to tune the acceptance threshold, achieving ~20% throughput gains by accepting more draft tokens, The user can manually verify the threshold at which semantic breakdown occurs for their specific model/task combination.

When the drafter and target model disagree on a token, rather than immediately rejecting (standard argmax behaviour), --draft-p-accept triggers a one-sided softmax check over the target model's logits for the draft token. If the target assigns p >= draft-p-accept to that token, it is accepted in place of the target's own argmax prediction and decoding continues.

No drafter logits are required, keeping the drafter inference path unchanged and preserving the speed advantage of argmax-only drafting. This is intentionally lighter than the full ratio test in the MTP paper.

Changes:

common/sampling.cpp: add p_accept parameter to sample_and_accept_n; on drafter/target disagreement compute softmax over target logits and accept draft token if p_target(draft_token) >= p_accept
common/sampling.h: update both sample_and_accept_n signature
common/arg.cpp: register --draft-p-accept CLI argument
common/common.h: add p_accept field to common_params_speculative struct
tools/server/server-context.cpp: wire p_accept into speculative config

Usage:

--draft-p-accept 0.005 # accept draft token if p_target >= 0.005
--draft-p-accept 0.0 # standard argmax-only behaviour (default)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, co-wrote with Claude, I have read, checked, compiled and tested the code on combined RTX 3060+GTX1660 on Ubuntu 20.04. There is a 20% improvement in throughput with no breakdown of output coherence, acceptance increases inversely proportional to draft-p-accept as expected.
best test 15.5 t/s 300,000 token context

…s acceptance rate and throughput compared to argmax alone MTP drafters use greedy argmax internally — they do not expose a full logit distribution, by design, for speed. This change adds a further tok/s improvement by allowing users to tune the acceptance threshold, achieving ~20% throughput gains by accepting more draft tokens while retaining the ability to manually verify the threshold at which semantic breakdown occurs for their specific model/task combination. When the drafter and target model disagree on a token, rather than immediately rejecting (standard argmax behaviour), --draft-p-accept triggers a one-sided softmax check over the target model's logits for the draft token. If the target assigns p >= draft-p-accept to that token, it is accepted in place of the target's own argmax prediction and decoding continues. No drafter logits are required, keeping the drafter inference path unchanged and preserving the speed advantage of argmax-only drafting. This is intentionally lighter than the full ratio test in the MTP paper. Changes: - common/sampling.cpp: add p_accept parameter to sample_and_accept_n; on drafter/target disagreement compute softmax over target logits and accept draft token if p_target(draft_token) >= p_accept - common/sampling.h: update both overloads of sample_and_accept_n signature - common/arg.cpp: register --draft-p-accept CLI argument - common/common.h: add p_accept field to common_params_speculative struct - tools/server/server-context.cpp: wire p_accept into speculative config Usage: --draft-p-accept 0.005 # accept draft token if p_target >= 0.005 --draft-p-accept 0.0 # standard argmax-only behaviour (default)

Ooooze · 2026-05-12T19:07:36Z

Review: sampler state vs accepted token

In common_sampler_sample_and_accept_n, when draft[i] != id and the p_accept branch succeeds, the code does result.back() = draft[i] after common_sampler_accept(gsmpl, id, true) has already run.

common_sampler_accept updates the grammar (if enabled), the sampling chain, and gsmpl->prev for token id. The server then rebuilds the prompt from ids (see server-context.cpp after sample_and_accept_n). If ids contains draft[i] but the sampler accepted id, prefix history inside the sampler no longer matches the actual sequence — incorrect for grammar and any sampler state keyed on accepted tokens.

Suggestion: decide the chosen token first, then call common_sampler_accept once with that token (e.g. defer accept until after the p_accept check, or if overriding to draft[i], accept draft[i] instead of id).

Minor: full-vocabulary softmax on each mismatch is O(n_vocab); worth noting for large vocabs.

Otherwise the feature direction looks useful.

Fixes sampler state bug identified by Ooooze - previously common_sampler_accept was called with target id before p_accept check, leaving grammar FSM and gsmpl->prev tracking wrong token when draft token was substituted.

sujitvasanth · 2026-05-13T02:31:00Z

Thanks for catching my error! — yes we need to defer common_sampler_accept until after the p_accept resolves, otherwise stale tokens are passed to the grammar FSM.

I've pushed the fix with your corrections to the PR.

I have several further enhancements to this feature but feel it's more important to get the throughput gains to users quickly and avoid scope creep in this PR. Will follow up in subsequent PRs.

github-actions Bot added examples server labels May 11, 2026

fix: defer common_sampler_accept until after p_accept resolution

1c5d208

Fixes sampler state bug identified by Ooooze - previously common_sampler_accept was called with target id before p_accept check, leaving grammar FSM and gsmpl->prev tracking wrong token when draft token was substituted.

sujitvasanth mentioned this pull request May 13, 2026

fix: add missing prototype for turbo_cpu_fwht_inverse to resolve -Wmissing-prototypes CI error #12

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: one-sided target probability acceptance for MTP drafts increases acceptance rate and throughput compared to argmax alone#8

feat: one-sided target probability acceptance for MTP drafts increases acceptance rate and throughput compared to argmax alone#8
sujitvasanth wants to merge 2 commits into
AtomicBot-ai:feature/turboquant-kv-cachefrom
sujitvasanth:feat/draft-p-accept

sujitvasanth commented May 11, 2026 •

edited

Loading

Uh oh!

Ooooze commented May 12, 2026

Uh oh!

sujitvasanth commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sujitvasanth commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Changes:

Usage:

Requirements

Uh oh!

Ooooze commented May 12, 2026

Uh oh!

sujitvasanth commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sujitvasanth commented May 11, 2026 •

edited

Loading