Skip to content

llama + spec: MTP Support #22673

Draft
am17an wants to merge 9 commits intoggml-org:masterfrom
am17an:mtp-clean
Draft

llama + spec: MTP Support #22673
am17an wants to merge 9 commits intoggml-org:masterfrom
am17an:mtp-clean

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 4, 2026

Overview

This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows:

Performance

A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090. Posting the results below:

Performance on DGX Spark 🧵

No MTP (baseline)

./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=7.1
  qa_factual         pred= 177 draft=   0 acc=   0 rate=n/a tok/s=7.0
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=7.7
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.1
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.2
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1404,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 201.07
}

MTP --spec-draft-max-n 3

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6
  code_cpp           pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7
  explain_concept    pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3
  summarize          pred=  55 draft=  51 acc=  37 rate=0.726 tok/s=17.9
  qa_factual         pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=13.9
  creative_short     pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3
  long_code_review   pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1319,
  "total_draft_accepted": 952,
  "aggregate_accept_rate": 0.7218,
  "wall_s_total": 83.8
}

MTP --spec-draft-max-n 2

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4
  code_cpp           pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1
  summarize          pred=  55 draft=  44 acc=  32 rate=0.727 tok/s=15.6
  qa_factual         pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2
  translation        pred=  22 draft=  18 acc=  12 rate=0.667 tok/s=15.2
  creative_short     pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1421,
  "total_draft": 1062,
  "total_draft_accepted": 877,
  "aggregate_accept_rate": 0.8258,
  "wall_s_total": 90.44
}

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4
  code_cpp           pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8
  explain_concept    pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7
  summarize          pred=  57 draft=  63 acc=  39 rate=0.619 tok/s=16.9
  qa_factual         pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7
  translation        pred=  23 draft=  18 acc=  15 rate=0.833 tok/s=18.7
  creative_short     pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4
  stepwise_math      pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3
  long_code_review   pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1424,
  "total_draft": 1497,
  "total_draft_accepted": 1013,
  "aggregate_accept_rate": 0.6767,
  "wall_s_total": 81.39
}

Master with draft model with spec-draft-n-max 64 with no partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2
  code_cpp           pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0
  explain_concept    pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4
  summarize          pred=  55 draft=  48 acc=  36 rate=0.750 tok/s=14.6
  qa_factual         pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9
  translation        pred=  22 draft=  13 acc=  13 rate=1.000 tok/s=16.5
  creative_short     pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8
  stepwise_math      pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0
  long_code_review   pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1137,
  "total_draft_accepted": 897,
  "aggregate_accept_rate": 0.7889,
  "wall_s_total": 97.13
}

How to use

I've uploaded the GGUF which I made by using the convert_hf_to_gguf.py changes in this PR. Here is another GGUF for the MoE (35BA3B) model

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for debugging and reviewing. Also the convert_hf_to_gguf.py + model definitions. Writing bench for validation against vLLM.

@github-actions github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026
@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented May 4, 2026

Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue)

There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of)

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(not a review, but opening some discussions)

Comment on lines +73 to +74
// number of recurrent-state snapshots per seq for rollback; tensors are widened to (1 + n_rs_seq) groups
uint32_t n_rs_seq = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 100% sure but maybe the naming with _seq is a bit confusing (or I'm misunderstanding this)

I imagine that we want to keep a buffer ring style of recurrent-state(s), similar to SWA in KV cache, right? if that's the case, probably better call it n_rs_window

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is partly the review comment from here #22400 (comment)

Comment thread src/models/qwen35.cpp

for (int il = 0; il < n_layer; ++il) {
// MTP/NextN layers are loaded as extra decoder blocks but not executed in the main pass.
const int n_transformer_layers = n_layer - (int)hparams.nextn_predict_layers;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits, but maybe call it n_main_layers, as technically nextn layer is also a transformer layer

Comment on lines +811 to +823
//TODO: generalize if this is ok, we should load <arch_name>_mtp arch?
if (params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP) {
SRV_INF("loading MTP head from '%s' (override_arch=qwen35_mtp)\n",
params_base.model.path.c_str());

auto mparams_mtp = common_model_params_to_llama(params_base);
mparams_mtp.override_arch = "qwen35_mtp";

model_mtp.reset(llama_model_load_from_file(params_base.model.path.c_str(), mparams_mtp));
if (model_mtp == nullptr) {
SRV_ERR("failed to load MTP head from '%s'\n", params_base.model.path.c_str());
return false;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you look at #18886, the better way is to move llama_graph_type to the public API, then load the context with the appropriate graph type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that seems like the correct way to do this if we want to support MTP in a generic way

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@ngxson yes the h2d was discussed with GG, he's working on a refactor which will allow us to share tensors between two llama context

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 4, 2026

Great work, this should massively bridge the TG gap with vLLM, or maybe even surpass it together with tensor-parallel.

am17an added 4 commits May 4, 2026 20:15
Currently speculative checkpoint needs to restart from a checkpoint
after some draft tokens are not accepted, this leads to some wastage in
running the target again. This PR adds the ability to rollback upto
`draft_max` by storing the GDN intermediates.
@cmp-nct
Copy link
Copy Markdown
Contributor

cmp-nct commented May 4, 2026

in my opinion Qwen 3.6 is the most important thing that happened in open source models in a long time, this is going to be so valuable.
I wonder if this, once merged, could be combined with ngram drafting ?
So MTP is used until ngram is triggered - switching to ngram until rejection and back to MTP

ngram could be set to match only very strong and long candidates - for large repetitive paraphrasing
and MTP fills the gap

@Dampfinchen
Copy link
Copy Markdown

Dampfinchen commented May 4, 2026

" idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc." -> Does this mean MTP needs additional resources (RAM/VRAM?)

If so, there should always be an option to remain to disable it. Right now on my system (6 GB VRAM, 32 GB RAM), speculative decoding just makes things much slower even on very small draft models because of that exact reason, they need own context and kv-cache. Such low to midrange systems already operate on the edge in terms of memory.

@mbednarek360
Copy link
Copy Markdown

I'm getting garbage responses running this PR on the Vulkan backend with an R9700 using llama-server. I'm using the GGUF you linked above. Interestingly, draft acceptance is only 0.01282.

Prompt: "Hello!"
Response:

The from,

;::...

... on;srible威风to{ islitor

\ ...

• We
&eq和chn ***, on
Prompt (:
mouth

“ ? forM� P 

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cmp-nct I'm not sure, but could be possible

@Dampfinchen as of right now it is opt-in via --spec-type mtp, but in terms of memory it should be < 10% of overall memory used (it's just a single layer transformer + kv cache, much lighter than draft models)

@mbednarek360 I've only tested this on a small number of CUDA devices as of now, once it's ready to review I would have tested more devices/backends. In particular this PR relies on #22400 which is not implemented for vulkan for now, if you ask an LLM to add support for that you might get a little further Vulkan and Metal also tested

@nawoa
Copy link
Copy Markdown

nawoa commented May 4, 2026

Might it be possible/useful to run the draft model on a second GPU? Given that MTP weights model are relatively small this might provide a useful speedup on systems with a dedicated high-VRAM "AI" GPU with a cheaper low-VRAM "normal" GPU used for display output, etc... possibly prevent some degree of resource contention.

@cturan
Copy link
Copy Markdown

cturan commented May 4, 2026

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__
Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch
Hardware: RTX 3090 24GB + RTX 3060 12GB
Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt
Endpoint: /completion, raw text prompt
Prompt: 6978 tokens
Generation: 256 tokens
Runs: 3 measured runs after warmup

mode model prefill tok/s avg generation tok/s avg MTP acceptance loaded VRAM
MTP enabled Qwen3.6-27B-MTP-Q6_K.gguf + --spec-type mtp --spec-draft-n-max 3 665.14 42.45 76.0% 24.96 GiB
MTP disabled, same GGUF Qwen3.6-27B-MTP-Q6_K.gguf, no spec 1315.46 22.97 n/a 22.47 GiB
Existing non-MTP Q6 Qwen3.6-27B-Q6_K.gguf, no spec 1260.12 22.39 n/a 22.59 GiB

Result:

  • MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.
  • Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.
  • Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).
  • MTP adds about 2.49 GiB loaded VRAM in this setup.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 4, 2026

@cturan Thanks for testing, I'm aware of the issue for the prefill and will work on a fix.

@iiLaurens
Copy link
Copy Markdown

Might be a long shot, but any chance of supporting MTP with a reduced vocabulary? MTP layers are rather chonky and reducing token embeddings might help users with less VRAM by filtering out certain languages. Obviously the full model will still be able to produce those tokens if need be so it won't be gimped.

@nybblr
Copy link
Copy Markdown

nybblr commented May 4, 2026

Working on taking this for a spin with the Q4_K_M quant of Qwen3.6-35BA3B. I was gonna try to start from unsloth's quant since they already perform really well, but of course they don't have any mtp layers.

@am17an Think it would work if I just "steal" the layers from your q8 quant and merge them into the unsloth quant? (add blk.40 and bump some top-level config like block_count and kv_count)

@volkermauel
Copy link
Copy Markdown

only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 532 | processing task, is_child = 0
slot update_slots: id  0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id  0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id  0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id  0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id  0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id  0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token,   253.34 tokens per second)
�������eval time =   56063.04 ms /  5913 tokens (����9.48 ms per token,   105.47 tokens per second)
������total time =   56126.20 ms /  5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted /  5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id  0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv  update_slots: all slots are idle

same model, same config (except mtp)

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 0 | 
prompt eval time =      91.85 ms /    16 tokens (    5.74 ms per token,   174.20 tokens per second)
       eval time =  103127.94 ms /  6571 tokens (   15.69 ms per token,    63.72 tokens per second)
      total time =  103219.79 ms /  6587 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv  update_slots: all slots are idle

prompt „create a flappy bird clone“

(I‘m not creative, sorry)

Great Speedup!

@Viktor-Osika
Copy link
Copy Markdown

Observing very nice speed boost on 5090 with RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF!

I also tried loading this model https://huggingface.co/lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP but I get "llama.cpp\src\models\qwen35_mtp.cpp:8: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed"
Launch command:
.\llama-server -hf llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP-GGUF --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence_penalty 0.0 --repeat-penalty 1.0 --host 0.0.0.0 -c 124000 -np 1 --spec-type mtp --spec-draft-n-max 3
It loads fine without --spec-type mtp.

@PkmX
Copy link
Copy Markdown

PkmX commented May 5, 2026

Metal should work now as well

M1 Ultra: Qwen3.6 27B Q8_0 went from 17 t/s to 25 t/s with --spec-type mtp --spec-draft-n-max 3.

@pwilkin
Copy link
Copy Markdown
Member

pwilkin commented May 5, 2026

Observing very nice speed boost on 5090 with RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF!

I also tried loading this model https://huggingface.co/lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP but I get "llama.cpp\src\models\qwen35_mtp.cpp:8: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed" Launch command: .\llama-server -hf llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP-GGUF --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence_penalty 0.0 --repeat-penalty 1.0 --host 0.0.0.0 -c 124000 -np 1 --spec-type mtp --spec-draft-n-max 3 It loads fine without --spec-type mtp.

That message translates to ELI5 as "this GGUF has the MTP layers stripped".

@Dampfinchen
Copy link
Copy Markdown

Google has just released the MTP layers for Gemma 4:

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant
https://huggingface.co/google/gemma-4-31B-it-assistant
https://huggingface.co/google/gemma-4-E4B-it-assistant
https://huggingface.co/google/gemma-4-E2B-it-assistant

@adriabama06
Copy link
Copy Markdown

adriabama06 commented May 5, 2026

I just tested it with the iGPU of my R9 6900HX, using ROCm + force-host-allocation-APU + Qwen3.5-4B Q8:

I manually converted Qwen3.5-4B to GGUF using convert_hf_to_gguf.py and these are the results:

Prompt:
Considerad el subespacio \(F = \langle \begin{pmatrix} 0 \\ 1 \\ 1 \end{pmatrix}, \begin{pmatrix} 4 \\ 1 \\ -1 \end{pmatrix}, \begin{pmatrix} 2 \\ 1 \\ 0 \end{pmatrix} \rangle\) en \(\mathbb{R}^3\). Hallad una base de \(F\) y la condición (en forma de sistema de ecuaciones lineales homogéneas) que ha de satisfacer un vector \(\begin{pmatrix} x \\ y \\ z \end{pmatrix}\) para pertenecer a \(F\).

Both gave the correct answer

No MTP ~6.6 tok/s
$ LD_PRELOAD=/home/adri/libforcegttalloc.so HSA_OVERRIDE_GFX_VERSION='10.3.0' ./llama.cpp_mtp/build/bin/llama-server --host 0.0.0.0 --port 8001 -c 8192 --temp 0.6 --repeat-penalty 1.05 --top-k 20 --top-p 0.95 --min-p 0.00 --chat-template-kwargs '{"enable_thinking":false}' -m llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf -np 1 --no-mmap --no-warmup -fa on -ctk f16 -ctv f16 -ngl 999 --jinja --context-shift

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 6856 MiB):
  Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 6856 MiB
Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
build_info: b9026-f8c6b03da
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (Graphics)   |  6856 = 9729 + (5054 =  4386 +     178 +     490) +       -7927 |
common_memory_breakdown_print: |   - Host               |                  662 =   644 +       0 +      18                |
common_params_fit_impl: projected to use 5054 MiB of device memory vs. 9729 MiB of free device memory
common_params_fit_impl: will leave 4674 >= 1024 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 1.73 seconds
llama_model_loader: loaded meta data with 42 key-value pairs and 441 tensors from llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen/Qwen3.5-4B
llama_model_loader: - kv   3:                           general.finetune str              = 851bf6e806efd8d0a36b00ddf55e13ccb7b8cd0a
llama_model_loader: - kv   4:                         general.size_label str              = 4.3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3.5 4B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                         qwen35.block_count u32              = 33
llama_model_loader: - kv  13:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  14:                    qwen35.embedding_length u32              = 2560
llama_model_loader: - kv  15:                 qwen35.feed_forward_length u32              = 9216
llama_model_loader: - kv  16:                qwen35.attention.head_count u32              = 16
llama_model_loader: - kv  17:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  18:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  19:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  20:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  22:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  23:                          general.file_type u32              = 7
llama_model_loader: - kv  24:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  25:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  26:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  27:                  qwen35.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  28:                      qwen35.ssm.inner_size u32              = 4096
llama_model_loader: - kv  29:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  30:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  31:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - type  f32:  184 tensors
llama_model_loader: - type q8_0:  257 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 4.28 GiB (8.51 BPW)
llama_prepare_model_devices: using device ROCm0 (AMD Radeon Graphics) (0000:e3:00.0) - 9756 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 2560
print_info: n_embd_inp            = 2560
print_info: n_layer               = 33
print_info: n_head                = 16
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 9216
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 4096
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 32
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 4B
print_info: model params          = 4.33 B
print_info: general.name          = Qwen/Qwen3.5-4B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 34/34 layers to GPU
load_tensors:        ROCm0 model buffer size =  4386.53 MiB
load_tensors:    ROCm_Host model buffer size =   644.14 MiB
.............................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.95 MiB
llama_kv_cache:      ROCm0 KV buffer size =   128.00 MiB
llama_kv_cache: size =  128.00 MiB (  4096 cells,   8 layers,  1/1 seqs), K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_memory_recurrent:      ROCm0 RS buffer size =    50.25 MiB
llama_memory_recurrent: size =   50.25 MiB (     1 cells,  33 layers,  1 seqs), R (f32):    2.25 MiB, S (f32):   48.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      ROCm0 compute buffer size =   490.00 MiB
sched_reserve:  ROCm_Host compute buffer size =    18.02 MiB
sched_reserve: graph nodes  = 1833
sched_reserve: graph splits = 2
sched_reserve: reserve took 1725.91 ms, sched copies = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
clip_model_loader: model name:   Qwen/Qwen3.5-4B
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    298
clip_model_loader: n_kv:         29

clip_model_loader: has vision encoder
clip_ctx: CLIP using ROCm0 backend
load_hparams: projector:          qwen3vl_merger
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            24
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     2560

--- vision hparams ---
load_hparams: image_size:         768
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels:   1048576 (custom value)
load_hparams: image_max_pixels:   4194304

load_hparams: model size:         644.26 MiB
load_hparams: metadata size:      0.10 MiB
srv    load_model: loaded multimodal model, 'llama.cpp_mtp/mmproj-Qwen-Qwen3.5-4B-bf16.gguf'
srv    load_model: initializing slots, n_slots = 1
common_context_can_seq_rm: the target context does not support partial sequence removal
srv    load_model: speculative decoding will use checkpoints
no implementations specified for speculative decoding
slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: init: --cache-idle-slots requires --kv-unified, disabling
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8001
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: done request: GET /tools 192.168.1.154 404
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 4096 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 1053
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 8, batch.n_tokens = 8, progress = 0.007597
srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.154 200
slot update_slots: id  0 | task 0 | n_tokens = 8, memory_seq_rm [8, end)
srv  process_chun: processing image...
encoding image slice...
alloc_compute_meta:      ROCm0 compute buffer size =   109.22 MiB
alloc_compute_meta:        CPU compute buffer size =    12.19 MiB
alloc_compute_meta: graph splits = 1, nodes = 736
warmup: flash attention is enabled
image slice encoded in 7058 ms
decoding image batch 1/1, n_tokens_batch = 1035
find_slot: non-consecutive token position 8 after 7 for sequence 0 with 512 new tokens
find_slot: non-consecutive token position 8 after 8 for sequence 0 with 512 new tokens
find_slot: non-consecutive token position 8 after 8 for sequence 0 with 11 new tokens
find_slot: non-consecutive token position 8 after 7 for sequence 0 with 512 new tokens
find_slot: non-consecutive token position 8 after 8 for sequence 0 with 512 new tokens
find_slot: non-consecutive token position 8 after 8 for sequence 0 with 11 new tokens
image decoded (batch 1/1) in 5621 ms
srv  process_chun: image processed in 12680 ms
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 1049, batch.n_tokens = 6, progress = 0.996201
find_slot: non-consecutive token position 58 after 8 for sequence 0 with 6 new tokens
find_slot: non-consecutive token position 58 after 8 for sequence 0 with 6 new tokens
slot update_slots: id  0 | task 0 | n_tokens = 1049, memory_seq_rm [59, end)
slot init_sampler: id  0 | task 0 | init sampler, took 0.04 ms, tokens: text = 18, total = 1053
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 1053, batch.n_tokens = 4
slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 58, pos_max = 58, n_tokens = 1049, size = 50.251 MiB)
slot print_timing: id  0 | task 0 |
prompt eval time =   13187.54 ms /  1053 tokens (   12.52 ms per token,    79.85 tokens per second)
       eval time =   63816.31 ms /   424 tokens (  150.51 ms per token,     6.64 tokens per second)
      total time =   77003.84 ms /  1477 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 1476, truncated = 0
srv  update_slots: all slots are idle
^Csrv    operator(): operator(): cleaning up before exit...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (Graphics)   |  6856 = 3126 + (5054 =  4386 +     178 +     490) +       -1325 |
common_memory_breakdown_print: |   - Host               |                  662 =   644 +       0 +      18                |
MTP `--spec-type mtp --spec-draft-n-max 3` ~11.6 tok/s
$ LD_PRELOAD=/home/adri/libforcegttalloc.so HSA_OVERRIDE_GFX_VERSION='10.3.0' ./llama.cpp_mtp/build/bin/llama-server --host 0.0.0.0 --port 8001 -c 8192 --temp 0.6 --repeat-penalty 1.05 --top-k 20 --top-p 0.95 --min-p 0.00 --chat-template-kwargs '{"enable_thinking":false}' -m llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf -np 1 --no-mmap --no-warmup -fa on -ctk f16 -ctv f16 -ngl 999 --jinja --context-shift --spec-type mtp --spec-draft-n-max 3

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 6856 MiB):
  Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 6856 MiB
Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
build_info: b9026-f8c6b03da
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (Graphics)   |  6856 = 9834 + (5333 =  4386 +     457 +     490) +       -8311 |
common_memory_breakdown_print: |   - Host               |                  670 =   644 +       0 +      26                |
common_params_fit_impl: projected to use 5333 MiB of device memory vs. 9834 MiB of free device memory
common_params_fit_impl: will leave 4500 >= 1024 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 2.44 seconds
llama_model_loader: loaded meta data with 42 key-value pairs and 441 tensors from llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen/Qwen3.5-4B
llama_model_loader: - kv   3:                           general.finetune str              = 851bf6e806efd8d0a36b00ddf55e13ccb7b8cd0a
llama_model_loader: - kv   4:                         general.size_label str              = 4.3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3.5 4B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                         qwen35.block_count u32              = 33
llama_model_loader: - kv  13:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  14:                    qwen35.embedding_length u32              = 2560
llama_model_loader: - kv  15:                 qwen35.feed_forward_length u32              = 9216
llama_model_loader: - kv  16:                qwen35.attention.head_count u32              = 16
llama_model_loader: - kv  17:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  18:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  19:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  20:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  22:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  23:                          general.file_type u32              = 7
llama_model_loader: - kv  24:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  25:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  26:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  27:                  qwen35.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  28:                      qwen35.ssm.inner_size u32              = 4096
llama_model_loader: - kv  29:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  30:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  31:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - type  f32:  184 tensors
llama_model_loader: - type q8_0:  257 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 4.28 GiB (8.51 BPW)
llama_prepare_model_devices: using device ROCm0 (AMD Radeon Graphics) (0000:e3:00.0) - 9962 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 2560
print_info: n_embd_inp            = 2560
print_info: n_layer               = 33
print_info: n_head                = 16
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 9216
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 4096
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 32
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 4B
print_info: model params          = 4.33 B
print_info: general.name          = Qwen/Qwen3.5-4B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 34/34 layers to GPU
load_tensors:        ROCm0 model buffer size =  4386.53 MiB
load_tensors:    ROCm_Host model buffer size =   644.14 MiB
.............................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.95 MiB
llama_kv_cache:      ROCm0 KV buffer size =   256.00 MiB
llama_kv_cache: size =  256.00 MiB (  8192 cells,   8 layers,  1/1 seqs), K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_memory_recurrent:      ROCm0 RS buffer size =   201.00 MiB
llama_memory_recurrent: size =  201.00 MiB (     1 cells,  33 layers,  1 seqs), R (f32):    9.00 MiB, S (f32):  192.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      ROCm0 compute buffer size =   490.00 MiB
sched_reserve:  ROCm_Host compute buffer size =    26.02 MiB
sched_reserve: graph nodes  = 1833
sched_reserve: graph splits = 2
sched_reserve: reserve took 1718.52 ms, sched copies = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
srv    load_model: loading MTP head from 'llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf' (override_arch=qwen35_mtp)
llama_model_loader: loaded meta data with 42 key-value pairs and 441 tensors from llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen/Qwen3.5-4B
llama_model_loader: - kv   3:                           general.finetune str              = 851bf6e806efd8d0a36b00ddf55e13ccb7b8cd0a
llama_model_loader: - kv   4:                         general.size_label str              = 4.3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3.5 4B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                         qwen35.block_count u32              = 33
llama_model_loader: - kv  13:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  14:                    qwen35.embedding_length u32              = 2560
llama_model_loader: - kv  15:                 qwen35.feed_forward_length u32              = 9216
llama_model_loader: - kv  16:                qwen35.attention.head_count u32              = 16
llama_model_loader: - kv  17:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  18:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  19:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  20:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  22:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  23:                          general.file_type u32              = 7
llama_model_loader: - kv  24:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  25:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  26:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  27:                  qwen35.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  28:                      qwen35.ssm.inner_size u32              = 4096
llama_model_loader: - kv  29:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  30:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  31:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - type  f32:  184 tensors
llama_model_loader: - type q8_0:  257 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 4.28 GiB (8.51 BPW)
llama_model_create: overriding architecture qwen35 -> qwen35_mtp
llama_prepare_model_devices: using device ROCm0 (AMD Radeon Graphics) (0000:e3:00.0) - 4126 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35_mtp
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 2560
print_info: n_embd_inp            = 2560
print_info: n_layer               = 33
print_info: n_head                = 16
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 9216
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: model type            = ?B
print_info: model params          = 4.33 B
print_info: general.name          = Qwen/Qwen3.5-4B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
done_getting_tensors: partial load — used 17 of 441 tensors in the file (rest belong to a sibling model on the same .gguf)
load_tensors: offloading output layer to GPU
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 34/34 layers to GPU
load_tensors:        ROCm0 model buffer size =   766.39 MiB
load_tensors:    ROCm_Host model buffer size =   644.14 MiB
....srv    load_model: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.95 MiB
llama_kv_cache:      ROCm0 KV buffer size =    32.00 MiB
llama_kv_cache: size =   32.00 MiB (  8192 cells,   1 layers,  1/1 seqs), K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      ROCm0 compute buffer size =   490.00 MiB
sched_reserve:  ROCm_Host compute buffer size =    26.02 MiB
sched_reserve: graph nodes  = 50
sched_reserve: graph splits = 2
sched_reserve: reserve took 1346.77 ms, sched copies = 1
set_mtp: MTP draft head registered (ctx_mtp=0x5cd411b88ae0, n_ubatch=512, n_embd=2560)
slot   load_model: id  0 | task -1 | speculative decoding context initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 8192
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: init: --cache-idle-slots requires --kv-unified, disabling
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8001
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: done request: GET /tools 192.168.1.154 404
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 8192 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, task.n_tokens = 156
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 152, batch.n_tokens = 152, progress = 0.974359
srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.154 200
slot update_slots: id  0 | task 0 | n_tokens = 152, memory_seq_rm [152, end)
slot init_sampler: id  0 | task 0 | init sampler, took 0.10 ms, tokens: text = 156, total = 156
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 156, batch.n_tokens = 4
slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 151, pos_max = 151, n_tokens = 152, size = 50.251 MiB)
slot print_timing: id  0 | task 0 |
prompt eval time =    1503.36 ms /   156 tokens (    9.64 ms per token,   103.77 tokens per second)
       eval time =  235881.32 ms /  2740 tokens (   86.09 ms per token,    11.62 tokens per second)
      total time =  237384.68 ms /  2896 tokens
draft acceptance rate = 0.65906 ( 1819 accepted /  2760 generated)
statistics mtp: #calls(b,g,a) = 1 920 771, #gen drafts = 920, #acc drafts = 771, #gen tokens = 2760, #acc tokens = 1819, dur(b,g,a) = 0.015, 72899.020, 5.252 ms
slot      release: id  0 | task 0 | stop processing: n_tokens = 2895, truncated = 0
srv  update_slots: all slots are idle
^Csrv    operator(): operator(): cleaning up before exit...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (Graphics)   |  6856 = 1735 + (5333 =  4386 +     457 +     490) +        -212 |
common_memory_breakdown_print: |   - Host               |                  670 =   644 +       0 +      26                |
double free or corruption (!prev)
Aborted (core dumped)

Edit: Add GGUF link: https://huggingface.co/adriabama06/Qwen3.5-4B-q8_0-am17an-MTP

@Ezzz-dev
Copy link
Copy Markdown

Ezzz-dev commented May 5, 2026

Can confirm with latest changes in Vulkan I no longer have issues with token generation suddenly stopping. At 120k tokens, the t/s generation dropped to 45t/s so that's still a huge boost in performance!

Comment thread src/llama-context.h
Comment on lines +243 to +248
void handle_mtp_for_ubatch(
int32_t n_tokens,
const llama_token * tokens,
const llama_pos * positions,
struct ggml_tensor * t_h_pre_norm);

Copy link
Copy Markdown
Member

@ggerganov ggerganov May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic has to be extracted into the server/speculative contexts. One of the main reasons is that just the token positions are not enough to be able to do this correctly. For example, for multi-modal use cases, this implementation does not take into account that multiple tokens have the same position and there is no way to resolve this inside llama_context.

Also, restoring prompts in a slot from the prompt cache does not work because we forget the computed embeddings for the prompt.

Hence, we have to actually extract the embeddings from the context (in a similar way as we extract the logits) and manage them along side the prompts (store them, prefix cache them, etc.). Quite a lot of work is needed, but I think that's the proper way to implement support for these methods.

For MTP, we can already use the existing embeddings API because MTP only needs the output embeddings and we already have the API for that llama_get_embeddings_* (edit: on second look, we actually need the embeddings before the output norm, so need API for that as well). But for Eagle3 this is not enough. For that I am preparing new embeddings API (#22728) that will be used to extract the internal layer embeddings. Feedback is welcome.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov IMO the only blocker to not run this inside the same llama graph was that kv-cache becomes weird to handle. Perhaps we can restructure the kv-cache such that there is an auxiliary cache for these "speculators", then everything becomes kind of already working from a sync perspective. Or I am maybe missing something

nybblr pushed a commit to nybblr/llama.cpp that referenced this pull request May 5, 2026
Without this override, upstream PR ggml-org#22673 MTP context inherits n_ubatch=1024
from the main context, sizing its compute buffer at ~1.1 GiB. At 150k on
24GB, that triggers cudaMalloc OOM during MTP context init. Override drops
the buffer to ~362 MiB at n_ubatch=256 and lets the MTP context fit.

Decode bench post-patch reveals the real wall is MTP head weight format
mismatch (3.4% accept on this lab's sidecar GGUF vs 72-82% in PR
description), not VRAM. See RESULTS_UPSTREAM_MTP_22673.md in the lab repo.
@coder543
Copy link
Copy Markdown

coder543 commented May 5, 2026

Just as a note: Gemma 4 MTP support will likely require other changes.

From the blog post:

To make these MTP drafters exceptionally fast and accurate, we introduced several architectural enhancements under the hood. The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out. For our E2B and E4B edge models, where the final logit calculation becomes a big bottleneck, we even implemented an efficient clustering technique in the embedder to further accelerate generation.

In the transformers implementation, Google also uses a heuristic approach to dynamically adjust how many tokens are drafted at each step, which could be nice to have for all MTP models.

@msb-msb
Copy link
Copy Markdown

msb-msb commented May 5, 2026

Did a clean DFlash + DDTree reproduction on a single RTX 3090 (24GB) a few days ago, both Qwens, full bench_llm.py suite from the Luce repo. Posting numbers here as a comparison point for your MTP results.

Bench results

Model HumanEval GSM8K Math500 Mean
Qwen 3.5-27B Q4_K_M 2.76x 2.48x 2.53x 2.59x
Qwen 3.6-27B Q4_K_M 2.81x 2.25x 2.61x 2.56x

Both with their matching 3.5-1.5B / 3.6-1.5B drafts.

Two things worth flagging on these numbers:

  • Qwen 3.5 came in below the README's headline 3.43x HumanEval — I got 2.76x. Probably DDTree variance at temp=1.0 plus bench_llm.py defaults vs whatever produced the 3.43x.
  • Qwen 3.6 came in above the README's 1.98x. The 3.6 draft matured between April 26 and April 30 — HumanEval acceptance length climbed from 5.94 to 7.48.

vs MTP on Qwen 3.6

So on Qwen 3.6 specifically: ~2.56x mean for DFlash + DDTree vs the ~1.85x being reported here for MTP. Different mechanisms, different tradeoffs:

  • DFlash needs the Luce fork + separate draft GGUF
  • MTP is single-checkpoint and lands in mainline

Whether they compose — MTP draft tokens fed into DDTree verification — is the interesting question, but I haven't tried it yet.

Plan to reproduce MTP on the same machine when this lands. Full methodology + per-task tables at https://insiderllm.com/guides/dflash-rtx-3090-bench-both-qwens/ if useful.

@gmaxwell
Copy link
Copy Markdown

gmaxwell commented May 5, 2026

If the user invokes with a non-MTP model it dies with an assert:

llama_model_create: overriding architecture qwen35 -> qwen35_mtp
/src/models/qwen35_mtp.cpp:8: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed

obviously I don't expect it to mtp in this case but it should probably politely reject the option and cleanly abort or ignore the spec option.

Aside, Qwen3.6-27B-Q8 and a 256k context on RTX A6000-- ~20tok/s without MTP, 55tok/s with MTP. Seems to work. Would be nice if parallel was still possible, but one step at a time! :)

@sluflyer06
Copy link
Copy Markdown

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__ Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch Hardware: RTX 3090 24GB + RTX 3060 12GB Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt Endpoint: /completion, raw text prompt Prompt: 6978 tokens Generation: 256 tokens Runs: 3 measured runs after warmup
mode model prefill tok/s avg generation tok/s avg MTP acceptance loaded VRAM
MTP enabled Qwen3.6-27B-MTP-Q6_K.gguf + --spec-type mtp --spec-draft-n-max 3 665.14 42.45 76.0% 24.96 GiB
MTP disabled, same GGUF Qwen3.6-27B-MTP-Q6_K.gguf, no spec 1315.46 22.97 n/a 22.47 GiB
Existing non-MTP Q6 Qwen3.6-27B-Q6_K.gguf, no spec 1260.12 22.39 n/a 22.59 GiB

Result:

* MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.

* Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.

* Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).

* MTP adds about 2.49 GiB loaded VRAM in this setup.

ooooof that is bad, PP takes a absolute dump, hope you're not processing any long prompts or care about how responsive the model is. Need to be doubling the PP.

@y-almannaee
Copy link
Copy Markdown

After transplanting from your provided MTP file to unsloth/Qwen3.6-27B-Q4_K_M, and testing it on a 4090 on Windows 10, it seems to "silently" crash if GPU VRAM is full or will be full? I don't fully understand the mode of the crash but:

Common to both:

G:\AI\llama\llama-server.exe --port 5001 --chat-template-file "G:\AI\models\Qwen3.6_merged_template.jinja" 
-m G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf -fa 1 --cpu-strict 1 -np 1 --no-warmup 
--slot-save-path G:\AI\models\cache --host 0.0.0.0 --jinja --min-p 0.0 -t 14  --no-mmap
--ctx-checkpoints 16 -cram 12288 -ngl 60 --mlock --spec-type mtp --spec-draft-n-max 3
First scenario - "silent" crash

-fit 163840 --fit-target 1536 --ctx-size 163840

Key item: using device CUDA0 ... - 0 MiB free

PS G:\AI\llama-swap> .\llama-swap.exe
llama-swap listening on http://:8080
Tuesday, 05-May-26 23:24:28 +04 [INFO] Preloading model: Qwen3.6-27B-Q4_K_M-mtp
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24563 MiB):
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24563 MiB
build_info: b0-unknown
system_info: n_threads = 14 (n_threads_batch = 14) / 16 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 4090)   | 24563 = 22450 + (25703 = 14323 +   10136 +    1243) +      -23589 |
common_memory_breakdown_print: |   - Host               |                   3187 =  2134 +     702 +     350                |
common_params_fit_impl: projected to use 25703 MiB of device memory vs. 22450 MiB of free device memory
common_params_fit_impl: cannot meet free memory target of 1536 MiB, need to reduce device memory by 4789 MiB
common_params_fit_impl: context size set by user to 163840 -> no change
common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 60, abort
common_fit_params: fitting params to free memory took 0.63 seconds
llama_model_loader: loaded meta data with 52 key-value pairs and 866 tensors from G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.6-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.6 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.6-27B
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  18:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  19:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  20:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  21:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  22:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  26:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  27:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  28:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  29:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  30:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  31:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  32:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  33:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 248044
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 15
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.6-27B.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 76
llama_model_loader: - kv  50:                         qwen35.block_count u32              = 65
llama_model_loader: - kv  51:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - type  f32:  456 tensors
llama_model_loader: - type q8_0:    8 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q5_K:   48 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 16.07 GiB (5.05 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 22988 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 65
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 6144
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 48
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 27B
print_info: model params          = 27.32 B
print_info: general.name          = Qwen3.6-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 248044 '<|endoftext|>'
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 60/66 layers to GPU
load_tensors:          CPU model buffer size =   682.03 MiB
load_tensors:        CUDA0 model buffer size = 14323.45 MiB
load_tensors:    CUDA_Host model buffer size =  1452.62 MiB
............................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 163840
llama_context: n_ctx_seq     = 163840
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (163840) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:        CPU KV buffer size =   640.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  9600.00 MiB
llama_kv_cache: size = 10240.00 MiB (163840 cells,  16 layers,  1/1 seqs), K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_memory_recurrent:        CPU RS buffer size =    62.34 MiB
llama_memory_recurrent:      CUDA0 RS buffer size =   536.16 MiB
llama_memory_recurrent: size =  598.50 MiB (     1 cells,  65 layers,  1 seqs), R (f32):   22.50 MiB, S (f32):  576.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =  1243.73 MiB
sched_reserve:  CUDA_Host compute buffer size =   350.13 MiB
sched_reserve: graph nodes  = 3657
sched_reserve: graph splits = 100 (with bs=512), 12 (with bs=1)
sched_reserve: reserve took 91.75 ms, sched copies = 1
srv    load_model: loading MTP head from 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf' (override_arch=qwen35_mtp)
llama_model_loader: loaded meta data with 52 key-value pairs and 866 tensors from G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.6-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.6 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.6-27B
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  18:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  19:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  20:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  21:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  22:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  26:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  27:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  28:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  29:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  30:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  31:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  32:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  33:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 248044
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 15
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.6-27B.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 76
llama_model_loader: - kv  50:                         qwen35.block_count u32              = 65
llama_model_loader: - kv  51:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - type  f32:  456 tensors
llama_model_loader: - type q8_0:    8 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q5_K:   48 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 16.07 GiB (5.05 BPW)
llama_model_create: overriding architecture qwen35 -> qwen35_mtp
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 0 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35_mtp
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 65
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: model type            = ?B
print_info: model params          = 27.32 B
print_info: general.name          = Qwen3.6-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 248044 '<|endoftext|>'
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
llama_model_load: error loading model: invalid vector subscript
llama_model_load_from_file_impl: failed to load model
srv    load_model: failed to load MTP head from 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf'
srv   operator (): operator (): cleaning up before exit...
main: exiting due to model loading error
Tuesday, 05-May-26 23:24:41 +04 [WARN] <Qwen3.6-27B-Q4_K_M-mtp> ExitError >> exit status 1, exit code: 1
Tuesday, 05-May-26 23:24:41 +04 [INFO] <Qwen3.6-27B-Q4_K_M-mtp> process exited but not StateStopping, current state: starting
Received signal interrupt, shutting down...
Second scenario - successful start, choose small values to not overload GPU just to get it working

-fit 81920 --fit-target 512 --ctx-size 81920

PS G:\AI\llama-swap> .\llama-swap.exe
llama-swap listening on http://:8080
Tuesday, 05-May-26 23:25:12 +04 [INFO] Preloading model: Qwen3.6-27B-Q4_K_M-mtp
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24563 MiB):
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24563 MiB
build_info: b0-unknown
system_info: n_threads = 14 (n_threads_batch = 14) / 16 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 4090)   | 24563 = 22450 + (20343 = 14323 +    5336 +     683) +      -18229 |
common_memory_breakdown_print: |   - Host               |                   2707 =  2134 +     382 +     190                |
common_params_fit_impl: projected to use 20343 MiB of device memory vs. 22450 MiB of free device memory
common_params_fit_impl: will leave 2106 >= 512 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.58 seconds
llama_model_loader: loaded meta data with 52 key-value pairs and 866 tensors from G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.6-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.6 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.6-27B
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  18:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  19:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  20:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  21:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  22:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  26:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  27:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  28:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  29:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  30:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  31:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  32:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  33:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 248044
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 15
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.6-27B.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 76
llama_model_loader: - kv  50:                         qwen35.block_count u32              = 65
llama_model_loader: - kv  51:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - type  f32:  456 tensors
llama_model_loader: - type q8_0:    8 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q5_K:   48 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 16.07 GiB (5.05 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 22988 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 65
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 6144
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 48
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 27B
print_info: model params          = 27.32 B
print_info: general.name          = Qwen3.6-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 248044 '<|endoftext|>'
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 60/66 layers to GPU
load_tensors:          CPU model buffer size =   682.03 MiB
load_tensors:        CUDA0 model buffer size = 14323.45 MiB
load_tensors:    CUDA_Host model buffer size =  1452.62 MiB
............................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 81920
llama_context: n_ctx_seq     = 81920
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (81920) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:        CPU KV buffer size =   320.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  4800.00 MiB
llama_kv_cache: size = 5120.00 MiB ( 81920 cells,  16 layers,  1/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_memory_recurrent:        CPU RS buffer size =    62.34 MiB
llama_memory_recurrent:      CUDA0 RS buffer size =   536.16 MiB
llama_memory_recurrent: size =  598.50 MiB (     1 cells,  65 layers,  1 seqs), R (f32):   22.50 MiB, S (f32):  576.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   683.73 MiB
sched_reserve:  CUDA_Host compute buffer size =   190.13 MiB
sched_reserve: graph nodes  = 3657
sched_reserve: graph splits = 100 (with bs=512), 12 (with bs=1)
sched_reserve: reserve took 25.77 ms, sched copies = 1
srv    load_model: loading MTP head from 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf' (override_arch=qwen35_mtp)
llama_model_loader: loaded meta data with 52 key-value pairs and 866 tensors from G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.6-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.6 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.6-27B
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  18:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  19:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  20:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  21:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  22:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  26:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  27:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  28:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  29:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  30:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  31:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  32:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  33:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 248044
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 15
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.6-27B.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 76
llama_model_loader: - kv  50:                         qwen35.block_count u32              = 65
llama_model_loader: - kv  51:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - type  f32:  456 tensors
llama_model_loader: - type q8_0:    8 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q5_K:   48 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 16.07 GiB (5.05 BPW)
llama_model_create: overriding architecture qwen35 -> qwen35_mtp
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 2286 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35_mtp
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 65
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: model type            = ?B
print_info: model params          = 27.32 B
print_info: general.name          = Qwen3.6-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 248044 '<|endoftext|>'
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
done_getting_tensors: partial load — used 18 of 866 tensors in the file (rest belong to a sibling model on the same .gguf)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 60/66 layers to GPU
load_tensors:          CPU model buffer size =   682.03 MiB
load_tensors:        CUDA0 model buffer size =  1425.06 MiB
....srv    load_model: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 81920
llama_context: n_ctx_seq     = 81920
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (81920) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:      CUDA0 KV buffer size =   320.00 MiB
llama_kv_cache: size =  320.00 MiB ( 81920 cells,   1 layers,  1/1 seqs), K (f16):  160.00 MiB, V (f16):  160.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   495.00 MiB
sched_reserve:  CUDA_Host compute buffer size =   180.02 MiB
sched_reserve: graph nodes  = 50
sched_reserve: graph splits = 2
sched_reserve: reserve took 20.02 ms, sched copies = 1
set_mtp: MTP draft head registered (ctx_mtp=00000258C56E6C90, n_ubatch=512, n_embd=5120)
slot   load_model: id  0 | task -1 | speculative decoding context initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 81920
srv    load_model: prompt cache is enabled, size limit: 12288 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: init: --cache-idle-slots requires --kv-unified, disabling
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:5802
main: starting the main loop...
srv  update_slots: all slots are idle
Tuesday, 05-May-26 23:25:23 +04 [INFO] <Qwen3.6-27B-Q4_K_M-mtp> Health check passed on http://localhost:5802/health
srv  log_server_r: done request: GET / 127.0.0.1 200

It shouldn't assign MTP to 0 VRAM GPU, instead it should ether fall back to CPU or emit a clear error like “cannot load second MTP model on GPU: no free memory; reduce offload or force MTP CPU.”

@wsbagnsv1
Copy link
Copy Markdown
Contributor

With all the current mainstream ggufs being without mtp would it maybe make sense to allow a secondary gguf with the mtp tensors to be loaded instead of just 1 file with both the main and mtp model?

@1337hero
Copy link
Copy Markdown

1337hero commented May 6, 2026

Decided to test this myself and report back about results - eh - I guess this is better off for more memory bound systems.

So my setup:

  • GPU: 3x AMD Radeon AI PRO R9700 (32 GB), RADV / Vulkan, single-GPU pinned via --device Vulkan1
  • Model: Qwen3.6-27B Q8_0 (am17an/Qwen3.6-27B-MTP-GGUF for MTP runs, vanilla Q8_0 for baseline)
  • Builds:
    • Baseline: master 66bafdcf1 (b8979), Vulkan
    • MTP: this PR branch 17df5830e (b9030), Vulkan
  • Bench: [am17an's gist](https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090), unmodified

Ran a baseline first then did MTP n_max 2 & 3

Aggregate

Run Wall (s) Accept rate Avg tok/s
Baseline 72.70 n/a ~20.1
MTP n_max=2 78.02 0.819 ~18.9
MTP n_max=3 82.45 0.713 ~17.6

Per-prompt accept rate

prompt n=2 n=3
code_python 0.918 0.908
code_cpp 0.790 0.740
explain_concept 0.819 0.672
summarize 0.738 0.630
qa_factual 0.862 0.684
translation 0.812 0.542
creative_short 0.750 0.603
stepwise_math 0.904 0.804
long_code_review 0.740 0.677

Per-prompt tok/s

prompt baseline n=2 n=3
code_python 20.1 20.2 21.1
code_cpp 20.1 18.7 18.5
explain_concept 20.1 18.8 17.3
summarize 20.3 17.6 16.9
qa_factual 20.1 19.6 17.7
translation 20.9 19.8 15.5
creative_short 20.1 18.0 16.2
stepwise_math 20.1 20.1 19.5
long_code_review 20.0 17.7 17.3

Couldn't get the OP's results myself. thought I'd share

@seadra
Copy link
Copy Markdown

seadra commented May 6, 2026

Apologies for the spam, but can anyone familiar explain how to create the secondary smaller model locally for qwen3.6 variants like this which don't yet have it?

@clintonium-119
Copy link
Copy Markdown

clintonium-119 commented May 6, 2026

I just tested this on my Rog Flow z13 (Strix Halo), and this gave me a big boost. All models are Q6, except the 'normal' 27b, which was a Q5. Just used a local bench script that runs a simple test 3 times each.

I believe the context size was at 128k with a q8 kV cache.

The only issue I ran into was it was crashing for me when trying to run parallel slots.

Model | Prefill (t/s) | Decode (t/s)

=====================================
Qwen3.6-27B | 55.3 | 10.3
Qwen3.6-27B-MTP | 47.8 | 20.0

Qwen3.6-35B | 153.1 | 42.3
Qwen3.6-35B-MTP | 136.6 | 58.2

@AdamNiederer
Copy link
Copy Markdown

Results on a 7900XT w/ Vulkan, Qwen 3.5 27b IQ4_XS converted by me from hf (Q4_K_M doesn't fit w/ mtp):

Configuration Input Output PP t/s TG t/s Total Time Acceptance Rate
MTP IQ4_XS 13,259 2,887 398.45 51.60 89.24 s 63.84% (1,896 / 2,970)
No MTP Q4_K_M 13,259 2,160 674.94 32.13 86.89 s N/A

Prompt being "explain this Rust code". I'm seeing ~60 with Python and ~40 in multi-turn multimodal/english workflows.

Smashing work. With the pp regression fixed this will make these dense models usable on whole new classes of hardware.

@michaelasper
Copy link
Copy Markdown

I ran an independent smoke benchmark of this PR using the prompt suite from the linked gist. This is not a full quality/eval run, just a quick check for whether MTP produces a measurable speedup on my local machine.

Environment:

  • Mac Studio, Apple M3 Ultra, 256 GB unified memory
  • PR checkout: 267f8afe8 from mtp-clean
  • Model: am17an/Qwen3.6-35BA3B-MTP-GGUF / Qwen3.6-35BA3B-MTP.gguf
  • Server flags included: --ctx-size 32768 --n-gpu-layers 999 --parallel 1 --cont-batching --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 4096 --ubatch-size 1024
  • MTP flags tested: --spec-type mtp --spec-draft-n-max 2 and 3
  • Prompt suite: 9 prompts from https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090
  • n_predict: 192

Aggregate result:

profile ok total wall time avg tok/s aggregate accept rate speedup vs baseline
baseline yes 19.94s 71.17 n/a n/a
MTP draft-n=2 yes 14.72s 96.38 80.28% 1.35x
MTP draft-n=3 yes 15.15s 93.65 69.00% 1.32x

The result looks positive: MTP produced a clear speedup on all 9 prompts in this short completion suite. In this configuration, draft-n=2 was slightly better than draft-n=3; draft-n=3 accepted more total draft tokens, but the lower acceptance rate made it a bit slower overall.

One resource note: observed RSS was about 37.2 GB for baseline and about 74.2 GB for the MTP profiles. That is fine on this machine, but probably worth calling out for users testing larger Qwen3.6 variants.

Thanks for working on this. Happy to run a longer-output or longer-context variant if that would be more useful for the PR.

@y-almannaee
Copy link
Copy Markdown

Apologies for the spam, but can anyone familiar explain how to create the secondary smaller model locally for qwen3.6 variants like this which don't yet have it?

You may use https://gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67#file-convert-py with the 35B-A3B MTP file provided in this thread under the "How to use" subheading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Apple Metal https://en.wikipedia.org/wiki/Metal_(API) examples ggml changes relating to the ggml tensor library for machine learning model Model specific Nvidia GPU Issues specific to Nvidia GPUs python python script changes server testing Everything test related Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.