llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp

am17an · 2026-05-04T09:41:20Z

Overview

This PR adds support for MTP (Multi Token Prediction) heads. I tested this on Qwen3.6 27B and Qwen3.6 35BA3B but in principle it should work for any MTP model. I've posted the detailed results below, but typically I see a steady-state acceptance of around 75% with 3 draft tokens, which is more than >2x speed-up over baseline. The design decisions I took to get to this stage are as follows:

The MTP model is a separate model which loads from the same GGUF, the idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc.
I saw a problem in [Speculative decoding] feat: add EAGLE3 speculative decoding support #18039 where the hidden features weren't propagated correctly across multiple ubatches, so this PR adds a separate "hook" for the MTP to consume after each ubatch
The MTP speculative class is fairly trivial (although it does depend on llama: allow partial seq_rm for GDN models for speculative decoding #22400, but could work without it)

Performance

A simple bench for testing various prompts is here: https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090. Posting the results below:

Performance on DGX Spark 🧵

No MTP (baseline)

./llama-server -m ../qwen3.6-q8_0.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0
  code_cpp           pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  explain_concept    pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.3
  summarize          pred=  53 draft=   0 acc=   0 rate=n/a tok/s=7.1
  qa_factual         pred= 177 draft=   0 acc=   0 rate=n/a tok/s=7.0
  translation        pred=  22 draft=   0 acc=   0 rate=n/a tok/s=7.7
  creative_short     pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.1
  stepwise_math      pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.2
  long_code_review   pred= 192 draft=   0 acc=   0 rate=n/a tok/s=7.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1404,
  "total_draft": 0,
  "total_draft_accepted": 0,
  "aggregate_accept_rate": null,
  "wall_s_total": 201.07
}

MTP --spec-draft-max-n 3

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 3

  code_python        pred= 192 draft= 153 acc= 139 rate=0.908 tok/s=21.6
  code_cpp           pred= 192 draft= 176 acc= 132 rate=0.750 tok/s=18.7
  explain_concept    pred= 192 draft= 191 acc= 126 rate=0.660 tok/s=16.3
  summarize          pred=  55 draft=  51 acc=  37 rate=0.726 tok/s=17.9
  qa_factual         pred= 177 draft= 174 acc= 118 rate=0.678 tok/s=16.5
  translation        pred=  22 draft=  24 acc=  13 rate=0.542 tok/s=13.9
  creative_short     pred= 192 draft= 200 acc= 123 rate=0.615 tok/s=15.8
  stepwise_math      pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=19.3
  long_code_review   pred= 192 draft= 179 acc= 131 rate=0.732 tok/s=18.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1319,
  "total_draft_accepted": 952,
  "aggregate_accept_rate": 0.7218,
  "wall_s_total": 83.8
}

MTP --spec-draft-max-n 2

./llama-server -m ../qwen3.6-q8_0-mtp.gguf -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}" --spec-type mtp --spec-draft-n-max 2

  code_python        pred= 192 draft= 134 acc= 123 rate=0.918 tok/s=17.4
  code_cpp           pred= 192 draft= 145 acc= 118 rate=0.814 tok/s=16.5
  explain_concept    pred= 192 draft= 148 acc= 116 rate=0.784 tok/s=16.1
  summarize          pred=  55 draft=  44 acc=  32 rate=0.727 tok/s=15.6
  qa_factual         pred= 192 draft= 132 acc= 125 rate=0.947 tok/s=18.2
  translation        pred=  22 draft=  18 acc=  12 rate=0.667 tok/s=15.2
  creative_short     pred= 192 draft= 149 acc= 116 rate=0.778 tok/s=16.1
  stepwise_math      pred= 192 draft= 139 acc= 121 rate=0.871 tok/s=17.2
  long_code_review   pred= 192 draft= 153 acc= 114 rate=0.745 tok/s=15.6

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1421,
  "total_draft": 1062,
  "total_draft_accepted": 877,
  "aggregate_accept_rate": 0.8258,
  "wall_s_total": 90.44
}

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 16 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 188 acc= 156 rate=0.830 tok/s=26.4
  code_cpp           pred= 192 draft= 201 acc= 126 rate=0.627 tok/s=16.8
  explain_concept    pred= 192 draft= 263 acc= 112 rate=0.426 tok/s=12.7
  summarize          pred=  57 draft=  63 acc=  39 rate=0.619 tok/s=16.9
  qa_factual         pred= 192 draft= 178 acc= 177 rate=0.994 tok/s=47.7
  translation        pred=  23 draft=  18 acc=  15 rate=0.833 tok/s=18.7
  creative_short     pred= 192 draft= 189 acc= 120 rate=0.635 tok/s=15.4
  stepwise_math      pred= 192 draft= 190 acc= 148 rate=0.779 tok/s=22.3
  long_code_review   pred= 192 draft= 207 acc= 120 rate=0.580 tok/s=14.5

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1424,
  "total_draft": 1497,
  "total_draft_accepted": 1013,
  "aggregate_accept_rate": 0.6767,
  "wall_s_total": 81.39
}

Master with draft model with spec-draft-n-max 64 with no partial rollback

llama-server -m ../qwen3.6/Qwen3.6-27B-Q8_0.gguf -hfd unsloth/Qwen3.5-0.8B-GGUF:Q8_0 --spec-draft-n-max 64 -np 1 --chat-template-kwargs "{\"preserve_thinking\": true}"

  code_python        pred= 192 draft= 174 acc= 159 rate=0.914 tok/s=27.2
  code_cpp           pred= 192 draft= 138 acc= 120 rate=0.870 tok/s=15.0
  explain_concept    pred= 192 draft= 170 acc= 101 rate=0.594 tok/s=11.4
  summarize          pred=  55 draft=  48 acc=  36 rate=0.750 tok/s=14.6
  qa_factual         pred= 177 draft= 126 acc= 106 rate=0.841 tok/s=13.9
  translation        pred=  22 draft=  13 acc=  13 rate=1.000 tok/s=16.5
  creative_short     pred= 192 draft= 136 acc= 104 rate=0.765 tok/s=12.8
  stepwise_math      pred= 192 draft= 172 acc= 147 rate=0.855 tok/s=22.0
  long_code_review   pred= 192 draft= 160 acc= 111 rate=0.694 tok/s=13.0

Aggregate: {
  "n_requests": 9,
  "total_predicted": 1406,
  "total_draft": 1137,
  "total_draft_accepted": 897,
  "aggregate_accept_rate": 0.7889,
  "wall_s_total": 97.13
}

How to use

I've uploaded the GGUF which I made by using the convert_hf_to_gguf.py changes in this PR. Here is another GGUF for the MoE (35BA3B) model

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, for debugging and reviewing. Also the convert_hf_to_gguf.py + model definitions. Writing bench for validation against vLLM.

ngxson · 2026-05-04T10:08:48Z

Nice, I think this is a fresh start better than my WIP #18886 (that I still never find the time to continue)

There were some other attempts to add MTP support but they all heavily rely on host <--> device data copy. I assume you tried addressed this, right? (Maybe there was a discussion somewhere but I wasn't aware of)

ngxson

(not a review, but opening some discussions)

ngxson · 2026-05-04T10:12:51Z

+    // number of recurrent-state snapshots per seq for rollback; tensors are widened to (1 + n_rs_seq) groups
+    uint32_t n_rs_seq = 0;


not 100% sure but maybe the naming with _seq is a bit confusing (or I'm misunderstanding this)

I imagine that we want to keep a buffer ring style of recurrent-state(s), similar to SWA in KV cache, right? if that's the case, probably better call it n_rs_window

Yes this is partly the review comment from here #22400 (comment)

ngxson · 2026-05-04T10:13:52Z


-    for (int il = 0; il < n_layer; ++il) {
+    // MTP/NextN layers are loaded as extra decoder blocks but not executed in the main pass.
+    const int n_transformer_layers = n_layer - (int)hparams.nextn_predict_layers;


nits, but maybe call it n_main_layers, as technically nextn layer is also a transformer layer

ngxson · 2026-05-04T10:18:08Z

+        //TODO: generalize if this is ok, we should load <arch_name>_mtp arch?
+        if (params_base.speculative.type == COMMON_SPECULATIVE_TYPE_MTP) {
+            SRV_INF("loading MTP head from '%s' (override_arch=qwen35_mtp)\n",
+                    params_base.model.path.c_str());
+
+            auto mparams_mtp = common_model_params_to_llama(params_base);
+            mparams_mtp.override_arch = "qwen35_mtp";
+
+            model_mtp.reset(llama_model_load_from_file(params_base.model.path.c_str(), mparams_mtp));
+            if (model_mtp == nullptr) {
+                SRV_ERR("failed to load MTP head from '%s'\n", params_base.model.path.c_str());
+                return false;
+            }


if you look at #18886, the better way is to move llama_graph_type to the public API, then load the context with the appropriate graph type

Yes that seems like the correct way to do this if we want to support MTP in a generic way

am17an · 2026-05-04T10:25:00Z

@ngxson yes the h2d was discussed with GG, he's working on a refactor which will allow us to share tensors between two llama context

pwilkin · 2026-05-04T10:41:19Z

Great work, this should massively bridge the TG gap with vLLM, or maybe even surpass it together with tensor-parallel.

Currently speculative checkpoint needs to restart from a checkpoint after some draft tokens are not accepted, this leads to some wastage in running the target again. This PR adds the ability to rollback upto `draft_max` by storing the GDN intermediates.

cmp-nct · 2026-05-04T13:17:27Z

in my opinion Qwen 3.6 is the most important thing that happened in open source models in a long time, this is going to be so valuable.
I wonder if this, once merged, could be combined with ngram drafting ?
So MTP is used until ngram is triggered - switching to ngram until rejection and back to MTP

ngram could be set to match only very strong and long candidates - for large repetitive paraphrasing
and MTP fills the gap

Dampfinchen · 2026-05-04T13:18:48Z

" idea is that MTP should automatically start and we shouldn't need to distribute the MTP gguf separately but also it has it's own context/kv-cache etc." -> Does this mean MTP needs additional resources (RAM/VRAM?)

If so, there should always be an option to remain to disable it. Right now on my system (6 GB VRAM, 32 GB RAM), speculative decoding just makes things much slower even on very small draft models because of that exact reason, they need own context and kv-cache. Such low to midrange systems already operate on the edge in terms of memory.

mbednarek360 · 2026-05-04T13:31:26Z

I'm getting garbage responses running this PR on the Vulkan backend with an R9700 using llama-server. I'm using the GGUF you linked above. Interestingly, draft acceptance is only 0.01282.

Prompt: "Hello!"
Response:

The from,

;::...

... on;srible威风to{ islitor

\ ...

• We
&eq和chn ***, on
Prompt (:
mouth

“ ? forM� P

am17an · 2026-05-04T13:35:31Z

@cmp-nct I'm not sure, but could be possible

@Dampfinchen as of right now it is opt-in via --spec-type mtp, but in terms of memory it should be < 10% of overall memory used (it's just a single layer transformer + kv cache, much lighter than draft models)

@mbednarek360 I've only tested this on a small number of CUDA devices as of now, once it's ready to review I would have tested more devices/backends. In particular this PR relies on #22400 which is not implemented for vulkan for now, if you ask an LLM to add support for that you might get a little further Vulkan and Metal also tested

nawoa · 2026-05-04T15:21:12Z

Might it be possible/useful to run the draft model on a second GPU? Given that MTP weights model are relatively small this might provide a useful speedup on systems with a dedicated high-VRAM "AI" GPU with a cheaper low-VRAM "normal" GPU used for display output, etc... possibly prevent some degree of resource contention.

cturan · 2026-05-04T15:25:48Z

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__
Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch
Hardware: RTX 3090 24GB + RTX 3060 12GB
Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt
Endpoint: /completion, raw text prompt
Prompt: 6978 tokens
Generation: 256 tokens
Runs: 3 measured runs after warmup

mode	model	prefill tok/s avg	generation tok/s avg	MTP acceptance	loaded VRAM
MTP enabled	Qwen3.6-27B-MTP-Q6_K.gguf + `--spec-type mtp --spec-draft-n-max 3`	665.14	42.45	76.0%	24.96 GiB
MTP disabled, same GGUF	Qwen3.6-27B-MTP-Q6_K.gguf, no spec	1315.46	22.97	n/a	22.47 GiB
Existing non-MTP Q6	Qwen3.6-27B-Q6_K.gguf, no spec	1260.12	22.39	n/a	22.59 GiB

Result:

MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.
Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.
Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).
MTP adds about 2.49 GiB loaded VRAM in this setup.

am17an · 2026-05-04T15:33:17Z

@cturan Thanks for testing, I'm aware of the issue for the prefill and will work on a fix.

iiLaurens · 2026-05-04T17:41:11Z

Might be a long shot, but any chance of supporting MTP with a reduced vocabulary? MTP layers are rather chonky and reducing token embeddings might help users with less VRAM by filtering out certain languages. Obviously the full model will still be able to produce those tokens if need be so it won't be gimped.

nybblr · 2026-05-04T18:05:54Z

Working on taking this for a spin with the Q4_K_M quant of Qwen3.6-35BA3B. I was gonna try to start from unsloth's quant since they already perform really well, but of course they don't have any mtp layers.

@am17an Think it would work if I just "steal" the layers from your q8 quant and merge them into the unsloth quant? (add blk.40 and bump some top-level config like block_count and kv_count)

volkermauel · 2026-05-04T19:16:44Z

only a quick test run, 1x 5090 qwen3.6-27b mtp 3, q4_0 quantized, kv also q4_0

slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
slot launch_slot_: id  0 | task 532 | processing task, is_child = 0
slot update_slots: id  0 | task 532 | new prompt, n_ctx_slot = 200192, n_keep = 0, task.n_tokens = 16
slot update_slots: id  0 | task 532 | n_past = 3, slot.prompt.tokens.size() = 1327, seq_id = 0, pos_min = 1326, n_swa = 0
slot update_slots: id  0 | task 532 | forcing full prompt re-processing due to lack of cache data (likely due to SWA or hybrid/recurrent memory, see https://github.com/ggml-org/llama.cpp/pull/13194#issuecomment-2868343055)
slot update_slots: id  0 | task 532 | n_tokens = 0, memory_seq_rm [0, end)
srv  log_server_r: done request: POST /v1/chat/completions 192.168.178.49 200
slot update_slots: id  0 | task 532 | prompt processing progress, n_tokens = 12, batch.n_tokens = 12, progress = 0.750000
slot update_slots: id  0 | task 532 | n_tokens = 12, memory_seq_rm [12, end)
slot init_sampler: id  0 | task 532 | init sampler, took 0.01 ms, tokens: text = 16, total = 16
slot update_slots: id  0 | task 532 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 532 |
prompt eval time =������63.16 ms /����16 tokens (����3.95 ms per token,   253.34 tokens per second)
�������eval time =   56063.04 ms /  5913 tokens (����9.48 ms per token,   105.47 tokens per second)
������total time =   56126.20 ms /  5929 tokens
draft acceptance rate = 0.79728 ( 4169 accepted /  5229 generated)
statistics mtp: #calls(b,g,a) = 2 2272 1976, #gen drafts = 2272, #acc drafts = 1976, #gen tokens = 6816, #acc tokens = 4950, dur(b,g,a) = 0.007, 15393.656, 64.921 ms
slot������release: id  0 | task 532 | stop processing: n_tokens = 5928, truncated = 0
srv  update_slots: all slots are idle

same model, same config (except mtp)

slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 16, batch.n_tokens = 4
slot print_timing: id  0 | task 0 | 
prompt eval time =      91.85 ms /    16 tokens (    5.74 ms per token,   174.20 tokens per second)
       eval time =  103127.94 ms /  6571 tokens (   15.69 ms per token,    63.72 tokens per second)
      total time =  103219.79 ms /  6587 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 6586, truncated = 0
srv  update_slots: all slots are idle

prompt „create a flappy bird clone“

(I‘m not creative, sorry)

Great Speedup!

Viktor-Osika · 2026-05-05T15:46:04Z

Observing very nice speed boost on 5090 with RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF!

I also tried loading this model https://huggingface.co/lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP but I get "llama.cpp\src\models\qwen35_mtp.cpp:8: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed"
Launch command:
.\llama-server -hf llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP-GGUF --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence_penalty 0.0 --repeat-penalty 1.0 --host 0.0.0.0 -c 124000 -np 1 --spec-type mtp --spec-draft-n-max 3
It loads fine without --spec-type mtp.

PkmX · 2026-05-05T15:49:46Z

Metal should work now as well

M1 Ultra: Qwen3.6 27B Q8_0 went from 17 t/s to 25 t/s with --spec-type mtp --spec-draft-n-max 3.

pwilkin · 2026-05-05T15:59:04Z

Observing very nice speed boost on 5090 with RDson/Qwen3.6-27B-MTP-Q4_K_M-GGUF!

I also tried loading this model https://huggingface.co/lyf/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP but I get "llama.cpp\src\models\qwen35_mtp.cpp:8: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed" Launch command: .\llama-server -hf llmfan46/Qwen3.6-27B-uncensored-heretic-v2-NVFP4-MTP-GGUF --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00 --presence_penalty 0.0 --repeat-penalty 1.0 --host 0.0.0.0 -c 124000 -np 1 --spec-type mtp --spec-draft-n-max 3 It loads fine without --spec-type mtp.

That message translates to ELI5 as "this GGUF has the MTP layers stripped".

Dampfinchen · 2026-05-05T16:26:58Z

Google has just released the MTP layers for Gemma 4:

https://huggingface.co/google/gemma-4-26B-A4B-it-assistant
https://huggingface.co/google/gemma-4-31B-it-assistant
https://huggingface.co/google/gemma-4-E4B-it-assistant
https://huggingface.co/google/gemma-4-E2B-it-assistant

adriabama06 · 2026-05-05T16:34:31Z

I just tested it with the iGPU of my R9 6900HX, using ROCm + force-host-allocation-APU + Qwen3.5-4B Q8:

I manually converted Qwen3.5-4B to GGUF using convert_hf_to_gguf.py and these are the results:

Prompt:

Considerad el subespacio \(F = \langle \begin{pmatrix} 0 \\ 1 \\ 1 \end{pmatrix}, \begin{pmatrix} 4 \\ 1 \\ -1 \end{pmatrix}, \begin{pmatrix} 2 \\ 1 \\ 0 \end{pmatrix} \rangle\) en \(\mathbb{R}^3\). Hallad una base de \(F\) y la condición (en forma de sistema de ecuaciones lineales homogéneas) que ha de satisfacer un vector \(\begin{pmatrix} x \\ y \\ z \end{pmatrix}\) para pertenecer a \(F\).

Both gave the correct answer

No MTP ~6.6 tok/s

$ LD_PRELOAD=/home/adri/libforcegttalloc.so HSA_OVERRIDE_GFX_VERSION='10.3.0' ./llama.cpp_mtp/build/bin/llama-server --host 0.0.0.0 --port 8001 -c 8192 --temp 0.6 --repeat-penalty 1.05 --top-k 20 --top-p 0.95 --min-p 0.00 --chat-template-kwargs '{"enable_thinking":false}' -m llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf -np 1 --no-mmap --no-warmup -fa on -ctk f16 -ctv f16 -ngl 999 --jinja --context-shift

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 6856 MiB):
  Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 6856 MiB
Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
build_info: b9026-f8c6b03da
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (Graphics)   |  6856 = 9729 + (5054 =  4386 +     178 +     490) +       -7927 |
common_memory_breakdown_print: |   - Host               |                  662 =   644 +       0 +      18                |
common_params_fit_impl: projected to use 5054 MiB of device memory vs. 9729 MiB of free device memory
common_params_fit_impl: will leave 4674 >= 1024 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 1.73 seconds
llama_model_loader: loaded meta data with 42 key-value pairs and 441 tensors from llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen/Qwen3.5-4B
llama_model_loader: - kv   3:                           general.finetune str              = 851bf6e806efd8d0a36b00ddf55e13ccb7b8cd0a
llama_model_loader: - kv   4:                         general.size_label str              = 4.3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3.5 4B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                         qwen35.block_count u32              = 33
llama_model_loader: - kv  13:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  14:                    qwen35.embedding_length u32              = 2560
llama_model_loader: - kv  15:                 qwen35.feed_forward_length u32              = 9216
llama_model_loader: - kv  16:                qwen35.attention.head_count u32              = 16
llama_model_loader: - kv  17:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  18:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  19:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  20:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  22:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  23:                          general.file_type u32              = 7
llama_model_loader: - kv  24:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  25:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  26:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  27:                  qwen35.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  28:                      qwen35.ssm.inner_size u32              = 4096
llama_model_loader: - kv  29:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  30:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  31:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - type  f32:  184 tensors
llama_model_loader: - type q8_0:  257 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 4.28 GiB (8.51 BPW)
llama_prepare_model_devices: using device ROCm0 (AMD Radeon Graphics) (0000:e3:00.0) - 9756 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 2560
print_info: n_embd_inp            = 2560
print_info: n_layer               = 33
print_info: n_head                = 16
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 9216
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 4096
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 32
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 4B
print_info: model params          = 4.33 B
print_info: general.name          = Qwen/Qwen3.5-4B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 34/34 layers to GPU
load_tensors:        ROCm0 model buffer size =  4386.53 MiB
load_tensors:    ROCm_Host model buffer size =   644.14 MiB
.............................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (4096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.95 MiB
llama_kv_cache:      ROCm0 KV buffer size =   128.00 MiB
llama_kv_cache: size =  128.00 MiB (  4096 cells,   8 layers,  1/1 seqs), K (f16):   64.00 MiB, V (f16):   64.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_memory_recurrent:      ROCm0 RS buffer size =    50.25 MiB
llama_memory_recurrent: size =   50.25 MiB (     1 cells,  33 layers,  1 seqs), R (f32):    2.25 MiB, S (f32):   48.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      ROCm0 compute buffer size =   490.00 MiB
sched_reserve:  ROCm_Host compute buffer size =    18.02 MiB
sched_reserve: graph nodes  = 1833
sched_reserve: graph splits = 2
sched_reserve: reserve took 1725.91 ms, sched copies = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
clip_model_loader: model name:   Qwen/Qwen3.5-4B
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    298
clip_model_loader: n_kv:         29

clip_model_loader: has vision encoder
clip_ctx: CLIP using ROCm0 backend
load_hparams: projector:          qwen3vl_merger
load_hparams: n_embd:             1024
load_hparams: n_head:             16
load_hparams: n_ff:               4096
load_hparams: n_layer:            24
load_hparams: ffn_op:             gelu
load_hparams: projection_dim:     2560

--- vision hparams ---
load_hparams: image_size:         768
load_hparams: patch_size:         16
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: n_merge:            2
load_hparams: n_wa_pattern: 0
load_hparams: image_min_pixels:   1048576 (custom value)
load_hparams: image_max_pixels:   4194304

load_hparams: model size:         644.26 MiB
load_hparams: metadata size:      0.10 MiB
srv    load_model: loaded multimodal model, 'llama.cpp_mtp/mmproj-Qwen-Qwen3.5-4B-bf16.gguf'
srv    load_model: initializing slots, n_slots = 1
common_context_can_seq_rm: the target context does not support partial sequence removal
srv    load_model: speculative decoding will use checkpoints
no implementations specified for speculative decoding
slot   load_model: id  0 | task -1 | new slot, n_ctx = 4096
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: init: --cache-idle-slots requires --kv-unified, disabling
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8001
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: done request: GET /tools 192.168.1.154 404
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 4096 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, task.n_tokens = 1053
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 8, batch.n_tokens = 8, progress = 0.007597
srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.154 200
slot update_slots: id  0 | task 0 | n_tokens = 8, memory_seq_rm [8, end)
srv  process_chun: processing image...
encoding image slice...
alloc_compute_meta:      ROCm0 compute buffer size =   109.22 MiB
alloc_compute_meta:        CPU compute buffer size =    12.19 MiB
alloc_compute_meta: graph splits = 1, nodes = 736
warmup: flash attention is enabled
image slice encoded in 7058 ms
decoding image batch 1/1, n_tokens_batch = 1035
find_slot: non-consecutive token position 8 after 7 for sequence 0 with 512 new tokens
find_slot: non-consecutive token position 8 after 8 for sequence 0 with 512 new tokens
find_slot: non-consecutive token position 8 after 8 for sequence 0 with 11 new tokens
find_slot: non-consecutive token position 8 after 7 for sequence 0 with 512 new tokens
find_slot: non-consecutive token position 8 after 8 for sequence 0 with 512 new tokens
find_slot: non-consecutive token position 8 after 8 for sequence 0 with 11 new tokens
image decoded (batch 1/1) in 5621 ms
srv  process_chun: image processed in 12680 ms
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 1049, batch.n_tokens = 6, progress = 0.996201
find_slot: non-consecutive token position 58 after 8 for sequence 0 with 6 new tokens
find_slot: non-consecutive token position 58 after 8 for sequence 0 with 6 new tokens
slot update_slots: id  0 | task 0 | n_tokens = 1049, memory_seq_rm [59, end)
slot init_sampler: id  0 | task 0 | init sampler, took 0.04 ms, tokens: text = 18, total = 1053
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 1053, batch.n_tokens = 4
slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 58, pos_max = 58, n_tokens = 1049, size = 50.251 MiB)
slot print_timing: id  0 | task 0 |
prompt eval time =   13187.54 ms /  1053 tokens (   12.52 ms per token,    79.85 tokens per second)
       eval time =   63816.31 ms /   424 tokens (  150.51 ms per token,     6.64 tokens per second)
      total time =   77003.84 ms /  1477 tokens
slot      release: id  0 | task 0 | stop processing: n_tokens = 1476, truncated = 0
srv  update_slots: all slots are idle
^Csrv    operator(): operator(): cleaning up before exit...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (Graphics)   |  6856 = 3126 + (5054 =  4386 +     178 +     490) +       -1325 |
common_memory_breakdown_print: |   - Host               |                  662 =   644 +       0 +      18                |

MTP `--spec-type mtp --spec-draft-n-max 3` ~11.6 tok/s

$ LD_PRELOAD=/home/adri/libforcegttalloc.so HSA_OVERRIDE_GFX_VERSION='10.3.0' ./llama.cpp_mtp/build/bin/llama-server --host 0.0.0.0 --port 8001 -c 8192 --temp 0.6 --repeat-penalty 1.05 --top-k 20 --top-p 0.95 --min-p 0.00 --chat-template-kwargs '{"enable_thinking":false}' -m llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf -np 1 --no-mmap --no-warmup -fa on -ctk f16 -ctv f16 -ngl 999 --jinja --context-shift --spec-type mtp --spec-draft-n-max 3

ggml_cuda_init: found 1 ROCm devices (Total VRAM: 6856 MiB):
  Device 0: AMD Radeon Graphics, gfx1030 (0x1030), VMM: no, Wave Size: 32, VRAM: 6856 MiB
Setting 'enable_thinking' via --chat-template-kwargs is deprecated. Use --reasoning on / --reasoning off instead.
build_info: b9026-f8c6b03da
system_info: n_threads = 8 (n_threads_batch = 8) / 16 | ROCm : NO_VMM = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
Running without SSL
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (Graphics)   |  6856 = 9834 + (5333 =  4386 +     457 +     490) +       -8311 |
common_memory_breakdown_print: |   - Host               |                  670 =   644 +       0 +      26                |
common_params_fit_impl: projected to use 5333 MiB of device memory vs. 9834 MiB of free device memory
common_params_fit_impl: will leave 4500 >= 1024 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 2.44 seconds
llama_model_loader: loaded meta data with 42 key-value pairs and 441 tensors from llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen/Qwen3.5-4B
llama_model_loader: - kv   3:                           general.finetune str              = 851bf6e806efd8d0a36b00ddf55e13ccb7b8cd0a
llama_model_loader: - kv   4:                         general.size_label str              = 4.3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3.5 4B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                         qwen35.block_count u32              = 33
llama_model_loader: - kv  13:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  14:                    qwen35.embedding_length u32              = 2560
llama_model_loader: - kv  15:                 qwen35.feed_forward_length u32              = 9216
llama_model_loader: - kv  16:                qwen35.attention.head_count u32              = 16
llama_model_loader: - kv  17:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  18:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  19:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  20:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  22:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  23:                          general.file_type u32              = 7
llama_model_loader: - kv  24:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  25:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  26:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  27:                  qwen35.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  28:                      qwen35.ssm.inner_size u32              = 4096
llama_model_loader: - kv  29:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  30:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  31:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - type  f32:  184 tensors
llama_model_loader: - type q8_0:  257 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 4.28 GiB (8.51 BPW)
llama_prepare_model_devices: using device ROCm0 (AMD Radeon Graphics) (0000:e3:00.0) - 9962 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 2560
print_info: n_embd_inp            = 2560
print_info: n_layer               = 33
print_info: n_head                = 16
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 9216
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 4096
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 32
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 4B
print_info: model params          = 4.33 B
print_info: general.name          = Qwen/Qwen3.5-4B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 34/34 layers to GPU
load_tensors:        ROCm0 model buffer size =  4386.53 MiB
load_tensors:    ROCm_Host model buffer size =   644.14 MiB
.............................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.95 MiB
llama_kv_cache:      ROCm0 KV buffer size =   256.00 MiB
llama_kv_cache: size =  256.00 MiB (  8192 cells,   8 layers,  1/1 seqs), K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_memory_recurrent:      ROCm0 RS buffer size =   201.00 MiB
llama_memory_recurrent: size =  201.00 MiB (     1 cells,  33 layers,  1 seqs), R (f32):    9.00 MiB, S (f32):  192.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      ROCm0 compute buffer size =   490.00 MiB
sched_reserve:  ROCm_Host compute buffer size =    26.02 MiB
sched_reserve: graph nodes  = 1833
sched_reserve: graph splits = 2
sched_reserve: reserve took 1718.52 ms, sched copies = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
srv    load_model: loading MTP head from 'llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf' (override_arch=qwen35_mtp)
llama_model_loader: loaded meta data with 42 key-value pairs and 441 tensors from llama.cpp_mtp/Qwen-Qwen3.5-4B-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Qwen/Qwen3.5-4B
llama_model_loader: - kv   3:                           general.finetune str              = 851bf6e806efd8d0a36b00ddf55e13ccb7b8cd0a
llama_model_loader: - kv   4:                         general.size_label str              = 4.3B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen3.5 4B Base
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.5-4...
llama_model_loader: - kv  11:                               general.tags arr[str,1]       = ["image-text-to-text"]
llama_model_loader: - kv  12:                         qwen35.block_count u32              = 33
llama_model_loader: - kv  13:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  14:                    qwen35.embedding_length u32              = 2560
llama_model_loader: - kv  15:                 qwen35.feed_forward_length u32              = 9216
llama_model_loader: - kv  16:                qwen35.attention.head_count u32              = 16
llama_model_loader: - kv  17:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  18:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  19:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  20:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  21:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  22:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  23:                          general.file_type u32              = 7
llama_model_loader: - kv  24:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  25:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  26:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  27:                  qwen35.ssm.time_step_rank u32              = 32
llama_model_loader: - kv  28:                      qwen35.ssm.inner_size u32              = 4096
llama_model_loader: - kv  29:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  30:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  31:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  34:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  35:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  36:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  37:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 248044
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  41:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - type  f32:  184 tensors
llama_model_loader: - type q8_0:  257 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 4.28 GiB (8.51 BPW)
llama_model_create: overriding architecture qwen35 -> qwen35_mtp
llama_prepare_model_devices: using device ROCm0 (AMD Radeon Graphics) (0000:e3:00.0) - 4126 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35_mtp
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 2560
print_info: n_embd_inp            = 2560
print_info: n_layer               = 33
print_info: n_head                = 16
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 9216
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: model type            = ?B
print_info: model params          = 4.33 B
print_info: general.name          = Qwen/Qwen3.5-4B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 11 ','
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248044 '<|endoftext|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
done_getting_tensors: partial load — used 17 of 441 tensors in the file (rest belong to a sibling model on the same .gguf)
load_tensors: offloading output layer to GPU
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloaded 34/34 layers to GPU
load_tensors:        ROCm0 model buffer size =   766.39 MiB
load_tensors:    ROCm_Host model buffer size =   644.14 MiB
....srv    load_model: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 8192
llama_context: n_ctx_seq     = 8192
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (8192) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  ROCm_Host  output buffer size =     0.95 MiB
llama_kv_cache:      ROCm0 KV buffer size =    32.00 MiB
llama_kv_cache: size =   32.00 MiB (  8192 cells,   1 layers,  1/1 seqs), K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      ROCm0 compute buffer size =   490.00 MiB
sched_reserve:  ROCm_Host compute buffer size =    26.02 MiB
sched_reserve: graph nodes  = 50
sched_reserve: graph splits = 2
sched_reserve: reserve took 1346.77 ms, sched copies = 1
set_mtp: MTP draft head registered (ctx_mtp=0x5cd411b88ae0, n_ubatch=512, n_embd=2560)
slot   load_model: id  0 | task -1 | speculative decoding context initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 8192
srv    load_model: prompt cache is enabled, size limit: 8192 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: init: --cache-idle-slots requires --kv-unified, disabling
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:8001
main: starting the main loop...
srv  update_slots: all slots are idle
srv  log_server_r: done request: GET /tools 192.168.1.154 404
srv  params_from_: Chat format: peg-native
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
srv  get_availabl: updating prompt cache
srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 8192 tokens, 8589934592 est)
srv  get_availabl: prompt cache update took 0.01 ms
slot launch_slot_: id  0 | task -1 | sampler chain: logits -> penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> ?min-p -> ?xtc -> temp-ext -> dist
slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, task.n_tokens = 156
slot update_slots: id  0 | task 0 | n_tokens = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_tokens = 152, batch.n_tokens = 152, progress = 0.974359
srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.154 200
slot update_slots: id  0 | task 0 | n_tokens = 152, memory_seq_rm [152, end)
slot init_sampler: id  0 | task 0 | init sampler, took 0.10 ms, tokens: text = 156, total = 156
slot update_slots: id  0 | task 0 | prompt processing done, n_tokens = 156, batch.n_tokens = 4
slot create_check: id  0 | task 0 | created context checkpoint 1 of 32 (pos_min = 151, pos_max = 151, n_tokens = 152, size = 50.251 MiB)
slot print_timing: id  0 | task 0 |
prompt eval time =    1503.36 ms /   156 tokens (    9.64 ms per token,   103.77 tokens per second)
       eval time =  235881.32 ms /  2740 tokens (   86.09 ms per token,    11.62 tokens per second)
      total time =  237384.68 ms /  2896 tokens
draft acceptance rate = 0.65906 ( 1819 accepted /  2760 generated)
statistics mtp: #calls(b,g,a) = 1 920 771, #gen drafts = 920, #acc drafts = 771, #gen tokens = 2760, #acc tokens = 1819, dur(b,g,a) = 0.015, 72899.020, 5.252 ms
slot      release: id  0 | task 0 | stop processing: n_tokens = 2895, truncated = 0
srv  update_slots: all slots are idle
^Csrv    operator(): operator(): cleaning up before exit...
common_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - ROCm0 (Graphics)   |  6856 = 1735 + (5333 =  4386 +     457 +     490) +        -212 |
common_memory_breakdown_print: |   - Host               |                  670 =   644 +       0 +      26                |
double free or corruption (!prev)
Aborted (core dumped)

Edit: Add GGUF link: https://huggingface.co/adriabama06/Qwen3.5-4B-q8_0-am17an-MTP

Ezzz-dev · 2026-05-05T17:23:05Z

Can confirm with latest changes in Vulkan I no longer have issues with token generation suddenly stopping. At 120k tokens, the t/s generation dropped to 45t/s so that's still a huge boost in performance!

ggerganov · 2026-05-05T18:10:50Z

+    void handle_mtp_for_ubatch(
+            int32_t                n_tokens,
+            const llama_token    * tokens,
+            const llama_pos      * positions,
+            struct ggml_tensor   * t_h_pre_norm);
+


This logic has to be extracted into the server/speculative contexts. One of the main reasons is that just the token positions are not enough to be able to do this correctly. For example, for multi-modal use cases, this implementation does not take into account that multiple tokens have the same position and there is no way to resolve this inside llama_context.

Also, restoring prompts in a slot from the prompt cache does not work because we forget the computed embeddings for the prompt.

Hence, we have to actually extract the embeddings from the context (in a similar way as we extract the logits) and manage them along side the prompts (store them, prefix cache them, etc.). Quite a lot of work is needed, but I think that's the proper way to implement support for these methods.

For MTP, we can already use the existing embeddings API because MTP only needs the output embeddings and we already have the API for that llama_get_embeddings_* (edit: on second look, we actually need the embeddings before the output norm, so need API for that as well). But for Eagle3 this is not enough. For that I am preparing new embeddings API (#22728) that will be used to extract the internal layer embeddings. Feedback is welcome.

@ggerganov IMO the only blocker to not run this inside the same llama graph was that kv-cache becomes weird to handle. Perhaps we can restructure the kv-cache such that there is an auxiliary cache for these "speculators", then everything becomes kind of already working from a sync perspective. Or I am maybe missing something

Without this override, upstream PR ggml-org#22673 MTP context inherits n_ubatch=1024 from the main context, sizing its compute buffer at ~1.1 GiB. At 150k on 24GB, that triggers cudaMalloc OOM during MTP context init. Override drops the buffer to ~362 MiB at n_ubatch=256 and lets the MTP context fit. Decode bench post-patch reveals the real wall is MTP head weight format mismatch (3.4% accept on this lab's sidecar GGUF vs 72-82% in PR description), not VRAM. See RESULTS_UPSTREAM_MTP_22673.md in the lab repo.

coder543 · 2026-05-05T18:41:29Z

Just as a note: Gemma 4 MTP support will likely require other changes.

From the blog post:

To make these MTP drafters exceptionally fast and accurate, we introduced several architectural enhancements under the hood. The draft models seamlessly utilize the target model's activations and share its KV cache, meaning they don't have to waste time recalculating context the larger model has already figured out. For our E2B and E4B edge models, where the final logit calculation becomes a big bottleneck, we even implemented an efficient clustering technique in the embedder to further accelerate generation.

In the transformers implementation, Google also uses a heuristic approach to dynamically adjust how many tokens are drafted at each step, which could be nice to have for all MTP models.

msb-msb · 2026-05-05T18:59:31Z

Did a clean DFlash + DDTree reproduction on a single RTX 3090 (24GB) a few days ago, both Qwens, full bench_llm.py suite from the Luce repo. Posting numbers here as a comparison point for your MTP results.

Bench results

Model	HumanEval	GSM8K	Math500	Mean
Qwen 3.5-27B Q4_K_M	2.76x	2.48x	2.53x	2.59x
Qwen 3.6-27B Q4_K_M	2.81x	2.25x	2.61x	2.56x

Both with their matching 3.5-1.5B / 3.6-1.5B drafts.

Two things worth flagging on these numbers:

Qwen 3.5 came in below the README's headline 3.43x HumanEval — I got 2.76x. Probably DDTree variance at temp=1.0 plus bench_llm.py defaults vs whatever produced the 3.43x.
Qwen 3.6 came in above the README's 1.98x. The 3.6 draft matured between April 26 and April 30 — HumanEval acceptance length climbed from 5.94 to 7.48.

vs MTP on Qwen 3.6

So on Qwen 3.6 specifically: ~2.56x mean for DFlash + DDTree vs the ~1.85x being reported here for MTP. Different mechanisms, different tradeoffs:

DFlash needs the Luce fork + separate draft GGUF
MTP is single-checkpoint and lands in mainline

Whether they compose — MTP draft tokens fed into DDTree verification — is the interesting question, but I haven't tried it yet.

Plan to reproduce MTP on the same machine when this lands. Full methodology + per-task tables at https://insiderllm.com/guides/dflash-rtx-3090-bench-both-qwens/ if useful.

gmaxwell · 2026-05-05T19:15:47Z

If the user invokes with a non-MTP model it dies with an assert:

llama_model_create: overriding architecture qwen35 -> qwen35_mtp
/src/models/qwen35_mtp.cpp:8: GGML_ASSERT(hparams.nextn_predict_layers > 0 && "QWEN35_MTP requires nextn_predict_layers > 0") failed

obviously I don't expect it to mtp in this case but it should probably politely reject the option and cleanly abort or ignore the spec option.

Aside, Qwen3.6-27B-Q8 and a 256k context on RTX A6000-- ~20tok/s without MTP, 55tok/s with MTP. Seems to work. Would be nice if parallel was still possible, but one step at a time! :)

sluflyer06 · 2026-05-05T19:33:31Z

Thank you, we are eagerly awaiting this to become stable, here automated test results for my machine;

__ Qwen3.6-27B Q6_K benchmark on llama.cpp b9025-10829dbcc / PR #22673 branch Hardware: RTX 3090 24GB + RTX 3060 12GB Runtime flags: -fa on -c 10000 -np 1 -ngl 99 --no-mmap --no-cache-prompt Endpoint: /completion, raw text prompt Prompt: 6978 tokens Generation: 256 tokens Runs: 3 measured runs after warmup
mode model prefill tok/s avg generation tok/s avg MTP acceptance loaded VRAM
MTP enabled Qwen3.6-27B-MTP-Q6_K.gguf + --spec-type mtp --spec-draft-n-max 3 665.14 42.45 76.0% 24.96 GiB
MTP disabled, same GGUF Qwen3.6-27B-MTP-Q6_K.gguf, no spec 1315.46 22.97 n/a 22.47 GiB
Existing non-MTP Q6 Qwen3.6-27B-Q6_K.gguf, no spec 1260.12 22.39 n/a 22.59 GiB

Result:
* MTP improves decode from 22.97 tok/s to 42.45 tok/s on the same GGUF: ~1.85x speedup.

* Against the existing non-MTP Q6 file, decode improves from 22.39 tok/s to 42.45 tok/s: ~1.90x speedup.

* Prefill is slower with MTP enabled in this PR path: 665 tok/s vs 1315 tok/s on the same GGUF (~0.51x).

* MTP adds about 2.49 GiB loaded VRAM in this setup.

ooooof that is bad, PP takes a absolute dump, hope you're not processing any long prompts or care about how responsive the model is. Need to be doubling the PP.

y-almannaee · 2026-05-05T19:40:02Z

After transplanting from your provided MTP file to unsloth/Qwen3.6-27B-Q4_K_M, and testing it on a 4090 on Windows 10, it seems to "silently" crash if GPU VRAM is full or will be full? I don't fully understand the mode of the crash but:

Common to both:

G:\AI\llama\llama-server.exe --port 5001 --chat-template-file "G:\AI\models\Qwen3.6_merged_template.jinja" 
-m G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf -fa 1 --cpu-strict 1 -np 1 --no-warmup 
--slot-save-path G:\AI\models\cache --host 0.0.0.0 --jinja --min-p 0.0 -t 14  --no-mmap
--ctx-checkpoints 16 -cram 12288 -ngl 60 --mlock --spec-type mtp --spec-draft-n-max 3

First scenario - "silent" crash

-fit 163840 --fit-target 1536 --ctx-size 163840

Key item: using device CUDA0 ... - 0 MiB free

PS G:\AI\llama-swap> .\llama-swap.exe
llama-swap listening on http://:8080
Tuesday, 05-May-26 23:24:28 +04 [INFO] Preloading model: Qwen3.6-27B-Q4_K_M-mtp
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24563 MiB):
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24563 MiB
build_info: b0-unknown
system_info: n_threads = 14 (n_threads_batch = 14) / 16 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 4090)   | 24563 = 22450 + (25703 = 14323 +   10136 +    1243) +      -23589 |
common_memory_breakdown_print: |   - Host               |                   3187 =  2134 +     702 +     350                |
common_params_fit_impl: projected to use 25703 MiB of device memory vs. 22450 MiB of free device memory
common_params_fit_impl: cannot meet free memory target of 1536 MiB, need to reduce device memory by 4789 MiB
common_params_fit_impl: context size set by user to 163840 -> no change
common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 60, abort
common_fit_params: fitting params to free memory took 0.63 seconds
llama_model_loader: loaded meta data with 52 key-value pairs and 866 tensors from G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.6-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.6 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.6-27B
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  18:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  19:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  20:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  21:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  22:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  26:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  27:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  28:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  29:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  30:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  31:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  32:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  33:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 248044
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 15
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.6-27B.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 76
llama_model_loader: - kv  50:                         qwen35.block_count u32              = 65
llama_model_loader: - kv  51:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - type  f32:  456 tensors
llama_model_loader: - type q8_0:    8 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q5_K:   48 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 16.07 GiB (5.05 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 22988 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 65
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 6144
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 48
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 27B
print_info: model params          = 27.32 B
print_info: general.name          = Qwen3.6-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 248044 '<|endoftext|>'
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 60/66 layers to GPU
load_tensors:          CPU model buffer size =   682.03 MiB
load_tensors:        CUDA0 model buffer size = 14323.45 MiB
load_tensors:    CUDA_Host model buffer size =  1452.62 MiB
............................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 163840
llama_context: n_ctx_seq     = 163840
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (163840) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:        CPU KV buffer size =   640.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  9600.00 MiB
llama_kv_cache: size = 10240.00 MiB (163840 cells,  16 layers,  1/1 seqs), K (f16): 5120.00 MiB, V (f16): 5120.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_memory_recurrent:        CPU RS buffer size =    62.34 MiB
llama_memory_recurrent:      CUDA0 RS buffer size =   536.16 MiB
llama_memory_recurrent: size =  598.50 MiB (     1 cells,  65 layers,  1 seqs), R (f32):   22.50 MiB, S (f32):  576.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =  1243.73 MiB
sched_reserve:  CUDA_Host compute buffer size =   350.13 MiB
sched_reserve: graph nodes  = 3657
sched_reserve: graph splits = 100 (with bs=512), 12 (with bs=1)
sched_reserve: reserve took 91.75 ms, sched copies = 1
srv    load_model: loading MTP head from 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf' (override_arch=qwen35_mtp)
llama_model_loader: loaded meta data with 52 key-value pairs and 866 tensors from G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.6-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.6 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.6-27B
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  18:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  19:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  20:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  21:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  22:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  26:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  27:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  28:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  29:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  30:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  31:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  32:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  33:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 248044
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 15
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.6-27B.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 76
llama_model_loader: - kv  50:                         qwen35.block_count u32              = 65
llama_model_loader: - kv  51:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - type  f32:  456 tensors
llama_model_loader: - type q8_0:    8 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q5_K:   48 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 16.07 GiB (5.05 BPW)
llama_model_create: overriding architecture qwen35 -> qwen35_mtp
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 0 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35_mtp
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 65
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: model type            = ?B
print_info: model params          = 27.32 B
print_info: general.name          = Qwen3.6-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 248044 '<|endoftext|>'
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
llama_model_load: error loading model: invalid vector subscript
llama_model_load_from_file_impl: failed to load model
srv    load_model: failed to load MTP head from 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf'
srv   operator (): operator (): cleaning up before exit...
main: exiting due to model loading error
Tuesday, 05-May-26 23:24:41 +04 [WARN] <Qwen3.6-27B-Q4_K_M-mtp> ExitError >> exit status 1, exit code: 1
Tuesday, 05-May-26 23:24:41 +04 [INFO] <Qwen3.6-27B-Q4_K_M-mtp> process exited but not StateStopping, current state: starting
Received signal interrupt, shutting down...

Second scenario - successful start, choose small values to not overload GPU just to get it working

-fit 81920 --fit-target 512 --ctx-size 81920

PS G:\AI\llama-swap> .\llama-swap.exe
llama-swap listening on http://:8080
Tuesday, 05-May-26 23:25:12 +04 [INFO] Preloading model: Qwen3.6-27B-Q4_K_M-mtp
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24563 MiB):
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes, VRAM: 24563 MiB
build_info: b0-unknown
system_info: n_threads = 14 (n_threads_batch = 14) / 16 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 |
init: using 15 threads for HTTP server
start: binding port with default address family
main: loading model
srv    load_model: loading model 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf'
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
common_params_fit_impl: getting device memory data for initial parameters:
common_memory_breakdown_print: | memory breakdown [MiB] | total    free     self   model   context   compute    unaccounted |
common_memory_breakdown_print: |   - CUDA0 (RTX 4090)   | 24563 = 22450 + (20343 = 14323 +    5336 +     683) +      -18229 |
common_memory_breakdown_print: |   - Host               |                   2707 =  2134 +     382 +     190                |
common_params_fit_impl: projected to use 20343 MiB of device memory vs. 22450 MiB of free device memory
common_params_fit_impl: will leave 2106 >= 512 MiB of free device memory, no changes needed
common_fit_params: successfully fit params to free device memory
common_fit_params: fitting params to free memory took 0.58 seconds
llama_model_loader: loaded meta data with 52 key-value pairs and 866 tensors from G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.6-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.6 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.6-27B
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  18:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  19:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  20:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  21:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  22:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  26:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  27:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  28:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  29:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  30:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  31:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  32:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  33:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 248044
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 15
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.6-27B.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 76
llama_model_loader: - kv  50:                         qwen35.block_count u32              = 65
llama_model_loader: - kv  51:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - type  f32:  456 tensors
llama_model_loader: - type q8_0:    8 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q5_K:   48 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 16.07 GiB (5.05 BPW)
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 22988 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 65
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: ssm_d_conv            = 4
print_info: ssm_d_inner           = 6144
print_info: ssm_d_state           = 128
print_info: ssm_dt_rank           = 48
print_info: ssm_n_group           = 16
print_info: ssm_dt_b_c_rms        = 0
print_info: model type            = 27B
print_info: model params          = 27.32 B
print_info: general.name          = Qwen3.6-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 248044 '<|endoftext|>'
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 60/66 layers to GPU
load_tensors:          CPU model buffer size =   682.03 MiB
load_tensors:        CUDA0 model buffer size = 14323.45 MiB
load_tensors:    CUDA_Host model buffer size =  1452.62 MiB
............................................................................................
common_init_result: added <|endoftext|> logit bias = -inf
common_init_result: added <|im_end|> logit bias = -inf
common_init_result: added <|fim_pad|> logit bias = -inf
common_init_result: added <|repo_name|> logit bias = -inf
common_init_result: added <|file_sep|> logit bias = -inf
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 81920
llama_context: n_ctx_seq     = 81920
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (81920) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:        CPU KV buffer size =   320.00 MiB
llama_kv_cache:      CUDA0 KV buffer size =  4800.00 MiB
llama_kv_cache: size = 5120.00 MiB ( 81920 cells,  16 layers,  1/1 seqs), K (f16): 2560.00 MiB, V (f16): 2560.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
llama_memory_recurrent:        CPU RS buffer size =    62.34 MiB
llama_memory_recurrent:      CUDA0 RS buffer size =   536.16 MiB
llama_memory_recurrent: size =  598.50 MiB (     1 cells,  65 layers,  1 seqs), R (f32):   22.50 MiB, S (f32):  576.00 MiB
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   683.73 MiB
sched_reserve:  CUDA_Host compute buffer size =   190.13 MiB
sched_reserve: graph nodes  = 3657
sched_reserve: graph splits = 100 (with bs=512), 12 (with bs=1)
sched_reserve: reserve took 25.77 ms, sched copies = 1
srv    load_model: loading MTP head from 'G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf' (override_arch=qwen35_mtp)
llama_model_loader: loaded meta data with 52 key-value pairs and 866 tensors from G:\AI\models\Qwen3.6-27B-Q4_K_M-MTP\model.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen35
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                     general.sampling.top_k i32              = 20
llama_model_loader: - kv   3:                     general.sampling.top_p f32              = 0.950000
llama_model_loader: - kv   4:                      general.sampling.temp f32              = 1.000000
llama_model_loader: - kv   5:                               general.name str              = Qwen3.6-27B
llama_model_loader: - kv   6:                           general.basename str              = Qwen3.6-27B
llama_model_loader: - kv   7:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   8:                         general.size_label str              = 27B
llama_model_loader: - kv   9:                            general.license str              = apache-2.0
llama_model_loader: - kv  10:                       general.license.link str              = https://huggingface.co/Qwen/Qwen3.6-2...
llama_model_loader: - kv  11:                           general.repo_url str              = https://huggingface.co/unsloth
llama_model_loader: - kv  12:                   general.base_model.count u32              = 1
llama_model_loader: - kv  13:                  general.base_model.0.name str              = Qwen3.6 27B
llama_model_loader: - kv  14:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  15:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen3.6-27B
llama_model_loader: - kv  16:                               general.tags arr[str,2]       = ["unsloth", "image-text-to-text"]
llama_model_loader: - kv  17:                      qwen35.context_length u32              = 262144
llama_model_loader: - kv  18:                    qwen35.embedding_length u32              = 5120
llama_model_loader: - kv  19:                 qwen35.feed_forward_length u32              = 17408
llama_model_loader: - kv  20:                qwen35.attention.head_count u32              = 24
llama_model_loader: - kv  21:             qwen35.attention.head_count_kv u32              = 4
llama_model_loader: - kv  22:             qwen35.rope.dimension_sections arr[i32,4]       = [11, 11, 10, 0]
llama_model_loader: - kv  23:                      qwen35.rope.freq_base f32              = 10000000.000000
llama_model_loader: - kv  24:    qwen35.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  25:                qwen35.attention.key_length u32              = 256
llama_model_loader: - kv  26:              qwen35.attention.value_length u32              = 256
llama_model_loader: - kv  27:                     qwen35.ssm.conv_kernel u32              = 4
llama_model_loader: - kv  28:                      qwen35.ssm.state_size u32              = 128
llama_model_loader: - kv  29:                     qwen35.ssm.group_count u32              = 16
llama_model_loader: - kv  30:                  qwen35.ssm.time_step_rank u32              = 48
llama_model_loader: - kv  31:                      qwen35.ssm.inner_size u32              = 6144
llama_model_loader: - kv  32:             qwen35.full_attention_interval u32              = 4
llama_model_loader: - kv  33:                qwen35.rope.dimension_count u32              = 64
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = qwen35
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,248320]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,248320]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,247587]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  39:                tokenizer.ggml.eos_token_id u32              = 248046
llama_model_loader: - kv  40:            tokenizer.ggml.padding_token_id u32              = 248055
llama_model_loader: - kv  41:                tokenizer.ggml.bos_token_id u32              = 248044
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  43:                    tokenizer.chat_template str              = {%- set image_count = namespace(value...
llama_model_loader: - kv  44:               general.quantization_version u32              = 2
llama_model_loader: - kv  45:                          general.file_type u32              = 15
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = Qwen3.6-27B-GGUF/imatrix_unsloth.gguf
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = unsloth_calibration_Qwen3.6-27B.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count u32              = 496
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count u32              = 76
llama_model_loader: - kv  50:                         qwen35.block_count u32              = 65
llama_model_loader: - kv  51:                qwen35.nextn_predict_layers u32              = 1
llama_model_loader: - type  f32:  456 tensors
llama_model_loader: - type q8_0:    8 tensors
llama_model_loader: - type q4_K:  289 tensors
llama_model_loader: - type q5_K:   48 tensors
llama_model_loader: - type q6_K:   65 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 16.07 GiB (5.05 BPW)
llama_model_create: overriding architecture qwen35 -> qwen35_mtp
llama_prepare_model_devices: using device CUDA0 (NVIDIA GeForce RTX 4090) (0000:01:00.0) - 2286 MiB free
load: 0 unused tokens
load: printing all EOG tokens:
load:   - 248044 ('<|endoftext|>')
load:   - 248046 ('<|im_end|>')
load:   - 248063 ('<|fim_pad|>')
load:   - 248064 ('<|repo_name|>')
load:   - 248065 ('<|file_sep|>')
load: special tokens cache size = 33
load: token to piece cache size = 1.7581 MB
print_info: arch                  = qwen35_mtp
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 262144
print_info: n_embd                = 5120
print_info: n_embd_inp            = 5120
print_info: n_layer               = 65
print_info: n_head                = 24
print_info: n_head_kv             = 4
print_info: n_rot                 = 64
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 256
print_info: n_embd_head_v         = 256
print_info: n_gqa                 = 6
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-06
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 17408
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = -1
print_info: rope type             = 40
print_info: rope scaling          = linear
print_info: freq_base_train       = 10000000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 262144
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: mrope sections        = [11, 11, 10, 0]
print_info: model type            = ?B
print_info: model params          = 27.32 B
print_info: general.name          = Qwen3.6-27B
print_info: vocab type            = BPE
print_info: n_vocab               = 248320
print_info: n_merges              = 247587
print_info: BOS token             = 248044 '<|endoftext|>'
print_info: EOS token             = 248046 '<|im_end|>'
print_info: EOT token             = 248046 '<|im_end|>'
print_info: PAD token             = 248055 '<|vision_pad|>'
print_info: LF token              = 198 'Ċ'
print_info: FIM PRE token         = 248060 '<|fim_prefix|>'
print_info: FIM SUF token         = 248062 '<|fim_suffix|>'
print_info: FIM MID token         = 248061 '<|fim_middle|>'
print_info: FIM PAD token         = 248063 '<|fim_pad|>'
print_info: FIM REP token         = 248064 '<|repo_name|>'
print_info: FIM SEP token         = 248065 '<|file_sep|>'
print_info: EOG token             = 248044 '<|endoftext|>'
print_info: EOG token             = 248046 '<|im_end|>'
print_info: EOG token             = 248063 '<|fim_pad|>'
print_info: EOG token             = 248064 '<|repo_name|>'
print_info: EOG token             = 248065 '<|file_sep|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
done_getting_tensors: partial load — used 18 of 866 tensors in the file (rest belong to a sibling model on the same .gguf)
load_tensors: offloading output layer to GPU
load_tensors: offloading 59 repeating layers to GPU
load_tensors: offloaded 60/66 layers to GPU
load_tensors:          CPU model buffer size =   682.03 MiB
load_tensors:        CUDA0 model buffer size =  1425.06 MiB
....srv    load_model: initializing slots, n_slots = 1
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 81920
llama_context: n_ctx_seq     = 81920
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 10000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (81920) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.95 MiB
llama_kv_cache:      CUDA0 KV buffer size =   320.00 MiB
llama_kv_cache: size =  320.00 MiB ( 81920 cells,   1 layers,  1/1 seqs), K (f16):  160.00 MiB, V (f16):  160.00 MiB
llama_kv_cache: attn_rot_k = 0, n_embd_head_k_all = 256
llama_kv_cache: attn_rot_v = 0, n_embd_head_k_all = 256
sched_reserve: reserving ...
sched_reserve: resolving fused Gated Delta Net support:
sched_reserve: fused Gated Delta Net (autoregressive) enabled
sched_reserve: fused Gated Delta Net (chunked) enabled
sched_reserve:      CUDA0 compute buffer size =   495.00 MiB
sched_reserve:  CUDA_Host compute buffer size =   180.02 MiB
sched_reserve: graph nodes  = 50
sched_reserve: graph splits = 2
sched_reserve: reserve took 20.02 ms, sched copies = 1
set_mtp: MTP draft head registered (ctx_mtp=00000258C56E6C90, n_ubatch=512, n_embd=5120)
slot   load_model: id  0 | task -1 | speculative decoding context initialized
slot   load_model: id  0 | task -1 | new slot, n_ctx = 81920
srv    load_model: prompt cache is enabled, size limit: 12288 MiB
srv    load_model: use `--cache-ram 0` to disable the prompt cache
srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
srv          init: init: --cache-idle-slots requires --kv-unified, disabling
init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>
'
srv          init: init: chat template, thinking = 1
main: model loaded
main: server is listening on http://0.0.0.0:5802
main: starting the main loop...
srv  update_slots: all slots are idle
Tuesday, 05-May-26 23:25:23 +04 [INFO] <Qwen3.6-27B-Q4_K_M-mtp> Health check passed on http://localhost:5802/health
srv  log_server_r: done request: GET / 127.0.0.1 200

It shouldn't assign MTP to 0 VRAM GPU, instead it should ether fall back to CPU or emit a clear error like “cannot load second MTP model on GPU: no free memory; reduce offload or force MTP CPU.”

wsbagnsv1 · 2026-05-05T19:59:20Z

With all the current mainstream ggufs being without mtp would it maybe make sense to allow a secondary gguf with the mtp tensors to be loaded instead of just 1 file with both the main and mtp model?

1337hero · 2026-05-06T00:23:42Z

Decided to test this myself and report back about results - eh - I guess this is better off for more memory bound systems.

So my setup:

GPU: 3x AMD Radeon AI PRO R9700 (32 GB), RADV / Vulkan, single-GPU pinned via --device Vulkan1
Model: Qwen3.6-27B Q8_0 (am17an/Qwen3.6-27B-MTP-GGUF for MTP runs, vanilla Q8_0 for baseline)
Builds:
- Baseline: master 66bafdcf1 (b8979), Vulkan
- MTP: this PR branch 17df5830e (b9030), Vulkan
Bench: [am17an's gist](https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090), unmodified

Ran a baseline first then did MTP n_max 2 & 3

Aggregate

Run	Wall (s)	Accept rate	Avg tok/s
Baseline	72.70	n/a	~20.1
MTP `n_max=2`	78.02	0.819	~18.9
MTP `n_max=3`	82.45	0.713	~17.6

Per-prompt accept rate

prompt	n=2	n=3
code_python	0.918	0.908
code_cpp	0.790	0.740
explain_concept	0.819	0.672
summarize	0.738	0.630
qa_factual	0.862	0.684
translation	0.812	0.542
creative_short	0.750	0.603
stepwise_math	0.904	0.804
long_code_review	0.740	0.677

Per-prompt tok/s

prompt	baseline	n=2	n=3
code_python	20.1	20.2	21.1
code_cpp	20.1	18.7	18.5
explain_concept	20.1	18.8	17.3
summarize	20.3	17.6	16.9
qa_factual	20.1	19.6	17.7
translation	20.9	19.8	15.5
creative_short	20.1	18.0	16.2
stepwise_math	20.1	20.1	19.5
long_code_review	20.0	17.7	17.3

Couldn't get the OP's results myself. thought I'd share

seadra · 2026-05-06T02:04:04Z

Apologies for the spam, but can anyone familiar explain how to create the secondary smaller model locally for qwen3.6 variants like this which don't yet have it?

clintonium-119 · 2026-05-06T02:55:27Z

I just tested this on my Rog Flow z13 (Strix Halo), and this gave me a big boost. All models are Q6, except the 'normal' 27b, which was a Q5. Just used a local bench script that runs a simple test 3 times each.

I believe the context size was at 128k with a q8 kV cache.

The only issue I ran into was it was crashing for me when trying to run parallel slots.

Model | Prefill (t/s) | Decode (t/s)

=====================================
Qwen3.6-27B | 55.3 | 10.3
Qwen3.6-27B-MTP | 47.8 | 20.0

Qwen3.6-35B | 153.1 | 42.3
Qwen3.6-35B-MTP | 136.6 | 58.2

AdamNiederer · 2026-05-06T02:57:00Z

Results on a 7900XT w/ Vulkan, Qwen 3.5 27b IQ4_XS converted by me from hf (Q4_K_M doesn't fit w/ mtp):

Configuration	Input	Output	PP t/s	TG t/s	Total Time	Acceptance Rate
MTP IQ4_XS	13,259	2,887	398.45	51.60	89.24 s	63.84% (1,896 / 2,970)
No MTP Q4_K_M	13,259	2,160	674.94	32.13	86.89 s	N/A

Prompt being "explain this Rust code". I'm seeing ~60 with Python and ~40 in multi-turn multimodal/english workflows.

Smashing work. With the pp regression fixed this will make these dense models usable on whole new classes of hardware.

michaelasper · 2026-05-06T03:48:01Z

I ran an independent smoke benchmark of this PR using the prompt suite from the linked gist. This is not a full quality/eval run, just a quick check for whether MTP produces a measurable speedup on my local machine.

Environment:

Mac Studio, Apple M3 Ultra, 256 GB unified memory
PR checkout: 267f8afe8 from mtp-clean
Model: am17an/Qwen3.6-35BA3B-MTP-GGUF / Qwen3.6-35BA3B-MTP.gguf
Server flags included: --ctx-size 32768 --n-gpu-layers 999 --parallel 1 --cont-batching --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 --batch-size 4096 --ubatch-size 1024
MTP flags tested: --spec-type mtp --spec-draft-n-max 2 and 3
Prompt suite: 9 prompts from https://gist.github.com/am17an/228edfb84ed082aa88e3865d6fa27090
n_predict: 192

Aggregate result:

profile	ok	total wall time	avg tok/s	aggregate accept rate	speedup vs baseline
baseline	yes	19.94s	71.17	n/a	n/a
MTP draft-n=2	yes	14.72s	96.38	80.28%	1.35x
MTP draft-n=3	yes	15.15s	93.65	69.00%	1.32x

The result looks positive: MTP produced a clear speedup on all 9 prompts in this short completion suite. In this configuration, draft-n=2 was slightly better than draft-n=3; draft-n=3 accepted more total draft tokens, but the lower acceptance rate made it a bit slower overall.

One resource note: observed RSS was about 37.2 GB for baseline and about 74.2 GB for the MTP profiles. That is fine on this machine, but probably worth calling out for users testing larger Qwen3.6 variants.

Thanks for working on this. Happy to run a longer-output or longer-context variant if that would be more useful for the PR.

y-almannaee · 2026-05-06T04:11:24Z

Apologies for the spam, but can anyone familiar explain how to create the secondary smaller model locally for qwen3.6 variants like this which don't yet have it?

You may use https://gist.github.com/buzz/1c439684d5e3f36492ae9f64ef7e3f67#file-convert-py with the 35B-A3B MTP file provided in this thread under the "How to use" subheading

github-actions Bot added model Model specific testing Everything test related Nvidia GPU Issues specific to Nvidia GPUs Vulkan Issues specific to the Vulkan backend examples python python script changes server ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026

ngxson reviewed May 4, 2026

View reviewed changes

am17an added 4 commits May 4, 2026 20:15

add enum for part sequence removal to enable checkpoints

589490f

review: rename rollback to rs_seq and remove public API

c5e0227

llama + spec: MTP support

10829db

am17an force-pushed the mtp-clean branch from 6b40a9f to 10829db Compare May 4, 2026 12:33

add qwen35moe_mtp

f8c6b03

wjy9902 mentioned this pull request May 4, 2026

2026-05-05 wjy9902/ai-daily#58

Open

MirkoCovizzi mentioned this pull request May 4, 2026

Potential 2x decoding speedup with MTP antirez/llama.cpp-deepseek-v4-flash#5

Open

metal: add keep_intermediates=true path for GDN

038d787

am17an force-pushed the mtp-clean branch from 3069fa5 to 038d787 Compare May 5, 2026 15:47

convert: fix python type check

d6c4de8

noonghunna mentioned this pull request May 5, 2026

docs(upstream): track llama.cpp MTP PR #22673 + non-adoption rationale noonghunna/club-3090#64

Merged

3 tasks

ggerganov reviewed May 5, 2026

View reviewed changes

ubergarm mentioned this pull request May 5, 2026

MTP tweaks ikawrakow/ik_llama.cpp#1741

Merged

am17an force-pushed the mtp-clean branch from b7f9670 to 493ce44 Compare May 6, 2026 03:08

test-llama-arch: ignore mtp heads

267f8af

am17an force-pushed the mtp-clean branch from 493ce44 to 267f8af Compare May 6, 2026 03:22

BahamutRU mentioned this pull request May 6, 2026

feat: Add Mimo v2.5 model support #22493

Open

AiSatan mentioned this pull request May 6, 2026

Feature Request: support 'Multi-Token Prediction (MTP) drafters' #22747

Open

4 tasks

		// number of recurrent-state snapshots per seq for rollback; tensors are widened to (1 + n_rs_seq) groups
		uint32_t n_rs_seq = 0;

Conversation

am17an commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Performance

No MTP (baseline)

MTP --spec-draft-max-n 3

MTP --spec-draft-max-n 2

Draft model (Qwen3.5 0.8B) with spec-draft-n-max 16 with partial rollback

Master with draft model with spec-draft-n-max 64 with no partial rollback

How to use

Requirements

Uh oh!

ngxson commented May 4, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

ngxson May 4, 2026

Choose a reason for hiding this comment

Uh oh!

am17an May 4, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson May 4, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson May 4, 2026

Choose a reason for hiding this comment

Uh oh!

am17an May 5, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented May 4, 2026

Uh oh!

pwilkin commented May 4, 2026

Uh oh!

cmp-nct commented May 4, 2026

Uh oh!

Dampfinchen commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mbednarek360 commented May 4, 2026

Uh oh!

am17an commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nawoa commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cturan commented May 4, 2026

Uh oh!

am17an commented May 4, 2026

Uh oh!

iiLaurens commented May 4, 2026

Uh oh!

nybblr commented May 4, 2026

Uh oh!

volkermauel commented May 4, 2026

Uh oh!

Viktor-Osika commented May 5, 2026

Uh oh!

PkmX commented May 5, 2026

Uh oh!

pwilkin commented May 5, 2026

Uh oh!

Dampfinchen commented May 5, 2026

Uh oh!

adriabama06 commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ezzz-dev commented May 5, 2026

Uh oh!

ggerganov May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

am17an May 6, 2026

Choose a reason for hiding this comment

Uh oh!

am17an commented May 4, 2026 •

edited

Loading

Dampfinchen commented May 4, 2026 •

edited

Loading

am17an commented May 4, 2026 •

edited

Loading

nawoa commented May 4, 2026 •

edited

Loading

adriabama06 commented May 5, 2026 •

edited

Loading

ggerganov May 5, 2026 •

edited

Loading

coder543 commented May 5, 2026 •

edited

Loading

clintonium-119 commented May 6, 2026 •

edited

Loading