feat: Add Mimo v2.5 model support by AesSedai · Pull Request #22493 · ggml-org/llama.cpp

AesSedai · 2026-04-29T01:19:28Z

Overview

This PR adds support for MiMo V2.5 (+ Pro) for text-to-text inference. The non-Pro MiMo V2.5 has audio and vision components that are not included in this PR.

Additional information

I haven't re-tested the Pro model but I think it should still convert and quantize correctly, will follow-up with that again when I finish with the non-Pro model quantizations.

The convert_hf_to_gguf.py now dequantized the FP8 safetensors correctly, MiMo has an oddly packed TP-aware sharding for its weights in additional to fusing the attention_qkv. To maintain compatibility with the existing MiMo V2 Flash path, I've opted to un-fuse the attention_qkv and use the existing modeling code.

One small tweak to note is that the MiMo V2 and V2.5 models have an attention_value_scale that was provided in the config.json but not being used. I've plumbed that through, which should bring the model closer to parity with the transformers implementation.

MiMo-V2.5-Q8_0-KLD.txt

====== Perplexity statistics ======
Mean PPL(Q)                   :   5.135221 ±   0.030263
Mean PPL(base)                :   5.128919 ±   0.030176
Cor(ln(PPL(Q)), ln(PPL(base))):  99.65%
Mean ln(PPL(Q)/PPL(base))     :   0.001228 ±   0.000494
Mean PPL(Q)/PPL(base)         :   1.001229 ±   0.000495
Mean PPL(Q)-PPL(base)         :   0.006302 ±   0.002539

====== KL divergence statistics ======
Mean    KLD:   0.012455 ±   0.000173
Maximum KLD:  10.765786
99.9%   KLD:   0.548270
99.0%   KLD:   0.125446
95.0%   KLD:   0.043128
90.0%   KLD:   0.025489
Median  KLD:   0.004163
10.0%   KLD:   0.000084
 5.0%   KLD:   0.000021
 1.0%   KLD:   0.000002
 0.1%   KLD:  -0.000002
Minimum KLD:  -0.000098

Requirements

I have read and agree with the contributing guidelines: Yes
AI usage disclosure: Yes, used to implement the TP-aware FP8 dequantization

AesSedai · 2026-04-29T01:21:40Z

also cc @ngxson for review

sayap · 2026-04-29T01:36:15Z

Getting this error when converting:

AttributeError: 'GGUFWriter' object has no attribute 'add_attn_value_scale'. Did you mean: 'add_attn_output_scale'?

Need to include changes to gguf-py?

segmond · 2026-04-29T04:00:27Z

I'm going to find some disk and download and give this a go!

AesSedai · 2026-04-29T04:09:57Z

@segmond oops, forgot to include that in the commit. I've pushed it now, give it another shot?

segmond · 2026-04-29T04:31:15Z

@segmond oops, forgot to include that in the commit. I've pushed it now, give it another shot?

I'm downloading the q8, at the rate it's going, it will take about 9 hours if there's no issue. I'll just pull down and rebuild when I get up in the morning before I try it.

AesSedai · 2026-04-29T04:42:51Z

Ah I meant to ping @sayap about the convert issue, my eyes are crossed :P

I pushed the commit that added the writer and constant updates.

AesSedai · 2026-04-29T05:39:47Z

I've just tried converting the MiMi V2.5 Pro version and the conversion fails at the TP dequant, will look into it.

ngxson

Not 100% sure, given the attention formula:

It seems like softmax * (V * scale_v) should be equivalent to (softmax * V) * scale_v, right?

If so, we maybe able to reuse attn_output_scale for it?

drrros · 2026-04-29T13:35:15Z

Build from this branch and it fails to autofit:

srv          load: spawning server instance with name=mimo-2.5-q5-k-m:thinking-coding on port 60083
srv          load: spawning server instance with args:
srv          load:   /home/drros/llama.cpp/build/bin/llama-server
srv          load:   --draft-max
srv          load:   64
srv          load:   --draft-n-min
srv          load:   4
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   --mlock
srv          load:   --no-mmap
srv          load:   --no-mmproj-offload
srv          load:   --port
srv          load:   60083
srv          load:   --spec-ngram-size-n
srv          load:   48
srv          load:   --spec-type
srv          load:   ngram-mod
srv          load:   --temperature
srv          load:   1.0
srv          load:   --top-p
srv          load:   0.95
srv          load:   --webui-mcp-proxy
srv          load:   --alias
srv          load:   mimo-2.5-q5-k-m:thinking-coding
srv          load:   --ctx-size
srv          load:   262144
srv          load:   --cache-ram
srv          load:   65536
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --swa-checkpoints
srv          load:   128
srv          load:   --fit
srv          load:   on
srv          load:   --fit-target
srv          load:   1536,512,512
srv          load:   --kv-unified
srv          load:   --model
srv          load:   /mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf
srv          load:   --parallel
srv          load:   6
srv          load:   --reasoning
srv          load:   on
srv          load:   --ubatch-size
srv          load:   2048
srv  log_server_r: done request: POST /models/load 192.168.0.61 200
[60083] ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
[60083]   Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
[60083]   Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
[60083]   Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
[60083] build_info: b8955-3dcaba985
[60083] system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
[60083] Running without SSL
[60083] init: using 47 threads for HTTP server
[60083] srv          main: -----------------
[60083] srv          main: CORS proxy is enabled, do not expose server to untrusted environments
[60083] srv          main: This feature is EXPERIMENTAL and may be removed or changed in future versions
[60083] srv          main: -----------------
[60083] start: binding port with default address family
[60083] main: loading model
[60083] srv    load_model: loading model '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'
[60083] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[60083] common_params_fit_impl: getting device memory data for initial parameters:
[60083] llama_init_from_model: failed to initialize the context: quantized V cache was requested, but this requires Flash Attention
[60083] common_fit_params: encountered an error while trying to fit params to free device memory: failed to create llama_context from model
[60083] common_fit_params: fitting params to free memory took 1.16 seconds
...
[60083] load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
[60083] ggml_backend_cuda_buffer_type_alloc_buffer: allocating 73814.24 MiB on device 0: cudaMalloc failed: out of memory
[60083] alloc_tensor_range: failed to allocate CUDA0 buffer of size 77399838208
[60083] llama_model_load: error loading model: unable to allocate CUDA0 buffer
[60083] llama_model_load_from_file_impl: failed to load model
[60083] common_init_from_params: failed to load model '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'
[60083] srv    load_model: failed to load model, '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'
[60083] srv    operator(): operator(): cleaning up before exit...
[60083] main: exiting due to model loading error

Upd:
--n-cpu-moe 99 doesn't help, still fails to load:

[56501] load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
[56501] load_tensors: offloading output layer to GPU
[56501] load_tensors: offloading 47 repeating layers to GPU
[56501] load_tensors: offloaded 49/49 layers to GPU
[56501] load_tensors:        CUDA0 model buffer size =  1878.24 MiB
[56501] load_tensors:        CUDA1 model buffer size =  1578.58 MiB
[56501] load_tensors:        CUDA2 model buffer size =  2112.19 MiB
[56501] load_tensors:    CUDA_Host model buffer size = 211945.25 MiB
[56501] ....................................................................................................
[56501] common_init_result: added </s> logit bias = -inf
[56501] common_init_result: added <|endoftext|> logit bias = -inf
[56501] common_init_result: added <|im_end|> logit bias = -inf
[56501] common_init_result: added <|fim_pad|> logit bias = -inf
[56501] common_init_result: added <|repo_name|> logit bias = -inf
[56501] common_init_result: added <|file_sep|> logit bias = -inf
[56501] llama_context: constructing llama_context
[56501] llama_context: n_seq_max     = 6
[56501] llama_context: n_ctx         = 262144
[56501] llama_context: n_ctx_seq     = 262144
[56501] llama_context: n_batch       = 2048
[56501] llama_context: n_ubatch      = 2048
[56501] llama_context: causal_attn   = 1
[56501] llama_context: flash_attn    = auto
[56501] llama_context: kv_unified    = true
[56501] llama_context: freq_base     = 10000000.0
[56501] llama_context: freq_scale    = 1
[56501] llama_context: n_ctx_seq (262144) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
[56501] llama_context:  CUDA_Host  output buffer size =     3.49 MiB
[56501] llama_kv_cache_iswa: creating non-SWA KV cache, size = 262144 cells
[56501] llama_kv_cache:      CUDA0 KV buffer size =  1020.00 MiB
[56501] llama_kv_cache:      CUDA1 KV buffer size =  1020.00 MiB
[56501] llama_kv_cache:      CUDA2 KV buffer size =  1020.00 MiB
[56501] llama_kv_cache: size = 3060.00 MiB (262144 cells,   9 layers,  6/1 seqs), K (q8_0): 1836.00 MiB, V (q8_0): 1224.00 MiB
[56501] llama_kv_cache: attn_rot_k = 1, n_embd_head_k_all = 192
[56501] llama_kv_cache: attn_rot_v = 1, n_embd_head_k_all = 128
[56501] llama_kv_cache_iswa: creating     SWA KV cache, size = 2816 cells
[56501] llama_kv_cache:      CUDA0 KV buffer size =   102.27 MiB
[56501] llama_kv_cache:      CUDA1 KV buffer size =    94.96 MiB
[56501] llama_kv_cache:      CUDA2 KV buffer size =    87.66 MiB
[56501] llama_kv_cache: size =  284.88 MiB (  2816 cells,  39 layers,  6/1 seqs), K (q8_0):  170.93 MiB, V (q8_0):  113.95 MiB
[56501] llama_kv_cache: attn_rot_k = 1, n_embd_head_k_all = 192
[56501] llama_kv_cache: attn_rot_v = 1, n_embd_head_k_all = 128
[56501] sched_reserve: reserving ...
[56501] sched_reserve: layer 0 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
[56501] sched_reserve: Flash Attention was auto, set to disabled
[56501] sched_reserve: resolving fused Gated Delta Net support:
[56501] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[56501] sched_reserve: fused Gated Delta Net (chunked) enabled
[56501] ggml_backend_cuda_buffer_type_alloc_buffer: allocating 133698.49 MiB on device 0: cudaMalloc failed: out of memory
[56501] ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 140193026688
[56501] graph_reserve: failed to allocate compute buffers
[56501] llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
[56501] common_init_result: failed to create context with model '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'
[56501] common_init_from_params: failed to create context with model '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'

Doing something wrong?
this is .ini part for mimo:

[mimo-2.5-q5-k-m:thinking-coding]
model = /mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf
c = 262144
temp = 1.0
top-p = 0.95
cache-ram = 65536
load-on-startup = false
ub = 2048
reasoning = on
no-mmap = 1
ctk = q8_0
ctv = q8_0
ctx-checkpoints = 128
fit = off
ncmoe = 99

Andryusz · 2026-04-29T14:16:41Z

I did some tests with the IQ3_S quant, and while the model seems sane on this PR, I get quite different behavior compared to the official API via openrouter.

I have a quite specific prompt that causes non-English, in-character reasoning on many models, including Mimo v2.5 on the API - and it's 100% consistent on the API. However, on this PR, the model always thinks in English as a normal assistant, and the final response is also quite different compared to the API. The token that would result in non-English reasoning has only about 6% probability, so I don't think quantization would explain such a big difference in token distribution.
While it's not definite proof that something is wrong with the implementation, it would be nice to do a logit comparison against vllm/transformers.

As a side note, the performance is terrible for me. I get only 30%-40% of decode speed compared to something like Qwen 3.5 397B.

segmond · 2026-04-29T16:06:31Z

Still needs work, fit doesn't seem to work, I have 8 gpus.... I'm letting this load just to get a feel for the inference, then I'll manually assign layers after and see how it is.

load_tensors: offloading output layer to GPU
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 1/49 layers to GPU
load_tensors: CUDA7 model buffer size = 633.27 MiB
load_tensors: CUDA_Host model buffer size = 312384.99 MiB
.................................................................

AesSedai · 2026-04-29T16:47:30Z

@drrros @segmond for autofit problems, that should be a separate issue I think. Autofit was working fine for me on both the Pro and non-Pro versions at least.

@ngxson thanks for looking it over, I'll give that an eyeball later today 👀

@Andryusz I'll do some more digging later today with logit dumps, if there are issues I'd lean towards it being somewhere in the inference implementation (I think?) since I haven't touched that, this was mostly in the convert stage and if that was FUBAR then it'd be total gibberish. This model doesn't have a shared expert which may contribute to some perf issues (and that IQ quants are a bit slower on CPU I think).

coder543 · 2026-04-29T20:56:23Z

To provide some performance context on DGX Spark... it is surprisingly slow. I also noticed no difference when running on this branch versus not.

llama-bench -d 0,32768 -p 8192 -n 100 -fa 1 -b 2048 -ub 2048 -mmp 0 -m mimo-v2.5-iq3_s.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
mimo2 310B.A15B Q6_K	105.33 GiB	308.78 B	CUDA	99	2048	1	pp8192	210.36 ± 0.48
mimo2 310B.A15B Q6_K	105.33 GiB	308.78 B	CUDA	99	2048	1	tg100	6.48 ± 0.09
mimo2 310B.A15B Q6_K	105.33 GiB	308.78 B	CUDA	99	2048	1	pp8192 @ d32768	42.45 ± 0.06
mimo2 310B.A15B Q6_K	105.33 GiB	308.78 B	CUDA	99	2048	1	tg100 @ d32768	3.02 ± 0.01

Compare that to:

llama-bench -d 0,32768 -p 8192 -n 100 -fa 1 -b 2048 -ub 2048 -mmp 0 -m nemotron-3-super-120b-a12b-ud-q4_k_xl.gguf,qwen3.5-397b-a17b-ud-iq2_xxs.gguf,step-3.5-flash.q4_k_s.gguf,minimax-m2.7-ud-q3_k_xl.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB

model	size	params	backend	ngl	n_ubatch	fa	test	t/s
nemotron_h_moe 120B.A12B Q4_K - Medium	78.02 GiB	120.67 B	CUDA	99	2048	1	pp8192	822.80 ± 0.99
nemotron_h_moe 120B.A12B Q4_K - Medium	78.02 GiB	120.67 B	CUDA	99	2048	1	tg100	16.77 ± 0.01
nemotron_h_moe 120B.A12B Q4_K - Medium	78.02 GiB	120.67 B	CUDA	99	2048	1	pp8192 @ d32768	759.38 ± 6.63
nemotron_h_moe 120B.A12B Q4_K - Medium	78.02 GiB	120.67 B	CUDA	99	2048	1	tg100 @ d32768	16.38 ± 0.02
qwen35moe 397B.A17B IQ2_XXS - 2.0625 bpw	106.97 GiB	396.35 B	CUDA	99	2048	1	pp8192	515.73 ± 3.14
qwen35moe 397B.A17B IQ2_XXS - 2.0625 bpw	106.97 GiB	396.35 B	CUDA	99	2048	1	tg100	18.84 ± 0.02
qwen35moe 397B.A17B IQ2_XXS - 2.0625 bpw	106.97 GiB	396.35 B	CUDA	99	512	1	pp8192 @ d32768	284.37 ± 2.62
qwen35moe 397B.A17B IQ2_XXS - 2.0625 bpw	106.97 GiB	396.35 B	CUDA	99	512	1	tg100 @ d32768	17.21 ± 0.04
step35 196B.A11B Q4_K - Small	103.84 GiB	196.96 B	CUDA	99	2048	1	pp8192	933.19 ± 2.56
step35 196B.A11B Q4_K - Small	103.84 GiB	196.96 B	CUDA	99	2048	1	tg100	27.42 ± 0.04
step35 196B.A11B Q4_K - Small	103.84 GiB	196.96 B	CUDA	99	2048	1	pp8192 @ d32768	730.20 ± 1.79
step35 196B.A11B Q4_K - Small	103.84 GiB	196.96 B	CUDA	99	2048	1	tg100 @ d32768	21.42 ± 0.04
minimax-m2 230B.A10B Q3_K - Medium	94.93 GiB	228.69 B	CUDA	99	2048	1	pp8192	883.13 ± 3.21
minimax-m2 230B.A10B Q3_K - Medium	94.93 GiB	228.69 B	CUDA	99	2048	1	tg100	28.41 ± 0.02
minimax-m2 230B.A10B Q3_K - Medium	94.93 GiB	228.69 B	CUDA	99	2048	1	pp8192 @ d32768	382.06 ± 1.07
minimax-m2 230B.A10B Q3_K - Medium	94.93 GiB	228.69 B	CUDA	99	2048	1	tg100 @ d32768	13.12 ± 0.02

Whether you consider models of similar on-disk size, models with more total and active parameters... whatever way you look at it, mimi-v2.5's performance is really low compared to anything else that is comparable.

I also noticed a lot of CPU usage on mimo-v2.5, even though the model was entirely pinned to the GPU, and llama-server logs don't seem to indicate anything running on the CPU.

AesSedai · 2026-04-29T21:03:56Z

@coder543 there wouldn't be a performance difference between this branch and master, the difference is that one attention scale hparam that gets plumbed through to the inference code, which would alter the output slightly.

RE: performance, that may need to be addressed too but not sure it's in scope for this PR which just adds support in the first place. I do appreciate the feedback though and the detailed comparison.

coder543 · 2026-04-29T21:29:35Z

For whatever it is worth, I had codex investigate a little, and maybe flash attention is broken for this model in some way. Disabling flash attention significantly increases throughput, but also requires vastly more memory.

Quoting GPT-5.5 running through codex

Root cause looks like MiMo’s head shape:

n_embd_head_k = 192
n_embd_head_v = 128

CUDA Flash Attention support in this llama.cpp build rejects that shape. In ggml-cuda/fattn.cu, supported K->ne[0] cases include 128, 256, 320 with V=256, 576 with V=512, etc., but not K=192, V=128. Since we explicitly pass -fa 1, llama.cpp keeps Flash Attention enabled and the scheduler assigns GGML_OP_FLASH_ATTN_EXT to CPU.

I tested -fa 0:

-fa 1 tg100: ~6.5 t/s
-fa 0 tg100: ~21.8 t/s
-fa 1 pp8192: ~210 t/s
-fa 0 pp8192: ~418 t/s

So disabling Flash Attention fixes the CPU fallback and makes generation ~3.3x faster.

Bad news: -fa 0 greatly increases memory pressure. Even -c 32768 -ub 2048 failed to create context because CUDA compute buffer reservation wanted ~16.9 GiB and OOMed.

AesSedai · 2026-04-29T23:50:52Z

@ngxson

It seems like softmax * (V * scale_v) should be equivalent to (softmax * V) * scale_v, right?

If so, we maybe able to reuse attn_output_scale for it?

A bit of digging shows that attn_output_scale is only used for the Grok arch at the moment, so that's gated? I could re-use it and plumb it through for MiMo as attn_output_scale and that would save a tiny bit of wiring work, up to you though.

The other recommendation I saw from some LLM review was to pre-bake that scale into the v_proj at convert time which wouldn't need any new hparams at all then (but would require newly converted ggufs, which I could do)

AesSedai · 2026-04-30T06:45:12Z

@Andryusz do you mind sharing the prompt? I've managed to get transformers inference working with some dequantization to BF16 and compared the forced-prefix KLD from transformers to the BF16 gguf logits and there is a bit of variance but overall it's very close:

hf_top_k=20 gg_top_k=20  hf_n=64 gg_n=64

=== factual  steps=64 ===
  top-1 agreement     : 100.00%
  top-K overlap (mean):  78.91%
  KL(hf||gg) mean/p50/p95: 0.013956 / 0.000056 / 0.010251
  KL(gg||hf) mean     : 0.004526
  logit cosine mean   : 0.970220
  disagreements       : 0 (near-ties=0, drift=0)
  KL hotspots (step, KL):
    step  22: 0.421583
    step  17: 0.270204
    step  16: 0.130762
    step  14: 0.012261
    step   2: 0.010251

=== math_step  steps=64 ===
  top-1 agreement     :  98.44%
  top-K overlap (mean):  95.08%
  KL(hf||gg) mean/p50/p95: 0.001669 / 0.000000 / 0.007076
  KL(gg||hf) mean     : 0.001732
  logit cosine mean   : 0.995724
  disagreements       : 1 (near-ties=1, drift=0)
    step  17 [near-tie]: hf=220 (gap=0.375)  gg=369 (gap=0.086)  gg-pick logit in hf=29.125, hf-pick logit in gg=29.324
  KL hotspots (step, KL):
    step  13: 0.027873
    step  17: 0.025531
    step   1: 0.020404
    step  28: 0.016059
    step  57: 0.007076

=== roleplay  steps=64 ===
  top-1 agreement     :  98.44%
  top-K overlap (mean):  94.69%
  KL(hf||gg) mean/p50/p95: 0.008640 / 0.000304 / 0.033271
  KL(gg||hf) mean     : 0.008350
  logit cosine mean   : 0.993676
  disagreements       : 1 (near-ties=1, drift=0)
    step   7 [near-tie]: hf=264 (gap=0.125)  gg=438 (gap=0.075)  gg-pick logit in hf=25.500, hf-pick logit in gg=25.593
  KL hotspots (step, KL):
    step  40: 0.080738
    step  52: 0.052327
    step  33: 0.051458
    step  47: 0.047668
    step  55: 0.033271

=== OVERALL ===
  top-1 agreement     :  98.96%
  top-K overlap       :  89.56%
  KL(hf || gg) mean   : 0.008088
  KL(gg || hf) mean   : 0.004869
  logit cosine mean   : 0.986540

so unless you have a more specific reproduction, I'd chalk it up to the IQ3_S quantization error being the cause.

ngxson · 2026-04-30T10:51:44Z

A bit of digging shows that attn_output_scale is only used for the Grok arch at the moment, so that's gated? I could re-use it and plumb it through for MiMo as attn_output_scale and that would save a tiny bit of wiring work, up to you though.

Hmm right, grok has a specific logic for it. I think it's ok for keep a dedicated var for v_scale for now then.

The other recommendation I saw from some LLM review was to pre-bake that scale into the v_proj at convert time which wouldn't need any new hparams at all then (but would require newly converted ggufs, which I could do)

No it should not be baked into v_proj for numerical stability. For ex, NVFP4 also have a separate scale applied to the activation, not baked to the projection matrix.

Andryusz · 2026-04-30T11:54:13Z

@AesSedai Thank you for checking the logits, they certainly look pretty good. After poking around a bit more with the model, I agree the effect I observed is probably caused by quantization + iMatrix, which possibly skews the model toward English. I could share the prompt, but to be honest I don't think it's worth to spend more time on this particular example. I will be doing a bit more testing and if I see more concrete indications of something being wrong I will share the details.

Regarding the bad performance - I can confirm @coder543 findings - FA seems broken, disabling it increases speeds into reasonable territory.

ChicoPinto70 · 2026-04-30T16:48:56Z

Great work, AesSedai!!! Using -ot, instead --fit, I made the non pro Q5_K_M version works in a 3x3090 with 256 GB DDR4. The model seems Great!! But, when the prompt is a bit more complex, it goes in a endless reasoning chain of thought. But, as the previous version already has this behavior, I believe is a Xiaomi issue. Thanks, again!

AesSedai · 2026-05-05T04:02:10Z

@ngxson merged master in and fixed the conflicts, and added fused QKV. It's very slightly faster:

and the BF16 PPL is still fine: Final estimate: PPL = 5.9630 +/- 0.03940

AesSedai · 2026-05-05T09:26:17Z

There is a regression somewhere, working on tracking it down. I was testing the Pro Q8_0 out and it off somehow. I still had the Q8_0 logits from KLD testing Pro previously and I still have the unfused Pro GGUF and the mean KLD was very different:

====== Perplexity statistics ======
Mean PPL(Q)                   :   6.699453 ±   0.052520
Mean PPL(base)                :   3.198633 ±   0.016744
Cor(ln(PPL(Q)), ln(PPL(base))):  76.28%
Mean ln(PPL(Q)/PPL(base))     :   0.739302 ±   0.005124
Mean PPL(Q)/PPL(base)         :   2.094474 ±   0.010731
Mean PPL(Q)-PPL(base)         :   3.500820 ±   0.041196

====== KL divergence statistics ======
Mean    KLD:   0.775122 ±   0.004258
Maximum KLD:  27.474590
99.9%   KLD:  14.151838
99.0%   KLD:   8.829903
95.0%   KLD:   3.696737
90.0%   KLD:   1.970265
Median  KLD:   0.209342
10.0%   KLD:   0.000715
 5.0%   KLD:   0.000148
 1.0%   KLD:   0.000010
 0.1%   KLD:  -0.000000
Minimum KLD:  -0.000007

Putting this into draft mode for now while I root cause.

AesSedai · 2026-05-05T09:59:19Z

Fixed, the ml.get_key(LLM_KV_ATTENTION_VALUE_SCALE, hparams.f_attn_value_scale, false); key got lost during the merge on accident. KLD is normal again:

====== Perplexity statistics ======
Mean PPL(Q)                   :   3.195142 ±   0.016734
Mean PPL(base)                :   3.198633 ±   0.016744
Cor(ln(PPL(Q)), ln(PPL(base))):  99.19%
Mean ln(PPL(Q)/PPL(base))     :  -0.001092 ±   0.000666
Mean PPL(Q)/PPL(base)         :   0.998909 ±   0.000666
Mean PPL(Q)-PPL(base)         :  -0.003491 ±   0.002130

====== KL divergence statistics ======
Mean    KLD:   0.021039 ±   0.000290
Maximum KLD:  11.382697
99.9%   KLD:   1.377634
99.0%   KLD:   0.255950
95.0%   KLD:   0.075872
90.0%   KLD:   0.040932
Median  KLD:   0.003563
10.0%   KLD:   0.000020
 5.0%   KLD:   0.000006
 1.0%   KLD:  -0.000001
 0.1%   KLD:  -0.000006
Minimum KLD:  -0.000058

drrros · 2026-05-05T15:34:37Z

Running latest ..-vision branch, performance is decent (350-390 t\s pp and 15-20 tg on 3 rtx 4000 pros and epyc 9274f \ 12 channel ddr5 4800 on Q5_K_M) model seems smart, but often goes into loops when using as agentic backend. I'm using Claude Code. RN trying to mitigate with upping repeat-penalty - now at 1.2 but testing further. (1.0 didn't helped, not sure it's not default, though).

AesSedai · 2026-05-05T23:19:35Z

I've made updated QKV fused quants and am uploading them to HF now. The PPL / KLD for non-Pro is as follows (the mixture column is the MoE-optimized quant schema for Default Type / FFN Up / FFN Gate / FFN Down types):

Quant	Size	Mixture	PPL	1-(Mean PPL(Q)/PPL(base))	KLD
Q8_0	305.68 GiB (8.50 BPW)	Q8_0	5.135595 ± 0.030275	+0.1271%	0.012539 ± 0.000329
Q5_K_M	212.42 GiB (5.91 BPW)	Q8_0 / Q5_K / Q5_K / Q6_K	5.148091 ± 0.030387	+0.3708%	0.014915 ± 0.000309
Q4_K_M	176.70 GiB (4.92 BPW)	Q8_0 / Q4_K / Q4_K / Q5_K	5.195349 ± 0.030767	+1.2921%	0.020670 ± 0.000214
IQ4_XS	136.78 GiB (3.80 BPW)	Q8_0 / IQ3_S / IQ3_S / IQ4_XS	5.271397 ± 0.031170	+2.7748%	0.041177 ± 0.000349
IQ3_S	105.33 GiB (2.93 BPW)	Q6_K / IQ2_S / IQ2_S / IQ3_S	5.552710 ± 0.033286	+8.2595%	0.092639 ± 0.000604

@ggerganov @CISC this PR should be fully ready for review now. @ngxson approved it earlier, but that's stale with the merge from master and the change to fused QKV.

BahamutRU · 2026-05-06T06:20:53Z

#22673
Did you see this? Will there be MTP heads in MiMo-V2.5? Seems like a real way to speed things up.
Sry for off-topic!

AesSedai · 2026-05-07T06:38:42Z

@BahamutRU I hadn't seen that PR yet, I'll take a closer look at it. MiMo does have MTP heads so I'll take a quick peek at what it'd take to keep those tensors in the conversion. Hopefully it'd be drop-in support then with that PR?

BahamutRU · 2026-05-07T06:47:27Z

Hopefully it'd be drop-in support then with that PR?

I don't know, but I really hope so. It works great for Qwen3.6-27B (+95% tg) and Qwen3.6-35B-A3B (+40%); even the 40% is a very nice bonus.

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

AesSedai · 2026-05-07T09:35:14Z

I'm also testing the convert locally for including the MTP tensors in the GGUF, following the GLM-4.5/DS MTP convention. I'll push that commit up in a few hours when I confirm the convert works correctly, the tensors get stored in nextn properly, it can load properly, and the PPL / KLD look correct.

CISC

Clean these up later. :)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

AesSedai · 2026-05-07T10:30:38Z

It didn't take as long to write and test as I thought, so I'm going to bed (very late) now :)

The MTP tensors were saved and the model loads correctly:

model has unused tensor blk.48.attn_output.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.48.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.48.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.ffn_gate.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.48.ffn_down.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.48.ffn_up.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.48.nextn.eh_proj.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.48.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_output.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.49.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.49.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.ffn_gate.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.49.ffn_down.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.49.ffn_up.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.49.nextn.eh_proj.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.49.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_output.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.50.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.50.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.ffn_gate.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.50.ffn_down.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.50.ffn_up.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.50.nextn.eh_proj.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.50.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.layer_output_norm.weight (size = 16384 bytes) -- ignoring

and BF16 (logits collected from previous conversion) to BF16 (new conversion) shows it's identical:

====== Perplexity statistics ======
Mean PPL(Q)                   :   5.132410 ±   0.030237
Mean PPL(base)                :   5.130804 ±   0.030196
Cor(ln(PPL(Q)), ln(PPL(base))):  99.99%
Mean ln(PPL(Q)/PPL(base))     :   0.000313 ±   0.000094
Mean PPL(Q)/PPL(base)         :   1.000313 ±   0.000095
Mean PPL(Q)-PPL(base)         :   0.001606 ±   0.000486

====== KL divergence statistics ======
Mean    KLD:  -0.000010 ±   0.000000
Maximum KLD:   0.000004
99.9%   KLD:   0.000000
99.0%   KLD:   0.000000
95.0%   KLD:   0.000000
90.0%   KLD:  -0.000000
Median  KLD:  -0.000009
10.0%   KLD:  -0.000022
 5.0%   KLD:  -0.000025
 1.0%   KLD:  -0.000031
 0.1%   KLD:  -0.000037
Minimum KLD:  -0.000051

I think that's all of it now 🤞

ngxson

Let's merge when the CI passes, any bugs can be fixed via follow-up PRs

coder543 · 2026-05-07T11:22:28Z

Apparently there was a bug in the config, fixed a few hours ago: https://huggingface.co/XiaomiMiMo/MiMo-V2.5/commit/13b5e3f92ab9572523fa21c7f1bfe9c92228aaca

Might need new GGUFs?

CISC · 2026-05-07T11:37:48Z

Apparently there was a bug in the config, fixed a few hours ago: https://huggingface.co/XiaomiMiMo/MiMo-V2.5/commit/13b5e3f92ab9572523fa21c7f1bfe9c92228aaca

Might need new GGUFs?

This array is never used for GGUFs, but the tokens may or may not need to be added as EOG.

AesSedai added 2 commits April 28, 2026 14:17

add mimo-v2.5 support

09e1b61

mimo-v2.5: fix modify_tensors row split

548fde3

AesSedai requested review from CISC and ggerganov as code owners April 29, 2026 01:19

github-actions Bot added model Model specific python python script changes labels Apr 29, 2026

AesSedai mentioned this pull request Apr 29, 2026

Model request: MiMo V2.5 Series #22469

Open

mimi-v2.5: forgot add_attn_value_scale plumbing

287ac83

mimi-v2.5: fix tp dequant to detect tp rows

3dcaba9

ngxson reviewed Apr 29, 2026

View reviewed changes

mimo-v2.5: fix TP iteration to be descending

24364b3

AesSedai marked this pull request as draft May 5, 2026 09:24

mimo-v2.5: missed the attn_value scale during merge

a57c707

AesSedai marked this pull request as ready for review May 5, 2026 09:58

mimo-v2.5: fused QKV needs contiguous for scaling attention value

c6a0bc8

CISC reviewed May 6, 2026

View reviewed changes

Comment thread convert_hf_to_gguf.py Outdated

AesSedai added 2 commits May 6, 2026 23:29

Merge remote-tracking branch 'origin/master' into mimo-v2.5

d2b710c

mimo-v2.5: move speech_embeddings. to TextModel filter_tensors

451cf3c

CISC approved these changes May 7, 2026

View reviewed changes

Comment thread src/llama-hparams.h Outdated

Comment thread src/models/mimo2.cpp Outdated

Comment thread src/models/mimo2.cpp Outdated

AesSedai and others added 3 commits May 7, 2026 02:32

Update src/llama-hparams.h

2718bea

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update src/models/mimo2.cpp

12e71fb

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update src/models/mimo2.cpp

0e703ad

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

CISC reviewed May 7, 2026

View reviewed changes

Comment thread convert_hf_to_gguf.py Outdated

Comment thread convert_hf_to_gguf.py Outdated

Comment thread src/models/mimo2.cpp Outdated

AesSedai and others added 4 commits May 7, 2026 03:05

Update convert_hf_to_gguf.py

bbd2372

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update convert_hf_to_gguf.py

c305687

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Update src/models/mimo2.cpp

fe9b705

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

mimo-v2.5: include MTP weights in gguf

b7de507

ngxson approved these changes May 7, 2026

View reviewed changes

ngxson merged commit 8e52631 into ggml-org:master May 7, 2026
48 of 50 checks passed

Conversation

AesSedai commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Requirements

Uh oh!

AesSedai commented Apr 29, 2026

Uh oh!

sayap commented Apr 29, 2026

Uh oh!

segmond commented Apr 29, 2026

Uh oh!

AesSedai commented Apr 29, 2026

Uh oh!

segmond commented Apr 29, 2026

Uh oh!

AesSedai commented Apr 29, 2026

Uh oh!

AesSedai commented Apr 29, 2026

Uh oh!

ngxson left a comment

Choose a reason for hiding this comment

Uh oh!

drrros commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andryusz commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

segmond commented Apr 29, 2026

Uh oh!

AesSedai commented Apr 29, 2026

Uh oh!

coder543 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AesSedai commented Apr 29, 2026

Uh oh!

coder543 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AesSedai commented Apr 29, 2026

Uh oh!

AesSedai commented Apr 30, 2026

Uh oh!

ngxson commented Apr 30, 2026

Uh oh!

Andryusz commented Apr 30, 2026

Uh oh!

ChicoPinto70 commented Apr 30, 2026

Uh oh!

AesSedai commented May 5, 2026

Uh oh!

AesSedai commented May 5, 2026

Uh oh!

AesSedai commented May 5, 2026

Uh oh!

drrros commented May 5, 2026

Uh oh!

AesSedai commented May 5, 2026

Uh oh!

BahamutRU commented May 6, 2026

Uh oh!

Uh oh!

AesSedai commented May 7, 2026

Uh oh!

BahamutRU commented May 7, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AesSedai commented May 7, 2026

Uh oh!

CISC left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AesSedai commented Apr 29, 2026 •

edited

Loading

drrros commented Apr 29, 2026 •

edited

Loading

Andryusz commented Apr 29, 2026 •

edited

Loading

coder543 commented Apr 29, 2026 •

edited

Loading

coder543 commented Apr 29, 2026 •

edited

Loading