Skip to content

feat: Add Mimo v2.5 model support#22493

Merged
ngxson merged 19 commits intoggml-org:masterfrom
AesSedai:mimo-v2.5
May 7, 2026
Merged

feat: Add Mimo v2.5 model support#22493
ngxson merged 19 commits intoggml-org:masterfrom
AesSedai:mimo-v2.5

Conversation

@AesSedai
Copy link
Copy Markdown
Contributor

@AesSedai AesSedai commented Apr 29, 2026

Overview

This PR adds support for MiMo V2.5 (+ Pro) for text-to-text inference. The non-Pro MiMo V2.5 has audio and vision components that are not included in this PR.

Additional information

I haven't re-tested the Pro model but I think it should still convert and quantize correctly, will follow-up with that again when I finish with the non-Pro model quantizations.

The convert_hf_to_gguf.py now dequantized the FP8 safetensors correctly, MiMo has an oddly packed TP-aware sharding for its weights in additional to fusing the attention_qkv. To maintain compatibility with the existing MiMo V2 Flash path, I've opted to un-fuse the attention_qkv and use the existing modeling code.

One small tweak to note is that the MiMo V2 and V2.5 models have an attention_value_scale that was provided in the config.json but not being used. I've plumbed that through, which should bring the model closer to parity with the transformers implementation.

MiMo-V2.5-Q8_0-KLD.txt

====== Perplexity statistics ======
Mean PPL(Q)                   :   5.135221 ±   0.030263
Mean PPL(base)                :   5.128919 ±   0.030176
Cor(ln(PPL(Q)), ln(PPL(base))):  99.65%
Mean ln(PPL(Q)/PPL(base))     :   0.001228 ±   0.000494
Mean PPL(Q)/PPL(base)         :   1.001229 ±   0.000495
Mean PPL(Q)-PPL(base)         :   0.006302 ±   0.002539

====== KL divergence statistics ======
Mean    KLD:   0.012455 ±   0.000173
Maximum KLD:  10.765786
99.9%   KLD:   0.548270
99.0%   KLD:   0.125446
95.0%   KLD:   0.043128
90.0%   KLD:   0.025489
Median  KLD:   0.004163
10.0%   KLD:   0.000084
 5.0%   KLD:   0.000021
 1.0%   KLD:   0.000002
 0.1%   KLD:  -0.000002
Minimum KLD:  -0.000098

Requirements

  • I have read and agree with the contributing guidelines: Yes
  • AI usage disclosure: Yes, used to implement the TP-aware FP8 dequantization

@AesSedai AesSedai requested review from CISC and ggerganov as code owners April 29, 2026 01:19
@github-actions github-actions Bot added model Model specific python python script changes labels Apr 29, 2026
@AesSedai
Copy link
Copy Markdown
Contributor Author

also cc @ngxson for review

@sayap
Copy link
Copy Markdown
Contributor

sayap commented Apr 29, 2026

Getting this error when converting:

AttributeError: 'GGUFWriter' object has no attribute 'add_attn_value_scale'. Did you mean: 'add_attn_output_scale'?

Need to include changes to gguf-py?

@segmond
Copy link
Copy Markdown

segmond commented Apr 29, 2026

I'm going to find some disk and download and give this a go!

@AesSedai
Copy link
Copy Markdown
Contributor Author

@segmond oops, forgot to include that in the commit. I've pushed it now, give it another shot?

@segmond
Copy link
Copy Markdown

segmond commented Apr 29, 2026

@segmond oops, forgot to include that in the commit. I've pushed it now, give it another shot?

I'm downloading the q8, at the rate it's going, it will take about 9 hours if there's no issue. I'll just pull down and rebuild when I get up in the morning before I try it.

@AesSedai
Copy link
Copy Markdown
Contributor Author

Ah I meant to ping @sayap about the convert issue, my eyes are crossed :P

I pushed the commit that added the writer and constant updates.

@AesSedai
Copy link
Copy Markdown
Contributor Author

I've just tried converting the MiMi V2.5 Pro version and the conversion fails at the TP dequant, will look into it.

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure, given the attention formula:

Image

It seems like softmax * (V * scale_v) should be equivalent to (softmax * V) * scale_v, right?

If so, we maybe able to reuse attn_output_scale for it?

@drrros
Copy link
Copy Markdown
Contributor

drrros commented Apr 29, 2026

Build from this branch and it fails to autofit:

srv          load: spawning server instance with name=mimo-2.5-q5-k-m:thinking-coding on port 60083
srv          load: spawning server instance with args:
srv          load:   /home/drros/llama.cpp/build/bin/llama-server
srv          load:   --draft-max
srv          load:   64
srv          load:   --draft-n-min
srv          load:   4
srv          load:   --host
srv          load:   127.0.0.1
srv          load:   --mlock
srv          load:   --no-mmap
srv          load:   --no-mmproj-offload
srv          load:   --port
srv          load:   60083
srv          load:   --spec-ngram-size-n
srv          load:   48
srv          load:   --spec-type
srv          load:   ngram-mod
srv          load:   --temperature
srv          load:   1.0
srv          load:   --top-p
srv          load:   0.95
srv          load:   --webui-mcp-proxy
srv          load:   --alias
srv          load:   mimo-2.5-q5-k-m:thinking-coding
srv          load:   --ctx-size
srv          load:   262144
srv          load:   --cache-ram
srv          load:   65536
srv          load:   --cache-type-k
srv          load:   q8_0
srv          load:   --cache-type-v
srv          load:   q8_0
srv          load:   --swa-checkpoints
srv          load:   128
srv          load:   --fit
srv          load:   on
srv          load:   --fit-target
srv          load:   1536,512,512
srv          load:   --kv-unified
srv          load:   --model
srv          load:   /mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf
srv          load:   --parallel
srv          load:   6
srv          load:   --reasoning
srv          load:   on
srv          load:   --ubatch-size
srv          load:   2048
srv  log_server_r: done request: POST /models/load 192.168.0.61 200
[60083] ggml_cuda_init: found 3 CUDA devices (Total VRAM: 71963 MiB):
[60083]   Device 0: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
[60083]   Device 1: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
[60083]   Device 2: NVIDIA RTX PRO 4000 Blackwell, compute capability 12.0, VMM: yes, VRAM: 23987 MiB
[60083] build_info: b8955-3dcaba985
[60083] system_info: n_threads = 24 (n_threads_batch = 24) / 48 | CUDA : ARCHS = 1200 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | BLACKWELL_NATIVE_FP4 = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
[60083] Running without SSL
[60083] init: using 47 threads for HTTP server
[60083] srv          main: -----------------
[60083] srv          main: CORS proxy is enabled, do not expose server to untrusted environments
[60083] srv          main: This feature is EXPERIMENTAL and may be removed or changed in future versions
[60083] srv          main: -----------------
[60083] start: binding port with default address family
[60083] main: loading model
[60083] srv    load_model: loading model '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'
[60083] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[60083] common_params_fit_impl: getting device memory data for initial parameters:
[60083] llama_init_from_model: failed to initialize the context: quantized V cache was requested, but this requires Flash Attention
[60083] common_fit_params: encountered an error while trying to fit params to free device memory: failed to create llama_context from model
[60083] common_fit_params: fitting params to free memory took 1.16 seconds
...
[60083] load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
[60083] ggml_backend_cuda_buffer_type_alloc_buffer: allocating 73814.24 MiB on device 0: cudaMalloc failed: out of memory
[60083] alloc_tensor_range: failed to allocate CUDA0 buffer of size 77399838208
[60083] llama_model_load: error loading model: unable to allocate CUDA0 buffer
[60083] llama_model_load_from_file_impl: failed to load model
[60083] common_init_from_params: failed to load model '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'
[60083] srv    load_model: failed to load model, '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'
[60083] srv    operator(): operator(): cleaning up before exit...
[60083] main: exiting due to model loading error

Upd:
--n-cpu-moe 99 doesn't help, still fails to load:

[56501] load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
[56501] load_tensors: offloading output layer to GPU
[56501] load_tensors: offloading 47 repeating layers to GPU
[56501] load_tensors: offloaded 49/49 layers to GPU
[56501] load_tensors:        CUDA0 model buffer size =  1878.24 MiB
[56501] load_tensors:        CUDA1 model buffer size =  1578.58 MiB
[56501] load_tensors:        CUDA2 model buffer size =  2112.19 MiB
[56501] load_tensors:    CUDA_Host model buffer size = 211945.25 MiB
[56501] ....................................................................................................
[56501] common_init_result: added </s> logit bias = -inf
[56501] common_init_result: added <|endoftext|> logit bias = -inf
[56501] common_init_result: added <|im_end|> logit bias = -inf
[56501] common_init_result: added <|fim_pad|> logit bias = -inf
[56501] common_init_result: added <|repo_name|> logit bias = -inf
[56501] common_init_result: added <|file_sep|> logit bias = -inf
[56501] llama_context: constructing llama_context
[56501] llama_context: n_seq_max     = 6
[56501] llama_context: n_ctx         = 262144
[56501] llama_context: n_ctx_seq     = 262144
[56501] llama_context: n_batch       = 2048
[56501] llama_context: n_ubatch      = 2048
[56501] llama_context: causal_attn   = 1
[56501] llama_context: flash_attn    = auto
[56501] llama_context: kv_unified    = true
[56501] llama_context: freq_base     = 10000000.0
[56501] llama_context: freq_scale    = 1
[56501] llama_context: n_ctx_seq (262144) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
[56501] llama_context:  CUDA_Host  output buffer size =     3.49 MiB
[56501] llama_kv_cache_iswa: creating non-SWA KV cache, size = 262144 cells
[56501] llama_kv_cache:      CUDA0 KV buffer size =  1020.00 MiB
[56501] llama_kv_cache:      CUDA1 KV buffer size =  1020.00 MiB
[56501] llama_kv_cache:      CUDA2 KV buffer size =  1020.00 MiB
[56501] llama_kv_cache: size = 3060.00 MiB (262144 cells,   9 layers,  6/1 seqs), K (q8_0): 1836.00 MiB, V (q8_0): 1224.00 MiB
[56501] llama_kv_cache: attn_rot_k = 1, n_embd_head_k_all = 192
[56501] llama_kv_cache: attn_rot_v = 1, n_embd_head_k_all = 128
[56501] llama_kv_cache_iswa: creating     SWA KV cache, size = 2816 cells
[56501] llama_kv_cache:      CUDA0 KV buffer size =   102.27 MiB
[56501] llama_kv_cache:      CUDA1 KV buffer size =    94.96 MiB
[56501] llama_kv_cache:      CUDA2 KV buffer size =    87.66 MiB
[56501] llama_kv_cache: size =  284.88 MiB (  2816 cells,  39 layers,  6/1 seqs), K (q8_0):  170.93 MiB, V (q8_0):  113.95 MiB
[56501] llama_kv_cache: attn_rot_k = 1, n_embd_head_k_all = 192
[56501] llama_kv_cache: attn_rot_v = 1, n_embd_head_k_all = 128
[56501] sched_reserve: reserving ...
[56501] sched_reserve: layer 0 is assigned to device CUDA0 but the Flash Attention tensor is assigned to device CPU (usually due to missing support)
[56501] sched_reserve: Flash Attention was auto, set to disabled
[56501] sched_reserve: resolving fused Gated Delta Net support:
[56501] sched_reserve: fused Gated Delta Net (autoregressive) enabled
[56501] sched_reserve: fused Gated Delta Net (chunked) enabled
[56501] ggml_backend_cuda_buffer_type_alloc_buffer: allocating 133698.49 MiB on device 0: cudaMalloc failed: out of memory
[56501] ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 140193026688
[56501] graph_reserve: failed to allocate compute buffers
[56501] llama_init_from_model: failed to initialize the context: failed to allocate compute pp buffers
[56501] common_init_result: failed to create context with model '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'
[56501] common_init_from_params: failed to create context with model '/mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf'

Doing something wrong?
this is .ini part for mimo:

[mimo-2.5-q5-k-m:thinking-coding]
model = /mnt/ds1nfs/codellamaweights/mimo2.5-aessedai-q5-k-m/MiMo-V2.5-Q5_K_M-00001-of-00006.gguf
c = 262144
temp = 1.0
top-p = 0.95
cache-ram = 65536
load-on-startup = false
ub = 2048
reasoning = on
no-mmap = 1
ctk = q8_0
ctv = q8_0
ctx-checkpoints = 128
fit = off
ncmoe = 99

@Andryusz
Copy link
Copy Markdown

Andryusz commented Apr 29, 2026

I did some tests with the IQ3_S quant, and while the model seems sane on this PR, I get quite different behavior compared to the official API via openrouter.

I have a quite specific prompt that causes non-English, in-character reasoning on many models, including Mimo v2.5 on the API - and it's 100% consistent on the API. However, on this PR, the model always thinks in English as a normal assistant, and the final response is also quite different compared to the API. The token that would result in non-English reasoning has only about 6% probability, so I don't think quantization would explain such a big difference in token distribution.
While it's not definite proof that something is wrong with the implementation, it would be nice to do a logit comparison against vllm/transformers.

As a side note, the performance is terrible for me. I get only 30%-40% of decode speed compared to something like Qwen 3.5 397B.

@segmond
Copy link
Copy Markdown

segmond commented Apr 29, 2026

Still needs work, fit doesn't seem to work, I have 8 gpus.... I'm letting this load just to get a feel for the inference, then I'll manually assign layers after and see how it is.

load_tensors: offloading output layer to GPU
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 1/49 layers to GPU
load_tensors: CUDA7 model buffer size = 633.27 MiB
load_tensors: CUDA_Host model buffer size = 312384.99 MiB
.................................................................

@AesSedai
Copy link
Copy Markdown
Contributor Author

@drrros @segmond for autofit problems, that should be a separate issue I think. Autofit was working fine for me on both the Pro and non-Pro versions at least.

@ngxson thanks for looking it over, I'll give that an eyeball later today 👀

@Andryusz I'll do some more digging later today with logit dumps, if there are issues I'd lean towards it being somewhere in the inference implementation (I think?) since I haven't touched that, this was mostly in the convert stage and if that was FUBAR then it'd be total gibberish. This model doesn't have a shared expert which may contribute to some perf issues (and that IQ quants are a bit slower on CPU I think).

@coder543
Copy link
Copy Markdown

coder543 commented Apr 29, 2026

To provide some performance context on DGX Spark... it is surprisingly slow. I also noticed no difference when running on this branch versus not.

llama-bench -d 0,32768 -p 8192 -n 100 -fa 1 -b 2048 -ub 2048 -mmp 0 -m mimo-v2.5-iq3_s.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB
model size params backend ngl n_ubatch fa mmap test t/s
mimo2 310B.A15B Q6_K 105.33 GiB 308.78 B CUDA 99 2048 1 0 pp8192 210.36 ± 0.48
mimo2 310B.A15B Q6_K 105.33 GiB 308.78 B CUDA 99 2048 1 0 tg100 6.48 ± 0.09
mimo2 310B.A15B Q6_K 105.33 GiB 308.78 B CUDA 99 2048 1 0 pp8192 @ d32768 42.45 ± 0.06
mimo2 310B.A15B Q6_K 105.33 GiB 308.78 B CUDA 99 2048 1 0 tg100 @ d32768 3.02 ± 0.01

Compare that to:

llama-bench -d 0,32768 -p 8192 -n 100 -fa 1 -b 2048 -ub 2048 -mmp 0 -m nemotron-3-super-120b-a12b-ud-q4_k_xl.gguf,qwen3.5-397b-a17b-ud-iq2_xxs.gguf,step-3.5-flash.q4_k_s.gguf,minimax-m2.7-ud-q3_k_xl.gguf
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 124610 MiB):
  Device 0: NVIDIA GB10, compute capability 12.1, VMM: yes, VRAM: 124610 MiB
model size params backend ngl n_ubatch fa mmap test t/s
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA 99 2048 1 0 pp8192 822.80 ± 0.99
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA 99 2048 1 0 tg100 16.77 ± 0.01
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA 99 2048 1 0 pp8192 @ d32768 759.38 ± 6.63
nemotron_h_moe 120B.A12B Q4_K - Medium 78.02 GiB 120.67 B CUDA 99 2048 1 0 tg100 @ d32768 16.38 ± 0.02
qwen35moe 397B.A17B IQ2_XXS - 2.0625 bpw 106.97 GiB 396.35 B CUDA 99 2048 1 0 pp8192 515.73 ± 3.14
qwen35moe 397B.A17B IQ2_XXS - 2.0625 bpw 106.97 GiB 396.35 B CUDA 99 2048 1 0 tg100 18.84 ± 0.02
qwen35moe 397B.A17B IQ2_XXS - 2.0625 bpw 106.97 GiB 396.35 B CUDA 99 512 1 0 pp8192 @ d32768 284.37 ± 2.62
qwen35moe 397B.A17B IQ2_XXS - 2.0625 bpw 106.97 GiB 396.35 B CUDA 99 512 1 0 tg100 @ d32768 17.21 ± 0.04
step35 196B.A11B Q4_K - Small 103.84 GiB 196.96 B CUDA 99 2048 1 0 pp8192 933.19 ± 2.56
step35 196B.A11B Q4_K - Small 103.84 GiB 196.96 B CUDA 99 2048 1 0 tg100 27.42 ± 0.04
step35 196B.A11B Q4_K - Small 103.84 GiB 196.96 B CUDA 99 2048 1 0 pp8192 @ d32768 730.20 ± 1.79
step35 196B.A11B Q4_K - Small 103.84 GiB 196.96 B CUDA 99 2048 1 0 tg100 @ d32768 21.42 ± 0.04
minimax-m2 230B.A10B Q3_K - Medium 94.93 GiB 228.69 B CUDA 99 2048 1 0 pp8192 883.13 ± 3.21
minimax-m2 230B.A10B Q3_K - Medium 94.93 GiB 228.69 B CUDA 99 2048 1 0 tg100 28.41 ± 0.02
minimax-m2 230B.A10B Q3_K - Medium 94.93 GiB 228.69 B CUDA 99 2048 1 0 pp8192 @ d32768 382.06 ± 1.07
minimax-m2 230B.A10B Q3_K - Medium 94.93 GiB 228.69 B CUDA 99 2048 1 0 tg100 @ d32768 13.12 ± 0.02

Whether you consider models of similar on-disk size, models with more total and active parameters... whatever way you look at it, mimi-v2.5's performance is really low compared to anything else that is comparable.

I also noticed a lot of CPU usage on mimo-v2.5, even though the model was entirely pinned to the GPU, and llama-server logs don't seem to indicate anything running on the CPU.

@AesSedai
Copy link
Copy Markdown
Contributor Author

@coder543 there wouldn't be a performance difference between this branch and master, the difference is that one attention scale hparam that gets plumbed through to the inference code, which would alter the output slightly.

RE: performance, that may need to be addressed too but not sure it's in scope for this PR which just adds support in the first place. I do appreciate the feedback though and the detailed comparison.

@coder543
Copy link
Copy Markdown

coder543 commented Apr 29, 2026

For whatever it is worth, I had codex investigate a little, and maybe flash attention is broken for this model in some way. Disabling flash attention significantly increases throughput, but also requires vastly more memory.

Quoting GPT-5.5 running through codex

Root cause looks like MiMo’s head shape:

n_embd_head_k = 192
n_embd_head_v = 128

CUDA Flash Attention support in this llama.cpp build rejects that shape. In ggml-cuda/fattn.cu, supported K->ne[0] cases include 128, 256, 320 with V=256, 576 with V=512, etc., but not K=192, V=128. Since we explicitly pass -fa 1, llama.cpp keeps Flash Attention enabled and the scheduler assigns GGML_OP_FLASH_ATTN_EXT to CPU.

I tested -fa 0:

-fa 1 tg100: ~6.5 t/s
-fa 0 tg100: ~21.8 t/s
-fa 1 pp8192: ~210 t/s
-fa 0 pp8192: ~418 t/s

So disabling Flash Attention fixes the CPU fallback and makes generation ~3.3x faster.

Bad news: -fa 0 greatly increases memory pressure. Even -c 32768 -ub 2048 failed to create context because CUDA compute buffer reservation wanted ~16.9 GiB and OOMed.

@AesSedai
Copy link
Copy Markdown
Contributor Author

@ngxson

It seems like softmax * (V * scale_v) should be equivalent to (softmax * V) * scale_v, right?

If so, we maybe able to reuse attn_output_scale for it?

A bit of digging shows that attn_output_scale is only used for the Grok arch at the moment, so that's gated? I could re-use it and plumb it through for MiMo as attn_output_scale and that would save a tiny bit of wiring work, up to you though.

The other recommendation I saw from some LLM review was to pre-bake that scale into the v_proj at convert time which wouldn't need any new hparams at all then (but would require newly converted ggufs, which I could do)

@AesSedai
Copy link
Copy Markdown
Contributor Author

@Andryusz do you mind sharing the prompt? I've managed to get transformers inference working with some dequantization to BF16 and compared the forced-prefix KLD from transformers to the BF16 gguf logits and there is a bit of variance but overall it's very close:

hf_top_k=20 gg_top_k=20  hf_n=64 gg_n=64

=== factual  steps=64 ===
  top-1 agreement     : 100.00%
  top-K overlap (mean):  78.91%
  KL(hf||gg) mean/p50/p95: 0.013956 / 0.000056 / 0.010251
  KL(gg||hf) mean     : 0.004526
  logit cosine mean   : 0.970220
  disagreements       : 0 (near-ties=0, drift=0)
  KL hotspots (step, KL):
    step  22: 0.421583
    step  17: 0.270204
    step  16: 0.130762
    step  14: 0.012261
    step   2: 0.010251

=== math_step  steps=64 ===
  top-1 agreement     :  98.44%
  top-K overlap (mean):  95.08%
  KL(hf||gg) mean/p50/p95: 0.001669 / 0.000000 / 0.007076
  KL(gg||hf) mean     : 0.001732
  logit cosine mean   : 0.995724
  disagreements       : 1 (near-ties=1, drift=0)
    step  17 [near-tie]: hf=220 (gap=0.375)  gg=369 (gap=0.086)  gg-pick logit in hf=29.125, hf-pick logit in gg=29.324
  KL hotspots (step, KL):
    step  13: 0.027873
    step  17: 0.025531
    step   1: 0.020404
    step  28: 0.016059
    step  57: 0.007076

=== roleplay  steps=64 ===
  top-1 agreement     :  98.44%
  top-K overlap (mean):  94.69%
  KL(hf||gg) mean/p50/p95: 0.008640 / 0.000304 / 0.033271
  KL(gg||hf) mean     : 0.008350
  logit cosine mean   : 0.993676
  disagreements       : 1 (near-ties=1, drift=0)
    step   7 [near-tie]: hf=264 (gap=0.125)  gg=438 (gap=0.075)  gg-pick logit in hf=25.500, hf-pick logit in gg=25.593
  KL hotspots (step, KL):
    step  40: 0.080738
    step  52: 0.052327
    step  33: 0.051458
    step  47: 0.047668
    step  55: 0.033271

=== OVERALL ===
  top-1 agreement     :  98.96%
  top-K overlap       :  89.56%
  KL(hf || gg) mean   : 0.008088
  KL(gg || hf) mean   : 0.004869
  logit cosine mean   : 0.986540

so unless you have a more specific reproduction, I'd chalk it up to the IQ3_S quantization error being the cause.

@ngxson
Copy link
Copy Markdown
Contributor

ngxson commented Apr 30, 2026

A bit of digging shows that attn_output_scale is only used for the Grok arch at the moment, so that's gated? I could re-use it and plumb it through for MiMo as attn_output_scale and that would save a tiny bit of wiring work, up to you though.

Hmm right, grok has a specific logic for it. I think it's ok for keep a dedicated var for v_scale for now then.

The other recommendation I saw from some LLM review was to pre-bake that scale into the v_proj at convert time which wouldn't need any new hparams at all then (but would require newly converted ggufs, which I could do)

No it should not be baked into v_proj for numerical stability. For ex, NVFP4 also have a separate scale applied to the activation, not baked to the projection matrix.

@Andryusz
Copy link
Copy Markdown

@AesSedai Thank you for checking the logits, they certainly look pretty good. After poking around a bit more with the model, I agree the effect I observed is probably caused by quantization + iMatrix, which possibly skews the model toward English. I could share the prompt, but to be honest I don't think it's worth to spend more time on this particular example. I will be doing a bit more testing and if I see more concrete indications of something being wrong I will share the details.

Regarding the bad performance - I can confirm @coder543 findings - FA seems broken, disabling it increases speeds into reasonable territory.

@ChicoPinto70
Copy link
Copy Markdown

Great work, AesSedai!!! Using -ot, instead --fit, I made the non pro Q5_K_M version works in a 3x3090 with 256 GB DDR4. The model seems Great!! But, when the prompt is a bit more complex, it goes in a endless reasoning chain of thought. But, as the previous version already has this behavior, I believe is a Xiaomi issue. Thanks, again!

@AesSedai
Copy link
Copy Markdown
Contributor Author

AesSedai commented May 5, 2026

@ngxson merged master in and fixed the conflicts, and added fused QKV. It's very slightly faster:

image

and the BF16 PPL is still fine: Final estimate: PPL = 5.9630 +/- 0.03940

@AesSedai AesSedai marked this pull request as draft May 5, 2026 09:24
@AesSedai
Copy link
Copy Markdown
Contributor Author

AesSedai commented May 5, 2026

There is a regression somewhere, working on tracking it down. I was testing the Pro Q8_0 out and it off somehow. I still had the Q8_0 logits from KLD testing Pro previously and I still have the unfused Pro GGUF and the mean KLD was very different:

====== Perplexity statistics ======
Mean PPL(Q)                   :   6.699453 ±   0.052520
Mean PPL(base)                :   3.198633 ±   0.016744
Cor(ln(PPL(Q)), ln(PPL(base))):  76.28%
Mean ln(PPL(Q)/PPL(base))     :   0.739302 ±   0.005124
Mean PPL(Q)/PPL(base)         :   2.094474 ±   0.010731
Mean PPL(Q)-PPL(base)         :   3.500820 ±   0.041196

====== KL divergence statistics ======
Mean    KLD:   0.775122 ±   0.004258
Maximum KLD:  27.474590
99.9%   KLD:  14.151838
99.0%   KLD:   8.829903
95.0%   KLD:   3.696737
90.0%   KLD:   1.970265
Median  KLD:   0.209342
10.0%   KLD:   0.000715
 5.0%   KLD:   0.000148
 1.0%   KLD:   0.000010
 0.1%   KLD:  -0.000000
Minimum KLD:  -0.000007

Putting this into draft mode for now while I root cause.

@AesSedai AesSedai marked this pull request as ready for review May 5, 2026 09:58
@AesSedai
Copy link
Copy Markdown
Contributor Author

AesSedai commented May 5, 2026

Fixed, the ml.get_key(LLM_KV_ATTENTION_VALUE_SCALE, hparams.f_attn_value_scale, false); key got lost during the merge on accident. KLD is normal again:

====== Perplexity statistics ======
Mean PPL(Q)                   :   3.195142 ±   0.016734
Mean PPL(base)                :   3.198633 ±   0.016744
Cor(ln(PPL(Q)), ln(PPL(base))):  99.19%
Mean ln(PPL(Q)/PPL(base))     :  -0.001092 ±   0.000666
Mean PPL(Q)/PPL(base)         :   0.998909 ±   0.000666
Mean PPL(Q)-PPL(base)         :  -0.003491 ±   0.002130

====== KL divergence statistics ======
Mean    KLD:   0.021039 ±   0.000290
Maximum KLD:  11.382697
99.9%   KLD:   1.377634
99.0%   KLD:   0.255950
95.0%   KLD:   0.075872
90.0%   KLD:   0.040932
Median  KLD:   0.003563
10.0%   KLD:   0.000020
 5.0%   KLD:   0.000006
 1.0%   KLD:  -0.000001
 0.1%   KLD:  -0.000006
Minimum KLD:  -0.000058

@drrros
Copy link
Copy Markdown
Contributor

drrros commented May 5, 2026

Running latest ..-vision branch, performance is decent (350-390 t\s pp and 15-20 tg on 3 rtx 4000 pros and epyc 9274f \ 12 channel ddr5 4800 on Q5_K_M) model seems smart, but often goes into loops when using as agentic backend. I'm using Claude Code. RN trying to mitigate with upping repeat-penalty - now at 1.2 but testing further. (1.0 didn't helped, not sure it's not default, though).

@AesSedai
Copy link
Copy Markdown
Contributor Author

AesSedai commented May 5, 2026

I've made updated QKV fused quants and am uploading them to HF now. The PPL / KLD for non-Pro is as follows (the mixture column is the MoE-optimized quant schema for Default Type / FFN Up / FFN Gate / FFN Down types):

Quant Size Mixture PPL 1-(Mean PPL(Q)/PPL(base)) KLD
Q8_0 305.68 GiB (8.50 BPW) Q8_0 5.135595 ± 0.030275 +0.1271% 0.012539 ± 0.000329
Q5_K_M 212.42 GiB (5.91 BPW) Q8_0 / Q5_K / Q5_K / Q6_K 5.148091 ± 0.030387 +0.3708% 0.014915 ± 0.000309
Q4_K_M 176.70 GiB (4.92 BPW) Q8_0 / Q4_K / Q4_K / Q5_K 5.195349 ± 0.030767 +1.2921% 0.020670 ± 0.000214
IQ4_XS 136.78 GiB (3.80 BPW) Q8_0 / IQ3_S / IQ3_S / IQ4_XS 5.271397 ± 0.031170 +2.7748% 0.041177 ± 0.000349
IQ3_S 105.33 GiB (2.93 BPW) Q6_K / IQ2_S / IQ2_S / IQ3_S 5.552710 ± 0.033286 +8.2595% 0.092639 ± 0.000604

@ggerganov @CISC this PR should be fully ready for review now. @ngxson approved it earlier, but that's stale with the merge from master and the change to fused QKV.

@BahamutRU
Copy link
Copy Markdown

#22673
Did you see this? Will there be MTP heads in MiMo-V2.5? Seems like a real way to speed things up.
Sry for off-topic!

Comment thread convert_hf_to_gguf.py Outdated
@AesSedai
Copy link
Copy Markdown
Contributor Author

AesSedai commented May 7, 2026

@BahamutRU I hadn't seen that PR yet, I'll take a closer look at it. MiMo does have MTP heads so I'll take a quick peek at what it'd take to keep those tensors in the conversion. Hopefully it'd be drop-in support then with that PR?

@BahamutRU
Copy link
Copy Markdown

Hopefully it'd be drop-in support then with that PR?

I don't know, but I really hope so. It works great for Qwen3.6-27B (+95% tg) and Qwen3.6-35B-A3B (+40%); even the 40% is a very nice bonus.

Comment thread src/llama-hparams.h Outdated
Comment thread src/models/mimo2.cpp Outdated
Comment thread src/models/mimo2.cpp Outdated
AesSedai and others added 3 commits May 7, 2026 02:32
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@AesSedai
Copy link
Copy Markdown
Contributor Author

AesSedai commented May 7, 2026

I'm also testing the convert locally for including the MTP tensors in the GGUF, following the GLM-4.5/DS MTP convention. I'll push that commit up in a few hours when I confirm the convert works correctly, the tensors get stored in nextn properly, it can load properly, and the PPL / KLD look correct.

Copy link
Copy Markdown
Member

@CISC CISC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean these up later. :)

Comment thread convert_hf_to_gguf.py Outdated
Comment thread convert_hf_to_gguf.py Outdated
Comment thread src/models/mimo2.cpp Outdated
AesSedai and others added 4 commits May 7, 2026 03:05
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
@AesSedai
Copy link
Copy Markdown
Contributor Author

AesSedai commented May 7, 2026

It didn't take as long to write and test as I thought, so I'm going to bed (very late) now :)

The MTP tensors were saved and the model loads correctly:

model has unused tensor blk.48.attn_output.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.48.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.48.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.ffn_gate.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.48.ffn_down.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.48.ffn_up.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.48.nextn.eh_proj.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.48.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.48.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_output.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.49.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.49.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.ffn_gate.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.49.ffn_down.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.49.ffn_up.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.49.nextn.eh_proj.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.49.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.49.layer_output_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_output.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.50.attn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.attn_sinks.weight (size = 256 bytes) -- ignoring
model has unused tensor blk.50.ffn_norm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.ffn_gate.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.50.ffn_down.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.50.ffn_up.weight (size = 134217728 bytes) -- ignoring
model has unused tensor blk.50.nextn.eh_proj.weight (size = 67108864 bytes) -- ignoring
model has unused tensor blk.50.nextn.enorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.nextn.hnorm.weight (size = 16384 bytes) -- ignoring
model has unused tensor blk.50.layer_output_norm.weight (size = 16384 bytes) -- ignoring

and BF16 (logits collected from previous conversion) to BF16 (new conversion) shows it's identical:

====== Perplexity statistics ======
Mean PPL(Q)                   :   5.132410 ±   0.030237
Mean PPL(base)                :   5.130804 ±   0.030196
Cor(ln(PPL(Q)), ln(PPL(base))):  99.99%
Mean ln(PPL(Q)/PPL(base))     :   0.000313 ±   0.000094
Mean PPL(Q)/PPL(base)         :   1.000313 ±   0.000095
Mean PPL(Q)-PPL(base)         :   0.001606 ±   0.000486

====== KL divergence statistics ======
Mean    KLD:  -0.000010 ±   0.000000
Maximum KLD:   0.000004
99.9%   KLD:   0.000000
99.0%   KLD:   0.000000
95.0%   KLD:   0.000000
90.0%   KLD:  -0.000000
Median  KLD:  -0.000009
10.0%   KLD:  -0.000022
 5.0%   KLD:  -0.000025
 1.0%   KLD:  -0.000031
 0.1%   KLD:  -0.000037
Minimum KLD:  -0.000051

I think that's all of it now 🤞

Copy link
Copy Markdown
Contributor

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's merge when the CI passes, any bugs can be fixed via follow-up PRs

@ngxson ngxson merged commit 8e52631 into ggml-org:master May 7, 2026
48 of 50 checks passed
@coder543
Copy link
Copy Markdown

coder543 commented May 7, 2026

Apparently there was a bug in the config, fixed a few hours ago: https://huggingface.co/XiaomiMiMo/MiMo-V2.5/commit/13b5e3f92ab9572523fa21c7f1bfe9c92228aaca

Might need new GGUFs?

@CISC
Copy link
Copy Markdown
Member

CISC commented May 7, 2026

Apparently there was a bug in the config, fixed a few hours ago: https://huggingface.co/XiaomiMiMo/MiMo-V2.5/commit/13b5e3f92ab9572523fa21c7f1bfe9c92228aaca

Might need new GGUFs?

This array is never used for GGUFs, but the tokens may or may not need to be added as EOG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.