feat: Add Mimo v2.5 model support#22493
Conversation
|
also cc @ngxson for review |
|
Getting this error when converting: Need to include changes to gguf-py? |
|
I'm going to find some disk and download and give this a go! |
|
@segmond oops, forgot to include that in the commit. I've pushed it now, give it another shot? |
I'm downloading the q8, at the rate it's going, it will take about 9 hours if there's no issue. I'll just pull down and rebuild when I get up in the morning before I try it. |
|
Ah I meant to ping @sayap about the convert issue, my eyes are crossed :P I pushed the commit that added the writer and constant updates. |
|
I've just tried converting the MiMi V2.5 Pro version and the conversion fails at the TP dequant, will look into it. |
|
Build from this branch and it fails to autofit: Upd: Doing something wrong? |
|
I did some tests with the IQ3_S quant, and while the model seems sane on this PR, I get quite different behavior compared to the official API via openrouter. I have a quite specific prompt that causes non-English, in-character reasoning on many models, including Mimo v2.5 on the API - and it's 100% consistent on the API. However, on this PR, the model always thinks in English as a normal assistant, and the final response is also quite different compared to the API. The token that would result in non-English reasoning has only about 6% probability, so I don't think quantization would explain such a big difference in token distribution. As a side note, the performance is terrible for me. I get only 30%-40% of decode speed compared to something like Qwen 3.5 397B. |
|
Still needs work, fit doesn't seem to work, I have 8 gpus.... I'm letting this load just to get a feel for the inference, then I'll manually assign layers after and see how it is. load_tensors: offloading output layer to GPU |
|
@drrros @segmond for autofit problems, that should be a separate issue I think. Autofit was working fine for me on both the Pro and non-Pro versions at least. @ngxson thanks for looking it over, I'll give that an eyeball later today 👀 @Andryusz I'll do some more digging later today with logit dumps, if there are issues I'd lean towards it being somewhere in the inference implementation (I think?) since I haven't touched that, this was mostly in the convert stage and if that was FUBAR then it'd be total gibberish. This model doesn't have a shared expert which may contribute to some perf issues (and that IQ quants are a bit slower on CPU I think). |
|
To provide some performance context on DGX Spark... it is surprisingly slow. I also noticed no difference when running on this branch versus not.
Compare that to:
Whether you consider models of similar on-disk size, models with more total and active parameters... whatever way you look at it, mimi-v2.5's performance is really low compared to anything else that is comparable. I also noticed a lot of CPU usage on mimo-v2.5, even though the model was entirely pinned to the GPU, and |
|
@coder543 there wouldn't be a performance difference between this branch and RE: performance, that may need to be addressed too but not sure it's in scope for this PR which just adds support in the first place. I do appreciate the feedback though and the detailed comparison. |
|
For whatever it is worth, I had Quoting GPT-5.5 running through codex
|
A bit of digging shows that The other recommendation I saw from some LLM review was to pre-bake that scale into the |
|
@Andryusz do you mind sharing the prompt? I've managed to get transformers inference working with some dequantization to BF16 and compared the forced-prefix KLD from transformers to the BF16 gguf logits and there is a bit of variance but overall it's very close: so unless you have a more specific reproduction, I'd chalk it up to the IQ3_S quantization error being the cause. |
Hmm right, grok has a specific logic for it. I think it's ok for keep a dedicated var for v_scale for now then.
No it should not be baked into v_proj for numerical stability. For ex, NVFP4 also have a separate scale applied to the activation, not baked to the projection matrix. |
|
@AesSedai Thank you for checking the logits, they certainly look pretty good. After poking around a bit more with the model, I agree the effect I observed is probably caused by quantization + iMatrix, which possibly skews the model toward English. I could share the prompt, but to be honest I don't think it's worth to spend more time on this particular example. I will be doing a bit more testing and if I see more concrete indications of something being wrong I will share the details. Regarding the bad performance - I can confirm @coder543 findings - FA seems broken, disabling it increases speeds into reasonable territory. |
|
Great work, AesSedai!!! Using -ot, instead --fit, I made the non pro Q5_K_M version works in a 3x3090 with 256 GB DDR4. The model seems Great!! But, when the prompt is a bit more complex, it goes in a endless reasoning chain of thought. But, as the previous version already has this behavior, I believe is a Xiaomi issue. Thanks, again! |
|
@ngxson merged master in and fixed the conflicts, and added fused QKV. It's very slightly faster:
and the BF16 PPL is still fine: |
|
There is a regression somewhere, working on tracking it down. I was testing the Pro Q8_0 out and it off somehow. I still had the Q8_0 logits from KLD testing Pro previously and I still have the unfused Pro GGUF and the mean KLD was very different: Putting this into draft mode for now while I root cause. |
|
Fixed, the |
|
Running latest ..-vision branch, performance is decent (350-390 t\s pp and 15-20 tg on 3 rtx 4000 pros and epyc 9274f \ 12 channel ddr5 4800 on Q5_K_M) model seems smart, but often goes into loops when using as agentic backend. I'm using Claude Code. RN trying to mitigate with upping repeat-penalty - now at 1.2 but testing further. (1.0 didn't helped, not sure it's not default, though). |
|
I've made updated QKV fused quants and am uploading them to HF now. The PPL / KLD for non-Pro is as follows (the mixture column is the MoE-optimized quant schema for
@ggerganov @CISC this PR should be fully ready for review now. @ngxson approved it earlier, but that's stale with the merge from |
|
#22673 |
|
@BahamutRU I hadn't seen that PR yet, I'll take a closer look at it. MiMo does have MTP heads so I'll take a quick peek at what it'd take to keep those tensors in the conversion. Hopefully it'd be drop-in support then with that PR? |
I don't know, but I really hope so. It works great for Qwen3.6-27B (+95% tg) and Qwen3.6-35B-A3B (+40%); even the 40% is a very nice bonus. |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
I'm also testing the convert locally for including the MTP tensors in the GGUF, following the GLM-4.5/DS MTP convention. I'll push that commit up in a few hours when I confirm the convert works correctly, the tensors get stored in |
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
|
It didn't take as long to write and test as I thought, so I'm going to bed (very late) now :) The MTP tensors were saved and the model loads correctly: and BF16 (logits collected from previous conversion) to BF16 (new conversion) shows it's identical: I think that's all of it now 🤞 |
ngxson
left a comment
There was a problem hiding this comment.
Let's merge when the CI passes, any bugs can be fixed via follow-up PRs
|
Apparently there was a bug in the config, fixed a few hours ago: https://huggingface.co/XiaomiMiMo/MiMo-V2.5/commit/13b5e3f92ab9572523fa21c7f1bfe9c92228aaca Might need new GGUFs? |
This array is never used for GGUFs, but the tokens may or may not need to be added as EOG. |


Overview
This PR adds support for MiMo V2.5 (+ Pro) for text-to-text inference. The non-Pro MiMo V2.5 has audio and vision components that are not included in this PR.
Additional information
I haven't re-tested the Pro model but I think it should still convert and quantize correctly, will follow-up with that again when I finish with the non-Pro model quantizations.
The
convert_hf_to_gguf.pynow dequantized the FP8 safetensors correctly, MiMo has an oddly packed TP-aware sharding for its weights in additional to fusing the attention_qkv. To maintain compatibility with the existing MiMo V2 Flash path, I've opted to un-fuse the attention_qkv and use the existing modeling code.One small tweak to note is that the MiMo V2 and V2.5 models have an
attention_value_scalethat was provided in the config.json but not being used. I've plumbed that through, which should bring the model closer to parity with the transformers implementation.MiMo-V2.5-Q8_0-KLD.txt
Requirements