Skip to content

gguf: optimize prefill speeds for Q4_K quants#1395

Open
AlpinDale wants to merge 3 commits into
mainfrom
q4k_prefill_optim
Open

gguf: optimize prefill speeds for Q4_K quants#1395
AlpinDale wants to merge 3 commits into
mainfrom
q4k_prefill_optim

Conversation

@AlpinDale
Copy link
Copy Markdown
Collaborator

@AlpinDale AlpinDale commented Jul 18, 2025

Our prefill is currently 8x slower than native llama.cpp. This is an attempt at closing that gap.

Llama 3.1 8B Q4_K_M, RTX 3090
Main

Request completed - E2E time: 6.76s, TTFT: 4.03s, Prefill: 1927 tokens (478.1 tokens/s), Decode: 264 tokens (96.9 tokens/s)

PR

Request completed - E2E time: 10.25s, TTFT: 1.67s, Prefill: 1927 tokens (1150.5 tokens/s), Decode: 823 tokens (96.0 tokens/s)

Baseline (llama.cpp)

prompt eval time =     458.04 ms /  1928 tokens (    0.24 ms per token,  4209.21 tokens per second)
       eval time =    2932.51 ms /   156 tokens (   18.80 ms per token,    53.20 tokens per second)
      total time =    3390.55 ms /  2084 tokens

Note that the decode numbers for llama.cpp are closer to ours, these tests were run with top_k=vocab_size, which llama.cpp struggles with.

There's still a lot of room for optimization, and it needs to be extended to other quant types as well. For now, this provides a huge prefill throughput improvement over main. The optimizations here are quite conservative to avoid some NaN output issues I ran into, so there's still room for way more aggressive approaches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant