gguf: optimize prefill speeds for Q4_K quants by AlpinDale · Pull Request #1395 · dphnAI/aphrodite-engine

AlpinDale · 2025-07-18T19:01:41Z

Our prefill is currently 8x slower than native llama.cpp. This is an attempt at closing that gap.

Llama 3.1 8B Q4_K_M, RTX 3090
Main

Request completed - E2E time: 6.76s, TTFT: 4.03s, Prefill: 1927 tokens (478.1 tokens/s), Decode: 264 tokens (96.9 tokens/s)

PR

Request completed - E2E time: 10.25s, TTFT: 1.67s, Prefill: 1927 tokens (1150.5 tokens/s), Decode: 823 tokens (96.0 tokens/s)

Baseline (llama.cpp)

prompt eval time =     458.04 ms /  1928 tokens (    0.24 ms per token,  4209.21 tokens per second)
       eval time =    2932.51 ms /   156 tokens (   18.80 ms per token,    53.20 tokens per second)
      total time =    3390.55 ms /  2084 tokens

Note that the decode numbers for llama.cpp are closer to ours, these tests were run with top_k=vocab_size, which llama.cpp struggles with.

There's still a lot of room for optimization, and it needs to be extended to other quant types as well. For now, this provides a huge prefill throughput improvement over main. The optimizations here are quite conservative to avoid some NaN output issues I ran into, so there's still room for way more aggressive approaches.

AlpinDale added 3 commits July 18, 2025 18:58

gguf: optimize prefill speeds for Q4_K quants

4e54b58

tile size optimizations

ca370b9

better shared memory calculation for q4k

bde0968

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

gguf: optimize prefill speeds for Q4_K quants#1395

gguf: optimize prefill speeds for Q4_K quants#1395
AlpinDale wants to merge 3 commits into
mainfrom
q4k_prefill_optim

AlpinDale commented Jul 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AlpinDale commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AlpinDale commented Jul 18, 2025 •

edited

Loading