tools/mllm-llm-benchmark: estimate KV cache bytes from model config by huangzhenhua111 · Pull Request #623 · UbiquitousLearning/mllm

huangzhenhua111 · 2026-01-31T06:34:44Z

What have been changed

KV cache estimate now derived from model config when available (Llama/TinyLlama implemented).
No behavior change for models that don’t provide kvEstimateInfo().

How to test

Tested on x86_64 (WSL/Ubuntu), CPU backend

ninja -C build -v mllm-llm-benchmark
./build/bin/mllm-llm-benchmark \
  -n tiny_llama \
  -m /home/huangzhenhua/models/mllm_tinyllama/tinyllama-fp32.mllm \
  -c /home/huangzhenhua/mllm-runok/examples/llama/config_tiny_llama.json \
  -pp 8 -tg 4 -t 4 -cl 2048 \
  -r 3 -cs 1 \
  -kv 4 \
  -oc out_kv.csv
cat out_kv.csv

Output

schema_version,git_commit,arch,model_name,pp,tg,ttft_ms,prefill_speed,decode_speed,prefill_ms,decode_ms_per_tok,kv_est_bytes_pp,kv_est_bytes_final
1,f5c0006ce4f93d378e960648936c852418e69c88,x86_64,tiny_llama,8,4,410.549,19.8104,5.22112,403.829,191.53,360448,540672

Summary by CodeRabbit

Release Notes

New Features
- Configurable benchmark runs via new command-line arguments (runs, cooldown, output CSV path, KV-dtype settings)
- CSV export for benchmark results with model metadata and commit information
- Per-run latency measurements (prefill and decode latency)
- KV cache size estimation support for benchmarks
- LLaMA model benchmarking with comprehensive performance metrics and warmup functionality

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-31T06:35:03Z

📝 Walkthrough

Walkthrough

The changes enhance the benchmark tool with configurable runs via command-line arguments, CSV output capability, and KV cache estimation infrastructure. A new Llama model-specific benchmark implementation is added alongside expanded per-run timing metrics and dynamic runtime environment initialization.

Changes

Cohort / File(s)	Summary
CLI & CSV Output `tools/mllm-llm-benchmark/main.cpp`	Adds command-line arguments for runs, cooldown, output CSV, schema version, KV-dtype bytes, and threads. Implements CSV file creation with header and per-run data logging. Introduces input validation, configurable benchmark loop, per-run timing computation (prefill/decode latency), and KV cache byte estimation calculation.
Benchmark Infrastructure `tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp`, `tools/mllm-llm-benchmark/models/All.hpp`	Adds KVCacheEstimateInfo struct and virtual kvEstimateInfo() method to BenchmarkTemplate base class. Marks createBenchmark as inline in All.hpp and introduces explicit Llama model selection logic with safe tolower conversion.
Llama Model Implementation `tools/mllm-llm-benchmark/models/Llama.hpp`	New file introducing Llama_Benchmark subclass with KV cache estimation, model initialization, configuration printing, warmup, and comprehensive benchmark run method measuring prefill/decode latency, throughput, and time-to-first-token.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/main.cpp
    participant Factory as createBenchmark()
    participant Template as BenchmarkTemplate
    participant Llama as Llama_Benchmark
    participant CSV as CSV Output

    CLI->>CLI: Parse CLI args (runs, output_csv, etc.)
    CLI->>CLI: Validate inputs
    loop For each run (1 to R)
        CLI->>Factory: createBenchmark(model_name)
        Factory->>Llama: Create Llama_Benchmark
        CLI->>Llama: init(cfg_path, model_path)
        CLI->>Llama: printModelInfo()
        CLI->>Llama: warmup()
        CLI->>Llama: run(pp, tg)
        Llama->>Llama: Measure prefill latency
        Llama->>Llama: Measure decode latency
        Llama->>Template: kvEstimateInfo() override
        Llama-->>CLI: Return BenchmarkTemplateResult
        CLI->>CSV: Write per-run metrics + KV estimates
    end
    CLI->>CSV: Close file

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

feat(cli): add mllm-llm-benchmark tool for performance testing #481: Introduces the initial mllm-llm-benchmark tool structure; this PR extends it with configurable runs, CSV output, KV cache estimation, and concrete model implementations.
feat: Add benchmark for Qwen3 and update readme about benchmark #487: Directly overlaps with benchmark framework modifications (main.cpp, All.hpp, BenchmarkTemplate.hpp) and adds model-specific benchmark implementations in similar patterns.

Poem

🐰 Hop along, the benchmarks now run with flair,
CSV trails mark each lap with care,
KV caches sized, Llama models shine,
Configurable, precise—a test suite fine!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: adding KV cache byte estimation from model configuration to the benchmark tool.
Description check	✅ Passed	The description covers what changed, how to test with a specific command example, and actual output. It adequately documents the feature and testing approach.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@tools/mllm-llm-benchmark/models/Llama.hpp`:
- Around line 17-24: kvEstimateInfo() can divide by zero when
cfg_->num_attention_heads == 0; add a guard similar to printModelInfo(): if
cfg_->num_attention_heads == 0 either return std::nullopt or set info.head_dim =
0 (choose behavior consistent with callers) before computing head_dim, and
ensure cfg_ is checked first; also mark the method [[nodiscard]] to satisfy
static analysis. Update the function signature kvEstimateInfo() and the body
(referencing cfg_, KVCacheEstimateInfo, head_dim, num_attention_heads)
accordingly.

🧹 Nitpick comments (5)

tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp (1)
69-72: Consider adding [[nodiscard]] attribute.

The return value of kvEstimateInfo() should always be checked by callers. Adding [[nodiscard]] helps prevent accidental discard of the optional result, which aligns with the static analysis hint.
💡 Suggested fix
   // Optional: provide info for KV cache size estimation.
   // If a model does not support it, return std::nullopt.
-  virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }
+  [[nodiscard]] virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }
tools/mllm-llm-benchmark/models/All.hpp (1)
31-35: Redundant substring check for "tinyllama".

The condition normalized_model_name.find("tinyllama") is redundant because find("llama") will already match any string containing "tinyllama". The "tiny_llama" check (with underscore) is valid though.
💡 Simplified condition
-  if (normalized_model_name.find("llama") != std::string::npos ||
-      normalized_model_name.find("tinyllama") != std::string::npos ||
-      normalized_model_name.find("tiny_llama") != std::string::npos) {
+  if (normalized_model_name.find("llama") != std::string::npos ||
+      normalized_model_name.find("tiny_llama") != std::string::npos) {
     return std::make_shared<Llama_Benchmark>();
   }
tools/mllm-llm-benchmark/main.cpp (3)
17-18: Remove duplicate stringify macros.

STR_HELPER/STR (lines 26-27) duplicates STRINGIFY_INTERNAL/STRINGIFY (lines 17-18). Use only one set.
♻️ Remove duplicate macros
 `#define` STRINGIFY_INTERNAL(x) `#x`
 `#define` STRINGIFY(x) STRINGIFY_INTERNAL(x)

 `#include` "models/All.hpp"

 `#ifndef` MLLM_GIT_COMMIT_HASH
 `#define` MLLM_GIT_COMMIT_HASH unknown
 `#endif`

-#define STR_HELPER(x) `#x`
-#define STR(x) STR_HELPER(x)
-
Also applies to: 26-27

194-202: Consider using a named constant for the K+V multiplier.

The factor 2.0 represents K and V caches. A named constant improves clarity.
💡 Suggested improvement
+    // K and V each need storage, hence factor of 2
+    constexpr double kKVCacheMultiplier = 2.0;
     // Rough KV cache estimate (bytes)
     double kv_est_bytes_pp = 0.0;
     double kv_est_bytes_final = 0.0;
     if (auto info = benchmark->kvEstimateInfo(); info.has_value()) {
       const int32_t bytes_per = kv_dtype_bytes.get();  // 1/2/4
       // LLaMA-like KV: 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes
-      kv_est_bytes_pp = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per;
-      kv_est_bytes_final = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per;
+      kv_est_bytes_pp = kKVCacheMultiplier * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per;
+      kv_est_bytes_final = kKVCacheMultiplier * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per;
     }
Based on learnings: "Use named constants instead of magic numbers."

163-164: Cast pp to float for accurate latency calculation.

pp is int32_t. Integer division occurs before the float multiplication, which could lose precision for small prefill_speed values.
💡 Explicit cast
-      float prefill_ms = (result.prefill_speed > 0.0f) ? (pp / result.prefill_speed) * 1000.0f : 0.0f;
+      float prefill_ms = (result.prefill_speed > 0.0f) ? (static_cast<float>(pp) / result.prefill_speed) * 1000.0f : 0.0f;

coderabbitai · 2026-01-31T06:38:43Z

tools/mllm-llm-benchmark/models/Llama.hpp

+    std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
+    if (!cfg_) return std::nullopt;
+    KVCacheEstimateInfo info;
+    info.num_layers = cfg_->num_hidden_layers;
+    info.num_kv_heads = cfg_->num_key_value_heads;
+    info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
+    return info;
+  }


⚠️ Potential issue | 🟡 Minor

Add division-by-zero guard for head_dim calculation.

printModelInfo() (line 58) defensively checks num_attention_heads > 0 before dividing, but kvEstimateInfo() does not. If the config has num_attention_heads == 0, this would cause a division by zero.

🛡️ Proposed fix

std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override { if (!cfg_) return std::nullopt; + if (cfg_->num_attention_heads <= 0) return std::nullopt; KVCacheEstimateInfo info; info.num_layers = cfg_->num_hidden_layers; info.num_kv_heads = cfg_->num_key_value_heads; info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads; return info; }

Also consider adding [[nodiscard]] per the static analysis hint.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {

if (!cfg_) return std::nullopt;

KVCacheEstimateInfo info;

info.num_layers = cfg_->num_hidden_layers;

info.num_kv_heads = cfg_->num_key_value_heads;

info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;

return info;

}

std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {

if (!cfg_) return std::nullopt;

if (cfg_->num_attention_heads <= 0) return std::nullopt;

KVCacheEstimateInfo info;

info.num_layers = cfg_->num_hidden_layers;

info.num_kv_heads = cfg_->num_key_value_heads;

info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;

return info;

}

🧰 Tools

🪛 Clang (14.0.6)

[error] 17-17: function 'kvEstimateInfo' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

🤖 Prompt for AI Agents

In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 17 - 24, kvEstimateInfo() can divide by zero when cfg_->num_attention_heads == 0; add a guard similar to printModelInfo(): if cfg_->num_attention_heads == 0 either return std::nullopt or set info.head_dim = 0 (choose behavior consistent with callers) before computing head_dim, and ensure cfg_ is checked first; also mark the method [[nodiscard]] to satisfy static analysis. Update the function signature kvEstimateInfo() and the body (referencing cfg_, KVCacheEstimateInfo, head_dim, num_attention_heads) accordingly.

huangzhenhua111 · 2026-01-31T06:54:01Z

Hi @chenghuaWang and @jialilve,

I hope you're doing well! I have a quick question regarding the review process for my recent PRs (#617, #622) which have separated out the current change and my new PR (#623) which I am not sure whether I should separate it from former state.

I noticed that my new PR includes changes related to KV cache estimation, and it's built upon the earlier PRs. However, those previous PRs haven’t been reviewed yet, and I wanted to ask if you prefer me to:

Separate out the current change into a standalone PR focused only on the KV cache estimate and keep the previous changes in tools/mllm-llm-benchmark: add llama benchmark template #617/tools/mllm-llm-benchmark: add CSV output and configurable runs #622 as separate PRs. This would allow you to review the current changes independently of the earlier ones.
Keep everything in one PR, since the changes in tools/mllm-llm-benchmark: add llama benchmark template #617/tools/mllm-llm-benchmark: add CSV output and configurable runs #622 are related to the new functionality in this PR, and reviewing them together may give you better context.

I suggest you can review the changes from former changes (#617/#622) first. Once those are approved, you can merge them into the main branch, and the changes from the later(#623) will automatically reflect the updates.

I want to make it easier for you to review and ensure the PR process is as smooth as possible, so please let me know which option works better for you.

Thanks for your time and feedback!

Best,
huangzhenhua

huangzhenhua111 added 3 commits January 30, 2026 17:44

tools/mllm-llm-benchmark: add llama benchmark template

1cb7439

tools/mllm-llm-benchmark: add csv output and configurable runs

5444b15

tools/mllm-llm-benchmark: estimate KV cache bytes from model config

b17c8f5

coderabbitai bot reviewed Jan 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools/mllm-llm-benchmark: estimate KV cache bytes from model config#623

tools/mllm-llm-benchmark: estimate KV cache bytes from model config#623
huangzhenhua111 wants to merge 3 commits intoUbiquitousLearning:mainfrom
huangzhenhua111:fix/llm-benchmark-kv-est

huangzhenhua111 commented Jan 31, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 31, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 31, 2026

Uh oh!

huangzhenhua111 commented Jan 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

huangzhenhua111 commented Jan 31, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What have been changed

How to test

Output

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 31, 2026

Choose a reason for hiding this comment

Uh oh!

huangzhenhua111 commented Jan 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

huangzhenhua111 commented Jan 31, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 31, 2026 •

edited

Loading

huangzhenhua111 commented Jan 31, 2026 •

edited

Loading