Skip to content

tools/mllm-llm-benchmark: estimate KV cache bytes from model config#623

Open
huangzhenhua111 wants to merge 3 commits intoUbiquitousLearning:mainfrom
huangzhenhua111:fix/llm-benchmark-kv-est
Open

tools/mllm-llm-benchmark: estimate KV cache bytes from model config#623
huangzhenhua111 wants to merge 3 commits intoUbiquitousLearning:mainfrom
huangzhenhua111:fix/llm-benchmark-kv-est

Conversation

@huangzhenhua111
Copy link

@huangzhenhua111 huangzhenhua111 commented Jan 31, 2026

What have been changed

  • KV cache estimate now derived from model config when available (Llama/TinyLlama implemented).
  • No behavior change for models that don’t provide kvEstimateInfo().

How to test

  • Tested on x86_64 (WSL/Ubuntu), CPU backend
ninja -C build -v mllm-llm-benchmark
./build/bin/mllm-llm-benchmark \
  -n tiny_llama \
  -m /home/huangzhenhua/models/mllm_tinyllama/tinyllama-fp32.mllm \
  -c /home/huangzhenhua/mllm-runok/examples/llama/config_tiny_llama.json \
  -pp 8 -tg 4 -t 4 -cl 2048 \
  -r 3 -cs 1 \
  -kv 4 \
  -oc out_kv.csv
cat out_kv.csv

Output

schema_version,git_commit,arch,model_name,pp,tg,ttft_ms,prefill_speed,decode_speed,prefill_ms,decode_ms_per_tok,kv_est_bytes_pp,kv_est_bytes_final
1,f5c0006ce4f93d378e960648936c852418e69c88,x86_64,tiny_llama,8,4,410.549,19.8104,5.22112,403.829,191.53,360448,540672

Summary by CodeRabbit

Release Notes

  • New Features
    • Configurable benchmark runs via new command-line arguments (runs, cooldown, output CSV path, KV-dtype settings)
    • CSV export for benchmark results with model metadata and commit information
    • Per-run latency measurements (prefill and decode latency)
    • KV cache size estimation support for benchmarks
    • LLaMA model benchmarking with comprehensive performance metrics and warmup functionality

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 31, 2026

📝 Walkthrough

Walkthrough

The changes enhance the benchmark tool with configurable runs via command-line arguments, CSV output capability, and KV cache estimation infrastructure. A new Llama model-specific benchmark implementation is added alongside expanded per-run timing metrics and dynamic runtime environment initialization.

Changes

Cohort / File(s) Summary
CLI & CSV Output
tools/mllm-llm-benchmark/main.cpp
Adds command-line arguments for runs, cooldown, output CSV, schema version, KV-dtype bytes, and threads. Implements CSV file creation with header and per-run data logging. Introduces input validation, configurable benchmark loop, per-run timing computation (prefill/decode latency), and KV cache byte estimation calculation.
Benchmark Infrastructure
tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp, tools/mllm-llm-benchmark/models/All.hpp
Adds KVCacheEstimateInfo struct and virtual kvEstimateInfo() method to BenchmarkTemplate base class. Marks createBenchmark as inline in All.hpp and introduces explicit Llama model selection logic with safe tolower conversion.
Llama Model Implementation
tools/mllm-llm-benchmark/models/Llama.hpp
New file introducing Llama_Benchmark subclass with KV cache estimation, model initialization, configuration printing, warmup, and comprehensive benchmark run method measuring prefill/decode latency, throughput, and time-to-first-token.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as CLI/main.cpp
    participant Factory as createBenchmark()
    participant Template as BenchmarkTemplate
    participant Llama as Llama_Benchmark
    participant CSV as CSV Output

    CLI->>CLI: Parse CLI args (runs, output_csv, etc.)
    CLI->>CLI: Validate inputs
    loop For each run (1 to R)
        CLI->>Factory: createBenchmark(model_name)
        Factory->>Llama: Create Llama_Benchmark
        CLI->>Llama: init(cfg_path, model_path)
        CLI->>Llama: printModelInfo()
        CLI->>Llama: warmup()
        CLI->>Llama: run(pp, tg)
        Llama->>Llama: Measure prefill latency
        Llama->>Llama: Measure decode latency
        Llama->>Template: kvEstimateInfo() override
        Llama-->>CLI: Return BenchmarkTemplateResult
        CLI->>CSV: Write per-run metrics + KV estimates
    end
    CLI->>CSV: Close file
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Poem

🐰 Hop along, the benchmarks now run with flair,
CSV trails mark each lap with care,
KV caches sized, Llama models shine,
Configurable, precise—a test suite fine!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: adding KV cache byte estimation from model configuration to the benchmark tool.
Description check ✅ Passed The description covers what changed, how to test with a specific command example, and actual output. It adequately documents the feature and testing approach.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp`:
- Around line 17-24: kvEstimateInfo() can divide by zero when
cfg_->num_attention_heads == 0; add a guard similar to printModelInfo(): if
cfg_->num_attention_heads == 0 either return std::nullopt or set info.head_dim =
0 (choose behavior consistent with callers) before computing head_dim, and
ensure cfg_ is checked first; also mark the method [[nodiscard]] to satisfy
static analysis. Update the function signature kvEstimateInfo() and the body
(referencing cfg_, KVCacheEstimateInfo, head_dim, num_attention_heads)
accordingly.
🧹 Nitpick comments (5)
tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp (1)

69-72: Consider adding [[nodiscard]] attribute.

The return value of kvEstimateInfo() should always be checked by callers. Adding [[nodiscard]] helps prevent accidental discard of the optional result, which aligns with the static analysis hint.

💡 Suggested fix
   // Optional: provide info for KV cache size estimation.
   // If a model does not support it, return std::nullopt.
-  virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }
+  [[nodiscard]] virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }
tools/mllm-llm-benchmark/models/All.hpp (1)

31-35: Redundant substring check for "tinyllama".

The condition normalized_model_name.find("tinyllama") is redundant because find("llama") will already match any string containing "tinyllama". The "tiny_llama" check (with underscore) is valid though.

💡 Simplified condition
-  if (normalized_model_name.find("llama") != std::string::npos ||
-      normalized_model_name.find("tinyllama") != std::string::npos ||
-      normalized_model_name.find("tiny_llama") != std::string::npos) {
+  if (normalized_model_name.find("llama") != std::string::npos ||
+      normalized_model_name.find("tiny_llama") != std::string::npos) {
     return std::make_shared<Llama_Benchmark>();
   }
tools/mllm-llm-benchmark/main.cpp (3)

17-18: Remove duplicate stringify macros.

STR_HELPER/STR (lines 26-27) duplicates STRINGIFY_INTERNAL/STRINGIFY (lines 17-18). Use only one set.

♻️ Remove duplicate macros
 `#define` STRINGIFY_INTERNAL(x) `#x`
 `#define` STRINGIFY(x) STRINGIFY_INTERNAL(x)

 `#include` "models/All.hpp"

 `#ifndef` MLLM_GIT_COMMIT_HASH
 `#define` MLLM_GIT_COMMIT_HASH unknown
 `#endif`

-#define STR_HELPER(x) `#x`
-#define STR(x) STR_HELPER(x)
-

Also applies to: 26-27


194-202: Consider using a named constant for the K+V multiplier.

The factor 2.0 represents K and V caches. A named constant improves clarity.

💡 Suggested improvement
+    // K and V each need storage, hence factor of 2
+    constexpr double kKVCacheMultiplier = 2.0;
     // Rough KV cache estimate (bytes)
     double kv_est_bytes_pp = 0.0;
     double kv_est_bytes_final = 0.0;
     if (auto info = benchmark->kvEstimateInfo(); info.has_value()) {
       const int32_t bytes_per = kv_dtype_bytes.get();  // 1/2/4
       // LLaMA-like KV: 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes
-      kv_est_bytes_pp = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per;
-      kv_est_bytes_final = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per;
+      kv_est_bytes_pp = kKVCacheMultiplier * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per;
+      kv_est_bytes_final = kKVCacheMultiplier * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per;
     }

Based on learnings: "Use named constants instead of magic numbers."


163-164: Cast pp to float for accurate latency calculation.

pp is int32_t. Integer division occurs before the float multiplication, which could lose precision for small prefill_speed values.

💡 Explicit cast
-      float prefill_ms = (result.prefill_speed > 0.0f) ? (pp / result.prefill_speed) * 1000.0f : 0.0f;
+      float prefill_ms = (result.prefill_speed > 0.0f) ? (static_cast<float>(pp) / result.prefill_speed) * 1000.0f : 0.0f;

Comment on lines +17 to +24
std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
if (!cfg_) return std::nullopt;
KVCacheEstimateInfo info;
info.num_layers = cfg_->num_hidden_layers;
info.num_kv_heads = cfg_->num_key_value_heads;
info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
return info;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add division-by-zero guard for head_dim calculation.

printModelInfo() (line 58) defensively checks num_attention_heads > 0 before dividing, but kvEstimateInfo() does not. If the config has num_attention_heads == 0, this would cause a division by zero.

🛡️ Proposed fix
   std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
     if (!cfg_) return std::nullopt;
+    if (cfg_->num_attention_heads <= 0) return std::nullopt;
     KVCacheEstimateInfo info;
     info.num_layers = cfg_->num_hidden_layers;
     info.num_kv_heads = cfg_->num_key_value_heads;
     info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
     return info;
   }

Also consider adding [[nodiscard]] per the static analysis hint.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
if (!cfg_) return std::nullopt;
KVCacheEstimateInfo info;
info.num_layers = cfg_->num_hidden_layers;
info.num_kv_heads = cfg_->num_key_value_heads;
info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
return info;
}
std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
if (!cfg_) return std::nullopt;
if (cfg_->num_attention_heads <= 0) return std::nullopt;
KVCacheEstimateInfo info;
info.num_layers = cfg_->num_hidden_layers;
info.num_kv_heads = cfg_->num_key_value_heads;
info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
return info;
}
🧰 Tools
🪛 Clang (14.0.6)

[error] 17-17: function 'kvEstimateInfo' should be marked [[nodiscard]]

(modernize-use-nodiscard,-warnings-as-errors)

🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 17 - 24,
kvEstimateInfo() can divide by zero when cfg_->num_attention_heads == 0; add a
guard similar to printModelInfo(): if cfg_->num_attention_heads == 0 either
return std::nullopt or set info.head_dim = 0 (choose behavior consistent with
callers) before computing head_dim, and ensure cfg_ is checked first; also mark
the method [[nodiscard]] to satisfy static analysis. Update the function
signature kvEstimateInfo() and the body (referencing cfg_, KVCacheEstimateInfo,
head_dim, num_attention_heads) accordingly.

@huangzhenhua111
Copy link
Author

huangzhenhua111 commented Jan 31, 2026

Hi @chenghuaWang and @jialilve,

I hope you're doing well! I have a quick question regarding the review process for my recent PRs (#617, #622) which have separated out the current change and my new PR (#623) which I am not sure whether I should separate it from former state.

I noticed that my new PR includes changes related to KV cache estimation, and it's built upon the earlier PRs. However, those previous PRs haven’t been reviewed yet, and I wanted to ask if you prefer me to:

  1. Separate out the current change into a standalone PR focused only on the KV cache estimate and keep the previous changes in tools/mllm-llm-benchmark: add llama benchmark template #617/tools/mllm-llm-benchmark: add CSV output and configurable runs #622 as separate PRs. This would allow you to review the current changes independently of the earlier ones.

  2. Keep everything in one PR, since the changes in tools/mllm-llm-benchmark: add llama benchmark template #617/tools/mllm-llm-benchmark: add CSV output and configurable runs #622 are related to the new functionality in this PR, and reviewing them together may give you better context.

I suggest you can review the changes from former changes (#617/#622) first. Once those are approved, you can merge them into the main branch, and the changes from the later(#623) will automatically reflect the updates.

I want to make it easier for you to review and ensure the PR process is as smooth as possible, so please let me know which option works better for you.

Thanks for your time and feedback!

Best,
huangzhenhua

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant