tools/mllm-llm-benchmark: estimate KV cache bytes from model config#623
tools/mllm-llm-benchmark: estimate KV cache bytes from model config#623huangzhenhua111 wants to merge 3 commits intoUbiquitousLearning:mainfrom
Conversation
📝 WalkthroughWalkthroughThe changes enhance the benchmark tool with configurable runs via command-line arguments, CSV output capability, and KV cache estimation infrastructure. A new Llama model-specific benchmark implementation is added alongside expanded per-run timing metrics and dynamic runtime environment initialization. Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as CLI/main.cpp
participant Factory as createBenchmark()
participant Template as BenchmarkTemplate
participant Llama as Llama_Benchmark
participant CSV as CSV Output
CLI->>CLI: Parse CLI args (runs, output_csv, etc.)
CLI->>CLI: Validate inputs
loop For each run (1 to R)
CLI->>Factory: createBenchmark(model_name)
Factory->>Llama: Create Llama_Benchmark
CLI->>Llama: init(cfg_path, model_path)
CLI->>Llama: printModelInfo()
CLI->>Llama: warmup()
CLI->>Llama: run(pp, tg)
Llama->>Llama: Measure prefill latency
Llama->>Llama: Measure decode latency
Llama->>Template: kvEstimateInfo() override
Llama-->>CLI: Return BenchmarkTemplateResult
CLI->>CSV: Write per-run metrics + KV estimates
end
CLI->>CSV: Close file
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp`:
- Around line 17-24: kvEstimateInfo() can divide by zero when
cfg_->num_attention_heads == 0; add a guard similar to printModelInfo(): if
cfg_->num_attention_heads == 0 either return std::nullopt or set info.head_dim =
0 (choose behavior consistent with callers) before computing head_dim, and
ensure cfg_ is checked first; also mark the method [[nodiscard]] to satisfy
static analysis. Update the function signature kvEstimateInfo() and the body
(referencing cfg_, KVCacheEstimateInfo, head_dim, num_attention_heads)
accordingly.
🧹 Nitpick comments (5)
tools/mllm-llm-benchmark/models/BenchmarkTemplate.hpp (1)
69-72: Consider adding[[nodiscard]]attribute.The return value of
kvEstimateInfo()should always be checked by callers. Adding[[nodiscard]]helps prevent accidental discard of the optional result, which aligns with the static analysis hint.💡 Suggested fix
// Optional: provide info for KV cache size estimation. // If a model does not support it, return std::nullopt. - virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; } + [[nodiscard]] virtual std::optional<KVCacheEstimateInfo> kvEstimateInfo() const { return std::nullopt; }tools/mllm-llm-benchmark/models/All.hpp (1)
31-35: Redundant substring check for "tinyllama".The condition
normalized_model_name.find("tinyllama")is redundant becausefind("llama")will already match any string containing "tinyllama". The "tiny_llama" check (with underscore) is valid though.💡 Simplified condition
- if (normalized_model_name.find("llama") != std::string::npos || - normalized_model_name.find("tinyllama") != std::string::npos || - normalized_model_name.find("tiny_llama") != std::string::npos) { + if (normalized_model_name.find("llama") != std::string::npos || + normalized_model_name.find("tiny_llama") != std::string::npos) { return std::make_shared<Llama_Benchmark>(); }tools/mllm-llm-benchmark/main.cpp (3)
17-18: Remove duplicate stringify macros.
STR_HELPER/STR(lines 26-27) duplicatesSTRINGIFY_INTERNAL/STRINGIFY(lines 17-18). Use only one set.♻️ Remove duplicate macros
`#define` STRINGIFY_INTERNAL(x) `#x` `#define` STRINGIFY(x) STRINGIFY_INTERNAL(x) `#include` "models/All.hpp" `#ifndef` MLLM_GIT_COMMIT_HASH `#define` MLLM_GIT_COMMIT_HASH unknown `#endif` -#define STR_HELPER(x) `#x` -#define STR(x) STR_HELPER(x) -Also applies to: 26-27
194-202: Consider using a named constant for the K+V multiplier.The factor
2.0represents K and V caches. A named constant improves clarity.💡 Suggested improvement
+ // K and V each need storage, hence factor of 2 + constexpr double kKVCacheMultiplier = 2.0; // Rough KV cache estimate (bytes) double kv_est_bytes_pp = 0.0; double kv_est_bytes_final = 0.0; if (auto info = benchmark->kvEstimateInfo(); info.has_value()) { const int32_t bytes_per = kv_dtype_bytes.get(); // 1/2/4 // LLaMA-like KV: 2 * n_layers * n_kv_heads * head_dim * seq_len * bytes - kv_est_bytes_pp = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per; - kv_est_bytes_final = 2.0 * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per; + kv_est_bytes_pp = kKVCacheMultiplier * info->num_layers * info->num_kv_heads * info->head_dim * (double)pp * bytes_per; + kv_est_bytes_final = kKVCacheMultiplier * info->num_layers * info->num_kv_heads * info->head_dim * (double)(pp + tg) * bytes_per; }Based on learnings: "Use named constants instead of magic numbers."
163-164: Castppto float for accurate latency calculation.
ppisint32_t. Integer division occurs before the float multiplication, which could lose precision for smallprefill_speedvalues.💡 Explicit cast
- float prefill_ms = (result.prefill_speed > 0.0f) ? (pp / result.prefill_speed) * 1000.0f : 0.0f; + float prefill_ms = (result.prefill_speed > 0.0f) ? (static_cast<float>(pp) / result.prefill_speed) * 1000.0f : 0.0f;
| std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override { | ||
| if (!cfg_) return std::nullopt; | ||
| KVCacheEstimateInfo info; | ||
| info.num_layers = cfg_->num_hidden_layers; | ||
| info.num_kv_heads = cfg_->num_key_value_heads; | ||
| info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads; | ||
| return info; | ||
| } |
There was a problem hiding this comment.
Add division-by-zero guard for head_dim calculation.
printModelInfo() (line 58) defensively checks num_attention_heads > 0 before dividing, but kvEstimateInfo() does not. If the config has num_attention_heads == 0, this would cause a division by zero.
🛡️ Proposed fix
std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override {
if (!cfg_) return std::nullopt;
+ if (cfg_->num_attention_heads <= 0) return std::nullopt;
KVCacheEstimateInfo info;
info.num_layers = cfg_->num_hidden_layers;
info.num_kv_heads = cfg_->num_key_value_heads;
info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads;
return info;
}Also consider adding [[nodiscard]] per the static analysis hint.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override { | |
| if (!cfg_) return std::nullopt; | |
| KVCacheEstimateInfo info; | |
| info.num_layers = cfg_->num_hidden_layers; | |
| info.num_kv_heads = cfg_->num_key_value_heads; | |
| info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads; | |
| return info; | |
| } | |
| std::optional<KVCacheEstimateInfo> kvEstimateInfo() const override { | |
| if (!cfg_) return std::nullopt; | |
| if (cfg_->num_attention_heads <= 0) return std::nullopt; | |
| KVCacheEstimateInfo info; | |
| info.num_layers = cfg_->num_hidden_layers; | |
| info.num_kv_heads = cfg_->num_key_value_heads; | |
| info.head_dim = cfg_->hidden_size / cfg_->num_attention_heads; | |
| return info; | |
| } |
🧰 Tools
🪛 Clang (14.0.6)
[error] 17-17: function 'kvEstimateInfo' should be marked [[nodiscard]]
(modernize-use-nodiscard,-warnings-as-errors)
🤖 Prompt for AI Agents
In `@tools/mllm-llm-benchmark/models/Llama.hpp` around lines 17 - 24,
kvEstimateInfo() can divide by zero when cfg_->num_attention_heads == 0; add a
guard similar to printModelInfo(): if cfg_->num_attention_heads == 0 either
return std::nullopt or set info.head_dim = 0 (choose behavior consistent with
callers) before computing head_dim, and ensure cfg_ is checked first; also mark
the method [[nodiscard]] to satisfy static analysis. Update the function
signature kvEstimateInfo() and the body (referencing cfg_, KVCacheEstimateInfo,
head_dim, num_attention_heads) accordingly.
|
Hi @chenghuaWang and @jialilve, I hope you're doing well! I have a quick question regarding the review process for my recent PRs (#617, #622) which have separated out the current change and my new PR (#623) which I am not sure whether I should separate it from former state. I noticed that my new PR includes changes related to KV cache estimation, and it's built upon the earlier PRs. However, those previous PRs haven’t been reviewed yet, and I wanted to ask if you prefer me to:
I suggest you can review the changes from former changes (#617/#622) first. Once those are approved, you can merge them into the main branch, and the changes from the later(#623) will automatically reflect the updates. I want to make it easier for you to review and ensure the PR process is as smooth as possible, so please let me know which option works better for you. Thanks for your time and feedback! Best, |
What have been changed
kvEstimateInfo().How to test
Output
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.