feat : Add Support of Qwen2.5Omni Model #612
feat : Add Support of Qwen2.5Omni Model #612KKkai0315 wants to merge 15 commits intoUbiquitousLearning:mainfrom
Conversation
…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.
… debug print statements from Qwen3DecoderLayer
…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.
…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.
… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.
…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.
…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.
… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.
📝 WalkthroughWalkthroughAdds a new Qwen2.5-Omni multimodal example and full model support: build targets and three CLI runners; configuration JSON; audio preprocessing, tokenizer, multimodal model implementation (vision/audio/text), new ops/layers (ConvTranspose1D, Tanh) with CPU backend implementations and tests. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant CLI as Runner (text/image/audio)
participant Tokenizer as Qwen2_5OmniTokenizer
participant Preproc as Audio/Image Preprocessor
participant Model as Qwen2_5OmniForCausalLM
participant Thinker as Qwen2_5OmniThinker
User->>CLI: provide input (text/image/audio)
CLI->>Tokenizer: load tokenizer & config
CLI->>Preproc: (if media) process file -> features
CLI->>Tokenizer: convertMessage / convertVisionMessage / convertAudioMessage
Tokenizer-->>CLI: token ids + feature tensors
CLI->>Model: forward(input_ids, feature_tensors)
Model->>Thinker: encode modalities, fuse into decoder input
Thinker->>Thinker: vision/audio encoders -> multimodal embeddings
Thinker->>Model: decode and produce logits
loop streaming
CLI->>Tokenizer: detokenize(next_token)
Tokenizer-->>CLI: text chunk
CLI->>User: stream output
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 8
🤖 Fix all issues with AI agents
In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 24-29: The code currently silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is not "v1" or "v2";
instead validate model_version.get() explicitly: examine the string returned by
model_version.get(), set file_version to mllm::ModelFileVersion::kV1 or ::kV2
for "v1" and "v2" respectively, and for any other value print a clear error
mentioning the allowed values and terminate (non-zero exit) to avoid loading the
wrong format; update the block that assigns file_version (and any related usage)
to enforce this validation and fail fast on unsupported values.
- Around line 48-60: The example currently sets audio_path and prompt_text to
empty strings causing processAudioFile()/convertAudioMessage() to get empty
audio via mllm::audio::readWAV("") and fail; restore interactive input by
re-enabling the std::getline calls (or alternatively accept paths via CLI args)
so audio_path and prompt_text are populated before calling
processAudioFile()/convertAudioMessage(); ensure the check for "exit"/"quit"
remains and that prompt_text falls back to a default if empty to avoid empty
prompt usage.
In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json`:
- Around line 5-7: The config contains unused flags enable_audio_output and
enable_talker (alongside model_type "qwen2_5_omni"); update the JSON to avoid
misleading users by either removing the keys enable_audio_output and
enable_talker or setting both to false. Locate the entries for
"enable_audio_output" and "enable_talker" in the config_qwen2_5omni_7B.json and
change their values to false (or delete those lines) so only supported
functionality (the thinker/text output) is advertised.
In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 24-29: The current logic silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is unknown; update the
branch around model_version.get() so you explicitly accept only "v1" and "v2",
set file_version to mllm::ModelFileVersion::kV1 or kV2 accordingly, and
otherwise report an error (include the invalid value and allowed values) and
terminate with a non‑zero exit or throw an exception; reference the
model_version variable and the file_version / mllm::ModelFileVersion symbols
when making this validation change.
- Around line 52-60: Uncomment the two std::getline calls so the program reads
user input instead of using empty strings: after the fmt::print("Image path (or
'exit/quit'): "); call restore std::getline(std::cin, image_path) to populate
image_path (so the exit/quit check works), and after fmt::print("Prompt text:
"); restore std::getline(std::cin, prompt_text) to populate prompt_text; ensure
the variables image_path and prompt_text are used as before and that included
<iostream> usage is consistent with the surrounding code.
In `@examples/qwen2_5omni/text_infer.cpp`:
- Around line 24-29: The code silently defaults to mllm::ModelFileVersion::kV1
when model_version is unknown; change this to validate the input from
model_version.get() and reject unsupported values instead of defaulting. In the
block that sets file_version (referencing model_version.get(), file_version, and
mllm::ModelFileVersion), check for "v1" and "v2" explicitly and if neither
matches, print a clear error mentioning the unsupported model_version and
terminate (e.g., return/exit with non-zero or throw) so the program fails fast
rather than loading the wrong format.
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 942-958: The code leaves attn uninitialized when
key_states.dtype() is neither kFloat32 nor kFloat16; add a fallback else branch
in the attention block (around the key_states.dtype() checks) that handles other
dtypes by casting query_states and key_states to kFloat32, computing attn =
nn::functional::matmul(...)* (1.f / sqrtf(head_dim_)), applying mask_ and
softmax_, and then casting attn back to the original key_states.dtype() if
needed (or alternatively throw a clear runtime_error mentioning unsupported
dtype) so that attn is always initialized before using
nn::functional::matmul(attn, value_states).
In `@mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp`:
- Around line 303-316: The code computes img_token_nums from grid_thw and then
inserts tokens without validating it; add a check (similar to
convertAudioMessage's audio_token_nums > 0) after computing img_token_nums and
before ids.insert to ensure img_token_nums > 0, and on failure call
MLLM_ERROR_EXIT with a clear message (e.g., "Invalid image token count") so you
don't call ids.insert(img_token_nums - 1, ...) with zero/negative counts; locate
symbols grid_thw, img_token_nums, image_token_id, ids.insert and the surrounding
convertVisionMessage logic to make the change.
🧹 Nitpick comments (7)
mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp (2)
98-114: Redundant condition in whitespace handling logic.After the while loop on line 100 consumes all whitespace,
str[pos]is guaranteed to be a non-whitespace character (whenpos < str.size()). Therefore, the conditionstd::iswspace(str[pos])on line 101 is never true, making the second half of the condition unreachable.This also makes lines 109-114 redundant since the only way to reach them is if the first block restored
pos(line 105), but that only happens when the condition on line 101 is false—which it always is unless we're at end of string.Consider clarifying the intended behavior or simplifying this logic.
189-193: Clarify the technical reason for the max_pixels constraint.The comment "interestingly, the answer went bad when setting max_pixels higher" is informal and doesn't explain the root cause. Consider documenting the technical reason (e.g., memory constraints, model training resolution, numerical stability) to help future maintainers understand why this limit exists.
mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp (2)
19-37: Consider adding existence checks or documenting required fields.The config parser accesses required fields like
text_config,hidden_size,intermediate_size, etc. directly without.contains()checks. If the config file is missing these required fields, the parser will throw an exception.This fail-fast behavior may be intentional, but consider either:
- Adding explicit existence checks with descriptive error messages
- Documenting the required config structure in a header comment
80-118: Consider extracting common parsing logic to reduce duplication.The root-level config parsing (lines 80-118) largely duplicates the thinker_config parsing (lines 23-79). This is understandable for supporting two config formats, but could be refactored into helper methods if the config format evolves.
mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (2)
958-961: Consider passing layer_idx as a constructor parameter instead of public member.
layer_idx_is a public member that's set externally (line 1003). This breaks encapsulation and could lead to bugs if the member is not set before use.🛠️ Suggested refactor
class Qwen2_5OmniAttention final : public nn::Module { + int layer_idx_; // ... other members ... public: - Qwen2_5OmniAttention(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) { + Qwen2_5OmniAttention(const std::string& name, const Qwen2_5OmniConfig& cfg, int layer_idx) + : nn::Module(name), layer_idx_(layer_idx) { // ... } - - int layer_idx_; };Then update Qwen2_5OmniDecoder and the module registration accordingly.
853-858: Addexplicitkeyword for consistency with other constructors.
Qwen2_5OmniMLP,Qwen2_5OmniAttention(line 889),Qwen2_5OmniDecoder(line 972), andQwen2_5OmniText(line 1001) are missing theexplicitkeyword that other similar classes use (e.g.,Qwen2_5OmniVisionMLPat line 421).🛠️ Suggested fix
- Qwen2_5OmniMLP(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) { + explicit Qwen2_5OmniMLP(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) {Apply similar changes to
Qwen2_5OmniAttention,Qwen2_5OmniDecoder, andQwen2_5OmniText.mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp (1)
43-55: Add M_PI definition guard for portability.
M_PIis not part of the C++ standard and is only a POSIX extension. While commonly available on Unix-like systems, it may be undefined on other platforms (notably Windows with MSVC without_USE_MATH_DEFINES). The codebase already uses this pattern inthird_party/wenet_audio/fft.handmllm/backends/cpu/ops/STFTOp.cpp.🛠️ Suggested fix
+#ifndef M_PI +#define M_PI 3.14159265358979323846 +#endif + inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Validate --model_version instead of silently falling back to v1.
An unknown value currently defaults to v1, which can load the wrong format and fail in confusing ways. Consider rejecting unsupported values explicitly.
Suggested fix
- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
- if (model_version.get() == "v1") {
- file_version = mllm::ModelFileVersion::kV1;
- } else if (model_version.get() == "v2") {
- file_version = mllm::ModelFileVersion::kV2;
- }
+ mllm::ModelFileVersion file_version;
+ if (model_version.get() == "v1") {
+ file_version = mllm::ModelFileVersion::kV1;
+ } else if (model_version.get() == "v2") {
+ file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print("Unsupported --model_version: {}\n", model_version.get());
+ return 1;
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } | |
| mllm::ModelFileVersion file_version; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } else { | |
| fmt::print("Unsupported --model_version: {}\n", model_version.get()); | |
| return 1; | |
| } |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 24 - 29, The code
currently silently defaults file_version to mllm::ModelFileVersion::kV1 when
model_version.get() is not "v1" or "v2"; instead validate model_version.get()
explicitly: examine the string returned by model_version.get(), set file_version
to mllm::ModelFileVersion::kV1 or ::kV2 for "v1" and "v2" respectively, and for
any other value print a clear error mentioning the allowed values and terminate
(non-zero exit) to avoid loading the wrong format; update the block that assigns
file_version (and any related usage) to enforce this validation and fail fast on
unsupported values.
| std::string audio_path; | ||
| std::string prompt_text; | ||
|
|
||
| fmt::print("Audio path (or 'exit/quit'): "); | ||
| //std::getline(std::cin, audio_path); | ||
| //if (audio_path == "exit" || audio_path == "quit") { return 0; } | ||
| audio_path = ""; | ||
|
|
||
| fmt::print("Prompt text: "); | ||
| //std::getline(std::cin, prompt_text); | ||
| //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } | ||
| prompt_text = ""; | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd examples/qwen2_5omni && wc -l audio_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 190
🏁 Script executed:
head -70 examples/qwen2_5omni/audio_infer.cpp | tail -30Repository: UbiquitousLearning/mllm
Length of output: 1202
🏁 Script executed:
cat -n examples/qwen2_5omni/audio_infer.cpp | sed -n '40,70p'Repository: UbiquitousLearning/mllm
Length of output: 1510
🏁 Script executed:
rg -A 10 "convertAudioMessage" examples/qwen2_5omni/ --type cppRepository: UbiquitousLearning/mllm
Length of output: 1132
🏁 Script executed:
rg -B 5 -A 10 "audio_file_path" examples/qwen2_5omni/ --type cpp | head -50Repository: UbiquitousLearning/mllm
Length of output: 1466
🏁 Script executed:
fd "\.hpp$|\.h$" examples/qwen2_5omni --type fRepository: UbiquitousLearning/mllm
Length of output: 49
🏁 Script executed:
rg "convertAudioMessage" --type cpp --type hpp | head -20Repository: UbiquitousLearning/mllm
Length of output: 94
🏁 Script executed:
find . -type f -name "*.h" -o -name "*.hpp" | grep -E "(qwen|tokenizer)" | head -20Repository: UbiquitousLearning/mllm
Length of output: 1152
🏁 Script executed:
rg "convertAudioMessage" -A 15 --max-count 5Repository: UbiquitousLearning/mllm
Length of output: 3215
🏁 Script executed:
rg "audio_file_path" -B 5 -A 10 --max-count 10Repository: UbiquitousLearning/mllm
Length of output: 16860
🏁 Script executed:
rg "MLLM_ERROR_EXIT" -B 2 -A 2 --max-count 5Repository: UbiquitousLearning/mllm
Length of output: 50379
🏁 Script executed:
rg "readWAV" -B 3 -A 8 | head -40Repository: UbiquitousLearning/mllm
Length of output: 3737
Enable interactive input for the example to work as intended.
The example currently has input hardcoded to empty strings, causing convertAudioMessage() to fail. When an empty audio_path is passed to processAudioFile(), it attempts to read from an empty path via mllm::audio::readWAV(""), which returns empty audio data. This triggers the error check if (audio_result.input_features.isNil() || audio_result.feature_length <= 0) and exits with MLLM_ERROR_EXIT(ExitCode::kIOError, "Failed to extract audio features for Qwen2.5-Omni.").
Restore the commented std::getline() calls to enable interactive input, or wire the paths to CLI arguments.
Suggested fix (restore interactive input)
- fmt::print("Audio path (or 'exit/quit'): ");
- //std::getline(std::cin, audio_path);
- //if (audio_path == "exit" || audio_path == "quit") { return 0; }
- audio_path = "";
-
- fmt::print("Prompt text: ");
- //std::getline(std::cin, prompt_text);
- //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
- prompt_text = "";
+ fmt::print("Audio path (or 'exit/quit'): ");
+ std::getline(std::cin, audio_path);
+ if (audio_path == "exit" || audio_path == "quit") { return 0; }
+
+ fmt::print("Prompt text: ");
+ std::getline(std::cin, prompt_text);
+ if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| std::string audio_path; | |
| std::string prompt_text; | |
| fmt::print("Audio path (or 'exit/quit'): "); | |
| //std::getline(std::cin, audio_path); | |
| //if (audio_path == "exit" || audio_path == "quit") { return 0; } | |
| audio_path = ""; | |
| fmt::print("Prompt text: "); | |
| //std::getline(std::cin, prompt_text); | |
| //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } | |
| prompt_text = ""; | |
| std::string audio_path; | |
| std::string prompt_text; | |
| fmt::print("Audio path (or 'exit/quit'): "); | |
| std::getline(std::cin, audio_path); | |
| if (audio_path == "exit" || audio_path == "quit") { return 0; } | |
| fmt::print("Prompt text: "); | |
| std::getline(std::cin, prompt_text); | |
| if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/audio_infer.cpp` around lines 48 - 60, The example
currently sets audio_path and prompt_text to empty strings causing
processAudioFile()/convertAudioMessage() to get empty audio via
mllm::audio::readWAV("") and fail; restore interactive input by re-enabling the
std::getline calls (or alternatively accept paths via CLI args) so audio_path
and prompt_text are populated before calling
processAudioFile()/convertAudioMessage(); ensure the check for "exit"/"quit"
remains and that prompt_text falls back to a default if empty to avoid empty
prompt usage.
| "enable_audio_output": true, | ||
| "enable_talker": true, | ||
| "model_type": "qwen2_5_omni", |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, verify the file exists and show context around the lines mentioned
echo "=== Checking the config file ==="
find . -name "config_qwen2_5omni_7B.json" -type f 2>/dev/null | head -5
echo -e "\n=== Config file content ==="
if [ -f "examples/qwen2_5omni/config_qwen2_5omni_7B.json" ]; then
cat -n "examples/qwen2_5omni/config_qwen2_5omni_7B.json"
fi
echo -e "\n=== Search for flag usage in code ==="
rg -n --type=cpp 'enable_talker|enable_audio_output' -A 2 -B 2Repository: UbiquitousLearning/mllm
Length of output: 16869
🏁 Script executed:
#!/bin/bash
# Search for flag usage in Python and other files
echo "=== Search for enable_talker and enable_audio_output in all files ==="
rg -n 'enable_talker|enable_audio_output' -i --type-list | head -5
echo -e "\n=== Search without type restriction ==="
rg -n 'enable_talker|enable_audio_output' -A 3 -B 1 | head -50
echo -e "\n=== Search for talker_config usage ==="
rg -n 'talker_config' -A 2 -B 1 | head -50
echo -e "\n=== Check for model loading/config ==="
rg -n 'def.*load|def.*model' --type=py | grep -i config | head -20
echo -e "\n=== Find Python files in examples directory ==="
find examples -name "*.py" -type f 2>/dev/null | head -10Repository: UbiquitousLearning/mllm
Length of output: 1512
🏁 Script executed:
#!/bin/bash
# Let's find where config files are loaded and used
echo "=== Look for config loading functions ==="
rg -n 'load.*config|from_pretrained' --type=py -A 2 | head -60
echo -e "\n=== Search for qwen omni references ==="
rg -n 'qwen.*omni|omni.*model' --type=py -i | head -30
echo -e "\n=== Explore directory structure ==="
find . -type d -name "*qwen*" -o -name "*model*" -o -name "*config*" 2>/dev/null | head -20
echo -e "\n=== Look for model initialization in Python ==="
find . -name "*.py" -type f 2>/dev/null | head -20
echo -e "\n=== Check for source code structure ==="
ls -la | head -20Repository: UbiquitousLearning/mllm
Length of output: 3977
🏁 Script executed:
#!/bin/bash
# Search for where JSON configs are loaded
echo "=== Search for JSON config loading ==="
rg -n '\.json|load.*config' --type=py | grep -i 'config.*json\|json.*config' | head -20
echo -e "\n=== Search for qwen2_5omni model class or file ==="
find . -type f \( -name "*.py" -o -name "*.hpp" -o -name "*.cpp" -o -name "*.h" \) -exec grep -l 'qwen2_5omni\|Qwen2_5Omni' {} \; 2>/dev/null | head -20
echo -e "\n=== Check if config file is actually used in any Python test or runner ==="
rg -n 'config_qwen2_5omni' -A 3 | head -30
echo -e "\n=== Look for files that reference the examples directory ==="
rg -n 'examples/qwen2_5omni' | head -20
echo -e "\n=== Search for enable_ flags in code (broader search) ==="
rg -n 'enable_' --type=py -B 2 -A 2 | head -40Repository: UbiquitousLearning/mllm
Length of output: 3540
🏁 Script executed:
#!/bin/bash
# Read the configuration file
echo "=== configuration_qwen2_5omni.hpp ==="
wc -l mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp
head -100 mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp
echo -e "\n=== modeling_qwen2_5omni.hpp ==="
wc -l mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp
echo -e "\n=== Search for enable_talker in C++ files ==="
rg -n 'enable_talker|enable_audio_output' mllm/models/qwen2_5omni/ -A 3 -B 1Repository: UbiquitousLearning/mllm
Length of output: 5986
🏁 Script executed:
#!/bin/bash
# Check if the flags are mentioned anywhere in the entire qwen2_5omni files
echo "=== All files in qwen2_5omni directory ==="
ls -la mllm/models/qwen2_5omni/
echo -e "\n=== Search ALL qwen2_5omni files for enable_ flags ==="
grep -r 'enable_talker\|enable_audio_output' mllm/models/qwen2_5omni/ 2>/dev/null || echo "Not found in C++ files"
echo -e "\n=== Search ALL qwen2_5omni files for talker ==="
grep -n 'talker' mllm/models/qwen2_5omni/*.hpp mllm/models/qwen2_5omni/*.cpp 2>/dev/null | head -20
echo -e "\n=== Full configuration_qwen2_5omni.hpp to see if flags are read ==="
cat mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp | tail -80Repository: UbiquitousLearning/mllm
Length of output: 4080
🏁 Script executed:
#!/bin/bash
# Check if flags are used anywhere in the entire repository
echo "=== Search entire repo for enable_talker ==="
rg -n 'enable_talker' --type-list 2>/dev/null || true
rg -n 'enable_talker' 2>/dev/null | head -20
echo -e "\n=== Search entire repo for enable_audio_output ==="
rg -n 'enable_audio_output' 2>/dev/null | head -20
echo -e "\n=== Check if flags affect model initialization ==="
rg -n 'talker_config\|thinker_config' mllm/models/qwen2_5omni/ -B 2 -A 2
echo -e "\n=== Look at the inference examples to see how config is used ==="
head -50 examples/qwen2_5omni/text_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 2339
Remove or set enable_talker and enable_audio_output to false—these flags are currently unused.
These flags are defined in the config but never parsed or checked in the model code. Setting them to true misleads users about available functionality since only the thinker (text output) is currently loaded and used. To avoid confusion, either remove these flags or set them to false until talker and audio output support are actually implemented.
Suggested change
- "enable_audio_output": true,
- "enable_talker": true,
+ "enable_audio_output": false,
+ "enable_talker": false,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "enable_audio_output": true, | |
| "enable_talker": true, | |
| "model_type": "qwen2_5_omni", | |
| "enable_audio_output": false, | |
| "enable_talker": false, | |
| "model_type": "qwen2_5_omni", |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json` around lines 5 - 7, The
config contains unused flags enable_audio_output and enable_talker (alongside
model_type "qwen2_5_omni"); update the JSON to avoid misleading users by either
removing the keys enable_audio_output and enable_talker or setting both to
false. Locate the entries for "enable_audio_output" and "enable_talker" in the
config_qwen2_5omni_7B.json and change their values to false (or delete those
lines) so only supported functionality (the thinker/text output) is advertised.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Validate --model_version instead of silently falling back to v1.
An unknown value currently defaults to v1, which can load the wrong format and fail unexpectedly. Consider rejecting unsupported values.
Suggested fix
- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
- if (model_version.get() == "v1") {
- file_version = mllm::ModelFileVersion::kV1;
- } else if (model_version.get() == "v2") {
- file_version = mllm::ModelFileVersion::kV2;
- }
+ mllm::ModelFileVersion file_version;
+ if (model_version.get() == "v1") {
+ file_version = mllm::ModelFileVersion::kV1;
+ } else if (model_version.get() == "v2") {
+ file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print("Unsupported --model_version: {}\n", model_version.get());
+ return 1;
+ }🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 24 - 29, The current logic
silently defaults file_version to mllm::ModelFileVersion::kV1 when
model_version.get() is unknown; update the branch around model_version.get() so
you explicitly accept only "v1" and "v2", set file_version to
mllm::ModelFileVersion::kV1 or kV2 accordingly, and otherwise report an error
(include the invalid value and allowed values) and terminate with a non‑zero
exit or throw an exception; reference the model_version variable and the
file_version / mllm::ModelFileVersion symbols when making this validation
change.
| fmt::print("Image path (or 'exit/quit'): "); | ||
| image_path = ""; | ||
| //std::getline(std::cin, image_path); | ||
| if (image_path == "exit" || image_path == "quit") { return 0; } | ||
|
|
||
| fmt::print("Prompt text: "); | ||
| prompt_text = ""; | ||
| //std::getline(std::cin, prompt_text); | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd examples/qwen2_5omni && wc -l image_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 85
🏁 Script executed:
cd examples/qwen2_5omni && sed -n '45,65p' image_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 730
🏁 Script executed:
cd examples/qwen2_5omni && grep -n "getline" image_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 157
🏁 Script executed:
cd examples/qwen2_5omni && head -52 image_infer.cpp | tail -20Repository: UbiquitousLearning/mllm
Length of output: 789
🏁 Script executed:
cd examples/qwen2_5omni && cat -n image_infer.cpp | head -30Repository: UbiquitousLearning/mllm
Length of output: 1480
🏁 Script executed:
cd examples/qwen2_5omni && sed -n '50,75p' image_infer.cppRepository: UbiquitousLearning/mllm
Length of output: 1119
Uncomment the std::getline() calls to restore interactive input.
The example prints prompts for user input but then hardcodes empty strings and has the std::getline() calls commented out. This makes the example non-functional since it will always attempt to process empty image paths and prompts. Restore the interactive input handling by uncommenting lines 54 and 59 (the std::getline() calls).
Suggested fix
- fmt::print("Image path (or 'exit/quit'): ");
- image_path = "";
- //std::getline(std::cin, image_path);
- if (image_path == "exit" || image_path == "quit") { return 0; }
-
- fmt::print("Prompt text: ");
- prompt_text = "";
- //std::getline(std::cin, prompt_text);
+ fmt::print("Image path (or 'exit/quit'): ");
+ std::getline(std::cin, image_path);
+ if (image_path == "exit" || image_path == "quit") { return 0; }
+
+ fmt::print("Prompt text: ");
+ std::getline(std::cin, prompt_text);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| fmt::print("Image path (or 'exit/quit'): "); | |
| image_path = ""; | |
| //std::getline(std::cin, image_path); | |
| if (image_path == "exit" || image_path == "quit") { return 0; } | |
| fmt::print("Prompt text: "); | |
| prompt_text = ""; | |
| //std::getline(std::cin, prompt_text); | |
| fmt::print("Image path (or 'exit/quit'): "); | |
| std::getline(std::cin, image_path); | |
| if (image_path == "exit" || image_path == "quit") { return 0; } | |
| fmt::print("Prompt text: "); | |
| std::getline(std::cin, prompt_text); |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/image_infer.cpp` around lines 52 - 60, Uncomment the two
std::getline calls so the program reads user input instead of using empty
strings: after the fmt::print("Image path (or 'exit/quit'): "); call restore
std::getline(std::cin, image_path) to populate image_path (so the exit/quit
check works), and after fmt::print("Prompt text: "); restore
std::getline(std::cin, prompt_text) to populate prompt_text; ensure the
variables image_path and prompt_text are used as before and that included
<iostream> usage is consistent with the surrounding code.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | ||
| if (model_version.get() == "v1") { | ||
| file_version = mllm::ModelFileVersion::kV1; | ||
| } else if (model_version.get() == "v2") { | ||
| file_version = mllm::ModelFileVersion::kV2; | ||
| } |
There was a problem hiding this comment.
Validate --model_version instead of silently falling back to v1.
An unknown value currently defaults to v1, which can load the wrong format and fail unexpectedly. Consider rejecting unsupported values.
Suggested fix
- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
- if (model_version.get() == "v1") {
- file_version = mllm::ModelFileVersion::kV1;
- } else if (model_version.get() == "v2") {
- file_version = mllm::ModelFileVersion::kV2;
- }
+ mllm::ModelFileVersion file_version;
+ if (model_version.get() == "v1") {
+ file_version = mllm::ModelFileVersion::kV1;
+ } else if (model_version.get() == "v2") {
+ file_version = mllm::ModelFileVersion::kV2;
+ } else {
+ fmt::print("Unsupported --model_version: {}\n", model_version.get());
+ return 1;
+ }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } | |
| mllm::ModelFileVersion file_version; | |
| if (model_version.get() == "v1") { | |
| file_version = mllm::ModelFileVersion::kV1; | |
| } else if (model_version.get() == "v2") { | |
| file_version = mllm::ModelFileVersion::kV2; | |
| } else { | |
| fmt::print("Unsupported --model_version: {}\n", model_version.get()); | |
| return 1; | |
| } |
🤖 Prompt for AI Agents
In `@examples/qwen2_5omni/text_infer.cpp` around lines 24 - 29, The code silently
defaults to mllm::ModelFileVersion::kV1 when model_version is unknown; change
this to validate the input from model_version.get() and reject unsupported
values instead of defaulting. In the block that sets file_version (referencing
model_version.get(), file_version, and mllm::ModelFileVersion), check for "v1"
and "v2" explicitly and if neither matches, print a clear error mentioning the
unsupported model_version and terminate (e.g., return/exit with non-zero or
throw) so the program fails fast rather than loading the wrong format.
| Tensor attn; | ||
| if (key_states.dtype() == kFloat32) { | ||
| attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); | ||
| attn = mask_(attn); | ||
| attn = softmax_(attn); | ||
| } else if (key_states.dtype() == kFloat16) { | ||
| attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); | ||
| attn = mask_(attn); | ||
| attn = softmax_(attn); | ||
| attn = attn.to(kFloat16); | ||
| } | ||
|
|
||
| auto output = nn::functional::matmul(attn, value_states); | ||
| output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_}); | ||
| output = o_proj_(output); | ||
| return {output}; | ||
| } |
There was a problem hiding this comment.
Missing fallback for unsupported dtype in attention computation.
If key_states.dtype() is neither kFloat32 nor kFloat16, the attn tensor remains uninitialized (line 942 declares it as Tensor attn;), and the subsequent nn::functional::matmul(attn, value_states) on line 954 would cause undefined behavior.
🛠️ Suggested fix
if (key_states.dtype() == kFloat32) {
attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));
attn = mask_(attn);
attn = softmax_(attn);
} else if (key_states.dtype() == kFloat16) {
attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));
attn = mask_(attn);
attn = softmax_(attn);
attn = attn.to(kFloat16);
+ } else {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation.");
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| Tensor attn; | |
| if (key_states.dtype() == kFloat32) { | |
| attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); | |
| attn = mask_(attn); | |
| attn = softmax_(attn); | |
| } else if (key_states.dtype() == kFloat16) { | |
| attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); | |
| attn = mask_(attn); | |
| attn = softmax_(attn); | |
| attn = attn.to(kFloat16); | |
| } | |
| auto output = nn::functional::matmul(attn, value_states); | |
| output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_}); | |
| output = o_proj_(output); | |
| return {output}; | |
| } | |
| Tensor attn; | |
| if (key_states.dtype() == kFloat32) { | |
| attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); | |
| attn = mask_(attn); | |
| attn = softmax_(attn); | |
| } else if (key_states.dtype() == kFloat16) { | |
| attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); | |
| attn = mask_(attn); | |
| attn = softmax_(attn); | |
| attn = attn.to(kFloat16); | |
| } else { | |
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation."); | |
| } | |
| auto output = nn::functional::matmul(attn, value_states); | |
| output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_}); | |
| output = o_proj_(output); | |
| return {output}; | |
| } |
🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 942 - 958, The
code leaves attn uninitialized when key_states.dtype() is neither kFloat32 nor
kFloat16; add a fallback else branch in the attention block (around the
key_states.dtype() checks) that handles other dtypes by casting query_states and
key_states to kFloat32, computing attn = nn::functional::matmul(...)* (1.f /
sqrtf(head_dim_)), applying mask_ and softmax_, and then casting attn back to
the original key_states.dtype() if needed (or alternatively throw a clear
runtime_error mentioning unsupported dtype) so that attn is always initialized
before using nn::functional::matmul(attn, value_states).
| auto grid_t = grid_thw.ptr<int32_t>()[0]; | ||
| auto grid_h = grid_thw.ptr<int32_t>()[1]; | ||
| auto grid_w = grid_thw.ptr<int32_t>()[2]; | ||
| int32_t img_token_nums = grid_t * grid_h * grid_w; | ||
| img_token_nums /= (spatial_merge_size_ * spatial_merge_size_); | ||
|
|
||
| auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>"); | ||
| { | ||
| auto it = std::find(ids.begin(), ids.end(), image_token_id); | ||
| if (it == ids.end()) { | ||
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template."); | ||
| } | ||
| ids.insert(it + 1, img_token_nums - 1, image_token_id); | ||
| } |
There was a problem hiding this comment.
Add validation for image token count before insertion.
Unlike convertAudioMessage which validates audio_token_nums > 0 (lines 347-349), convertVisionMessage doesn't validate img_token_nums before the ids.insert() call. If grid dimensions somehow result in zero or negative values, inserting img_token_nums - 1 tokens would cause undefined behavior.
🛠️ Suggested fix
int32_t img_token_nums = grid_t * grid_h * grid_w;
img_token_nums /= (spatial_merge_size_ * spatial_merge_size_);
+ if (img_token_nums <= 0) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid image token count for Qwen2.5-Omni.");
+ }
auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| auto grid_t = grid_thw.ptr<int32_t>()[0]; | |
| auto grid_h = grid_thw.ptr<int32_t>()[1]; | |
| auto grid_w = grid_thw.ptr<int32_t>()[2]; | |
| int32_t img_token_nums = grid_t * grid_h * grid_w; | |
| img_token_nums /= (spatial_merge_size_ * spatial_merge_size_); | |
| auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>"); | |
| { | |
| auto it = std::find(ids.begin(), ids.end(), image_token_id); | |
| if (it == ids.end()) { | |
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template."); | |
| } | |
| ids.insert(it + 1, img_token_nums - 1, image_token_id); | |
| } | |
| auto grid_t = grid_thw.ptr<int32_t>()[0]; | |
| auto grid_h = grid_thw.ptr<int32_t>()[1]; | |
| auto grid_w = grid_thw.ptr<int32_t>()[2]; | |
| int32_t img_token_nums = grid_t * grid_h * grid_w; | |
| img_token_nums /= (spatial_merge_size_ * spatial_merge_size_); | |
| if (img_token_nums <= 0) { | |
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid image token count for Qwen2.5-Omni."); | |
| } | |
| auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>"); | |
| { | |
| auto it = std::find(ids.begin(), ids.end(), image_token_id); | |
| if (it == ids.end()) { | |
| MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template."); | |
| } | |
| ids.insert(it + 1, img_token_nums - 1, image_token_id); | |
| } |
🤖 Prompt for AI Agents
In `@mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp` around lines 303 - 316,
The code computes img_token_nums from grid_thw and then inserts tokens without
validating it; add a check (similar to convertAudioMessage's audio_token_nums >
0) after computing img_token_nums and before ids.insert to ensure img_token_nums
> 0, and on failure call MLLM_ERROR_EXIT with a clear message (e.g., "Invalid
image token count") so you don't call ids.insert(img_token_nums - 1, ...) with
zero/negative counts; locate symbols grid_thw, img_token_nums, image_token_id,
ids.insert and the surrounding convertVisionMessage logic to make the change.
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@mllm/backends/cpu/ops/ConvTranspose1DOp.hpp`:
- Around line 11-22: Add concise doc comments above the CPUConvTranspose1DOp
class and CPUConvTranspose1DOpFactory to document their roles and expected
tensor shapes/semantics: describe that CPUConvTranspose1DOp implements the CPU
backend of aops::ConvTranspose1DOp, list expected input tensors (e.g., input,
weight, optional bias), expected output shape behavior (how output length is
computed from stride/padding/dilation/output_padding), and any preconditions
(memory layout, dtype). Also add a brief comment for CPUConvTranspose1DOpFactory
explaining it constructs CPUConvTranspose1DOp instances from
aops::ConvTranspose1DOpOptions so readers know this factory ties the OpOptions
to the CPU implementation.
In `@mllm/core/aops/ConvTranspose1DOp.cpp`:
- Around line 17-23: Validate options_.groups before any division/modulo in
ConvTranspose1DOp::load and ConvTranspose1DOp::reshape by checking it is > 0 and
that it cleanly divides the relevant channel counts (e.g., options_.out_channels
% options_.groups == 0 and any other channel dimension used with groups is
divisible). If a check fails, return/raise a clear error (e.g., throw
std::invalid_argument or use the existing error/reporting mechanism) with a
message referencing the op name and invalid group value. Update the logic around
weight_ = weight_.view(...) in load() and the corresponding shape calculations
in reshape() to assume groups is validated so the divisions/modulo are safe.
Ensure tests or callers expecting validation get a deterministic error rather
than UB/crash.
- Around line 72-81: The code must validate computed seq_out and option
constraints before allocating the output tensor: compute seq_out as currently
done, then check that seq_out > 0 and that options_.output_padding <
options_.stride and that options_.padding >= 0, options_.dilation >= 1,
options_.kernel_size > 0 and options_.stride > 0; if any check fails,
return/raise a clear configuration error (e.g., throw std::invalid_argument or
use the project’s error/reporting helper) instead of calling
outputs.emplace_back(Tensor::empty(...)); place these checks immediately after
the seq_out calculation and before the outputs.emplace_back line in
ConvTranspose1DOp (referencing seq_out, options_.output_padding,
options_.stride, options_.padding, options_.dilation, options_.kernel_size).
In `@mllm/core/aops/ConvTranspose1DOp.hpp`:
- Around line 12-49: Add API documentation comments for the new public types:
place a brief docstring above the struct ConvTranspose1DOpOptions describing its
purpose (options for 1D transposed convolution), and document each field
(in_channels, out_channels, kernel_size, stride, padding, output_padding,
dilation, groups, bias) including units, valid ranges/constraints (e.g.,
positive ints, output_padding < max(stride, dilation), groups divides
in_channels/out_channels, kernel_size > 0), and default meanings; also add a
short doc comment above the ConvTranspose1DOp class describing its role,
lifecycle methods (load, trace, forward, reshape, setup, getParams), and
invariants about weight_ and bias_ (shapes derived from options, bias present
only if options.bias is true). Ensure comments reference the exact symbols
ConvTranspose1DOpOptions, ConvTranspose1DOp, weight_, bias_ so callers can
locate the documented behavior.
In `@mllm/core/aops/TanhOp.hpp`:
- Around line 11-30: Add API doc comments for TanhOp and TanhOpOptions: place a
brief class-level comment above struct TanhOpOptions and class TanhOp describing
the operation's purpose (element-wise hyperbolic tangent), expected inputs
(single tensor of arbitrary shape), outputs (single tensor with same shape and
dtype), and any error/shape expectations (e.g., require one input tensor,
matching output count; throw or assert on incorrect counts). Also add short
comments on the public methods load, trace, forward, reshape, setup and the
options() accessor indicating their roles (e.g., load: load parameters from
ParameterFile; forward: compute element-wise tanh; reshape/setup: validate and
set output shapes; trace: record shapes/metadata) and mention any preconditions
or postconditions (shape invariants), following existing doc style in the
codebase.
🧹 Nitpick comments (4)
mllm/backends/cpu/ops/ConvTranspose1DOp.cpp (2)
48-48: Unnecessary zero-initialization of output.The output buffer is zero-filled here, but the subsequent loop at line 86 directly assigns
output_ptr[output_idx] = sum;rather than accumulating. Each output element is computed completely before being written, making this zero-fill redundant overhead.♻️ Suggested fix
- std::fill_n(output_ptr, output.numel(), 0.0f); + // No zero-fill needed: each output element is fully computed before assignment
52-90: Redundant dtype switch after assertion.The
switch (output.dtype())at line 52 is redundant sinceMLLM_RT_ASSERT_EQ(output.dtype(), kFloat32)at line 39 already guarantees the dtype. Consider removing the switch or, if future dtype support is planned, removing the assertion.♻️ Suggested simplification
- switch (output.dtype()) { - case kFloat32: - MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, { + MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, { // ... loop body unchanged ... - }); - break; - default: NYI("ConvTranspose1D: unsupported data type"); - } + });tests/cpu/ConvTranspose1DKernelTest.hpp (1)
114-133: Redundant variable extraction in failure path.Lines 117-128 extract all config values into local variables, but these are only used in the
print()call. Consider simplifying by passing the config directly or just printing the failing config map.♻️ Simplified failure handling
bool testConvTranspose1D(const std::vector<std::unordered_map<std::string, int32_t>>& cfgs) { for (auto& cfg : cfgs) { if (!testConvTranspose1DOnce(cfg)) { - auto batch = cfg.at("batch"); - auto in_channel = cfg.at("in_channel"); - auto out_channel = cfg.at("out_channel"); - auto sequence = cfg.at("sequence"); - auto kernel_size = cfg.at("kernel_size"); - auto stride = cfg.at("stride"); - auto padding = cfg.at("padding"); - auto output_padding = cfg.at("output_padding"); - auto dilation = cfg.at("dilation"); - auto groups = cfg.at("groups"); - auto bias = cfg.at("bias"); - print(batch, in_channel, out_channel, sequence, kernel_size, stride, padding, output_padding, dilation, groups, bias); + // Print failing configuration for debugging + for (const auto& [key, value] : cfg) { + print(key, "=", value); + } return false; } } return true; }mllm/nn/layers/Tanh.hpp (1)
11-18: Add a brief class doc comment.Public APIs should have a short description of purpose/inputs/outputs. As per coding guidelines, please add a brief doc comment for the Tanh layer.
| class CPUConvTranspose1DOp final : public aops::ConvTranspose1DOp { | ||
| public: | ||
| explicit CPUConvTranspose1DOp(const aops::ConvTranspose1DOpOptions& options); | ||
|
|
||
| void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
| }; | ||
|
|
||
| class CPUConvTranspose1DOpFactory : public TypedOpFactory<OpTypes::kConvTranspose1D, aops::ConvTranspose1DOpOptions> { | ||
| public: | ||
| std::shared_ptr<BaseOp> createOpImpl(const aops::ConvTranspose1DOpOptions& options) override { | ||
| return std::make_shared<CPUConvTranspose1DOp>(options); | ||
| } |
There was a problem hiding this comment.
Add brief docs for CPU op/factory.
Please add short comments describing the CPU op’s role and any expectations about inputs/outputs to keep backend APIs self-documenting. As per coding guidelines, public APIs should include clear docstrings/comments.
🤖 Prompt for AI Agents
In `@mllm/backends/cpu/ops/ConvTranspose1DOp.hpp` around lines 11 - 22, Add
concise doc comments above the CPUConvTranspose1DOp class and
CPUConvTranspose1DOpFactory to document their roles and expected tensor
shapes/semantics: describe that CPUConvTranspose1DOp implements the CPU backend
of aops::ConvTranspose1DOp, list expected input tensors (e.g., input, weight,
optional bias), expected output shape behavior (how output length is computed
from stride/padding/dilation/output_padding), and any preconditions (memory
layout, dtype). Also add a brief comment for CPUConvTranspose1DOpFactory
explaining it constructs CPUConvTranspose1DOp instances from
aops::ConvTranspose1DOpOptions so readers know this factory ties the OpOptions
to the CPU implementation.
| void ConvTranspose1DOp::load(const ParameterFile::ptr_t& ploader) { | ||
| switch (ploader->version()) { | ||
| case ModelFileVersion::kV1: { | ||
| weight_ = ploader->pull(getName() + ".weight"); | ||
| if (options_.bias) { bias_ = ploader->pull(getName() + ".bias"); } | ||
| weight_ = weight_.view({options_.in_channels, options_.out_channels / options_.groups, options_.kernel_size}); | ||
| if (options_.bias) { bias_ = bias_.view({options_.out_channels}); } |
There was a problem hiding this comment.
Guard against invalid groups before division/modulo.
options_.groups is used in division/modulo in both load() and reshape(). If it’s 0 (or incompatible with channels), this will crash or corrupt shapes. Add explicit validation before use. As per coding guidelines, validate inputs for public APIs.
🔧 Suggested fix
void ConvTranspose1DOp::load(const ParameterFile::ptr_t& ploader) {
+ if (options_.groups <= 0) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp groups must be > 0");
+ }
+ MLLM_RT_ASSERT_EQ(options_.out_channels % options_.groups, 0);
switch (ploader->version()) {
case ModelFileVersion::kV1: {
weight_ = ploader->pull(getName() + ".weight");
if (options_.bias) { bias_ = ploader->pull(getName() + ".bias"); }
weight_ = weight_.view({options_.in_channels, options_.out_channels / options_.groups, options_.kernel_size});
if (options_.bias) { bias_ = bias_.view({options_.out_channels}); }
break;
}
@@
void ConvTranspose1DOp::reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) {
const auto& i = inputs[0];
const auto& ishape = i.shape();
if (ishape.size() != 3) {
MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp expects 3D input, got {} D", ishape.size());
outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device()));
return;
}
+ if (options_.groups <= 0) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp groups must be > 0");
+ outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device()));
+ return;
+ }
const int batch = ishape[0];
const int in_channels = ishape[1];
const int sequence = ishape[2];
MLLM_RT_ASSERT_EQ(in_channels, options_.in_channels);
MLLM_RT_ASSERT_EQ(in_channels % options_.groups, 0);
MLLM_RT_ASSERT_EQ(options_.out_channels % options_.groups, 0);Also applies to: 68-71
🤖 Prompt for AI Agents
In `@mllm/core/aops/ConvTranspose1DOp.cpp` around lines 17 - 23, Validate
options_.groups before any division/modulo in ConvTranspose1DOp::load and
ConvTranspose1DOp::reshape by checking it is > 0 and that it cleanly divides the
relevant channel counts (e.g., options_.out_channels % options_.groups == 0 and
any other channel dimension used with groups is divisible). If a check fails,
return/raise a clear error (e.g., throw std::invalid_argument or use the
existing error/reporting mechanism) with a message referencing the op name and
invalid group value. Update the logic around weight_ = weight_.view(...) in
load() and the corresponding shape calculations in reshape() to assume groups is
validated so the divisions/modulo are safe. Ensure tests or callers expecting
validation get a deterministic error rather than UB/crash.
| const int kernel_size = options_.kernel_size; | ||
| const int stride = options_.stride; | ||
| const int dilation = options_.dilation; | ||
| const int padding = options_.padding; | ||
| const int output_padding = options_.output_padding; | ||
|
|
||
| const int seq_out = (sequence - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1; | ||
|
|
||
| auto new_shape = std::vector<int32_t>{batch, options_.out_channels, seq_out}; | ||
| outputs.emplace_back(Tensor::empty(new_shape, i.dtype(), i.device())); |
There was a problem hiding this comment.
Validate output size and output_padding constraints.
seq_out can become <= 0 for invalid parameters (e.g., negative padding/dilation, output_padding ≥ stride). Right now this creates an invalid tensor. Add explicit checks before allocation to fail fast and surface configuration errors. As per coding guidelines, validate inputs for public APIs.
🔧 Suggested fix
const int kernel_size = options_.kernel_size;
const int stride = options_.stride;
const int dilation = options_.dilation;
const int padding = options_.padding;
const int output_padding = options_.output_padding;
+ if (kernel_size <= 0 || stride <= 0 || dilation <= 0 || padding < 0 || output_padding < 0 || output_padding >= stride) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid ConvTranspose1D options");
+ outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device()));
+ return;
+ }
const int seq_out = (sequence - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1;
+ if (seq_out <= 0) {
+ MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1D produces non-positive output length");
+ outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device()));
+ return;
+ }
auto new_shape = std::vector<int32_t>{batch, options_.out_channels, seq_out};
outputs.emplace_back(Tensor::empty(new_shape, i.dtype(), i.device()));🤖 Prompt for AI Agents
In `@mllm/core/aops/ConvTranspose1DOp.cpp` around lines 72 - 81, The code must
validate computed seq_out and option constraints before allocating the output
tensor: compute seq_out as currently done, then check that seq_out > 0 and that
options_.output_padding < options_.stride and that options_.padding >= 0,
options_.dilation >= 1, options_.kernel_size > 0 and options_.stride > 0; if any
check fails, return/raise a clear configuration error (e.g., throw
std::invalid_argument or use the project’s error/reporting helper) instead of
calling outputs.emplace_back(Tensor::empty(...)); place these checks immediately
after the seq_out calculation and before the outputs.emplace_back line in
ConvTranspose1DOp (referencing seq_out, options_.output_padding,
options_.stride, options_.padding, options_.dilation, options_.kernel_size).
| struct ConvTranspose1DOpOptions : public BaseOpOptions<ConvTranspose1DOpOptions> { | ||
| int32_t in_channels; | ||
| int32_t out_channels; | ||
| int32_t kernel_size; | ||
| int32_t stride = 1; | ||
| int32_t padding = 0; | ||
| int32_t output_padding = 0; | ||
| int32_t dilation = 1; | ||
| int32_t groups = 1; | ||
| bool bias = true; | ||
| }; | ||
|
|
||
| class ConvTranspose1DOp : public BaseOp { | ||
| public: | ||
| explicit ConvTranspose1DOp(const ConvTranspose1DOpOptions& options); | ||
|
|
||
| void load(const ParameterFile::ptr_t& ploader) override; | ||
|
|
||
| void trace(void* trace_context, const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| ParameterFile::ptr_t getParams() override; | ||
|
|
||
| inline Tensor& weight() { return weight_; } | ||
|
|
||
| inline Tensor& bias() { return bias_; } | ||
|
|
||
| inline ConvTranspose1DOpOptions& options() { return options_; } | ||
|
|
||
| protected: | ||
| Tensor weight_; | ||
| Tensor bias_; | ||
| ConvTranspose1DOpOptions options_; |
There was a problem hiding this comment.
Add API docs for ConvTranspose1D options and op.
These are new public types; please add brief comments covering purpose, parameter meaning (channels/stride/padding/output_padding/groups/bias), and invariants so callers know expected constraints. As per coding guidelines, public APIs should include clear docstrings/comments.
🤖 Prompt for AI Agents
In `@mllm/core/aops/ConvTranspose1DOp.hpp` around lines 12 - 49, Add API
documentation comments for the new public types: place a brief docstring above
the struct ConvTranspose1DOpOptions describing its purpose (options for 1D
transposed convolution), and document each field (in_channels, out_channels,
kernel_size, stride, padding, output_padding, dilation, groups, bias) including
units, valid ranges/constraints (e.g., positive ints, output_padding <
max(stride, dilation), groups divides in_channels/out_channels, kernel_size >
0), and default meanings; also add a short doc comment above the
ConvTranspose1DOp class describing its role, lifecycle methods (load, trace,
forward, reshape, setup, getParams), and invariants about weight_ and bias_
(shapes derived from options, bias present only if options.bias is true). Ensure
comments reference the exact symbols ConvTranspose1DOpOptions,
ConvTranspose1DOp, weight_, bias_ so callers can locate the documented behavior.
| struct TanhOpOptions : public BaseOpOptions<TanhOpOptions> {}; | ||
|
|
||
| class TanhOp : public BaseOp { | ||
| public: | ||
| explicit TanhOp(const TanhOpOptions& options); | ||
|
|
||
| void load(const ParameterFile::ptr_t& ploader) override; | ||
|
|
||
| void trace(void* trace_context, const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override; | ||
|
|
||
| inline TanhOpOptions& options() { return options_; } | ||
|
|
||
| protected: | ||
| TanhOpOptions options_; |
There was a problem hiding this comment.
Add API docs for TanhOp.
Please add brief doc comments describing purpose, inputs/outputs, and any error/shape expectations. As per coding guidelines, public APIs should include clear docstrings/comments.
🤖 Prompt for AI Agents
In `@mllm/core/aops/TanhOp.hpp` around lines 11 - 30, Add API doc comments for
TanhOp and TanhOpOptions: place a brief class-level comment above struct
TanhOpOptions and class TanhOp describing the operation's purpose (element-wise
hyperbolic tangent), expected inputs (single tensor of arbitrary shape), outputs
(single tensor with same shape and dtype), and any error/shape expectations
(e.g., require one input tensor, matching output count; throw or assert on
incorrect counts). Also add short comments on the public methods load, trace,
forward, reshape, setup and the options() accessor indicating their roles (e.g.,
load: load parameters from ParameterFile; forward: compute element-wise tanh;
reshape/setup: validate and set output shapes; trace: record shapes/metadata)
and mention any preconditions or postconditions (shape invariants), following
existing doc style in the codebase.
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.