feat : Add Support of Qwen2.5Omni Model by KKkai0315 · Pull Request #612 · UbiquitousLearning/mllm

KKkai0315 · 2026-01-23T07:51:41Z

add support of Qwen2.5Omni Model (without talker)
now it can receive text, image and audio input, and generate text output
the talker part (audio output) is under development

Summary by CodeRabbit

New Features
- Qwen2.5 Omni multimodal model added (text, vision, audio) with a comprehensive configuration.
- New example CLI apps for text, image, and audio inference.
- Audio preprocessing (Mel spectrograms) and multimodal tokenizer for integrated prompts.
- Added CPU support for new ops and layers (ConvTranspose1D, Tanh) to enable wider model compatibility and improved inference.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

…fixed quantization parameters, updated ActivationQDQ to use MovingAverageMinMaxObserver, and adjusted eps values for better precision. Modified Qwen3 model to utilize FixedActivationQDQ for sigmoid output and ensured dtype consistency in attention calculations.

… debug print statements from Qwen3DecoderLayer

…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.

…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.

… zeros, ones, specific values, arange, and random fills. Introduce a new fill-inl.hpp file for optimized implementations and update kernel dispatch to include these operations. Enhance CPUFillOp to utilize the new fill functions for better performance and maintainability.

…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.

…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.

… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.

coderabbitai · 2026-01-23T07:52:00Z

📝 Walkthrough

Walkthrough

Adds a new Qwen2.5-Omni multimodal example and full model support: build targets and three CLI runners; configuration JSON; audio preprocessing, tokenizer, multimodal model implementation (vision/audio/text), new ops/layers (ConvTranspose1D, Tanh) with CPU backend implementations and tests.

Changes

Cohort / File(s)	Summary
Build Configuration `examples/CMakeLists.txt`, `examples/qwen2_5omni/CMakeLists.txt`	Register `qwen2_5omni` and add three executables (`mllm-qwen2_5-omni-text-runner`, `...-image-runner`, `...-audio-runner`) linking MllmRT and MllmCPUBackend.
Example Inference Programs `examples/qwen2_5omni/text_infer.cpp`, `examples/qwen2_5omni/image_infer.cpp`, `examples/qwen2_5omni/audio_infer.cpp`	New CLI examples: arg parsing, model/config/tokenizer loading, input preprocessing (text/image/audio), streaming generation and perf reporting.
Model Config & Example JSON `examples/qwen2_5omni/config_qwen2_5omni_7B.json`, `mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp`	Add large JSON config for Qwen2.5-Omni 7B and new Qwen2_5OmniConfig C++ parser/struct exposing many model and modality parameters.
Audio Preprocessing `mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp`	New MelSpectrogramFeatures, Slaney mel utilities, Hann window and Qwen2_5OmniAudioPreprocessor for WAV reading, padding/trimming, feature extraction and length calculations.
Tokenization & Message Types `mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp`	New Qwen2_5OmniTokenizer, regex+BPE pipeline, message structs for text/vision/audio, convert* methods producing token ids and modality feature tensors.
Multimodal Model Implementation `mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`	Large multimodal architecture: RoPE utilities, patch embed/merger, vision/audio encoders, text decoder, Thinker integration, Qwen2_5OmniForCausalLM wrapper and LM head.
New Ops / aops `mllm/core/aops/ConvTranspose1DOp.hpp`, `mllm/core/aops/ConvTranspose1DOp.cpp`, `mllm/core/aops/TanhOp.hpp`, `mllm/core/aops/TanhOp.cpp`	Add ConvTranspose1D and Tanh op definitions: options, load/trace/reshape, parameter handling (ConvTranspose1D), and NYI forwards.
CPU Backend Implementations `mllm/backends/cpu/ops/ConvTranspose1DOp.{hpp,cpp}`, `mllm/backends/cpu/ops/TanhOp.{hpp,cpp}`, `mllm/backends/cpu/CPUBackend.cpp`	CPU implementations of ConvTranspose1D and Tanh, factory registrations added to CPUBackend.
NN Layer Wrappers `mllm/nn/layers/ConvTranspose1D.{hpp,cpp}`, `mllm/nn/layers/Tanh.{hpp,cpp}`, `mllm/nn/Nn.hpp`	New ConvTranspose1D and Tanh layer classes and includes in public NN header.
IR / Op Registration / RTTI `mllm/compile/ir/GeneratedRTTIKind.hpp`, `mllm/compile/ir/NodeRTTIClassOfImpl.hpp`, `mllm/compile/ir/linalg/Op.{hpp,cpp}`, `mllm/core/OpTypes.hpp`	Register new op kinds (ConvTranspose1D, Tanh) in enums, RTTI macros, linalg op registrations and optype->string mapping.
Tests `tests/cpu/TanhKernelTest.hpp`, `tests/cpu/ConvTranspose1DKernelTest.hpp`, `tests/cpu/KernelTest.cpp`	Add kernel tests and test fixtures for Tanh and ConvTranspose1D and invoke tests in KernelTest.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant CLI as Runner (text/image/audio)
    participant Tokenizer as Qwen2_5OmniTokenizer
    participant Preproc as Audio/Image Preprocessor
    participant Model as Qwen2_5OmniForCausalLM
    participant Thinker as Qwen2_5OmniThinker

    User->>CLI: provide input (text/image/audio)
    CLI->>Tokenizer: load tokenizer & config
    CLI->>Preproc: (if media) process file -> features
    CLI->>Tokenizer: convertMessage / convertVisionMessage / convertAudioMessage
    Tokenizer-->>CLI: token ids + feature tensors
    CLI->>Model: forward(input_ids, feature_tensors)
    Model->>Thinker: encode modalities, fuse into decoder input
    Thinker->>Thinker: vision/audio encoders -> multimodal embeddings
    Thinker->>Model: decode and produce logits
    loop streaming
        CLI->>Tokenizer: detokenize(next_token)
        Tokenizer-->>CLI: text chunk
        CLI->>User: stream output
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

feat(deepseek-ocr): deepseek-ocr support(On working) #486 — related backend/IR/op-registration changes adding CPU ops and factories.
feat: add kai&qnn-vl&opencl #489 — related updates to examples CMake wiring and example registration.

Suggested reviewers

liang1232018
chenghuaWang

Poem

🐰 I hopped through code dense and bright,

Text, image, audio stitched just right,
Qwen2.5 Omni leaps with flair,
Carrots of data dance in air,
A rabbit cheers for multimodal light!

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 1.59% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The description identifies the key features (text/image/audio input support, text output generation, talker under development) but lacks structured sections and does not follow the provided template format.	Enhance the description by following the template structure, adding more detail about changes made, and clearly organizing sections for easier review.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: adding support for the Qwen2.5Omni model. It is specific, concise, and directly related to the primary objective of the pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 8

🤖 Fix all issues with AI agents

In `@examples/qwen2_5omni/audio_infer.cpp`:
- Around line 24-29: The code currently silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is not "v1" or "v2";
instead validate model_version.get() explicitly: examine the string returned by
model_version.get(), set file_version to mllm::ModelFileVersion::kV1 or ::kV2
for "v1" and "v2" respectively, and for any other value print a clear error
mentioning the allowed values and terminate (non-zero exit) to avoid loading the
wrong format; update the block that assigns file_version (and any related usage)
to enforce this validation and fail fast on unsupported values.
- Around line 48-60: The example currently sets audio_path and prompt_text to
empty strings causing processAudioFile()/convertAudioMessage() to get empty
audio via mllm::audio::readWAV("") and fail; restore interactive input by
re-enabling the std::getline calls (or alternatively accept paths via CLI args)
so audio_path and prompt_text are populated before calling
processAudioFile()/convertAudioMessage(); ensure the check for "exit"/"quit"
remains and that prompt_text falls back to a default if empty to avoid empty
prompt usage.

In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json`:
- Around line 5-7: The config contains unused flags enable_audio_output and
enable_talker (alongside model_type "qwen2_5_omni"); update the JSON to avoid
misleading users by either removing the keys enable_audio_output and
enable_talker or setting both to false. Locate the entries for
"enable_audio_output" and "enable_talker" in the config_qwen2_5omni_7B.json and
change their values to false (or delete those lines) so only supported
functionality (the thinker/text output) is advertised.

In `@examples/qwen2_5omni/image_infer.cpp`:
- Around line 24-29: The current logic silently defaults file_version to
mllm::ModelFileVersion::kV1 when model_version.get() is unknown; update the
branch around model_version.get() so you explicitly accept only "v1" and "v2",
set file_version to mllm::ModelFileVersion::kV1 or kV2 accordingly, and
otherwise report an error (include the invalid value and allowed values) and
terminate with a non‑zero exit or throw an exception; reference the
model_version variable and the file_version / mllm::ModelFileVersion symbols
when making this validation change.
- Around line 52-60: Uncomment the two std::getline calls so the program reads
user input instead of using empty strings: after the fmt::print("Image path (or
'exit/quit'): "); call restore std::getline(std::cin, image_path) to populate
image_path (so the exit/quit check works), and after fmt::print("Prompt text:
"); restore std::getline(std::cin, prompt_text) to populate prompt_text; ensure
the variables image_path and prompt_text are used as before and that included
<iostream> usage is consistent with the surrounding code.

In `@examples/qwen2_5omni/text_infer.cpp`:
- Around line 24-29: The code silently defaults to mllm::ModelFileVersion::kV1
when model_version is unknown; change this to validate the input from
model_version.get() and reject unsupported values instead of defaulting. In the
block that sets file_version (referencing model_version.get(), file_version, and
mllm::ModelFileVersion), check for "v1" and "v2" explicitly and if neither
matches, print a clear error mentioning the unsupported model_version and
terminate (e.g., return/exit with non-zero or throw) so the program fails fast
rather than loading the wrong format.

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp`:
- Around line 942-958: The code leaves attn uninitialized when
key_states.dtype() is neither kFloat32 nor kFloat16; add a fallback else branch
in the attention block (around the key_states.dtype() checks) that handles other
dtypes by casting query_states and key_states to kFloat32, computing attn =
nn::functional::matmul(...)* (1.f / sqrtf(head_dim_)), applying mask_ and
softmax_, and then casting attn back to the original key_states.dtype() if
needed (or alternatively throw a clear runtime_error mentioning unsupported
dtype) so that attn is always initialized before using
nn::functional::matmul(attn, value_states).

In `@mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp`:
- Around line 303-316: The code computes img_token_nums from grid_thw and then
inserts tokens without validating it; add a check (similar to
convertAudioMessage's audio_token_nums > 0) after computing img_token_nums and
before ids.insert to ensure img_token_nums > 0, and on failure call
MLLM_ERROR_EXIT with a clear message (e.g., "Invalid image token count") so you
don't call ids.insert(img_token_nums - 1, ...) with zero/negative counts; locate
symbols grid_thw, img_token_nums, image_token_id, ids.insert and the surrounding
convertVisionMessage logic to make the change.

🧹 Nitpick comments (7)

mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp (2)

98-114: Redundant condition in whitespace handling logic.

After the while loop on line 100 consumes all whitespace, str[pos] is guaranteed to be a non-whitespace character (when pos < str.size()). Therefore, the condition std::iswspace(str[pos]) on line 101 is never true, making the second half of the condition unreachable.

This also makes lines 109-114 redundant since the only way to reach them is if the first block restored pos (line 105), but that only happens when the condition on line 101 is false—which it always is unless we're at end of string.

Consider clarifying the intended behavior or simplifying this logic.

189-193: Clarify the technical reason for the max_pixels constraint.

The comment "interestingly, the answer went bad when setting max_pixels higher" is informal and doesn't explain the root cause. Consider documenting the technical reason (e.g., memory constraints, model training resolution, numerical stability) to help future maintainers understand why this limit exists.

mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp (2)

19-37: Consider adding existence checks or documenting required fields.

The config parser accesses required fields like text_config, hidden_size, intermediate_size, etc. directly without .contains() checks. If the config file is missing these required fields, the parser will throw an exception.

This fail-fast behavior may be intentional, but consider either:

Adding explicit existence checks with descriptive error messages

Documenting the required config structure in a header comment

80-118: Consider extracting common parsing logic to reduce duplication.

The root-level config parsing (lines 80-118) largely duplicates the thinker_config parsing (lines 23-79). This is understandable for supporting two config formats, but could be refactored into helper methods if the config format evolves.
mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp (2)
958-961: Consider passing layer_idx as a constructor parameter instead of public member.

layer_idx_ is a public member that's set externally (line 1003). This breaks encapsulation and could lead to bugs if the member is not set before use.
🛠️ Suggested refactor
 class Qwen2_5OmniAttention final : public nn::Module {
+  int layer_idx_;
   // ... other members ...
 
  public:
-  Qwen2_5OmniAttention(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) {
+  Qwen2_5OmniAttention(const std::string& name, const Qwen2_5OmniConfig& cfg, int layer_idx)
+      : nn::Module(name), layer_idx_(layer_idx) {
     // ...
   }
-
-  int layer_idx_;
 };
Then update Qwen2_5OmniDecoder and the module registration accordingly.
853-858: Add explicit keyword for consistency with other constructors.

Qwen2_5OmniMLP, Qwen2_5OmniAttention (line 889), Qwen2_5OmniDecoder (line 972), and Qwen2_5OmniText (line 1001) are missing the explicit keyword that other similar classes use (e.g., Qwen2_5OmniVisionMLP at line 421).
🛠️ Suggested fix
-  Qwen2_5OmniMLP(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) {
+  explicit Qwen2_5OmniMLP(const std::string& name, const Qwen2_5OmniConfig& cfg) : nn::Module(name) {
Apply similar changes to Qwen2_5OmniAttention, Qwen2_5OmniDecoder, and Qwen2_5OmniText.
mllm/models/qwen2_5omni/audio_preprocessor_qwen2_5omni.hpp (1)
43-55: Add M_PI definition guard for portability.

M_PI is not part of the C++ standard and is only a POSIX extension. While commonly available on Unix-like systems, it may be undefined on other platforms (notably Windows with MSVC without _USE_MATH_DEFINES). The codebase already uses this pattern in third_party/wenet_audio/fft.h and mllm/backends/cpu/ops/STFTOp.cpp.
🛠️ Suggested fix
+#ifndef M_PI
+#define M_PI 3.14159265358979323846
+#endif
+
 inline Tensor create_hann_window(int32_t window_length, bool periodic = true) {

coderabbitai · 2026-01-23T08:00:32Z

examples/qwen2_5omni/audio_infer.cpp

+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }


⚠️ Potential issue | 🟡 Minor

Validate --model_version instead of silently falling back to v1.

An unknown value currently defaults to v1, which can load the wrong format and fail in confusing ways. Consider rejecting unsupported values explicitly.

Suggested fix

- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; - if (model_version.get() == "v1") { - file_version = mllm::ModelFileVersion::kV1; - } else if (model_version.get() == "v2") { - file_version = mllm::ModelFileVersion::kV2; - } + mllm::ModelFileVersion file_version; + if (model_version.get() == "v1") { + file_version = mllm::ModelFileVersion::kV1; + } else if (model_version.get() == "v2") { + file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported --model_version: {}\n", model_version.get()); + return 1; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

}

mllm::ModelFileVersion file_version;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

} else {

fmt::print("Unsupported --model_version: {}\n", model_version.get());

return 1;

}

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/audio_infer.cpp` around lines 24 - 29, The code currently silently defaults file_version to mllm::ModelFileVersion::kV1 when model_version.get() is not "v1" or "v2"; instead validate model_version.get() explicitly: examine the string returned by model_version.get(), set file_version to mllm::ModelFileVersion::kV1 or ::kV2 for "v1" and "v2" respectively, and for any other value print a clear error mentioning the allowed values and terminate (non-zero exit) to avoid loading the wrong format; update the block that assigns file_version (and any related usage) to enforce this validation and fail fast on unsupported values.

coderabbitai · 2026-01-23T08:00:32Z

examples/qwen2_5omni/audio_infer.cpp

+    std::string audio_path;
+    std::string prompt_text;
+
+    fmt::print("Audio path (or 'exit/quit'): ");
+    //std::getline(std::cin, audio_path);
+    //if (audio_path == "exit" || audio_path == "quit") { return 0; }
+    audio_path = "";
+
+    fmt::print("Prompt text: ");
+    //std::getline(std::cin, prompt_text);
+    //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }
+    prompt_text = "";
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd examples/qwen2_5omni && wc -l audio_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 190

🏁 Script executed:

head -70 examples/qwen2_5omni/audio_infer.cpp | tail -30

Repository: UbiquitousLearning/mllm

Length of output: 1202

🏁 Script executed:

cat -n examples/qwen2_5omni/audio_infer.cpp | sed -n '40,70p'

Repository: UbiquitousLearning/mllm

Length of output: 1510

🏁 Script executed:

rg -A 10 "convertAudioMessage" examples/qwen2_5omni/ --type cpp

Repository: UbiquitousLearning/mllm

Length of output: 1132

🏁 Script executed:

rg -B 5 -A 10 "audio_file_path" examples/qwen2_5omni/ --type cpp | head -50

Repository: UbiquitousLearning/mllm

Length of output: 1466

🏁 Script executed:

fd "\.hpp$|\.h$" examples/qwen2_5omni --type f

Repository: UbiquitousLearning/mllm

Length of output: 49

🏁 Script executed:

rg "convertAudioMessage" --type cpp --type hpp | head -20

Repository: UbiquitousLearning/mllm

Length of output: 94

🏁 Script executed:

find . -type f -name "*.h" -o -name "*.hpp" | grep -E "(qwen|tokenizer)" | head -20

Repository: UbiquitousLearning/mllm

Length of output: 1152

🏁 Script executed:

rg "convertAudioMessage" -A 15 --max-count 5

Repository: UbiquitousLearning/mllm

Length of output: 3215

🏁 Script executed:

rg "audio_file_path" -B 5 -A 10 --max-count 10

Repository: UbiquitousLearning/mllm

Length of output: 16860

🏁 Script executed:

rg "MLLM_ERROR_EXIT" -B 2 -A 2 --max-count 5

Repository: UbiquitousLearning/mllm

Length of output: 50379

🏁 Script executed:

rg "readWAV" -B 3 -A 8 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 3737

Enable interactive input for the example to work as intended.

The example currently has input hardcoded to empty strings, causing convertAudioMessage() to fail. When an empty audio_path is passed to processAudioFile(), it attempts to read from an empty path via mllm::audio::readWAV(""), which returns empty audio data. This triggers the error check if (audio_result.input_features.isNil() || audio_result.feature_length <= 0) and exits with MLLM_ERROR_EXIT(ExitCode::kIOError, "Failed to extract audio features for Qwen2.5-Omni.").

Restore the commented std::getline() calls to enable interactive input, or wire the paths to CLI arguments.

Suggested fix (restore interactive input)

- fmt::print("Audio path (or 'exit/quit'): "); - //std::getline(std::cin, audio_path); - //if (audio_path == "exit" || audio_path == "quit") { return 0; } - audio_path = ""; - - fmt::print("Prompt text: "); - //std::getline(std::cin, prompt_text); - //if (prompt_text.empty()) { prompt_text = "Please describe the audio."; } - prompt_text = ""; + fmt::print("Audio path (or 'exit/quit'): "); + std::getline(std::cin, audio_path); + if (audio_path == "exit" || audio_path == "quit") { return 0; } + + fmt::print("Prompt text: "); + std::getline(std::cin, prompt_text); + if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

std::string audio_path;

std::string prompt_text;

fmt::print("Audio path (or 'exit/quit'): ");

//std::getline(std::cin, audio_path);

//if (audio_path == "exit" || audio_path == "quit") { return 0; }

audio_path = "";

fmt::print("Prompt text: ");

//std::getline(std::cin, prompt_text);

//if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

prompt_text = "";

std::string audio_path;

std::string prompt_text;

fmt::print("Audio path (or 'exit/quit'): ");

std::getline(std::cin, audio_path);

if (audio_path == "exit" || audio_path == "quit") { return 0; }

fmt::print("Prompt text: ");

std::getline(std::cin, prompt_text);

if (prompt_text.empty()) { prompt_text = "Please describe the audio."; }

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/audio_infer.cpp` around lines 48 - 60, The example currently sets audio_path and prompt_text to empty strings causing processAudioFile()/convertAudioMessage() to get empty audio via mllm::audio::readWAV("") and fail; restore interactive input by re-enabling the std::getline calls (or alternatively accept paths via CLI args) so audio_path and prompt_text are populated before calling processAudioFile()/convertAudioMessage(); ensure the check for "exit"/"quit" remains and that prompt_text falls back to a default if empty to avoid empty prompt usage.

coderabbitai · 2026-01-23T08:00:32Z

examples/qwen2_5omni/config_qwen2_5omni_7B.json

+  "enable_audio_output": true,
+  "enable_talker": true,
+  "model_type": "qwen2_5_omni",


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, verify the file exists and show context around the lines mentioned echo "=== Checking the config file ===" find . -name "config_qwen2_5omni_7B.json" -type f 2>/dev/null | head -5 echo -e "\n=== Config file content ===" if [ -f "examples/qwen2_5omni/config_qwen2_5omni_7B.json" ]; then cat -n "examples/qwen2_5omni/config_qwen2_5omni_7B.json" fi echo -e "\n=== Search for flag usage in code ===" rg -n --type=cpp 'enable_talker|enable_audio_output' -A 2 -B 2

Repository: UbiquitousLearning/mllm

Length of output: 16869

🏁 Script executed:

#!/bin/bash # Search for flag usage in Python and other files echo "=== Search for enable_talker and enable_audio_output in all files ===" rg -n 'enable_talker|enable_audio_output' -i --type-list | head -5 echo -e "\n=== Search without type restriction ===" rg -n 'enable_talker|enable_audio_output' -A 3 -B 1 | head -50 echo -e "\n=== Search for talker_config usage ===" rg -n 'talker_config' -A 2 -B 1 | head -50 echo -e "\n=== Check for model loading/config ===" rg -n 'def.*load|def.*model' --type=py | grep -i config | head -20 echo -e "\n=== Find Python files in examples directory ===" find examples -name "*.py" -type f 2>/dev/null | head -10

Repository: UbiquitousLearning/mllm

Length of output: 1512

🏁 Script executed:

#!/bin/bash # Let's find where config files are loaded and used echo "=== Look for config loading functions ===" rg -n 'load.*config|from_pretrained' --type=py -A 2 | head -60 echo -e "\n=== Search for qwen omni references ===" rg -n 'qwen.*omni|omni.*model' --type=py -i | head -30 echo -e "\n=== Explore directory structure ===" find . -type d -name "*qwen*" -o -name "*model*" -o -name "*config*" 2>/dev/null | head -20 echo -e "\n=== Look for model initialization in Python ===" find . -name "*.py" -type f 2>/dev/null | head -20 echo -e "\n=== Check for source code structure ===" ls -la | head -20

Repository: UbiquitousLearning/mllm

Length of output: 3977

🏁 Script executed:

#!/bin/bash # Search for where JSON configs are loaded echo "=== Search for JSON config loading ===" rg -n '\.json|load.*config' --type=py | grep -i 'config.*json\|json.*config' | head -20 echo -e "\n=== Search for qwen2_5omni model class or file ===" find . -type f $ -name "*.py" -o -name "*.hpp" -o -name "*.cpp" -o -name "*.h" $ -exec grep -l 'qwen2_5omni\|Qwen2_5Omni' {} \; 2>/dev/null | head -20 echo -e "\n=== Check if config file is actually used in any Python test or runner ===" rg -n 'config_qwen2_5omni' -A 3 | head -30 echo -e "\n=== Look for files that reference the examples directory ===" rg -n 'examples/qwen2_5omni' | head -20 echo -e "\n=== Search for enable_ flags in code (broader search) ===" rg -n 'enable_' --type=py -B 2 -A 2 | head -40

Repository: UbiquitousLearning/mllm

Length of output: 3540

🏁 Script executed:

#!/bin/bash # Read the configuration file echo "=== configuration_qwen2_5omni.hpp ===" wc -l mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp head -100 mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp echo -e "\n=== modeling_qwen2_5omni.hpp ===" wc -l mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp echo -e "\n=== Search for enable_talker in C++ files ===" rg -n 'enable_talker|enable_audio_output' mllm/models/qwen2_5omni/ -A 3 -B 1

Repository: UbiquitousLearning/mllm

Length of output: 5986

🏁 Script executed:

#!/bin/bash # Check if the flags are mentioned anywhere in the entire qwen2_5omni files echo "=== All files in qwen2_5omni directory ===" ls -la mllm/models/qwen2_5omni/ echo -e "\n=== Search ALL qwen2_5omni files for enable_ flags ===" grep -r 'enable_talker\|enable_audio_output' mllm/models/qwen2_5omni/ 2>/dev/null || echo "Not found in C++ files" echo -e "\n=== Search ALL qwen2_5omni files for talker ===" grep -n 'talker' mllm/models/qwen2_5omni/*.hpp mllm/models/qwen2_5omni/*.cpp 2>/dev/null | head -20 echo -e "\n=== Full configuration_qwen2_5omni.hpp to see if flags are read ===" cat mllm/models/qwen2_5omni/configuration_qwen2_5omni.hpp | tail -80

Repository: UbiquitousLearning/mllm

Length of output: 4080

🏁 Script executed:

#!/bin/bash # Check if flags are used anywhere in the entire repository echo "=== Search entire repo for enable_talker ===" rg -n 'enable_talker' --type-list 2>/dev/null || true rg -n 'enable_talker' 2>/dev/null | head -20 echo -e "\n=== Search entire repo for enable_audio_output ===" rg -n 'enable_audio_output' 2>/dev/null | head -20 echo -e "\n=== Check if flags affect model initialization ===" rg -n 'talker_config\|thinker_config' mllm/models/qwen2_5omni/ -B 2 -A 2 echo -e "\n=== Look at the inference examples to see how config is used ===" head -50 examples/qwen2_5omni/text_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 2339

Remove or set enable_talker and enable_audio_output to false—these flags are currently unused.

These flags are defined in the config but never parsed or checked in the model code. Setting them to true misleads users about available functionality since only the thinker (text output) is currently loaded and used. To avoid confusion, either remove these flags or set them to false until talker and audio output support are actually implemented.

Suggested change

- "enable_audio_output": true, - "enable_talker": true, + "enable_audio_output": false, + "enable_talker": false,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"enable_audio_output": true,

"enable_talker": true,

"model_type": "qwen2_5_omni",

"enable_audio_output": false,

"enable_talker": false,

"model_type": "qwen2_5_omni",

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/config_qwen2_5omni_7B.json` around lines 5 - 7, The config contains unused flags enable_audio_output and enable_talker (alongside model_type "qwen2_5_omni"); update the JSON to avoid misleading users by either removing the keys enable_audio_output and enable_talker or setting both to false. Locate the entries for "enable_audio_output" and "enable_talker" in the config_qwen2_5omni_7B.json and change their values to false (or delete those lines) so only supported functionality (the thinker/text output) is advertised.

coderabbitai · 2026-01-23T08:00:32Z

examples/qwen2_5omni/image_infer.cpp

+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }


⚠️ Potential issue | 🟡 Minor

Validate --model_version instead of silently falling back to v1.

An unknown value currently defaults to v1, which can load the wrong format and fail unexpectedly. Consider rejecting unsupported values.

Suggested fix

- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; - if (model_version.get() == "v1") { - file_version = mllm::ModelFileVersion::kV1; - } else if (model_version.get() == "v2") { - file_version = mllm::ModelFileVersion::kV2; - } + mllm::ModelFileVersion file_version; + if (model_version.get() == "v1") { + file_version = mllm::ModelFileVersion::kV1; + } else if (model_version.get() == "v2") { + file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported --model_version: {}\n", model_version.get()); + return 1; + }

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/image_infer.cpp` around lines 24 - 29, The current logic silently defaults file_version to mllm::ModelFileVersion::kV1 when model_version.get() is unknown; update the branch around model_version.get() so you explicitly accept only "v1" and "v2", set file_version to mllm::ModelFileVersion::kV1 or kV2 accordingly, and otherwise report an error (include the invalid value and allowed values) and terminate with a non‑zero exit or throw an exception; reference the model_version variable and the file_version / mllm::ModelFileVersion symbols when making this validation change.

coderabbitai · 2026-01-23T08:00:33Z

examples/qwen2_5omni/image_infer.cpp

+    fmt::print("Image path (or 'exit/quit'): ");
+    image_path = "";
+    //std::getline(std::cin, image_path);
+    if (image_path == "exit" || image_path == "quit") { return 0; }
+
+    fmt::print("Prompt text: ");
+    prompt_text = "";
+    //std::getline(std::cin, prompt_text);
+


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cd examples/qwen2_5omni && wc -l image_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 85

🏁 Script executed:

cd examples/qwen2_5omni && sed -n '45,65p' image_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 730

🏁 Script executed:

cd examples/qwen2_5omni && grep -n "getline" image_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 157

🏁 Script executed:

cd examples/qwen2_5omni && head -52 image_infer.cpp | tail -20

Repository: UbiquitousLearning/mllm

Length of output: 789

🏁 Script executed:

cd examples/qwen2_5omni && cat -n image_infer.cpp | head -30

Repository: UbiquitousLearning/mllm

Length of output: 1480

🏁 Script executed:

cd examples/qwen2_5omni && sed -n '50,75p' image_infer.cpp

Repository: UbiquitousLearning/mllm

Length of output: 1119

Uncomment the std::getline() calls to restore interactive input.

The example prints prompts for user input but then hardcodes empty strings and has the std::getline() calls commented out. This makes the example non-functional since it will always attempt to process empty image paths and prompts. Restore the interactive input handling by uncommenting lines 54 and 59 (the std::getline() calls).

Suggested fix

- fmt::print("Image path (or 'exit/quit'): "); - image_path = ""; - //std::getline(std::cin, image_path); - if (image_path == "exit" || image_path == "quit") { return 0; } - - fmt::print("Prompt text: "); - prompt_text = ""; - //std::getline(std::cin, prompt_text); + fmt::print("Image path (or 'exit/quit'): "); + std::getline(std::cin, image_path); + if (image_path == "exit" || image_path == "quit") { return 0; } + + fmt::print("Prompt text: "); + std::getline(std::cin, prompt_text);

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

fmt::print("Image path (or 'exit/quit'): ");

image_path = "";

//std::getline(std::cin, image_path);

if (image_path == "exit" || image_path == "quit") { return 0; }

fmt::print("Prompt text: ");

prompt_text = "";

//std::getline(std::cin, prompt_text);

fmt::print("Image path (or 'exit/quit'): ");

std::getline(std::cin, image_path);

if (image_path == "exit" || image_path == "quit") { return 0; }

fmt::print("Prompt text: ");

std::getline(std::cin, prompt_text);

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/image_infer.cpp` around lines 52 - 60, Uncomment the two std::getline calls so the program reads user input instead of using empty strings: after the fmt::print("Image path (or 'exit/quit'): "); call restore std::getline(std::cin, image_path) to populate image_path (so the exit/quit check works), and after fmt::print("Prompt text: "); restore std::getline(std::cin, prompt_text) to populate prompt_text; ensure the variables image_path and prompt_text are used as before and that included <iostream> usage is consistent with the surrounding code.

coderabbitai · 2026-01-23T08:00:33Z

examples/qwen2_5omni/text_infer.cpp

+  mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;
+  if (model_version.get() == "v1") {
+    file_version = mllm::ModelFileVersion::kV1;
+  } else if (model_version.get() == "v2") {
+    file_version = mllm::ModelFileVersion::kV2;
+  }


⚠️ Potential issue | 🟡 Minor

Validate --model_version instead of silently falling back to v1.

An unknown value currently defaults to v1, which can load the wrong format and fail unexpectedly. Consider rejecting unsupported values.

Suggested fix

- mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1; - if (model_version.get() == "v1") { - file_version = mllm::ModelFileVersion::kV1; - } else if (model_version.get() == "v2") { - file_version = mllm::ModelFileVersion::kV2; - } + mllm::ModelFileVersion file_version; + if (model_version.get() == "v1") { + file_version = mllm::ModelFileVersion::kV1; + } else if (model_version.get() == "v2") { + file_version = mllm::ModelFileVersion::kV2; + } else { + fmt::print("Unsupported --model_version: {}\n", model_version.get()); + return 1; + }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

mllm::ModelFileVersion file_version = mllm::ModelFileVersion::kV1;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

}

mllm::ModelFileVersion file_version;

if (model_version.get() == "v1") {

file_version = mllm::ModelFileVersion::kV1;

} else if (model_version.get() == "v2") {

file_version = mllm::ModelFileVersion::kV2;

} else {

fmt::print("Unsupported --model_version: {}\n", model_version.get());

return 1;

}

🤖 Prompt for AI Agents

In `@examples/qwen2_5omni/text_infer.cpp` around lines 24 - 29, The code silently defaults to mllm::ModelFileVersion::kV1 when model_version is unknown; change this to validate the input from model_version.get() and reject unsupported values instead of defaulting. In the block that sets file_version (referencing model_version.get(), file_version, and mllm::ModelFileVersion), check for "v1" and "v2" explicitly and if neither matches, print a clear error mentioning the unsupported model_version and terminate (e.g., return/exit with non-zero or throw) so the program fails fast rather than loading the wrong format.

coderabbitai · 2026-01-23T08:00:33Z

mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp

+    Tensor attn;
+    if (key_states.dtype() == kFloat32) {
+      attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));
+      attn = mask_(attn);
+      attn = softmax_(attn);
+    } else if (key_states.dtype() == kFloat16) {
+      attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));
+      attn = mask_(attn);
+      attn = softmax_(attn);
+      attn = attn.to(kFloat16);
+    }
+
+    auto output = nn::functional::matmul(attn, value_states);
+    output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_});
+    output = o_proj_(output);
+    return {output};
+  }


⚠️ Potential issue | 🟠 Major

Missing fallback for unsupported dtype in attention computation.

If key_states.dtype() is neither kFloat32 nor kFloat16, the attn tensor remains uninitialized (line 942 declares it as Tensor attn;), and the subsequent nn::functional::matmul(attn, value_states) on line 954 would cause undefined behavior.

🛠️ Suggested fix

if (key_states.dtype() == kFloat32) { attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_)); attn = mask_(attn); attn = softmax_(attn); } else if (key_states.dtype() == kFloat16) { attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_)); attn = mask_(attn); attn = softmax_(attn); attn = attn.to(kFloat16); + } else { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation."); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

Tensor attn;

if (key_states.dtype() == kFloat32) {

attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));

attn = mask_(attn);

attn = softmax_(attn);

} else if (key_states.dtype() == kFloat16) {

attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));

attn = mask_(attn);

attn = softmax_(attn);

attn = attn.to(kFloat16);

}

auto output = nn::functional::matmul(attn, value_states);

output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_});

output = o_proj_(output);

return {output};

}

Tensor attn;

if (key_states.dtype() == kFloat32) {

attn = nn::functional::matmul(query_states, key_states, false, true) * (1.f / sqrtf(head_dim_));

attn = mask_(attn);

attn = softmax_(attn);

} else if (key_states.dtype() == kFloat16) {

attn = nn::functional::matmul(query_states.to(kFloat32), key_states.to(kFloat32), false, true) * (1.f / sqrtf(head_dim_));

attn = mask_(attn);

attn = softmax_(attn);

attn = attn.to(kFloat16);

} else {

MLLM_ERROR_EXIT(ExitCode::kCoreError, "Unsupported dtype for attention computation.");

}

auto output = nn::functional::matmul(attn, value_states);

output = output.transpose(1, 2).view({B, S, num_attention_heads_ * head_dim_});

output = o_proj_(output);

return {output};

}

🤖 Prompt for AI Agents

In `@mllm/models/qwen2_5omni/modeling_qwen2_5omni.hpp` around lines 942 - 958, The code leaves attn uninitialized when key_states.dtype() is neither kFloat32 nor kFloat16; add a fallback else branch in the attention block (around the key_states.dtype() checks) that handles other dtypes by casting query_states and key_states to kFloat32, computing attn = nn::functional::matmul(...)* (1.f / sqrtf(head_dim_)), applying mask_ and softmax_, and then casting attn back to the original key_states.dtype() if needed (or alternatively throw a clear runtime_error mentioning unsupported dtype) so that attn is always initialized before using nn::functional::matmul(attn, value_states).

coderabbitai · 2026-01-23T08:00:33Z

mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp

+    auto grid_t = grid_thw.ptr<int32_t>()[0];
+    auto grid_h = grid_thw.ptr<int32_t>()[1];
+    auto grid_w = grid_thw.ptr<int32_t>()[2];
+    int32_t img_token_nums = grid_t * grid_h * grid_w;
+    img_token_nums /= (spatial_merge_size_ * spatial_merge_size_);
+
+    auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");
+    {
+      auto it = std::find(ids.begin(), ids.end(), image_token_id);
+      if (it == ids.end()) {
+        MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template.");
+      }
+      ids.insert(it + 1, img_token_nums - 1, image_token_id);
+    }


⚠️ Potential issue | 🟡 Minor

Add validation for image token count before insertion.

Unlike convertAudioMessage which validates audio_token_nums > 0 (lines 347-349), convertVisionMessage doesn't validate img_token_nums before the ids.insert() call. If grid dimensions somehow result in zero or negative values, inserting img_token_nums - 1 tokens would cause undefined behavior.

🛠️ Suggested fix

int32_t img_token_nums = grid_t * grid_h * grid_w; img_token_nums /= (spatial_merge_size_ * spatial_merge_size_); + if (img_token_nums <= 0) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid image token count for Qwen2.5-Omni."); + } auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

auto grid_t = grid_thw.ptr<int32_t>()[0];

auto grid_h = grid_thw.ptr<int32_t>()[1];

auto grid_w = grid_thw.ptr<int32_t>()[2];

int32_t img_token_nums = grid_t * grid_h * grid_w;

img_token_nums /= (spatial_merge_size_ * spatial_merge_size_);

auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");

{

auto it = std::find(ids.begin(), ids.end(), image_token_id);

if (it == ids.end()) {

MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template.");

}

ids.insert(it + 1, img_token_nums - 1, image_token_id);

}

auto grid_t = grid_thw.ptr<int32_t>()[0];

auto grid_h = grid_thw.ptr<int32_t>()[1];

auto grid_w = grid_thw.ptr<int32_t>()[2];

int32_t img_token_nums = grid_t * grid_h * grid_w;

img_token_nums /= (spatial_merge_size_ * spatial_merge_size_);

if (img_token_nums <= 0) {

MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid image token count for Qwen2.5-Omni.");

}

auto image_token_id = bpe_._lookup_vocab(L"<|IMAGE|>");

{

auto it = std::find(ids.begin(), ids.end(), image_token_id);

if (it == ids.end()) {

MLLM_ERROR_EXIT(ExitCode::kCoreError, "Missing <|IMAGE|> token in Qwen2.5-Omni prompt template.");

}

ids.insert(it + 1, img_token_nums - 1, image_token_id);

}

🤖 Prompt for AI Agents

In `@mllm/models/qwen2_5omni/tokenization_qwen2_5omni.hpp` around lines 303 - 316, The code computes img_token_nums from grid_thw and then inserts tokens without validating it; add a check (similar to convertAudioMessage's audio_token_nums > 0) after computing img_token_nums and before ids.insert to ensure img_token_nums > 0, and on failure call MLLM_ERROR_EXIT with a clear message (e.g., "Invalid image token count") so you don't call ids.insert(img_token_nums - 1, ...) with zero/negative counts; locate symbols grid_thw, img_token_nums, image_token_id, ids.insert and the surrounding convertVisionMessage logic to make the change.

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@mllm/backends/cpu/ops/ConvTranspose1DOp.hpp`:
- Around line 11-22: Add concise doc comments above the CPUConvTranspose1DOp
class and CPUConvTranspose1DOpFactory to document their roles and expected
tensor shapes/semantics: describe that CPUConvTranspose1DOp implements the CPU
backend of aops::ConvTranspose1DOp, list expected input tensors (e.g., input,
weight, optional bias), expected output shape behavior (how output length is
computed from stride/padding/dilation/output_padding), and any preconditions
(memory layout, dtype). Also add a brief comment for CPUConvTranspose1DOpFactory
explaining it constructs CPUConvTranspose1DOp instances from
aops::ConvTranspose1DOpOptions so readers know this factory ties the OpOptions
to the CPU implementation.

In `@mllm/core/aops/ConvTranspose1DOp.cpp`:
- Around line 17-23: Validate options_.groups before any division/modulo in
ConvTranspose1DOp::load and ConvTranspose1DOp::reshape by checking it is > 0 and
that it cleanly divides the relevant channel counts (e.g., options_.out_channels
% options_.groups == 0 and any other channel dimension used with groups is
divisible). If a check fails, return/raise a clear error (e.g., throw
std::invalid_argument or use the existing error/reporting mechanism) with a
message referencing the op name and invalid group value. Update the logic around
weight_ = weight_.view(...) in load() and the corresponding shape calculations
in reshape() to assume groups is validated so the divisions/modulo are safe.
Ensure tests or callers expecting validation get a deterministic error rather
than UB/crash.
- Around line 72-81: The code must validate computed seq_out and option
constraints before allocating the output tensor: compute seq_out as currently
done, then check that seq_out > 0 and that options_.output_padding <
options_.stride and that options_.padding >= 0, options_.dilation >= 1,
options_.kernel_size > 0 and options_.stride > 0; if any check fails,
return/raise a clear configuration error (e.g., throw std::invalid_argument or
use the project’s error/reporting helper) instead of calling
outputs.emplace_back(Tensor::empty(...)); place these checks immediately after
the seq_out calculation and before the outputs.emplace_back line in
ConvTranspose1DOp (referencing seq_out, options_.output_padding,
options_.stride, options_.padding, options_.dilation, options_.kernel_size).

In `@mllm/core/aops/ConvTranspose1DOp.hpp`:
- Around line 12-49: Add API documentation comments for the new public types:
place a brief docstring above the struct ConvTranspose1DOpOptions describing its
purpose (options for 1D transposed convolution), and document each field
(in_channels, out_channels, kernel_size, stride, padding, output_padding,
dilation, groups, bias) including units, valid ranges/constraints (e.g.,
positive ints, output_padding < max(stride, dilation), groups divides
in_channels/out_channels, kernel_size > 0), and default meanings; also add a
short doc comment above the ConvTranspose1DOp class describing its role,
lifecycle methods (load, trace, forward, reshape, setup, getParams), and
invariants about weight_ and bias_ (shapes derived from options, bias present
only if options.bias is true). Ensure comments reference the exact symbols
ConvTranspose1DOpOptions, ConvTranspose1DOp, weight_, bias_ so callers can
locate the documented behavior.

In `@mllm/core/aops/TanhOp.hpp`:
- Around line 11-30: Add API doc comments for TanhOp and TanhOpOptions: place a
brief class-level comment above struct TanhOpOptions and class TanhOp describing
the operation's purpose (element-wise hyperbolic tangent), expected inputs
(single tensor of arbitrary shape), outputs (single tensor with same shape and
dtype), and any error/shape expectations (e.g., require one input tensor,
matching output count; throw or assert on incorrect counts). Also add short
comments on the public methods load, trace, forward, reshape, setup and the
options() accessor indicating their roles (e.g., load: load parameters from
ParameterFile; forward: compute element-wise tanh; reshape/setup: validate and
set output shapes; trace: record shapes/metadata) and mention any preconditions
or postconditions (shape invariants), following existing doc style in the
codebase.

🧹 Nitpick comments (4)

mllm/backends/cpu/ops/ConvTranspose1DOp.cpp (2)
48-48: Unnecessary zero-initialization of output.

The output buffer is zero-filled here, but the subsequent loop at line 86 directly assigns output_ptr[output_idx] = sum; rather than accumulating. Each output element is computed completely before being written, making this zero-fill redundant overhead.
♻️ Suggested fix
-  std::fill_n(output_ptr, output.numel(), 0.0f);
+  // No zero-fill needed: each output element is fully computed before assignment
52-90: Redundant dtype switch after assertion.

The switch (output.dtype()) at line 52 is redundant since MLLM_RT_ASSERT_EQ(output.dtype(), kFloat32) at line 39 already guarantees the dtype. Consider removing the switch or, if future dtype support is planned, removing the assertion.
♻️ Suggested simplification
-  switch (output.dtype()) {
-    case kFloat32:
-      MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, {
+  MLLM_CONDITIONAL_PARALLEL_FOR(options_.getThreads() > 1, 4, idx, 0, total_iterations, 1, {
         // ... loop body unchanged ...
-      });
-      break;
-    default: NYI("ConvTranspose1D: unsupported data type");
-  }
+  });
tests/cpu/ConvTranspose1DKernelTest.hpp (1)
114-133: Redundant variable extraction in failure path.

Lines 117-128 extract all config values into local variables, but these are only used in the print() call. Consider simplifying by passing the config directly or just printing the failing config map.
♻️ Simplified failure handling
   bool testConvTranspose1D(const std::vector<std::unordered_map<std::string, int32_t>>& cfgs) {
     for (auto& cfg : cfgs) {
       if (!testConvTranspose1DOnce(cfg)) {
-        auto batch = cfg.at("batch");
-        auto in_channel = cfg.at("in_channel");
-        auto out_channel = cfg.at("out_channel");
-        auto sequence = cfg.at("sequence");
-        auto kernel_size = cfg.at("kernel_size");
-        auto stride = cfg.at("stride");
-        auto padding = cfg.at("padding");
-        auto output_padding = cfg.at("output_padding");
-        auto dilation = cfg.at("dilation");
-        auto groups = cfg.at("groups");
-        auto bias = cfg.at("bias");
-        print(batch, in_channel, out_channel, sequence, kernel_size, stride, padding, output_padding, dilation, groups, bias);
+        // Print failing configuration for debugging
+        for (const auto& [key, value] : cfg) {
+          print(key, "=", value);
+        }
         return false;
       }
     }
     return true;
   }
mllm/nn/layers/Tanh.hpp (1)

11-18: Add a brief class doc comment.

Public APIs should have a short description of purpose/inputs/outputs. As per coding guidelines, please add a brief doc comment for the Tanh layer.

coderabbitai · 2026-01-25T11:33:31Z

mllm/backends/cpu/ops/ConvTranspose1DOp.hpp

+class CPUConvTranspose1DOp final : public aops::ConvTranspose1DOp {
+ public:
+  explicit CPUConvTranspose1DOp(const aops::ConvTranspose1DOpOptions& options);
+
+  void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+};
+
+class CPUConvTranspose1DOpFactory : public TypedOpFactory<OpTypes::kConvTranspose1D, aops::ConvTranspose1DOpOptions> {
+ public:
+  std::shared_ptr<BaseOp> createOpImpl(const aops::ConvTranspose1DOpOptions& options) override {
+    return std::make_shared<CPUConvTranspose1DOp>(options);
+  }


⚠️ Potential issue | 🟡 Minor

Add brief docs for CPU op/factory.

Please add short comments describing the CPU op’s role and any expectations about inputs/outputs to keep backend APIs self-documenting. As per coding guidelines, public APIs should include clear docstrings/comments.

🤖 Prompt for AI Agents

In `@mllm/backends/cpu/ops/ConvTranspose1DOp.hpp` around lines 11 - 22, Add concise doc comments above the CPUConvTranspose1DOp class and CPUConvTranspose1DOpFactory to document their roles and expected tensor shapes/semantics: describe that CPUConvTranspose1DOp implements the CPU backend of aops::ConvTranspose1DOp, list expected input tensors (e.g., input, weight, optional bias), expected output shape behavior (how output length is computed from stride/padding/dilation/output_padding), and any preconditions (memory layout, dtype). Also add a brief comment for CPUConvTranspose1DOpFactory explaining it constructs CPUConvTranspose1DOp instances from aops::ConvTranspose1DOpOptions so readers know this factory ties the OpOptions to the CPU implementation.

coderabbitai · 2026-01-25T11:33:31Z

mllm/core/aops/ConvTranspose1DOp.cpp

+void ConvTranspose1DOp::load(const ParameterFile::ptr_t& ploader) {
+  switch (ploader->version()) {
+    case ModelFileVersion::kV1: {
+      weight_ = ploader->pull(getName() + ".weight");
+      if (options_.bias) { bias_ = ploader->pull(getName() + ".bias"); }
+      weight_ = weight_.view({options_.in_channels, options_.out_channels / options_.groups, options_.kernel_size});
+      if (options_.bias) { bias_ = bias_.view({options_.out_channels}); }


⚠️ Potential issue | 🟠 Major

Guard against invalid groups before division/modulo.

options_.groups is used in division/modulo in both load() and reshape(). If it’s 0 (or incompatible with channels), this will crash or corrupt shapes. Add explicit validation before use. As per coding guidelines, validate inputs for public APIs.

🔧 Suggested fix

void ConvTranspose1DOp::load(const ParameterFile::ptr_t& ploader) { + if (options_.groups <= 0) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp groups must be > 0"); + } + MLLM_RT_ASSERT_EQ(options_.out_channels % options_.groups, 0); switch (ploader->version()) { case ModelFileVersion::kV1: { weight_ = ploader->pull(getName() + ".weight"); if (options_.bias) { bias_ = ploader->pull(getName() + ".bias"); } weight_ = weight_.view({options_.in_channels, options_.out_channels / options_.groups, options_.kernel_size}); if (options_.bias) { bias_ = bias_.view({options_.out_channels}); } break; } @@ void ConvTranspose1DOp::reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) { const auto& i = inputs[0]; const auto& ishape = i.shape(); if (ishape.size() != 3) { MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp expects 3D input, got {} D", ishape.size()); outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device())); return; } + if (options_.groups <= 0) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1DOp groups must be > 0"); + outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device())); + return; + } const int batch = ishape[0]; const int in_channels = ishape[1]; const int sequence = ishape[2]; MLLM_RT_ASSERT_EQ(in_channels, options_.in_channels); MLLM_RT_ASSERT_EQ(in_channels % options_.groups, 0); MLLM_RT_ASSERT_EQ(options_.out_channels % options_.groups, 0);

Also applies to: 68-71

🤖 Prompt for AI Agents

In `@mllm/core/aops/ConvTranspose1DOp.cpp` around lines 17 - 23, Validate options_.groups before any division/modulo in ConvTranspose1DOp::load and ConvTranspose1DOp::reshape by checking it is > 0 and that it cleanly divides the relevant channel counts (e.g., options_.out_channels % options_.groups == 0 and any other channel dimension used with groups is divisible). If a check fails, return/raise a clear error (e.g., throw std::invalid_argument or use the existing error/reporting mechanism) with a message referencing the op name and invalid group value. Update the logic around weight_ = weight_.view(...) in load() and the corresponding shape calculations in reshape() to assume groups is validated so the divisions/modulo are safe. Ensure tests or callers expecting validation get a deterministic error rather than UB/crash.

coderabbitai · 2026-01-25T11:33:31Z

mllm/core/aops/ConvTranspose1DOp.cpp

+  const int kernel_size = options_.kernel_size;
+  const int stride = options_.stride;
+  const int dilation = options_.dilation;
+  const int padding = options_.padding;
+  const int output_padding = options_.output_padding;
+
+  const int seq_out = (sequence - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1;
+
+  auto new_shape = std::vector<int32_t>{batch, options_.out_channels, seq_out};
+  outputs.emplace_back(Tensor::empty(new_shape, i.dtype(), i.device()));


⚠️ Potential issue | 🟠 Major

Validate output size and output_padding constraints.

seq_out can become <= 0 for invalid parameters (e.g., negative padding/dilation, output_padding ≥ stride). Right now this creates an invalid tensor. Add explicit checks before allocation to fail fast and surface configuration errors. As per coding guidelines, validate inputs for public APIs.

🔧 Suggested fix

const int kernel_size = options_.kernel_size; const int stride = options_.stride; const int dilation = options_.dilation; const int padding = options_.padding; const int output_padding = options_.output_padding; + if (kernel_size <= 0 || stride <= 0 || dilation <= 0 || padding < 0 || output_padding < 0 || output_padding >= stride) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "Invalid ConvTranspose1D options"); + outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device())); + return; + } const int seq_out = (sequence - 1) * stride - 2 * padding + dilation * (kernel_size - 1) + output_padding + 1; + if (seq_out <= 0) { + MLLM_ERROR_EXIT(ExitCode::kCoreError, "ConvTranspose1D produces non-positive output length"); + outputs.emplace_back(Tensor::empty(i.shape(), i.dtype(), i.device())); + return; + } auto new_shape = std::vector<int32_t>{batch, options_.out_channels, seq_out}; outputs.emplace_back(Tensor::empty(new_shape, i.dtype(), i.device()));

🤖 Prompt for AI Agents

In `@mllm/core/aops/ConvTranspose1DOp.cpp` around lines 72 - 81, The code must validate computed seq_out and option constraints before allocating the output tensor: compute seq_out as currently done, then check that seq_out > 0 and that options_.output_padding < options_.stride and that options_.padding >= 0, options_.dilation >= 1, options_.kernel_size > 0 and options_.stride > 0; if any check fails, return/raise a clear configuration error (e.g., throw std::invalid_argument or use the project’s error/reporting helper) instead of calling outputs.emplace_back(Tensor::empty(...)); place these checks immediately after the seq_out calculation and before the outputs.emplace_back line in ConvTranspose1DOp (referencing seq_out, options_.output_padding, options_.stride, options_.padding, options_.dilation, options_.kernel_size).

coderabbitai · 2026-01-25T11:33:31Z

mllm/core/aops/ConvTranspose1DOp.hpp

+struct ConvTranspose1DOpOptions : public BaseOpOptions<ConvTranspose1DOpOptions> {
+  int32_t in_channels;
+  int32_t out_channels;
+  int32_t kernel_size;
+  int32_t stride = 1;
+  int32_t padding = 0;
+  int32_t output_padding = 0;
+  int32_t dilation = 1;
+  int32_t groups = 1;
+  bool bias = true;
+};
+
+class ConvTranspose1DOp : public BaseOp {
+ public:
+  explicit ConvTranspose1DOp(const ConvTranspose1DOpOptions& options);
+
+  void load(const ParameterFile::ptr_t& ploader) override;
+
+  void trace(void* trace_context, const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  ParameterFile::ptr_t getParams() override;
+
+  inline Tensor& weight() { return weight_; }
+
+  inline Tensor& bias() { return bias_; }
+
+  inline ConvTranspose1DOpOptions& options() { return options_; }
+
+ protected:
+  Tensor weight_;
+  Tensor bias_;
+  ConvTranspose1DOpOptions options_;


⚠️ Potential issue | 🟡 Minor

Add API docs for ConvTranspose1D options and op.

These are new public types; please add brief comments covering purpose, parameter meaning (channels/stride/padding/output_padding/groups/bias), and invariants so callers know expected constraints. As per coding guidelines, public APIs should include clear docstrings/comments.

🤖 Prompt for AI Agents

In `@mllm/core/aops/ConvTranspose1DOp.hpp` around lines 12 - 49, Add API documentation comments for the new public types: place a brief docstring above the struct ConvTranspose1DOpOptions describing its purpose (options for 1D transposed convolution), and document each field (in_channels, out_channels, kernel_size, stride, padding, output_padding, dilation, groups, bias) including units, valid ranges/constraints (e.g., positive ints, output_padding < max(stride, dilation), groups divides in_channels/out_channels, kernel_size > 0), and default meanings; also add a short doc comment above the ConvTranspose1DOp class describing its role, lifecycle methods (load, trace, forward, reshape, setup, getParams), and invariants about weight_ and bias_ (shapes derived from options, bias present only if options.bias is true). Ensure comments reference the exact symbols ConvTranspose1DOpOptions, ConvTranspose1DOp, weight_, bias_ so callers can locate the documented behavior.

coderabbitai · 2026-01-25T11:33:31Z

mllm/core/aops/TanhOp.hpp

+struct TanhOpOptions : public BaseOpOptions<TanhOpOptions> {};
+
+class TanhOp : public BaseOp {
+ public:
+  explicit TanhOp(const TanhOpOptions& options);
+
+  void load(const ParameterFile::ptr_t& ploader) override;
+
+  void trace(void* trace_context, const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void forward(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void reshape(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  void setup(const std::vector<Tensor>& inputs, std::vector<Tensor>& outputs) override;
+
+  inline TanhOpOptions& options() { return options_; }
+
+ protected:
+  TanhOpOptions options_;


⚠️ Potential issue | 🟡 Minor

Add API docs for TanhOp.

Please add brief doc comments describing purpose, inputs/outputs, and any error/shape expectations. As per coding guidelines, public APIs should include clear docstrings/comments.

🤖 Prompt for AI Agents

In `@mllm/core/aops/TanhOp.hpp` around lines 11 - 30, Add API doc comments for TanhOp and TanhOpOptions: place a brief class-level comment above struct TanhOpOptions and class TanhOp describing the operation's purpose (element-wise hyperbolic tangent), expected inputs (single tensor of arbitrary shape), outputs (single tensor with same shape and dtype), and any error/shape expectations (e.g., require one input tensor, matching output count; throw or assert on incorrect counts). Also add short comments on the public methods load, trace, forward, reshape, setup and the options() accessor indicating their roles (e.g., load: load parameters from ParameterFile; forward: compute element-wise tanh; reshape/setup: validate and set output shapes; trace: record shapes/metadata) and mention any preconditions or postconditions (shape invariants), following existing doc style in the codebase.

KKkai0315 and others added 13 commits January 23, 2026 15:46

feat:add Qwen2.5omni text modal processing

a04ef3e

add qwen2.5omni vision, audio modal

c9333ab

fix: Suppress deprecated comma-subscript warnings in CMake and remove…

0672432

… debug print statements from Qwen3DecoderLayer

feat(qualcomm): Add installation targets for flatbuffers and MllmQNNB…

927f7eb

…ackend in CMake, enhance PTQPass with unsolved tensor value checks, and update quantization specifications in RMSNorm and model file conversion.

feat(qualcomm): Refactor Qwen3 model to integrate ConcatObserver for …

d2e6b36

…improved quantization, enhance rotate_half function to utilize observers, and ensure consistent scale and zero_point across concatenated inputs.

feat(qnn): Enhance QNNBackend initialization with improved logging an…

e976d11

…d error handling; update default log level to verbose. Add QEmbedding class for quantized embedding operations in PyTorch. Introduce build tasks for Android and x86 QNN AOT SDKs.

feat(qnn): Update quantization handling and embedding output data typ…

224d68e

…es; ensure position-independent code for flatbuffers. Enhance context creation with existing context checks and improve weight quantization specifications.

feat(qwen3): Integrate QEmbedding for quantized embeddings and refine…

d2d5c09

… input layer normalization handling in Qwen3DecoderLayer. Update weight conversion logic in training script to address model compatibility issues.

fix

c4f2306

fix

a235a13

Merge remote-tracking branch 'refs/remotes/origin/main'

eeac11f

coderabbitai bot reviewed Jan 23, 2026

View reviewed changes

KKkai0315 added 2 commits January 25, 2026 01:51

add ConvTranspose1dOp & TanhOp

adc3b64

fix: fix Tanh op and add test for Tanh Op and ConvTranspose1d Op

674f97c

KKkai0315 requested review from chenghuaWang, oreomaker and yirongjie as code owners January 25, 2026 11:25

coderabbitai bot reviewed Jan 25, 2026

View reviewed changes

Conversation

KKkai0315 commented Jan 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KKkai0315 commented Jan 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 23, 2026 •

edited

Loading