Sync with Microsoft ONNX Runtime - 23032026 by ai-fw-intg · Pull Request #986 · intel/onnxruntime

ai-fw-intg · 2026-03-23T05:13:19Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

…icrosoft#26773) **Description** This PR integrates Arm® KleidiAI™ SME2 BF16 kernel through MLAS SBGEMM path. Rework of microsoft#24346 **Motivation and Context** This kernel provides performance improvements on SME-enabled devices. --------- Signed-off-by: Patryk Kaiser <patryk.kaiser@arm.com>

Upgrading dependency to resolve CVE-2026-27904, which is lighting up some component governance issues with internal-MSFT builds of ORT. Co-authored-by: Kevin Taha <kevintaha@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…icrosoft#27694) ### Description fix condition of the following definitions: - DAWN_ENABLE_VULKAN - DAWN_ENABLE_D3D12

…#27688) This deletes 3 per-head-size .cu files and merges their content into a single file to avoid dependency during cuda compiling. Currently, masked_multihead_attention_kernel template is implemented in decoder_masked_multihead_attention_impl.cu‎. The other three .cu files use the masked_multihead_attention_kernel template but not include the implementation. That causes problem when they are built in cuda plugin ep.

microsoft#27671) ## Description This PR fixes longstanding MLAS issues that were causing `NhwcTransformerTests.*` and `QDQTransformerTests.*` failures in quantized convolution paths (see microsoft#27670). The failures were not in the graph transformers themselves; they came from incorrect qgemm dispatch selection and broken backend kernel behavior in specific AVX2-VNNI and AMX paths. The fix removes incorrect `U8U8` dispatch upgrades, avoids a broken AVX2-VNNI row-panel fallback, and corrects the AMX `U8S8` 32-row kernel path. It also adds MLAS regression coverage for the conv-shaped qgemm dimensions that exposed the problems. ## Summary of Changes ### Dispatch Selection Fixes | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/platform.cpp` | Remove three incorrect assignments that upgraded `GemmU8U8Dispatch` to `U8S8` dispatch objects in the AVXVNNI, AVX512VNNI, and AMX feature paths. | ### AVX2-VNNI Kernel Fix | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/qgemm_kernel_avx2.cpp` | Reduce `StrideM` from `6` to `4` for the `U8U8`, `S8S8`, and `S8U8` AVX2-VNNI qgemm dispatch objects so they never enter the legacy `>4` row fallback path. | ### AMX Kernel Fix | File | Change | |------|--------| | `onnxruntime/core/mlas/lib/qgemm_kernel_amx.cpp` | Replace the broken pipelined `CountM >= 32` `U8S8` AMX fast path with the same per-K tile update pattern already used by the working smaller-row path. | ### Regression Coverage | File | Change | |------|--------| | `onnxruntime/test/mlas/unittest/test_qgemm_fixture.h` | Add MLAS qgemm regression cases for conv-like shapes `6x30x207` and `169x30x207` in packed/non-packed and int32 or fp32 variants. | ## Root Cause There were three separate MLAS correctness issues: 1. `platform.cpp` was incorrectly overwriting `GemmU8U8Dispatch` with `U8S8` dispatch objects when newer CPU features were detected. That caused `U8U8` conv workloads to run through the wrong dispatch path. 2. The AVX2-VNNI qgemm dispatch objects advertised an `M` stride of `6`, but the assembly kernel only handled VNNI packing safely up to 4 rows. For 5- or 6-row panels it fell back to an older AVX2 path with incompatible packing and sign assumptions. 3. The AMX `U8S8` qgemm kernel had a bug in its `CountM >= 32` fast path. The smaller-row AMX path was correct, but the 32-row pipelined update logic produced wrong accumulators for conv-shaped workloads and caused the remaining QDQ/NHWC failures on AMX-capable hosts. ## Why This Fix - The `platform.cpp` cleanup restores the intended `U8U8` dispatch selection on feature-rich x86 hosts. - The AVX2-VNNI stride change is a targeted mitigation that avoids the known-bad legacy fallback until that assembly path is corrected. - The AMX kernel change keeps the AMX `U8S8` dispatch enabled, but replaces the broken 32-row implementation with a proven update pattern that matches the working smaller-row path. - The new MLAS regression tests cover the exact conv-derived qgemm shapes that exposed the bug, so future dispatch or kernel changes will fail at the MLAS layer before surfacing as transformer test regressions. ## Testing - `cd build/cuda/Release && ./onnxruntime_mlas_test --gtest_filter='QGemmU8S8_*169xN30xK207*:*QGemmU8S8_*6xN30xK207*'` - `cd build/cuda/Release && ./onnxruntime_test_all --gtest_filter='NhwcTransformerTests.*:QDQTransformerTests.*'` - Verified that the filtered transformer suite passes with AMX `U8S8` dispatch enabled. ## Motivation and Context These test failures had been present for a long time and were initially attributed to transformer rewrites because they surfaced in NHWC and QDQ test suites. Investigation showed that the optimized graphs were structurally correct and that the failures came from lower-level MLAS qgemm execution instead. Fixing the behavior in MLAS is the right layer because it restores correctness for both direct qgemm coverage and higher-level quantized conv paths. ## Checklist - [x] Tests added/updated - [x] No breaking changes - [x] CI passes

## Description This PR fixes clang-specific build failures that show up in both the standalone clang build and the CUDA clang build. It keeps the build-system changes targeted, prefers source fixes where the warnings indicate real type or declaration issues, and avoids broader warning suppression than necessary for the CUDA provider target. ## Summary of Changes ### Build System | File | Change | |------|--------| | `cmake/CMakeLists.txt` | Stop forwarding `-Wshorten-64-to-32` through CUDA host compilation where the GNU host compiler does not recognize it. | | `cmake/onnxruntime_providers_cuda.cmake` | Add targeted clang `-Wno-error` handling for warning classes that are currently triggered by CUDA provider code and third-party CUDA headers under clang. | ### CPU / Common clang fixes | File | Change | |------|--------| | `onnxruntime/core/common/cpuid_info.cc` | Replace the clang-incompatible `__builtin_cpu_supports("waitpkg")` path with the CPUID-bit check for TPAUSE detection. | | `onnxruntime/test/framework/allocation_planner_test.cc` | Refactor `typeid` assertions to avoid clang's potentially-evaluated-expression warning while keeping test coverage unchanged. | ### CUDA provider and contrib fixes | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/utils/dump_cuda_tensor.h` | Mark the `IConsoleDumper` overrides explicitly while leaving CUDA-only overloads unchanged. | | `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Use `template` on the dependent `GetAttrOrDefault` call so clang parses it correctly. | | `onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.cc` | Make narrowing conversions to flash-attention parameter fields explicit. | | `onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc` | Make the `nbits_` conversion explicit when calling the CUDA helper. | | `onnxruntime/contrib_ops/cuda/quantization/moe_quantization.cc` | Restrict the GCC-only warning pragma so clang does not treat it as an unknown warning option. | | `onnxruntime/contrib_ops/cuda/transformers/generation_device_helper.cc` | Fix explicit state-field assignments to use the actual `int` field type. | | `onnxruntime/core/providers/cuda/cuda_mempool_arena.h` | Remove an unused private field that clang flagged in the CUDA provider build. | ## Testing Tested CPU and CUDA 12.8 builds in Azure Linux with - clang 18.1.8 - gcc 13.2 - cmake 4.2.3 Example for CPU build: ``` export CC=clang export CXX=clang++ bash build.sh --config RelWithDebInfo --parallel --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=ON ``` ## Motivation and Context Clang is stricter than GCC/MSVC in a few areas that affect this tree: CUDA host flag forwarding, explicit narrowing, dependent template parsing, warnings emitted from third-party CUDA headers, and RTTI/typeid expressions in tests. The goal here is to keep the staged fix minimal and maintainable by correcting real source issues where practical and confining warning downgrades to the CUDA provider target where third-party header noise is currently unavoidable.

This pull request addresses a critical validation gap in the "reflect" mode of the `Pad` operator for both CPU and CUDA backends, ensuring compliance with the ONNX specification and preventing out-of-bounds memory access. The main change is the addition of checks that prevent the pad size from exceeding the maximum allowed value (`extent - 1`) for each axis, and the introduction of comprehensive regression tests to verify the new behavior. Validation fixes for reflect-mode padding: * Added explicit checks in `onnxruntime/core/providers/cpu/tensor/pad.cc` and `onnxruntime/core/providers/cuda/tensor/pad.cc` to ensure that, for reflect mode, both pre-pad and post-pad values do not exceed `extent - 1` for each axis, as required by the ONNX spec. This prevents heap out-of-bounds errors and aligns with numpy behavior. [[1]](diffhunk://#diff-b84d79dea8e316c5a5fec57854f215d103666c4f4754268c4eadbd814329f0d6L498-R499) [[2]](diffhunk://#diff-b84d79dea8e316c5a5fec57854f215d103666c4f4754268c4eadbd814329f0d6R510-R522) [[3]](diffhunk://#diff-a6cea0925b036808544af30a643d0dddf27446dd4c1a68d67aba2d20b056bf70R198-R200) [[4]](diffhunk://#diff-a6cea0925b036808544af30a643d0dddf27446dd4c1a68d67aba2d20b056bf70R215-R227) Testing and regression coverage: * Added a suite of regression tests in `onnxruntime/test/providers/cpu/tensor/pad_test.cc` to verify that invalid pad sizes in reflect mode are correctly rejected, including edge cases for 1D and 2D inputs, boundary conditions, and scenarios with slicing. These tests ensure that the operator fails gracefully when pad sizes exceed the allowed limit and succeeds when within bounds. Other changes: * Minor file encoding update in `onnxruntime/test/providers/cpu/tensor/pad_test.cc`. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

microsoft#27578) ## Summary This PR adds CUDA support for optimized **nearest-neighbor 3D resize mapping/execution** in the Resize operator path, and adds targeted regression coverage. The implementation introduces a dedicated 3D fast path for nearest resize to handle the last three spatial dimensions (`D/H/W`) efficiently when outer dimensions are unchanged. ## What Changed ### CUDA Resize implementation File: `onnxruntime/core/providers/cuda/tensor/resize_impl.cu` - Added 3D nearest mapping kernel: - `_ResizeNearestMappingKernel3D` - Added 3D nearest compute kernel: - `_ResizeNearestKernel3D` - Added optimized 3D dispatch path in `ResizeNearestImpl`: - Enabled when: - `rank >= 3` - `coordinate_transformation_mode != tf_crop_and_resize` - all outer scales (except last 3 dims) are `1.0` This keeps existing behavior unchanged for other cases while using the optimized path for true 3D nearest resize workloads. ### Regression tests File: `onnxruntime/test/providers/cpu/tensor/resize_op_test.cc` Added CUDA-targeted regression tests: - `ResizeOpNearestUpSampleTest_5D_CudaRegression_Optimized3DMapping` - `ResizeOpNearestDownSampleTest_5D_CudaRegression_Optimized3DMapping` ## Why The previous nearest implementation relied on the generic path for these 3D scenarios. This change introduces a dedicated CUDA 3D path to improve performance for 5D nearest resize workloads. Fixes microsoft#14596

…bugs (microsoft#27692) ### Description This PR adds fp16 (half-precision) support for 8-bit MatMulNBits on ARM64 NEON and fixes several pre-existing bugs discovered during testing. **New features:** - **HQNBIT_CompFp16 for 8-bit:** Added `HQ8BitGemmPackQuantBData_CompFp16` and `HQ8BitBlkDequantBForHgemm_CompFp16` NEON kernels that pack and dequantize 8-bit quantized weights for fp16 GEMM. Reuses the existing `HQ4BitGemmKernel_CompFp16` for the actual compute since the dequantized B matrix has the same layout. - **HQNBIT_CompInt8 for 4-bit:** Added accuracy level 4 (int8 compute) support for fp16 4-bit MatMulNBits. Converts fp16 activations to fp32, then uses the existing SQ4Bit int8 kernels. - **HQNBIT_CompInt8 for 8-bit:** Added accuracy level 4 (int8 compute) support for fp16 8-bit MatMulNBits. Converts fp16 scales to fp32 for packing, then uses the existing SQ8Bit int8 kernels. **Bug fixes:** - **Bias offset bug in CompFp16 (Windows ARM multithreading):** Fixed missing `+ RangeStartN` when initializing `Bias` pointer in `HQ4BitGemm_CompFp16` and `HQ8BitGemm_CompFp16`. This caused incorrect results when using multiple threads, as worker threads processing column ranges beyond the first would read bias values from the wrong offset. - **QuantBDataWorkspace not set for MLFloat16 fallback (macOS ARM crash):** Removed `#ifdef MLAS_TARGET_AMD64_IX86` guard around setting `QuantBDataWorkspace` in `ComputeBPacked<MLFloat16>`, so macOS ARM (which uses the fp32 fallback path) correctly sets the workspace pointer for SQNBIT_CompInt8. - **Scale/ZP packing skipped on non-x64 in MLFloat16 PrePack (macOS ARM gibberish):** Removed `#ifdef MLAS_TARGET_AMD64_IX86` guard around the SQNBIT_CompInt8 scale and zero-point packing in the `MatMulNBits<MLFloat16>::PrePack` specialization. Added `nbits_ == 8` condition to match the generic template's behavior on ARM (only 8-bit needs separate scale packing on ARM, while x64 needs it for both 4-bit and 8-bit). ### Motivation and Context 8-bit quantized models with fp16 inputs are increasingly common on ARM devices (Windows ARM, macOS Apple Silicon). The existing MatMulNBits implementation only supported 4-bit for the HQNBIT fp16 paths. This change extends support to 8-bit, enabling faster inference for 8-bit quantized models on ARM64 without requiring fp16→fp32 conversion of the weights. The bug fixes address issues that were either pre-existing (the `#ifdef` guards were copy-paste inconsistencies from prior PRs) or introduced alongside the fp16 NEON support (the Bias offset issue). These caused crashes or incorrect output on macOS ARM and multithreaded Windows ARM configurations. ### Improvements Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU` #### Accuracy level 4 (uses HQNBIT_CompInt8) vs Accuracy level 1 (uses HQNBIT_CompFp16) | Model | Seq 1 | Seq 256 | Seq 512 | |-------|-------|---------|---------| | **4-bit** | | | | | Qwen 0.5B | 1.19× (9.6ms) | 1.36× (428ms) | 1.27× (1119ms) | | Qwen 1.5B | 0.89× (39.8ms) | 1.62× (1371ms) | 1.54× (2694ms) | | Qwen 3B | 1.16× (46.8ms) | 1.54× (2654ms) | 1.43× (5427ms) | | **8-bit** | | | | | Qwen 0.5B | 0.79× (22.5ms) | 2.59× (257ms) | 2.16× (642ms) | | Qwen 1.5B | 1.14× (41.4ms) | 2.50× (848ms) | 2.55× (1636ms) | | Qwen 3B | 1.07× (52.9ms) | 1.95× (2133ms) | 2.29× (3799ms) | #### Latest changes vs ORT 1.24.3 (both accuracy level 4) On ORT 1.24.3: - 4 bit uses HQNBIT_CompFp16 - 8 bit uses naive unpacked dequantize and matmul | Model | Seq 1 | Seq 256 | Seq 512 | |-------|-------|---------|---------| | **4-bit** | | | | | Qwen 0.5B | 1.13× (9.6ms) | 1.35× (428ms) | 1.27× (1119ms) | | Qwen 1.5B | 0.82× (39.8ms) | 1.40× (1371ms) | 1.47× (2694ms) | | Qwen 3B | 1.16× (46.8ms) | 1.47× (2654ms) | 1.51× (5427ms) | | **8-bit** | | | | | Qwen 0.5B | **35.4×** (22.5ms) | **5.0×** (257ms) | **3.2×** (642ms) | | Qwen 1.5B | **98.0×** (41.4ms) | **6.8×** (848ms) | **4.7×** (1636ms) | | Qwen 3B | **107.8×** (52.9ms) | **4.1×** (2133ms) | **3.1×** (3799ms) |

…icrosoft#27091) ### Description This PR fixes test errors encountered during the build and compilation of onnxruntime-cann. error1： Unrecognized because HiSilicon CPU info was not added, leading to CTest errors. ``` onnxruntime cpuid_info warning: Unknown CPU vendor. cpuinfo_vendor value: 15 10: [----------] Global test environment tear-down 10: [==========] 15 tests from 1 test suite ran. (772 ms total) 10: [ PASSED ] 15 tests. 10/10 Test #10: onnxruntime_ep_graph_test ............... Passed 0.98 sec 90% tests passed, 1 tests failed out of 10 Total Test time (real) = 674.68 sec The following tests FAILED: 1 - onnxruntime_test_all (Failed) Errors while running CTest ``` error2: Some Python tests are failing here due to a previously submitted PR (microsoft#25867). In that PR, we introduced a new parameter enable_cann_subgraph to control subgraph partitioning for unsupported operators, with a preference for executing the entire graph as a whole. However, this change causes certain test cases to fail when specific operator versions in the models are not supported, leading to execution errors.

…oft#27698) ### Description Don't try to show extended ccache stats. ### Motivation and Context Current docker images are on a package set that contains a decrepit version of ccache that doesn't understand `--verbose` (`-v`). If `ccache` is added via the pkg manager -> build scripts fail. This matters b/c we will be soon enabling ccache throughout various pipelines/workflows.

### Description Release packaging currently requires manually triggering and monitoring multiple independent pipelines (Python, Nuget, Java, NPM, iOS, etc.). This PR introduces a unified orchestration framework that triggers all packaging pipelines from a single master pipeline, monitors them in parallel, and reports aggregated results. ### Motivation and Context * main-release-pipeline.yml — A 1ES pipeline with per-pipeline enable/disable toggles, dry-run support, and configurable target branch. * [trigger_and_wait_pipelines.py](vscode-file://vscode-app/c:/Users/kusbandi/AppData/Local/Programs/Microsoft%20VS%20Code/ce099c1ed2/resources/app/out/vs/code/electron-browser/workbench/workbench.html) — Python orchestrator that triggers ~11 packaging pipelines via ADO REST API, polls them every 60s, and logs status to Azure Kusto for analytics. --------- Co-authored-by: Kusuma Padma Kavya Bandi <kusbandi@microsoft.com>

### Description File mapping is enabled by default, but when the model contains an embedded EP context in a node, QNN EP attempts to map the EP context as a file that doesn't exist. This change is to disable file mapping entirely if econtext_embed_mode = 1. --------- Co-authored-by: calvnguy <calvnguy@qti.qualcomm.com> Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>

### Description Allow using KAI SBGemm on any ARM64 build (previously it needed the condition that NCHWc be not enabled as the logistics of using the KAI SBGemm path for the SBGemm based Conv was not clear). This change instead allows using the KAI path for other call sites (by enabling it in the build) but explicitly disallows using the KAI path only for the Conv and continue using the default MLAS path for the same (as before). ### Motivation and Context microsoft#26773 (comment) --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

### Description  ### Motivation and Context  --------- Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

…els (microsoft#27749) ## Description Fix two bugs in the WebGPU ConvTranspose shader code generation (conv_backprop.cc) in the `pack_input_as4_` code path when `a_components_ == 1` (triggered when input channels per group is not divisible by 2 or 4, e.g., 5 or 7). ### Bug 1: Wrong offset for weight reads Weight values were read using `x_offset` (the input/dy tensor offset) instead of `w_offset` (the weight tensor offset), producing incorrect convolution results. ### Bug 2: Missing weight multiplication in remainder loop The remainder loop (handling leftover channels when `inputChannelsPerGroup` is not a multiple of 4) was adding raw input values to `dotProd` without multiplying by the corresponding weight values. ## Motivation and Context The `inChannels = 5` and `inChannels = 7` test cases in `conv-transpose.jsonc` were failing because these channel counts aren't divisible by 2 or 4, triggering the buggy `a_components_ == 1` branch. Cases like `inChannels = 6` (`a_components_ = 2`) and `inChannels = 8` (`a_components_ = 4`) were unaffected. ## Testing All 22 conv-transpose WebGPU tests now pass: ``` npm test -- op conv-transpose.jsonc -b=webgpu -e=node 22 passing (23s) ``` Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

### Description This change updates the KleidiAI SGEMM post-processing path in onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp with two parts: - Correctness fix: in the alpha == 0 || K == 0 fast path, beta handling is now applied for every batch entry (not just batch 0), so batched SGEMM behaviour is correct. - NEON SGEMM epilogue optimisation: adds a vectorised alpha/beta post-processing path for contiguous outputs, with guarded fallback to scalar for non-contiguous or small cases. The 2D epilogue path also routes contiguous tiles through the contiguous 1D epilogue path to enable vectorisation. ### Motivation and Context This change addresses correctness and performance in the SGEMM post-processing stage: - The batched alpha == 0 || K == 0 path previously used only Data[0], which could produce incorrect results for BatchSize > 1. - The post-processing loop (C = alpha * (A*B) + beta * C) is a known latency contributor when memcpy fast paths are not applicable. The NEON epilogue changes are intended to reduce this cost on supported ARM platforms while preserving existing fallback behaviour. --------- Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com Signed-off-by: Cathal Lawlor <cathal.lawlor@arm.com>

### Description This PR updates the logic for identifying a large model in Intel's Neural Compressor. ### Motivation and Context The original logic was not sufficient to detect whether a model produced by the model builder is too large or not. Here is an example traceback from an internal customer. ``` Traceback (most recent call last): File "D:\a\_work\1\s\edge.onnxruntime-genai\src\python\py\models\builder.py", line 502, in <module> create_model( File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\torch\utils\_contextlib.py", line 124, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\a\_work\1\s\edge.onnxruntime-genai\src\python\py\models\builder.py", line 346, in create_model onnx_model.save_model(output_dir) File "D:\a\_work\1\s\edge.onnxruntime-genai\src\python\py\models\builders\base.py", line 748, in save_model model = self.to_int4() ^^^^^^^^^^^^^^ File "D:\a\_work\1\s\edge.onnxruntime-genai\src\python\py\models\builders\base.py", line 738, in to_int4 quant.process() File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\matmul_nbits_quantizer.py", line 1442, in process self.int4_quant_algo() File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\matmul_nbits_quantizer.py", line 1388, in int4_quant_algo self.model = rtn_quantize( ^^^^^^^^^^^^^ File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\neural_compressor\weight_only.py", line 456, in rtn_quantize model = ONNXModel(model) ^^^^^^^^^^^^^^^^ File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\neural_compressor\onnx_model.py", line 52, in __init__ self.check_is_large_model() File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\neural_compressor\onnx_model.py", line 91, in check_is_large_model raise e File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\neural_compressor\onnx_model.py", line 84, in check_is_large_model init_bytes = init.SerializeToString() ^^^^^^^^^^^^^^^^^^^^^^^^ google.protobuf.message.EncodeError: Failed to serialize proto ```

…nds read (microsoft#27748) ### Description Add an `ORT_ENFORCE` check in the `QMoECPU` constructor to require `swiglu_fusion == 1` when using SwiGLU activation, preventing an out-of-bounds read. When `swiglu_fusion=0` (the default), `fc1_out_features` is computed as `inter_size` instead of `2*inter_size`. However, `ApplySwiGLUActivation` reads `2*inter_size` values from the FC1 output buffer (via `input_data[2*i]` for `i` in `[0, inter_size)`), causing an out-of-bounds read that produces NaN on Windows x86. This matches the existing validation already present in the `MoE` CPU operator (`moe_cpu.cc:26-27`). ### Motivation and Context - The NaN was caused by a missing `swiglu_fusion=1` attribute. With the default `swiglu_fusion=0`, the SwiGLU activation reads past the allocated FC1 output buffer — an out-of-bounds read. - The `MoE` CPU operator already enforces `swiglu_fusion == 1` for SwiGLU; this change adds the same guard to `QMoECPU` for consistency and safety. - Non-interleaved SwiGLU format (`swiglu_fusion=2`) is not implemented (throws `ORT_NOT_IMPLEMENTED`), and `swiglu_fusion=0` is invalid for SwiGLU, so only `swiglu_fusion=1` is valid. Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

### Description Building Linux onnxruntime-qnn python wheel didn't work through WSL because QNN lib dependencies were not copied over to the wheel, so this PR resolved the issue. --------- Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

…egation routing (microsoft#27687) ### Description Adds optional input `router_weights` (index 14) to `com.microsoft.QMoE` to decouple Top-K expert selection from output aggregation weighting. When `router_weights` is provided: - `router_probs` → Top-K expert selection only - `router_weights` → values gathered at selected expert indices used as mixing weights When omitted, existing softmax-of-`router_probs` behavior is preserved (backward compatible). **Changes:** - **Schema** (`contrib_defs.cc`): New optional input 14 `router_weights`, type T, shape `(num_tokens, num_experts)` - **CPU provider** (`moe_quantization_cpu.cc`): Implements the separate routing path with MLFloat16/float support and optional `normalize_routing_weights` normalization - **CUDA provider** (`moe_quantization.cc`): Reads input, enforces not-implemented if provided - **WebGPU provider** (`qmoe.cc`): Same not-implemented guard - **Tests** (`moe_test.cc`): `QMoETest_CPU_RouterWeights` covering both normalized and unnormalized paths with non-zero expected outputs via FC2 bias to validate correct aggregation weights - **Docs** (`OperatorKernels.md`): Updated CPU and CUDA entries This pattern matches DeepSeek-V2/V3/R1 routing where `sigmoid(logits)` is used for aggregation while `logits + bias` with group masking drives selection: ```python # DeepSeek-style: different tensors for selection vs aggregation topk_indices = torch.topk(scores_for_choice, k=top_k)[1] # selection from modified logits topk_weights = router_logits.gather(1, topk_indices) # aggregation from original sigmoid ``` ### Motivation and Context `QMoE` previously required the same tensor for both routing and weighting, blocking DeepSeek-style `noaux_tc` MoE models where these are intentionally separate. This unblocks ONNX Runtime export/serving of DeepSeek-V2/V3/R1 MoE architectures.  <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>[Feature Request] Support noaux_tc MoE routing in com.microsoft.QMoE via separate router_weights</issue_title> > <issue_description>### Describe the feature request > > `com.microsoft.QMoE` currently accepts a single routing tensor (commonly router_probs) that is used both for: > > Top‑K expert selection (routing / dispatch), and > Weighting the outputs of selected experts (aggregation). > > This design makes it impossible to represent DeepSeek‑style `noaux_tc` `MoE` routing, where different tensors are intentionally used for: > > * expert selection (Top‑K routing), and > * expert output weighting (mixing). > > This issue proposes adding an optional input `router_weights` to `com.microsoft.QMoE` so that: > > * `router_probs` is used only for Top‑K selection, and > * `router_weights` is used only for multiplying / aggregating expert outputs. > > The change is backward compatible > This also allows for any other methodology in future where different tensors are used for selection/aggregation > > ### Describe scenario use case > > Enables exporting and serving DeepSeek‑V2/V3/R1‑style MoE models in ONNX Runtime</issue_description> > > <agent_instructions>Please update operator spec and implement it in CPU provider. For CUDA provider, it is fine to throw not implemented exception for now. > > Example Deepseek MoE script can be found in https://github.com/huggingface/transformers/blob/75c836b7853cb65f48ab2ce13cddfb12d14ecf5a/src/transformers/models/deepseek_v3/modular_deepseek_v3.py like the following: > > class DeepseekV3MoE(nn.Module): > """ > A mixed expert module containing shared experts. > """ > > def __init__(self, config): > super().__init__() > self.config = config > self.experts = DeepseekV3NaiveMoe(config) > self.gate = DeepseekV3TopkRouter(config) > self.shared_experts = DeepseekV3MLP( > config=config, intermediate_size=config.moe_intermediate_size * config.n_shared_experts > ) > self.n_routed_experts = config.n_routed_experts > self.n_group = config.n_group > self.topk_group = config.topk_group > self.norm_topk_prob = config.norm_topk_prob > self.routed_scaling_factor = config.routed_scaling_factor > self.top_k = config.num_experts_per_tok > > def route_tokens_to_experts(self, router_logits): > router_logits = router_logits.sigmoid() > router_logits_for_choice = router_logits + self.gate.e_score_correction_bias > group_scores = ( > router_logits_for_choice.view(-1, self.n_group, self.n_routed_experts // self.n_group) > .topk(2, dim=-1)[0] > .sum(dim=-1) > ) > group_idx = torch.topk(group_scores, k=self.topk_group, dim=-1, sorted=False)[1] > group_mask = torch.zeros_like(group_scores) > group_mask.scatter_(1, group_idx, 1) > score_mask = ( > group_mask.unsqueeze(-1) > .expand(-1, self.n_group, self.n_routed_experts // self.n_group) > .reshape(-1, self.n_routed_experts) > ) > scores_for_choice = router_logits_for_choice.masked_fill(~score_mask.bool(), 0.0) > topk_indices = torch.topk(scores_for_choice, k=self.top_k, dim=-1, sorted=False)[1] > topk_weights = router_logits.gather(1, topk_indices) > if self.norm_topk_prob: > denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20 > topk_weights /= denominator > topk_weights = topk_weights * self.routed_scaling_factor > return topk_indices, topk_weights > > def forward(self, hidden_states): > residuals = hidden_states > orig_shape = hidden_states.shape > router_logits = self.gate(hidden_states) > topk_indices, topk_weights = self.route_tokens_to_experts(router_logits) > hidden_states = hidden_states.view(-1, hidden_states.shape[-1]) > hidden_states = self.experts(hidden_states, topk_indices, topk_weights).view(*orig_shape) > hidden_states = hidden_states + self.shared_experts(residuals) > return hidden_states > > </agent_instructions> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details>  - Fixes microsoft#27675  --- 💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more [Copilot coding agent tips](https://gh.io/copilot-coding-agent-tips) in the docs. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

… unsqueeze_elimination against invalid model (microsoft#27638) This pull request introduces improvements to the `concat_slice_elimination` and `unsqueeze_elimination` optimizer passes, focusing on correctness, robustness, and code clarity. The changes include enhanced handling of optional Slice operator attributes, stricter validation of axes and steps, and improved error handling for invalid model inputs. Improvements to `concat_slice_elimination`: * Materialized default values for optional `axes` and `steps` in the Slice operator, ensuring safe indexing and alignment with ONNX defaults. (`onnxruntime/core/optimizer/concat_slice_elimination.cc`) * Refined the fusion pattern to only allow `starts.size() == 1`, clarifying the scope of the optimization and preventing incorrect fusions. (`onnxruntime/core/optimizer/concat_slice_elimination.cc`) * Added `<numeric>` include to support new code using `std::iota`. (`onnxruntime/core/optimizer/concat_slice_elimination.cc`) Improvements to `unsqueeze_elimination`: * Added validation for axes values, including range checks and detection of duplicate axes, returning errors for invalid models instead of silently proceeding. (`onnxruntime/core/optimizer/unsqueeze_elimination.cc`) * Added `core/providers/common.h` include for utility functions used in validation. (`onnxruntime/core/optimizer/unsqueeze_elimination.cc`)

### Description Extends CUDA EP Squeeze and Unsqueeze kernel registrations from opset 23 to opset 25, matching CPU provider coverage. - **`squeeze.cc` / `unsqueeze.cc`**: Cap opset 23 to versioned `23–23`, add versioned `24–24`, add non-versioned `25` - **`cuda_execution_provider.cc`**: Add corresponding forward declarations and `BuildKernelCreateInfo` registry entries for opsets 23 (now versioned), 24, and 25 - **`docs/OperatorKernels.md`**: Update CUDA Squeeze and Unsqueeze entries to reflect `25+` coverage with individual `24` and `23` version rows No new computation logic — these ops are shape-only (data is a `cudaMemcpyAsync`), so the same kernel implementation covers all new opsets. ### Motivation and Context CUDA EP registered Squeeze/Unsqueeze only up to opset 23 while the ONNX spec defines them through opset 25. Models exported at opset 24+ would fail to find a matching CUDA kernel. Part of the broader opset gap audit tracked in microsoft#27729. ### Limitation It does not include new data types for float8, float4, int4 etc. That will be added later if needed. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>

…27693) ### Description Add a pre-check for zero values in the divisor tensor for integral types in `Div<T>`. Returns an error `Status` instead of hitting undefined behavior (SIGFPE / structured exception). - **`element_wise_ops.h`**: When the divisor is a constant initializer, `TryGetConstantInput` validates for zeros once at kernel creation time in the constructor, avoiding per-`Compute` overhead. A `divisor_is_validated_constant_` flag tracks whether the one-time check was performed. - **`element_wise_ops.cc`**: `if constexpr (std::is_integral<T>::value)` guard scans non-constant divisors before calling `UntypedBroadcastTwo`, skipping the check when the constant was already validated. Compiled away for float/double/half — zero cost for non-integer paths. - **`element_wise_ops_test.cc`**: Added `Div_int8_by_zero`, `Div_int32_by_zero`, `Div_int64_by_zero_scalar` tests covering tensor and scalar divisor cases, plus `Div_int32_by_zero_constant_initializer` to exercise the `TryGetConstantInput` constructor path with `is_initializer = true`. ### Motivation and Context Integer division by zero is UB in C++ and causes a hardware exception that crashes the process. Float types produce inf/NaN naturally, but int8/int16/int32/int64/uint* types do not. This was reported via Chromium (https://issues.chromium.org/issues/491835014) with a trivial repro: `tensor<int8> / scalar(0)`.  <details> <summary>Original prompt</summary> > > ---- > > *This section details on the original issue you should resolve* > > <issue_title>int8 / 0 exception not caught for cpu ep</issue_title> > <issue_description>See https://issues.chromium.org/issues/491835014. > > Repro: > a=tensor<int8> > b=tensor<int8>, ie a scalar that is 0 > model that does a/b > > Stack trace: > ``` > onnxruntime.dll!Eigen::internal::scalar_quotient_op<signed char,signed char>::operator()(const char &) Line 437 C++ > [Inline Frame] onnxruntime.dll!Eigen::internal::binary_evaluator<Eigen::CwiseBinaryOp<Eigen::internal::scalar_quotient_op<signed char,signed char>,Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<signed char>,Eigen::Array<signed char,-1,1,0,-1,1> const> const ,Eigen::ArrayWrapper<Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1> const ,0,Eigen::Stride<0,0>>> const>,Eigen::internal::IndexBased,Eigen::internal::IndexBased,signed char,signed char>::coeff(__int64) Line 910 C++ > ... > [Inline Frame] onnxruntime.dll!Eigen::internal::Assignment<Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>>,Eigen::CwiseBinaryOp<Eigen::internal::scalar_quotient_op<signed char,signed char>,Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<signed char>,Eigen::Array<signed char,-1,1,0,-1,1> const> const ,Eigen::ArrayWrapper<Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1> const ,0,Eigen::Stride<0,0>>> const>,Eigen::internal::assign_op<signed char,signed char>,Eigen::internal::Dense2Dense,void>::run(Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>> &) Line 855 C++ > [Inline Frame] onnxruntime.dll!Eigen::internal::call_assignment_no_alias(Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>> &) Line 797 C++ > [Inline Frame] onnxruntime.dll!Eigen::internal::call_assignment(Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>> &) Line 768 C++ > [Inline Frame] onnxruntime.dll!Eigen::internal::call_assignment(Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>> &) Line 750 C++ > [Inline Frame] onnxruntime.dll!Eigen::MatrixBase<Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>>>::operator=(const Eigen::DenseBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_quotient_op<signed char,signed char>,Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<signed char>,Eigen::Array<signed char,-1,1,0,-1,1> const> const ,Eigen::ArrayWrapper<Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1> const ,0,Eigen::Stride<0,0>>> const>> &) Line 59 C++ > [Inline Frame] onnxruntime.dll!onnxruntime::Div<signed char>::Compute::__l2::<lambda_998187df037dec36fd0905b4142c682e>::operator()(onnxruntime::BroadcastHelper &) Line 685 C++ > onnxruntime.dll!<lambda_998187df037dec36fd0905b4142c682e>::<lambda_invoker_cdecl>(onnxruntime::BroadcastHelper & per_iter_bh) Line 686 C++ > [External Code] > [Inline Frame] onnxruntime.dll!std::_Func_class<void,__int64,__int64>::operator()(__int64 <_Args_0>, __int64 <_Args_1>) Line 926 C++ > onnxruntime.dll!onnxruntime::concurrency::ThreadPool::ParallelFor(__int64 n, const onnxruntime::TensorOpCost & c, const std::function<void __cdecl(__int64,__int64)> & f) Line 628 C++ > onnxruntime.dll!onnxruntime::concurrency::ThreadPool::TryParallelFor(onnxruntime::concurrency::ThreadPool * tp, __int64 total, const onnxruntime::TensorOpCost & cost_per_unit, const std::function<void __cdecl(__int64,__int64)> & fn) Line 705 C++ > onnxruntime.dll!onnxruntime::ParallelizeSingleSpan<onnxruntime::BroadcastHelper>(onnxruntime::BroadcastHelper & helper, const onnxruntime::ProcessBroadcastSpanFuncs & functors) Line 955 C++ > onnxruntime.dll!onnxruntime::BroadcastLooper<onnxruntime::BroadcastHelper>(onnxruntime::BroadcastHelper & helper, const onnxruntime::ProcessBroadcastSpanFuncs & functors) Line 1006 C++ > onnxruntime.dll!onnxruntime::UntypedBroadcastTwo(onnxruntime::OpKernelContext & context, const onnxruntime::ProcessBroadcastSpanFuncs & funcs, double unit_cost, void * user_data) Line 2305 C++ > onnxruntime.dll!onnxruntime::Div<signed char>::Compute(onnxruntime::OpKernelContext * context) Line 695 C++ > > ``` > </issue_description> > > ## Comments on the Issue (you are @copilot in this section) > > <comments> > </comments> > </details>  - Fixes microsoft#27686  --- 📱 Kick off Copilot coding agent tasks wherever you are with [GitHub Mobile](https://gh.io/cca-mobile-docs), available on iOS and Android. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: skottmckay <979079+skottmckay@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

…in onnxruntime.native.path (microsoft#27668) ## Summary Fixes the Java native provider loading flow so that `extractProviderLibrary` checks `onnxruntime.native.path` before attempting extraction from classpath resources. Previously, when a provider library was already present in `onnxruntime.native.path`, the Java loader could still attempt `extractFromResources(...)` first, and only fall back to the configured native path afterwards. This caused an unnecessary extraction attempt even though the library was already available on disk. This change updates the lookup order to: 1. Return immediately if the provider was already marked as ready 2. Check whether the provider library already exists in `onnxruntime.native.path` 3. Only if not found there, attempt extraction from classpath resources ## Tests Added a regression test covering the case where the requested provider library already exists in `onnxruntime.native.path`. The test verifies that: - `extractProviderLibrary(...)` returns `true` - extraction from resources is not attempted in that case A small test-only hook was added to observe calls to `extractFromResources(...)` so the regression can be validated directly and deterministically. ## Issue Fixes microsoft#27655

…7714) Fix microsoft#27712 This pull request improves support and validation for the `softcap` and `softmax_precision` attributes in the CUDA Attention operator, updates kernel eligibility and fallback logic, and enhances test coverage for these features. The changes ensure that only valid values are accepted, propagate new parameters to eligible kernels, and clarify backend capabilities in code comments and tests. **CUDA Attention operator improvements:** * Added validation to enforce that `softcap` is non-negative and that `softmax_precision` is one of the supported TensorProto types (0, 1, 10, or 16). * Updated code comments and eligibility checks to clarify that `softcap` is now supported natively in Flash and Memory Efficient Attention (MEA) kernels, and that `softmax_precision` is inherently satisfied (always computed in FP32 on CUDA). [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL174-R183) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL548-R556) [[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL824-R834) * Propagated the `softcap` parameter to the MEA kernel invocation to enable native support. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR696) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR746) * Modified fallback and rejection logic: unfused attention now explicitly rejects `softcap` with a clear error message, while `softmax_precision` is always considered satisfied. [[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL1096-R1110) [[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR1179-R1186) **Testing improvements:** * Added a new test to verify that `softmax_precision=1` (FLOAT) produces identical results to the default, since all CUDA backends compute softmax in FP32. * Clarified in existing softcap-related tests that certain configurations are not supported by CUDA unfused attention and require Flash or MEA; updated test comments for clarity. [[1]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777R1088-R1089) [[2]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777R1118-R1119) * Expanded Python test cases for GQA (grouped-query attention) to include nonzero `softcap` values, increasing coverage of this feature. [[1]](diffhunk://#diff-8795174e6967f83c53fcd5de6d7bfe55782a1ae05cf720378b33b7a7c4cee7dcL613-R613) [[2]](diffhunk://#diff-8795174e6967f83c53fcd5de6d7bfe55782a1ae05cf720378b33b7a7c4cee7dcL648-R648) --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…27751) ### Description Extends the KleidiAI-accelerated `MatMulNBits` path on ARM64 to support asymmetric 4-bit quantization (models with per-block zero points). Previously, KleidiAI was only used for symmetric quantization (`!HasZp` guard), and asymmetric models fell back to a significantly slower non-KleidiAI kernel. Since KleidiAI only provides symmetric int4 micro-kernels (`kai_matmul_clamp_f32_qai8dxp_qsi4c32p`), this PR runs KleidiAI as-if symmetric (hardcoded `rhs_zero_point=8`) and applies a float-domain zero-point correction post-GEMM: $$C_{\text{actual}} = C_{\text{symmetric}} + A_{\text{blksum}} \times B_{\text{ZpCorr}}^T$$ where: - `BZpCorr[n, blk] = scale_b[n, blk] × (8 - zp_b[n, blk])` — precomputed at weight packing time - `AFloatBlkSum[m, blk] = Σ A_float[m, blk_start..blk_end]` — computed per inference alongside A quantization Key changes: - **`UseKleidiAIBase()`**: New function that checks KleidiAI eligibility without the `!HasZp` guard. `UseKleidiAI()` now delegates to `!HasZp && UseKleidiAIBase()`, preserving symmetric behavior. - **B packing (`SQ4BitGemmPackQuantBDataAndBlkSum`)**: Computes and stores `BZpCorr` after KleidiAI packed B data when zero points are present. - **Workspace expansion**: Allocates space for `AFloatBlkSum` (M × BlockCountK floats) in the per-GEMM workspace for asymmetric models. - **`ComputeAFloatBlkSum`**: NEON-vectorized (4× unrolled `vaddq_f32`) function to compute per-block float sums of A. - **`ApplyBZpCorrection`**: NEON-vectorized correction kernel tiled 4-N-wide (`vfmaq_f32`) for L1-friendly BZpCorr reuse. - **PrePack**: Computes `BZpCorr` during the scales PrePack (not zero_points PrePack), since ORT may erase constant inputs after marking them packed. No changes to the symmetric path. No changes to x64. No changes to 8-bit quantization. ### Motivation and Context Asymmetric 4-bit quantized models (e.g., GPTQ/RTN with zero points) on ARM64 were **23–72% slower** than their symmetric counterparts because KleidiAI's `sdot`/`i8mm` micro-kernels only support symmetric RHS, forcing a fallback to a slower non-KleidiAI kernel path. This change closes most of that gap: | Model | Seq Len | Asym/Sym (before) | Asym/Sym (after) | Asym speedup | Asym latency (after) | |-------|---------|-------------------|------------------|--------------|----------------------| | Qwen 1.5B | 256 | 1.35× | 1.17× | **1.16×** | 1107.8ms | | Qwen 1.5B | 512 | 1.23× | 1.06× | **1.14×** | 2259.7ms | | Qwen 3B | 256 | 1.43× | 1.12× | **1.28×** | 2029.7ms | | Qwen 3B | 512 | 1.39× | 1.22× | **1.24×** | 4188.0ms | | Qwen 7B | 256 | 1.61× | 1.11× | **1.52×** | 3661.6ms | | Qwen 7B | 512 | 1.72× | 1.11× | **1.58×** | 7263.8ms | The remaining 6–22% asym/sym gap comes from the extra pass over A to compute float block sums — this cannot be fused into KleidiAI's sealed A-packing function and would require an upstream KleidiAI API change.

## Description This PR refactors a set of CUDA provider helpers and kernel call sites so more of the remaining CUDA ops can build cleanly when the CUDA EP is compiled as a plugin. The main goal is to reduce direct dependencies on framework-internal CUDA EP types such as `CUDAExecutionProvider` and `CudaStream`, and to move reusable CUDA type/handle helpers into header-visible utilities that are available on both sides of the plugin boundary. These changes follow the plugin EP guidance, where the next stage of enablement is to remove kernel assumptions about provider-only infrastructure and rely on stream-handle based access instead. Along the way, the PR also fixes a few small compatibility issues in contrib CUDA kernels that surfaced while working through the remaining excluded ops. ## Summary of Changes ### Shared CUDA Helper Refactoring | File | Change | |------|--------| | `onnxruntime/core/providers/cuda/cuda_common_type_helpers.h` | Adds header-only CUDA type conversion and string helper utilities so `.cu` and plugin builds can consume them without relying on `cuda_common.cc`. | | `onnxruntime/core/providers/cuda/cuda_common.h` | Includes `core/util/math.h` for local half conversion helpers and pulls in the new header-only helper definitions. | | `onnxruntime/core/providers/cuda/cuda_common.cc` | Removes the moved helper implementations and keeps runtime GEMM option initialization focused on shared state. | | `onnxruntime/core/providers/cuda/shared_inc/accumulation_type.h` | Adds a default accumulation type mapping so unsupported specializations no longer fail at compile time. | | `onnxruntime/contrib_ops/cuda/math/gemm_float8.cu` | Switches float8 GEMM code to consume the new shared CUDA type helper header. | ### Stream and Handle Access Cleanup | File | Change | |------|--------| | `onnxruntime/core/providers/cuda/cuda_kernel.h` | Adds `Stream*` overloads for retrieving cuBLAS and cuDNN handles so kernels do not need to reach into `CudaStream` directly. | | `onnxruntime/core/providers/cuda/integer_gemm.cc` | Replaces direct `CudaStream` handle access with `CudaKernel` helper-based cuBLAS retrieval. | | `onnxruntime/core/providers/cuda/reduction/reduction_ops.cc` | Reuses a single cuDNN handle derived from the ORT stream and removes repeated direct `CudaStream` assumptions. | | `onnxruntime/core/providers/cuda/reduction/reduction_ops.h` | Stops caching a `CUDAExecutionProvider*` in reduction kernels, reducing provider coupling. | | `onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc` | Switches RNN weight reorganization to retrieve the cuDNN handle through the shared stream helper. | | `onnxruntime/core/providers/cuda/tensor/transpose.cc` | Uses the new stream-based cuBLAS handle helper in transpose fast paths. | ### Kernel-Level Plugin Compatibility Fixes | File | Change | |------|--------| | `onnxruntime/core/providers/cuda/tensor/reshape.cc` | Replaces `CopyTensor` usage with explicit `cudaMemcpyAsync` on the kernel stream to avoid plugin-incompatible stream assumptions. | | `onnxruntime/core/providers/cuda/tensor/reshape.h` | Applies the same stream-based reshape copy path to the attribute-driven reshape variant. | | `onnxruntime/core/providers/cuda/math/clip.h` | Removes the CPU `Clip_6Base` dependency by inlining min/max attribute handling into the CUDA kernel. | | `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Stores present KV outputs once and reuses those tensors instead of re-fetching them through `context->Output()`. | | `onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_qdq.cc` | Improves shape validation error reporting by emitting the tensor shape object directly. | ## Motivation and Context The CUDA plugin EP compiles kernels into a separate shared library and can only depend on types and helpers that are visible through the EP API boundary. The background doc for this work calls out several recurring incompatibility patterns, especially direct use of `CudaStream`, direct inclusion of real CUDA EP infrastructure headers, and helper implementations that only exist in provider-owned `.cc` files. This PR addresses that class of issues for another slice of the remaining excluded CUDA ops by: - moving broadly useful CUDA type helpers into header-visible code, - routing cuBLAS/cuDNN handle lookup through stream-oriented helpers instead of provider internals, - removing a CPU-base-class dependency from `Clip_6`, and - simplifying a few kernel call sites that were still assuming the non-plugin CUDA EP environment. Together, these changes make the CUDA provider code more self-contained and reduce the amount of plugin-specific adaptation needed to bring the remaining CUDA ops online. ## Checklist - [ ] Tests added/updated - [x] Documentation updated (background captured in `docs/cuda_plugin_ep/cuda_ops_for_plugin_ep.md`) - [x] No breaking changes - [ ] CI passes

### Description This pull request introduces a new graph optimization pass to fuse Add + SkipLayerNormalization subgraphs into a single SkipLayerNormalization node that incorporates a bias input. This helps simplify the computation graph, especially for models using bias after MatMul, and extends support for more execution providers. The main changes include the implementation of the new fusion, its integration into the optimizer pipeline, and updates to provider compatibility. **New Bias + SkipLayerNormalization Fusion:** * Added a new `BiasSkipLayerNormFusion` class and implementation to detect and fuse subgraphs where a 1D bias is added to a MatMul (optionally through a Cast) before SkipLayerNormalization, replacing them with a single node that absorbs the bias as a fifth input. **Integration into Optimization Pipeline:** * Registered the new `BiasSkipLayerNormFusion` in the graph transformer utility, ensuring it runs after the standard SkipLayerNorm fusion and covers more execution providers (CPU, ACL, CUDA, DML, JS, WebGPU). **Test and Include Updates:** * Updated test and implementation files to include the new fusion header where relevant. ### Motivation and Context These changes collectively improve model optimization by reducing node count and improving runtime efficiency for supported providers. This PR also helps perform this fusion on many models inside the Foundry Local catalog without needing to re-deploy models. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…1→22) (microsoft#27733) ### Description Extends CUDA kernel registrations for `GlobalAveragePool` and `GlobalMaxPool` from opset 1 only to the full opset 1–22 range. Follows the same pattern used for `MaxPool` in microsoft#27715. - **`core/providers/cuda/nn/pool.cc`** — Split single opset-1 registrations into versioned 1–21 + opset 22 for both NCHW and NHWC variants - **`core/providers/cuda/cuda_execution_provider.cc`** — Updated class declarations and `BuildKernelCreateInfo` entries (versioned 1–21, added opset 22) - **`core/providers/cuda/cuda_nhwc_kernels.cc`** — Same for NHWC kernel registrations - **`test/providers/cpu/nn/pool_op_test.cc`** — Added `GlobalAveragePool_22_CUDA` test - **`docs/OperatorKernels.md`** — Updated GlobalAveragePool and GlobalMaxPool entries from `1+` to `22+` / `[1, 21]` in both the ai.onnx and com.microsoft.internal.nhwc domains under CUDAExecutionProvider No functional changes to the kernel implementations—opsets 1 through 22 are spec-compatible for these ops. ### Motivation and Context `GlobalAveragePool` and `GlobalMaxPool` were registered at opset 1 only in the CUDA provider, creating a 21-version gap to the latest ONNX opset 22. Models exported at higher opsets would fail to find a matching CUDA kernel. Identified as P1 gaps in microsoft#27729. ### Limitations BF16 support for GlobalAveragePool-22 and GlobalMaxPool-22 is not added in this PR. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

### Description  Fix some issues that show up as test failures in `js/web/test/data/ops/dequantizelinear.jsonc`. 1. When `component=4`, output shapes where the last dimension was not divisible by `component` were not handled. `onnxruntime/core/providers/webgpu/program.cc:247 TensorShape onnxruntime::webgpu::(anonymous namespace)::GetReducedShape(const TensorShape &, int) shape.NumDimensions() > 0 && shape.GetDims()[shape.NumDimensions() - 1] % component == 0 was false. Cannot reduce shape {2,2} by component=4` Added `ProgramOutput::Flatten` to the output definition to address this. 2. Fix handling of zero point in blocked quantization path. Also renamed some test cases with more descriptive names. ### Motivation and Context  Fix some issues with WebGPU DequantizeLinear op implementation.

### Description React Native is currently limited by network isolation. ### Motivation and Context

…crosoft#27549) ### Description Fixes float16 tensor support in the React Native binding by mapping `ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT16` to `Uint16Array` instead of `Float16Array`. [Hermes does not support `Float16Array`](https://github.com/facebook/hermes/blob/main/include/hermes/VM/TypedArrays.def). When the binding tries to construct a `Float16Array` via the JSI global lookup, it gets `undefined` and crashes. This is the React Native equivalent of microsoft#27327 (the Node.js fix for the same issue, merged Feb 2026). ### Motivation Any React Native app using Hermes (the default since RN 0.70) using a model with float16 inputs/outputs crashes at runtime on both iOS and Android. ### Changes - `js/react_native/cpp/TensorUtils.cpp`: Change `"Float16Array"` to `"Uint16Array"` in `dataTypeToTypedArrayMap` Resolves microsoft#27548

### Description Extends GRU CUDA kernel registration from opset 14 to opset 22, following the same pattern as other recent opset gap fills (e.g., ConvTranspose in microsoft#27710). - **`gru.cc`**: Cap existing opset-14 non-versioned kernel to versioned 14–21; add new non-versioned kernel at opset 22+ - **`cuda_execution_provider.cc`**: Update forward declarations and `BuildKernelCreateInfo` entries for versioned 14–21 and non-versioned 22+ - **`deep_cpu_gru_op_test.cc`**: Add CUDA-specific test for GRU at opset 22 with `linear_before_reset=1` (cuDNN requirement) - **`docs/OperatorKernels.md`**: Update CUDA provider GRU entry to reflect `22+`, `[14, 21]`, and `[7, 13]` version ranges No functional changes to the kernel implementation—the GRU spec is unchanged between opsets 14 and 22. ### Motivation and Context CUDA EP registered GRU only up to opset 14, while ONNX defines GRU through opset 22. Models exported at opset ≥15 would fail to find a matching CUDA kernel and fall back to CPU. This is one of the P1 gaps tracked in microsoft#27729. ### Limitation BF16 version is not added for GRU-22. It can be added later if needed. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

### Description Add bounds checking for label tensor values in `SparseSoftmaxCrossEntropy::Compute` to prevent out-of-bounds memory reads. The `SparseSoftmaxCrossEntropy` operator uses `label_data[i]` (int64_t) directly as an array index into the log-probability buffer without validating that the value falls within `[0, D)` where `D` is the number of classes. A malicious ONNX model can embed arbitrary label values in a model initializer, causing the operator to read heap memory beyond the log-probability buffer. Affected expressions in `cross_entropy.cc`: ```cpp loss_sample[i] = -log_prob_data[i * d + label_data[i]] * weight_data[i]; // weighted path loss_sample[i] = -log_prob_data[i * d + label_data[i]]; // unweighted path ``` Existing shape validation confirms label and logit dimensions are compatible, but never validates label **values** against the class dimension. ## Fix Added a validation loop before the loss computation that returns an error status if any label value is outside `[0, D)`: ```cpp for (ptrdiff_t i = 0; i < n; i++) { ORT_RETURN_IF(label_data[i] < 0 || label_data[i] >= d, "SparseSoftmaxCrossEntropy: label value ", label_data[i], " at index ", i, " is out of range [0, ", d, ")"); } ```

…#27787) Bumps [flatted](https://github.com/WebReflection/flatted) from 3.2.7 to 3.4.2. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/WebReflection/flatted/commit/3bf09091c3562e17a0647bc06710dd6097079cf7"><code>3bf0909</code></a> 3.4.2</li> <li><a href="https://github.com/WebReflection/flatted/commit/885ddcc33cf9657caf38c57c7be45ae1c5272802"><code>885ddcc</code></a> fix CWE-1321</li> <li><a href="https://github.com/WebReflection/flatted/commit/0bdba705d130f00892b1b8fcc80cf4cdea0631e3"><code>0bdba70</code></a> added flatted-view to the benchmark</li> <li><a href="https://github.com/WebReflection/flatted/commit/2a02dce7c641dec31194c67663f9b0b12e62da20"><code>2a02dce</code></a> 3.4.1</li> <li><a href="https://github.com/WebReflection/flatted/commit/fba4e8f2e113665da275b19cd0f695f3d98e9416"><code>fba4e8f</code></a> Merge pull request <a href="https://redirect.github.com/WebReflection/flatted/issues/89">#89</a> from WebReflection/python-fix</li> <li><a href="https://github.com/WebReflection/flatted/commit/5fe86485e6df7f7f34a07a2a85498bd3e17384e7"><code>5fe8648</code></a> added "when in Rome" also a test for PHP</li> <li><a href="https://github.com/WebReflection/flatted/commit/53517adbefe724fe472b2f9ebcdb01910d0ae3f0"><code>53517ad</code></a> some minor improvement</li> <li><a href="https://github.com/WebReflection/flatted/commit/b3e2a0c387bf446435fec45ad7f05299f012346f"><code>b3e2a0c</code></a> Fixing recursion issue in Python too</li> <li><a href="https://github.com/WebReflection/flatted/commit/c4b46dbcbf782326e54ea1b65d3ebb1dc7a23fad"><code>c4b46db</code></a> Add SECURITY.md for security policy and reporting</li> <li><a href="https://github.com/WebReflection/flatted/commit/f86d071e0f70de5a7d8200198824a3f07fc9c988"><code>f86d071</code></a> Create dependabot.yml for version updates</li> <li>Additional commits viewable in <a href="https://github.com/WebReflection/flatted/compare/v3.2.7...v3.4.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=flatted&package-manager=npm_and_yarn&previous-version=3.2.7&new-version=3.4.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…Loop, Scan, ConstantOfShape, Size (microsoft#27728) ## Summary - Extend CUDA EP opset 21/23 kernel registrations to 7 additional operators that were updated in ONNX opset 21 but lacked proper CUDA kernel version declarations - Operators fixed: **Flatten**, **Identity**, **If**, **Loop**, **Scan**, **ConstantOfShape**, **Size** - Follows the identical pattern established in PR microsoft#26075 for Shape, Reshape, Transpose, Squeeze, Unsqueeze ## Motivation Fixes microsoft#27102. When ONNX introduces a new operator version in opset 21, ORT's `VerifyVersion` function in `kernel_registry.cc` rejects non-versioned (open-ended) CUDA kernels. The check at [kernel_registry.cc:L126-L133](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/kernel_registry.cc#L126) requires either an exact version match or a bounded version range — a kernel registered as `since_version=N, end_version=INT_MAX` fails when `since_ver` (from the opset 21 schema) differs from `N`. This causes the affected operators to fall back from CUDA to CPU, introducing unnecessary host↔device memory copies. On Windows with CUDA EP, this fallback path can produce corrupted shape computation values (e.g., `124647109376` instead of `6`), leading to downstream Reshape failures. PR microsoft#26075 fixed this for Shape, Reshape, Transpose, Squeeze, and Unsqueeze. This PR extends the same fix to the 7 remaining operators that were updated in ONNX opset 21 and had non-versioned CUDA kernels. ## Changes For each of the 7 operators: 1. **Cap existing non-versioned kernel** to opset 20 (`ONNX_OPERATOR_KERNEL` → `ONNX_OPERATOR_VERSIONED_KERNEL`) 2. **Add VERSIONED(21, 22) kernel** with identical type constraints 3. **Add non-versioned opset 23 kernel** for forward compatibility (opset 23 introduced another schema update for these operators) Files modified: - `onnxruntime/core/providers/cuda/cuda_execution_provider.cc` — forward declarations + `BuildKernelCreateInfo` registration - `onnxruntime/core/providers/cuda/tensor/flatten.cc` - `onnxruntime/core/providers/cuda/tensor/identity_op.cc` - `onnxruntime/core/providers/cuda/tensor/size.cc` - `onnxruntime/core/providers/cuda/generator/constant_of_shape.cc` - `onnxruntime/core/providers/cuda/controlflow/if.cc` - `onnxruntime/core/providers/cuda/controlflow/loop.cc` - `onnxruntime/core/providers/cuda/controlflow/scan.cc` ## Test Plan - [ ] Verify CUDA EP build compiles successfully (CI) - [ ] Existing opset 21 tests for Shape/Reshape/Squeeze/Unsqueeze pass (validates the pattern) - [ ] Verify operators are no longer falling back to CPU when running opset 21 models on CUDA - [ ] No regression in existing CUDA EP tests

### Description Fuses a pattern of 4 Slices + 1 Concat into 1 SpaceToDepth op (+1 Gather if the order doesn't match the expected default pattern). Saves about ~1ms for Yolox_tiny which is a sub-15ms model on an AVX512 machine. So, it is good savings. ### Motivation and Context Improves performance especially when the SpaceToDepth operation occurs in a cheap real-time model Original SpaceToDepth operation in the model (4 slices + 1 Concat is a lot of memory transactions that can be replaced by one optimized kernel): <img width="389" height="213" alt="image" src="https://github.com/user-attachments/assets/670be33e-84cf-4235-87dd-4086fc43d5f8" /> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…#27691) ### Description 2 main changes: 1) Handle activations that are seen in modern CNNs in the NCHWc transformer (`QuickGelu`, `Gelu`) and avoid reorder nodes getting inserted before and after them to do the NCHWc <-> NCHW data layout transforms. These can be avoided as these are elemtnwise ops that are otherwise data layout agnostic 2) Rewrites a channel scaling Mul (or scaling input shape 1,C,1,1 or C,1,1) into a depthwise conv NCHWc operation. This avoid resorder nodes and enables fusions of any subsequent `Add` operations into the new `Conv` node. ### Motivation and Context Avoid unnecessary data layout operations and enable more NCHWc compatible compute and fusions

## Summary - Move `tf.function`-decorated forward functions out of the inner benchmark loop to prevent unnecessary graph retracing on every `(batch_size, sequence_length)` iteration - Update deprecated `experimental_compile` to `jit_compile` (available since TF 2.4) - Hoist `import random` out of the inner loop Fixes microsoft#14953 ## Motivation When `run_with_tf_optimizations` is used as a decorator inside the innermost `(batch_size, sequence_length)` loop, each iteration creates a new Python function object. Since `tf.function` keys its trace cache on function identity, a new object means a forced retrace every iteration — the cached graph is never reused. This defeats the purpose of `tf.function` and adds significant overhead from repeated graph construction and optimization passes. The [TensorFlow documentation on tracing](https://www.tensorflow.org/guide/function#rules_of_tracing) explicitly warns against defining `tf.function`-decorated functions inside loops. ## Changes **`onnxruntime/python/tools/transformers/benchmark.py`** (1 file, ~35 insertions / ~31 deletions): 1. **Hoisted forward function definitions** (`encoder_forward`, `encoder_decoder_forward`, `lxmert_forward`) from the inner `batch_size × sequence_length` loop to the per-model scope. They are now defined once per model, and the `@run_with_tf_optimizations` decorator (which applies `@tf.function`) is only invoked once per model. 2. **Changed forward functions to accept `input_ids` as a parameter** instead of closing over the loop variable. This lets `tf.function` trace based on the tensor's `(dtype, shape)` spec and reuse cached concrete functions when shapes repeat across iterations. 3. **Updated `experimental_compile=use_xla`** to **`jit_compile=use_xla`**. The `experimental_compile` parameter was deprecated in TF 2.4 (Dec 2020) and removed in TF 2.12. 4. **Moved `import random`** from the innermost loop body to before the outer model loop — the module only needs to be imported once. 5. **Moved inference function selection** (`if config.is_encoder_decoder ... elif isinstance(config, LxmertConfig) ...`) outside the batch/sequence loops since it depends only on the model config, not on batch size or sequence length. The original priority order (`is_encoder_decoder` checked before `LxmertConfig`) is preserved. ## Test Plan - [x] `lintrunner -a` passes cleanly (no RUFF or RUFF-FORMAT violations) - [x] `python -m py_compile benchmark.py` — syntax verified - [x] Change is purely structural — function behavior (inputs, outputs, control flow) is identical - [ ] Manual verification with TensorFlow installed (TF is an optional dependency not present in the standard CI matrix; this code path is exercised via `python benchmark.py -e tensorflow`)

patryk-kaiser-ARM and others added 30 commits March 17, 2026 10:13

[webgpu] fix condition of DAWN_ENABLE_VULKAN and DAWN_ENABLE_D3D12 (m…

672e3bb

…icrosoft#27694) ### Description fix condition of the following definitions: - DAWN_ENABLE_VULKAN - DAWN_ENABLE_D3D12

edgchen1 and others added 11 commits March 20, 2026 09:42

Merge remote-tracking branch 'origin/master' into sync_msft_23032026

9965d54

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel March 23, 2026 05:13

ankitm3k approved these changes Mar 23, 2026

View reviewed changes

ankitm3k merged commit f153255 into ovep-develop Mar 23, 2026
7 of 8 checks passed

ankitm3k deleted the sync_msft_23032026 branch March 23, 2026 07:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 23032026#986

Sync with Microsoft ONNX Runtime - 23032026#986
ankitm3k merged 41 commits intoovep-developfrom
sync_msft_23032026

ai-fw-intg commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

ai-fw-intg commented Mar 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants