Skip to content

Sync with Microsoft ONNX Runtime - 23032026#986

Merged
ankitm3k merged 41 commits intoovep-developfrom
sync_msft_23032026
Mar 23, 2026
Merged

Sync with Microsoft ONNX Runtime - 23032026#986
ankitm3k merged 41 commits intoovep-developfrom
sync_msft_23032026

Conversation

@ai-fw-intg
Copy link

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

patryk-kaiser-ARM and others added 30 commits March 17, 2026 10:13
…icrosoft#26773)

**Description**
This PR integrates Arm® KleidiAI™ SME2 BF16 kernel through MLAS SBGEMM
path.

Rework of microsoft#24346

**Motivation and Context**
This kernel provides performance improvements on SME-enabled devices.

---------

Signed-off-by: Patryk Kaiser <patryk.kaiser@arm.com>
Upgrading dependency to resolve CVE-2026-27904, which is lighting up
some component governance issues with internal-MSFT builds of ORT.

Co-authored-by: Kevin Taha <kevintaha@microsoft.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#27694)

### Description

fix condition of the following definitions:
- DAWN_ENABLE_VULKAN
- DAWN_ENABLE_D3D12
…#27688)

This deletes 3 per-head-size .cu files and merges their content into a
single file to avoid dependency during cuda compiling.

Currently, masked_multihead_attention_kernel template is implemented in
decoder_masked_multihead_attention_impl.cu‎. The other three .cu files
use the masked_multihead_attention_kernel template but not include the
implementation. That causes problem when they are built in cuda plugin
ep.
microsoft#27671)

## Description

This PR fixes longstanding MLAS issues that were causing
`NhwcTransformerTests.*` and `QDQTransformerTests.*` failures in
quantized convolution paths (see
microsoft#27670). The failures
were not in the graph transformers themselves; they came from incorrect
qgemm dispatch selection and broken backend kernel behavior in specific
AVX2-VNNI and AMX paths.

The fix removes incorrect `U8U8` dispatch upgrades, avoids a broken
AVX2-VNNI row-panel fallback, and corrects the AMX `U8S8` 32-row kernel
path. It also adds MLAS regression coverage for the conv-shaped qgemm
dimensions that exposed the problems.

## Summary of Changes

### Dispatch Selection Fixes

| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/platform.cpp` | Remove three incorrect
assignments that upgraded `GemmU8U8Dispatch` to `U8S8` dispatch objects
in the AVXVNNI, AVX512VNNI, and AMX feature paths. |

### AVX2-VNNI Kernel Fix

| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/qgemm_kernel_avx2.cpp` | Reduce `StrideM`
from `6` to `4` for the `U8U8`, `S8S8`, and `S8U8` AVX2-VNNI qgemm
dispatch objects so they never enter the legacy `>4` row fallback path.
|

### AMX Kernel Fix

| File | Change |
|------|--------|
| `onnxruntime/core/mlas/lib/qgemm_kernel_amx.cpp` | Replace the broken
pipelined `CountM >= 32` `U8S8` AMX fast path with the same per-K tile
update pattern already used by the working smaller-row path. |

### Regression Coverage

| File | Change |
|------|--------|
| `onnxruntime/test/mlas/unittest/test_qgemm_fixture.h` | Add MLAS qgemm
regression cases for conv-like shapes `6x30x207` and `169x30x207` in
packed/non-packed and int32 or fp32 variants. |

## Root Cause

There were three separate MLAS correctness issues:

1. `platform.cpp` was incorrectly overwriting `GemmU8U8Dispatch` with
`U8S8` dispatch objects when newer CPU features were detected. That
caused `U8U8` conv workloads to run through the wrong dispatch path.
2. The AVX2-VNNI qgemm dispatch objects advertised an `M` stride of `6`,
but the assembly kernel only handled VNNI packing safely up to 4 rows.
For 5- or 6-row panels it fell back to an older AVX2 path with
incompatible packing and sign assumptions.
3. The AMX `U8S8` qgemm kernel had a bug in its `CountM >= 32` fast
path. The smaller-row AMX path was correct, but the 32-row pipelined
update logic produced wrong accumulators for conv-shaped workloads and
caused the remaining QDQ/NHWC failures on AMX-capable hosts.

## Why This Fix

- The `platform.cpp` cleanup restores the intended `U8U8` dispatch
selection on feature-rich x86 hosts.
- The AVX2-VNNI stride change is a targeted mitigation that avoids the
known-bad legacy fallback until that assembly path is corrected.
- The AMX kernel change keeps the AMX `U8S8` dispatch enabled, but
replaces the broken 32-row implementation with a proven update pattern
that matches the working smaller-row path.
- The new MLAS regression tests cover the exact conv-derived qgemm
shapes that exposed the bug, so future dispatch or kernel changes will
fail at the MLAS layer before surfacing as transformer test regressions.

## Testing

- `cd build/cuda/Release && ./onnxruntime_mlas_test
--gtest_filter='QGemmU8S8_*169xN30xK207*:*QGemmU8S8_*6xN30xK207*'`
- `cd build/cuda/Release && ./onnxruntime_test_all
--gtest_filter='NhwcTransformerTests.*:QDQTransformerTests.*'`
- Verified that the filtered transformer suite passes with AMX `U8S8`
dispatch enabled.

## Motivation and Context

These test failures had been present for a long time and were initially
attributed to transformer rewrites because they surfaced in NHWC and QDQ
test suites. Investigation showed that the optimized graphs were
structurally correct and that the failures came from lower-level MLAS
qgemm execution instead. Fixing the behavior in MLAS is the right layer
because it restores correctness for both direct qgemm coverage and
higher-level quantized conv paths.

## Checklist

- [x] Tests added/updated
- [x] No breaking changes
- [x] CI passes
## Description

This PR fixes clang-specific build failures that show up in both the
standalone clang build and the CUDA clang build. It keeps the
build-system changes targeted, prefers source fixes where the warnings
indicate real type or declaration issues, and avoids broader warning
suppression than necessary for the CUDA provider target.

## Summary of Changes

### Build System

| File | Change |
|------|--------|
| `cmake/CMakeLists.txt` | Stop forwarding `-Wshorten-64-to-32` through
CUDA host compilation where the GNU host compiler does not recognize it.
|
| `cmake/onnxruntime_providers_cuda.cmake` | Add targeted clang
`-Wno-error` handling for warning classes that are currently triggered
by CUDA provider code and third-party CUDA headers under clang. |

### CPU / Common clang fixes

| File | Change |
|------|--------|
| `onnxruntime/core/common/cpuid_info.cc` | Replace the
clang-incompatible `__builtin_cpu_supports("waitpkg")` path with the
CPUID-bit check for TPAUSE detection. |
| `onnxruntime/test/framework/allocation_planner_test.cc` | Refactor
`typeid` assertions to avoid clang's potentially-evaluated-expression
warning while keeping test coverage unchanged. |

### CUDA provider and contrib fixes

| File | Change |
|------|--------|
| `onnxruntime/contrib_ops/cuda/utils/dump_cuda_tensor.h` | Mark the
`IConsoleDumper` overrides explicitly while leaving CUDA-only overloads
unchanged. |
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Use
`template` on the dependent `GetAttrOrDefault` call so clang parses it
correctly. |
| `onnxruntime/contrib_ops/cuda/bert/flash_attention/flash_api.cc` |
Make narrowing conversions to flash-attention parameter fields explicit.
|
| `onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc` | Make the
`nbits_` conversion explicit when calling the CUDA helper. |
| `onnxruntime/contrib_ops/cuda/quantization/moe_quantization.cc` |
Restrict the GCC-only warning pragma so clang does not treat it as an
unknown warning option. |
|
`onnxruntime/contrib_ops/cuda/transformers/generation_device_helper.cc`
| Fix explicit state-field assignments to use the actual `int` field
type. |
| `onnxruntime/core/providers/cuda/cuda_mempool_arena.h` | Remove an
unused private field that clang flagged in the CUDA provider build. |

## Testing

Tested CPU and CUDA 12.8 builds in Azure Linux with
- clang 18.1.8
- gcc 13.2
- cmake 4.2.3

Example for CPU build:
```
export CC=clang
export CXX=clang++
bash build.sh --config RelWithDebInfo --parallel --cmake_extra_defines onnxruntime_BUILD_UNIT_TESTS=ON
```
## Motivation and Context

Clang is stricter than GCC/MSVC in a few areas that affect this tree:
CUDA host flag forwarding, explicit narrowing, dependent template
parsing, warnings emitted from third-party CUDA headers, and RTTI/typeid
expressions in tests. The goal here is to keep the staged fix minimal
and maintainable by correcting real source issues where practical and
confining warning downgrades to the CUDA provider target where
third-party header noise is currently unavoidable.
This pull request addresses a critical validation gap in the "reflect"
mode of the `Pad` operator for both CPU and CUDA backends, ensuring
compliance with the ONNX specification and preventing out-of-bounds
memory access. The main change is the addition of checks that prevent
the pad size from exceeding the maximum allowed value (`extent - 1`) for
each axis, and the introduction of comprehensive regression tests to
verify the new behavior.

Validation fixes for reflect-mode padding:

* Added explicit checks in
`onnxruntime/core/providers/cpu/tensor/pad.cc` and
`onnxruntime/core/providers/cuda/tensor/pad.cc` to ensure that, for
reflect mode, both pre-pad and post-pad values do not exceed `extent -
1` for each axis, as required by the ONNX spec. This prevents heap
out-of-bounds errors and aligns with numpy behavior.
[[1]](diffhunk://#diff-b84d79dea8e316c5a5fec57854f215d103666c4f4754268c4eadbd814329f0d6L498-R499)
[[2]](diffhunk://#diff-b84d79dea8e316c5a5fec57854f215d103666c4f4754268c4eadbd814329f0d6R510-R522)
[[3]](diffhunk://#diff-a6cea0925b036808544af30a643d0dddf27446dd4c1a68d67aba2d20b056bf70R198-R200)
[[4]](diffhunk://#diff-a6cea0925b036808544af30a643d0dddf27446dd4c1a68d67aba2d20b056bf70R215-R227)

Testing and regression coverage:

* Added a suite of regression tests in
`onnxruntime/test/providers/cpu/tensor/pad_test.cc` to verify that
invalid pad sizes in reflect mode are correctly rejected, including edge
cases for 1D and 2D inputs, boundary conditions, and scenarios with
slicing. These tests ensure that the operator fails gracefully when pad
sizes exceed the allowed limit and succeeds when within bounds.

Other changes:

* Minor file encoding update in
`onnxruntime/test/providers/cpu/tensor/pad_test.cc`.

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
microsoft#27578)

## Summary
This PR adds CUDA support for optimized **nearest-neighbor 3D resize
mapping/execution** in the Resize operator path, and adds targeted
regression coverage.

The implementation introduces a dedicated 3D fast path for nearest
resize to handle the last three spatial dimensions (`D/H/W`) efficiently
when outer dimensions are unchanged.

## What Changed

### CUDA Resize implementation
File: `onnxruntime/core/providers/cuda/tensor/resize_impl.cu`

- Added 3D nearest mapping kernel:
  - `_ResizeNearestMappingKernel3D`
- Added 3D nearest compute kernel:
  - `_ResizeNearestKernel3D`
- Added optimized 3D dispatch path in `ResizeNearestImpl`:
  - Enabled when:
    - `rank >= 3`
    - `coordinate_transformation_mode != tf_crop_and_resize`
    - all outer scales (except last 3 dims) are `1.0`

This keeps existing behavior unchanged for other cases while using the
optimized path for true 3D nearest resize workloads.

### Regression tests
File: `onnxruntime/test/providers/cpu/tensor/resize_op_test.cc`

Added CUDA-targeted regression tests:
- `ResizeOpNearestUpSampleTest_5D_CudaRegression_Optimized3DMapping`
- `ResizeOpNearestDownSampleTest_5D_CudaRegression_Optimized3DMapping`

## Why
The previous nearest implementation relied on the generic path for these
3D scenarios. This change introduces a dedicated CUDA 3D path to improve
performance for 5D nearest resize workloads.

Fixes microsoft#14596
…bugs (microsoft#27692)

### Description

This PR adds fp16 (half-precision) support for 8-bit MatMulNBits on
ARM64 NEON and fixes several pre-existing bugs discovered during
testing.

**New features:**
- **HQNBIT_CompFp16 for 8-bit:** Added
`HQ8BitGemmPackQuantBData_CompFp16` and
`HQ8BitBlkDequantBForHgemm_CompFp16` NEON kernels that pack and
dequantize 8-bit quantized weights for fp16 GEMM. Reuses the existing
`HQ4BitGemmKernel_CompFp16` for the actual compute since the dequantized
B matrix has the same layout.
- **HQNBIT_CompInt8 for 4-bit:** Added accuracy level 4 (int8 compute)
support for fp16 4-bit MatMulNBits. Converts fp16 activations to fp32,
then uses the existing SQ4Bit int8 kernels.
- **HQNBIT_CompInt8 for 8-bit:** Added accuracy level 4 (int8 compute)
support for fp16 8-bit MatMulNBits. Converts fp16 scales to fp32 for
packing, then uses the existing SQ8Bit int8 kernels.

**Bug fixes:**
- **Bias offset bug in CompFp16 (Windows ARM multithreading):** Fixed
missing `+ RangeStartN` when initializing `Bias` pointer in
`HQ4BitGemm_CompFp16` and `HQ8BitGemm_CompFp16`. This caused incorrect
results when using multiple threads, as worker threads processing column
ranges beyond the first would read bias values from the wrong offset.
- **QuantBDataWorkspace not set for MLFloat16 fallback (macOS ARM
crash):** Removed `#ifdef MLAS_TARGET_AMD64_IX86` guard around setting
`QuantBDataWorkspace` in `ComputeBPacked<MLFloat16>`, so macOS ARM
(which uses the fp32 fallback path) correctly sets the workspace pointer
for SQNBIT_CompInt8.
- **Scale/ZP packing skipped on non-x64 in MLFloat16 PrePack (macOS ARM
gibberish):** Removed `#ifdef MLAS_TARGET_AMD64_IX86` guard around the
SQNBIT_CompInt8 scale and zero-point packing in the
`MatMulNBits<MLFloat16>::PrePack` specialization. Added `nbits_ == 8`
condition to match the generic template's behavior on ARM (only 8-bit
needs separate scale packing on ARM, while x64 needs it for both 4-bit
and 8-bit).

### Motivation and Context

8-bit quantized models with fp16 inputs are increasingly common on ARM
devices (Windows ARM, macOS Apple Silicon). The existing MatMulNBits
implementation only supported 4-bit for the HQNBIT fp16 paths. This
change extends support to 8-bit, enabling faster inference for 8-bit
quantized models on ARM64 without requiring fp16→fp32 conversion of the
weights.

The bug fixes address issues that were either pre-existing (the `#ifdef`
guards were copy-paste inconsistencies from prior PRs) or introduced
alongside the fp16 NEON support (the Bias offset issue). These caused
crashes or incorrect output on macOS ARM and multithreaded Windows ARM
configurations.

### Improvements
Measured on `Snapdragon X Elite - X1E78100 - Qualcomm Oryon CPU`

#### Accuracy level 4 (uses HQNBIT_CompInt8) vs Accuracy level 1 (uses
HQNBIT_CompFp16)

| Model | Seq 1 | Seq 256 | Seq 512 |
|-------|-------|---------|---------|
| **4-bit** | | | |
| Qwen 0.5B | 1.19× (9.6ms) | 1.36× (428ms) | 1.27× (1119ms) |
| Qwen 1.5B | 0.89× (39.8ms) | 1.62× (1371ms) | 1.54× (2694ms) |
| Qwen 3B | 1.16× (46.8ms) | 1.54× (2654ms) | 1.43× (5427ms) |
| **8-bit** | | | |
| Qwen 0.5B | 0.79× (22.5ms) | 2.59× (257ms) | 2.16× (642ms) |
| Qwen 1.5B | 1.14× (41.4ms) | 2.50× (848ms) | 2.55× (1636ms) |
| Qwen 3B | 1.07× (52.9ms) | 1.95× (2133ms) | 2.29× (3799ms) |

#### Latest changes vs ORT 1.24.3 (both accuracy level 4)
On ORT 1.24.3:
- 4 bit uses HQNBIT_CompFp16
- 8 bit uses naive unpacked dequantize and matmul

| Model | Seq 1 | Seq 256 | Seq 512 |
|-------|-------|---------|---------|
| **4-bit** | | | |
| Qwen 0.5B | 1.13× (9.6ms) | 1.35× (428ms) | 1.27× (1119ms) |
| Qwen 1.5B | 0.82× (39.8ms) | 1.40× (1371ms) | 1.47× (2694ms) |
| Qwen 3B | 1.16× (46.8ms) | 1.47× (2654ms) | 1.51× (5427ms) |
| **8-bit** | | | |
| Qwen 0.5B | **35.4×** (22.5ms) | **5.0×** (257ms) | **3.2×** (642ms) |
| Qwen 1.5B | **98.0×** (41.4ms) | **6.8×** (848ms) | **4.7×** (1636ms)
|
| Qwen 3B | **107.8×** (52.9ms) | **4.1×** (2133ms) | **3.1×** (3799ms)
|
…icrosoft#27091)

### Description
This PR fixes test errors encountered during the build and compilation
of onnxruntime-cann.
error1:
Unrecognized because HiSilicon CPU info was not added, leading to CTest
errors.
```
onnxruntime cpuid_info warning: Unknown CPU vendor. cpuinfo_vendor value: 15

10: [----------] Global test environment tear-down
10: [==========] 15 tests from 1 test suite ran. (772 ms total)
10: [  PASSED  ] 15 tests.
10/10 Test #10: onnxruntime_ep_graph_test ...............   Passed    0.98 sec

90% tests passed, 1 tests failed out of 10

Total Test time (real) = 674.68 sec

The following tests FAILED:
          1 - onnxruntime_test_all (Failed)
Errors while running CTest
```
error2: 
Some Python tests are failing here due to a previously submitted PR
(microsoft#25867). In that PR, we
introduced a new parameter enable_cann_subgraph to control subgraph
partitioning for unsupported operators, with a preference for executing
the entire graph as a whole. However, this change causes certain test
cases to fail when specific operator versions in the models are not
supported, leading to execution errors.
…oft#27698)

### Description
Don't try to show extended ccache stats.

### Motivation and Context
Current docker images are on a package set that contains a decrepit
version of ccache that doesn't understand `--verbose` (`-v`).
If `ccache` is added via the pkg manager -> build scripts fail.
This matters b/c we will be soon enabling ccache throughout various
pipelines/workflows.
### Description

Release packaging currently requires manually triggering and monitoring
multiple independent pipelines (Python, Nuget, Java, NPM, iOS, etc.).
This PR introduces a unified orchestration framework that triggers all
packaging pipelines from a single master pipeline, monitors them in
parallel, and reports aggregated results.

### Motivation and Context
* main-release-pipeline.yml — A 1ES pipeline with per-pipeline
enable/disable toggles, dry-run support, and configurable target branch.
*
[trigger_and_wait_pipelines.py](vscode-file://vscode-app/c:/Users/kusbandi/AppData/Local/Programs/Microsoft%20VS%20Code/ce099c1ed2/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
— Python orchestrator that triggers ~11 packaging pipelines via ADO REST
API, polls them every 60s, and logs status to Azure Kusto for analytics.

---------

Co-authored-by: Kusuma Padma Kavya Bandi <kusbandi@microsoft.com>
### Description
File mapping is enabled by default, but when the model contains an
embedded EP context in a node, QNN EP attempts to map the EP context as
a file that doesn't exist. This change is to disable file mapping
entirely if econtext_embed_mode = 1.

---------

Co-authored-by: calvnguy <calvnguy@qti.qualcomm.com>
Co-authored-by: quic_calvnguy <quic_calvnguy@quic_inc.com>
### Description
Allow using KAI SBGemm on any ARM64 build (previously it needed the
condition that NCHWc be not enabled as the logistics of using the KAI
SBGemm path for the SBGemm based Conv was not clear). This change
instead allows using the KAI path for other call sites (by enabling it
in the build) but explicitly disallows using the KAI path only for the
Conv and continue using the default MLAS path for the same (as before).



### Motivation and Context

microsoft#26773 (comment)

---------

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
### Description
<!-- Describe your changes. -->



### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

---------

Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
…els (microsoft#27749)

## Description

Fix two bugs in the WebGPU ConvTranspose shader code generation
(conv_backprop.cc) in the `pack_input_as4_` code path when
`a_components_ == 1` (triggered when input channels per group is not
divisible by 2 or 4, e.g., 5 or 7).

### Bug 1: Wrong offset for weight reads
Weight values were read using `x_offset` (the input/dy tensor offset)
instead of `w_offset` (the weight tensor offset), producing incorrect
convolution results.

### Bug 2: Missing weight multiplication in remainder loop
The remainder loop (handling leftover channels when
`inputChannelsPerGroup` is not a multiple of 4) was adding raw input
values to `dotProd` without multiplying by the corresponding weight
values.

## Motivation and Context

The `inChannels = 5` and `inChannels = 7` test cases in
`conv-transpose.jsonc` were failing because these channel counts aren't
divisible by 2 or 4, triggering the buggy `a_components_ == 1` branch.
Cases like `inChannels = 6` (`a_components_ = 2`) and `inChannels = 8`
(`a_components_ = 4`) were unaffected.

## Testing

All 22 conv-transpose WebGPU tests now pass:
```
npm test -- op conv-transpose.jsonc -b=webgpu -e=node
22 passing (23s)
```

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
### Description

This change updates the KleidiAI SGEMM post-processing path in
onnxruntime/core/mlas/lib/kleidiai/sgemm_kleidiai.cpp with two parts:
- Correctness fix: in the alpha == 0 || K == 0 fast path, beta handling
is now applied for every batch entry (not just batch 0), so batched
SGEMM behaviour is correct.
- NEON SGEMM epilogue optimisation: adds a vectorised alpha/beta
post-processing path for contiguous outputs, with guarded fallback to
scalar for non-contiguous or small cases. The 2D epilogue path also
routes contiguous tiles through the contiguous 1D epilogue path to
enable vectorisation.

  ### Motivation and Context

This change addresses correctness and performance in the SGEMM
post-processing stage:
- The batched alpha == 0 || K == 0 path previously used only Data[0],
which could produce incorrect results for BatchSize > 1.
- The post-processing loop (C = alpha * (A*B) + beta * C) is a known
latency contributor when memcpy fast paths are not applicable. The NEON
epilogue changes are intended to reduce this cost on supported ARM
platforms while preserving existing fallback behaviour.

---------

Signed-off-by: Cathal Lawlor cathal.lawlor@arm.com
Signed-off-by: Cathal Lawlor <cathal.lawlor@arm.com>
### Description

This PR updates the logic for identifying a large model in Intel's
Neural Compressor.

### Motivation and Context

The original logic was not sufficient to detect whether a model produced
by the model builder is too large or not. Here is an example traceback
from an internal customer.

```
Traceback (most recent call last):
  File "D:\a\_work\1\s\edge.onnxruntime-genai\src\python\py\models\builder.py", line 502, in <module>
    create_model(
  File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\torch\utils\_contextlib.py", line 124, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "D:\a\_work\1\s\edge.onnxruntime-genai\src\python\py\models\builder.py", line 346, in create_model
    onnx_model.save_model(output_dir)
  File "D:\a\_work\1\s\edge.onnxruntime-genai\src\python\py\models\builders\base.py", line 748, in save_model
    model = self.to_int4()
            ^^^^^^^^^^^^^^
  File "D:\a\_work\1\s\edge.onnxruntime-genai\src\python\py\models\builders\base.py", line 738, in to_int4
    quant.process()
  File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\matmul_nbits_quantizer.py", line 1442, in process
    self.int4_quant_algo()
  File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\matmul_nbits_quantizer.py", line 1388, in int4_quant_algo
    self.model = rtn_quantize(
                 ^^^^^^^^^^^^^
  File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\neural_compressor\weight_only.py", line 456, in rtn_quantize
    model = ONNXModel(model)
            ^^^^^^^^^^^^^^^^
  File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\neural_compressor\onnx_model.py", line 52, in __init__
    self.check_is_large_model()
  File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\neural_compressor\onnx_model.py", line 91, in check_is_large_model
    raise e
  File "C:\ToolCache\Python\3.12.10\x64\Lib\site-packages\onnxruntime\quantization\neural_compressor\onnx_model.py", line 84, in check_is_large_model
    init_bytes = init.SerializeToString()
                 ^^^^^^^^^^^^^^^^^^^^^^^^
google.protobuf.message.EncodeError: Failed to serialize proto
```
…nds read (microsoft#27748)

### Description

Add an `ORT_ENFORCE` check in the `QMoECPU` constructor to require
`swiglu_fusion == 1` when using SwiGLU activation, preventing an
out-of-bounds read.

When `swiglu_fusion=0` (the default), `fc1_out_features` is computed as
`inter_size` instead of `2*inter_size`. However, `ApplySwiGLUActivation`
reads `2*inter_size` values from the FC1 output buffer (via
`input_data[2*i]` for `i` in `[0, inter_size)`), causing an
out-of-bounds read that produces NaN on Windows x86.

This matches the existing validation already present in the `MoE` CPU
operator (`moe_cpu.cc:26-27`).

### Motivation and Context

- The NaN was caused by a missing `swiglu_fusion=1` attribute. With the
default `swiglu_fusion=0`, the SwiGLU activation reads past the
allocated FC1 output buffer — an out-of-bounds read.
- The `MoE` CPU operator already enforces `swiglu_fusion == 1` for
SwiGLU; this change adds the same guard to `QMoECPU` for consistency and
safety.
- Non-interleaved SwiGLU format (`swiglu_fusion=2`) is not implemented
(throws `ORT_NOT_IMPLEMENTED`), and `swiglu_fusion=0` is invalid for
SwiGLU, so only `swiglu_fusion=1` is valid.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
### Description

Building Linux onnxruntime-qnn python wheel didn't work through WSL
because QNN lib dependencies were not copied over to the wheel, so this
PR resolved the issue.

---------

Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>
…egation routing (microsoft#27687)

### Description

Adds optional input `router_weights` (index 14) to `com.microsoft.QMoE`
to decouple Top-K expert selection from output aggregation weighting.

When `router_weights` is provided:
- `router_probs` → Top-K expert selection only
- `router_weights` → values gathered at selected expert indices used as
mixing weights

When omitted, existing softmax-of-`router_probs` behavior is preserved
(backward compatible).

**Changes:**
- **Schema** (`contrib_defs.cc`): New optional input 14
`router_weights`, type T, shape `(num_tokens, num_experts)`
- **CPU provider** (`moe_quantization_cpu.cc`): Implements the separate
routing path with MLFloat16/float support and optional
`normalize_routing_weights` normalization
- **CUDA provider** (`moe_quantization.cc`): Reads input, enforces
not-implemented if provided
- **WebGPU provider** (`qmoe.cc`): Same not-implemented guard
- **Tests** (`moe_test.cc`): `QMoETest_CPU_RouterWeights` covering both
normalized and unnormalized paths with non-zero expected outputs via FC2
bias to validate correct aggregation weights
- **Docs** (`OperatorKernels.md`): Updated CPU and CUDA entries

This pattern matches DeepSeek-V2/V3/R1 routing where `sigmoid(logits)`
is used for aggregation while `logits + bias` with group masking drives
selection:

```python
# DeepSeek-style: different tensors for selection vs aggregation
topk_indices = torch.topk(scores_for_choice, k=top_k)[1]  # selection from modified logits
topk_weights = router_logits.gather(1, topk_indices)        # aggregation from original sigmoid
```

### Motivation and Context

`QMoE` previously required the same tensor for both routing and
weighting, blocking DeepSeek-style `noaux_tc` MoE models where these are
intentionally separate. This unblocks ONNX Runtime export/serving of
DeepSeek-V2/V3/R1 MoE architectures.

<!-- START COPILOT ORIGINAL PROMPT -->



<details>

<summary>Original prompt</summary>

> 
> ----
> 
> *This section details on the original issue you should resolve*
> 
> <issue_title>[Feature Request] Support noaux_tc MoE routing in
com.microsoft.QMoE via separate router_weights</issue_title>
> <issue_description>### Describe the feature request
> 
> `com.microsoft.QMoE` currently accepts a single routing tensor
(commonly router_probs) that is used both for:
> 
> Top‑K expert selection (routing / dispatch), and
> Weighting the outputs of selected experts (aggregation).
> 
> This design makes it impossible to represent DeepSeek‑style `noaux_tc`
`MoE` routing, where different tensors are intentionally used for:
> 
> * expert selection (Top‑K routing), and
> * expert output weighting (mixing).
> 
> This issue proposes adding an optional input `router_weights` to
`com.microsoft.QMoE` so that:
> 
> * `router_probs` is used only for Top‑K selection, and
> * `router_weights` is used only for multiplying / aggregating expert
outputs.
> 
> The change is backward compatible
> This also allows for any other methodology in future where different
tensors are used for selection/aggregation
> 
> ### Describe scenario use case
> 
> Enables exporting and serving DeepSeek‑V2/V3/R1‑style MoE models in
ONNX Runtime</issue_description>
> 
> <agent_instructions>Please update operator spec and implement it in
CPU provider. For CUDA provider, it is fine to throw not implemented
exception for now.
> 
> Example Deepseek MoE script can be found in
https://github.com/huggingface/transformers/blob/75c836b7853cb65f48ab2ce13cddfb12d14ecf5a/src/transformers/models/deepseek_v3/modular_deepseek_v3.py
like the following:
> 
> class DeepseekV3MoE(nn.Module):
>     """
>     A mixed expert module containing shared experts.
>     """
> 
>     def __init__(self, config):
>         super().__init__()
>         self.config = config
>         self.experts = DeepseekV3NaiveMoe(config)
>         self.gate = DeepseekV3TopkRouter(config)
>         self.shared_experts = DeepseekV3MLP(
> config=config, intermediate_size=config.moe_intermediate_size *
config.n_shared_experts
>         )
>         self.n_routed_experts = config.n_routed_experts
>         self.n_group = config.n_group
>         self.topk_group = config.topk_group
>         self.norm_topk_prob = config.norm_topk_prob
>         self.routed_scaling_factor = config.routed_scaling_factor
>         self.top_k = config.num_experts_per_tok
> 
>     def route_tokens_to_experts(self, router_logits):
>         router_logits = router_logits.sigmoid()
> router_logits_for_choice = router_logits +
self.gate.e_score_correction_bias
>         group_scores = (
> router_logits_for_choice.view(-1, self.n_group, self.n_routed_experts
// self.n_group)
>             .topk(2, dim=-1)[0]
>             .sum(dim=-1)
>         )
> group_idx = torch.topk(group_scores, k=self.topk_group, dim=-1,
sorted=False)[1]
>         group_mask = torch.zeros_like(group_scores)
>         group_mask.scatter_(1, group_idx, 1)
>         score_mask = (
>             group_mask.unsqueeze(-1)
> .expand(-1, self.n_group, self.n_routed_experts // self.n_group)
>             .reshape(-1, self.n_routed_experts)
>         )
> scores_for_choice =
router_logits_for_choice.masked_fill(~score_mask.bool(), 0.0)
> topk_indices = torch.topk(scores_for_choice, k=self.top_k, dim=-1,
sorted=False)[1]
>         topk_weights = router_logits.gather(1, topk_indices)
>         if self.norm_topk_prob:
> denominator = topk_weights.sum(dim=-1, keepdim=True) + 1e-20
>             topk_weights /= denominator
>         topk_weights = topk_weights * self.routed_scaling_factor
>         return topk_indices, topk_weights
> 
>     def forward(self, hidden_states):
>         residuals = hidden_states
>         orig_shape = hidden_states.shape
>         router_logits = self.gate(hidden_states)
> topk_indices, topk_weights =
self.route_tokens_to_experts(router_logits)
> hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
> hidden_states = self.experts(hidden_states, topk_indices,
topk_weights).view(*orig_shape)
>         hidden_states = hidden_states + self.shared_experts(residuals)
>         return hidden_states
> 
> </agent_instructions>
> 
> ## Comments on the Issue (you are @copilot in this section)
> 
> <comments>
> </comments>
> 


</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes microsoft#27675

<!-- START COPILOT CODING AGENT TIPS -->
---

💡 You can make Copilot smarter by setting up custom instructions,
customizing its development environment and configuring Model Context
Protocol (MCP) servers. Learn more [Copilot coding agent
tips](https://gh.io/copilot-coding-agent-tips) in the docs.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
… unsqueeze_elimination against invalid model (microsoft#27638)

This pull request introduces improvements to the
`concat_slice_elimination` and `unsqueeze_elimination` optimizer passes,
focusing on correctness, robustness, and code clarity. The changes
include enhanced handling of optional Slice operator attributes,
stricter validation of axes and steps, and improved error handling for
invalid model inputs.

Improvements to `concat_slice_elimination`:

* Materialized default values for optional `axes` and `steps` in the
Slice operator, ensuring safe indexing and alignment with ONNX defaults.
(`onnxruntime/core/optimizer/concat_slice_elimination.cc`)
* Refined the fusion pattern to only allow `starts.size() == 1`,
clarifying the scope of the optimization and preventing incorrect
fusions. (`onnxruntime/core/optimizer/concat_slice_elimination.cc`)
* Added `<numeric>` include to support new code using `std::iota`.
(`onnxruntime/core/optimizer/concat_slice_elimination.cc`)

Improvements to `unsqueeze_elimination`:

* Added validation for axes values, including range checks and detection
of duplicate axes, returning errors for invalid models instead of
silently proceeding.
(`onnxruntime/core/optimizer/unsqueeze_elimination.cc`)
* Added `core/providers/common.h` include for utility functions used in
validation. (`onnxruntime/core/optimizer/unsqueeze_elimination.cc`)
### Description

Extends CUDA EP Squeeze and Unsqueeze kernel registrations from opset 23
to opset 25, matching CPU provider coverage.

- **`squeeze.cc` / `unsqueeze.cc`**: Cap opset 23 to versioned `23–23`,
add versioned `24–24`, add non-versioned `25`
- **`cuda_execution_provider.cc`**: Add corresponding forward
declarations and `BuildKernelCreateInfo` registry entries for opsets 23
(now versioned), 24, and 25
- **`docs/OperatorKernels.md`**: Update CUDA Squeeze and Unsqueeze
entries to reflect `25+` coverage with individual `24` and `23` version
rows

No new computation logic — these ops are shape-only (data is a
`cudaMemcpyAsync`), so the same kernel implementation covers all new
opsets.

### Motivation and Context

CUDA EP registered Squeeze/Unsqueeze only up to opset 23 while the ONNX
spec defines them through opset 25. Models exported at opset 24+ would
fail to find a matching CUDA kernel. Part of the broader opset gap audit
tracked in microsoft#27729.

### Limitation

It does not include new data types for float8, float4, int4 etc. That
will be added later if needed.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
…27693)

### Description

Add a pre-check for zero values in the divisor tensor for integral types
in `Div<T>`. Returns an error `Status` instead of hitting undefined
behavior (SIGFPE / structured exception).

- **`element_wise_ops.h`**: When the divisor is a constant initializer,
`TryGetConstantInput` validates for zeros once at kernel creation time
in the constructor, avoiding per-`Compute` overhead. A
`divisor_is_validated_constant_` flag tracks whether the one-time check
was performed.
- **`element_wise_ops.cc`**: `if constexpr (std::is_integral<T>::value)`
guard scans non-constant divisors before calling `UntypedBroadcastTwo`,
skipping the check when the constant was already validated. Compiled
away for float/double/half — zero cost for non-integer paths.
- **`element_wise_ops_test.cc`**: Added `Div_int8_by_zero`,
`Div_int32_by_zero`, `Div_int64_by_zero_scalar` tests covering tensor
and scalar divisor cases, plus `Div_int32_by_zero_constant_initializer`
to exercise the `TryGetConstantInput` constructor path with
`is_initializer = true`.

### Motivation and Context

Integer division by zero is UB in C++ and causes a hardware exception
that crashes the process. Float types produce inf/NaN naturally, but
int8/int16/int32/int64/uint* types do not. This was reported via
Chromium (https://issues.chromium.org/issues/491835014) with a trivial
repro: `tensor<int8> / scalar(0)`.

<!-- START COPILOT ORIGINAL PROMPT -->



<details>

<summary>Original prompt</summary>

> 
> ----
> 
> *This section details on the original issue you should resolve*
> 
> <issue_title>int8 / 0 exception not caught for cpu ep</issue_title>
> <issue_description>See https://issues.chromium.org/issues/491835014.
> 
> Repro:
> a=tensor<int8>
> b=tensor<int8>, ie a scalar that is 0
> model that does a/b
> 
> Stack trace:
> ```
> onnxruntime.dll!Eigen::internal::scalar_quotient_op<signed char,signed
char>::operator()(const char &) Line 437      C++
>      [Inline Frame]
onnxruntime.dll!Eigen::internal::binary_evaluator<Eigen::CwiseBinaryOp<Eigen::internal::scalar_quotient_op<signed
char,signed
char>,Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<signed
char>,Eigen::Array<signed char,-1,1,0,-1,1> const> const
,Eigen::ArrayWrapper<Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1>
const ,0,Eigen::Stride<0,0>>>
const>,Eigen::internal::IndexBased,Eigen::internal::IndexBased,signed
char,signed char>::coeff(__int64) Line 910    C++
>  ...
>      [Inline Frame]
onnxruntime.dll!Eigen::internal::Assignment<Eigen::Map<Eigen::Matrix<signed
char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>>,Eigen::CwiseBinaryOp<Eigen::internal::scalar_quotient_op<signed
char,signed
char>,Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<signed
char>,Eigen::Array<signed char,-1,1,0,-1,1> const> const
,Eigen::ArrayWrapper<Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1>
const ,0,Eigen::Stride<0,0>>> const>,Eigen::internal::assign_op<signed
char,signed
char>,Eigen::internal::Dense2Dense,void>::run(Eigen::Map<Eigen::Matrix<signed
char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>> &) Line 855      C++
>      [Inline Frame]
onnxruntime.dll!Eigen::internal::call_assignment_no_alias(Eigen::Map<Eigen::Matrix<signed
char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>> &) Line 797      C++
>      [Inline Frame]
onnxruntime.dll!Eigen::internal::call_assignment(Eigen::Map<Eigen::Matrix<signed
char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>> &) Line 768   C++
>      [Inline Frame]
onnxruntime.dll!Eigen::internal::call_assignment(Eigen::Map<Eigen::Matrix<signed
char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>> &) Line 750   C++
>      [Inline Frame]
onnxruntime.dll!Eigen::MatrixBase<Eigen::Map<Eigen::Matrix<signed
char,-1,1,0,-1,1>,0,Eigen::Stride<0,0>>>::operator=(const
Eigen::DenseBase<Eigen::CwiseBinaryOp<Eigen::internal::scalar_quotient_op<signed
char,signed
char>,Eigen::CwiseNullaryOp<Eigen::internal::scalar_constant_op<signed
char>,Eigen::Array<signed char,-1,1,0,-1,1> const> const
,Eigen::ArrayWrapper<Eigen::Map<Eigen::Matrix<signed char,-1,1,0,-1,1>
const ,0,Eigen::Stride<0,0>>> const>> &) Line 59 C++
>      [Inline Frame] onnxruntime.dll!onnxruntime::Div<signed
char>::Compute::__l2::<lambda_998187df037dec36fd0905b4142c682e>::operator()(onnxruntime::BroadcastHelper
&) Line 685   C++
>
     onnxruntime.dll!<lambda_998187df037dec36fd0905b4142c682e>::<lambda_invoker_cdecl>(onnxruntime::BroadcastHelper
& per_iter_bh) Line 686    C++
>       [External Code]   
>      [Inline Frame]
onnxruntime.dll!std::_Func_class<void,__int64,__int64>::operator()(__int64
<_Args_0>, __int64 <_Args_1>) Line 926    C++
>
     onnxruntime.dll!onnxruntime::concurrency::ThreadPool::ParallelFor(__int64
n, const onnxruntime::TensorOpCost & c, const std::function<void
__cdecl(__int64,__int64)> & f) Line 628  C++
>
     onnxruntime.dll!onnxruntime::concurrency::ThreadPool::TryParallelFor(onnxruntime::concurrency::ThreadPool
* tp, __int64 total, const onnxruntime::TensorOpCost & cost_per_unit,
const std::function<void __cdecl(__int64,__int64)> & fn) Line
705     C++
>
     onnxruntime.dll!onnxruntime::ParallelizeSingleSpan<onnxruntime::BroadcastHelper>(onnxruntime::BroadcastHelper
& helper, const onnxruntime::ProcessBroadcastSpanFuncs & functors) Line
955 C++
>
     onnxruntime.dll!onnxruntime::BroadcastLooper<onnxruntime::BroadcastHelper>(onnxruntime::BroadcastHelper
& helper, const onnxruntime::ProcessBroadcastSpanFuncs & functors) Line
1006      C++
>
     onnxruntime.dll!onnxruntime::UntypedBroadcastTwo(onnxruntime::OpKernelContext
& context, const onnxruntime::ProcessBroadcastSpanFuncs & funcs, double
unit_cost, void * user_data) Line 2305    C++
>      onnxruntime.dll!onnxruntime::Div<signed
char>::Compute(onnxruntime::OpKernelContext * context) Line 695     C++
>       
> ```
> </issue_description>
> 
> ## Comments on the Issue (you are @copilot in this section)
> 
> <comments>
> </comments>
> 


</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes microsoft#27686

<!-- START COPILOT CODING AGENT TIPS -->
---

📱 Kick off Copilot coding agent tasks wherever you are with [GitHub
Mobile](https://gh.io/cca-mobile-docs), available on iOS and Android.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: skottmckay <979079+skottmckay@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
…in onnxruntime.native.path (microsoft#27668)

## Summary

Fixes the Java native provider loading flow so that
`extractProviderLibrary` checks
`onnxruntime.native.path` before attempting extraction from classpath
resources.

Previously, when a provider library was already present in
`onnxruntime.native.path`,
the Java loader could still attempt `extractFromResources(...)` first,
and only fall
back to the configured native path afterwards. This caused an
unnecessary extraction
attempt even though the library was already available on disk.

This change updates the lookup order to:

1. Return immediately if the provider was already marked as ready
2. Check whether the provider library already exists in
`onnxruntime.native.path`
3. Only if not found there, attempt extraction from classpath resources

## Tests

Added a regression test covering the case where the requested provider
library already
exists in `onnxruntime.native.path`.

The test verifies that:

- `extractProviderLibrary(...)` returns `true`
- extraction from resources is not attempted in that case

A small test-only hook was added to observe calls to
`extractFromResources(...)` so the
regression can be validated directly and deterministically.

## Issue

Fixes microsoft#27655
…7714)

Fix microsoft#27712 

This pull request improves support and validation for the `softcap` and
`softmax_precision` attributes in the CUDA Attention operator, updates
kernel eligibility and fallback logic, and enhances test coverage for
these features. The changes ensure that only valid values are accepted,
propagate new parameters to eligible kernels, and clarify backend
capabilities in code comments and tests.

**CUDA Attention operator improvements:**

* Added validation to enforce that `softcap` is non-negative and that
`softmax_precision` is one of the supported TensorProto types (0, 1, 10,
or 16).
* Updated code comments and eligibility checks to clarify that `softcap`
is now supported natively in Flash and Memory Efficient Attention (MEA)
kernels, and that `softmax_precision` is inherently satisfied (always
computed in FP32 on CUDA).
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL174-R183)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL548-R556)
[[3]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL824-R834)
* Propagated the `softcap` parameter to the MEA kernel invocation to
enable native support.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR696)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR746)
* Modified fallback and rejection logic: unfused attention now
explicitly rejects `softcap` with a clear error message, while
`softmax_precision` is always considered satisfied.
[[1]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffL1096-R1110)
[[2]](diffhunk://#diff-0701e4cc6d4951894ae1a60f35c1e6c0f69ba7595f896a23c8f5ed7265eab4ffR1179-R1186)

**Testing improvements:**

* Added a new test to verify that `softmax_precision=1` (FLOAT) produces
identical results to the default, since all CUDA backends compute
softmax in FP32.
* Clarified in existing softcap-related tests that certain
configurations are not supported by CUDA unfused attention and require
Flash or MEA; updated test comments for clarity.
[[1]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777R1088-R1089)
[[2]](diffhunk://#diff-3ff6dfa2ce407ae0073009174c37d1756509e8bbc434dee7c44cd55a996bb777R1118-R1119)
* Expanded Python test cases for GQA (grouped-query attention) to
include nonzero `softcap` values, increasing coverage of this feature.
[[1]](diffhunk://#diff-8795174e6967f83c53fcd5de6d7bfe55782a1ae05cf720378b33b7a7c4cee7dcL613-R613)
[[2]](diffhunk://#diff-8795174e6967f83c53fcd5de6d7bfe55782a1ae05cf720378b33b7a7c4cee7dcL648-R648)

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…27751)

### Description

Extends the KleidiAI-accelerated `MatMulNBits` path on ARM64 to support
asymmetric 4-bit quantization (models with per-block zero points).
Previously, KleidiAI was only used for symmetric quantization (`!HasZp`
guard), and asymmetric models fell back to a significantly slower
non-KleidiAI kernel.

Since KleidiAI only provides symmetric int4 micro-kernels
(`kai_matmul_clamp_f32_qai8dxp_qsi4c32p`), this PR runs KleidiAI as-if
symmetric (hardcoded `rhs_zero_point=8`) and applies a float-domain
zero-point correction post-GEMM:

$$C_{\text{actual}} = C_{\text{symmetric}} + A_{\text{blksum}} \times
B_{\text{ZpCorr}}^T$$

where:
- `BZpCorr[n, blk] = scale_b[n, blk] × (8 - zp_b[n, blk])` — precomputed
at weight packing time
- `AFloatBlkSum[m, blk] = Σ A_float[m, blk_start..blk_end]` — computed
per inference alongside A quantization

Key changes:
- **`UseKleidiAIBase()`**: New function that checks KleidiAI eligibility
without the `!HasZp` guard. `UseKleidiAI()` now delegates to `!HasZp &&
UseKleidiAIBase()`, preserving symmetric behavior.
- **B packing (`SQ4BitGemmPackQuantBDataAndBlkSum`)**: Computes and
stores `BZpCorr` after KleidiAI packed B data when zero points are
present.
- **Workspace expansion**: Allocates space for `AFloatBlkSum` (M ×
BlockCountK floats) in the per-GEMM workspace for asymmetric models.
- **`ComputeAFloatBlkSum`**: NEON-vectorized (4× unrolled `vaddq_f32`)
function to compute per-block float sums of A.
- **`ApplyBZpCorrection`**: NEON-vectorized correction kernel tiled
4-N-wide (`vfmaq_f32`) for L1-friendly BZpCorr reuse.
- **PrePack**: Computes `BZpCorr` during the scales PrePack (not
zero_points PrePack), since ORT may erase constant inputs after marking
them packed.

No changes to the symmetric path. No changes to x64. No changes to 8-bit
quantization.

### Motivation and Context

Asymmetric 4-bit quantized models (e.g., GPTQ/RTN with zero points) on
ARM64 were **23–72% slower** than their symmetric counterparts because
KleidiAI's `sdot`/`i8mm` micro-kernels only support symmetric RHS,
forcing a fallback to a slower non-KleidiAI kernel path.

This change closes most of that gap:

| Model | Seq Len | Asym/Sym (before) | Asym/Sym (after) | Asym speedup
| Asym latency (after) |

|-------|---------|-------------------|------------------|--------------|----------------------|
| Qwen 1.5B | 256 | 1.35× | 1.17× | **1.16×** | 1107.8ms |
| Qwen 1.5B | 512 | 1.23× | 1.06× | **1.14×** | 2259.7ms |
| Qwen 3B | 256 | 1.43× | 1.12× | **1.28×** | 2029.7ms |
| Qwen 3B | 512 | 1.39× | 1.22× | **1.24×** | 4188.0ms |
| Qwen 7B | 256 | 1.61× | 1.11× | **1.52×** | 3661.6ms |
| Qwen 7B | 512 | 1.72× | 1.11× | **1.58×** | 7263.8ms |

The remaining 6–22% asym/sym gap comes from the extra pass over A to
compute float block sums — this cannot be fused into KleidiAI's sealed
A-packing function and would require an upstream KleidiAI API change.
## Description

This PR refactors a set of CUDA provider helpers and kernel call sites
so more of the remaining CUDA ops can build cleanly when the CUDA EP is
compiled as a plugin. The main goal is to reduce direct dependencies on
framework-internal CUDA EP types such as `CUDAExecutionProvider` and
`CudaStream`, and to move reusable CUDA type/handle helpers into
header-visible utilities that are available on both sides of the plugin
boundary.

These changes follow the plugin EP guidance, where the next stage of
enablement is to remove kernel assumptions about provider-only
infrastructure and rely on stream-handle based access instead. Along the
way, the PR also fixes a few small compatibility issues in contrib CUDA
kernels that surfaced while working through the remaining excluded ops.

## Summary of Changes

### Shared CUDA Helper Refactoring

| File | Change |
|------|--------|
| `onnxruntime/core/providers/cuda/cuda_common_type_helpers.h` | Adds
header-only CUDA type conversion and string helper utilities so `.cu`
and plugin builds can consume them without relying on `cuda_common.cc`.
|
| `onnxruntime/core/providers/cuda/cuda_common.h` | Includes
`core/util/math.h` for local half conversion helpers and pulls in the
new header-only helper definitions. |
| `onnxruntime/core/providers/cuda/cuda_common.cc` | Removes the moved
helper implementations and keeps runtime GEMM option initialization
focused on shared state. |
| `onnxruntime/core/providers/cuda/shared_inc/accumulation_type.h` |
Adds a default accumulation type mapping so unsupported specializations
no longer fail at compile time. |
| `onnxruntime/contrib_ops/cuda/math/gemm_float8.cu` | Switches float8
GEMM code to consume the new shared CUDA type helper header. |

### Stream and Handle Access Cleanup

| File | Change |
|------|--------|
| `onnxruntime/core/providers/cuda/cuda_kernel.h` | Adds `Stream*`
overloads for retrieving cuBLAS and cuDNN handles so kernels do not need
to reach into `CudaStream` directly. |
| `onnxruntime/core/providers/cuda/integer_gemm.cc` | Replaces direct
`CudaStream` handle access with `CudaKernel` helper-based cuBLAS
retrieval. |
| `onnxruntime/core/providers/cuda/reduction/reduction_ops.cc` | Reuses
a single cuDNN handle derived from the ORT stream and removes repeated
direct `CudaStream` assumptions. |
| `onnxruntime/core/providers/cuda/reduction/reduction_ops.h` | Stops
caching a `CUDAExecutionProvider*` in reduction kernels, reducing
provider coupling. |
| `onnxruntime/core/providers/cuda/rnn/cudnn_rnn_base.cc` | Switches RNN
weight reorganization to retrieve the cuDNN handle through the shared
stream helper. |
| `onnxruntime/core/providers/cuda/tensor/transpose.cc` | Uses the new
stream-based cuBLAS handle helper in transpose fast paths. |

### Kernel-Level Plugin Compatibility Fixes

| File | Change |
|------|--------|
| `onnxruntime/core/providers/cuda/tensor/reshape.cc` | Replaces
`CopyTensor` usage with explicit `cudaMemcpyAsync` on the kernel stream
to avoid plugin-incompatible stream assumptions. |
| `onnxruntime/core/providers/cuda/tensor/reshape.h` | Applies the same
stream-based reshape copy path to the attribute-driven reshape variant.
|
| `onnxruntime/core/providers/cuda/math/clip.h` | Removes the CPU
`Clip_6Base` dependency by inlining min/max attribute handling into the
CUDA kernel. |
| `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Stores
present KV outputs once and reuses those tensors instead of re-fetching
them through `context->Output()`. |
|
`onnxruntime/contrib_ops/cuda/quantization/qordered_ops/qordered_qdq.cc`
| Improves shape validation error reporting by emitting the tensor shape
object directly. |

## Motivation and Context

The CUDA plugin EP compiles kernels into a separate shared library and
can only depend on types and helpers that are visible through the EP API
boundary. The background doc for this work calls out several recurring
incompatibility patterns, especially direct use of `CudaStream`, direct
inclusion of real CUDA EP infrastructure headers, and helper
implementations that only exist in provider-owned `.cc` files.

This PR addresses that class of issues for another slice of the
remaining excluded CUDA ops by:

- moving broadly useful CUDA type helpers into header-visible code,
- routing cuBLAS/cuDNN handle lookup through stream-oriented helpers
instead of provider internals,
- removing a CPU-base-class dependency from `Clip_6`, and
- simplifying a few kernel call sites that were still assuming the
non-plugin CUDA EP environment.

Together, these changes make the CUDA provider code more self-contained
and reduce the amount of plugin-specific adaptation needed to bring the
remaining CUDA ops online.

## Checklist

- [ ] Tests added/updated
- [x] Documentation updated (background captured in
`docs/cuda_plugin_ep/cuda_ops_for_plugin_ep.md`)
- [x] No breaking changes
- [ ] CI passes
### Description

This pull request introduces a new graph optimization pass to fuse Add +
SkipLayerNormalization subgraphs into a single SkipLayerNormalization
node that incorporates a bias input. This helps simplify the computation
graph, especially for models using bias after MatMul, and extends
support for more execution providers. The main changes include the
implementation of the new fusion, its integration into the optimizer
pipeline, and updates to provider compatibility.

**New Bias + SkipLayerNormalization Fusion:**

* Added a new `BiasSkipLayerNormFusion` class and implementation to
detect and fuse subgraphs where a 1D bias is added to a MatMul
(optionally through a Cast) before SkipLayerNormalization, replacing
them with a single node that absorbs the bias as a fifth input.

**Integration into Optimization Pipeline:**

* Registered the new `BiasSkipLayerNormFusion` in the graph transformer
utility, ensuring it runs after the standard SkipLayerNorm fusion and
covers more execution providers (CPU, ACL, CUDA, DML, JS, WebGPU).

**Test and Include Updates:**

* Updated test and implementation files to include the new fusion header
where relevant.

### Motivation and Context

These changes collectively improve model optimization by reducing node
count and improving runtime efficiency for supported providers.

This PR also helps perform this fusion on many models inside the Foundry
Local catalog without needing to re-deploy models.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…1→22) (microsoft#27733)

### Description

Extends CUDA kernel registrations for `GlobalAveragePool` and
`GlobalMaxPool` from opset 1 only to the full opset 1–22 range. Follows
the same pattern used for `MaxPool` in microsoft#27715.

- **`core/providers/cuda/nn/pool.cc`** — Split single opset-1
registrations into versioned 1–21 + opset 22 for both NCHW and NHWC
variants
- **`core/providers/cuda/cuda_execution_provider.cc`** — Updated class
declarations and `BuildKernelCreateInfo` entries (versioned 1–21, added
opset 22)
- **`core/providers/cuda/cuda_nhwc_kernels.cc`** — Same for NHWC kernel
registrations
- **`test/providers/cpu/nn/pool_op_test.cc`** — Added
`GlobalAveragePool_22_CUDA` test
- **`docs/OperatorKernels.md`** — Updated GlobalAveragePool and
GlobalMaxPool entries from `1+` to `22+` / `[1, 21]` in both the ai.onnx
and com.microsoft.internal.nhwc domains under CUDAExecutionProvider

No functional changes to the kernel implementations—opsets 1 through 22
are spec-compatible for these ops.

### Motivation and Context

`GlobalAveragePool` and `GlobalMaxPool` were registered at opset 1 only
in the CUDA provider, creating a 21-version gap to the latest ONNX opset
22. Models exported at higher opsets would fail to find a matching CUDA
kernel. Identified as P1 gaps in microsoft#27729.

### Limitations

BF16 support for GlobalAveragePool-22 and GlobalMaxPool-22 is not added
in this PR.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
edgchen1 and others added 11 commits March 20, 2026 09:42
### Description
<!-- Describe your changes. -->

Fix some issues that show up as test failures in
`js/web/test/data/ops/dequantizelinear.jsonc`.

1. When `component=4`, output shapes where the last dimension was not
divisible by `component` were not handled.
`onnxruntime/core/providers/webgpu/program.cc:247 TensorShape
onnxruntime::webgpu::(anonymous namespace)::GetReducedShape(const
TensorShape &, int) shape.NumDimensions() > 0 &&
shape.GetDims()[shape.NumDimensions() - 1] % component == 0 was false.
Cannot reduce shape {2,2} by component=4`
Added `ProgramOutput::Flatten` to the output definition to address this.

2. Fix handling of zero point in blocked quantization path.

Also renamed some test cases with more descriptive names.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->

Fix some issues with WebGPU DequantizeLinear op implementation.
### Description
React Native is currently limited by network isolation.


### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
…crosoft#27549)

### Description

Fixes float16 tensor support in the React Native binding by mapping
`ONNX_TENSOR_ELEMENT_DATA_TYPE_FLOAT16` to `Uint16Array` instead of
`Float16Array`.

[Hermes does not support
`Float16Array`](https://github.com/facebook/hermes/blob/main/include/hermes/VM/TypedArrays.def).
When the binding tries to construct a `Float16Array` via the JSI global
lookup, it gets `undefined` and crashes.

This is the React Native equivalent of microsoft#27327 (the Node.js fix for the
same issue, merged Feb 2026).

### Motivation

Any React Native app using Hermes (the default since RN 0.70) using a
model with float16 inputs/outputs crashes at runtime on both iOS and
Android.

### Changes

- `js/react_native/cpp/TensorUtils.cpp`: Change `"Float16Array"` to
`"Uint16Array"` in `dataTypeToTypedArrayMap`

Resolves microsoft#27548
### Description

Extends GRU CUDA kernel registration from opset 14 to opset 22,
following the same pattern as other recent opset gap fills (e.g.,
ConvTranspose in microsoft#27710).

- **`gru.cc`**: Cap existing opset-14 non-versioned kernel to versioned
14–21; add new non-versioned kernel at opset 22+
- **`cuda_execution_provider.cc`**: Update forward declarations and
`BuildKernelCreateInfo` entries for versioned 14–21 and non-versioned
22+
- **`deep_cpu_gru_op_test.cc`**: Add CUDA-specific test for GRU at opset
22 with `linear_before_reset=1` (cuDNN requirement)
- **`docs/OperatorKernels.md`**: Update CUDA provider GRU entry to
reflect `22+`, `[14, 21]`, and `[7, 13]` version ranges

No functional changes to the kernel implementation—the GRU spec is
unchanged between opsets 14 and 22.

### Motivation and Context

CUDA EP registered GRU only up to opset 14, while ONNX defines GRU
through opset 22. Models exported at opset ≥15 would fail to find a
matching CUDA kernel and fall back to CPU. This is one of the P1 gaps
tracked in microsoft#27729.

### Limitation

BF16 version is not added for GRU-22. It can be added later if needed.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com>
Co-authored-by: Tianlei Wu <tlwu@microsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
Add bounds checking for label tensor values in
`SparseSoftmaxCrossEntropy::Compute` to prevent out-of-bounds memory
reads.


The `SparseSoftmaxCrossEntropy` operator uses `label_data[i]` (int64_t)
directly as an array index into the log-probability buffer without
validating that the value falls within `[0, D)` where `D` is the number
of classes. A malicious ONNX model can embed arbitrary label values in a
model initializer, causing the operator to read heap memory beyond the
log-probability buffer.

Affected expressions in `cross_entropy.cc`:
```cpp
loss_sample[i] = -log_prob_data[i * d + label_data[i]] * weight_data[i];  // weighted path
loss_sample[i] = -log_prob_data[i * d + label_data[i]];                   // unweighted path
```

Existing shape validation confirms label and logit dimensions are
compatible, but never validates label **values** against the class
dimension.

## Fix

Added a validation loop before the loss computation that returns an
error status if any label value is outside `[0, D)`:

```cpp
for (ptrdiff_t i = 0; i < n; i++) {
  ORT_RETURN_IF(label_data[i] < 0 || label_data[i] >= d,
                "SparseSoftmaxCrossEntropy: label value ", label_data[i],
                " at index ", i, " is out of range [0, ", d, ")");
}
```
…#27787)

Bumps [flatted](https://github.com/WebReflection/flatted) from 3.2.7 to
3.4.2.
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/WebReflection/flatted/commit/3bf09091c3562e17a0647bc06710dd6097079cf7"><code>3bf0909</code></a>
3.4.2</li>
<li><a
href="https://github.com/WebReflection/flatted/commit/885ddcc33cf9657caf38c57c7be45ae1c5272802"><code>885ddcc</code></a>
fix CWE-1321</li>
<li><a
href="https://github.com/WebReflection/flatted/commit/0bdba705d130f00892b1b8fcc80cf4cdea0631e3"><code>0bdba70</code></a>
added flatted-view to the benchmark</li>
<li><a
href="https://github.com/WebReflection/flatted/commit/2a02dce7c641dec31194c67663f9b0b12e62da20"><code>2a02dce</code></a>
3.4.1</li>
<li><a
href="https://github.com/WebReflection/flatted/commit/fba4e8f2e113665da275b19cd0f695f3d98e9416"><code>fba4e8f</code></a>
Merge pull request <a
href="https://redirect.github.com/WebReflection/flatted/issues/89">#89</a>
from WebReflection/python-fix</li>
<li><a
href="https://github.com/WebReflection/flatted/commit/5fe86485e6df7f7f34a07a2a85498bd3e17384e7"><code>5fe8648</code></a>
added &quot;when in Rome&quot; also a test for PHP</li>
<li><a
href="https://github.com/WebReflection/flatted/commit/53517adbefe724fe472b2f9ebcdb01910d0ae3f0"><code>53517ad</code></a>
some minor improvement</li>
<li><a
href="https://github.com/WebReflection/flatted/commit/b3e2a0c387bf446435fec45ad7f05299f012346f"><code>b3e2a0c</code></a>
Fixing recursion issue in Python too</li>
<li><a
href="https://github.com/WebReflection/flatted/commit/c4b46dbcbf782326e54ea1b65d3ebb1dc7a23fad"><code>c4b46db</code></a>
Add SECURITY.md for security policy and reporting</li>
<li><a
href="https://github.com/WebReflection/flatted/commit/f86d071e0f70de5a7d8200198824a3f07fc9c988"><code>f86d071</code></a>
Create dependabot.yml for version updates</li>
<li>Additional commits viewable in <a
href="https://github.com/WebReflection/flatted/compare/v3.2.7...v3.4.2">compare
view</a></li>
</ul>
</details>
<br />


[![Dependabot compatibility
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=flatted&package-manager=npm_and_yarn&previous-version=3.2.7&new-version=3.4.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot show <dependency name> ignore conditions` will show all
of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the
[Security Alerts
page](https://github.com/microsoft/onnxruntime/network/alerts).

</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…Loop, Scan, ConstantOfShape, Size (microsoft#27728)

## Summary
- Extend CUDA EP opset 21/23 kernel registrations to 7 additional
operators that were updated in ONNX opset 21 but lacked proper CUDA
kernel version declarations
- Operators fixed: **Flatten**, **Identity**, **If**, **Loop**,
**Scan**, **ConstantOfShape**, **Size**
- Follows the identical pattern established in PR microsoft#26075 for Shape,
Reshape, Transpose, Squeeze, Unsqueeze

## Motivation
Fixes microsoft#27102.

When ONNX introduces a new operator version in opset 21, ORT's
`VerifyVersion` function in `kernel_registry.cc` rejects non-versioned
(open-ended) CUDA kernels. The check at
[kernel_registry.cc:L126-L133](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/framework/kernel_registry.cc#L126)
requires either an exact version match or a bounded version range — a
kernel registered as `since_version=N, end_version=INT_MAX` fails when
`since_ver` (from the opset 21 schema) differs from `N`.

This causes the affected operators to fall back from CUDA to CPU,
introducing unnecessary host↔device memory copies. On Windows with CUDA
EP, this fallback path can produce corrupted shape computation values
(e.g., `124647109376` instead of `6`), leading to downstream Reshape
failures.

PR microsoft#26075 fixed this for Shape, Reshape, Transpose, Squeeze, and
Unsqueeze. This PR extends the same fix to the 7 remaining operators
that were updated in ONNX opset 21 and had non-versioned CUDA kernels.

## Changes
For each of the 7 operators:
1. **Cap existing non-versioned kernel** to opset 20
(`ONNX_OPERATOR_KERNEL` → `ONNX_OPERATOR_VERSIONED_KERNEL`)
2. **Add VERSIONED(21, 22) kernel** with identical type constraints
3. **Add non-versioned opset 23 kernel** for forward compatibility
(opset 23 introduced another schema update for these operators)

Files modified:
- `onnxruntime/core/providers/cuda/cuda_execution_provider.cc` — forward
declarations + `BuildKernelCreateInfo` registration
- `onnxruntime/core/providers/cuda/tensor/flatten.cc`
- `onnxruntime/core/providers/cuda/tensor/identity_op.cc`
- `onnxruntime/core/providers/cuda/tensor/size.cc`
- `onnxruntime/core/providers/cuda/generator/constant_of_shape.cc`
- `onnxruntime/core/providers/cuda/controlflow/if.cc`
- `onnxruntime/core/providers/cuda/controlflow/loop.cc`
- `onnxruntime/core/providers/cuda/controlflow/scan.cc`

## Test Plan
- [ ] Verify CUDA EP build compiles successfully (CI)
- [ ] Existing opset 21 tests for Shape/Reshape/Squeeze/Unsqueeze pass
(validates the pattern)
- [ ] Verify operators are no longer falling back to CPU when running
opset 21 models on CUDA
- [ ] No regression in existing CUDA EP tests
### Description
Fuses a pattern of 4 Slices + 1 Concat into 1 SpaceToDepth op (+1 Gather
if the order doesn't match the expected default pattern). Saves about
~1ms for Yolox_tiny which is a sub-15ms model on an AVX512 machine. So,
it is good savings.


### Motivation and Context
Improves performance especially when the SpaceToDepth operation occurs
in a cheap real-time model

Original SpaceToDepth operation in the model (4 slices + 1 Concat is a
lot of memory transactions that can be replaced by one optimized
kernel):

<img width="389" height="213" alt="image"
src="https://github.com/user-attachments/assets/670be33e-84cf-4235-87dd-4086fc43d5f8"
/>

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…#27691)

### Description

2 main changes:

1) Handle activations that are seen in modern CNNs in the NCHWc
transformer (`QuickGelu`, `Gelu`) and avoid reorder nodes getting
inserted before and after them to do the NCHWc <-> NCHW data layout
transforms. These can be avoided as these are elemtnwise ops that are
otherwise data layout agnostic

2) Rewrites a channel scaling Mul (or scaling input shape 1,C,1,1 or
C,1,1) into a depthwise conv NCHWc operation. This avoid resorder nodes
and enables fusions of any subsequent `Add` operations into the new
`Conv` node.


### Motivation and Context
Avoid unnecessary data layout operations and enable more NCHWc
compatible compute and fusions
## Summary

- Move `tf.function`-decorated forward functions out of the inner
benchmark loop to prevent unnecessary graph retracing on every
`(batch_size, sequence_length)` iteration
- Update deprecated `experimental_compile` to `jit_compile` (available
since TF 2.4)
- Hoist `import random` out of the inner loop

Fixes microsoft#14953

## Motivation

When `run_with_tf_optimizations` is used as a decorator inside the
innermost `(batch_size, sequence_length)` loop, each iteration creates a
new Python function object. Since `tf.function` keys its trace cache on
function identity, a new object means a forced retrace every iteration —
the cached graph is never reused. This defeats the purpose of
`tf.function` and adds significant overhead from repeated graph
construction and optimization passes.

The [TensorFlow documentation on
tracing](https://www.tensorflow.org/guide/function#rules_of_tracing)
explicitly warns against defining `tf.function`-decorated functions
inside loops.

## Changes

**`onnxruntime/python/tools/transformers/benchmark.py`** (1 file, ~35
insertions / ~31 deletions):

1. **Hoisted forward function definitions** (`encoder_forward`,
`encoder_decoder_forward`, `lxmert_forward`) from the inner `batch_size
× sequence_length` loop to the per-model scope. They are now defined
once per model, and the `@run_with_tf_optimizations` decorator (which
applies `@tf.function`) is only invoked once per model.

2. **Changed forward functions to accept `input_ids` as a parameter**
instead of closing over the loop variable. This lets `tf.function` trace
based on the tensor's `(dtype, shape)` spec and reuse cached concrete
functions when shapes repeat across iterations.

3. **Updated `experimental_compile=use_xla`** to
**`jit_compile=use_xla`**. The `experimental_compile` parameter was
deprecated in TF 2.4 (Dec 2020) and removed in TF 2.12.

4. **Moved `import random`** from the innermost loop body to before the
outer model loop — the module only needs to be imported once.

5. **Moved inference function selection** (`if config.is_encoder_decoder
... elif isinstance(config, LxmertConfig) ...`) outside the
batch/sequence loops since it depends only on the model config, not on
batch size or sequence length. The original priority order
(`is_encoder_decoder` checked before `LxmertConfig`) is preserved.

## Test Plan

- [x] `lintrunner -a` passes cleanly (no RUFF or RUFF-FORMAT violations)
- [x] `python -m py_compile benchmark.py` — syntax verified
- [x] Change is purely structural — function behavior (inputs, outputs,
control flow) is identical
- [ ] Manual verification with TensorFlow installed (TF is an optional
dependency not present in the standard CI matrix; this code path is
exercised via `python benchmark.py -e tensorflow`)
@ankitm3k ankitm3k merged commit f153255 into ovep-develop Mar 23, 2026
7 of 8 checks passed
@ankitm3k ankitm3k deleted the sync_msft_23032026 branch March 23, 2026 07:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.