Sync with Microsoft ONNX Runtime - 25032026 by ai-fw-intg · Pull Request #989 · intel/onnxruntime

ai-fw-intg · 2026-03-24T21:03:19Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

### Description  This change adds support for the Deformable Convolution 2D operator (DeformConv2D) to ONNX Runtime. The branch implements the operator schema and registration, provides kernel implementations (CPU and GPU/CUDA where available), implements shape inference, and adds unit and integration tests to validate correctness and numerical parity with reference implementations. The changes include performance-oriented optimizations and necessary changes to build/test scripts. ### Motivation and Context  Deformable convolutions are widely used in vision models that require spatial sampling flexibility (e.g., Deformable ConvNets, some detection/segmentation models). Native support in ONNX Runtime enables these models to run efficiently without custom operators or external runtimes, broadening the set of compatible models and improving performance and portability. ### See also - https://onnx.ai/onnx/operators/onnx__DeformConv.html - https://docs.pytorch.org/vision/main/generated/torchvision.ops.deform_conv2d.html - https://arxiv.org/abs/1811.11168 - https://arxiv.org/abs/1703.06211 - https://github.com/pytorch/vision/blob/0f6d91d9fe514e6de2f5519114cbeb389d498b2d/torchvision/csrc/ops/cuda/deform_conv2d_kernel.cu - https://github.com/open-mmlab/mmdetection/blob/master/mmdet/ops/dcn/src/deform_conv_cuda.cpp - https://github.com/pytorch/vision/blob/0f6d91d9fe514e6de2f5519114cbeb389d498b2d/torchvision/csrc/ops/cpu/deform_conv2d_kernel.cpp - https://github.com/open-mmlab/mmdetection/blob/master/mmdet/ops/dcn/src/deform_conv_cuda.cpp - microsoft#22060 - microsoft#15572 - microsoft#20810 - microsoft#16903 - onnx/onnx#5451 - ZhengPeng7/BiRefNet#167 - pytorch/pytorch#68910 - pytorch/vision#2066

…de (microsoft#27724) ### Description On Windows, `std::filesystem::path::string()` converts the internal UTF-16 representation to a narrow string using the system's active ANSI code page. When the path contains characters outside that code page (Japanese, Chinese, Korean, etc.), this throws std::system_error with 'No mapping for the Unicode character exists in the target multi-byte code page.' This affected both the core session telemetry logging (causing `InferenceSession::Initialize()` to fail) and execution provider code (OpenVINO, TensorRT, TensorRT RTX, QNN, MIGraphX) where model paths are converted for EPContext attributes and profiling. ### Motivation and Context Fix: Replace `.filename().string()` with `PathToUTF8String(.filename().native())` which uses `WideCharToMultiByte(CP_UTF8, ...)` and handles all Unicode characters correctly. This pattern is already used elsewhere in the codebase for path-to-string conversions. Note: Two remaining instances in Linux-only code (`cann_utils.cc`, `device_discovery.cc`) are left as-is since `.string()` is safe on Linux where paths are already narrow strings. Fixes microsoft/WindowsAppSDK#6173 --------- Co-authored-by: Sagar Bhure <sagarbhure@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Support RoiAlign for opset versions 16 and 22 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…27774) ### Description This PR consolidates PRs microsoft#27416 and microsoft#27708 to extend CUDA Pad kernel support through opset 25, including wrap mode implementation. ### Motivation and Context The CUDA execution provider previously only registered the Pad kernel up to opset 18 and did not implement wrap mode. When an ONNX model exported with opset 19+ was run on the CUDA executor, the Pad operation was forced to fall back to CPU, resulting in significant performance degradation. This PR aligns CUDA Pad registration with the ONNX Pad schema evolution through opset 25 and provides a correct wrap mode implementation. Related issues: microsoft#26393 Related PRs: microsoft#27416, microsoft#27708 ### Summary of Changes #### Kernel registration and opset coverage | File | Change | |------|--------| | `onnxruntime/core/providers/cuda/tensor/pad.cc` | Adds CUDA Pad kernel registrations for opset ranges 18, 19-20, 21-22, 23, 24, and 25. | | `onnxruntime/core/providers/cuda/cuda_execution_provider.cc` | Registers the new Pad kernel versions in the CUDA EP registry under the existing per-opset sections. | #### CUDA Pad implementation | File | Change | |------|--------| | `onnxruntime/core/providers/cuda/tensor/pad_impl.h` | Extends the Pad kernel interface to pass effective sliced extents and per-axis input offsets. | | `onnxruntime/core/providers/cuda/tensor/pad_impl.cu` | Adds CUDA wrap mode using a `WrapCoordinate` device helper with `if constexpr` compile-time specialization. Removes dead wrap code from the NCHW-specialized kernel path. | | `onnxruntime/core/providers/cuda/tensor/pad.cc` | Computes effective sliced input extents/offsets for wrap behavior with negative pads. Bypasses the NCHW fast-path for wrap mode and routes through the generic implementation. | #### Documentation | File | Change | |------|--------| | `docs/OperatorKernels.md` | Updates the CUDA Pad kernel opset coverage to reflect the new version splits (25+, 24, 23, [21,22], [19,20], 18) up to opset 25. | #### Test coverage | File | Change | |------|--------| | `onnxruntime/test/providers/cpu/tensor/pad_test.cc` | Adds CUDA-only Pad coverage for `edge` across opsets 18-25 and `wrap` across opsets 19-25. Updates existing wrap test comment. | ### Checklist - [x] Tests added/updated - [x] No breaking changes  --- ✨ Let Copilot coding agent [set things up for you](https://github.com/microsoft/onnxruntime/issues/new?title=✨+Set+up+Copilot+instructions&body=Configure%20instructions%20for%20this%20repository%20as%20documented%20in%20%5BBest%20practices%20for%20Copilot%20coding%20agent%20in%20your%20repository%5D%28https://gh.io/copilot-coding-agent-tips%29%2E%0A%0A%3COnboard%20this%20repo%3E&assignees=copilot) — coding agent works faster and does higher quality work when set up for your repo. --------- Co-authored-by: Shirasawa <764798966@qq.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tianleiwu <30328909+tianleiwu@users.noreply.github.com> Co-authored-by: Tianlei Wu <tlwu@microsoft.com>

…ft#27800) ### Description QNN python wheel pipeline for Linux didn't take into account the QNN version that the user can override when running the pipeline and always used whatever was the default in the pipeline yamls. Allow for overriding QNN version in onnxruntime-qnn Linux python wheel pipelines.

…crosoft#27812) The kernel computes neg_scaled_zp = -(scale * zp) in fp16 first (intermediate rounding), then uses it in the fma. For scale*zp in range [128, 256), the fp16 ULP is 0.125, so this intermediate rounding error (~0.06) propagates to the result and exceeded the test tolerance. The reference was computing everything in fp32 and converting to fp16 only at the end, avoiding this intermediate rounding. This caused mismatches up to 0.15 (29 fp16 ULPs). Fix: emulate the kernel's fp16 computation order in the reference: 1. neg_szp = MLAS_FP16(-(scale * zp)).ToFloat() // fp16 round-trip 2. result = MLAS_FP16(neg_szp + value * scale) // emulates fma

webgpu ep will now pass https://wpt.live/webnn/conformance_tests/where.https.any.html?gpu

…sor default op… (microsoft#27773) fixes a few webnn test cases for webgpu ep. This https://wpt.live/webnn/conformance_tests/transpose.https.any.html?gpu should pass 100% now

… quantization (microsoft#27769) Extends the QDQ `DQMatMulToMatMulNBits` fusion to handle additional quantization patterns beyond the existing blockwise DQ→MatMul case. ### New support - **Gemm**: Fuses DQ→Gemm (with optional bias, including DQ bias) into MatMulNBits, stripping Gemm-specific attributes (`alpha`, `beta`, `transB`). - **Per-tensor & per-channel quantization**: Expands scalar/1D scales and zero-points into block-quantized format expected by MatMulNBits. Block size is configurable via `session.qdq_matmulnbits_block_size` (default: 32). ### Changes - **Selectors** (qdq_selectors.cc): Replaced `ValidateBlockwiseDQForMatMulNBits` with `ValidateDQForMatMulNBits` supporting all three quantization modes. Added Gemm-specific validation. - **Actions** (qdq_actions.cc): Added scale/zp expansion for non-blockwise cases, Gemm attribute cleanup, and bias wiring to MatMulNBits input 5. - **Registration** (qdq_selector_action_transformer.cc): Registered `Gemm` alongside `MatMul`; threaded `qdq_matmulnbits_block_size` from session config. - **Tests** (qdq_matmulnbits_transformer_test.cc): Added tests for per-tensor, per-channel, Gemm (no bias, constant bias, DQ bias), block size options, and negative cases.

…7785) Bumps [flatted](https://github.com/WebReflection/flatted) from 3.3.3 to 3.4.2. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/WebReflection/flatted/commit/3bf09091c3562e17a0647bc06710dd6097079cf7"><code>3bf0909</code></a> 3.4.2</li> <li><a href="https://github.com/WebReflection/flatted/commit/885ddcc33cf9657caf38c57c7be45ae1c5272802"><code>885ddcc</code></a> fix CWE-1321</li> <li><a href="https://github.com/WebReflection/flatted/commit/0bdba705d130f00892b1b8fcc80cf4cdea0631e3"><code>0bdba70</code></a> added flatted-view to the benchmark</li> <li><a href="https://github.com/WebReflection/flatted/commit/2a02dce7c641dec31194c67663f9b0b12e62da20"><code>2a02dce</code></a> 3.4.1</li> <li><a href="https://github.com/WebReflection/flatted/commit/fba4e8f2e113665da275b19cd0f695f3d98e9416"><code>fba4e8f</code></a> Merge pull request <a href="https://redirect.github.com/WebReflection/flatted/issues/89">#89</a> from WebReflection/python-fix</li> <li><a href="https://github.com/WebReflection/flatted/commit/5fe86485e6df7f7f34a07a2a85498bd3e17384e7"><code>5fe8648</code></a> added "when in Rome" also a test for PHP</li> <li><a href="https://github.com/WebReflection/flatted/commit/53517adbefe724fe472b2f9ebcdb01910d0ae3f0"><code>53517ad</code></a> some minor improvement</li> <li><a href="https://github.com/WebReflection/flatted/commit/b3e2a0c387bf446435fec45ad7f05299f012346f"><code>b3e2a0c</code></a> Fixing recursion issue in Python too</li> <li><a href="https://github.com/WebReflection/flatted/commit/c4b46dbcbf782326e54ea1b65d3ebb1dc7a23fad"><code>c4b46db</code></a> Add SECURITY.md for security policy and reporting</li> <li><a href="https://github.com/WebReflection/flatted/commit/f86d071e0f70de5a7d8200198824a3f07fc9c988"><code>f86d071</code></a> Create dependabot.yml for version updates</li> <li>Additional commits viewable in <a href="https://github.com/WebReflection/flatted/compare/v3.3.3...v3.4.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=flatted&package-manager=npm_and_yarn&previous-version=3.3.3&new-version=3.4.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

Bumps [flatted](https://github.com/WebReflection/flatted) from 3.3.3 to 3.4.2. <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/WebReflection/flatted/commit/3bf09091c3562e17a0647bc06710dd6097079cf7"><code>3bf0909</code></a> 3.4.2</li> <li><a href="https://github.com/WebReflection/flatted/commit/885ddcc33cf9657caf38c57c7be45ae1c5272802"><code>885ddcc</code></a> fix CWE-1321</li> <li><a href="https://github.com/WebReflection/flatted/commit/0bdba705d130f00892b1b8fcc80cf4cdea0631e3"><code>0bdba70</code></a> added flatted-view to the benchmark</li> <li><a href="https://github.com/WebReflection/flatted/commit/2a02dce7c641dec31194c67663f9b0b12e62da20"><code>2a02dce</code></a> 3.4.1</li> <li><a href="https://github.com/WebReflection/flatted/commit/fba4e8f2e113665da275b19cd0f695f3d98e9416"><code>fba4e8f</code></a> Merge pull request <a href="https://redirect.github.com/WebReflection/flatted/issues/89">#89</a> from WebReflection/python-fix</li> <li><a href="https://github.com/WebReflection/flatted/commit/5fe86485e6df7f7f34a07a2a85498bd3e17384e7"><code>5fe8648</code></a> added "when in Rome" also a test for PHP</li> <li><a href="https://github.com/WebReflection/flatted/commit/53517adbefe724fe472b2f9ebcdb01910d0ae3f0"><code>53517ad</code></a> some minor improvement</li> <li><a href="https://github.com/WebReflection/flatted/commit/b3e2a0c387bf446435fec45ad7f05299f012346f"><code>b3e2a0c</code></a> Fixing recursion issue in Python too</li> <li><a href="https://github.com/WebReflection/flatted/commit/c4b46dbcbf782326e54ea1b65d3ebb1dc7a23fad"><code>c4b46db</code></a> Add SECURITY.md for security policy and reporting</li> <li><a href="https://github.com/WebReflection/flatted/commit/f86d071e0f70de5a7d8200198824a3f07fc9c988"><code>f86d071</code></a> Create dependabot.yml for version updates</li> <li>Additional commits viewable in <a href="https://github.com/WebReflection/flatted/compare/v3.3.3...v3.4.2">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=flatted&package-manager=npm_and_yarn&previous-version=3.3.3&new-version=3.4.2)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

### Description Add fused Silu and Exact Gelu (Erf based) for AVX512f Silu benchmarks: <img width="225" height="442" alt="image" src="https://github.com/user-attachments/assets/42ce53a2-10cb-496f-b3d9-23b9ebf26be3" /> GELU exact (Erf) benchmarks: <img width="218" height="431" alt="image" src="https://github.com/user-attachments/assets/c68b260a-c209-437e-819b-fb200212ee54" /> ### Motivation and Context Improve performance on AVX512F Silu shows small regression at B=1 but I don't think the absolute difference is much --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

### Description This PR makes it possible to build WebGPU EP as an EP API based plugin EP. #### Requirements The goal of this PR is to support both building WebGPU EP as a bundled EP and an EP API based plugin EP. This approach allows: - enabling WebGPU EP as a standalone plugin EP package for WCR usage - graceful transition for WebGPU EP as an native EP for language binding, from the bundled EP to an EP API based plugin EP - keep the existing usage (static library) working (majorly for web) #### Design & Implementation Instead of **changing** WebGPU EP from a bundled EP to an EP API based plugin EP in one shot, this PR **extend** WebGPU EP to support building as plugin EP. - add a new folder `include/onnxruntime/ep` with a bunches of header files. Those files are not WebGPU specific. They are used for: - include common defines/functions/macros for plugin EP to use - include a few "adapter" classes that takes C-API objects to simulate ORT internal classes behaviors - include a few "override" classes that simulate ORT internal classes, but using implementations that only depend on C-API - include a special base class `onnxruntime::ep::Ep` to inherit from These header files allow a compile time "switch" to the different set of types to minimize changes to existing code. Specifically, `pch.h` is required to be included as PCH to make sure the "override" to take place correctly. - add a new folder `onnxruntime/core/providers/webgpu/ep` for EP API implementation, specifically: - `api.cc`: implements `CreateEpFactories` and `ReleaseEpFactory` - `ep.cc` `ep.h`: implement class `onnxruntime::webgpu::ep::Ep` - `factory.cc` `factory.h`: implement class `onnxruntime::webgpu::ep::Factory` #### Dependencies and Prerequisites (unmerged changes are included as a part of current PR) - microsoft#26855 - microsoft#26803 - microsoft#26859 - microsoft#26879 - microsoft#26919 - microsoft#27569 - microsoft#27587 #### Missing Parts - Allow setting Global/Default EP options --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Edward Chen <18449977+edgchen1@users.noreply.github.com>

### Description  Fixes WebGPU Einsum op by replacing the manual `uniforms.inputN_shape[idx]` string with `GetElementAt(...)` which correctly handles uniform shape access for all tensor ranks. I also added a bunch of tests for this... ### Motivation and Context  Closes microsoft#27762

…eation (microsoft#27634) ## Description We had a weird behavior in Transformers.js V4. After calling `InferenceSession.release()` on a WebGPU session, attempting to create a new WebGPU session fails with: ``` WebGPU device lost (2): Device was destroyed. ``` In Transformers.js we encourage the use of the `create -> release -> create` pattern, because we expect the application to run for some time and might use multiple models. So it makes sense to unload models after the job is done. It seems like this was introduced in [e03631e](microsoft@e03631ee528), which added the `preserveDevice` option with a default value of `false`. When the last session is released and `preserveDevice=false`, the C++ side destroys the WebGPU device, but the JavaScript reference in `env.webgpu.device` is never cleared, leaving a stale reference to a destroyed device. ## Changes **Clear stale device reference when lost** (`backend-webgpu.ts`) 1. Made device property `configurable: true` to allow deletion 2. Added cleanup logic in `dispose()` to detect device loss via `device.lost` promise 3. When device is lost (destroyed, driver crash, etc.), delete the stale `env.webgpu.device` reference This allows subsequent session creation to acquire a fresh device instead of attempting to reuse a lost one.

…c transformer suite (microsoft#27821) ### Description As title ### Motivation and Context Tiny continuation to microsoft#27691 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

ShirasawaSama and others added 18 commits March 23, 2026 08:46

[CUDA] RoiAlign for opset versions 16 and 22 (microsoft#27646)

45b5900

Support RoiAlign for opset versions 16 and 22 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix webnn/where complicance tests for webgpu (microsoft#27776)

0c3e5fc

webgpu ep will now pass https://wpt.live/webnn/conformance_tests/where.https.any.html?gpu

fix webnn test case for webgpu ep: 'transpose float32 1D constant ten…

16b556d

…sor default op… (microsoft#27773) fixes a few webnn test cases for webgpu ep. This https://wpt.live/webnn/conformance_tests/transpose.https.any.html?gpu should pass 100% now

Merge remote-tracking branch 'origin/master' into sync_msft_25032026

6af93e8

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel March 24, 2026 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync with Microsoft ONNX Runtime - 25032026#989

Sync with Microsoft ONNX Runtime - 25032026#989
ai-fw-intg wants to merge 18 commits intoovep-developfrom
sync_msft_25032026

ai-fw-intg commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants

Conversation

ai-fw-intg commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

13 participants