WebGPU: add memory aliasing for intermediate tensor buffers by digantdesai · Pull Request #19305 · pytorch/executorch

digantdesai · 2026-05-05T19:30:47Z

USE ETVK's mem_obj_id for the WebGPU runtime to implement memory aliasing

pytorch-bot · 2026-05-05T19:30:51Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 19 Unrelated Failures

As of commit 3666881 with merge base af90130 ():

NEW FAILURES - The following jobs have failed:

pull / unittest / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_resnet18_model
pull / unittest-editable / macos / macos-job (gh)
export/tests/test_target_recipes.py::TestTargetRecipes::test_vit_model

CANCELLED JOB - The following job was cancelled. Please retry:

pull / test-binary-size-linux / linux-job (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / test-static-llama-qnn-linux (stories_260k_bc) / linux-job (gh) (detected as infra flaky with no log or failing log classifier)
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_no_target) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-lora-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-lora-multimethod-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (ic4, portable, linux.4xlarge.memory) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (ic4, xnnpack-quantization-delegation, linux.4xlarge.memory) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, portable, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (mobilebert, xnnpack-quantization-delegation, linux.2xlarge) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-models-linux (phi_4_mini, portable, linux.4xlarge.memory) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-moshi-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-phi-3-mini-runner-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-samsung-quantmodels-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-sqnr-static-llm-qnn-linux (smollm2_135m) / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-voxtral-realtime-xnnpack-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / test-vulkan-models-linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest / linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / linux / linux-job (gh) (trunk failure)
##[error]The operation was canceled.
pull / unittest-editable / windows / windows-job (gh) (trunk failure)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

digantdesai · 2026-05-05T19:30:59Z

@claude review this PR

github-actions · 2026-05-05T19:31:40Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

claude · 2026-05-05T19:31:47Z

Claude finished @digantdesai's task in 1m 37s —— View job

Review: WebGPU memory aliasing for intermediate tensor buffers

Read changed files and understand the implementation
Review WebGPUBackend changes
Review WebGPUGraph changes (core aliasing logic)
Review test changes
Post review feedback

Overall: Clean, well-structured implementation. The approach of using ETVK's mem_obj_id to share GPU buffers between non-overlapping intermediate tensors is sound and the test coverage is good. A few items worth addressing:

Issues

1. Global mutable state for memory stats (WebGPUBackend.cpp:41)

static WebGPUMemoryStats s_last_memory_stats;

This is a process-wide global that gets overwritten on every init() call. If two models are loaded concurrently (or sequentially), the stats from the first are silently lost. This is fine for a test-only helper, but consider documenting that intent (e.g., naming it something like s_last_memory_stats_for_testing) or gating it behind a test/debug flag so it doesn't ship in production builds. If this is intentionally test-only, could it live in the test file instead?

2. Destructor correctness — shared buffer double-release guard (WebGPUGraph.cpp:55-58)

for (size_t i = 0; i < tensors_.size(); i++) {
    if (tensors_[i].buffer &&
        (i >= tensor_mem_obj_ids_.size() || tensor_mem_obj_ids_[i] < 0)) {
      wgpuBufferRelease(tensors_[i].buffer);
    }
}

The guard i >= tensor_mem_obj_ids_.size() should never be true since both vectors are resized to num_vals in build(). If build() throws before that resize (e.g., device is null), both vectors are empty and the loop wouldn't execute anyway. The check is harmless but somewhat misleading—a simple tensor_mem_obj_ids_[i] < 0 would suffice given the invariant. Minor nit, no action needed.

3. total_bytes() excludes tensor_buffer_bytes field (WebGPUGraph.h:44-46)

size_t total_bytes() const {
    return shared_buffer_bytes + unshared_tensor_buffer_bytes +
        staging_buffer_bytes + uniform_buffer_bytes;
}

This is correct (it equals tensor_buffer_bytes + staging + uniform), but it's mildly confusing that total_bytes() doesn't reference the tensor_buffer_bytes field directly. Rewriting as return tensor_buffer_bytes + staging_buffer_bytes + uniform_buffer_bytes; would make the semantics clearer, since tensor_buffer_bytes is already computed as shared + unshared in memory_stats().

Fix this →

Minor Suggestions

4. Test assertion could verify specific aliasing count (test_webgpu_native.cpp:143)

The test asserts stats.num_shared_objects > 0 — for a 5-op chained add with 2 inputs + 1 output + intermediates, you could assert a more specific expected value (e.g., that num_shared_objects equals a known count) to catch regressions where aliasing becomes less effective. Though I recognize this is fragile if the partitioner changes its allocation strategy.

5. export_chained_add_model not invocable from CLI (test_add.py)

The export functions are defined but only callable programmatically or via the test script's -c block. Not a blocker, just noting that adding a --export-chained CLI arg would be nice for manual testing.

What looks good

The two-pass approach (collect max sizes in pass 1, allocate shared buffers in pass 2) is clean and avoids over-allocation.
The destructor properly separates ownership: shared buffers are released once via shared_buffers_, unshared per-tensor buffers are released individually.
The memory stats struct gives good observability into the aliasing savings.
The chained-add test validates both correctness and memory savings, which is the right thing to test here.
Replacing zero-size buffer fallbacks with asserts (ET_CHECK_MSG) is a good cleanup — zero-size tensors would indicate a bug upstream.

Verdict: LGTM with minor suggestions. The core aliasing logic is correct and well-tested. The global stats accessor is the only thing I'd recommend addressing before merge (even if just renaming to clarify it's test-only).

The export pipeline already runs a greedy memory planning pass that assigns mem_obj_id to tensors with non-overlapping lifetimes, but the WebGPU runtime was ignoring it and allocating a dedicated WGPUBuffer per tensor. Read mem_obj_id from the flatbuffer during graph build. Tensors sharing the same mem_obj_id now share a single WGPUBuffer sized to the largest user. Constants and tensors without a mem_obj_id still get dedicated buffers. Adds a chained-add native test (z=x+y; z=z+x; z=z+y) that verifies both correctness and that memory aliasing produces savings (~20% for this model). Co-authored with Claude.

Replace the silent `nbytes > 0 ? nbytes : 4` fallback pattern with ET_CHECK_MSG assertions. If a zero-byte tensor reaches buffer creation, we want to know immediately rather than silently creating a dummy 4-byte buffer that masks the issue. Co-authored with Claude.

Invert the condition to eliminate the empty if-body with a comment. Co-authored with Claude.

Export and run the chained-add memory aliasing test in test_build_webgpu.sh so it runs automatically instead of requiring a manual WEBGPU_TEST_CHAINED_MODEL env var. Co-authored with Claude.

Longer chain produces more intermediates, giving the memory planner more opportunity to alias buffers. Expected output: 3x + 3y. Co-authored with Claude.

Fix: if a constant tensor has mem_obj_id >= 0, force it to -1 so the dedicated buffer path and the destructor stay consistent. Previously the buffer would leak and get overwritten by the shared buffer pass. Also make the chained-add test actually fail when aliasing is absent instead of just printing informational messages. Co-authored with Claude.

…tes() Rename the static to s_last_memory_stats_for_testing and document the test-only, single-graph, not-thread-safe intent in the header. Simplify total_bytes() to use tensor_buffer_bytes directly since it is already computed as shared + unshared in memory_stats(). Co-authored with Claude.

Copilot

Pull request overview

This PR adds WebGPU runtime support for tensor-buffer memory aliasing by reusing ETVK/Vulkan mem_obj_id so intermediate tensors can share underlying GPU buffers, and extends the WebGPU native test flow to validate the aliasing behavior and report memory stats.

Changes:

Implement shared WGPUBuffer allocation/assignment in WebGPUGraph based on mem_obj_id, and extend memory stats to account for shared vs unshared tensor bytes.
Add a test-only mechanism to retrieve the last graph’s memory stats and a new native test that validates aliasing + memory savings using a chained-add model.
Update WebGPU test scripts and Python export utilities to generate and run the chained-add model.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
backends/webgpu/runtime/WebGPUGraph.h	Extends memory stats and adds state for `mem_obj_id`-based shared buffers.
backends/webgpu/runtime/WebGPUGraph.cpp	Allocates shared buffers by `mem_obj_id`, adjusts destruction logic, and updates memory stats accounting.
backends/webgpu/runtime/WebGPUBackend.h	Declares a test-only accessor for last graph memory stats.
backends/webgpu/runtime/WebGPUBackend.cpp	Stores last graph memory stats at init time for tests to query.
backends/webgpu/test/test_webgpu_native.cpp	Adds a chained-add native test that checks correctness and confirms aliasing memory savings.
backends/webgpu/test/test_build_webgpu.sh	Exports both simple and chained models and runs the native test with both paths.
backends/webgpu/test/ops/add/test_add.py	Extends chained-add model and adds an export helper for the chained model.

Comments suppressed due to low confidence (1)

backends/webgpu/runtime/WebGPUGraph.cpp:230

Output staging buffer creation now ET_CHECK_MSGs on tensors_[oid].nbytes > 0, which will abort for valid models that produce empty outputs. The previous code handled this by allocating a small non-zero staging buffer while still copying 0 bytes. Consider restoring that behavior (allocate at least 4 bytes, but allow 0-byte outputs) to avoid hard process termination.

      // Create staging buffer for output readback
      WGPUBufferDescriptor staging_desc = {};
      ET_CHECK_MSG(tensors_[oid].nbytes > 0, "Output tensor has zero bytes");
      staging_desc.size = tensors_[oid].nbytes;
      staging_desc.usage = WGPUBufferUsage_MapRead | WGPUBufferUsage_CopyDst;
      staging_desc.mappedAtCreation = false;
      output_staging_buffers_.push_back(
          wgpuDeviceCreateBuffer(device_, &staging_desc));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        // Constants always get dedicated buffers regardless of mem_obj_id
+        if (constant_id >= 0 || mem_obj_id < 0) {
+          tensor_mem_obj_ids_[i] = -1;
+          WGPUBufferDescriptor buf_desc = {};
+          ET_CHECK_MSG(tensor.nbytes > 0, "Tensor has zero bytes");
+          buf_desc.size = tensor.nbytes;
+          buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
+              WGPUBufferUsage_CopySrc;
+          buf_desc.mappedAtCreation = false;
+          tensor.buffer = wgpuDeviceCreateBuffer(device_, &buf_desc);


+  // Allocate shared buffers and assign to tensors
+  shared_buffers_.resize(shared_buffer_sizes_.size(), nullptr);
+  for (size_t id = 0; id < shared_buffer_sizes_.size(); id++) {
+    WGPUBufferDescriptor buf_desc = {};
+    ET_CHECK_MSG(shared_buffer_sizes_[id] > 0, "Shared buffer has zero bytes");
+    buf_desc.size = shared_buffer_sizes_[id];
+    buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
+        WGPUBufferUsage_CopySrc;
+    buf_desc.mappedAtCreation = false;
+    shared_buffers_[id] = wgpuDeviceCreateBuffer(device_, &buf_desc);



+// Test-only: returns memory stats from the most recently initialized graph.
+// Not thread-safe; only valid when a single graph is loaded at a time.
+WebGPUMemoryStats get_last_memory_stats();


meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2026

digantdesai added 7 commits May 6, 2026 20:13

WebGPU: clean up empty if-branch in memory_stats()

8cd897a

Invert the condition to eliminate the empty if-body with a comment. Co-authored with Claude.

WebGPU: add chained-add model to test script

358990c

Export and run the chained-add memory aliasing test in test_build_webgpu.sh so it runs automatically instead of requiring a manual WEBGPU_TEST_CHAINED_MODEL env var. Co-authored with Claude.

WebGPU: extend chained add test to 5 ops for better aliasing coverage

72d52f0

Longer chain produces more intermediates, giving the memory planner more opportunity to alias buffers. Expected output: 3x + 3y. Co-authored with Claude.

digantdesai force-pushed the wgpu_memory_aliasing branch from a402f89 to 3666881 Compare May 7, 2026 03:18

digantdesai marked this pull request as ready for review May 7, 2026 03:18

digantdesai requested review from SS-JIA and Copilot May 7, 2026 03:18

Copilot started reviewing on behalf of digantdesai May 7, 2026 03:19 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WebGPU: add memory aliasing for intermediate tensor buffers#19305

WebGPU: add memory aliasing for intermediate tensor buffers#19305
digantdesai wants to merge 7 commits intomainfrom
wgpu_memory_aliasing

digantdesai commented May 5, 2026

Uh oh!

pytorch-bot Bot commented May 5, 2026 •

edited

Loading

Uh oh!

digantdesai commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

Uh oh!

claude Bot commented May 5, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

digantdesai commented May 5, 2026

Uh oh!

pytorch-bot Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305

❌ 2 New Failures, 1 Cancelled Job, 19 Unrelated Failures

Uh oh!

digantdesai commented May 5, 2026

Uh oh!

github-actions Bot commented May 5, 2026

This PR needs a release notes: label

Uh oh!

claude Bot commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: WebGPU memory aliasing for intermediate tensor buffers

Issues

Minor Suggestions

What looks good

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented May 5, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented May 5, 2026 •

edited

Loading