Skip to content

WebGPU: add memory aliasing for intermediate tensor buffers#19305

Open
digantdesai wants to merge 7 commits intomainfrom
wgpu_memory_aliasing
Open

WebGPU: add memory aliasing for intermediate tensor buffers#19305
digantdesai wants to merge 7 commits intomainfrom
wgpu_memory_aliasing

Conversation

@digantdesai
Copy link
Copy Markdown
Contributor

USE ETVK's mem_obj_id for the WebGPU runtime to implement memory aliasing

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented May 5, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 1 Cancelled Job, 19 Unrelated Failures

As of commit 3666881 with merge base af90130 (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOB - The following job was cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 5, 2026
@digantdesai
Copy link
Copy Markdown
Contributor Author

@claude review this PR

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 5, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@claude
Copy link
Copy Markdown

claude Bot commented May 5, 2026

Claude finished @digantdesai's task in 1m 37s —— View job


Review: WebGPU memory aliasing for intermediate tensor buffers

  • Read changed files and understand the implementation
  • Review WebGPUBackend changes
  • Review WebGPUGraph changes (core aliasing logic)
  • Review test changes
  • Post review feedback

Overall: Clean, well-structured implementation. The approach of using ETVK's mem_obj_id to share GPU buffers between non-overlapping intermediate tensors is sound and the test coverage is good. A few items worth addressing:


Issues

1. Global mutable state for memory stats (WebGPUBackend.cpp:41)

static WebGPUMemoryStats s_last_memory_stats;

This is a process-wide global that gets overwritten on every init() call. If two models are loaded concurrently (or sequentially), the stats from the first are silently lost. This is fine for a test-only helper, but consider documenting that intent (e.g., naming it something like s_last_memory_stats_for_testing) or gating it behind a test/debug flag so it doesn't ship in production builds. If this is intentionally test-only, could it live in the test file instead?

2. Destructor correctness — shared buffer double-release guard (WebGPUGraph.cpp:55-58)

for (size_t i = 0; i < tensors_.size(); i++) {
    if (tensors_[i].buffer &&
        (i >= tensor_mem_obj_ids_.size() || tensor_mem_obj_ids_[i] < 0)) {
      wgpuBufferRelease(tensors_[i].buffer);
    }
}

The guard i >= tensor_mem_obj_ids_.size() should never be true since both vectors are resized to num_vals in build(). If build() throws before that resize (e.g., device is null), both vectors are empty and the loop wouldn't execute anyway. The check is harmless but somewhat misleading—a simple tensor_mem_obj_ids_[i] < 0 would suffice given the invariant. Minor nit, no action needed.

3. total_bytes() excludes tensor_buffer_bytes field (WebGPUGraph.h:44-46)

size_t total_bytes() const {
    return shared_buffer_bytes + unshared_tensor_buffer_bytes +
        staging_buffer_bytes + uniform_buffer_bytes;
}

This is correct (it equals tensor_buffer_bytes + staging + uniform), but it's mildly confusing that total_bytes() doesn't reference the tensor_buffer_bytes field directly. Rewriting as return tensor_buffer_bytes + staging_buffer_bytes + uniform_buffer_bytes; would make the semantics clearer, since tensor_buffer_bytes is already computed as shared + unshared in memory_stats().

Fix this →


Minor Suggestions

4. Test assertion could verify specific aliasing count (test_webgpu_native.cpp:143)

The test asserts stats.num_shared_objects > 0 — for a 5-op chained add with 2 inputs + 1 output + intermediates, you could assert a more specific expected value (e.g., that num_shared_objects equals a known count) to catch regressions where aliasing becomes less effective. Though I recognize this is fragile if the partitioner changes its allocation strategy.

5. export_chained_add_model not invocable from CLI (test_add.py)

The export functions are defined but only callable programmatically or via the test script's -c block. Not a blocker, just noting that adding a --export-chained CLI arg would be nice for manual testing.


What looks good

  • The two-pass approach (collect max sizes in pass 1, allocate shared buffers in pass 2) is clean and avoids over-allocation.
  • The destructor properly separates ownership: shared buffers are released once via shared_buffers_, unshared per-tensor buffers are released individually.
  • The memory stats struct gives good observability into the aliasing savings.
  • The chained-add test validates both correctness and memory savings, which is the right thing to test here.
  • Replacing zero-size buffer fallbacks with asserts (ET_CHECK_MSG) is a good cleanup — zero-size tensors would indicate a bug upstream.

Verdict: LGTM with minor suggestions. The core aliasing logic is correct and well-tested. The global stats accessor is the only thing I'd recommend addressing before merge (even if just renaming to clarify it's test-only).

The export pipeline already runs a greedy memory planning pass that assigns
mem_obj_id to tensors with non-overlapping lifetimes, but the WebGPU runtime
was ignoring it and allocating a dedicated WGPUBuffer per tensor.

Read mem_obj_id from the flatbuffer during graph build. Tensors sharing the
same mem_obj_id now share a single WGPUBuffer sized to the largest user.
Constants and tensors without a mem_obj_id still get dedicated buffers.

Adds a chained-add native test (z=x+y; z=z+x; z=z+y) that verifies both
correctness and that memory aliasing produces savings (~20% for this model).

Co-authored with Claude.
Replace the silent `nbytes > 0 ? nbytes : 4` fallback pattern with
ET_CHECK_MSG assertions. If a zero-byte tensor reaches buffer creation,
we want to know immediately rather than silently creating a dummy 4-byte
buffer that masks the issue.

Co-authored with Claude.
Invert the condition to eliminate the empty if-body with a comment.

Co-authored with Claude.
Export and run the chained-add memory aliasing test in
test_build_webgpu.sh so it runs automatically instead of requiring
a manual WEBGPU_TEST_CHAINED_MODEL env var.

Co-authored with Claude.
Longer chain produces more intermediates, giving the memory planner
more opportunity to alias buffers. Expected output: 3x + 3y.

Co-authored with Claude.
Fix: if a constant tensor has mem_obj_id >= 0, force it to -1 so the
dedicated buffer path and the destructor stay consistent. Previously
the buffer would leak and get overwritten by the shared buffer pass.

Also make the chained-add test actually fail when aliasing is absent
instead of just printing informational messages.

Co-authored with Claude.
…tes()

Rename the static to s_last_memory_stats_for_testing and document
the test-only, single-graph, not-thread-safe intent in the header.

Simplify total_bytes() to use tensor_buffer_bytes directly since it
is already computed as shared + unshared in memory_stats().

Co-authored with Claude.
@digantdesai digantdesai force-pushed the wgpu_memory_aliasing branch from a402f89 to 3666881 Compare May 7, 2026 03:18
@digantdesai digantdesai marked this pull request as ready for review May 7, 2026 03:18
@digantdesai digantdesai requested review from SS-JIA and Copilot May 7, 2026 03:18
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds WebGPU runtime support for tensor-buffer memory aliasing by reusing ETVK/Vulkan mem_obj_id so intermediate tensors can share underlying GPU buffers, and extends the WebGPU native test flow to validate the aliasing behavior and report memory stats.

Changes:

  • Implement shared WGPUBuffer allocation/assignment in WebGPUGraph based on mem_obj_id, and extend memory stats to account for shared vs unshared tensor bytes.
  • Add a test-only mechanism to retrieve the last graph’s memory stats and a new native test that validates aliasing + memory savings using a chained-add model.
  • Update WebGPU test scripts and Python export utilities to generate and run the chained-add model.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
backends/webgpu/runtime/WebGPUGraph.h Extends memory stats and adds state for mem_obj_id-based shared buffers.
backends/webgpu/runtime/WebGPUGraph.cpp Allocates shared buffers by mem_obj_id, adjusts destruction logic, and updates memory stats accounting.
backends/webgpu/runtime/WebGPUBackend.h Declares a test-only accessor for last graph memory stats.
backends/webgpu/runtime/WebGPUBackend.cpp Stores last graph memory stats at init time for tests to query.
backends/webgpu/test/test_webgpu_native.cpp Adds a chained-add native test that checks correctness and confirms aliasing memory savings.
backends/webgpu/test/test_build_webgpu.sh Exports both simple and chained models and runs the native test with both paths.
backends/webgpu/test/ops/add/test_add.py Extends chained-add model and adds an export helper for the chained model.
Comments suppressed due to low confidence (1)

backends/webgpu/runtime/WebGPUGraph.cpp:230

  • Output staging buffer creation now ET_CHECK_MSGs on tensors_[oid].nbytes > 0, which will abort for valid models that produce empty outputs. The previous code handled this by allocating a small non-zero staging buffer while still copying 0 bytes. Consider restoring that behavior (allocate at least 4 bytes, but allow 0-byte outputs) to avoid hard process termination.
      // Create staging buffer for output readback
      WGPUBufferDescriptor staging_desc = {};
      ET_CHECK_MSG(tensors_[oid].nbytes > 0, "Output tensor has zero bytes");
      staging_desc.size = tensors_[oid].nbytes;
      staging_desc.usage = WGPUBufferUsage_MapRead | WGPUBufferUsage_CopyDst;
      staging_desc.mappedAtCreation = false;
      output_staging_buffers_.push_back(
          wgpuDeviceCreateBuffer(device_, &staging_desc));

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +136 to +145
// Constants always get dedicated buffers regardless of mem_obj_id
if (constant_id >= 0 || mem_obj_id < 0) {
tensor_mem_obj_ids_[i] = -1;
WGPUBufferDescriptor buf_desc = {};
ET_CHECK_MSG(tensor.nbytes > 0, "Tensor has zero bytes");
buf_desc.size = tensor.nbytes;
buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
WGPUBufferUsage_CopySrc;
buf_desc.mappedAtCreation = false;
tensor.buffer = wgpuDeviceCreateBuffer(device_, &buf_desc);
Comment on lines +192 to +201
// Allocate shared buffers and assign to tensors
shared_buffers_.resize(shared_buffer_sizes_.size(), nullptr);
for (size_t id = 0; id < shared_buffer_sizes_.size(); id++) {
WGPUBufferDescriptor buf_desc = {};
ET_CHECK_MSG(shared_buffer_sizes_[id] > 0, "Shared buffer has zero bytes");
buf_desc.size = shared_buffer_sizes_[id];
buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst |
WGPUBufferUsage_CopySrc;
buf_desc.mappedAtCreation = false;
shared_buffers_[id] = wgpuDeviceCreateBuffer(device_, &buf_desc);
Comment on lines 17 to +20

// Test-only: returns memory stats from the most recently initialized graph.
// Not thread-safe; only valid when a single graph is loaded at a time.
WebGPUMemoryStats get_last_memory_stats();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants