WebGPU: add memory aliasing for intermediate tensor buffers#19305
WebGPU: add memory aliasing for intermediate tensor buffers#19305digantdesai wants to merge 7 commits intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/19305
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 1 Cancelled Job, 19 Unrelated FailuresAs of commit 3666881 with merge base af90130 ( NEW FAILURES - The following jobs have failed:
CANCELLED JOB - The following job was cancelled. Please retry:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
BROKEN TRUNK - The following jobs failed but were present on the merge base:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@claude review this PR |
This PR needs a
|
|
Claude finished @digantdesai's task in 1m 37s —— View job Review: WebGPU memory aliasing for intermediate tensor buffers
Overall: Clean, well-structured implementation. The approach of using ETVK's Issues1. Global mutable state for memory stats ( static WebGPUMemoryStats s_last_memory_stats;This is a process-wide global that gets overwritten on every 2. Destructor correctness — shared buffer double-release guard ( for (size_t i = 0; i < tensors_.size(); i++) {
if (tensors_[i].buffer &&
(i >= tensor_mem_obj_ids_.size() || tensor_mem_obj_ids_[i] < 0)) {
wgpuBufferRelease(tensors_[i].buffer);
}
}The guard 3. size_t total_bytes() const {
return shared_buffer_bytes + unshared_tensor_buffer_bytes +
staging_buffer_bytes + uniform_buffer_bytes;
}This is correct (it equals Minor Suggestions4. Test assertion could verify specific aliasing count ( The test asserts 5. The export functions are defined but only callable programmatically or via the test script's What looks good
Verdict: LGTM with minor suggestions. The core aliasing logic is correct and well-tested. The global stats accessor is the only thing I'd recommend addressing before merge (even if just renaming to clarify it's test-only). |
The export pipeline already runs a greedy memory planning pass that assigns mem_obj_id to tensors with non-overlapping lifetimes, but the WebGPU runtime was ignoring it and allocating a dedicated WGPUBuffer per tensor. Read mem_obj_id from the flatbuffer during graph build. Tensors sharing the same mem_obj_id now share a single WGPUBuffer sized to the largest user. Constants and tensors without a mem_obj_id still get dedicated buffers. Adds a chained-add native test (z=x+y; z=z+x; z=z+y) that verifies both correctness and that memory aliasing produces savings (~20% for this model). Co-authored with Claude.
Replace the silent `nbytes > 0 ? nbytes : 4` fallback pattern with ET_CHECK_MSG assertions. If a zero-byte tensor reaches buffer creation, we want to know immediately rather than silently creating a dummy 4-byte buffer that masks the issue. Co-authored with Claude.
Invert the condition to eliminate the empty if-body with a comment. Co-authored with Claude.
Export and run the chained-add memory aliasing test in test_build_webgpu.sh so it runs automatically instead of requiring a manual WEBGPU_TEST_CHAINED_MODEL env var. Co-authored with Claude.
Longer chain produces more intermediates, giving the memory planner more opportunity to alias buffers. Expected output: 3x + 3y. Co-authored with Claude.
Fix: if a constant tensor has mem_obj_id >= 0, force it to -1 so the dedicated buffer path and the destructor stay consistent. Previously the buffer would leak and get overwritten by the shared buffer pass. Also make the chained-add test actually fail when aliasing is absent instead of just printing informational messages. Co-authored with Claude.
…tes() Rename the static to s_last_memory_stats_for_testing and document the test-only, single-graph, not-thread-safe intent in the header. Simplify total_bytes() to use tensor_buffer_bytes directly since it is already computed as shared + unshared in memory_stats(). Co-authored with Claude.
a402f89 to
3666881
Compare
There was a problem hiding this comment.
Pull request overview
This PR adds WebGPU runtime support for tensor-buffer memory aliasing by reusing ETVK/Vulkan mem_obj_id so intermediate tensors can share underlying GPU buffers, and extends the WebGPU native test flow to validate the aliasing behavior and report memory stats.
Changes:
- Implement shared
WGPUBufferallocation/assignment inWebGPUGraphbased onmem_obj_id, and extend memory stats to account for shared vs unshared tensor bytes. - Add a test-only mechanism to retrieve the last graph’s memory stats and a new native test that validates aliasing + memory savings using a chained-add model.
- Update WebGPU test scripts and Python export utilities to generate and run the chained-add model.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| backends/webgpu/runtime/WebGPUGraph.h | Extends memory stats and adds state for mem_obj_id-based shared buffers. |
| backends/webgpu/runtime/WebGPUGraph.cpp | Allocates shared buffers by mem_obj_id, adjusts destruction logic, and updates memory stats accounting. |
| backends/webgpu/runtime/WebGPUBackend.h | Declares a test-only accessor for last graph memory stats. |
| backends/webgpu/runtime/WebGPUBackend.cpp | Stores last graph memory stats at init time for tests to query. |
| backends/webgpu/test/test_webgpu_native.cpp | Adds a chained-add native test that checks correctness and confirms aliasing memory savings. |
| backends/webgpu/test/test_build_webgpu.sh | Exports both simple and chained models and runs the native test with both paths. |
| backends/webgpu/test/ops/add/test_add.py | Extends chained-add model and adds an export helper for the chained model. |
Comments suppressed due to low confidence (1)
backends/webgpu/runtime/WebGPUGraph.cpp:230
- Output staging buffer creation now ET_CHECK_MSGs on tensors_[oid].nbytes > 0, which will abort for valid models that produce empty outputs. The previous code handled this by allocating a small non-zero staging buffer while still copying 0 bytes. Consider restoring that behavior (allocate at least 4 bytes, but allow 0-byte outputs) to avoid hard process termination.
// Create staging buffer for output readback
WGPUBufferDescriptor staging_desc = {};
ET_CHECK_MSG(tensors_[oid].nbytes > 0, "Output tensor has zero bytes");
staging_desc.size = tensors_[oid].nbytes;
staging_desc.usage = WGPUBufferUsage_MapRead | WGPUBufferUsage_CopyDst;
staging_desc.mappedAtCreation = false;
output_staging_buffers_.push_back(
wgpuDeviceCreateBuffer(device_, &staging_desc));
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Constants always get dedicated buffers regardless of mem_obj_id | ||
| if (constant_id >= 0 || mem_obj_id < 0) { | ||
| tensor_mem_obj_ids_[i] = -1; | ||
| WGPUBufferDescriptor buf_desc = {}; | ||
| ET_CHECK_MSG(tensor.nbytes > 0, "Tensor has zero bytes"); | ||
| buf_desc.size = tensor.nbytes; | ||
| buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst | | ||
| WGPUBufferUsage_CopySrc; | ||
| buf_desc.mappedAtCreation = false; | ||
| tensor.buffer = wgpuDeviceCreateBuffer(device_, &buf_desc); |
| // Allocate shared buffers and assign to tensors | ||
| shared_buffers_.resize(shared_buffer_sizes_.size(), nullptr); | ||
| for (size_t id = 0; id < shared_buffer_sizes_.size(); id++) { | ||
| WGPUBufferDescriptor buf_desc = {}; | ||
| ET_CHECK_MSG(shared_buffer_sizes_[id] > 0, "Shared buffer has zero bytes"); | ||
| buf_desc.size = shared_buffer_sizes_[id]; | ||
| buf_desc.usage = WGPUBufferUsage_Storage | WGPUBufferUsage_CopyDst | | ||
| WGPUBufferUsage_CopySrc; | ||
| buf_desc.mappedAtCreation = false; | ||
| shared_buffers_[id] = wgpuDeviceCreateBuffer(device_, &buf_desc); |
|
|
||
| // Test-only: returns memory stats from the most recently initialized graph. | ||
| // Not thread-safe; only valid when a single graph is loaded at a time. | ||
| WebGPUMemoryStats get_last_memory_stats(); |
USE ETVK's mem_obj_id for the WebGPU runtime to implement memory aliasing