Optimize CPU deform_conv2d forward pass with parallel im2col by developer0hye · Pull Request #9442 · pytorch/vision

developer0hye · 2026-03-16T14:52:08Z

Summary

The CPU deform_conv2d forward pass spends 89–97% of its time in the deformable_im2col_kernel (confirmed via torch.profiler), yet this kernel runs entirely single-threaded. GEMM (addmm_) accounts for only 3–10% and is already parallelized by BLAS.

This PR introduces three changes to torchvision/csrc/ops/cpu/deform_conv2d_kernel.cpp that together yield a 2.5–3.3x end-to-end speedup on the forward pass:

Parallelize deformable_im2col_kernel with at::parallel_for.
Each loop iteration writes to a non-overlapping region of the columns buffer (the write offset is uniquely determined by (in_c, out_b, out_y, out_x)), so parallelization is safe with no synchronization needed. Results are bit-for-bit identical regardless of thread count.
Replace at::zeros with at::empty for the columns buffer.
deformable_im2col_kernel writes every element of this buffer (n_in_channels × kH × kW × parallel_imgs × out_h × out_w elements total), so zero-initialization is wasted work.
Replace at::zeros with at::empty for out_buf and use addmm_ with beta=0.
Each out_buf[b][g] is written exactly once per (batch_block, weight_group) pair. Using beta=0 skips the accumulation of uninitialized values while preserving in-place operation (unlike at::mm, which allocates a new tensor).

Benchmark

All measurements use time.perf_counter(), 10 warmup + 100 timed iterations, reporting the median.

Hardware: Apple M2, torch.get_num_threads() = 4
Dtype: float32, with mask (DCNv2 mode)
Config format: s{spatial}-b{batch}, e.g. s32-b4 = 64 in/out channels, 3×3 kernel, stride 1, padding 1, 32×32 spatial, batch 4. s64-* uses 256 in/out channels.

Config     Baseline (ms)  This PR (ms)   Speedup
─────────────────────────────────────────────────
s32-b1           2.78          0.83         3.3x
s32-b3           9.62          3.54         2.7x
s32-b4          15.99          5.01         3.2x
s32-b8          32.90         11.17         2.9x
s64-b1          76.16         30.52         2.5x
s64-b4         315.69        122.65         2.6x
s64-b7         566.37        230.67         2.5x

Profiler breakdown (baseline, s32-b1)

                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    torchvision::deform_conv2d        92.30%      25.091ms       100.00%      27.183ms       2.718ms            10
                  aten::addmm_         2.82%     766.166us         2.82%     767.458us      76.746us            10
                   aten::zeros         0.57%     154.080us         2.94%     798.875us      79.888us            10

Benchmark script

import time
import torch
from torchvision.ops import deform_conv2d

def benchmark_forward(batch_sz, in_channels, out_channels, in_h, in_w,
                      kernel_h, kernel_w, stride, padding,
                      n_warmup=10, n_iter=100):
    out_h = (in_h + 2 * padding - kernel_h) // stride + 1
    out_w = (in_w + 2 * padding - kernel_w) // stride + 1
    x = torch.randn(batch_sz, in_channels, in_h, in_w)
    weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
    offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
    mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
    bias = torch.randn(out_channels)
    for _ in range(n_warmup):
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
    times = []
    for _ in range(n_iter):
        t0 = time.perf_counter()
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
        times.append((time.perf_counter() - t0) * 1000)
    times.sort()
    return times[len(times) // 2]

Numerical correctness

Output is bit-for-bit identical between 1-thread and 8-thread execution (torch.equal returns True). Each thread operates on a disjoint slice of the columns buffer, so floating-point evaluation order is unchanged.

All existing TestDeformConv tests pass (forward, backward, scripting, opcheck).

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9442

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Three changes to the CPU deformable convolution forward kernel: 1. Replace at::zeros with at::empty for columns and out_buf buffers. The deformable_im2col_kernel writes every element of the columns buffer, and out_buf is fully written by addmm_, so zero-initialization is wasted work. 2. Use addmm_ with beta=0 instead of the default beta=1. This avoids accumulating into uninitialized memory while preserving in-place operation (no extra allocation unlike at::mm). 3. Parallelize deformable_im2col_kernel with at::parallel_for. The im2col loop was the only single-threaded phase in the forward pass (GEMM is already parallelized by BLAS). Each loop iteration writes to a non-overlapping region of the columns buffer, so parallelization is safe. Benchmark results on Apple M2 (CPU, float32): Config Before (ms) After (ms) Change small-b1 9.76 2.44 -75% small-b8 91.77 33.88 -63% medium-b1 216.70 75.80 -65% medium-b8 1152.09 650.00 -44% large-b1 348.86 302.70 -13% large-b4 1342.75 1289.96 -4% Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>

meta-cla bot added the cla signed label Mar 16, 2026

developer0hye force-pushed the feat/dcnv2-cpu-forward-optimization branch from e653cad to 8a89fb8 Compare March 16, 2026 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize CPU deform_conv2d forward pass with parallel im2col#9442

Optimize CPU deform_conv2d forward pass with parallel im2col#9442
developer0hye wants to merge 1 commit intopytorch:mainfrom
developer0hye:feat/dcnv2-cpu-forward-optimization

developer0hye commented Mar 16, 2026

Uh oh!

pytorch-bot bot commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

developer0hye commented Mar 16, 2026

Summary

Benchmark

Numerical correctness

Related

Uh oh!

pytorch-bot bot commented Mar 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9442

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant