Skip to content

Optimize CPU deform_conv2d forward pass with parallel im2col#9442

Open
developer0hye wants to merge 1 commit intopytorch:mainfrom
developer0hye:feat/dcnv2-cpu-forward-optimization
Open

Optimize CPU deform_conv2d forward pass with parallel im2col#9442
developer0hye wants to merge 1 commit intopytorch:mainfrom
developer0hye:feat/dcnv2-cpu-forward-optimization

Conversation

@developer0hye
Copy link
Contributor

Summary

The CPU deform_conv2d forward pass spends 89–97% of its time in the deformable_im2col_kernel (confirmed via torch.profiler), yet this kernel runs entirely single-threaded. GEMM (addmm_) accounts for only 3–10% and is already parallelized by BLAS.

This PR introduces three changes to torchvision/csrc/ops/cpu/deform_conv2d_kernel.cpp that together yield a 2.5–3.3x end-to-end speedup on the forward pass:

  1. Parallelize deformable_im2col_kernel with at::parallel_for.
    Each loop iteration writes to a non-overlapping region of the columns buffer (the write offset is uniquely determined by (in_c, out_b, out_y, out_x)), so parallelization is safe with no synchronization needed. Results are bit-for-bit identical regardless of thread count.

  2. Replace at::zeros with at::empty for the columns buffer.
    deformable_im2col_kernel writes every element of this buffer (n_in_channels × kH × kW × parallel_imgs × out_h × out_w elements total), so zero-initialization is wasted work.

  3. Replace at::zeros with at::empty for out_buf and use addmm_ with beta=0.
    Each out_buf[b][g] is written exactly once per (batch_block, weight_group) pair. Using beta=0 skips the accumulation of uninitialized values while preserving in-place operation (unlike at::mm, which allocates a new tensor).

Benchmark

All measurements use time.perf_counter(), 10 warmup + 100 timed iterations, reporting the median.

Hardware: Apple M2, torch.get_num_threads() = 4
Dtype: float32, with mask (DCNv2 mode)
Config format: s{spatial}-b{batch}, e.g. s32-b4 = 64 in/out channels, 3×3 kernel, stride 1, padding 1, 32×32 spatial, batch 4. s64-* uses 256 in/out channels.

Config     Baseline (ms)  This PR (ms)   Speedup
─────────────────────────────────────────────────
s32-b1           2.78          0.83         3.3x
s32-b3           9.62          3.54         2.7x
s32-b4          15.99          5.01         3.2x
s32-b8          32.90         11.17         2.9x
s64-b1          76.16         30.52         2.5x
s64-b4         315.69        122.65         2.6x
s64-b7         566.37        230.67         2.5x
Profiler breakdown (baseline, s32-b1)
                          Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
    torchvision::deform_conv2d        92.30%      25.091ms       100.00%      27.183ms       2.718ms            10
                  aten::addmm_         2.82%     766.166us         2.82%     767.458us      76.746us            10
                   aten::zeros         0.57%     154.080us         2.94%     798.875us      79.888us            10
Benchmark script
import time
import torch
from torchvision.ops import deform_conv2d

def benchmark_forward(batch_sz, in_channels, out_channels, in_h, in_w,
                      kernel_h, kernel_w, stride, padding,
                      n_warmup=10, n_iter=100):
    out_h = (in_h + 2 * padding - kernel_h) // stride + 1
    out_w = (in_w + 2 * padding - kernel_w) // stride + 1
    x = torch.randn(batch_sz, in_channels, in_h, in_w)
    weight = torch.randn(out_channels, in_channels, kernel_h, kernel_w)
    offset = torch.randn(batch_sz, 2 * kernel_h * kernel_w, out_h, out_w)
    mask = torch.randn(batch_sz, kernel_h * kernel_w, out_h, out_w)
    bias = torch.randn(out_channels)
    for _ in range(n_warmup):
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
    times = []
    for _ in range(n_iter):
        t0 = time.perf_counter()
        deform_conv2d(x, offset, weight, bias, stride=stride, padding=padding, mask=mask)
        times.append((time.perf_counter() - t0) * 1000)
    times.sort()
    return times[len(times) // 2]

Numerical correctness

Output is bit-for-bit identical between 1-thread and 8-thread execution (torch.equal returns True). Each thread operates on a disjoint slice of the columns buffer, so floating-point evaluation order is unchanged.

All existing TestDeformConv tests pass (forward, backward, scripting, opcheck).

Related

  • #6619 — RFC noting that CPU deform_conv2d kernels are sequential and don't utilize multicore resources

cc @NicolasHug

@pytorch-bot
Copy link

pytorch-bot bot commented Mar 16, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/vision/9442

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the cla signed label Mar 16, 2026
Three changes to the CPU deformable convolution forward kernel:

1. Replace at::zeros with at::empty for columns and out_buf buffers.
   The deformable_im2col_kernel writes every element of the columns
   buffer, and out_buf is fully written by addmm_, so zero-initialization
   is wasted work.

2. Use addmm_ with beta=0 instead of the default beta=1. This avoids
   accumulating into uninitialized memory while preserving in-place
   operation (no extra allocation unlike at::mm).

3. Parallelize deformable_im2col_kernel with at::parallel_for. The
   im2col loop was the only single-threaded phase in the forward pass
   (GEMM is already parallelized by BLAS). Each loop iteration writes
   to a non-overlapping region of the columns buffer, so parallelization
   is safe.

Benchmark results on Apple M2 (CPU, float32):

  Config          Before (ms)   After (ms)    Change
  small-b1              9.76        2.44       -75%
  small-b8             91.77       33.88       -63%
  medium-b1           216.70       75.80       -65%
  medium-b8          1152.09      650.00       -44%
  large-b1            348.86      302.70       -13%
  large-b4           1342.75     1289.96        -4%

Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Yonghye Kwon <developer.0hye@gmail.com>
@developer0hye developer0hye force-pushed the feat/dcnv2-cpu-forward-optimization branch from e653cad to 8a89fb8 Compare March 16, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant