Skip to content

vulkan: optimize operations in the IM2COL shader#22685

Open
daniandtheweb wants to merge 2 commits intoggml-org:masterfrom
daniandtheweb:im2col
Open

vulkan: optimize operations in the IM2COL shader#22685
daniandtheweb wants to merge 2 commits intoggml-org:masterfrom
daniandtheweb:im2col

Conversation

@daniandtheweb
Copy link
Copy Markdown
Contributor

@daniandtheweb daniandtheweb commented May 4, 2026

Overview

This optimizes the IM2COL shader by extracting redundant operations from the loops, similar to how I already did it in this: #11826.

Radeon RX 7800XT
7800XT_im2col

Radeon RX 5700XT
5700XT_im2col

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, Gemini used for planning the possible optimizations and reviewing the final code.

@daniandtheweb daniandtheweb requested a review from a team as a code owner May 4, 2026 16:22
@jeffbolznv
Copy link
Copy Markdown
Contributor

Perf on RTX 5090:

before
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              98280 runs -    10.25 us/run -    10244 kB/run -  952.83 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              27060 runs -    37.40 us/run -    40964 kB/run - 1044.61 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1352 runs -   766.24 us/run -   655364 kB/run -  817.25 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     8528 runs -   121.87 us/run -   102445 kB/run -  801.93 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     2132 runs -   475.39 us/run -   409645 kB/run -  822.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              47058 runs -    21.72 us/run -    23536 kB/run - 1033.39 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               8710 runs -   117.16 us/run -   100208 kB/run -  815.78 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      560 runs -  1795.14 us/run -  1678448 kB/run -  893.42 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3289 runs -   306.13 us/run -   235365 kB/run -  733.44 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      850 runs -  1216.93 us/run -  1002085 kB/run -  786.25 GB/s
  
after
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):             114660 runs -     8.86 us/run -    10244 kB/run - 1103.00 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              28700 runs -    35.26 us/run -    40964 kB/run - 1108.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     1352 runs -   755.95 us/run -   655364 kB/run -  828.37 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     9512 runs -   107.82 us/run -   102445 kB/run -  906.44 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     2296 runs -   446.94 us/run -   409645 kB/run -  875.15 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              55614 runs -    18.24 us/run -    23536 kB/run - 1230.41 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               9045 runs -   114.88 us/run -   100208 kB/run -  831.96 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      620 runs -  1647.06 us/run -  1678448 kB/run -  973.74 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     3718 runs -   279.38 us/run -   235365 kB/run -  803.67 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      918 runs -  1116.62 us/run -  1002085 kB/run -  856.88 GB/s

const uint delta_ic = BLOCK_SIZE / KHKW;
const uint delta_rem = BLOCK_SIZE % KHKW;
const uint delta_ky = delta_rem / p.KW;
const uint delta_kx = delta_rem % p.KW;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally following this. In general it seems unsafe to precompute divs/mods and add them, as sometimes you would wrap to the next value and need to do a fixup. Maybe that's what the fixup logic is doing, but it's not clear.

I wonder if it might be better to pass KW as a spec constant and let the compiler transform it into something faster.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kx_wrap and ky_wrap should take care of the wrapping. Moreover it shouldn't be possible for the values to wrap around twice as:

  • delta_kx by using a modulo is always less than p.KW, kx is always less than p.KW because of the modulo and the maximum value that it can reach is 2 * p.KW - 2 which is always less than 2 * p.KW.
  • delta_ky's max value is p.KH - 1 so always less than p.KH, ky is at most p.KH - 1 and the maximum value that it can reach is 2 * p.KH - 1 so less than 2 * p.KH.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not that experienced on vulkan shaders so the most I could confidently achieve was this as it should (in theory) be mathematically correct. If there are better approaches I'll gladly look into them.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's better if I add some comments on the wrap values to make it more clear?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments would help, but I think spec constants would keep the code more clear. I'll defer to @0cc4m on what to do.

Copy link
Copy Markdown
Contributor Author

@daniandtheweb daniandtheweb May 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, I'll start by adding some comments on the most confusing parts for now.

@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants