Skip to content

SYCL: reduce allocation overhead during flash attention#22732

Open
sanmai wants to merge 2 commits intoggml-org:masterfrom
sanmai:fa-overhead-sycl
Open

SYCL: reduce allocation overhead during flash attention#22732
sanmai wants to merge 2 commits intoggml-org:masterfrom
sanmai:fa-overhead-sycl

Conversation

@sanmai
Copy link
Copy Markdown

@sanmai sanmai commented May 5, 2026

Fixes #22585

Overview

I found that flash attention allocated quite a few K/V buffers with little reuse, which remained in the legacy pool until teardown. And it sounds like a better strategy is to allocate the FA buffers outside the common pool and grow them on demand. So that at most, FA uses the largest buffers it needs.

Arguably, there are more optimal strategies: we still leave the buffers occupied. That said, there aren't evictions in the legacy pool, so at very least we are in a better spot.

  • To reduce the number of allocations, buffers grow in chunks of 16 MiB
  • The requests grow like this: 0.5, 1.0, 1.5, 2.0, 2.5... but that could be model-dependent
  • FA completes before ggml_sycl_fattn_alloc::alloc is called again so no queue sync should be needed but I added one anyway just to be extra safe

Additional information

Memory benchmarks with B60 and Qwen3.6-35B-A3B UD-Q4_K_M, q4_0:

Baseline (without buffers)

Smaller prompt:

~ggml_sycl_pool_leg: 9 buffers, cached = 181.67 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/50.40/8.40/50.40

Larger prompt:

~ggml_sycl_pool_leg: 23 buffers, cached = 1170.24 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/50.40/85.05/8.40/50.40/53.02/56.17/59.32/62.47/65.62/67.20/69.30/70.88/72.97/74.55/76.65/78.75/80.85/82.95

With buffers

Smaller prompt:

ggml_sycl_fattn_kv_buffer[0]: 16.00 MiB
ggml_sycl_fattn_kv_buffer[0]: 16.00 MiB
~ggml_sycl_pool_leg: 7 buffers, cached = 80.87 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/8.40

68.80 MiB savings compared to the baseline.

Larger prompt:

ggml_sycl_fattn_kv_buffer[0]: 96.00 MiB
ggml_sycl_fattn_kv_buffer[0]: 96.00 MiB
~ggml_sycl_pool_leg: 7 buffers, cached = 80.87 MiB
~ggml_sycl_pool_leg: slots MiB: 0.01/0.00/0.00/1.05/4.20/67.20/8.40

that's 272.87 total for the pool plus the two FA buffers, or 897.37 MiB savings compare to the baseline. In a memory constrained environment it could be a deal breaker.

It spills into the common memory breakdown just as one expects.

 | memory breakdown [MiB]                        | total   free     self   model   context   compute    unaccounted |
-|   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 =  854 + (20003 = 18727 +     782 +     493) +        2398 |
+|   - SYCL0 (Intel(R) Arc(TM) Pro B60 Graphics) | 23256 = 1129 + (20643 = 18727 +    1422 +     493) +        1483 |
 |   - Host                                      |                  2786 =  2522 +       0 +     264                |

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: YES; I used them in assistive capacity for profiling, prototyping, code review.

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language labels May 5, 2026

sycl::half * alloc(size_t n_elems) {
ptr = buf.ensure_half(n_elems);
return ptr;
Copy link
Copy Markdown
Author

@sanmai sanmai May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the calling code does not use the return value but pool_alloc::alloc returns so keep doing that too to reduce the surprise if someone changes the old code to use the return value later

ggml_sycl_pool_alloc<sycl::half> K_f16(pool);
ggml_sycl_pool_alloc<sycl::half> V_f16(pool);
ggml_sycl_fattn_alloc K_f16(fbuf.K);
ggml_sycl_fattn_alloc V_f16(fbuf.V);
Copy link
Copy Markdown
Author

@sanmai sanmai May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered adding a no-op template to make it look more uniform:

ggml_sycl_fattn_alloc<sycl::half>   K_f16(fbuf.K);
ggml_sycl_fattn_alloc<sycl::half>   V_f16(fbuf.V);

@sanmai sanmai marked this pull request as ready for review May 5, 2026 23:05
@sanmai sanmai requested a review from a team as a code owner May 5, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning SYCL https://en.wikipedia.org/wiki/SYCL - GPU programming language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SYCL: flash-attention buffers are retained across long-context ubatches causing linear VRAM growth

1 participant