Skip to content

common: do not fit to unknown device memory#22614

Merged
JohannesGaessler merged 3 commits intoggml-org:masterfrom
fl0rianr:fix/fit-unknown-device-memory
May 6, 2026
Merged

common: do not fit to unknown device memory#22614
JohannesGaessler merged 3 commits intoggml-org:masterfrom
fl0rianr:fix/fit-unknown-device-memory

Conversation

@fl0rianr
Copy link
Copy Markdown
Contributor

@fl0rianr fl0rianr commented May 2, 2026

Overview

'--fit' currently treats free == 0 && total == 0 device-memory reports as if was host memory. This can make fit approve parameters against the wrong budget.

This PR treats such reports as unknown.

Reasoning

A 0/0 device-memory report is an error on its own (e.g. user side). But using the host-memory budget as a substitute can apply too-large context fits on the device. So the real model load can then still hit OOM because the selected parameters were validated against the wrong budget.

Additional information

This is intentionally a partial change of the too-big proposal from #22602.

Requirements

Signed-off-by: Florian Reinle <f.reinle@otec.de>
@fl0rianr fl0rianr requested a review from JohannesGaessler as a code owner May 2, 2026 11:20
Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As is described in the PR that is linked in the comment, this was done because of the BLAS backend. It is not acceptable to break those backends to fix another one.

Signed-off-by: Florian Reinle <f.reinle@otec.de>
@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 2, 2026

Thanks! The fix is now limited on GPU devices only, not CPU/BLAS.
I also verified that the BLAS build still reaches common_fit_params: successfully fit params to free device memory. And GPU runs are not corrupted as well.

@fl0rianr fl0rianr requested a review from JohannesGaessler May 2, 2026 13:05
Comment thread common/fit.cpp
Comment on lines +122 to +126
const enum ggml_backend_dev_type type = ggml_backend_dev_type(dev);
if (type == GGML_BACKEND_DEVICE_TYPE_GPU || type == GGML_BACKEND_DEVICE_TYPE_IGPU) {
throw common_params_fit_exception(std::string("device ") + ggml_backend_dev_name(dev)
+ " did not report memory; cannot safely fit to an unknown device budget");
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have an OpenCL backend that reports 0 bytes of free and total memory as well and reports itself as GGML_BACKEND_DEVICE_TYPE_GPU. This will affect OpenCL too.

static void ggml_backend_opencl_device_get_memory(ggml_backend_dev_t dev, size_t * free, size_t * total) {
// no memory to report
*free = 0;
*total = 0;
GGML_UNUSED(dev);
}

static enum ggml_backend_dev_type ggml_backend_opencl_device_get_type(ggml_backend_dev_t dev) {
return GGML_BACKEND_DEVICE_TYPE_GPU;
GGML_UNUSED(dev);
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a bug in the OpenCL backend then and should be fixed there. In ggml-backend.h GPU devices are defined as "GPU device using dedicated memory".

Copy link
Copy Markdown
Member

@taronaeo taronaeo May 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did previously ask about OpenCL's memory reporting and according to this comment from @/lhez #18587 (comment), there isn't a way to report memory. So I guess this isn't a bug but a technical limitation from OpenCL?

Comment thread common/fit.cpp
Comment on lines +122 to +126
const enum ggml_backend_dev_type type = ggml_backend_dev_type(dev);
if (type == GGML_BACKEND_DEVICE_TYPE_GPU || type == GGML_BACKEND_DEVICE_TYPE_IGPU) {
throw common_params_fit_exception(std::string("device ") + ggml_backend_dev_name(dev)
+ " did not report memory; cannot safely fit to an unknown device budget");
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of throwing a hard error, print a warning and use the (0, 0) values. Otherwise the fitting code will break if any one bad device is found. With (0, 0) the fitting code should not assign anything to that device.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both, and thanks for the suggestion. Updated to avoid the hard error.

For GPU-like devices that report 0/0, we now keep the 0/0 budget and print a warning, so the fitter avoids assigning memory to that device.

Signed-off-by: Florian Reinle <f.reinle@otec.de>
@fl0rianr fl0rianr requested a review from JohannesGaessler May 2, 2026 14:38
@taronaeo
Copy link
Copy Markdown
Member

taronaeo commented May 2, 2026

@lhez Can you verify if this PR breaks anything for OpenCL?

If llama_fit returns (0,0) for OpenCL, I'm pretty sure the backend will be skipped.

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 2, 2026

This PR improves the behavior for backends that can provide a reliable memory budget. OpenCL remains the special case, I also guess special adaptions of the openCL backend is out of scope.

Not ideal, but safer than reintroducing an incorrect fallback.

@lhez
Copy link
Copy Markdown
Contributor

lhez commented May 3, 2026

@taronaeo Thank you for tagging me.

I believe Hexagon backend also reports 0 free memory and it reports itself as a GPU device @max-krasnyansky .

For OpenCL backend, this change doesn't seem to break anything - it runs as normal. common_fit_params does return an error because free memory is reported as 0. But the result doesn't seem checked.

llama.cpp/common/common.cpp

Lines 1149 to 1157 in d05fe1d

if (params.fit_params) {
LOG_INF("%s: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on\n", __func__);
common_fit_params(params.model.path.c_str(), &mparams, &cparams,
params.tensor_split,
params.tensor_buft_overrides.data(),
params.fit_params_target.data(),
params.fit_params_min_ctx,
params.verbosity >= 4 ? GGML_LOG_LEVEL_DEBUG : GGML_LOG_LEVEL_ERROR);
}

The annoying part about OpenCL is, the standard does allow querying total memory (CL_DEVICE_GLOBAL_MEM_SIZE), but it does not provide a way to directly query free memory (we have to keep the record ourselves, which I plan to do).

I think as an intermediate step we can simply return total memory (global mem size) for free memory; although not very accurate, it is probably better than just returning 0 for both.

@taronaeo
Copy link
Copy Markdown
Member

taronaeo commented May 3, 2026

For OpenCL backend, this change doesn't seem to break anything - it runs as normal. common_fit_params does return an error because free memory is reported as 0. But the result doesn't seem checked.

Hmm what about performance? If I'm following by Johannes's comments, this PR will skip assigning to OpenCL and instead, run on CPU which would be bad for performance.

I think as an intermediate step we can simply return total memory (global mem size) for free memory; although not very accurate, it is probably better than just returning 0 for both.

Yeah it's definitely better than the current upstream implementation of assuming host memory instead :)

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 3, 2026

I think as an intermediate step we can simply return total memory (global mem size) for free memory; although not very accurate, it is probably better than just returning 0 for both.

My suggestion would be to add another 1 GB cap besides the default 1 GB cap in this special case (assuming global mem size as free mem size). Even if the GPU runs no other bigger job, a amount like that might be used by the display manager etc. and the normal default cap is needed for a safe runtime. That way we run on GPU again and if llama.cpp is the "major application" it should run safely.

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 3, 2026

Additional thoughts: I think we can make this useful without pretending that total memory is accurate free memory.

For OpenCL, the backend can report CL_DEVICE_GLOBAL_MEM_SIZE as both total and a best-effort upper bound for free memory. This is not actually free memory, so common fitting should treat it more conservatively (e.g. 1GB extra).

My suggestion would be:

  • backends that know real free memory: keep the current behavior
  • OpenCL, and possibly Hexagon if it only has a backend-specific upper bound: report that upper bound from the backend instead of 0/0
  • in common fit code, if the device is one of these estimated-memory backends and the margin is the default 1024 MiB, add another 1024 MiB safety margin

So with the default --fit-target, such devices effectively keep 2 GiB free instead of 1 GiB. For example, on a 16 GiB display-attached GPU, this would fit closer to 14 GiB instead of treating the full 16 GiB as available and only leaving the normal 1 GiB margin. While with a normal CUDA or HIP backend we would have maybe 0.8 GB for display and current fit would get 14.2 GiB of model size working - basically the same.

If the user explicitly changes --fit-target, that value still controls how tight or conservative fitting should be. The only slightly imperfect case is an explicit --fit-target 1024, which is indistinguishable from the default unless we add another parameter flag. I think that is acceptable for this intermediate fix.

The important question for me is, how to proceed on from here?

@JohannesGaessler
Copy link
Copy Markdown
Contributor

My perspective is this: the behavior on master is wrong and the behavior with this PR is also wrong but in a more well-defined way. @taronaeo @lhez do you approve of the PR as-is? I would prefer not to block it until some indeterminate time in the future when OpenCL and Hexagon implement missing functionality.

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 4, 2026

I could implement and test a openCL fix locally as well - just as I proposed. But for hexagon - no hardware is available for me. Hexagon can the be cleanly integrated into this fix design later on, if needed.

@JohannesGaessler
Copy link
Copy Markdown
Contributor

@fl0rianr make a new PR for any OpenCL additions.

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 4, 2026

The fix i have in mind is independent of this PR (code wise), it's just openCL would not report 0/0 anymore. But it would require changes in fit.cpp and the openCL backend. Thanks for all your input! Now let's wait for further guidance regarding the next step.

@taronaeo
Copy link
Copy Markdown
Member

taronaeo commented May 4, 2026

If both OpenCL and Hexagon can confirm that they are able to report some form of memory information for --fit to use, then this PR would be acceptable. But if they can't then I can't agree to skipping 0/0 backends since this would force OpenCL and Hexagon to be unused every time and automatically use the CPU instead, which is undesirable.

I'm sorry but I don't have a better solution to this. I was thinking maybe reverting it to a hard error and informing the user that auto-fitting is unavailable for XXX backend(s) and that they should manually disable fitting and configure the context size and layer offload themselves, be a better way as compared to skipping the device and logging only 1 line across the vast logs we produce.

@taronaeo
Copy link
Copy Markdown
Member

taronaeo commented May 4, 2026

On another note, I may be wrong but apparently Hexagon uses host memory as part of its UMA, so reporting 0/0 for Hexagon is actually correct where it will result in the host memory information being reported to --fit instead of some form of dedicated memory. From Google Gemini:

Yes, the Qualcomm Hexagon NPU uses host memory (system RAM).

@max-krasnyansky
Copy link
Copy Markdown
Member

On another note, I may be wrong but apparently Hexagon uses host memory as part of its UMA, so reporting 0/0 for Hexagon is actually correct where it will result in the host memory information being reported to --fit instead of some form of dedicated memory. From Google Gemini:

Yes, the Qualcomm Hexagon NPU uses host memory (system RAM).

Sorry for the delayed response on this thread.

Yes, the Snapdragon SOCs have unified memory. The buffers need to be allocated from CMA/DMA allocators but otherwise it's just regular memory shared between CPU/GPU/NPU (ie not a dedicated on-device memory).
The NPU has a ~4GB window but the total memory size is not limited by that, we dynamically mmap/unmap the buffers as needed.

So yes, reporting 0/0 seems like a correct behaviour to me.
Technically, the CMA allocator does usually have a limit but it's very much platform dependent (windows, linux, different android vendors, IOT devices, etc). I'm not aware of a robust way to query that limit.

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 4, 2026

Thanks for the info! I think I'll make another PR that specifically handles the openCL issue with this PR. It works, i checked it. With the new info we don't touch hexagon at all. An open question for me would be regarding openCL, add another 1 GiB margin since total memory is used or don't do it (more simple change and no fit.cpp adaption for that PR). Appreciate your feedback!

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 4, 2026

Runtime update:

A small test using an Intel iGPU since openCL in llama.cpp is limited to Intel and Adreno:
master(openCL).log
With the change of the current PR fit does warn but still uses the device, NOT the CPU.
master_and_PR#22614.log
With the simplest fix for openCL fit does not warn anymore.
with_openCL_fix_and_PR#22614.log

I can re-run the test with different parameters if requested. Thanks for all your effort!

@lhez
Copy link
Copy Markdown
Contributor

lhez commented May 4, 2026

An open question for me would be regarding openCL, add another 1 GiB margin since total memory is used or don't do it (more simple change and no fit.cpp adaption for that PR).

I think it makes sense to leave another 1 GiB margin.

@taronaeo
Copy link
Copy Markdown
Member

taronaeo commented May 5, 2026

With the change of the current PR fit does warn but still uses the device, NOT the CPU.
master_and_PR#22614.log

I don't think this is an accurate test. You have -ngl 99 in the launch command which still forces up to 99 layers to the backend and if I am reading your logs correctly, the auto-fitting failed, aborted, and defaulted back to your launch commands of -ngl 99, which is why OpenCL is being used despite supposedly skipping backends that report 0/0 memory.

For reference, I am seeing these lines in your logs

common_get_device_memory_data: device GPUOpenCL did not report memory; --fit will not use it
common_params_fit_impl: projected to use 891 MiB of device memory vs. 0 MiB of free device memory
common_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 1915 MiB
common_params_fit_impl: user has requested full context size of 32768 -> no change
common_fit_params: failed to fit params to free device memory: n_gpu_layers already set by user to 99, abort # <-- Failure point where auto-fitting did not change anything
common_fit_params: fitting params to free memory took 0.16 seconds

I don't have hardware that can run the OpenCL backend. Can you help me re-test it only using this command? I want the see the auto-fitting behavior with this PR without any additional knobs added:

./build-opencl-intel/bin/llama-server \
  -m ./qwen2.5-0.5b-instruct-q4_k_m.gguf \
  --fit on \

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 5, 2026

@taronaeo Good point, thanks for calling it out! I updated all 3 logs with the correct start parameters. I was a bit unsure if i did it right, indeed. What I definitely consider a very good sign now is that we get the expected behavior:
master_and_PR#22614_v2.log
We see the CPU fallback.

With the PR #22688 we fixed this again:
with_openCL_fixPR#22688_and_PR#22614_v2.log

current master for reference:
master(openCL)v2.log

What is your verdict? Do you think we are good to go now?

@taronaeo
Copy link
Copy Markdown
Member

taronaeo commented May 5, 2026

Given that OpenCL has a resolution now, it would be a yes from me because that was my main concern - silently skipping the backend because it will always report 0/0 memory.

For the Hexagon side of things, I was just thinking, @max-krasnyansky if Hexagon is actually an NPU that uses host memory, shouldn't ggml_backend_hexagon_device_get_type report GGML_BACKEND_DEVICE_TYPE_ACCEL instead of GGML_BACKEND_DEVICE_TYPE_GPU since it is actually an accelerator?

If my understanding is correct, then Hexagon can continue to use 0/0 memory to automatically fallback to host memory calculations without being silently skipped.

Copy link
Copy Markdown
Member

@taronaeo taronaeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait a little bit for @/max-krasnyansky's answer before merging :)

@max-krasnyansky
Copy link
Copy Markdown
Member

Given that OpenCL has a resolution now, it would be a yes from me because that was my main concern - silently skipping the backend because it will always report 0/0 memory.

For the Hexagon side of things, I was just thinking, @max-krasnyansky if Hexagon is actually an NPU that uses host memory, shouldn't ggml_backend_hexagon_device_get_type report GGML_BACKEND_DEVICE_TYPE_ACCEL instead of GGML_BACKEND_DEVICE_TYPE_GPU since it is actually an accelerator?

If my understanding is correct, then Hexagon can continue to use 0/0 memory to automatically fallback to host memory calculations without being silently skipped.

Good question. I'm trying to recall why we used DEV_TYPE_GPU. It made sense at the time but things have evolved quite a bit. Let me try setting to ACCEL and see how things go. Updates later today ...

@max-krasnyansky
Copy link
Copy Markdown
Member

Given that OpenCL has a resolution now, it would be a yes from me because that was my main concern - silently skipping the backend because it will always report 0/0 memory.
For the Hexagon side of things, I was just thinking, @max-krasnyansky if Hexagon is actually an NPU that uses host memory, shouldn't ggml_backend_hexagon_device_get_type report GGML_BACKEND_DEVICE_TYPE_ACCEL instead of GGML_BACKEND_DEVICE_TYPE_GPU since it is actually an accelerator?
If my understanding is correct, then Hexagon can continue to use 0/0 memory to automatically fallback to host memory calculations without being silently skipped.

Good question. I'm trying to recall why we used DEV_TYPE_GPU. It made sense at the time but things have evolved quite a bit. Let me try setting to ACCEL and see how things go. Updates later today ...

Quick update. Now it's coming back to me.

Basically, there is still a bunch of logic/params that are technically generic but are checking for a "GPU".
One of the most obvious ones is -ngl. See the code below.
For example, suppose we want to offload only 10 layers to the HTP0 (ie Hexagon backend dev 0) with -ngl 10. If HTP0 is registered as GPU then -ngl works as expected, otherwise it does not.

I'm thinking ideally we'd want to overhaul the definition of a "GPU" in llama.cpp. It's basically just means a device with a dedicated on-device memory. That on-device memory is really the only key difference.
In the Snapdragon case both GPU and NPU have similar access to the main memory, can share buffers, etc.

Anyway, sorry for a long response :-)
Seems like at this point we should keep the Hexagon as the DEVICE_TYPE_GPU to keep it fully usable with layer/model split features and things.

bool llama_supports_gpu_offload(void) {
    return ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_GPU) != nullptr ||
           ggml_backend_dev_by_type(GGML_BACKEND_DEVICE_TYPE_IGPU) != nullptr ||
           llama_supports_rpc();
}
    if (llama_supports_gpu_offload()) {
        const int n_gpu = std::min(n_gpu_layers, int(hparams.n_layer));
        
        int n_repeating = n_gpu;
        if (n_repeating > 0) {
            LLAMA_LOG_INFO("%s: offloading output layer to GPU\n", __func__);
            n_repeating--;
        }
        LLAMA_LOG_INFO("%s: offloading %d repeating layers to GPU\n", __func__, n_repeating);
    
        const int max_backend_supported_layers = hparams.n_layer + 1;
        const int max_offloadable_layers       = hparams.n_layer + 1;
    
        LLAMA_LOG_INFO("%s: offloaded %d/%d layers to GPU\n", __func__, std::min(n_gpu_layers, max_offloadable_layers), max_backend_supported_layers);
    }

@fl0rianr
Copy link
Copy Markdown
Contributor Author

fl0rianr commented May 5, 2026

I'm thinking ideally we'd want to overhaul the definition of a "GPU" in llama.cpp. It's basically just means a device with a dedicated on-device memory. That on-device memory is really the only key difference.
In the Snapdragon case both GPU and NPU have similar access to the main memory, can share buffers, etc.

@max-krasnyansky Thanks! I have a bit bigger PR planed on iGPU fixes with a similar issue regarding memory - yeah since shared memory is not yet a implemented concept as far as i see it. I'm not sure if we could run NPUs via that iGPU way e.g. treating it was a GPU with UMA is the right call here. You can also engage on that PR if it comes around, let me know if I should mention you by then.

@JohannesGaessler JohannesGaessler merged commit a010122 into ggml-org:master May 6, 2026
45 of 46 checks passed
@fl0rianr fl0rianr deleted the fix/fit-unknown-device-memory branch May 6, 2026 15:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants