common: do not fit to unknown device memory#22614
common: do not fit to unknown device memory#22614JohannesGaessler merged 3 commits intoggml-org:masterfrom
Conversation
Signed-off-by: Florian Reinle <f.reinle@otec.de>
JohannesGaessler
left a comment
There was a problem hiding this comment.
As is described in the PR that is linked in the comment, this was done because of the BLAS backend. It is not acceptable to break those backends to fix another one.
Signed-off-by: Florian Reinle <f.reinle@otec.de>
|
Thanks! The fix is now limited on GPU devices only, not CPU/BLAS. |
| const enum ggml_backend_dev_type type = ggml_backend_dev_type(dev); | ||
| if (type == GGML_BACKEND_DEVICE_TYPE_GPU || type == GGML_BACKEND_DEVICE_TYPE_IGPU) { | ||
| throw common_params_fit_exception(std::string("device ") + ggml_backend_dev_name(dev) | ||
| + " did not report memory; cannot safely fit to an unknown device budget"); | ||
| } |
There was a problem hiding this comment.
We have an OpenCL backend that reports 0 bytes of free and total memory as well and reports itself as GGML_BACKEND_DEVICE_TYPE_GPU. This will affect OpenCL too.
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp
Lines 6549 to 6555 in 0929436
llama.cpp/ggml/src/ggml-opencl/ggml-opencl.cpp
Lines 6557 to 6561 in 0929436
There was a problem hiding this comment.
That is a bug in the OpenCL backend then and should be fixed there. In ggml-backend.h GPU devices are defined as "GPU device using dedicated memory".
There was a problem hiding this comment.
I did previously ask about OpenCL's memory reporting and according to this comment from @/lhez #18587 (comment), there isn't a way to report memory. So I guess this isn't a bug but a technical limitation from OpenCL?
| const enum ggml_backend_dev_type type = ggml_backend_dev_type(dev); | ||
| if (type == GGML_BACKEND_DEVICE_TYPE_GPU || type == GGML_BACKEND_DEVICE_TYPE_IGPU) { | ||
| throw common_params_fit_exception(std::string("device ") + ggml_backend_dev_name(dev) | ||
| + " did not report memory; cannot safely fit to an unknown device budget"); | ||
| } |
There was a problem hiding this comment.
Instead of throwing a hard error, print a warning and use the (0, 0) values. Otherwise the fitting code will break if any one bad device is found. With (0, 0) the fitting code should not assign anything to that device.
There was a problem hiding this comment.
Thanks both, and thanks for the suggestion. Updated to avoid the hard error.
For GPU-like devices that report 0/0, we now keep the 0/0 budget and print a warning, so the fitter avoids assigning memory to that device.
Signed-off-by: Florian Reinle <f.reinle@otec.de>
|
@lhez Can you verify if this PR breaks anything for OpenCL? If llama_fit returns (0,0) for OpenCL, I'm pretty sure the backend will be skipped. |
|
This PR improves the behavior for backends that can provide a reliable memory budget. OpenCL remains the special case, I also guess special adaptions of the openCL backend is out of scope. Not ideal, but safer than reintroducing an incorrect fallback. |
|
@taronaeo Thank you for tagging me. I believe Hexagon backend also reports 0 free memory and it reports itself as a GPU device @max-krasnyansky . For OpenCL backend, this change doesn't seem to break anything - it runs as normal. Lines 1149 to 1157 in d05fe1d The annoying part about OpenCL is, the standard does allow querying total memory ( I think as an intermediate step we can simply return total memory (global mem size) for free memory; although not very accurate, it is probably better than just returning 0 for both. |
Hmm what about performance? If I'm following by Johannes's comments, this PR will skip assigning to OpenCL and instead, run on CPU which would be bad for performance.
Yeah it's definitely better than the current upstream implementation of assuming host memory instead :) |
My suggestion would be to add another 1 GB cap besides the default 1 GB cap in this special case (assuming global mem size as free mem size). Even if the GPU runs no other bigger job, a amount like that might be used by the display manager etc. and the normal default cap is needed for a safe runtime. That way we run on GPU again and if llama.cpp is the "major application" it should run safely. |
|
Additional thoughts: I think we can make this useful without pretending that total memory is accurate free memory. For OpenCL, the backend can report CL_DEVICE_GLOBAL_MEM_SIZE as both total and a best-effort upper bound for free memory. This is not actually free memory, so common fitting should treat it more conservatively (e.g. 1GB extra). My suggestion would be:
So with the default --fit-target, such devices effectively keep 2 GiB free instead of 1 GiB. For example, on a 16 GiB display-attached GPU, this would fit closer to 14 GiB instead of treating the full 16 GiB as available and only leaving the normal 1 GiB margin. While with a normal CUDA or HIP backend we would have maybe 0.8 GB for display and current fit would get 14.2 GiB of model size working - basically the same. If the user explicitly changes --fit-target, that value still controls how tight or conservative fitting should be. The only slightly imperfect case is an explicit --fit-target 1024, which is indistinguishable from the default unless we add another parameter flag. I think that is acceptable for this intermediate fix. The important question for me is, how to proceed on from here? |
|
My perspective is this: the behavior on master is wrong and the behavior with this PR is also wrong but in a more well-defined way. @taronaeo @lhez do you approve of the PR as-is? I would prefer not to block it until some indeterminate time in the future when OpenCL and Hexagon implement missing functionality. |
|
I could implement and test a openCL fix locally as well - just as I proposed. But for hexagon - no hardware is available for me. Hexagon can the be cleanly integrated into this fix design later on, if needed. |
|
@fl0rianr make a new PR for any OpenCL additions. |
|
The fix i have in mind is independent of this PR (code wise), it's just openCL would not report 0/0 anymore. But it would require changes in fit.cpp and the openCL backend. Thanks for all your input! Now let's wait for further guidance regarding the next step. |
|
If both OpenCL and Hexagon can confirm that they are able to report some form of memory information for I'm sorry but I don't have a better solution to this. I was thinking maybe reverting it to a hard error and informing the user that auto-fitting is unavailable for XXX backend(s) and that they should manually disable fitting and configure the context size and layer offload themselves, be a better way as compared to skipping the device and logging only 1 line across the vast logs we produce. |
|
On another note, I may be wrong but apparently Hexagon uses host memory as part of its UMA, so reporting 0/0 for Hexagon is actually correct where it will result in the host memory information being reported to
|
Sorry for the delayed response on this thread. Yes, the Snapdragon SOCs have unified memory. The buffers need to be allocated from CMA/DMA allocators but otherwise it's just regular memory shared between CPU/GPU/NPU (ie not a dedicated on-device memory). So yes, reporting 0/0 seems like a correct behaviour to me. |
|
Thanks for the info! I think I'll make another PR that specifically handles the openCL issue with this PR. It works, i checked it. With the new info we don't touch hexagon at all. An open question for me would be regarding openCL, add another 1 GiB margin since total memory is used or don't do it (more simple change and no fit.cpp adaption for that PR). Appreciate your feedback! |
|
Runtime update: A small test using an Intel iGPU since openCL in llama.cpp is limited to Intel and Adreno: I can re-run the test with different parameters if requested. Thanks for all your effort! |
I think it makes sense to leave another 1 GiB margin. |
I don't think this is an accurate test. You have For reference, I am seeing these lines in your logs
I don't have hardware that can run the OpenCL backend. Can you help me re-test it only using this command? I want the see the auto-fitting behavior with this PR without any additional knobs added: ./build-opencl-intel/bin/llama-server \
-m ./qwen2.5-0.5b-instruct-q4_k_m.gguf \
--fit on \ |
|
@taronaeo Good point, thanks for calling it out! I updated all 3 logs with the correct start parameters. I was a bit unsure if i did it right, indeed. What I definitely consider a very good sign now is that we get the expected behavior: With the PR #22688 we fixed this again: current master for reference: What is your verdict? Do you think we are good to go now? |
|
Given that OpenCL has a resolution now, it would be a yes from me because that was my main concern - silently skipping the backend because it will always report 0/0 memory. For the Hexagon side of things, I was just thinking, @max-krasnyansky if Hexagon is actually an NPU that uses host memory, shouldn't If my understanding is correct, then Hexagon can continue to use 0/0 memory to automatically fallback to host memory calculations without being silently skipped. |
taronaeo
left a comment
There was a problem hiding this comment.
Let's wait a little bit for @/max-krasnyansky's answer before merging :)
Good question. I'm trying to recall why we used DEV_TYPE_GPU. It made sense at the time but things have evolved quite a bit. Let me try setting to ACCEL and see how things go. Updates later today ... |
Quick update. Now it's coming back to me. Basically, there is still a bunch of logic/params that are technically generic but are checking for a "GPU". I'm thinking ideally we'd want to overhaul the definition of a "GPU" in llama.cpp. It's basically just means a device with a dedicated on-device memory. That on-device memory is really the only key difference. Anyway, sorry for a long response :-) |
@max-krasnyansky Thanks! I have a bit bigger PR planed on iGPU fixes with a similar issue regarding memory - yeah since shared memory is not yet a implemented concept as far as i see it. I'm not sure if we could run NPUs via that iGPU way e.g. treating it was a GPU with UMA is the right call here. You can also engage on that PR if it comes around, let me know if I should mention you by then. |
Overview
'--fit' currently treats
free == 0 && total == 0device-memory reports as if was host memory. This can make fit approve parameters against the wrong budget.This PR treats such reports as unknown.
Reasoning
A 0/0 device-memory report is an error on its own (e.g. user side). But using the host-memory budget as a substitute can apply too-large context fits on the device. So the real model load can then still hit OOM because the selected parameters were validated against the wrong budget.
Additional information
This is intentionally a partial change of the too-big proposal from #22602.
Requirements
YES, based on code from common: make --fit validate shared UMA memory budgets #22602 but this condensed portion was extracted and checked manually.