-
Notifications
You must be signed in to change notification settings - Fork 89
Advanced Topics in cuPyNumeric (profiling & debugging) #1242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…eviewed on github for ease
…nto a single file that displays on github
Updated references and formatting in profiling_debugging.rst for clarity and consistency.
Corrected numbering and formatting in profiling debugging documentation.
Updated profiler output images and descriptions for both inefficient and efficient CPU, utility, I/O, system, channel, GPU, and framebuffer results.
Updated formatting for section titles and removed example text.
|
@ipdemes @shriram-jagan @Jacobfaib could you please take a look over this tutorial that Nathan (Sunita Chandrasekaran's student from University of Delaware) wrote for us? Also @lightsighter FYI |
|
|
||
| **3.) After a run completes, in the directory you ran the command you’ll see:** | ||
|
|
||
| - A folder: ``legate_prof/``, a self-contained HTML report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this folder is generated with the recent legate/legate_prof
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you're right, I checked my latest runs. That section is now updated. Thank you.
…script. Removed unnecessary line breaks and adjusted formatting for clarity.
Updated usage examples to include the --provenance flag for diagnostic commands.
|
@manopapad , what's the status of the review? |
shriram-jagan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks really good, please address some of the comments I left.
| see: | ||
|
|
||
| - Setting up your environment and running `cuPyNumeric <https://docs.nvidia.com/cupynumeric/latest/user/tutorial.html>`_ | ||
| - Extending cuPyNumeric with `Legate Task <https://docs.nvidia.com/cupynumeric/25.10/user/task.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.nvidia.com/cupynumeric/latest/user/task.html instead of pointing to 25.10
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, fixed!
| multi-node clusters. Previous sections covered how to get code running; here | ||
| the focus shifts to making workloads production-ready. At scale, success is | ||
| not just about adding GPUs or nodes, it requires ensuring that applications | ||
| remain efficient, stable, and resilient under load. That means finding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what "stable, and resilient under load" means in this context. I'd probably leave out this sentence that defines what success is at scale and instead continue to the next sentence which is more specific and relatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I can see how that would be ambiguous, its been removed.
|
|
||
| * - **What you'll gain:** By combining profiling tools with solid | ||
| OOM-handling strategies, you can significantly improve the | ||
| efficiency, scalability, and reliability of cuPyNumeric |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by reliability, do you mean that the library doesn't fail on different processor variants or architectures? (you don't have to update the doc with the definition, maybe just tell me what it is in a comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah its definitely a broad statement, what I meant by "reliability" here is execution stability. Applying profiling and OOM-handling practices makes cuPyNumeric runs less likely to fail (OOM/job crashes) and reduces stalling/underutilization from memory pressure and tiny tasks, especially at scale. Which would in turn make it a more reliable.
Although, it doesn't exactly mean the program is reliable in a way such that it would still always work with different architectures, as that would require more context such as specific runtime and hardware environment which would be outside the scope of the profiler and OOM sections to an extent.
Please feel free to let me know if you think this part should be altered for clarity or removed, I'd be more than happy to change it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applying profiling and OOM-handling practices makes cuPyNumeric runs less likely to fail (OOM/job crashes) and reduces stalling/underutilization from memory pressure and tiny tasks, especially at scale. Which would in turn make it a more reliable.
I like this part and I understand now what you are trying to convey. Profiling gives you an understanding of how your application is performing at scale. In particular, it helps you understand different metrics -- memory pressure and tiny tasks, like you mentioned, are a couple of them. Can you rephrase the original sentence and make it more specific instead of saying "reliability". Somehow I don't feel comfortable using the word reliability in the context of profiling when we have asynchronous runtime underneath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have this now "What you'll gain: By combining profiling with practical OOM-handling strategies, you can improve efficiency and scaling by identifying memory pressure and over-granular execution, while reducing OOM crashes and runtime stalls across CPUs, GPUs, and multi-node systems.".
|
|
||
| **For more detail, see the official references:** | ||
|
|
||
| - `Usage — NVIDIA legate <https://docs.nvidia.com/legate/24.11/usage.html>`_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://docs.nvidia.com/legate/latest/manual/usage/index.html
here and elsewhere in this page, please link to "latest" instead of linking to a specific version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, all sections should be fixed!
|
|
||
| # Multi-GPU/Multi-Node: multiple ranks (pass them all: e.g: N0, N1, N2, etc) | ||
| legate_prof view /path/to/legate_*.prof | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we leave a link to the legate profiler stanford page?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that could be a good idea, is this what you were referring to?: https://legion.stanford.edu/profiling/index.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey Bo or Manolis, what do you think? If yes, I can go ahead and add it in
| in many tiny tasks; runtime overhead dominates useful computation. | ||
|
|
||
| Profiler Output and Interpretation - Inefficient CPU Results | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess it might be easy to interpret if you tell the user how the data is presented in the profiler -- profiler's x-axis is time and y-axis is utilization and each panel in the profile is some kind of resource, so you essentially resource utilization in each panel (e.g., mem, processor utilization (cpu/gpu/omp), etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, I added a general interpretation under each profiling section, for inefficient (2x) and efficient (2x). Under the first image in each part:
"Interpretation: The profiler is presented as a timeline. The x-axis is time, the y-axis is organized by resource/utilization lanes. each horizontal lane represents a particular resource stream (CPU workers, GPU Device/Host, runtime/Utility threads, memory pools like Framebuffer/Zerocopy, and copy/Channel). Colored boxes show work on that resource; the box width is how long it ran, gaps indicate idle/waiting, and dense “barcode” slivers usually mean many tiny tasks (high overhead), while long solid blocks indicate fewer, larger tasks (better utilization)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"each horizontal lane" -> "Each horizontal lane". Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, thanks!
| production-ready code. Profiling turns performance tuning from guesswork into | ||
| an intentional, data-driven process that elevates code quality from functional | ||
| to excellent. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you also mention how you can "trace back" dependencies by say, looking at a task in a panel (GPU utilization) and finding its task ID, and then searching for that ID and looking at other panels (say utility to see when it got mapped, or channel to see if there were any data movement, etc)? this is how we can interactively find what operations were associated with a task.
note that the profiler allows you to search by other keys as well, not just ID.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely, its been added in the wrap up section.
Updated links to the latest NVIDIA Legate documentation.
Added detailed interpretation of profiler timelines for CPU and GPU resources.
Added details about the traceable view feature in profiling, explaining how to use task identifiers to connect performance symptoms to runtime activities.
|
@NathanGraddon, left a few nits. looks good on my end. |
Enhanced the explanation of benefits from profiling tools and OOM-handling strategies, emphasizing memory pressure identification and execution granularity.
Integration of Part 4: Advanced Topics in cuPyNumeric (profiling & debugging) rst file.