-
Notifications
You must be signed in to change notification settings - Fork 270
Add a README.md file to ck/library/util #3665
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
I'm collecting information about our current testing and added a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances.
|
Can I get a review from @aosewski , @amd-anclark , @bartekxk , @johannes-graner, and @kabrahamAMD? If you know any other engineers working on this kind of testing improvement, can you invite them to review this code? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds comprehensive documentation to the CK library utility directory, explaining the testing infrastructure with an emphasis on modern GPU-first validation strategies and automatic tolerance computation based on IEEE 754 precision limits.
Changes:
- Added a detailed README.md file documenting testing utilities, validation approaches, and best practices
- Documented the performance advantages of GPU-first validation over legacy CPU-based approaches
- Provided reference tables for tolerance computation across different data types
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | FP32 | 23 | 1e-5 | 3e-6 | | ||
| | TF32 | 10 | 5e-4 | 5e-4 | | ||
| | FP16 | 10 | 1e-3 | 1e-3 | | ||
| | BF16 | 7 | 1e-1 | 1e-3 | |
Copilot
AI
Jan 28, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The relative tolerance for BF16 (1e-1 or 0.1) appears unusually high compared to other data types. This suggests 10% relative error is acceptable, which seems inconsistent with typical numerical validation standards. Verify this value is correct or clarify if this is for specific use cases.
| | BF16 | 7 | 1e-1 | 1e-3 | | |
| | BF16 | 7 | 1e-2 | 1e-3 | |
| - `gpu_reduce_max()`: Computes max(abs(tensor)) on GPU for tolerance scaling | ||
| - Grid-stride kernels with LDS reduction for optimal performance | ||
|
|
||
| **Performance**: 10-100x faster than CPU validation for large tensors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe remove the 10-100x number?
|
|
||
| - `gpu_verify()`: Compares device tensors entirely on GPU | ||
| - Automatic tolerance computation based on data types | ||
| - Only transfers error statistics (~12 bytes), not tensors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove (~12 bytes)
| | FP16 | 10 | 1e-3 | 1e-3 | | ||
| | BF16 | 7 | 1e-1 | 1e-3 | | ||
| | FP8 | 3-4 | 1e-3 | 1e-3 | | ||
| | BF8 | 2-3 | 1e-3 | 1e-3 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rtol for BF8 lower than BF16?
| ↑ BOTTLENECK: PCIe transfer of entire tensor | ||
| ``` | ||
|
|
||
| - **Problem**: Transferring multi-GB tensors over PCIe is 10-100x slower than computation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove the 10-100x
| ``` | ||
|
|
||
| - **Advantage**: All data stays on GPU, only error statistics transfer to CPU | ||
| - **Performance**: 10-100x faster for large tensors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove or rephrase
|
|
||
| This directory contains utility headers for testing, benchmarking, and validating Composable Kernel (CK) operations. The utilities support both modern GPU-first validation for high-performance testing and legacy CPU-based approaches for backward compatibility. | ||
|
|
||
| ## Quick Start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section seems to summarize what our good practices are, key principles, or validation guidelines rather than initial setup steps.
| ## Quick Start | |
| ## Recommended Practices |
| 2. **Let the system compute tolerances** automatically based on data types | ||
| 3. **Only transfer error statistics**, not full tensors | ||
|
|
||
| ## File-to-Purpose Quick Reference |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small thing, but the table is organized from left-to-right as purpose-to-file. The current section name is the reverse.
| ## File-to-Purpose Quick Reference | |
| ## Purpose-to-Utility Quick Reference |
| // Explicit tolerance | ||
| bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size); | ||
|
|
||
| // Automatic tolerance for mixed precision | ||
| bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth mentioning when to use an explicit vs. automatic tolerance?
| - `get_relative_threshold<ComputeType, OutType, AccType>()`: Computes relative tolerance from mantissa bits | ||
| - `get_absolute_threshold<ComputeType, OutType, AccType>()`: Computes absolute tolerance scaled by magnitude |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is helpful, a short example for each of these calls could help users see its usage.
johannes-graner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice to have summary of the existing utility functionality.
My review mainly covers the GPU verification since that's what I'm most familiar with, although I looked through the rest too.
| bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size); | ||
|
|
||
| // Automatic tolerance for mixed precision | ||
| bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
K_dim should be changed to accumulation_count or similar, it's not necessarily equal to the K dimension.
On a slightly separate note, this does not currently support split-k, which requires accounting for accumulation in multiple data types. See issue #3673.
| 1. **Use GPU-first validation** for all new tests | ||
| 2. **Avoid CPU transfers** unless debugging specific values | ||
| 3. **Generate data on GPU** when possible | ||
| 4. **Batch verification** to amortize kernel launch overhead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we do or support batch verification. It would be very VRAM-intense since many device-side tensors would have to be kept in memory instead of clearing the non-reference output after each kernel that is tested.
I'm collecting information about our current testing (#3664). As part of this work I a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances.
This readme contains internal code comments for CK developers and does not need ROCm documentation review.