Add a README.md file to ck/library/util #3665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

shumway wants to merge 1 commit into develop from jshumway/util-readme

+290 −0

Collaborator

shumway commented Jan 28, 2026 •

edited

Loading

I'm collecting information about our current testing (#3664). As part of this work I a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances.

This readme contains internal code comments for CK developers and does not need ROCm documentation review.


          Add a readme file to ck/library/util

69fc05d

I'm collecting information about our current testing and added a README to the directory to emphasize the GPU-first testing strategy and our  support for type-specific tolerances.

shumway requested review from a team, Snektron, ThomasNing, afagaj, andriy-ca, aosewski, asleepzzz, bartekxk, carlushuang, cgmillette, coderfeli, ddembeckAMD, geyyer, illsilin, poyenc, qianfengz, vidyasagar-amd and vpietila-amd as code owners

January 28, 2026 04:39

Collaborator Author

shumway commented Jan 28, 2026

Can I get a review from @aosewski , @amd-anclark , @bartekxk , @johannes-graner, and @kabrahamAMD?

If you know any other engineers working on this kind of testing improvement, can you invite them to review this code?

afagaj requested a review from Copilot

January 28, 2026 17:19

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

This PR adds comprehensive documentation to the CK library utility directory, explaining the testing infrastructure with an emphasis on modern GPU-first validation strategies and automatic tolerance computation based on IEEE 754 precision limits.

Changes:

Added a detailed README.md file documenting testing utilities, validation approaches, and best practices
Documented the performance advantages of GPU-first validation over legacy CPU-based approaches
Provided reference tables for tolerance computation across different data types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/ck/library/utility/README.md

+              | FP32      | 23            | 1e-5         | 3e-6         |
+              | TF32      | 10            | 5e-4         | 5e-4         |
+              | FP16      | 10            | 1e-3         | 1e-3         |
+              | BF16      | 7             | 1e-1         | 1e-3         |

Copilot AI Jan 28, 2026

The relative tolerance for BF16 (1e-1 or 0.1) appears unusually high compared to other data types. This suggests 10% relative error is acceptable, which seems inconsistent with typical numerical validation standards. Verify this value is correct or clarify if this is for specific use cases.

Suggested change

      
            | BF16      | 7             | 1e-1         | 1e-3         |
          
            | BF16      | 7             | 1e-2         | 1e-3         |

Copilot uses AI. Check for mistakes.

vidyasagar-amd reviewed

View reviewed changes

include/ck/library/utility/README.md

+              - `gpu_reduce_max()`: Computes max(abs(tensor)) on GPU for tolerance scaling
+              - Grid-stride kernels with LDS reduction for optimal performance
+              **Performance**: 10-100x faster than CPU validation for large tensors.

Contributor

vidyasagar-amd Jan 28, 2026

Maybe remove the 10-100x number?

include/ck/library/utility/README.md

+              - `gpu_verify()`: Compares device tensors entirely on GPU
+                - Automatic tolerance computation based on data types
+                - Only transfers error statistics (~12 bytes), not tensors

Contributor

vidyasagar-amd Jan 28, 2026

we can remove (~12 bytes)

include/ck/library/utility/README.md

+              | FP16      | 10            | 1e-3         | 1e-3         |
+              | BF16      | 7             | 1e-1         | 1e-3         |
+              | FP8       | 3-4           | 1e-3         | 1e-3         |
+              | BF8       | 2-3           | 1e-3         | 1e-3         |

Contributor

vidyasagar-amd Jan 28, 2026

Rtol for BF8 lower than BF16?

include/ck/library/utility/README.md

+                          ↑ BOTTLENECK: PCIe transfer of entire tensor
+              ```
+              - **Problem**: Transferring multi-GB tensors over PCIe is 10-100x slower than computation

Contributor

vidyasagar-amd Jan 28, 2026

can remove the 10-100x

include/ck/library/utility/README.md

+              ```
+              - **Advantage**: All data stays on GPU, only error statistics transfer to CPU
+              - **Performance**: 10-100x faster for large tensors

Contributor

vidyasagar-amd Jan 28, 2026

Can remove or rephrase

amd-anclark reviewed

View reviewed changes

include/ck/library/utility/README.md


		This directory contains utility headers for testing, benchmarking, and validating Composable Kernel (CK) operations. The utilities support both modern GPU-first validation for high-performance testing and legacy CPU-based approaches for backward compatibility.

		## Quick Start

Collaborator

amd-anclark Jan 28, 2026

This section seems to summarize what our good practices are, key principles, or validation guidelines rather than initial setup steps.

Suggested change

      
            ## Quick Start
          
            ## Recommended Practices

include/ck/library/utility/README.md

+. **Let the system compute tolerances** automatically based on data types
+. **Only transfer error statistics**, not full tensors
+              ## File-to-Purpose Quick Reference

Collaborator

amd-anclark Jan 28, 2026

A small thing, but the table is organized from left-to-right as purpose-to-file. The current section name is the reverse.

Suggested change

      
            ## File-to-Purpose Quick Reference
          
            ## Purpose-to-Utility Quick Reference

include/ck/library/utility/README.md

Comment on lines +44 to +48

+              // Explicit tolerance
+              bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size);
+              // Automatic tolerance for mixed precision
+              bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size);

Collaborator

amd-anclark Jan 28, 2026

Is it worth mentioning when to use an explicit vs. automatic tolerance?

include/ck/library/utility/README.md

Comment on lines +57 to +58

		- `get_relative_threshold<ComputeType, OutType, AccType>()`: Computes relative tolerance from mantissa bits
		- `get_absolute_threshold<ComputeType, OutType, AccType>()`: Computes absolute tolerance scaled by magnitude

Collaborator

amd-anclark Jan 28, 2026

If it is helpful, a short example for each of these calls could help users see its usage.

johannes-graner reviewed

View reviewed changes

Contributor

johannes-graner left a comment

Very nice to have summary of the existing utility functionality.
My review mainly covers the GPU verification since that's what I'm most familiar with, although I looked through the rest too.

include/ck/library/utility/README.md

+              bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size);
+              // Automatic tolerance for mixed precision
+              bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size);

Contributor

johannes-graner Jan 29, 2026

K_dim should be changed to accumulation_count or similar, it's not necessarily equal to the K dimension.

On a slightly separate note, this does not currently support split-k, which requires accounting for accumulation in multiple data types. See issue #3673.

include/ck/library/utility/README.md

+. **Use GPU-first validation** for all new tests
+. **Avoid CPU transfers** unless debugging specific values
+. **Generate data on GPU** when possible
+. **Batch verification** to amortize kernel launch overhead

Contributor

johannes-graner Jan 29, 2026

I don't think we do or support batch verification. It would be very VRAM-intense since many device-side tensors would have to be kept in memory instead of clearing the non-reference output after each kernel that is tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

johannes-graner johannes-graner left review comments

Copilot code review Copilot Copilot left review comments

vidyasagar-amd vidyasagar-amd left review comments

amd-anclark amd-anclark left review comments

illsilin Awaiting requested review from illsilin illsilin is a code owner

carlushuang Awaiting requested review from carlushuang carlushuang is a code owner

qianfengz Awaiting requested review from qianfengz qianfengz is a code owner

aosewski Awaiting requested review from aosewski aosewski is a code owner

poyenc Awaiting requested review from poyenc poyenc is a code owner

geyyer Awaiting requested review from geyyer geyyer is a code owner

bartekxk Awaiting requested review from bartekxk bartekxk is a code owner

andriy-ca Awaiting requested review from andriy-ca andriy-ca is a code owner

afagaj Awaiting requested review from afagaj afagaj is a code owner

asleepzzz Awaiting requested review from asleepzzz asleepzzz is a code owner

ThomasNing Awaiting requested review from ThomasNing ThomasNing is a code owner

coderfeli Awaiting requested review from coderfeli coderfeli is a code owner

cgmillette Awaiting requested review from cgmillette cgmillette is a code owner

ddembeckAMD Awaiting requested review from ddembeckAMD ddembeckAMD is a code owner

vpietila-amd Awaiting requested review from vpietila-amd vpietila-amd is a code owner

Snektron Awaiting requested review from Snektron Snektron is a code owner

At least 1 approving review is required to merge this pull request.

Labels

None yet