Skip to content

Conversation

@shumway
Copy link
Collaborator

@shumway shumway commented Jan 28, 2026

I'm collecting information about our current testing (#3664). As part of this work I a README to the directory to emphasize the GPU-first testing strategy and our support for type-specific tolerances.

This readme contains internal code comments for CK developers and does not need ROCm documentation review.

I'm collecting information about our current testing and added a README to the directory to emphasize the GPU-first testing strategy and our  support for type-specific tolerances.
@shumway
Copy link
Collaborator Author

shumway commented Jan 28, 2026

Can I get a review from @aosewski , @amd-anclark , @bartekxk , @johannes-graner, and @kabrahamAMD?

If you know any other engineers working on this kind of testing improvement, can you invite them to review this code?

@afagaj afagaj requested a review from Copilot January 28, 2026 17:19
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive documentation to the CK library utility directory, explaining the testing infrastructure with an emphasis on modern GPU-first validation strategies and automatic tolerance computation based on IEEE 754 precision limits.

Changes:

  • Added a detailed README.md file documenting testing utilities, validation approaches, and best practices
  • Documented the performance advantages of GPU-first validation over legacy CPU-based approaches
  • Provided reference tables for tolerance computation across different data types

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

| FP32 | 23 | 1e-5 | 3e-6 |
| TF32 | 10 | 5e-4 | 5e-4 |
| FP16 | 10 | 1e-3 | 1e-3 |
| BF16 | 7 | 1e-1 | 1e-3 |
Copy link

Copilot AI Jan 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The relative tolerance for BF16 (1e-1 or 0.1) appears unusually high compared to other data types. This suggests 10% relative error is acceptable, which seems inconsistent with typical numerical validation standards. Verify this value is correct or clarify if this is for specific use cases.

Suggested change
| BF16 | 7 | 1e-1 | 1e-3 |
| BF16 | 7 | 1e-2 | 1e-3 |

Copilot uses AI. Check for mistakes.
- `gpu_reduce_max()`: Computes max(abs(tensor)) on GPU for tolerance scaling
- Grid-stride kernels with LDS reduction for optimal performance

**Performance**: 10-100x faster than CPU validation for large tensors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove the 10-100x number?


- `gpu_verify()`: Compares device tensors entirely on GPU
- Automatic tolerance computation based on data types
- Only transfers error statistics (~12 bytes), not tensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can remove (~12 bytes)

| FP16 | 10 | 1e-3 | 1e-3 |
| BF16 | 7 | 1e-1 | 1e-3 |
| FP8 | 3-4 | 1e-3 | 1e-3 |
| BF8 | 2-3 | 1e-3 | 1e-3 |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rtol for BF8 lower than BF16?

↑ BOTTLENECK: PCIe transfer of entire tensor
```

- **Problem**: Transferring multi-GB tensors over PCIe is 10-100x slower than computation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove the 10-100x

```

- **Advantage**: All data stays on GPU, only error statistics transfer to CPU
- **Performance**: 10-100x faster for large tensors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove or rephrase


This directory contains utility headers for testing, benchmarking, and validating Composable Kernel (CK) operations. The utilities support both modern GPU-first validation for high-performance testing and legacy CPU-based approaches for backward compatibility.

## Quick Start
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section seems to summarize what our good practices are, key principles, or validation guidelines rather than initial setup steps.

Suggested change
## Quick Start
## Recommended Practices

2. **Let the system compute tolerances** automatically based on data types
3. **Only transfer error statistics**, not full tensors

## File-to-Purpose Quick Reference
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small thing, but the table is organized from left-to-right as purpose-to-file. The current section name is the reverse.

Suggested change
## File-to-Purpose Quick Reference
## Purpose-to-Utility Quick Reference

Comment on lines +44 to +48
// Explicit tolerance
bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size);

// Automatic tolerance for mixed precision
bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth mentioning when to use an explicit vs. automatic tolerance?

Comment on lines +57 to +58
- `get_relative_threshold<ComputeType, OutType, AccType>()`: Computes relative tolerance from mantissa bits
- `get_absolute_threshold<ComputeType, OutType, AccType>()`: Computes absolute tolerance scaled by magnitude
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is helpful, a short example for each of these calls could help users see its usage.

Copy link
Contributor

@johannes-graner johannes-graner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice to have summary of the existing utility functionality.
My review mainly covers the GPU verification since that's what I'm most familiar with, although I looked through the rest too.

bool pass = gpu_verify<float>(output_dev, reference_dev, 1e-5f, 1e-6f, size);

// Automatic tolerance for mixed precision
bool pass = gpu_verify<float, half_t, float>(output_dev, reference_dev, K_dim, size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

K_dim should be changed to accumulation_count or similar, it's not necessarily equal to the K dimension.

On a slightly separate note, this does not currently support split-k, which requires accounting for accumulation in multiple data types. See issue #3673.

1. **Use GPU-first validation** for all new tests
2. **Avoid CPU transfers** unless debugging specific values
3. **Generate data on GPU** when possible
4. **Batch verification** to amortize kernel launch overhead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we do or support batch verification. It would be very VRAM-intense since many device-side tensors would have to be kept in memory instead of clearing the non-reference output after each kernel that is tested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants