Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
99 changes: 99 additions & 0 deletions experimental/conv/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
# Conv3D Implicit GEMM (Experimental)

Experimental Conv3D kernel prototype using implicit GEMM, with optional fused FP4 fake quantization for activations.

This code is kept under `experimental/` by design and is **not** part of the stable `modelopt.torch.quantization` API.

## Model Support

| Model/Framework | Supported | Notes |
|-----------------|-----------|-------|
| Video diffusion backbones using Conv3D | Partial | Intended for experimentation and microbenchmarking |
| Generic LLM backbones | No | Conv3D path is not relevant |
| End-to-end ModelOpt PTQ/QAT pipeline | No | Not wired into formal quantization/export/compress flows |

## Deployment

| Framework | Supported | Notes |
|-----------|-----------|-------|
| TensorRT-LLM | No | No formal export integration for this kernel path |
| vLLM | No | No integration |
| SGLang | No | No integration |
| PyTorch runtime (CUDA) | Yes (experimental) | JIT-compiles CUDA extension on first use |

## Usage

```python
import torch

from experimental.conv.implicit_gemm_cuda import conv3d_implicit_gemm_cuda
from modelopt.torch.quantization.tensor_quant import dynamic_block_quantize_op

x = torch.randn(1, 128, 21, 60, 106, device="cuda")
w = torch.randn(512, 128, 3, 3, 3, device="cuda")
block_size = 128

# Without FP4 activation quantization (drop-in-style Conv3D call)
out = conv3d_implicit_gemm_cuda(x, w, stride=(1, 1, 1), padding=(1, 1, 1))

# Optional block quantization of weights for experiments
w_q = dynamic_block_quantize_op(
w,
block_size,
w.abs().max().unsqueeze(0),
4, # num_bits
2, # exponent_bits
8, # scale_num_bits
4, # scale_exponent_bits
)

# With FP4 activation fake quantization
out_q = conv3d_implicit_gemm_cuda(
x,
w_q,
stride=(1, 1, 1),
padding=(1, 1, 1),
act_amax=x.abs().max().unsqueeze(0),
quant_act=True,
fp4_block_size=block_size, # 128 or 256
)
```

## API

Function: `conv3d_implicit_gemm_cuda(...)` from `experimental/conv/implicit_gemm_cuda.py`

| Parameter | Description |
|-----------|-------------|
| `x` | Input tensor `[N, Cin, D, H, W]` |
| `w` | Weight tensor `[Cout, Cin, kD, kH, kW]` |
| `bias` | Optional bias `[Cout]` |
| `stride` | Convolution stride `(D, H, W)` |
| `padding` | Convolution padding `(D, H, W)` |
| `dilation` | Convolution dilation `(D, H, W)` |
| `act_amax` | Activation abs-max scalar tensor (required when `quant_act=True`) |
| `quant_act` | Enable FP4 fake quantization on activations |
| `FP4_BLOCK_SIZE` | FP4 quantization block size (`128` or `256`) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Parameter name mismatch with actual API.

The table lists FP4_BLOCK_SIZE but the Python function signature uses fp4_block_size. This will confuse users trying to call the function with keyword arguments.

-| `FP4_BLOCK_SIZE` | FP4 quantization block size (`128` or `256`) |
+| `fp4_block_size` | FP4 quantization block size (`128` or `256`) |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| `FP4_BLOCK_SIZE` | FP4 quantization block size (`128` or `256`) |
| `fp4_block_size` | FP4 quantization block size (`128` or `256`) |
🤖 Prompt for AI Agents
In `@experimental/conv/README.md` at line 76, The README table uses the
constant-style name `FP4_BLOCK_SIZE` which doesn't match the Python function
parameter `fp4_block_size`; update the table entry to use `fp4_block_size` (or
explicitly list both forms if you want to document the env/constant separately)
so it matches the function signature and avoids confusion when calling the
function with keyword arguments; locate the table row that currently shows
`FP4_BLOCK_SIZE` and replace it with `fp4_block_size` (or add a parenthetical
note like `fp4_block_size (FP4_BLOCK_SIZE)` if documenting both).


## Status

Current state: **Prototype**

Known limitations:

- API is unstable and may change without notice.
- Not registered in core quantization module registries.
- Not covered by formal export/compress integration.
- CUDA extension compile latency on first invocation.
- Validation and performance coverage are limited to local experiments.

## Notes

- The CUDA kernel is JIT-compiled on first call (can take several seconds).
- Output shape matches `torch.nn.functional.conv3d`.
- FP4 path applies quantize-dequantize in-kernel for activation tiles.

## References

- Implicit GEMM-based convolution design patterns in GPU kernels.
- ModelOpt FP4-related quantization utilities in `modelopt.torch.quantization.tensor_quant`.
Loading