8587 test erros on pytorch release 2508 on series 50#8770
8587 test erros on pytorch release 2508 on series 50#8770garciadias wants to merge 12 commits intoProject-MONAI:devfrom
Conversation
Signed-off-by: R. Garcia-Dias <rafaelagd@gmail.com>
…en USE_COMPILED=True monai._C (grid_pull) was not compiled with sm_120 (Blackwell) architecture support, causing spatial_resample to produce incorrect results on RTX 50-series GPUs when USE_COMPILED=True. Add _compiled_unsupported() to detect compute capability major >= 12 at runtime and transparently fall back to the PyTorch-native affine_grid + grid_sample path, which is verified correct on sm_120. Fixes test_flips_inverse_124 in tests.transforms.spatial.test_spatial_resampled on NVIDIA GeForce RTX 5090 (Blackwell, sm_120).
The same USE_COMPILED guard that was fixed in spatial_resample (functional.py) was also present in Resample.__call__ (array.py), used by Affine, RandAffine and related transforms. Apply the same _compiled_unsupported() check so that grid_pull is not called on sm_120 (Blackwell) devices when monai._C lacks sm_120 support, preventing garbage output in test_affine, test_affined, test_rand_affine and test_rand_affined on RTX 50-series GPUs.
📝 WalkthroughWalkthroughThis diff mainly reformats various error/exception messages from multi-part string concatenations into single-line strings across several modules. It also adds a private helper Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
monai/transforms/spatial/array.py (1)
2066-2117:⚠️ Potential issue | 🟠 MajorConvert grid coordinates when falling back to
grid_samplewithnorm_coords=False.When
_compiled_unsupported()returns True, the fallback path at line 2103+ passes grids directly togrid_samplewithout converting from compiled convention[0, size-1]to PyTorch convention[-1, 1]. This causes silent missampling on unsupported devices (Blackwell cc12.x). Add coordinate conversion in the else block or reject thenorm_coords=False+ compiled fallback combination explicitly. Add test coverage for this scenario.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/transforms/spatial/array.py` around lines 2066 - 2117, The fallback branch that calls torch.nn.functional.grid_sample (else block) fails to convert compiled-style grid coordinates [0, size-1] to PyTorch normalized coords [-1, 1] when self.norm_coords is False; update the else branch before calling grid_sample to convert grid_t per-dimension: for each i and corresponding spatial size dim from img_t.shape, compute grid_t[0, ..., i] = (grid_t[0, ..., i] * 2.0 / max(1, dim - 1)) - 1.0 (use max to avoid division by zero) so grid_sample receives normalized coords, and preserve dtype/device/memory_format as done elsewhere (target symbols: _compiled_unsupported, grid_sample, self.norm_coords, grid_t, img_t, moveaxis).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@monai/apps/auto3dseg/bundle_gen.py`:
- Around line 266-268: The error message raised in BundleAlgo._run_cmd uses a
concatenated string missing a space after the period; update the
NotImplementedError text that references self.device_setting['MN_START_METHOD']
so it reads "...not supported yet. Try modify BundleAlgo._run_cmd for your
cluster." (preserve the current exception chaining "from err") to fix the
missing space between sentences.
In `@monai/apps/detection/networks/retinanet_detector.py`:
- Around line 519-522: The error message raised when self.inferer is None
contains a missing space after the period; update the string in the raise
ValueError call to insert a space so it reads "`self.inferer` is not defined.
Please refer to function self.set_sliding_window_inferer(*)" (locate the check
referencing self.inferer and the call to self.set_sliding_window_inferer to
apply the fix).
---
Outside diff comments:
In `@monai/transforms/spatial/array.py`:
- Around line 2066-2117: The fallback branch that calls
torch.nn.functional.grid_sample (else block) fails to convert compiled-style
grid coordinates [0, size-1] to PyTorch normalized coords [-1, 1] when
self.norm_coords is False; update the else branch before calling grid_sample to
convert grid_t per-dimension: for each i and corresponding spatial size dim from
img_t.shape, compute grid_t[0, ..., i] = (grid_t[0, ..., i] * 2.0 / max(1, dim -
1)) - 1.0 (use max to avoid division by zero) so grid_sample receives normalized
coords, and preserve dtype/device/memory_format as done elsewhere (target
symbols: _compiled_unsupported, grid_sample, self.norm_coords, grid_t, img_t,
moveaxis).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: aa93797a-0eea-4904-a754-c937994e99c8
📒 Files selected for processing (14)
monai/apps/auto3dseg/bundle_gen.pymonai/apps/detection/networks/retinanet_detector.pymonai/apps/detection/utils/anchor_utils.pymonai/apps/detection/utils/detector_utils.pymonai/auto3dseg/analyzer.pymonai/data/wsi_reader.pymonai/losses/unified_focal_loss.pymonai/metrics/meandice.pymonai/networks/blocks/patchembedding.pymonai/networks/layers/factories.pymonai/transforms/croppad/array.pymonai/transforms/regularization/array.pymonai/transforms/spatial/array.pymonai/transforms/spatial/functional.py
| raise NotImplementedError( | ||
| f"{self.device_setting['MN_START_METHOD']} is not supported yet." | ||
| "Try modify BundleAlgo._run_cmd for your cluster." | ||
| f"{self.device_setting['MN_START_METHOD']} is not supported yet.Try modify BundleAlgo._run_cmd for your cluster." | ||
| ) from err |
There was a problem hiding this comment.
Missing space after period in error message.
The concatenation dropped the space between sentences: "not supported yet.Try modify" should be "not supported yet. Try modify".
Proposed fix
raise NotImplementedError(
- f"{self.device_setting['MN_START_METHOD']} is not supported yet.Try modify BundleAlgo._run_cmd for your cluster."
+ f"{self.device_setting['MN_START_METHOD']} is not supported yet. Try modify BundleAlgo._run_cmd for your cluster."
) from err📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| raise NotImplementedError( | |
| f"{self.device_setting['MN_START_METHOD']} is not supported yet." | |
| "Try modify BundleAlgo._run_cmd for your cluster." | |
| f"{self.device_setting['MN_START_METHOD']} is not supported yet.Try modify BundleAlgo._run_cmd for your cluster." | |
| ) from err | |
| raise NotImplementedError( | |
| f"{self.device_setting['MN_START_METHOD']} is not supported yet. Try modify BundleAlgo._run_cmd for your cluster." | |
| ) from err |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@monai/apps/auto3dseg/bundle_gen.py` around lines 266 - 268, The error message
raised in BundleAlgo._run_cmd uses a concatenated string missing a space after
the period; update the NotImplementedError text that references
self.device_setting['MN_START_METHOD'] so it reads "...not supported yet. Try
modify BundleAlgo._run_cmd for your cluster." (preserve the current exception
chaining "from err") to fix the missing space between sentences.
| if self.inferer is None: | ||
| raise ValueError( | ||
| "`self.inferer` is not defined." "Please refer to function self.set_sliding_window_inferer(*)." | ||
| "`self.inferer` is not defined.Please refer to function self.set_sliding_window_inferer(*)." | ||
| ) |
There was a problem hiding this comment.
Missing space after period in error message.
"is not defined.Please refer" should be "is not defined. Please refer".
Proposed fix
raise ValueError(
- "`self.inferer` is not defined.Please refer to function self.set_sliding_window_inferer(*)."
+ "`self.inferer` is not defined. Please refer to function self.set_sliding_window_inferer(*)."
)🧰 Tools
🪛 Ruff (0.15.4)
[warning] 520-522: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@monai/apps/detection/networks/retinanet_detector.py` around lines 519 - 522,
The error message raised when self.inferer is None contains a missing space
after the period; update the string in the raise ValueError call to insert a
space so it reads "`self.inferer` is not defined. Please refer to function
self.set_sliding_window_inferer(*)" (locate the check referencing self.inferer
and the call to self.set_sliding_window_inferer to apply the fix).
|
Tolerance tests are passing. 🐧 ❯ export CUDAVERSION=slim; docker run --rm -v ./:/opt/monai/ --name monai_$CUDAVERSION --gpus=all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it monai:$CUDAVERSION python -m unittest \
tests.transforms.test_affine \
tests.transforms.test_affined \
tests.transforms.test_affine_grid \
tests.transforms.test_rand_affine \
tests.transforms.test_rand_affine_grid \
tests.transforms.test_rand_affined \
tests.transforms.test_create_grid_and_affine \
tests.transforms.test_spatial_resample \
tests.transforms.spatial.test_spatial_resampled \
tests.networks.layers.test_affine_transform \
tests.data.utils.test_zoom_affine \
tests.integration.test_meta_affine
tf32 enabled: False
monai.transforms.spatial.array Orientation.__init__:labels: Current default value of argument `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` was changed in version None from `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` to `labels=None`. Default value changed to None meaning that the transform now uses the 'space' of a meta-tensor, if applicable, to determine appropriate axis labels.
monai.transforms.spatial.dictionary Orientationd.__init__:labels: Current default value of argument `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` was changed in version None from `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` to `labels=None`. Default value changed to None meaning that the transform now uses the 'space' of a meta-tensor, if applicable, to determine appropriate axis labels.
.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................__array__ implementation doesn't accept a copy keyword, so passing copy=False failed. __array__ must implement 'dtype' and 'copy' keyword arguments. To learn more, see the migration guide https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword
........__array__ implementation doesn't accept a copy keyword, so passing copy=False failed. __array__ must implement 'dtype' and 'copy' keyword arguments. To learn more, see the migration guide https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword
...................................................................................................................................................................................................................................................................................Since version 1.3.0, affine_grid behavior has changed for unit-size grids when align_corners=True. This is not an intended use case of affine_grid. See the documentation of affine_grid for details.
...........................2026-03-09 13:37:04,857 - INFO - Verified 'ref_avg152T1_LR.nii.gz', sha256: c01a50caa7a563158ecda43d93a1466bfc8aa939bc16b06452ac1089c54661c8.
2026-03-09 13:37:04,858 - INFO - File exists: /opt/monai/tests/testing_data/ref_avg152T1_LR.nii.gz, skipped downloading.
2026-03-09 13:37:04,858 - INFO - Verified 'ref_avg152T1_RL.nii.gz', sha256: 8a731128dac4de46ccb2cc60d972b98f75a52f21fb63ddb040ca96f0aed8b51a.
2026-03-09 13:37:04,858 - INFO - File exists: /opt/monai/tests/testing_data/ref_avg152T1_RL.nii.gz, skipped downloading.
builtin type SwigPyPacked has no __module__ attribute
builtin type SwigPyObject has no __module__ attribute
builtin type swigvarlink has no __module__ attribute
...........monai.transforms.spatial.array Orientation.__init__:labels: Current default value of argument `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` was changed in version None from `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` to `labels=None`. Default value changed to None meaning that the transform now uses the 'space' of a meta-tensor, if applicable, to determine appropriate axis labels.
........................monai.transforms.spatial.dictionary Orientationd.__init__:labels: Current default value of argument `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` was changed in version None from `labels=(('L', 'R'), ('P', 'A'), ('I', 'S'))` to `labels=None`. Default value changed to None meaning that the transform now uses the 'space' of a meta-tensor, if applicable, to determine appropriate axis labels.
............
----------------------------------------------------------------------
Ran 910 tests in 12.732s
OK
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute |
…ytorch-release-2508-on-series-50 DCO Remediation Commit for R. Garcia-Dias <rafaelagd@gmail.com> I, R. Garcia-Dias <rafaelagd@gmail.com>, hereby add my Signed-off-by to this commit: ba56a6d I, R. Garcia-Dias <rafaelagd@gmail.com>, hereby add my Signed-off-by to this commit: 09c2cd9 I, R. Garcia-Dias <rafaelagd@gmail.com>, hereby add my Signed-off-by to this commit: 7cd0607 Signed-off-by: R. Garcia-Dias <rafaelagd@gmail.com>
f64b17a to
356956a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
monai/apps/detection/networks/retinanet_detector.py (1)
519-522:⚠️ Potential issue | 🟡 MinorMissing space after period.
"is not defined.Please"→"is not defined. Please".Proposed fix
raise ValueError( - "`self.inferer` is not defined.Please refer to function self.set_sliding_window_inferer(*)." + "`self.inferer` is not defined. Please refer to function self.set_sliding_window_inferer(*)." ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/apps/detection/networks/retinanet_detector.py` around lines 519 - 522, The ValueError message raised when self.inferer is None in retinanet_detector.py contains a missing space after the period; update the string in the check inside the method where self.inferer is validated (the block referencing self.inferer and the suggestion to call self.set_sliding_window_inferer) to read "`self.inferer` is not defined. Please refer to function self.set_sliding_window_inferer(*)`" so there is a space after the period.
🧹 Nitpick comments (1)
monai/transforms/spatial/functional.py (1)
57-76: Function implementation is sound; docstring lacks formal Args/Returns.The logic correctly gates Blackwell GPUs (sm_120+). Consider adding explicit
Args:andReturns:sections to align with Google-style docstrings per project conventions.📝 Suggested docstring format
def _compiled_unsupported(device: torch.device) -> bool: """ Return True if ``monai._C`` (the compiled C extension providing ``grid_pull``) is not compiled with support for the given CUDA device's compute capability. - + + Args: + device: The torch device to check for compiled extension support. + + Returns: + True if the device is CUDA with compute capability major >= 12 (Blackwell+), + False otherwise. + ``monai._C`` is built at install time against a fixed set of CUDA architectures.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@monai/transforms/spatial/functional.py` around lines 57 - 76, The _compiled_unsupported function's docstring is missing formal Google-style "Args:" and "Returns:" sections; update the docstring for _compiled_unsupported(device: torch.device) to include an "Args:" entry describing the device parameter (type and semantics) and a "Returns:" entry stating it returns a bool indicating whether the compiled monai._C lacks support for the device (True for CUDA devices with major compute capability >= 12, False otherwise), keeping the existing explanatory text intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@monai/transforms/spatial/functional.py`:
- Around line 57-76: Warp.forward currently calls grid_pull when USE_COMPILED is
true without checking whether the compiled C extension supports the device;
update Warp.forward to compute _use_compiled = USE_COMPILED and not
_compiled_unsupported(image.device) (reuse the existing _compiled_unsupported
function from functional.py) and branch: if not _use_compiled use the PyTorch
native affine_grid + grid_sample path, else call grid_pull as before, ensuring
grid_pull is only invoked on supported CUDA devices (reference symbols:
Warp.forward, USE_COMPILED, _compiled_unsupported, grid_pull).
---
Duplicate comments:
In `@monai/apps/detection/networks/retinanet_detector.py`:
- Around line 519-522: The ValueError message raised when self.inferer is None
in retinanet_detector.py contains a missing space after the period; update the
string in the check inside the method where self.inferer is validated (the block
referencing self.inferer and the suggestion to call
self.set_sliding_window_inferer) to read "`self.inferer` is not defined. Please
refer to function self.set_sliding_window_inferer(*)`" so there is a space after
the period.
---
Nitpick comments:
In `@monai/transforms/spatial/functional.py`:
- Around line 57-76: The _compiled_unsupported function's docstring is missing
formal Google-style "Args:" and "Returns:" sections; update the docstring for
_compiled_unsupported(device: torch.device) to include an "Args:" entry
describing the device parameter (type and semantics) and a "Returns:" entry
stating it returns a bool indicating whether the compiled monai._C lacks support
for the device (True for CUDA devices with major compute capability >= 12, False
otherwise), keeping the existing explanatory text intact.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a83bf9dd-d11c-4601-9888-15b532b84f0b
📒 Files selected for processing (7)
monai/apps/auto3dseg/bundle_gen.pymonai/apps/detection/networks/retinanet_detector.pymonai/auto3dseg/analyzer.pymonai/data/wsi_reader.pymonai/transforms/regularization/array.pymonai/transforms/spatial/array.pymonai/transforms/spatial/functional.py
💤 Files with no reviewable changes (1)
- monai/transforms/regularization/array.py
🚧 Files skipped from review as they are similar to previous changes (2)
- monai/data/wsi_reader.py
- monai/apps/auto3dseg/bundle_gen.py
| def _compiled_unsupported(device: torch.device) -> bool: | ||
| """ | ||
| Return True if ``monai._C`` (the compiled C extension providing ``grid_pull``) is not | ||
| compiled with support for the given CUDA device's compute capability. | ||
|
|
||
| ``monai._C`` is built at install time against a fixed set of CUDA architectures. | ||
| NVIDIA Blackwell GPUs (sm_120, compute capability 12.x) and newer were not included in | ||
| the default ``TORCH_CUDA_ARCH_LIST`` when the MONAI slim image was originally built, | ||
| so executing ``grid_pull`` on those devices produces incorrect results. Falling back to | ||
| the PyTorch-native ``affine_grid`` + ``grid_sample`` path (``USE_COMPILED=False``) gives | ||
| correct output on all architectures. | ||
|
|
||
| The threshold (``major >= 12``) matches the first architecture family (Blackwell, sm_120) | ||
| that shipped after the highest sm supported in the current default build list (sm_90, | ||
| Hopper). Adjust this constant when ``monai._C`` is rebuilt with sm_120+ support. | ||
| """ | ||
| if device.type != "cuda": | ||
| return False | ||
| return torch.cuda.get_device_properties(device).major >= 12 | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find all USE_COMPILED usages that might need the _compiled_unsupported guard
rg -n "USE_COMPILED" --type=py -C2 | grep -v "_compiled_unsupported"Repository: Project-MONAI/MONAI
Length of output: 22885
🏁 Script executed:
cat -n monai/networks/blocks/warp.py | head -200Repository: Project-MONAI/MONAI
Length of output: 9110
🏁 Script executed:
sed -n '57,250p' monai/transforms/spatial/functional.py | head -100Repository: Project-MONAI/MONAI
Length of output: 6288
🏁 Script executed:
sed -n '170,200p' monai/transforms/spatial/functional.pyRepository: Project-MONAI/MONAI
Length of output: 1558
🏁 Script executed:
sed -n '140,152p' monai/networks/blocks/warp.pyRepository: Project-MONAI/MONAI
Length of output: 693
Add device compatibility check before grid_pull call.
The Warp.forward() method (line 141+) calls grid_pull when USE_COMPILED=True without verifying the device supports the compiled C extension. Following the pattern in functional.py (line 183), use:
_use_compiled = USE_COMPILED and not _compiled_unsupported(image.device)
if not _use_compiled: # pytorch native grid_sample
...
else: # grid_pull
return grid_pull(...)This prevents incorrect results on Blackwell GPUs (compute capability 12.x+) where monai._C lacks support.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@monai/transforms/spatial/functional.py` around lines 57 - 76, Warp.forward
currently calls grid_pull when USE_COMPILED is true without checking whether the
compiled C extension supports the device; update Warp.forward to compute
_use_compiled = USE_COMPILED and not _compiled_unsupported(image.device) (reuse
the existing _compiled_unsupported function from functional.py) and branch: if
not _use_compiled use the PyTorch native affine_grid + grid_sample path, else
call grid_pull as before, ensuring grid_pull is only invoked on supported CUDA
devices (reference symbols: Warp.forward, USE_COMPILED, _compiled_unsupported,
grid_pull).
Fixes #8587.
Description
NVIDIA Blackwell GPUs (compute capability 12.x, sm_120) are not included in the TORCH_CUDA_ARCH_LIST used when building the MONAI compiled C extension (monai._C). When USE_COMPILED=True, calling grid_pull on a Blackwell device produces incorrect results because the extension was not compiled for that architecture.
This PR adds a _compiled_unsupported(device) helper in monai/transforms/spatial/functional.py that detects whether the current CUDA device's compute capability is unsupported by the compiled extension (major >= 12). When detected, both spatial_resample and the Resample class fall back to the PyTorch-native affine_grid + grid_sample path, giving correct output on all architectures. The behaviour on all previously supported GPUs (sm_75 through sm_90) is unchanged.
Types of changes
Non-breaking change (fix or new feature that would not break existing functionality).
Breaking change (fix or new feature that would cause existing functionality to change).
New tests added to cover the changes.
Integration tests passed locally by running ./runtests.sh -f -u --net --coverage.
Quick tests passed locally by running ./runtests.sh --quick --unittests --disttests.
In-line docstrings updated.
Documentation updated, tested make html command in the docs/ folder.
./runtests.sh -f -u --net --coverage../runtests.sh --quick --unittests --disttests.make htmlcommand in thedocs/folder.