Conversation
There was a problem hiding this comment.
Pull request overview
Improves ROCm GPU power monitoring robustness by handling unsupported power fields returned by amdsmi_get_power_info() on newer hardware (e.g., MI300X) and normalizing power_limit units so monitoring doesn’t crash and reports correct watts.
Changes:
- Update ROCm
get_device_power()to fall back fromaverage_socket_powertocurrent_socket_powerwhen the former is unsupported ('N/A'). - Update ROCm
get_device_power_limit()to treatpower_limitas microwatts (when applicable) and convert to watts.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (0.00%) is below the target coverage (80.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #794 +/- ##
==========================================
- Coverage 85.69% 83.18% -2.51%
==========================================
Files 103 103
Lines 7890 7900 +10
==========================================
- Hits 6761 6572 -189
- Misses 1129 1328 +199
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Use numbers.Real instead of (int, float) so numeric types like numpy scalars and ctypes ints are accepted, while still rejecting bool. - Explicitly check for missing keys in amdsmi power_info and log a warning so monitoring failures are diagnosable instead of silently returning None. - Replace the magic threshold 100000 with named constants _AMDSMI_MICROWATTS_PER_WATT and _AMDSMI_MICROWATTS_THRESHOLD, with a clearer comment explaining the µW vs W detection heuristic. - Add unit tests covering: average_socket_power supported, fallback to current_socket_power on 'N/A', both unsupported -> None, missing keys -> None, power_limit µW->W conversion, watts passthrough, non-numeric power_limit -> None, and missing power_limit key -> None.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
tests/common/test_device_manager.py:31
test_nvidia_helper_utilsis only guarded by@decorator.cuda_test(SB_TEST_CUDA), but it will still run (and fail) on hosts without NVIDIA GPUs (e.g., AMD-only or CPU-only), becausedm.device_managerwill not be aNvidiaDeviceManagerand methods likeget_device_compute_capability()returnNone. Gate this test on GPU vendor/availability (or split into CUDA vs ROCm tests using the existingrocm_testdecorator).
@decorator.cuda_test
@mock.patch('superbench.common.utils.process.run_command')
def test_nvidia_helper_utils(mock_run_command):
"""Test util functions of nvidia_helper."""
assert (isinstance(dm.device_manager.get_device_count(), numbers.Number))
assert (isinstance(dm.device_manager.get_device_compute_capability(), numbers.Number))
assert (isinstance(dm.device_manager.get_device_utilization(0), numbers.Number))
assert (isinstance(dm.device_manager.get_device_temperature(0), numbers.Number))
assert (isinstance(dm.device_manager.get_device_power_limit(0), numbers.Number))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Source: - Extract _amdsmi_power_to_watts helper that applies the microwatts->watts heuristic uniformly. This makes get_device_power and get_device_power_limit symmetric: previously the limit was µW-corrected but the live power was not, so on amdsmi versions reporting power in µW the monitored gpu_power could be off from gpu_power_limit by a factor of 1e6. - Make AmdDeviceManager.__del__ defensive: swallow exceptions from amdsmi_shut_down so interpreter-shutdown / partial-import scenarios cannot raise (e.g., NameError when the module global has been torn down). This is the proper place for the fix that the test was previously working around via sys.modules injection. Tests: - Drop the sys.modules-level rocml injection (no longer needed once __del__ is safe). - Add test_amd_get_device_power_microwatts_converted to lock in the symmetric unit handling for get_device_power. - Add test_amd_device_manager_lifecycle exercising __init__/__del__ through normal construction and verifying __del__ tolerates a raising amdsmi_shut_down.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| with mock.patch(f'{_DM_MODULE}.rocml', rocml_mock, create=True): | ||
| assert manager.get_device_power(0) == 350 | ||
| assert manager.get_device_power_limit(0) == 750 | ||
|
|
| _DM_MODULE = 'superbench.common.utils.device_manager' | ||
|
|
| @@ -3,6 +3,7 @@ | |||
|
|
|||
| """Device Managerment Library Utility.""" | |||
Description
GPU monitor crashes with invalid literal for int() with base 10: 'N/A' — amdsmi_get_power_info() returns 'N/A' for average_socket_power on MI300X (hardware reports 0xFFFF for unsupported fields). The
code called int() on this value without validation. Additionally, power_limit is returned in microwatts but was treated as watts.
Changes
get_device_power(): Fall back to current_socket_power when average_socket_power is 'N/A'. get_device_power_limit(): Convert microwatts to watts when needed.