NVIDIA Open GPU Kernel Modules Version
a9284ecf7ab29e599e96de82168484728627eb7e06727467053719b785401e0a /root/xconn/wade/open-gpu-kernel-modules/kernel-open/nvidia.ko
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Description: Ubuntu 22.04.5 LTS
Kernel Release
Linux h3 6.8.0-78-generic NVIDIA#78~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Aug 13 14:32:06 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
After inputting"nvidia-smi -L", terminal hangs.
Describe the bug
I have 1 AMD Turin server + 2 Nvidia A10 cards on it.
AMD Server can recognize the 2 A10 cards.
lspci:
+-[0000:c0]-+-00.0 Advanced Micro Devices, Inc. [AMD] Turin Root Complex
| +-00.3 Advanced Micro Devices, Inc. [AMD] Turin RCEC
| +-01.0 Advanced Micro Devices, Inc. [AMD] Turin PCIe Dummy Host Bridge
| +-01.1-[c1-c4]----00.0-[c2-c4]--+-07.0-[c3]----00.0 NVIDIA Corporation GA102GL [A10]
| | -0a.0-[c4]----00.0 NVIDIA Corporation GA102GL [A10]
| -02.0 Advanced Micro Devices, Inc. [AMD] Turin PCIe Dummy Host Bridge
nvidia-smi shows the following error message.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
dmesg has the following error message.
[ 271.395530] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 271.395537] NVRM: GPU 0000:c3:00.0 is already bound to nouveau.
[ 271.397851] NVRM: GPU 0000:c4:00.0 is already bound to nouveau.
[ 271.397920] NVRM: The NVIDIA probe routine was not called for 2 device(s).
[ 271.397921] NVRM: This can occur when another driver was loaded and
NVRM: obtained ownership of the NVIDIA device(s).
[ 271.397922] NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
[ 271.397922] NVRM: No NVIDIA devices probed.
[ 271.398524] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
To Reproduce
Reproduce steps:
- Power on AMD Turin servers
- lspci -vt // confirm server can recognize the 2 Nvidia A10 cards
- nvidia-smi
- dmesg | tail -n 50
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
More Info
I did the same test on AMD Turin server with 2 RTX5090 and it worked as below.
- nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01 Driver Version: 590.44.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:21:00.0 Off | N/A |
| 0% 28C P8 12W / 600W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:C1:00.0 Off | N/A |
| 0% 27C P8 14W / 600W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
- p2pBandwidthLatencyTest
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 5090, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 5090, pciBusID: c4, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1522.95 11.47
1 11.46 1556.27
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 1525.93 57.19
1 57.19 1547.03
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1527.32 11.57
1 11.46 1540.12
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1528.07 112.29
1 112.34 1538.58
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.09 15.43
1 15.42 2.08
CPU 0 1
0 2.27 6.21
1 6.21 2.24
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.07 0.37
1 0.45 2.07
CPU 0 1
0 2.28 1.58
1 1.59 2.23
NVIDIA Open GPU Kernel Modules Version
a9284ecf7ab29e599e96de82168484728627eb7e06727467053719b785401e0a /root/xconn/wade/open-gpu-kernel-modules/kernel-open/nvidia.ko
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
Operating System and Version
Description: Ubuntu 22.04.5 LTS
Kernel Release
Linux h3 6.8.0-78-generic NVIDIA#78~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Wed Aug 13 14:32:06 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
Hardware: GPU
After inputting"nvidia-smi -L", terminal hangs.
Describe the bug
I have 1 AMD Turin server + 2 Nvidia A10 cards on it.
AMD Server can recognize the 2 A10 cards.
lspci:
+-[0000:c0]-+-00.0 Advanced Micro Devices, Inc. [AMD] Turin Root Complex
| +-00.3 Advanced Micro Devices, Inc. [AMD] Turin RCEC
| +-01.0 Advanced Micro Devices, Inc. [AMD] Turin PCIe Dummy Host Bridge
| +-01.1-[c1-c4]----00.0-[c2-c4]--+-07.0-[c3]----00.0 NVIDIA Corporation GA102GL [A10]
| | -0a.0-[c4]----00.0 NVIDIA Corporation GA102GL [A10]
| -02.0 Advanced Micro Devices, Inc. [AMD] Turin PCIe Dummy Host Bridge
nvidia-smi shows the following error message.
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.
dmesg has the following error message.
[ 271.395530] nvidia-nvlink: Nvlink Core is being initialized, major device number 510
[ 271.395537] NVRM: GPU 0000:c3:00.0 is already bound to nouveau.
[ 271.397851] NVRM: GPU 0000:c4:00.0 is already bound to nouveau.
[ 271.397920] NVRM: The NVIDIA probe routine was not called for 2 device(s).
[ 271.397921] NVRM: This can occur when another driver was loaded and
NVRM: obtained ownership of the NVIDIA device(s).
[ 271.397922] NVRM: Try unloading the conflicting kernel module (and/or
NVRM: reconfigure your kernel without the conflicting
NVRM: driver(s)), then try loading the NVIDIA kernel module
NVRM: again.
[ 271.397922] NVRM: No NVIDIA devices probed.
[ 271.398524] nvidia-nvlink: Unregistered Nvlink Core, major device number 510
To Reproduce
Reproduce steps:
Bug Incidence
Always
nvidia-bug-report.log.gz
nvidia-bug-report.log.gz
More Info
I did the same test on AMD Turin server with 2 RTX5090 and it worked as below.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.44.01 Driver Version: 590.44.01 CUDA Version: 13.1 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5090 Off | 00000000:21:00.0 Off | N/A |
| 0% 28C P8 12W / 600W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 5090 Off | 00000000:C1:00.0 Off | N/A |
| 0% 27C P8 14W / 600W | 2MiB / 32607MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
[P2P (Peer-to-Peer) GPU Bandwidth Latency Test]
Device: 0, NVIDIA GeForce RTX 5090, pciBusID: c3, pciDeviceID: 0, pciDomainID:0
Device: 1, NVIDIA GeForce RTX 5090, pciBusID: c4, pciDeviceID: 0, pciDomainID:0
Device=0 CAN Access Peer Device=1
Device=1 CAN Access Peer Device=0
***NOTE: In case a device doesn't have P2P access to other one, it falls back to normal memcopy procedure.
So you can see lesser Bandwidth (GB/s) and unstable Latency (us) in those cases.
P2P Connectivity Matrix
D\D 0 1
0 1 1
1 1 1
Unidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1522.95 11.47
1 11.46 1556.27
Unidirectional P2P=Enabled Bandwidth (P2P Writes) Matrix (GB/s)
D\D 0 1
0 1525.93 57.19
1 57.19 1547.03
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1527.32 11.57
1 11.46 1540.12
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1
0 1528.07 112.29
1 112.34 1538.58
P2P=Disabled Latency Matrix (us)
GPU 0 1
0 2.09 15.43
1 15.42 2.08
CPU 0 1
0 2.27 6.21
1 6.21 2.24
P2P=Enabled Latency (P2P Writes) Matrix (us)
GPU 0 1
0 2.07 0.37
1 0.45 2.07
CPU 0 1
0 2.28 1.58
1 1.59 2.23