-
Notifications
You must be signed in to change notification settings - Fork 431
Open
Description
I am trying to install nvidia gpu operator in my self-managed k8s cluster.
helm install --wait --generate-name \
-n gpu-operator --create-namespace \
nvidia/gpu-operator \
--version=v25.10.1
Facing issues in pod -creation
abhijithmallya@abhijithmallya-ROG-Strix-G512LI-G512LI:~$ kubectl get pods -n gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-1767031327-node-feature-discovery-gc-6fff94bftgpjn 1/1 Running 0 6m7s
gpu-operator-1767031327-node-feature-discovery-master-55b6z7f4g 1/1 Running 0 6m7s
gpu-operator-1767031327-node-feature-discovery-worker-dslpc 1/1 Running 0 6m7s
gpu-operator-6996bfc8df-82c66 0/1 ImagePullBackOff 0 6m7s
Describing the pod issue
abhijithmallya@abhijithmallya-ROG-Strix-G512LI-G512LI:~$ kubectl describe pod -n gpu-operator gpu-operator-6996bfc8df-82c66
Name: gpu-operator-6996bfc8df-82c66
Namespace: gpu-operator
Priority: 2000001000
Priority Class Name: system-node-critical
Service Account: gpu-operator
Node: abhijithmallya-rog-strix-g512li-g512li/192.168.1.16
Start Time: Mon, 29 Dec 2025 23:32:11 +0530
Labels: app=gpu-operator
app.kubernetes.io/component=gpu-operator
app.kubernetes.io/instance=gpu-operator-1767031327
app.kubernetes.io/managed-by=Helm
app.kubernetes.io/name=gpu-operator
app.kubernetes.io/version=v24.9.2
helm.sh/chart=gpu-operator-v24.9.2
nvidia.com/gpu-driver-upgrade-drain.skip=true
pod-template-hash=6996bfc8df
Annotations: cni.projectcalico.org/containerID: 9998b0f74cef936f7d7e95951d4011cb621420e046a9ea46e32d51a16d94a41a
cni.projectcalico.org/podIP: 172.16.111.146/32
cni.projectcalico.org/podIPs: 172.16.111.146/32
openshift.io/scc: restricted-readonly
Status: Pending
IP: 172.16.111.146
IPs:
IP: 172.16.111.146
Controlled By: ReplicaSet/gpu-operator-6996bfc8df
Containers:
gpu-operator:
Container ID:
Image: nvcr.io/nvidia/gpu-operator:v24.9.2
Image ID:
Port: 8080/TCP
Host Port: 0/TCP
Command:
gpu-operator
Args:
--leader-elect
--zap-time-encoding=epoch
--zap-log-level=info
State: Waiting
Reason: ImagePullBackOff
Ready: False
Restart Count: 0
Limits:
cpu: 500m
memory: 350Mi
Requests:
cpu: 200m
memory: 100Mi
Liveness: http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
Readiness: http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
Environment:
WATCH_NAMESPACE:
OPERATOR_NAMESPACE: gpu-operator (v1:metadata.namespace)
DRIVER_MANAGER_IMAGE: nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.7.0
Mounts:
/host-etc/os-release from host-os-release (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtskt (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
host-os-release:
Type: HostPath (bare host directory volume)
Path: /etc/os-release
HostPathType:
kube-api-access-dtskt:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
Optional: false
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node-role.kubernetes.io/control-plane:NoSchedule
node-role.kubernetes.io/master:NoSchedule
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 6m35s default-scheduler Successfully assigned gpu-operator/gpu-operator-6996bfc8df-82c66 to abhijithmallya-rog-strix-g512li-g512li
Normal Pulling 2m48s (x5 over 6m35s) kubelet Pulling image "nvcr.io/nvidia/gpu-operator:v24.9.2"
Warning Failed 2m45s (x5 over 5m59s) kubelet Failed to pull image "nvcr.io/nvidia/gpu-operator:v24.9.2": unable to pull image or OCI artifact: pull image err: initializing source docker://nvcr.io/nvidia/gpu-operator:v24.9.2: Requesting bearer token: received unexpected HTTP status: 403 Forbidden; artifact err: pull artifact: initializing source docker://nvcr.io/nvidia/gpu-operator:v24.9.2: Requesting bearer token: received unexpected HTTP status: 403 Forbidden
Warning Failed 2m45s (x5 over 5m59s) kubelet Error: ErrImagePull
Warning Failed 52s (x20 over 5m59s) kubelet Error: ImagePullBackOff
Normal BackOff 38s (x21 over 5m59s) kubelet Back-off pulling image "nvcr.io/nvidia/gpu-operator:v24.9.2"
====
Please help me resolve this issue.
Metadata
Metadata
Assignees
Labels
No labels