Skip to content

Error: ImagePullBackOff #2016

@AbhijithMallya

Description

@AbhijithMallya

I am trying to install nvidia gpu operator in my self-managed k8s cluster.

helm install --wait --generate-name \
    -n gpu-operator --create-namespace \
    nvidia/gpu-operator \
    --version=v25.10.1

Facing issues in pod -creation

abhijithmallya@abhijithmallya-ROG-Strix-G512LI-G512LI:~$ kubectl get pods -n gpu-operator 
NAME                                                              READY   STATUS             RESTARTS   AGE
gpu-operator-1767031327-node-feature-discovery-gc-6fff94bftgpjn   1/1     Running            0          6m7s
gpu-operator-1767031327-node-feature-discovery-master-55b6z7f4g   1/1     Running            0          6m7s
gpu-operator-1767031327-node-feature-discovery-worker-dslpc       1/1     Running            0          6m7s
gpu-operator-6996bfc8df-82c66                                     0/1     ImagePullBackOff   0          6m7s

Describing the pod issue

abhijithmallya@abhijithmallya-ROG-Strix-G512LI-G512LI:~$ kubectl describe pod -n gpu-operator  gpu-operator-6996bfc8df-82c66 
Name:                 gpu-operator-6996bfc8df-82c66
Namespace:            gpu-operator
Priority:             2000001000
Priority Class Name:  system-node-critical
Service Account:      gpu-operator
Node:                 abhijithmallya-rog-strix-g512li-g512li/192.168.1.16
Start Time:           Mon, 29 Dec 2025 23:32:11 +0530
Labels:               app=gpu-operator
                      app.kubernetes.io/component=gpu-operator
                      app.kubernetes.io/instance=gpu-operator-1767031327
                      app.kubernetes.io/managed-by=Helm
                      app.kubernetes.io/name=gpu-operator
                      app.kubernetes.io/version=v24.9.2
                      helm.sh/chart=gpu-operator-v24.9.2
                      nvidia.com/gpu-driver-upgrade-drain.skip=true
                      pod-template-hash=6996bfc8df
Annotations:          cni.projectcalico.org/containerID: 9998b0f74cef936f7d7e95951d4011cb621420e046a9ea46e32d51a16d94a41a
                      cni.projectcalico.org/podIP: 172.16.111.146/32
                      cni.projectcalico.org/podIPs: 172.16.111.146/32
                      openshift.io/scc: restricted-readonly
Status:               Pending
IP:                   172.16.111.146
IPs:
  IP:           172.16.111.146
Controlled By:  ReplicaSet/gpu-operator-6996bfc8df
Containers:
  gpu-operator:
    Container ID:  
    Image:         nvcr.io/nvidia/gpu-operator:v24.9.2
    Image ID:      
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      gpu-operator
    Args:
      --leader-elect
      --zap-time-encoding=epoch
      --zap-log-level=info
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     500m
      memory:  350Mi
    Requests:
      cpu:      200m
      memory:   100Mi
    Liveness:   http-get http://:8081/healthz delay=15s timeout=1s period=20s #success=1 #failure=3
    Readiness:  http-get http://:8081/readyz delay=5s timeout=1s period=10s #success=1 #failure=3
    Environment:
      WATCH_NAMESPACE:       
      OPERATOR_NAMESPACE:    gpu-operator (v1:metadata.namespace)
      DRIVER_MANAGER_IMAGE:  nvcr.io/nvidia/cloud-native/k8s-driver-manager:v0.7.0
    Mounts:
      /host-etc/os-release from host-os-release (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-dtskt (ro)
Conditions:
  Type                        Status
  PodReadyToStartContainers   True 
  Initialized                 True 
  Ready                       False 
  ContainersReady             False 
  PodScheduled                True 
Volumes:
  host-os-release:
    Type:          HostPath (bare host directory volume)
    Path:          /etc/os-release
    HostPathType:  
  kube-api-access-dtskt:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    Optional:                false
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node-role.kubernetes.io/control-plane:NoSchedule
                             node-role.kubernetes.io/master:NoSchedule
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  6m35s                  default-scheduler  Successfully assigned gpu-operator/gpu-operator-6996bfc8df-82c66 to abhijithmallya-rog-strix-g512li-g512li
  Normal   Pulling    2m48s (x5 over 6m35s)  kubelet            Pulling image "nvcr.io/nvidia/gpu-operator:v24.9.2"
  Warning  Failed     2m45s (x5 over 5m59s)  kubelet            Failed to pull image "nvcr.io/nvidia/gpu-operator:v24.9.2": unable to pull image or OCI artifact: pull image err: initializing source docker://nvcr.io/nvidia/gpu-operator:v24.9.2: Requesting bearer token: received unexpected HTTP status: 403 Forbidden; artifact err: pull artifact: initializing source docker://nvcr.io/nvidia/gpu-operator:v24.9.2: Requesting bearer token: received unexpected HTTP status: 403 Forbidden
  Warning  Failed     2m45s (x5 over 5m59s)  kubelet            Error: ErrImagePull
  Warning  Failed     52s (x20 over 5m59s)   kubelet            Error: ImagePullBackOff
  Normal   BackOff    38s (x21 over 5m59s)   kubelet            Back-off pulling image "nvcr.io/nvidia/gpu-operator:v24.9.2"

====

Please help me resolve this issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions