diff --git a/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/1_Understanding_AI_Workloads_for_Kubernetes_Scheduling.md b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/1_Understanding_AI_Workloads_for_Kubernetes_Scheduling.md new file mode 100644 index 000000000..ef785ab8f --- /dev/null +++ b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/1_Understanding_AI_Workloads_for_Kubernetes_Scheduling.md @@ -0,0 +1,202 @@ +# Understanding AI Workloads for Kubernetes Scheduling + +**Authors (in alphabetical order):** + +Abhishek Malvankar, Alexander Scammon, Andrey Velichkevich, Ekin Karabulut, Kevin Hannon, Marlow Warnicke, Rajas Kakodkar, Sabrina DiLeva, Victor Lu, Yuan Tang + +**Contributors (in alphabetical order):** + +Adel Zaalouk, Andrew Shunichi Aikawa, Andrey Velichkevich, Boris Kurktchiev, Brian Redmond, Cathy Zhang, Claudia Misale, Joel Roberts, Johnu George, Josh Halley, Kay Yan, Malini Bhandaru, Michael Yao, Mike SHNg, Naadir Jeewa, Niki Manoledaki, Ricardo Aravena, Ronald Petty, Sachi Desai, Shravan Achar, Sudhanshu Prajapati, Victor Lu, Vijay Rodrigues + +# Series Introduction + +This paper is the first in a five-part series on scheduling for cloud native AI workloads: + +1. **Understanding AI Workloads for Kubernetes Scheduling (this paper):** What AI workloads are and why they differ from traditional cloud-native applications + +2. Scheduling Fundamentals for AI Workloads: How Kubernetes scheduling works and why additional layers are needed + +3. Job Orchestration Challenges for AI Workloads: Gang scheduling, fairness, queues, preemption, priority, and reservation + +4. Resource and Infrastructure Challenges for AI Workloads: Topology, GPU sharing, scalability, I/O, fault tolerance, and cost + +5. Solutions and Practical Guidance for AI Workload Scheduling: Tools, reference architectures, and real-world use cases + +Each paper can be read independently, but together they provide a comprehensive guide to running AI workloads on Kubernetes. + +# Executive Summary + +AI and machine learning workloads have fundamentally different characteristics than the stateless microservices Kubernetes was originally designed for. Training jobs consume expensive GPUs for days or weeks, require all workers to start simultaneously, and are sensitive to hardware topology. Inference workloads must maintain low latency while managing substantial state in GPU memory. This paper explains the AI/ML lifecycle, examines the resource requirements of each stage, and identifies why these workloads present unique scheduling challenges that traditional Kubernetes scheduling cannot address. + +# Introduction + +## Purpose and Scope + +This paper provides the foundational understanding of AI workloads necessary to address their scheduling requirements on Kubernetes. + +What this paper does: + +* Explains the stages of the AI/ML lifecycle and their resource characteristics +* Identifies what makes AI workloads different from traditional cloud-native applications +* Examines the distinct scheduling needs of training and inference workloads + +Who this paper is for: + +* Platform engineers building infrastructure for ML teams +* ML engineers who need to understand the infrastructure layer +* Infrastructure teams evaluating scheduling solutions for AI workloads + +What this paper assumes: + +* Basic familiarity with Kubernetes concepts (pods, nodes, deployments) +* No deep expertise in machine learning algorithms is required + +## Context + +Kubernetes has become the standard platform for deploying containerized applications. Its default scheduler excels at placing stateless microservices: it finds a node with sufficient CPU and memory, starts the pod, and moves on. If a pod fails, the system restarts it. If demand increases, horizontal autoscaling adds more pods. This model works well for web servers, APIs, and similar workloads. + +AI workloads break these assumptions. Understanding why requires examining the AI/ML lifecycle in detail. + +**Suggested pre-reading:** + +If you are new to Kubernetes scheduling, you may find these resources helpful: + +* [Kubernetes Scheduler documentation](https://kubernetes.io/docs/concepts/scheduling-eviction/kube-scheduler/) +* [Pod Priority and Preemption](https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/) +* [Resource Management for Pods and Containers](https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/) +* [Dynamic Resource Allocation](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) + +# Understanding AI Workloads + +## The AI/ML Lifecycle + +AI and machine learning projects follow a lifecycle with distinct stages. Each stage has different resource requirements and scheduling characteristics. + +### Data Preparation + +Data preparation transforms raw data into a format suitable for model training. This includes: + +* Data collection: Gathering data from various sources (databases, APIs, file systems) +* Data cleaning: Removing duplicates, handling missing values, fixing inconsistencies +* Data transformation: Normalizing, encoding categorical variables, feature scaling +* Data splitting: Dividing data into training, validation, and test sets + +From a scheduling perspective, data preparation is typically CPU and I/O intensive rather than GPU-intensive. That said, GPU-accelerated frameworks can significantly speed up large-scale data processing tasks such as filtering, joining, and aggregating datasets. Jobs are often parallelizable—you can clean different partitions of a dataset independently. Event-driven scheduling is common: new data arriving triggers a preparation pipeline. + +Kubernetes resources like Jobs and CronJobs handle these workloads reasonably well. Workflow orchestrators (Airflow, Argo Workflows, Flyte) coordinate multi-step pipelines. + +### Model Development + +Model development has two distinct activities that are often combined: + +* **Feature engineering** transforms prepared data into input features the model can use. This involves creating new variables, encoding categorical data, and selecting which features to include. Feature engineering is computationally similar to data preparation—CPU and I/O bound, parallelizable, often triggered by new data. +* **Model architecture** involves selecting the type of model (linear regression, decision tree, neural network, transformer) and designing its structure. For deep learning, this means defining layers, attention mechanisms, and other architectural choices. This work is often interactive—a data scientist experimenting in a notebook—and does not require significant compute resources until training begins. + +The scheduling needs for model development are modest. Interactive notebook environments need stable, long-running pods. Feature engineering jobs look like data preparation jobs. The heavy resource demands come in the next stage. + +### Model Training + +Model training is the process of exposing a model to data and iteratively adjusting its parameters (weights and biases) so it learns patterns that generalize to new inputs. + +Training requirements vary enormously depending on the model: + +* Training a linear regression model takes seconds on a laptop +* Training a random forest might take minutes to hours on a single machine +* Training a large language model takes weeks to months on thousands of GPUs + +For deep learning at scale, training is distributed across multiple machines. Workers communicate using collective operations like all-reduce, which requires all participants to synchronize. This creates the gang scheduling requirement: if you need 64 workers and only 60 are available, the job cannot start. If one worker fails mid-training, the entire job may need to restart from a checkpoint. + +Training jobs are: + +* Long-running: Days to months for large models +* Resource-intensive: Hundreds to thousands of GPUs +* Tightly coupled: All workers must run simultaneously +* Sensitive to topology: Communication speed depends on GPU interconnects + +The default Kubernetes scheduler cannot handle these requirements. It will start pods as resources become available, potentially leaving a job stuck with partial resources indefinitely. + +### Model Inference + +After training, the model is deployed to make predictions on new data. Inference has two common patterns: + +* **Batch inference** processes large datasets offline. For example, running a trained model over satellite imagery for an entire continent, or scoring millions of records for a recommendation system. Batch inference looks like a large parallel job—it can use many GPUs, but individual tasks are independent. Gang scheduling is typically not required. +* **Real-time inference** responds to individual requests with low latency. A user sends a query; the model returns a prediction in milliseconds. Real-time inference requires: + * Models preloaded in GPU memory (loading a model can take longer than serving a request) + * Horizontal scaling to handle variable request rates + * Low-latency networking + +Resource requirements per request are often lower than those of large-scale training, but real-time inference creates its own scheduling challenges: keeping models loaded, autoscaling with demand to meet SLOs (Service Level Objectives), and handling multiple base models competing for GPU’s. + +### Emerging Patterns + +* **Agentic pipelines** chain multiple model calls together, with models making decisions about what to do next. These create dynamic, unpredictable workloads that are difficult to schedule in advance. +* **Disaggregated inference** separates LLM serving into distinct phases; typically prefill (processing the input prompt) and decode (generating output tokens) that run on different hardware. Prefill is compute-intensive; decode is memory-bandwidth bound. This architecture enables independent optimization and scaling of each phase but introduces new scheduling requirements: KV cache (a data structure that stores intermediate computations from the prefill phase so they don't need to be recalculated during decoding) must transfer between phases, components need coordinated topology aware placement, and the system must maintain minimum viable combinations while scaling each component type independently. + +## What Makes AI Workloads Different + +Traditional cloud-native applications—web servers, APIs, microservices—share common characteristics that Kubernetes handles well: + +* Stateless or easily replicated state +* Short request/response cycles +* Horizontal scaling by adding more pods +* Graceful degradation under load (serve some requests slower, drop others) +* Pods are fungible; any pod can handle any request + +AI workloads break these assumptions: + +* **Long-running jobs.** A training run is not a request that completes in milliseconds. It is a job that runs for days or weeks. Interrupting it wastes all the work done since the last checkpoint. The scheduler must account for job duration, not just instantaneous resource needs. +* **Massive resource consumption.** Training large models requires hundreds or thousands of GPUs running simultaneously. A single job can consume the majority of a cluster's capacity for extended periods. This is not "scale horizontally by adding pods"—it is "reserve a large fraction of the cluster for one workload." +* **Tightly coupled distribution.** Distributed training uses collective communication patterns where all workers must participate. You cannot start with 7 of 8 workers and add the 8th later. You cannot lose one worker and continue with the remaining 7\. Either all workers are running, or the job cannot proceed. This is fundamentally different from web services, where losing one replica just shifts load to the others. +* **A large amount of state.** AI inference workloads answer user queries by creating and maintaining substantial state in GPU memory. This state management is key to consider when scaling AI inference server pods up or down while maintaining latency and SLOs. +* **GPU memory constraints.** LLM models consume all available GPU memory. A model that requires 80GB of VRAM cannot share a GPU with another workload—there is no memory left. This makes GPU sharing difficult. Loading and unloading models is slow (tens of GB transferred from storage), so swapping models for different requests is impractical for real-time inference. +* **I/O patterns.** AI workloads have distinct I/O phases: + * Data loading at the start of training can take 10-70% of total execution time (per research from [Google](https://arxiv.org/pdf/2101.12127.pdf) and [Microsoft](https://www.cs.utexas.edu/~vijay/papers/vldb21-datastalls.pdf)), leaving GPUs idle + * Checkpointing periodically saves model state to storage, creating bursty write patterns + * Model loading for inference transfers large model files from storage to GPU memory before serving can begin +* **Topology sensitivity.** Performance depends on which specific GPUs are allocated and how they are connected. Two GPUs on the same node with high-speed interconnects can communicate at hundreds of GB/s. Two GPUs on different nodes communicate over the network at 10-100 GB/s. For workloads dominated by communication (like distributed training with frequent synchronization), this difference determines whether a job takes hours or days. +* **Latency requirements:** Some jobs have high latency tolerances and others, such as inference, have much lower latency tolerances, as there is an end user waiting for an answer + +## Resource Characteristics by Lifecycle Stage + +The following table summarizes the resource profile of each lifecycle stage: + +| Stage | Primary Resources | Duration | Scheduling Characteristics | +| :---- | :---- | :---- | :---- | +| Data Preparation | CPU, storage I/O, network | Minutes to hours | Parallelizable, event-driven, no gang requirement | +| Feature Engineering | CPU, memory, storage I/O | Minutes to hours | Similar to data preparation | +| Model Development | CPU (notebooks), minimal GPU for experiments | Interactive sessions | Long-running pods, modest resources | +| Training (small models) | 1-8 GPUs | Hours to days | Standard job scheduling | +| Training (large models) | 100s-1000s of GPUs | Days to months | Gang scheduling, topology awareness, checkpointing | +| Fine-tuning | 1-64 GPUs | Hours to days | Similar to small model training | +| Batch Inference | Variable GPU count | Hours to days | Parallelizable, throughput-oriented | +| Real-time Inference | GPUs with models preloaded | Continuous | Low latency, autoscaling, model serving | + +The key insight: different stages need different scheduling strategies. A cluster running the full ML lifecycle must handle event-driven pipelines, interactive notebooks, gang-scheduled training, and latency-sensitive inference—often simultaneously, competing for the same GPU resources. + +Of these stages, training and real-time inference present the most significant scheduling challenges and the most significant conflicts when sharing infrastructure. The following section examines their requirements in detail. + +## What Makes Training and Real Time Inference Workloads Different + +Model training and model inference represent different workload classes with distinct resource consumption patterns, scheduling needs, and optimization objectives. Infrastructure supporting both workload types requires scheduling mechanisms capable of managing both of their requirements. + +### Scheduling Needs + +Training workloads have the following scheduling needs: + +* Gang scheduling: Distributed training using collective communication (e.g., all-reduce) requires all workers to execute simultaneously. Partial allocation results in resource waste and potential deadlock. +* Topology awareness: Inter-worker communication performance depends on GPU interconnect topology. Suboptimal placement can degrade training throughput by an order of magnitude. +* Long resource commitment: Resources remain allocated for the duration of training, which may span days to weeks. + +Inference workloads impose different needs and objectives: + +* Gang scheduling: Inference workloads spanning multiple pods also require all-or-nothing scheduling. A model-parallel or disaggregated deployment is non-functional if only some components are scheduled. + * Hierarchical gang scheduling: Disaggregated inference adds complexity. The system must guarantee minimum viable component combinations (e.g., at least one prefill and one decode worker) while allowing independent scaling. Prefill scales with prompt length and request rate; decode scales with output length and concurrent sessions. Traditional gang scheduling conflicts with this need by forcing all components to scale together. +* Topology awareness: Multi-pod inference deployments (whether model-parallel or disaggregated) require low-latency interconnects between pods. This topology must be preserved across when new versions are deployed. +* SLO and availability: SLOs specify response times in milliseconds. Unlike batch jobs, real-time inference services run continuously with no completion state. Resources must remain available while meeting latency and throughput targets. +* Scaling: Inference scaling spans multiple layers: autoscalers monitor request metrics and adjust replica counts; cluster schedulers handle pod placement; inference request routers distribute requests within the fleet. Effective scaling requires coordination across all three. + * Multi-level autoscaling: Disaggregated inference requires autoscaling at multiple levels; individual components (prefill workers for traffic spikes), related component groups (prefill leaders with their workers), and entire service replicas for overall capacity. These levels affect one another: scaling prefill workers may require more decode capacity, and new service replicas need proper component ratios. +* Cold start tradeoff: Pre-deployed replicas guarantee low latency but consume resources when idle. Reactive scaling is efficient but incurs cold start latency (seconds to minutes) while model weights load. Accelerated model loading (fast storage, model streaming) reduces this penalty. The choice depends on SLO requirements and cost tolerance. + +# What's Next + +This paper established what AI workloads are and why they differ from traditional cloud-native applications. The next paper in this series, **Scheduling Fundamentals for AI Workloads**, examines how Kubernetes scheduling works, introduces the different units of scheduling for different workload types, and explains why additional layers like meta-schedulers and queue managers are needed for AI workloads. diff --git a/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/2_Scheduling_Fundamentals_for_AI_Workloads.md b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/2_Scheduling_Fundamentals_for_AI_Workloads.md new file mode 100644 index 000000000..c3731ce09 --- /dev/null +++ b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/2_Scheduling_Fundamentals_for_AI_Workloads.md @@ -0,0 +1,160 @@ +# Scheduling Fundamentals for AI Workloads + +**Authors (in alphabetical order):** + +Abhishek Malvankar, Alexander Scammon, Andrey Velichkevich, Ekin Karabulut, Kevin Hannon, Marlow Warnicke, Rajas Kakodkar, Sabrina DiLeva, Victor Lu, Yuan Tang + +**Contributors (in alphabetical order):** + +Adel Zaalouk, Andrew Shunichi Aikawa, Andrey Velichkevich, Boris Kurktchiev, Brian Redmond, Cathy Zhang, Claudia Misale, Joel Roberts, Johnu George, Josh Halley, Kay Yan, Malini Bhandaru, Michael Yao, Mike SHNg, Naadir Jeewa, Niki Manoledaki, Ricardo Aravena, Ronald Petty, Sachi Desai, Shravan Achar, Sudhanshu Prajapati, Victor Lu, Vijay Rodrigues + +# Series Introduction + +This paper is the second in a five-part series on scheduling for cloud native AI workloads: + +1. Understanding AI Workloads for Kubernetes Scheduling: What AI workloads are and why they differ from traditional cloud-native applications + +2. **Scheduling Fundamentals for AI Workloads (this paper):** How Kubernetes scheduling works and why additional layers are needed + +3. Job Orchestration Challenges for AI Workloads: Gang scheduling, fairness, queues, preemption, priority, and reservation + +4. Resource and Infrastructure Challenges for AI Workloads: Topology, GPU sharing, scalability, I/O, fault tolerance, and cost + +5. Solutions and Practical Guidance for AI Workload Scheduling: Tools, reference architectures, and real-world use cases + +Each paper can be read independently, but together they provide a comprehensive guide to running AI workloads on Kubernetes. + +# Executive Summary + +The default Kubernetes scheduler operates at the pod level, making placement decisions independently for each pod without understanding relationships between them. This model works well for stateless microservices but falls short for AI workloads that require job-level scheduling, coordinated placement, and queue management. This paper explains how Kubernetes scheduling works, contrasts it with traditional HPC scheduling, and introduces the meta-schedulers and queue managers that bridge the gap for AI workloads. Kubernetes provides a [scheduling framework](https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/) that enables these extensions. + +# Previously + +The first paper in this series, **Understanding AI Workloads for Kubernetes Scheduling**, established what AI workloads are and why they differ from traditional cloud-native applications. Key points: + +* AI workloads span multiple lifecycle stages with different resource profiles: data preparation (CPU/I/O bound), training (GPU-intensive, long-running), and inference (latency-sensitive, stateful) + +* Training jobs are tightly coupled. All workers must run simultaneously for collective communication + +* GPUs cannot be shared easily because workloads must fit entirely in GPU memory + +* Performance depends heavily on hardware topology and interconnects + +This paper builds on that foundation by examining how scheduling systems work and where they fall short for AI workloads. + +# Scheduling Fundamentals + +## Unit of Scheduling + +Different systems operate at different abstraction levels to make scheduling possible for different classes of workloads. This section makes those units of scheduling explicit for different workload classes within a single cluster. + +### **Kubernetes: Pod-Level Scheduling** + +In Kubernetes, the fundamental unit of scheduling is the **pod**. The default Kubernetes scheduler generally makes placement decisions independently for each pod, assigning it to a single node based on resource availability, constraints, and policies. + +Characteristics: + +* Scheduling decisions are per pod. +* The scheduler does **not understand relationships** between pods (e.g., that multiple pods belong to the same distributed training job). + +In Kubernetes-native scheduling, the scheduling unit is a pod within **a single Kubernetes cluster**. + +### **Traditional HPC Schedulers: Task-Level Scheduling** + +In traditional high-performance computing (HPC) environments, schedulers such as Slurm operate at the task level (also called a rank or process group member). + +Characteristics: + +* A job submission explicitly declares the number of tasks, CPUs, GPUs, and nodes required. +* Tasks are scheduled **as a single atomic unit**: either all tasks are allocated resources, or the job does not start. +* The scheduler has native awareness of job-level structure, topology requirements, and inter-task communication. + +This model naturally supports gang scheduling, topology-aware placement, and reservation-based execution. These capabilities are particularly well-suited for model training workloads. + +In HPC-style schedulers, the scheduling unit is a task within a job, with all tasks scheduled together. + +**AI Training Workloads: Job-Level Scheduling** + +For distributed AI training, the natural unit of scheduling is the **job**. Here, "job" refers to a complete unit of work in the general sense, not the specific Kubernetes Job resource. + +Characteristics: + +* Multiple workers (e.g., data-parallel or model-parallel processes, etc) +* Strong synchronization requirements across all workers +* The job cannot make progress unless **all required workers are running simultaneously** +* Partial allocation wastes resources and may block other work + +AI training scheduling refers to job-level scheduling. + +### **AI Inference: Request-Level Scheduling** + +For both real-time and batch inference, the effective scheduling unit is often an **inference request**, typically an HTTP or gRPC request. + +Characteristics: + +* The request's input and output lengths vary based on the user prompt, thereby affecting completion time. +* State management becomes an important consideration in the system +* Resources for serving requests need to be scheduled on the correct hardware that satisfies the service level objective (SLO). + +AI inference scheduling refers to the scheduling of requests to meet the target latency. + +## Distributed Scheduling in Cloud Native Systems + +Kubernetes extends scheduling from a single machine to a cluster. Understanding the division of responsibilities is key to understanding where AI workloads hit limitations. + +### The Kubernetes scheduler + +The default scheduler (kube-scheduler) has one job: assign pods to nodes. When a pod is created, the scheduler: + +1. Filters nodes that cannot run the pod (insufficient resources, taints, affinity rules) +2. Scores the remaining nodes based on criteria like resource balance +3. Binds the pod to the highest-scoring node + +The scheduler operates on individual pods. It does not understand relationships between pods—it does not know that eight pods belong to the same distributed training job and must all be scheduled together or not at all. + +## The Need for Meta-Schedulers + +The gap between what Kubernetes provides and what AI workloads require has led to the addition of extra scheduling layers. + +Why are additional layers needed? The default scheduler is designed for: + +* Services that can start with partial capacity and scale up +* Pods that are mostly independent +* Workloads where any available resources are better than none + +AI workloads need: + +* All-or-nothing scheduling (gang scheduling) +* Coordinated placement across multiple pods +* Awareness of job-level semantics (not just pod-level) +* Queue management and fair sharing across teams +* Understanding of hardware topology + +Building this into the core Kubernetes scheduler would add complexity that most users do not need. Instead, the ecosystem has developed meta-schedulers and queue managers that use well-defined extensions to sit alongside or extend the default scheduler. + +**Event-driven scheduling.** Some workloads should start in response to events rather than immediately upon submission: + +* New data arrives in a storage bucket → trigger a data preparation pipeline +* A model training job completes → trigger an evaluation job +* An upstream job fails → trigger a notification or retry + +Tools like [Argo Events](https://argoproj.github.io/argo-events/), [Apache Airflow](https://airflow.apache.org/), and [Flyte](https://flyte.org/) handle event-driven orchestration. They decide *when* jobs should run; the Kubernetes scheduler decides *where* the resulting pods are placed. + +**Queue-based scheduling.** For AI workloads, queue-based scheduling provides: + +* **Job queues.** Jobs are submitted to queues rather than immediately creating pods. The queue manager decides when to admit jobs based on available resources and policies. +* **Fair sharing.** Resources are divided among queues (often representing teams or projects). A team that has been using less than its share gets priority; a team over quota must wait. +* **Borrowing and lending.** Unused quota from one team can be temporarily used by another, then reclaimed when needed. +* **Gang scheduling.** The queue manager waits until all resources for a job are available before admitting it, preventing partial allocations. + +Projects like Armada, KAI Scheduler, Kueue, Slinky, and Volcano provide these capabilities. They work with the Kubernetes scheduler—typically using mechanisms like pod scheduling gates or custom schedulers—to control when and how pods are scheduled. + +The result is a layered architecture: + +1. **Workflow orchestrators**: Decide what jobs to run and when +2. **Queue managers / batch schedulers**: Manage job admission, fairness, and gang scheduling +3. **Device plugins / [DRA](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/)**: Manage specialized resources, such as GPUs and CPU alignment. + +# What's Next + +This paper explained how Kubernetes scheduling works and why additional layers are needed for AI workloads. The next paper in this series, **Job Orchestration Challenges for AI Workloads**, dives into the specific challenges related to job orchestration: gang scheduling, resource fairness, queue management, preemption, priority scheduling, and resource reservation. diff --git a/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/3_Job _Orchestration_Challenges_for_AI_Workloads.md b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/3_Job _Orchestration_Challenges_for_AI_Workloads.md new file mode 100644 index 000000000..19f440bcc --- /dev/null +++ b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/3_Job _Orchestration_Challenges_for_AI_Workloads.md @@ -0,0 +1,110 @@ +# Job Orchestration Challenges for AI Workloads + +**Authors (in alphabetical order):** + +Abhishek Malvankar, Alexander Scammon, Andrey Velichkevich, Ekin Karabulut, Kevin Hannon, Marlow Warnicke, Rajas Kakodkar, Sabrina DiLeva, Victor Lu, Yuan Tang + +**Contributors (in alphabetical order):** + +Adel Zaalouk, Andrew Shunichi Aikawa, Andrey Velichkevich, Boris Kurktchiev, Brian Redmond, Cathy Zhang, Claudia Misale, Joel Roberts, Johnu George, Josh Halley, Kay Yan, Malini Bhandaru, Michael Yao, Mike SHNg, Naadir Jeewa, Niki Manoledaki, Ricardo Aravena, Ronald Petty, Sachi Desai, Shravan Achar, Sudhanshu Prajapati, Victor Lu, Vijay Rodrigues + +# Series Introduction + +This paper is the third in a five-part series on scheduling for cloud native AI workloads: + +1. Understanding AI Workloads for Kubernetes Scheduling: What AI workloads are and why they differ from traditional cloud-native applications + +2. Scheduling Fundamentals for AI Workloads: How Kubernetes scheduling works and why additional layers are needed + +3. **Job Orchestration Challenges for AI Workloads (this paper):** Gang scheduling, fairness, queues, preemption, priority, and reservation + +4. Resource and Infrastructure Challenges for AI Workloads: Topology, GPU sharing, scalability, I/O, fault tolerance, and cost + +5. Solutions and Practical Guidance for AI Workload Scheduling: Tools, reference architectures, and real-world use cases + +Each paper can be read independently, but together they provide a comprehensive guide to running AI workloads on Kubernetes. + +# Executive Summary + +Running AI workloads on Kubernetes introduces job orchestration challenges that the default scheduler was not designed to handle. Distributed training requires all workers to start simultaneously (gang scheduling). Multiple teams competing for GPUs need fair resource allocation. Jobs must be prioritized, queued, and sometimes preempted to make room for higher-priority work. This paper examines these orchestration challenges in detail: gang scheduling, resource fairness, queue management, preemption, priority scheduling, and resource reservation with backfill. + +# Previously + +The first two papers in this series established: + +* **Paper 1:** AI workloads have fundamentally different characteristics than traditional microservices—they are long-running, tightly coupled, GPU-intensive, and topology-sensitive. + +* **Paper 2:** The default Kubernetes scheduler operates at the pod level and does not understand job-level semantics. Meta-schedulers and queue managers are needed to bridge the gap. + +This paper examines the specific job orchestration challenges that arise when running AI workloads. + +# Job Orchestration Challenges + +This section outlines the scheduling challenges related to job orchestration—how jobs are admitted, ordered, and coordinated. + +## Gang Scheduling + +* **Problem:** Distributed training jobs require all workers to run simultaneously. If a job needs 8 GPUs and only 6 are available, the default Kubernetes scheduler will start 6 pods and leave 2 pending. Those 6 GPUs sit idle—the job cannot make progress, and other jobs cannot use those resources. Worse, this can cause deadlock: multiple partial jobs hold resources while waiting for more, and none can complete. +* **Solution approach:** Gang scheduling treats a group of pods as a single unit. Either all pods are scheduled together, or none are. The scheduler waits until sufficient resources are available for the entire job before starting any pods. +* **Lifecycle impact:** Gang scheduling is critical across multiple stages of the AI lifecycle: + * **Distributed training:** Workers use collective communication (all-reduce) that requires every participant. A partial allocation is useless. + * **Multi-pod inference:** Model-parallel deployments and disaggregated serving architectures require all components (e.g., prefill and decode workers) to be running before the system can serve requests. + * **Distributed data preparation:** Parallel jobs that must complete together to produce consistent output benefit from all-or-nothing scheduling. +* **Current state:** Kubernetes-native batch schedulers that support gang scheduling include the coscheduling plugin (via PodGroups), Armada, KAI Scheduler, and Volcano. Native gang scheduling with the Workloads API h[as been implemented in Kubernetes 1.35 as an alpha feature](https://kubernetes.io/docs/concepts/workloads/workload-api/), with a goal to reach beta in Kubernetes 1.36. + +## Resource Fairness and Quota Management + +* **Problem:** Without controls, a single team or job can monopolize cluster resources, starving others. A large training job might consume all available GPUs for weeks, preventing other teams from running anything. +* **Solution approaches:** + * Fair-share algorithms allocate resources proportionally. If Team A has 60% quota and Team B has 40%, resources are divided accordingly when both teams have pending work. + * Hierarchical quotas allow nested allocation—a department gets a quota, then subdivides it among teams. + * Borrowing and lending lets teams use idle quota from others, with mechanisms to reclaim it when the owner needs it back. +* **Key concepts:** + * **Min/max quotas vs. hard quotas.** Hard quotas block any work beyond the limit. Min/max quotas guarantee a minimum while allowing borrowing up to a maximum. This improves utilization—unused quota doesn't sit idle. + * **Fairness over time.** Batch workloads don't need instantaneous fairness. It's acceptable for one job to use 80% of the cluster for a week, as long as other teams get their share over the month. + * **Namespace-based vs. job-based fairness.** Kubernetes namespaces can represent teams, with quotas applied per namespace. But within a namespace, one user submitting 1,000 jobs shouldn't crowd out another user's 10 jobs—job-level or user-level fairness may also be needed. + +## Queue Management + +* **Problem:** A busy cluster has many jobs competing for resources: urgent production inference, routine nightly training, experimental notebooks, batch data processing. Without organization, chaos ensues. +* **Solution approach:** Queue-based scheduling organizes workloads: + * Hierarchical queues group related work. A "production" queue might have sub-queues for "inference" and "retraining." Each queue can have its own policies and resource allocation. + * Priority classes determine which jobs run first when resources are scarce. Production inference might be highest priority; experimental training might be lowest. +* **Key mechanics:** + * Enqueue decisions. Before a job enters a queue, the system checks whether it can ever run—does it request more resources than exist? Does it meet the minimum requirements for gang scheduling? Jobs that can never run should be rejected early, not left pending forever. + * Task topology. Some jobs require specific hardware (GPUs, TPUs, FPGAs). A multi-level queue system can route jobs to appropriate resource pools automatically. + +## Preemption + +* **Problem:** Higher-priority work arrives, but the cluster is full of lower-priority jobs. Without preemption, urgent work must wait for batch jobs to finish—which could take days. +* **Solution approach:** Preemption stops lower-priority work to free resources for higher-priority work. But naive preemption causes problems for AI workloads. +* **Key considerations:** + * **Preemption vs. reclaim.** Preemption is proactive—stopping jobs to make room for new, higher-priority work. Reclaim is reactive—taking back borrowed resources when the owner needs them. Both are necessary. + * **Checkpointing before preemption.** A training job that's been running for 3 days holds valuable state. Killing it without warning loses that work. Cooperative preemption gives jobs time to checkpoint before being stopped. + * **Preemption within jobs.** In gang-scheduled jobs, preempting one worker may crash the entire job. If a job has 64 workers and you preempt 1, you might lose all 64 GPUs worth of work. Schedulers must understand job structure. + * **Cascading failures.** AI workflows have dependencies. Preempting a data preparation job may invalidate downstream training jobs that depended on its output. + * **Cost-aware preemption.** GPU time is expensive. Some systems allow "GPU time budgets"—jobs can be preempted if they exceed their allocated GPU-hours, forcing checkpoint and resume when resources are next available. + * **Avoiding waste.** Oversubscribing a cluster (promising more resources than exist) can backfire. Constant preemption churns resources, and the overhead of stopping/restarting jobs may waste more than it saves. + +## Priority Scheduling + +* **Problem:** Not all jobs are equal. A production inference service that users are waiting on matters more than an experimental training run that can wait until tomorrow. +* **Solution approach:** Kubernetes has pod-level PriorityClasses, but AI workloads need job-level priority. Preempting one pod from each of ten jobs to fit one high-priority pod is worse than preempting one complete low-priority job. +* **Typical priority levels:** + * Production inference — Highest priority; user-facing latency matters + * Production training/retraining — High priority; needed to keep models fresh + * Development/experimentation — Medium priority; can wait for resources + * Batch/background jobs — Low priority; run when resources are idle +* **Interactive vs. batch:** Interactive workloads (notebooks, debugging) benefit from higher priority even if they use fewer resources—a data scientist waiting for a GPU to run a quick test is blocked until they get one. + +## Resource Reservation and Backfill + +* **Problem:** A large job needs 256 GPUs. Smaller jobs keep arriving and consuming resources. The large job can never find a 256-GPU window and starves indefinitely. +* **Solution approach:** + * Reservation locks resources for a specific job. The scheduler identifies which resources will be needed and stops scheduling new work to them, even if they're currently idle. + * Backfill allows small, short jobs to use reserved resources temporarily, as long as they'll finish before the reserved job needs them. +* **Mechanics:** The scheduler estimates when reserved resources will be free (based on running jobs' expected completion), then allows backfill jobs that fit within that window. This requires jobs to declare (or the system to estimate) their expected runtime. + +# What's Next + +This paper examined the job orchestration challenges for AI workloads. The next paper in this series, **Resource and Infrastructure Challenges for AI Workloads**, covers challenges related to where and how jobs run: topology awareness, resource heterogeneity, GPU utilization and sharing, scalability, I/O bottlenecks, fault tolerance, and cost constraints. diff --git a/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/4_Resource_and_Infrastructure_Challenges_for_AI_Workloads.md b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/4_Resource_and_Infrastructure_Challenges_for_AI_Workloads.md new file mode 100644 index 000000000..a755118bc --- /dev/null +++ b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/4_Resource_and_Infrastructure_Challenges_for_AI_Workloads.md @@ -0,0 +1,155 @@ +# Resource and Infrastructure Challenges for AI Workloads + +**Authors (in alphabetical order):** + +Abhishek Malvankar, Alexander Scammon, Andrey Velichkevich, Ekin Karabulut, Kevin Hannon, Marlow Warnicke, Rajas Kakodkar, Sabrina DiLeva, Victor Lu, Yuan Tang + +**Contributors (in alphabetical order):** + +Adel Zaalouk, Andrew Shunichi Aikawa, Andrey Velichkevich, Boris Kurktchiev, Brian Redmond, Cathy Zhang, Claudia Misale, Joel Roberts, Johnu George, Josh Halley, Kay Yan, Malini Bhandaru, Michael Yao, Mike SHNg, Naadir Jeewa, Niki Manoledaki, Ricardo Aravena, Ronald Petty, Sachi Desai, Shravan Achar, Sudhanshu Prajapati, Victor Lu, Vijay Rodrigues + +# Series Introduction + +This paper is the fourth in a five-part series on scheduling for cloud native AI workloads: + +1. Understanding AI Workloads for Kubernetes Scheduling: What AI workloads are and why they differ from traditional cloud-native applications + +2. Scheduling Fundamentals for AI Workloads: How Kubernetes scheduling works and why additional layers are needed + +3. Job Orchestration Challenges for AI Workloads: Gang scheduling, fairness, queues, preemption, priority, and reservation + +4. **Resource and Infrastructure Challenges for AI Workloads (this paper):** Topology, GPU sharing, scalability, I/O, fault tolerance, and cost + +5. Solutions and Practical Guidance for AI Workload Scheduling: Tools, reference architectures, and real-world use cases + +Each paper can be read independently, but together they provide a comprehensive guide to running AI workloads on Kubernetes. + +# Executive Summary + +Beyond job orchestration, AI workloads face challenges related to the underlying infrastructure: hardware topology affects communication performance, heterogeneous GPU types require careful placement, expensive GPUs should be shared when possible, and long-running jobs must tolerate failures. This paper examines these resource and infrastructure challenges: topology awareness, resource heterogeneity, GPU utilization and sharing, scalability, I/O bottlenecks, fault tolerance and elasticity, and budget constraints. + +# Previously + +The first three papers in this series established: + +* **Paper 1:** AI workloads have fundamentally different characteristics than traditional microservices. They are long-running, tightly coupled, GPU-intensive, and topology-sensitive. + +* **Paper 2:** The default Kubernetes scheduler operates at the pod level. Meta-schedulers and queue managers bridge the gap for AI workloads. + +* **Paper 3:** Job orchestration challenges include gang scheduling, resource fairness, queue management, preemption, priority scheduling, and resource reservation. + +This paper examines challenges related to the underlying resources and infrastructure. + +# Resource and Infrastructure Challenges + +This section outlines scheduling challenges related to hardware resources and infrastructure—where and how jobs run. While many examples reference training workloads, these challenges apply equally to multi-node inference deployments, such as model-parallel or disaggregated serving architectures. + +## Topology Awareness + +Hardware topology significantly impacts AI workload performance. Two different challenges require different solutions: + +* **Node-level topology:** + * NUMA (Non-Uniform Memory Access). Memory access speed depends on which CPU socket is accessing which memory bank. Placing a workload's threads and memory on the same NUMA node avoids slow cross-socket access. + * CPU-GPU affinity. GPUs are attached to specific PCIe buses, which connect to specific CPU sockets. Misaligned placement adds latency. + * GPU interconnects within a node. Eight GPUs in a server might be connected via high-speed interconnects or PCIe (slower). A job that needs 4 GPUs performs better if those 4 have high-speed connections. + * Unified memory architectures. Some hardware platforms use system-on-chip (SoC) designs where GPU and CPU share a single memory pool. On these architectures, GPU memory consumption directly reduces available system memory. The Kubernetes scheduler treats GPU memory and the memory available to CPU as independent resources, unaware of this coupling. This can lead to over-commitment and system instability (e.g., OOM kills) when GPU workloads consume memory that the scheduler assumed was available for system use. +* **Cluster-level topology:** + * Network fabric. Nodes under the same top-of-rack switch can communicate faster than nodes across the datacenter. For distributed training with frequent synchronization, network topology dominates performance. + * Rack placement. Placing all workers in the same rack minimizes network hops but creates a single point of failure. + * GPU interconnects across nodes. Technologies like high-bandwidth networking with RDMA provide high-bandwidth, low-latency connections between GPUs on different servers. The scheduler must know which nodes have these connections. +* **Why this matters:** For communication-heavy workloads like distributed training, placing workers on well-connected nodes can be the difference between a job taking hours vs. days. The default Kubernetes scheduler has limited topology awareness—it knows about NUMA but not about GPU interconnects or network fabric. + +## Resource Heterogeneity + +* **Problem:** Clusters contain mixed hardware: different GPU models, different generations, different memory sizes. Some nodes have TPUs, some have FPGAs, some have only CPUs. +* **Solution approach:** + * Node pools group nodes with similar hardware. Jobs can target specific pools. + * Resource labels tag nodes with their capabilities (gpu-type=high-memory, gpu-generation=current). + * Topology hints help the scheduler understand hardware characteristics beyond simple labels. +* **Challenges:** Jobs may have preferences rather than hard requirements—"prefers current-generation GPUs but can run on previous-generation with reduced batch size." Expressing and handling these preferences adds complexity. + +## GPU Utilization and Sharing + +* **Problem:** GPUs are expensive, but many workloads don't fully utilize them. An inference service or an interactive session might use 20% of a GPU's compute capacity. Allocating whole GPUs to such workloads wastes resources. +* **Solution approaches:** + * **GPU partitions support** into hardware-isolated instances. A single physical GPU becomes multiple smaller virtual GPUs, each with dedicated memory and compute. + * **Time-slicing** shares a GPU between multiple workloads by switching between them rapidly. This improves utilization but adds latency and provides no memory isolation. + * **MPS (Multi-Process Service)** allows multiple containers to use a single GPU concurrently, rather than taking turns. This improves utilization for workloads that don't need a full GPU but provides no hardware isolation between processes. + * **vGPU** (virtual GPU) provides software-level GPU sharing, typically in virtualized environments. + * [**Dynamic Resource Allocation (DRA)**](https://kubernetes.io/docs/concepts/scheduling-eviction/dynamic-resource-allocation/) is a Kubernetes feature that allows flexible resource allocation, including GPU partitioning on demand which need to be implemented via vendor DRA drivers. +* **Caveats:** + * Memory constraints. LLMs consume all available GPU memory. Sharing works for smaller workloads, not for large model inference. + +## Scalability + +* **Problem:** Large AI training runs require massive scale—hundreds of thousands of GPUs across thousands of nodes. The scheduling system must handle this scale. +* **Challenges:** + * Cluster autoscaling. Tools like Karpenter and the Kubernetes Cluster Autoscaler add nodes when demand exceeds capacity. But adding GPU nodes takes time (minutes), and large jobs may need more nodes than a cloud provider can quickly provision. + * Workload autoscaling. Beyond cluster-level scaling, workloads need to scale dynamically. Elastic training and data processing jobs can opportunistically expand beyond quota when resources are idle and shrink back when demand returns, while respecting gang scheduling constraints. For inference, scaling is driven by request load; metrics like tokens-per-second, queue depth, and KV cache utilization may be more relevant than CPU or memory. Disaggregated inference adds complexity, as prefill and decode components may scale independently. + * Scheduler throughput. The scheduler must make placement decisions for thousands of pods without becoming a bottleneck. At scale, scheduling overhead matters—every millisecond spent per pod adds up. + * Coordination overhead. Gang scheduling at scale (coordinating placement of thousands of pods simultaneously) is harder than scheduling independent pods. + +## I/O Bottlenecks + +* **Problem:** GPUs are fast, but getting data to them is slow. I/O bottlenecks waste expensive GPU time. +* **Data loading:** Research from [Google](https://arxiv.org/pdf/2101.12127.pdf) and [Microsoft](https://www.cs.utexas.edu/~vijay/papers/vldb21-datastalls.pdf) found that data loading consumes 10-70% of training execution time. During data loading, GPUs sit idle, waiting for the next batch. +* **Checkpointing:** Training jobs periodically save model state to storage. Checkpointing a large model can take minutes and requires high storage bandwidth. Frequent checkpointing (for fault tolerance) vs. infrequent checkpointing (for performance) is a tradeoff. +* **Model loading for inference:** Before serving requests, inference services must load model weights to GPU memory. The time and method of loading depends on the scenario: + * Cold start: Loading from storage (local filesystems or cloud object storage (S3, GCS)) can take minutes for large models. Optimized streaming tools (such as Run:ai Model Streamer, Tensorizer, and fastsafetensors) reduce this to seconds by streaming weights directly into GPU memory. + * +* **Storage performance:** The MLCommons Storage benchmark measures storage system performance for ML workloads. Inadequate storage throughput can bottleneck the entire training pipeline, regardless of how many GPUs are available. + +## Fault Tolerance and Elasticity + +* **Problem:** Long-running training jobs (days to weeks) will encounter failures. Hardware fails, nodes go down, preemption happens. Without fault tolerance, a failure after 6 days of training means starting over. The frequency of these interruptions is substantial: [Meta reported](https://arxiv.org/pdf/2407.21783) that during a 54-day snapshot of Llama 3 pre-training, they experienced 466 job interruptions: 47 planned (firmware upgrades, configuration updates) and the remainder unplanned hardware or infrastructure failures. +* **Checkpoint/restart:** Periodically saving model state to durable storage allows resuming from the last checkpoint rather than from scratch. The tradeoff: more frequent checkpoints mean less work lost but more overhead. +* **Elastic jobs:** Some frameworks (like PyTorch Elastic) support changing the number of workers during training. If a worker fails, training continues with fewer workers (within limits). If resources become available, workers can be added. This requires: + * Minimum and maximum worker counts + * Ability to redistribute work when workers join/leave + * Coordination with the scheduler to add/remove workers gracefully + * Role-aware scheduling: the scheduler must understand pod roles (e.g., master vs. worker) and preempt workers before masters to avoid job failure +* **Handling failures without full restart:** For gang-scheduled jobs, one worker failure typically crashes the entire job. Elastic training relaxes this—the job continues with the surviving workers, and a replacement worker can join later. + +## Budget and Cost Constraints + +* **Problem:** GPU resources are expensive. Teams have budgets. Without cost controls, a single runaway job can consume a month's cloud budget overnight. +* **Potential approaches:** + * GPU-hour budgets limit how much GPU time a team or job can consume + * Cost-aware scheduling considers the cost of different resource allocations (spot vs. on-demand instances, different GPU types) + * Automatic job termination when budgets are exhausted + * Cost visibility helps teams understand and optimize their spending +* This challenge is mentioned in comments on the original document but not fully addressed in the current Kubernetes ecosystem. It remains an area for future development. + +# Infrastructure Considerations + +Scheduling decisions depend on the underlying infrastructure. This section covers datacenter architecture and network topology as they relate to AI workload scheduling. + +## GPU Direct Communications + +GPUs can communicate without involving the CPU, which reduces latency and increases throughput. + +Within a chassis (same server): + +* High-speed GPU interconnects provide high-bandwidth, low-latency connections between GPUs. Interconnect bandwidth can exceed hundreds of GB/s between connected GPUs, far faster than PCIe. + +Not all GPUs in a server are necessarily connected via high-speed links. In a GPU dense server only certain pairs have NVLink connections. The scheduler must understand this topology. + +Between chassis (across servers): + +* InfiniBand provides high-bandwidth, low-latency networking between servers, with RDMA (Remote Direct Memory Access) support for GPU-direct communication. +* RoCE (RDMA over Converged Ethernet) offers similar capabilities over Ethernet infrastructure. +* GPUDirect RDMA enables direct data transfer between GPUs on different servers without CPU involvement. + +## Network Topology for AI Workloads + +For communication-heavy distributed training, network topology can matter more than raw compute speed. Key considerations: + +* **Locality matters.** Workers that communicate frequently should be placed close together in the network. "Close" means fewer hops and higher bandwidth—ideally on the same leaf switch. +* **Bandwidth requirements.** Large-scale distributed training with frequent synchronization can saturate network links. The scheduler should avoid placing communication-heavy jobs in ways that create network bottlenecks. +* **Failure domains.** Placing all workers in the same rack minimizes network hops but means a rack failure kills the entire job. Spreading workers across racks improves resilience but increases communication latency. + +The scheduler must balance these concerns. For latency-sensitive training, co-location may be worth the reduced resilience. For long-running jobs, spreading across failure domains and accepting higher latency may be preferable to risking a full restart. + +# What's Next + +This paper examined the resource and infrastructure challenges for AI workloads. The final paper in this series, **Solutions and Practical Guidance for AI Workload Scheduling**, catalogs the tools and Kubernetes features that address these challenges, provides a reference table mapping challenges to solutions, and offers practical guidance including real-world use cases. diff --git a/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/5_Solutions_and_Practical_Guidance_for_AI_Workload_Scheduling.md b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/5_Solutions_and_Practical_Guidance_for_AI_Workload_Scheduling.md new file mode 100644 index 000000000..8f62a4a87 --- /dev/null +++ b/initiatives/1641_Cloud_Native_AI_Scheduling_Challenges_Whitepaper/5_Solutions_and_Practical_Guidance_for_AI_Workload_Scheduling.md @@ -0,0 +1,248 @@ +# Solutions and Practical Guidance for AI Workload Scheduling + +**Authors (in alphabetical order):** + +Abhishek Malvankar, Alexander Scammon, Andrey Velichkevich, Ekin Karabulut, Kevin Hannon, Marlow Warnicke, Rajas Kakodkar, Sabrina DiLeva, Victor Lu, Yuan Tang + +**Contributors (in alphabetical order):** + +Adel Zaalouk, Andrew Shunichi Aikawa, Andrey Velichkevich, Boris Kurktchiev, Brian Redmond, Cathy Zhang, Claudia Misale, Joel Roberts, Johnu George, Josh Halley, Kay Yan, Malini Bhandaru, Michael Yao, Mike SHNg, Naadir Jeewa, Niki Manoledaki, Ricardo Aravena, Ronald Petty, Sachi Desai, Shravan Achar, Sudhanshu Prajapati, Victor Lu, Vijay Rodrigues + +# Series Introduction + +This paper is the fifth and final in a series on scheduling for cloud native AI workloads: + +1. Understanding AI Workloads for Kubernetes Scheduling: What AI workloads are and why they differ from traditional cloud-native applications + +2. Scheduling Fundamentals for AI Workloads: How Kubernetes scheduling works and why additional layers are needed + +3. Job Orchestration Challenges for AI Workloads: Gang scheduling, fairness, queues, preemption, priority, and reservation + +4. Resource and Infrastructure Challenges for AI Workloads: Topology, GPU sharing, scalability, I/O, fault tolerance, and cost + +5. **Solutions and Practical Guidance for AI Workload Scheduling (this paper):** Tools, reference architectures, and real-world use cases + +Each paper can be read independently, but together they provide a comprehensive guide to running AI workloads on Kubernetes. + +# Executive Summary + +Understanding AI workloads, scheduling fundamentals, and the challenges is necessary but not sufficient—organizations need to know what tools exist and how to apply them. This paper catalogs the solutions available for AI workload scheduling: Kubernetes native features, scheduler extensions, batch schedulers, ML platform tools, and workflow orchestrators. It provides a reference table mapping challenges to solutions, practical guidance on choosing a scheduling stack, common patterns and anti-patterns, and concrete use cases showing how organizations configure scheduling for different scenarios. + +# Previously + +The first four papers in this series established: + +* **Paper 1:** AI workloads span multiple lifecycle stages with different resource profiles. Training is GPU-intensive and tightly coupled; inference is latency-sensitive and stateful. + +* **Paper 2:** The default Kubernetes scheduler operates at the pod level. Meta-schedulers and queue managers are needed for AI workloads. + +* **Paper 3:** Job orchestration challenges include gang scheduling, resource fairness, queue management, preemption, priority scheduling, and resource reservation. + +* **Paper 4:** Resource and infrastructure challenges include topology awareness, GPU sharing, scalability, I/O bottlenecks, fault tolerance, and cost constraints. Infrastructure considerations like GPU direct communications and network topology also affect scheduling decisions. + +This paper catalogs the solutions and provides practical guidance on applying them. + +# Solutions Landscape + +This section catalogs the tools and Kubernetes features that address the scheduling challenges discussed in Papers 3 and 4\. + +## Kubernetes Native Features + +Kubernetes includes several features relevant to AI workload scheduling: + +* **Pod Priority and Preemption.** PriorityClasses assign priority levels to pods. Higher-priority pods can preempt lower-priority ones when resources are scarce. Limitation: operates at the pod level, not the job level. +* **Topology Spread Constraints.** Distribute pods across failure domains (regions, zones, nodes). Useful for resilience but not for optimizing GPU communication topology. +* **Mutable Scheduling Directives.** Allows modifying a Job's scheduling constraints while suspended. Custom queue controllers use this to implement admission control. +* **Non-preempting PriorityClass.** Jobs can have high priority (to queue ahead of others) without preempting running work. Useful for urgent jobs that can wait for resources to free naturally. +* **Dynamic Resource Allocation (DRA).** A newer feature (GA, Kubernetes 1.34+) that enables flexible allocation of specialized resources. DRA supports on-demand GPU partitioning, allowing workloads to request specific GPU configurations rather than whole GPUs. + +## Scheduler Extensions and Plugins + +The Kubernetes scheduler can be extended through plugins: + +**Scheduler Plugins** (kubernetes-sigs/scheduler-plugins) is a repository of out-of-tree scheduler plugins. Notable plugins include: + +* [Coscheduling](https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling): Implements gang scheduling via PodGroups +* [Capacity Scheduling](https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/capacityscheduling): Elastic quota management with borrowing +* [Node Resources](https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/noderesources): Enhanced resource allocation algorithms + +Multiple plugins can run together, but interactions between plugins require careful configuration. + +## Batch Schedulers and Queue Managers + +These tools add job queuing, fair sharing, and gang scheduling on top of Kubernetes: + +* **Volcano** is a CNCF project providing comprehensive batch scheduling. Features include gang scheduling, fair-share policies, queue management, and topology-aware scheduling. Widely used for training workloads. +* **Kueue** (kubernetes-sigs) is a cloud-native Job scheduler that works with the default Kubernetes scheduler, the Job controller, and the cluster autoscaler to provide an end-to-end batch system. Kueue implements Job queueing, deciding when Jobs should wait and when they should start, based on quotas and a hierarchy for sharing resources fairly among teams. +* **KAI Scheduler** (formerly Run:ai Scheduler, a CNCF sandbox project) is an AI workload scheduler designed for large scale, high throughput clusters. Features include gang scheduling, topology awareness, hierarchical queues with quotas, fair-share policies (with/without time based), GPU sharing (partitioning, time-slicing), elastic workloads, and multi-framework support. Used in enterprise deployments with thousands of nodes. +* **Apache YuniKorn** provides unified scheduling for batch and interactive workloads with hierarchical queues and fair sharing. +* **Slinky** (opensource project maintained by NVIDIA) is a set of powerful integration tools designed to bring Slurm capabilities into Kubernetes, both allowing Slurm jobs to run within Kubernetes and bringing the full suite of Slurm scheduling capabilities into Kubernetes. +* + +## ML Platform Tools + +These tools provide higher-level abstractions for ML workflows: + +* **Kubeflow** + * **Kubeflow Trainer** supports distributed training across frameworks (PyTorch, TensorFlow, PaddlePaddle, XGBoost). Provides job abstractions that handle worker coordination, including gang scheduling requirements. + * **Kubeflow MPI Operator** enables MPI-based collective communication for distributed training. Essential for workloads using all-reduce synchronization. + * **Kubeflow Katib** manages AutoML workloads—hyperparameter optimization and neural architecture search. Coordinates multiple training jobs with different configurations. +* **KServe** provides a standardized distributed generative and predictive AI inference platform for scalable, multi-framework deployment on Kubernetes. +* **MLflow** offers experiment tracking, model registry, and deployment tools. Not a scheduler, but integrates with scheduling systems for workflow management. +* **Ray / KubeRay** provides a distributed computing framework with its own scheduler. KubeRay is the Kubernetes operator for running Ray clusters. Useful for workloads that need Ray's programming model. + +## Data and Storage Tools + +* [**Fluid**](https://github.com/fluid-cloudnative/fluid) (CNCF Sbx) accelerates data access for AI workloads through dataset caching and data movement optimization. Reduces I/O bottlenecks that leave GPUs idle. +* [**HAMi**](https://github.com/Project-HAMi/HAMi) (Heterogeneous AI Computing Virtualization Middleware) enables flexible GPU sharing and partitioning across different GPU types. +* [**Kubeflow**](https://www.kubeflow.org/) (CNCF incubating project): + * **Kubeflow Data Cache** enables efficient data streaming for distributed training workloads. + * **Kubeflow Spark Operator** aims to make specifying and running Apache Spark applications as easy and idiomatic as running other workloads on Kubernetes. + * **Kubeflow Hub** provides a single pane of glass for ML model developers to index and manage models, versions, and ML artifacts metadata. + +## Workflow Orchestration + +These tools manage multi-step ML pipelines: + +* **Apache Airflow** is a general-purpose workflow orchestrator commonly used for ML pipelines. Defines DAGs (directed acyclic graphs) of tasks and handles scheduling, monitoring, and retries. +* **Argo Workflows** is a Kubernetes-native workflow engine. Workflows are defined as Kubernetes resources, making integration with other Kubernetes tools straightforward. +* **Flyte** is a workflow orchestration platform designed for ML. Provides versioning, caching, and strong typing for workflow inputs and outputs. +* **Kubeflow Pipelines** is a platform for building and deploying portable and scalable machine learning (ML) workflows using containers on Kubernetes-based systems. + +## Non-Kubernetes Schedulers + +* **Slurm** is the dominant scheduler in HPC environments. Many organizations run Slurm alongside or instead of Kubernetes for AI workloads (especially training and multi-node inference), when they have existing HPC infrastructure. Integration approaches exist for bridging Slurm and Kubernetes environments. + +# Mapping Challenges to Solutions + +A reference table that maps each challenge to the solutions that address it. All solutions in alphabetical order. + +| Challenge | Kubernetes Native | Schedulers | Platforms/Tools | Applies To | Notes | +| :---- | :---- | :---- | :---- | :---- | :---- | +| Gang Scheduling | Workload API \+ GangScheduling feature gate (alpha, K8s 1.35+) | Armada, Coscheduling plugin, KAI, Kueue, Slinky, Volcano, YuniKorn | Kubeflow Trainer | Both | Essential for distributed training | +| Resource Fairness | ResourceQuota (limited) | Armada, KAI, Kueue, Slinky, Volcano, YuniKorn | \- | Both | Hierarchical quotas require external tools | +| Queue Management | \- | Armada, KAI, Kueue, Slinky, Volcano, YuniKorn | Airflow, Flyte | Both | Core capability of batch schedulers | +| Preemption | PriorityClass (pod-level) | KAI, Kueue, Slinky, Volcano | \- | Both | Job-level preemption needs external tools | +| Priority Scheduling | PriorityClass | All batch schedulers | \- | Both | Job-level priority in batch schedulers | +| Reservation & Backfill | \- | Slinky, Volcano, YuniKorn | \- | Training | Advanced feature in some schedulers | +| Topology Awareness (Node) | Topology Manager (NUMA), DRA CPU Driver (CPU topology) | KAI, Kueue, Slinky, Volcano | \- | Both | GPU interconnect awareness varies | +| Topology Awareness (Cluster) | Topology Spread Constraints, DRANET (network DRA Driver) (limited) | KAI, Kueue, Slinky, Volcano | \- | Both | Network topology awareness is emerging | +| Resource Heterogeneity | Node selectors, labels | All batch schedulers | \- | Both | Standard Kubernetes features usually sufficient | +| GPU Sharing | DRA (GA, K8s 1.34+) | KAI | HAMi, KubeRay, Volcano | Both | MIG requires DRA or vendor tools | +| Scalability | Cluster Autoscaler, Karpenter | Armada, KAI, Kueue, Slinky, Volcano | interLink | Both | Large-scale scheduling is challenging | +| I/O Bottlenecks | PersistentVolumes | \- | Fluid | Both | Storage and caching solutions | +| Fault Tolerance | \- | Slinky, | Kubeflow (elastic training) | Training | Framework-dependent | +| Elasticity | HPA, VPA | KAI, Slinky, Volcano | Kubeflow Trainer | Both | PyTorch Elastic, etc. | +| Budget/Cost | \- | Limited support | Flyte (spot/interruptible tasks) | Both | Emerging area | +| Inference request scheduling | Pod-level metrics | Dynamo Router , llm-d-inference scheduler | Kubernetes | Inference | Emerging area | +| Inference request autoscaling | Pod-level metrics, event-driven | Dynamo Planner, llm-d-WVA-autoscaler | KServe | Inference | Emerging area | + +Key observations: + +* Gang scheduling is the most critical capability gap in default Kubernetes. All major batch schedulers address it. +* Fairness and queuing require external tools for anything beyond basic ResourceQuota. +* Topology awareness is partially addressed but remains an area of active development. +* GPU sharing is becoming more accessible with DRA, but memory constraints limit applicability for large models. +* Cost management is underserved by current tools. + +# Practical Guidance + +## Choosing a Scheduling Stack + +The right scheduling stack depends on your workloads, scale, and existing infrastructure. + +**Start with the basics if:** + +* You run primarily single-node training or inference + +* Your workloads don't require gang scheduling + +* You have a small team and simple fairness requirements + +In this case, standard Kubernetes features (PriorityClasses, ResourceQuota, node selectors) may suffice. + +**Deploy a batch scheduler if:** + +* You run workloads ranging from single pods to multi-pod configurations (such as distributed training or inference) that require gang scheduling. + +* You have multiple hierarchical teams competing for GPU resources + +* You need sophisticated fair-share policies + +* You operate at significant scale (hundreds of GPUs or more) + +**Consider multi-cluster scheduling if:** + +* Your resource needs exceed a single cluster's capacity + +* You have infrastructure distributed across regions or clouds + +* You want unified job management across multiple clusters + +**Use a traditional HPC scheduler that integrates with Kubernetes if:** + +* You have existing HPC infrastructure and workflows + +* You are running training workloads + +* You have users that need a traditional HPC scheduler + +* You need capabilities that Kubernetes schedulers don't yet provide + +* You want hybrid abilities to run both traditional HPC workloads and Kubernetes loads on the same nodes + +### Common Patterns and Anti-Patterns + +**Patterns that work well:** + +* Separate queues for different workload types. Production inference, training, and experimentation have different priorities and resource profiles. Separate queues make policies clearer. +* Use non-preempting priority for urgent-but-not-critical work. Jobs that should run soon but don't need to kill running work can use high priority with preemptionPolicy: Never. +* Set realistic resource requests. Over-requesting wastes resources; under-requesting causes OOM kills. Profile your workloads. +* Use either framework or container state checkpointing for long-running jobs. Training jobs that run for days should checkpoint regularly. This enables preemption without total loss and recovery from failures and minimizes loss of valuable run time. +* Use topology constraints deliberately. For communication-heavy workloads, specify topology requirements. For resilience-critical workloads, spread across failure domains. + +**Anti-patterns to avoid:** + +* Running gang-scheduled workloads without a gang scheduler. Partial allocations waste resources and can deadlock. +* Oversubscribing without preemption policies. Promising more resources than exist only works if you have clear rules about who gets preempted when. +* Ignoring GPU memory constraints. Scheduling multiple workloads to a GPU when they won't fit in memory causes crashes, not sharing. +* Treating all GPUs as equivalent. Different GPU models, memory sizes, and interconnects matter. Jobs that need current-generation GPUs shouldn't land on previous-generation hardware. +* No quotas in multi-tenant clusters. Without quotas, one team's large job can starve everyone else indefinitely. + +### Getting Started + +**For platform engineers building new infrastructure:** + +1. Start with a single batch scheduler to handle basic queuing and quota management. +2. Define queues that map to your organizational structure (teams, projects, environments). +3. Set quotas based on expected resource allocation—you can adjust later. +4. Ensure gang scheduling support if you run distributed training. +5. Implement monitoring to understand actual resource usage patterns. + +**For ML engineers working with existing infrastructure:** + +1. Understand what scheduling tools are available in your cluster. +2. Use the appropriate job abstractions (PyTorchJob, MPIJob, etc.) rather than raw pods. +3. Set accurate resource requests based on profiling. +4. Implement checkpointing for any training job longer than a few hours. +5. Work with platform engineers to tune quotas and priorities for your workloads. + +**For organizations evaluating their options:** + +1. Inventory your workloads: What fraction need gang scheduling? GPU sharing? Low latency? +2. Assess your scale: How many GPUs? How many concurrent users/teams? +3. Evaluate your existing infrastructure: Kubernetes-only? Hybrid with HPC? +4. Start small and iterate: Deploy a scheduler, run real workloads, observe, adjust. + +# Conclusion + +Kubernetes was not initially designed for AI workloads; however, it has become one of the de facto standards for managing them. Its default scheduler excels at placing stateless microservices but historically lacked capabilities that machine learning jobs require: gang scheduling, fair sharing across teams, topology awareness for GPU interconnects, and job-level (not just pod-level) management. The ecosystem has responded, and Kubernetes itself continues to evolve to address these gaps. Key takeaways from this series: + +1. Understand your workloads. Different stages of the ML lifecycle have different requirements. Data preparation, training, and inference each need different scheduling strategies. +2. Gang scheduling is essential for AI workloads. Without it, partial allocations waste resources and cause deadlocks. If you run distributed training, you need a scheduler that supports gang scheduling. +3. Fairness requires explicit policies. ResourceQuota provides basic limits, but real multi-tenant clusters need hierarchical quotas, borrowing/lending, and fair-share algorithms. +4. Topology matters. For communication-intensive workloads, placing workers on well-connected nodes significantly improves performance. This is an area of active development. +5. The ecosystem is maturing. Projects like Kueue, Slinky, Volcano, and KAI Scheduler are production-ready. Dynamic Resource Allocation is reaching stability. The tools exist—the question is choosing and integrating them. +6. Start simple, iterate. You don't need every feature on day one. Start with basic queuing and quotas, and add capabilities as your needs grow. + +The cloud-native AI landscape continues to evolve. New hardware (different GPU architectures, accelerators), new workload patterns (agentic systems, distributed inference), and new Kubernetes features will create both challenges and opportunities. This series provides a foundation; staying current requires ongoing engagement with the community. \ No newline at end of file