diff --git a/api-reference/overview.mdx b/api-reference/overview.mdx index b898cc7a..cf522383 100644 --- a/api-reference/overview.mdx +++ b/api-reference/overview.mdx @@ -3,41 +3,24 @@ title: "Overview" description: "Use the Runpod API to programmatically manage your compute resources." --- -import { PodsTooltip, InferenceTooltip, ServerlessTooltip, NetworkVolumeTooltip, TemplatesTooltip, WorkersTooltip } from "/snippets/tooltips.jsx"; - -The Runpod API provides programmatic access to all of Runpod's cloud compute resources. It enables you to integrate GPU infrastructure directly into your applications, workflows, and automation systems. - -Use the Runpod API to: - -* Create, monitor, and manage for persistent workloads. -* Deploy and scale endpoints for AI . -* Configure for data persistence. -* Integrate Runpod's GPU computing power into your existing applications and CI/CD pipelines. - -The API follows REST principles and returns JSON responses, making it compatible with virtually any programming language or automation tool. Whether you're building a machine learning platform, automating model deployments, or creating custom dashboards for resource management, the Runpod API provides a foundation for seamless integration. +The Runpod REST API provides programmatic access to all Runpod compute resources. Integrate GPU infrastructure into your applications, workflows, and automation systems. ## Available resources -The Runpod API provides complete access to Runpod's core resources: - -* **Pods**: Create and manage persistent GPU instances for development, training, and long-running workloads. Control Pod lifecycles, configure hardware specifications, and manage SSH access programmatically. -* **Serverless endpoints**: Deploy and scale containerized applications for AI inference and batch processing. Configure autoscaling parameters, manage pools, and monitor job execution in real-time. -* **Network volumes**: Create persistent storage that can be attached to multiple resources. Manage data persistence across Pod restarts and share datasets between different compute instances. -* **Templates**: Save and reuse Pod and endpoint configurations with to standardize deployments across projects and teams. -* **Container registry authentication**: Securely connect to private Docker registries to deploy custom containers and models. -* **Billing and usage**: Access detailed billing information and resource usage metrics to optimize costs and monitor spending across projects. +- **Pods**: Create and manage persistent GPU instances for development, training, and long-running workloads. +- **Serverless endpoints**: Deploy and scale containerized applications with autoscaling and job monitoring. +- **Network volumes**: Create persistent storage attachable to multiple resources. +- **Templates**: Save and reuse Pod and endpoint configurations. +- **Container registry auth**: Connect to private Docker registries. +- **Billing**: Access usage metrics and billing information. -## Getting started +## Authentication -To use the REST API, you'll need a [Runpod API key](/get-started/api-keys) with appropriate permissions for the resources you want to manage. API keys can be generated and managed through your account settings in the Runpod console. +All requests require a [Runpod API key](/get-started/api-keys) in the request headers. The API uses standard HTTP methods and returns JSON responses. -All API requests require authentication using your API key in the request headers. The API uses standard HTTP methods (GET, POST, PATCH, DELETE) and returns JSON responses with detailed error information when needed. +## OpenAPI schema -## Retrieve the OpenAPI schema - -You can get the complete OpenAPI specification for the Runpod API using the `/openapi.json` endpoint. Use this to generate client libraries, validate requests, or integrate the API specification into your development tools. - -The schema includes all available endpoints, request and response formats, authentication requirements, and data models. +Retrieve the complete OpenAPI specification for client generation, request validation, or tooling integration. @@ -57,7 +40,3 @@ print(response.json()) ``` - -The endpoint returns the OpenAPI 3.0 specification in JSON format. You can use it with tools like Swagger UI, Postman, or code generation utilities. - -For detailed endpoint documentation, request/response schemas, and code examples, explore the sections in the sidebar to the left. diff --git a/docs.json b/docs.json index 52a5497a..1110d90b 100644 --- a/docs.json +++ b/docs.json @@ -93,7 +93,6 @@ "serverless/workers/handler-functions", "serverless/development/local-testing", "serverless/development/validation", - "serverless/development/error-handling", "serverless/development/cleanup", "serverless/development/write-logs", "serverless/development/huggingface-models", @@ -118,6 +117,7 @@ "pages": [ "serverless/endpoints/overview", "serverless/endpoints/send-requests", + "serverless/endpoints/operation-reference", "serverless/endpoints/endpoint-configurations", "serverless/endpoints/model-caching", "serverless/development/optimization" @@ -155,7 +155,8 @@ "serverless/vllm/openai-compatibility", "serverless/vllm/environment-variables" ] - } + }, + "serverless/troubleshooting" ] }, { @@ -192,6 +193,18 @@ "pods/templates/environment-variables", "pods/templates/secrets" ] + }, + { + "group": "Troubleshooting", + "pages": [ + "pods/troubleshooting/zero-gpus", + "pods/troubleshooting/pod-migration", + "pods/troubleshooting/jupyterlab-blank-page", + "pods/troubleshooting/jupyterlab-checkpoints-folder", + "pods/troubleshooting/token-authentication-enabled", + "pods/troubleshooting/storage-full", + "pods/troubleshooting/troubleshooting-502-errors" + ] } ] }, @@ -210,69 +223,7 @@ "public-endpoints/reference", "public-endpoints/requests", "public-endpoints/ai-sdk", - "public-endpoints/ai-coding-tools", - { - "group": "Models", - "pages": [ - { - "group": "Image models", - "pages": [ - "public-endpoints/models/flux-dev", - "public-endpoints/models/flux-schnell", - "public-endpoints/models/flux-kontext-dev", - "public-endpoints/models/p-image-t2i", - "public-endpoints/models/p-image-edit", - "public-endpoints/models/qwen-image", - "public-endpoints/models/qwen-image-lora", - "public-endpoints/models/qwen-image-edit", - "public-endpoints/models/qwen-image-edit-2511", - "public-endpoints/models/qwen-image-edit-2511-lora", - "public-endpoints/models/seedream-4-t2i", - "public-endpoints/models/seedream-4-edit", - "public-endpoints/models/seedream-3", - "public-endpoints/models/wan-2-6-t2i", - "public-endpoints/models/z-image-turbo", - "public-endpoints/models/nano-banana-edit", - "public-endpoints/models/nano-banana-pro-edit" - ] - }, - { - "group": "Video models", - "pages": [ - "public-endpoints/models/infinitetalk", - "public-endpoints/models/kling-v2-1", - "public-endpoints/models/kling-v2-6-motion-control", - "public-endpoints/models/kling-video-o1-r2v", - "public-endpoints/models/seedance-1-pro", - "public-endpoints/models/seedance-1-5-pro", - "public-endpoints/models/sora-2", - "public-endpoints/models/sora-2-pro", - "public-endpoints/models/wan-2-6-t2v", - "public-endpoints/models/wan-2-5", - "public-endpoints/models/wan-2-2-i2v-lora", - "public-endpoints/models/wan-2-2-i2v", - "public-endpoints/models/wan-2-2-t2v", - "public-endpoints/models/wan-2-1-i2v", - "public-endpoints/models/wan-2-1-t2v" - ] - }, - { - "group": "Text models", - "pages": [ - "public-endpoints/models/granite-4", - "public-endpoints/models/qwen3-32b" - ] - }, - { - "group": "Audio models", - "pages": [ - "public-endpoints/models/chatterbox-turbo", - "public-endpoints/models/whisper-v3", - "public-endpoints/models/minimax-speech" - ] - } - ] - } + "public-endpoints/ai-coding-tools" ] }, { @@ -325,20 +276,7 @@ "references/billing-information", "references/referrals", "references/security-and-compliance", - { - "group": "Troubleshooting", - "pages": [ - "references/troubleshooting/zero-gpus", - "references/troubleshooting/pod-migration", - "references/troubleshooting/jupyterlab-blank-page", - "references/troubleshooting/jupyterlab-checkpoints-folder", - "references/troubleshooting/token-authentication-enabled", - "references/troubleshooting/leaked-api-keys", - "references/troubleshooting/storage-full", - "references/troubleshooting/troubleshooting-502-errors", - "references/troubleshooting/manage-payment-cards" - ] - }, + "references/manage-payment-cards", { "group": "Migrations", "pages": [ @@ -565,6 +503,68 @@ } ] }, + { + "tab": "Models", + "groups": [ + { + "group": "Image models", + "pages": [ + "public-endpoints/models/flux-dev", + "public-endpoints/models/flux-schnell", + "public-endpoints/models/flux-kontext-dev", + "public-endpoints/models/p-image-t2i", + "public-endpoints/models/p-image-edit", + "public-endpoints/models/qwen-image", + "public-endpoints/models/qwen-image-lora", + "public-endpoints/models/qwen-image-edit", + "public-endpoints/models/qwen-image-edit-2511", + "public-endpoints/models/qwen-image-edit-2511-lora", + "public-endpoints/models/seedream-4-t2i", + "public-endpoints/models/seedream-4-edit", + "public-endpoints/models/seedream-3", + "public-endpoints/models/wan-2-6-t2i", + "public-endpoints/models/z-image-turbo", + "public-endpoints/models/nano-banana-edit", + "public-endpoints/models/nano-banana-pro-edit" + ] + }, + { + "group": "Video models", + "pages": [ + "public-endpoints/models/infinitetalk", + "public-endpoints/models/kling-v2-1", + "public-endpoints/models/kling-v2-6-motion-control", + "public-endpoints/models/kling-video-o1-r2v", + "public-endpoints/models/seedance-1-pro", + "public-endpoints/models/seedance-1-5-pro", + "public-endpoints/models/sora-2", + "public-endpoints/models/sora-2-pro", + "public-endpoints/models/wan-2-6-t2v", + "public-endpoints/models/wan-2-5", + "public-endpoints/models/wan-2-2-i2v-lora", + "public-endpoints/models/wan-2-2-i2v", + "public-endpoints/models/wan-2-2-t2v", + "public-endpoints/models/wan-2-1-i2v", + "public-endpoints/models/wan-2-1-t2v" + ] + }, + { + "group": "Text models", + "pages": [ + "public-endpoints/models/granite-4", + "public-endpoints/models/qwen3-32b" + ] + }, + { + "group": "Audio models", + "pages": [ + "public-endpoints/models/chatterbox-turbo", + "public-endpoints/models/whisper-v3", + "public-endpoints/models/minimax-speech" + ] + } + ] + }, { "tab": "Release notes", "groups": [ @@ -622,7 +622,43 @@ }, { "source": "/references/faq", - "destination": "/references/troubleshooting/zero-gpus" + "destination": "/pods/troubleshooting/zero-gpus" + }, + { + "source": "/references/troubleshooting/zero-gpus", + "destination": "/pods/troubleshooting/zero-gpus" + }, + { + "source": "/references/troubleshooting/pod-migration", + "destination": "/pods/troubleshooting/pod-migration" + }, + { + "source": "/references/troubleshooting/jupyterlab-blank-page", + "destination": "/pods/troubleshooting/jupyterlab-blank-page" + }, + { + "source": "/references/troubleshooting/jupyterlab-checkpoints-folder", + "destination": "/pods/troubleshooting/jupyterlab-checkpoints-folder" + }, + { + "source": "/references/troubleshooting/token-authentication-enabled", + "destination": "/pods/troubleshooting/token-authentication-enabled" + }, + { + "source": "/references/troubleshooting/storage-full", + "destination": "/pods/troubleshooting/storage-full" + }, + { + "source": "/references/troubleshooting/troubleshooting-502-errors", + "destination": "/pods/troubleshooting/troubleshooting-502-errors" + }, + { + "source": "/references/troubleshooting/manage-payment-cards", + "destination": "/references/manage-payment-cards" + }, + { + "source": "/references/troubleshooting/leaked-api-keys", + "destination": "/get-started/api-keys" }, { "source": "/references/glossary", @@ -750,7 +786,11 @@ }, { "source": "/serverless/workers/handlers/handler-error-handling", - "destination": "/serverless/workers/handler-functions" + "destination": "/serverless/workers/handler-functions#error-handling" + }, + { + "source": "/serverless/development/error-handling", + "destination": "/serverless/workers/handler-functions#error-handling" }, { "source": "/serverless/workers/handlers/overview", diff --git a/flash/apps/deploy-apps.mdx b/flash/apps/deploy-apps.mdx index 000aa19c..ca19b9dd 100644 --- a/flash/apps/deploy-apps.mdx +++ b/flash/apps/deploy-apps.mdx @@ -4,8 +4,6 @@ sidebarTitle: "Deploy to Runpod" description: "Build and deploy your Flash app for production serving." --- -import { LoadBalancingEndpointsTooltip, QueueBasedEndpointsTooltip } from "/snippets/tooltips.jsx"; - When you're satisfied with your endpoint functions and ready to move to production, use `flash deploy` to build and deploy your Flash application: ```bash diff --git a/flash/create-endpoints.mdx b/flash/create-endpoints.mdx index c58dfcb3..04b2b10b 100644 --- a/flash/create-endpoints.mdx +++ b/flash/create-endpoints.mdx @@ -4,7 +4,7 @@ sidebarTitle: "Create endpoints" description: "Learn how to create and configure hardware and scaling behavior with the Flash Endpoint class." --- -import { WorkerTooltip, ServerlessTooltip, NetworkVolumesTooltip } from "/snippets/tooltips.jsx"; +import { WorkerTooltip, ServerlessTooltip } from "/snippets/tooltips.jsx"; In Flash, endpoints are the bridge between your local Python functions and Runpod's cloud infrastructure. When you decorate a function with `@Endpoint`, you're marking it to run remotely on Runpod instead of your local machine: diff --git a/flash/overview.mdx b/flash/overview.mdx index ba3fa401..1bf732cb 100644 --- a/flash/overview.mdx +++ b/flash/overview.mdx @@ -6,8 +6,6 @@ tag: "BETA" mode: "wide" --- -import { ServerlessTooltip, PodsTooltip, WorkersTooltip, LoadBalancingEndpointsTooltip, QueueBasedEndpointsTooltip, EndpointsTooltip } from "/snippets/tooltips.jsx"; -
diff --git a/get-started/manage-accounts.mdx b/get-started/manage-accounts.mdx index 5f409efb..a8872ea6 100644 --- a/get-started/manage-accounts.mdx +++ b/get-started/manage-accounts.mdx @@ -3,7 +3,7 @@ title: "Manage accounts" description: "Create accounts, manage teams, and configure user permissions in Runpod." --- -import { PodsTooltip, ServerlessTooltip, InferenceTooltip } from "/snippets/tooltips.jsx"; +import { PodsTooltip, ServerlessTooltip } from "/snippets/tooltips.jsx"; To access Runpod resources, you need to either create your own account or join an existing team through an invitation. This guide explains how to set up and manage accounts, teams, and user roles. diff --git a/hub/overview.mdx b/hub/overview.mdx index 14960879..717a94cd 100644 --- a/hub/overview.mdx +++ b/hub/overview.mdx @@ -2,13 +2,14 @@ title: "Overview" sidebarTitle: "Overview" description: "Discover, deploy, and share preconfigured AI repos using the Runpod Hub." +mode: "wide" --- -import { ServerlessTooltip, PodTooltip, EndpointTooltip, PublicEndpointTooltip, HandlerFunctionTooltip, WorkerTooltip } from "/snippets/tooltips.jsx"; +import { ServerlessTooltip, PodTooltip, PublicEndpointTooltip, HandlerFunctionTooltip, WorkerTooltip } from "/snippets/tooltips.jsx"; -The [Runpod Hub](https://console.runpod.io/hub) is a centralized repository that enables users to discover, share, and deploy preconfigured AI repos optimized for Runpod's and infrastructure. It offers a catalog of vetted, open-source repositories that can be deployed with minimal setup, creating a collaborative ecosystem for AI developers and users. +
-Whether you're a developer looking to share your work or a user seeking preconfigured solutions, the Hub makes discovering and deploying AI projects seamless and efficient. +The [Runpod Hub](https://console.runpod.io/hub) is a centralized repository for discovering, sharing, and deploying preconfigured AI repos optimized for and infrastructure. @@ -16,117 +17,70 @@ Whether you're a developer looking to share your work or a user seeking preconfi ## Why use the Hub? -The Hub simplifies the entire lifecycle of repo sharing and deployment, from initial submission through testing, discovery, and usage. +**For users:** +- **Production-ready solutions**: Vetted, open-source repos with minimal setup required. +- **One-click deployment**: Go from discovery to running services in minutes. +- **Configurable**: Customize parameters without diving into code. -### For Runpod users - -- **Find production-ready AI solutions**: Discover vetted, open-source repositories optimized for Runpod with minimal setup required. -- **Deploy in one click**: Go from discovery to running services in minutes, not days. -- **Customize to your needs**: Runpod Hub repos expose configurable parameters for fine-tuning without diving into code. -- **Save development time**: Leverage community innovations instead of building from scratch. - -### For Hub creators - -- **Showcase your work**: Share your projects with the broader AI community. -- **Maintain control**: Your GitHub repo remains the source of truth, while the Hub automatically detects new releases. -- **Streamline your workflow**: Automated building and testing ensures your releases work as expected. -- **Earn credits from your contributions**: Generate Runpod credits when users deploy your repositories. You can earn up to 7% of compute revenue for repos you publish, which is paid directly to your Runpod credit balance based on monthly usage. For more details, see the [Revenue sharing](/hub/revenue-sharing) section of the publishing guide. +**For creators:** +- **Showcase your work**: Share projects with the AI community. +- **Automated pipeline**: The Hub builds and tests your releases automatically. +- **Earn revenue**: Generate up to 7% of compute revenue when users deploy your repos. See [revenue sharing](/hub/revenue-sharing). ## Public Endpoints -In addition to official and community-submitted repos, the Hub also offers s for popular AI models. These are ready-to-use APIs that you can integrate directly into your applications without needing to manage any of the underlying infrastructure. - -Public Endpoints provide: - -- Instant access to state-of-the-art models. -- A playground for interactive testing. -- Simple, usage-based pricing. - -Browse all available models in the [model reference](/public-endpoints/reference). - -## How Hub repos work - -The Hub operates by integrating several key components: - -1. **Repository integration**: The Hub connects with GitHub repositories, using GitHub releases (not commits) as the basis for versioning and updates. -2. **GitHub authorization**: Hub repo administration access is automatically managed via GitHub authorization. -3. **Configuration system**: Repositories use standardized configuration files (`hub.json` and `tests.json`) in a `.runpod` directory to define metadata, hardware requirements, and test procedures. See the [publishing guide](/hub/publishing-guide) to learn more. -4. **Automated build pipeline**: When a repository is submitted or updated, the Hub automatically scans, builds, and tests it to ensure it works correctly on Runpod's infrastructure. -5. **Continuous release monitoring**: The system regularly checks for new releases in registered repositories and rebuilds them when updates are detected. -6. **Deployment interface**: Users can browse repos, customize parameters, and deploy them to Runpod infrastructure with minimal configuration. - -## Deploy a repo from the Hub - -You can deploy a repo from the Hub in seconds, choosing between Serverless endpoints or Pods based on your needs: - -### Deploy as a Serverless endpoint +The Hub also offers s for popular AI models. These are ready-to-use APIs with instant access, interactive playgrounds, and usage-based pricing. Browse available models in the [model reference](/public-endpoints/reference). -1. Navigate to the [Hub page](https://www.console.runpod.io/hub) in the Runpod console. -2. Browse the collection and select a repo that matches your needs. -3. Review the repo details, including hardware requirements and available configuration options to ensure compatibility with your use case. -4. Click the **Deploy** button in the top-right of the repo page. You can also use the dropdown menu to deploy an older version. -5. Click **Create Endpoint** +## Deploy a repo -Within minutes you'll have access to a new Serverless , ready for integration with your applications or experimentation. + + + 1. Go to the [Hub](https://www.console.runpod.io/hub) and select a repo. + 2. Review hardware requirements and configuration options. + 3. Click **Deploy** → **Create Endpoint**. -### Deploy as a Pod + Your endpoint will be ready for integration within minutes. + + + 1. Go to the [Hub](https://www.console.runpod.io/hub) and select a repo. + 2. Click **Deploy**, select **Pod** as the deployment type. + 3. Click **Deploy Pod**. -For users with consistent, predictable workloads who prioritize cost-effectiveness over automatic scaling: + After deployment, find a sample request in the Pod details pane: -1. Navigate to the [Hub page](https://www.console.runpod.io/hub) in the Runpod console. -2. Browse the collection and select a repo that matches your needs. -3. Click the **Deploy** button in the top-right of the repo page. -4. In the modal that opens, under **Deployment Type**, select **Pod**. -5. Click **Deploy Pod** + ```python + import requests -After the Pod deploys, you can find a sample Python request for the Pod in the Pod details pane, which you can access from the [Pods section](https://www.console.runpod.io/pods) of the Runpod console. The API interface is identical to the Serverless implementation, using the same endpoints and authentication methods, but using the Pod's internal IP address as a proxy. For example: - -```python -import requests - -headers = { - 'Content-Type': 'application/json', - 'Authorization': 'Bearer YOUR_API_KEY' -} - -data = { - 'input': {"prompt":"Your prompt"} -} - -response = requests.post('https://POD_ID-80.proxy.runpod.net/v2/LOCAL/run', headers=headers, json=data) -``` - -Where `POD_ID` is your Pod's actual ID. + response = requests.post( + 'https://POD_ID-80.proxy.runpod.net/v2/LOCAL/run', + headers={ + 'Content-Type': 'application/json', + 'Authorization': 'Bearer YOUR_API_KEY' + }, + json={'input': {"prompt": "Your prompt"}} + ) + ``` + + ## Publish your own repo -You can [publish your own repo](/hub/publishing-guide) on the Hub by preparing your GitHub repository with a working Serverless endpoint implementation, comprised of a and `Dockerfile`. +Publish your GitHub repository on the Hub by preparing a with a and `Dockerfile`. -To learn how to build your first worker, [follow this guide](/serverless/workers/custom-worker). +New to building Serverless workers? Follow the [quickstart guide](/serverless/quickstart). -Once your code is ready to share: - -1. Add the required configuration files in a `.runpod` directory, following the instructions in the [Hub publishing guide](/hub/publishing-guide). -2. Create a GitHub release to establish a versioned snapshot. -3. Submit your repository to the Hub through the Runpod console, where it will undergo automated building and testing. -4. The Runpod team will review your repo. After approval, your repo will appear in the Hub. - -Once your repo is approved and published, you can start earning revenue from user deployments. To receive credits, link your GitHub profile to your Runpod account for verified maintainer status. Revenue is calculated monthly based on compute hours generated by your repos, with tiers ranging from 1% (100-999 hours) to 7% (10,000+ hours). Credits are deposited directly into your Runpod account balance each month. +1. Add configuration files in a `.runpod` directory per the [publishing guide](/hub/publishing-guide). +2. Create a GitHub release. +3. Submit your repository through the Runpod console. +4. After review and approval, your repo appears in the Hub. -## Use cases +Once published, earn revenue from user deployments. Link your GitHub profile to your Runpod account for verified maintainer status. Revenue tiers range from 1% (100-999 compute hours) to 7% (10,000+ hours), paid monthly as Runpod credits. -The Runpod Hub supports a wide range of AI applications and workflows. Here are some common use cases that demonstrate the versatility and power of Hub repositories: - -### For AI researchers and enthusiasts - -Researchers can quickly deploy state-of-the-art models for experimentation without managing complex infrastructure. The Hub provides access to optimized implementations of popular models like Stable Diffusion, LLMs, and computer vision systems, allowing for rapid prototyping and iteration. This accessibility democratizes AI research by reducing the technical barriers to working with cutting-edge models. - -### For individual developers - -Individual developers benefit from the ability to experiment with different AI models and approaches without extensive setup time. The Hub provides an opportunity to learn from well-structured projects. Repos are designed to optimize resource usage, helping developers minimize costs while maximizing performance and potential earnings. In addition, developers who publish their own repos to the Hub can earn credits by participating in the [revenue sharing](/hub/revenue-sharing) program. - -### For enterprises and teams +## How Hub repos work -Enterprises and teams can accelerate their development cycle by using preconfigured repos instead of creating everything from scratch. The Hub reduces infrastructure complexity by providing standardized deployment configurations, allowing technical teams to focus on their core business logic rather than spending time configuring infrastructure and dependencies. For organizations that contribute their own repos, the [revenue sharing](/hub/revenue-sharing) program offers an additional incentive, enabling teams to earn credits as their solutions are adopted and used by the wider community. +1. **Repository integration**: Connects with GitHub repos using releases (not commits) for versioning. +2. **Configuration**: Repos use `hub.json` and `tests.json` in a `.runpod` directory to define metadata and test procedures. +3. **Automated pipeline**: The Hub builds and tests repos on submission and monitors for new releases. +4. **Deployment**: Users browse, customize, and deploy with minimal configuration. diff --git a/instant-clusters.mdx b/instant-clusters.mdx index 960fa955..8ed9816a 100644 --- a/instant-clusters.mdx +++ b/instant-clusters.mdx @@ -9,44 +9,30 @@ import { DataCenterTooltip, PyTorchTooltip, TrainingTooltip, InferenceTooltip, S
-Runpod Instant Clusters provide fully managed compute clusters with high-performance networking for distributed workloads. Deploy multi-node jobs or large-scale AI without managing infrastructure, networking, or cluster configuration. +Instant Clusters provide fully managed multi-node compute with high-performance networking for distributed workloads. Deploy jobs or large-scale without managing infrastructure, networking, or cluster configuration. -## Why use Instant Clusters? - -- **Scale beyond single machines.** Train models too large for one GPU, or accelerate training by distributing across multiple nodes. -- **High-speed networking included.** Clusters include 1600-3200 Gbps networking between nodes, enabling efficient gradient synchronization and data movement. -- **Zero configuration.** Clusters come pre-configured with static IPs, environment variables, and framework support. Start training immediately. -- **On-demand availability.** Deploy clusters in minutes and pay only for what you use. Scale up for intensive jobs, then release resources. - -## When to use Instant Clusters - -Instant Clusters offer distributed computing power beyond the capabilities of single-machine setups. Consider using Instant Clusters for: - -- **Multi-GPU language model training.** Accelerate training of models like Llama or GPT across multiple GPUs. -- **Large-scale computer vision projects.** Process massive imagery datasets for autonomous vehicles or medical analysis. -- **Scientific simulations.** Run climate, molecular dynamics, or physics simulations that require massive parallel processing. -- **Real-time AI inference.** Deploy production AI models that demand multiple GPUs for fast output. -- **Batch processing pipelines.** Create systems for large-scale data processing, including video rendering and genomics. +- **Scale beyond single machines**: Train models too large for one GPU, or accelerate training across multiple nodes. +- **High-speed networking**: 1600-3200 Gbps between nodes for efficient gradient synchronization and data movement. +- **Zero configuration**: Pre-configured static IPs, environment variables, and framework support. +- **On-demand**: Deploy in minutes, pay only for what you use. ## Get started -Choose the deployment guide that matches your preferred framework and use case: - - Set up a managed Slurm cluster for high-performance computing workloads. + Managed Slurm for HPC workloads. - - Set up multi-node PyTorch training for deep learning models. + + Multi-node PyTorch for deep learning. - - Use Axolotl's framework for fine-tuning large language models across multiple GPUs. + + Fine-tune LLMs across multiple GPUs. ## How it works -When you deploy an Instant Cluster, Runpod provisions multiple GPU nodes within the same and connects them with high-speed networking. One node is designated as the primary node, and all nodes receive pre-configured environment variables for distributed communication. +Runpod provisions multiple GPU nodes in the same connected with high-speed networking. One node is designated primary (`NODE_RANK=0`), and all nodes receive pre-configured environment variables for distributed communication.
```mermaid @@ -85,9 +71,7 @@ flowchart TD ```
-The high-speed network interfaces (`ens1`-`ens8`) handle inter-node communication for distributed training frameworks like , , and . The `eth0` interface on the primary node handles external traffic like downloading models or datasets. - -For more details on environment variables and network configuration, see the [configuration reference](/instant-clusters/configuration). +The high-speed interfaces (`ens1`-`ens8`) handle inter-node communication for , , and . The `eth0` interface on the primary node handles external traffic. See the [configuration reference](/instant-clusters/configuration) for environment variables and network details. ## Supported hardware @@ -102,18 +86,10 @@ For clusters larger than 8 nodes (up to 512 GPUs), [contact our sales team](http ## Pricing -Instant Cluster pricing is based on the GPU type and the number of nodes in your cluster. For current pricing, see the [Instant Clusters pricing page](https://www.runpod.io/pricing). +Pricing is based on GPU type and number of nodes. See [Instant Clusters pricing](https://www.runpod.io/pricing) for current rates. + +Custom pricing is available for enterprise workloads. [Contact our sales team](https://ecykq.share.hsforms.com/2MZdZATC3Rb62Dgci7knjbA) for details. -All accounts have a default spending limit. To deploy a larger cluster, submit a support ticket at [help@runpod.io](mailto:help@runpod.io). +All accounts have a default spending limit. To deploy larger clusters, contact [help@runpod.io](mailto:help@runpod.io). - -## Next steps - -- [Configuration reference](/instant-clusters/configuration): Learn about environment variables, network interfaces, and NCCL configuration. -- [Deploy a Slurm cluster](/instant-clusters/slurm-clusters): Set up job scheduling for HPC workloads. -- [Deploy a PyTorch cluster](/instant-clusters/pytorch): Get started with distributed deep learning. - - -Runpod offers custom Instant Cluster pricing plans for large scale and enterprise workloads. If you're interested in learning more, [contact our sales team](https://ecykq.share.hsforms.com/2MZdZATC3Rb62Dgci7knjbA). - \ No newline at end of file diff --git a/instant-clusters/pytorch.mdx b/instant-clusters/pytorch.mdx index 0d28ca64..ad913bcc 100644 --- a/instant-clusters/pytorch.mdx +++ b/instant-clusters/pytorch.mdx @@ -7,13 +7,6 @@ import { PyTorchTooltip } from "/snippets/tooltips.jsx"; This tutorial demonstrates how to use [Instant Clusters](/instant-clusters) with to run distributed workloads across multiple GPUs. By leveraging PyTorch's distributed processing capabilities and Runpod's high-speed networking infrastructure, you can significantly accelerate your training process compared to single-GPU setups. -## What you'll learn - -- How to deploy an Instant Cluster with PyTorch -- How to initialize a distributed PyTorch environment using Runpod's pre-configured environment variables -- How to launch multi-node training with `torchrun` -- How local and global ranks map to GPUs across your cluster - ## Requirements - A Runpod account with sufficient credits for a multi-node cluster diff --git a/instant-clusters/slurm-clusters.mdx b/instant-clusters/slurm-clusters.mdx index d61b4ffa..cd1f0238 100644 --- a/instant-clusters/slurm-clusters.mdx +++ b/instant-clusters/slurm-clusters.mdx @@ -6,17 +6,15 @@ description: Deploy Slurm Clusters on Runpod with zero configuration Runpod Slurm Clusters provide a managed high-performance computing and scheduling solution that enables you to rapidly create and manage Slurm Clusters with minimal setup. -For more information on working with Slurm, refer to the [Slurm documentation](https://slurm.schedmd.com/documentation.html). - -## Key features - -Slurm Clusters eliminate the traditional complexity of cluster orchestration by providing: +Slurm Clusters provide: - **Zero configuration setup:** Slurm and munge are pre-installed and fully configured. - **Instant provisioning:** Clusters deploy rapidly with minimal setup. - **Automatic role assignment:** Runpod automatically designates controller and agent nodes. -- **Built-in optimizations:** Pre-configured for optimal NCCL performance. -- **Full Slurm compatibility:** All standard Slurm commands work out-of-the-box. + +For more information on working with Slurm, refer to the [Slurm documentation](https://slurm.schedmd.com/documentation.html). + +## Deploy a Slurm Cluster @@ -24,8 +22,6 @@ If you prefer to manually configure your Slurm deployment, see [Deploy an Instan -## Deploy a Slurm Cluster - 1. Open the [Instant Clusters page](https://console.runpod.io/cluster) on the Runpod console. 2. Click **Create Cluster**. 3. Select **Slurm Cluster** from the cluster type dropdown menu. diff --git a/instant-clusters/slurm.mdx b/instant-clusters/slurm.mdx index 3ab4f4af..6e2386a2 100644 --- a/instant-clusters/slurm.mdx +++ b/instant-clusters/slurm.mdx @@ -11,9 +11,7 @@ This guide is for advanced users who want to configure and manage their own Slur -This tutorial demonstrates how to configure Runpod Instant Clusters with to manage and schedule distributed workloads across multiple nodes. Slurm is a popular open-source job scheduler that provides a framework for job management, scheduling, and resource allocation in high-performance computing environments. By leveraging Slurm on Runpod's high-speed networking infrastructure, you can efficiently manage complex workloads across multiple GPUs. - -Follow the steps below to deploy a cluster and start running distributed Slurm workloads efficiently. +This tutorial shows how to configure Runpod Instant Clusters with to manage and schedule distributed workloads across multiple nodes. ## Requirements diff --git a/pods/choose-a-pod.mdx b/pods/choose-a-pod.mdx index 9a518da8..17d2a014 100644 --- a/pods/choose-a-pod.mdx +++ b/pods/choose-a-pod.mdx @@ -4,74 +4,93 @@ description: "Select the right Pod by evaluating your resource requirements." sidebar_position: 3 --- -import { CUDATooltip, InferenceTooltip, TrainingTooltip } from "/snippets/tooltips.jsx"; +Selecting the right Pod configuration maximizes performance and cost efficiency for your workload. This guide helps you match your requirements to the right GPU, VRAM, and storage configuration. -Selecting the appropriate Pod configuration is a crucial step in maximizing performance and efficiency for your specific workloads. This guide will help you understand the key factors to consider when choosing a Pod that meets your requirements. +## Quick selection by workload -## Understanding your workload needs +Start by identifying your primary workload type: -Before selecting a Pod, take time to analyze your specific project requirements. Different applications have varying demands for computing resources: +| Workload | Recommended GPU tier | Minimum VRAM | Notes | +|----------|---------------------|--------------|-------| +| **LLM inference** (7B–13B params) | Mid-range (RTX 4090, L4) | 24 GB | Sufficient for most quantized models | +| **LLM inference** (30B–70B params) | High-end (A100, H100) | 48–80 GB | May require multi-GPU setup | +| **LLM training/fine-tuning** | High-end (A100, H100) | 40–80 GB | Memory bandwidth critical | +| **Image generation** (SDXL, Flux) | Mid-range (RTX 4090, L4) | 16–24 GB | Benefits from fast inference | +| **Computer vision** | Entry to mid-range | 8–16 GB | Depends on model and batch size | +| **3D rendering** | Mid-range with RT cores | 16–24 GB | RT cores accelerate ray tracing | +| **Data processing** | CPU-focused or entry GPU | 8 GB+ | Prioritize CPU cores and RAM | -- Machine learning models require sufficient VRAM and powerful GPUs. -- Data processing tasks benefit from higher CPU core counts and RAM. -- Rendering workloads need both strong GPU capabilities and adequate storage. +For a full list of available GPUs and their specifications, see [GPU types](/references/gpu-types). -For machine learning models, check the model's documentation on platforms like Hugging Face or review the `config.json` file to understand its resource requirements. +## Estimate VRAM requirements -## Resource assessment tools +VRAM is the most common bottleneck. Use these guidelines: -There are several online tools that can help you estimate your resource requirements: +**For LLMs:** Allocate approximately **2 GB of VRAM per billion parameters**. For example: +- 7B model → ~14 GB VRAM +- 13B model → ~26 GB VRAM +- 70B model → ~140 GB VRAM (requires multi-GPU) -- [Hugging Face's Model Memory Usage Calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) provides memory estimates for transformer models. -- [Vokturz's Can it run LLM calculator](https://huggingface.co/spaces/Vokturz/can-it-run-llm) helps determine if your hardware can run specific language models. -- [Alexander Smirnov's VRAM Estimator](https://vram.asmirnov.xyz) offers GPU memory requirement approximations. + +Quantization reduces VRAM requirements significantly. A 4-bit quantized 70B model can run on ~35 GB VRAM. + -## Key factors to consider +**For image models:** SDXL requires ~8 GB minimum, but 16–24 GB provides headroom for larger batch sizes and LoRA training. -### GPU selection +### Resource calculators -The GPU is the cornerstone of computational performance for many workloads. When selecting your GPU, consider the architecture that best suits your software requirements. NVIDIA GPUs with support are essential for most machine learning frameworks, while some applications might perform better on specific GPU generations. Evaluate both the raw computing power (CUDA cores, tensor cores) and the memory bandwidth to ensure optimal performance for your specific tasks. +Use these tools to estimate your specific requirements: -For machine learning , a mid-range GPU might be sufficient, while large models requires more powerful options. Check framework-specific recommendations, as PyTorch, TensorFlow, and other frameworks may perform differently across GPU types. +- [Hugging Face Model Memory Calculator](https://huggingface.co/spaces/hf-accelerate/model-memory-usage): Memory estimates for transformer models +- [Can it run LLM?](https://huggingface.co/spaces/Vokturz/can-it-run-llm): Check if hardware can run specific language models +- [VRAM Estimator](https://vram.asmirnov.xyz): GPU memory requirement approximations -For a full list of available GPUs, see [GPU types](/references/gpu-types). +## Storage configuration -### VRAM requirements +Choose storage based on your data persistence needs: -VRAM (video RAM) is the dedicated memory on your GPU that stores data being processed. Insufficient VRAM can severely limit your ability to work with large models or datasets. +| Storage type | Persists after stop? | Persists after delete? | Best for | +|--------------|---------------------|------------------------|----------| +| **Container disk** | No | No | OS, temporary files | +| **Volume disk** | Yes | No | Working files, checkpoints | +| **Network volume** | Yes | Yes | Datasets, model weights, long-term storage | -For machine learning models, VRAM requirements increase with model size, batch size, and input dimensions. When working with LLMs, a general guideline is to **allocate approximately 2GB of VRAM per billion parameters**. For example, running a 13-billion parameter model efficiently would require around 26GB of VRAM. Following this guideline helps ensure smooth model operation and prevents out-of-memory errors. +For data-intensive workloads, ensure sufficient volume disk or network volume capacity for your datasets, model weights, and output files. -### Storage configuration +## Optimize for cost -Your storage configuration affects both data access speeds and your ability to maintain persistent workspaces. Runpod offers both temporary and persistent [storage options](/pods/storage/types). +1. **Right-size your resources**: Start with the minimum viable configuration, then scale up based on actual usage. Development and testing often need less power than production. -When determining your storage needs, account for raw data size, intermediate files generated during processing, and space for output results. For data-intensive workloads, prioritize both capacity and speed to avoid bottlenecks. +2. **Use spot instances**: For fault-tolerant workloads like batch processing or training with checkpoints, spot instances offer significant savings. -## Balancing performance and cost - -When selecting a Pod, consider these strategies for balancing performance and cost: - -1. Use right-sized resources for your workload. For development and testing, a smaller Pod configuration may be sufficient, while production workloads might require more powerful options. - -2. Take advantage of spot instances for non-critical or fault-tolerant workloads to reduce costs. For consistent availability needs, on-demand or reserved Pods provide greater reliability. - -3. For extended usage, explore Runpod's [savings plans](/pods/pricing#savings-plans) to optimize your spending while ensuring access to the resources you need. +3. **Consider savings plans**: For extended usage, Runpod's [savings plans](/pods/pricing#savings-plans) reduce costs for committed usage. ## Secure Cloud vs Community Cloud -Secure Cloud operates in T3/T4 data centers with high reliability, redundancy, security, and fast response times to minimize downtime. It's designed for sensitive and enterprise workloads. +| | Secure Cloud | Community Cloud | +|---|--------------|-----------------| +| **Infrastructure** | T3/T4 data centers | Peer-to-peer providers | +| **Reliability** | High redundancy | Variable | +| **Best for** | Production, sensitive data | Cost-sensitive workloads | +| **Pricing** | Standard | Competitive | -Community Cloud connects individual compute providers to users through a peer-to-peer GPU computing platform. Community Cloud offers competitive pricing with good server quality, though with less redundancy for power and networking compared to Secure Cloud. - -Runpod is no longer accepting new hosts for Community Cloud. Existing Community Cloud resources remain available to users. + +Runpod is no longer accepting new hosts for Community Cloud. Existing Community Cloud resources remain available. + ## Next steps -Once you've determined your resource requirements, you can learn how to: - -- [Deploy a Pod](/get-started). -- [Manage your Pods](/pods/manage-pods). -- [Connect to a Pod](/pods/connect-to-a-pod). - -Remember that you can always deploy a new Pod if your requirements evolve. Start with a configuration that meets your immediate needs, then scale up or down based on actual usage patterns and performance metrics. + + + Create your first Pod with your chosen configuration. + + + Compare all available GPUs and specifications. + + + Learn more about storage types and pricing. + + + Learn how to create, start, stop, and delete Pods. + + diff --git a/pods/configuration/use-ssh.mdx b/pods/configuration/use-ssh.mdx index a49cf17e..49a0ee85 100644 --- a/pods/configuration/use-ssh.mdx +++ b/pods/configuration/use-ssh.mdx @@ -4,11 +4,17 @@ sidebarTitle: "Connect with SSH" description: "Manage Pods from your local machine using SSH." --- -Connecting to a Pod through an SSH (Secure Shell) terminal provides a secure and reliable method for interacting with your instance. Use this to manage long-running processes, critical tasks, or when you need the full capabilities of a shell environment. +SSH provides secure, reliable access to your Pod for long-running processes and full shell capabilities. -Every Pod offers the ability to connect through SSH using the [basic proxy method](#basic-ssh-with-key-authentication) below (which does not support commands like SCP or SFTP), but not all Pods support the [full public IP method](#full-ssh-via-public-ip-with-key-authentication). +## Connection methods -You can also SSH into a Pod using a [password-based method](#password-based-ssh) if you want a simple and fast way to enable SSH access without setting up SSH keys. SSH key authentication is recommended for most use cases, as it provides greater security and convenience for repeated use. +| Method | SCP/SFTP support | Setup | Best for | +|--------|------------------|-------|----------| +| [**Basic SSH**](#basic-ssh-with-key-authentication) | No | SSH key | Quick access, most Pods | +| [**Full SSH (public IP)**](#full-ssh-via-public-ip-with-key-authentication) | Yes | SSH key + public IP | File transfers, full SSH features | +| [**Password-based**](#password-based-ssh) | Yes | Script + public IP | Quick setup, temporary access | + +SSH key authentication is recommended for security and convenience. ## Generate an SSH key and add it to your Runpod account diff --git a/pods/connect-to-a-pod.mdx b/pods/connect-to-a-pod.mdx index 976186dd..864f7edd 100644 --- a/pods/connect-to-a-pod.mdx +++ b/pods/connect-to-a-pod.mdx @@ -1,52 +1,57 @@ --- title: "Connection options" sidebarTitle: "Connection options" -description: "Explore our Pod connection options, including the web terminal, SSH, JupyterLab, and VSCode/Cursor." +description: "Explore Pod connection options, including the web terminal, SSH, JupyterLab, and VSCode/Cursor." +mode: "wide" --- - + +
+ -## Web terminal connection +Choose a connection method based on your workflow: + +| Method | Best for | Persistence | Setup | +|--------|----------|-------------|-------| +| **Web terminal** | Quick commands, debugging | Session-based | None | +| **SSH** | Long-running processes, reliable access | Persistent | SSH client | +| **JupyterLab** | Data science, notebooks | Session-based | Template-dependent | +| **VS Code/Cursor** | Full development environment | Persistent | Extension | -The web terminal offers a convenient, browser-based method to quickly connect to your Pod and run commands. However, it's not recommended for long-running processes, such as training an LLM, as the connection might not be as stable or persistent as a direct [SSH connection](#ssh-terminal-connection). +## Web terminal -The availability of the web terminal depends on the [Pod's template](/pods/templates/overview). +Browser-based terminal for quick access. Not recommended for long-running processes (use [SSH](#ssh) instead). -To connect using the web terminal: +1. Navigate to the [Pods page](https://console.runpod.io/pods). +2. Expand your Pod and click **Connect**. +3. Click **Start** if the terminal is stopped, then **Open Web Terminal**. -1. Navigate to the [Pods page](https://console.runpod.io/pods) in the Runpod console. -2. Expand the desired Pod and select **Connect**. -3. If your web terminal is **Stopped**, click **Start**. - - If clicking **Start** does nothing, try refreshing the page. - -4. Click **Open Web Terminal** to open a new tab in your browser with a web terminal session. + +If **Start** doesn't respond, refresh the page. + -## JupyterLab connection +## JupyterLab -JupyterLab provides an interactive, web-based environment for running code, managing files, and performing data analysis. Many Runpod templates, especially those geared towards machine learning and data science, come with JupyterLab pre-configured and accessible via HTTP. +Interactive web environment for code, files, and data analysis. Available on templates with JupyterLab pre-configured (e.g., "Runpod Pytorch"). -To connect to JupyterLab (if it's available on your Pod): +1. Deploy a Pod with a JupyterLab-compatible template (all official Runpod PyTorch templates have JupyterLab pre-configured). +2. Navigate to the [Pods page](https://console.runpod.io/pods) and click **Connect**. +3. Under **HTTP Services**, click the **Jupyter Lab** link (usually port 8888). -1. Deploy your Pod, ensuring that the template is configured to run JupyterLab. Official Runpod templates like "Runpod Pytorch" are usually compatible. -2. Once the Pod is running, navigate to the [Pods page](https://console.runpod.io/pods) in the Runpod console. -3. Find the Pod you created and click the **Connect** button. If it's grayed out, your Pod hasn't finished starting up yet. -4. In the window that opens, under **HTTP Services**, look for a link to **Jupyter Lab** (or a similarly named service on the configured HTTP port, often 8888). Click this link to open the JupyterLab workspace in your browser. - - If the JupyterLab tab displays a blank page for more than a minute or two, try restarting the Pod and opening it again. - -5. Once in JupyterLab, you can create new notebooks (e.g., under **Notebook**, select **Python 3 (ipykernel)**), upload files, and run code interactively. + +If the JupyterLab tab displays a blank page for more than a minute or two, try restarting the Pod and opening it again. + -## SSH terminal connection +## SSH -Connecting to a Pod via an SSH (Secure Shell) terminal provides a secure and reliable method for interacting with your instance. To establish an SSH connection, you'll need an SSH client installed on your local machine. The exact command will vary slightly depending on whether you're using the basic proxy connection or a direct connection to a public IP. +Secure, reliable command-line access for long-running processes and development. -To learn more, see [Connect to a Pod with SSH](/pods/configuration/use-ssh). +See [Connect with SSH](/pods/configuration/use-ssh) for setup instructions. -## Connect to VSCode or Cursor +## VS Code / Cursor -For a more integrated development experience, you can connect directly to your Pod instance through Visual Studio Code (VSCode) or Cursor. This allows you to work within your Pod's volume directory as if the files were stored on your local machine, leveraging VSCode's or Cursor's powerful editing and debugging features. +Connect your local IDE directly to your Pod for a full development experience. -For a step-by-step guide, see [Connect to a Pod with VSCode or Cursor](/pods/configuration/connect-to-ide). \ No newline at end of file +See [Connect to VS Code or Cursor](/pods/configuration/connect-to-ide) for setup instructions. diff --git a/pods/manage-pods.mdx b/pods/manage-pods.mdx index 048e5420..7f80a09d 100644 --- a/pods/manage-pods.mdx +++ b/pods/manage-pods.mdx @@ -5,68 +5,46 @@ description: "Create, start, stop, and terminate Pods using the Runpod console o import { MachineTooltip, TemplatesTooltip } from "/snippets/tooltips.jsx"; -## Before you begin - -If you want to manage Pods using the Runpod CLI, you'll need to [install Runpod CLI](/runpodctl/overview), and set your [API key](/get-started/api-keys) in the configuration. - -Run the following command, replacing `RUNPOD_API_KEY` with your API key: +This page covers the core Pod management operations. For CLI usage, first [install the Runpod CLI](/runpodctl/overview) and configure your API key: ```sh runpodctl config --apiKey RUNPOD_API_KEY ``` +## Quick reference + +| Action | Web UI | CLI | +|--------|-----|-----| +| **Deploy** | [Pods page](https://www.console.runpod.io/pods) → Deploy | `runpodctl create pods --name NAME --gpuType "GPU" --imageName "IMAGE"` | +| **Start** | Expand Pod → Play icon | `runpodctl start pod POD_ID` | +| **Stop** | Expand Pod → Stop icon | `runpodctl stop pod POD_ID` | +| **Terminate** | Expand Pod → Trash icon | `runpodctl remove pod POD_ID` | +| **List** | [Pods page](https://www.console.runpod.io/pods) | `runpodctl get pod` | + ## Deploy a Pod -You can deploy preconfigured Pods from the repos listed in the [Runpod Hub](/hub/overview). For more info, see the [Hub deployment guide](/hub/overview#deploy-as-a-pod). +Deploy preconfigured Pods from the [Runpod Hub](/hub/overview#deploy-as-a-pod) for quick setup. - -To create a Pod using the Runpod console: -1. Open the [Pods page](https://www.console.runpod.io/pods) in the Runpod console and click the **Deploy** button. -2. (Optional) Specify a [network volume](/storage/network-volumes) if you need to share data between multiple Pods, or to save data for later use. -3. Select **GPU** or **CPU** using the buttons in the top-left corner of the window, and follow the configuration steps below. +1. Open the [Pods page](https://www.console.runpod.io/pods) and click **Deploy**. +2. (Optional) Attach a [network volume](/storage/network-volumes) for persistent storage. +3. Select **GPU** or **CPU**, then configure: -GPU configuration: +**GPU**: Select GPU type → Name your Pod → (Optional) Choose a template → Set GPU count → Click **Deploy On-Demand** -1. Select a graphics card (e.g., A40, RTX 4090, H100 SXM). -2. Give your Pod a name using the **Pod Name** field. -3. (Optional) Choose a **Pod Template** such as **Runpod Pytorch 2.1** or **Runpod Stable Diffusion**. -4. Specify your **GPU count** if you need multiple GPUs. -5. Click **Deploy On-Demand** to deploy and start your Pod. +**CPU**: Select CPU type → Choose instance configuration → Name your Pod → Click **Deploy On-Demand** -**CUDA Version Compatibility** - -When using (especially community templates like `runpod/pytorch:2.8.0-py3.11-cuda12.8.1-cudnn-devel-ubuntu22.04`), ensure the CUDA version of the host matches or exceeds the template's requirements. - -If you encounter errors like "OCI runtime create failed" or "unsatisfied condition: cuda>=X.X", you need to filter for compatible machines: - -1. Click **Additional filters** in the Pod creation interface -2. Click **CUDA Versions** filter dropdown -3. Select a CUDA version that matches or exceeds your template's requirements (e.g., if the template requires CUDA 12.8, select 12.8 or higher) - - - - - -**Note:** Check the template name or documentation for CUDA requirements. When in doubt, select the latest CUDA version as newer drivers are backward compatible. +**CUDA compatibility**: Ensure the host CUDA version matches your requirements. If you see "OCI runtime create failed" errors, use **Additional filters → CUDA Versions** to select compatible machines. -CPU configuration: - -1. Select a **CPU type** (e.g., CPU3/CPU5, Compute Optimized, General Purpose, Memory-Optimized). -2. Specify the number of CPUs and quantity of RAM for your Pod by selecting an **Instance Configuration**. -3. Give your Pod a name using the **Pod Name** field. -4. Click **Deploy On-Demand** to deploy and start your Pod. - - -To create a Pod using the CLI, use the `runpodctl create pods` command: + ```sh runpodctl create pods \ @@ -74,14 +52,12 @@ runpodctl create pods \ --gpuType "NVIDIA A40" \ --imageName "runpod/pytorch:3.10-2.0.0-117" \ --containerDiskSize 10 \ - --volumeSize 100 \ - --args "bash -c 'mkdir /testdir1 && /start.sh'" + --volumeSize 100 ``` -To create a Pod using the REST API, send a POST request to the `/pods` endpoint: ```bash curl --request POST \ @@ -89,270 +65,117 @@ curl --request POST \ --header 'Authorization: Bearer RUNPOD_API_KEY' \ --header 'Content-Type: application/json' \ --data '{ - "allowedCudaVersions": [ - "12.8" - ], - "cloudType": "SECURE", - "computeType": "GPU", - "containerDiskInGb": 50, - "containerRegistryAuthId": "clzdaifot0001l90809257ynb", - "countryCodes": [ - "US" - ], - "cpuFlavorIds": [ - "cpu3c" - ], - "cpuFlavorPriority": "availability", - "dataCenterIds": [ - "EU-RO-1", - "CA-MTL-1" - ], - "dataCenterPriority": "availability", - "dockerEntrypoint": [], - "dockerStartCmd": [], - "env": { - "ENV_VAR": "value" - }, - "globalNetworking": true, - "gpuCount": 1, - "gpuTypeIds": [ - "NVIDIA GeForce RTX 4090" - ], - "gpuTypePriority": "availability", - "imageName": "runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04", - "interruptible": false, - "locked": false, - "minDiskBandwidthMBps": 123, - "minDownloadMbps": 123, - "minRAMPerGPU": 8, - "name": "my-pod", - "ports": [ - "8888/http", - "22/tcp" - ], - "supportPublicIp": true, - "vcpuCount": 2, - "volumeInGb": 20, - "volumeMountPath": "/workspace" -}' + "name": "my-pod", + "imageName": "runpod/pytorch:2.1.0-py3.10-cuda11.8.0-devel-ubuntu22.04", + "gpuTypeIds": ["NVIDIA GeForce RTX 4090"], + "gpuCount": 1, + "containerDiskInGb": 50, + "volumeInGb": 20 + }' ``` -For complete API documentation and parameter details, see the [Pod API reference](/api-reference/pods/POST/pods). +See the [Pod API reference](/api-reference/pods/POST/pods) for all parameters. - -### Custom templates - -Runpod supports custom [Pod templates](/pods/templates/overview) that let you define your environment using a Dockerfile. - -With custom templates, you can: - -* Install specific dependencies and packages. -* Configure your development environment. -* Create [portable Docker images](/tutorials/introduction/containers) that work consistently across deployments. -* Share environments with team members for collaborative work. - ## Stop a Pod - -If your Pod has a [network volume](/storage/network-volumes) attached, it cannot be stopped, only terminated. When you terminate the Pod, data in the `/workspace` directory will be preserved in the network volume, and you can regain access by deploying a new Pod with the same network volume attached. - - -When a Pod is stopped, data in the container disk is cleared, but data in the `/workspace` directory is preserved. To learn more about how Pod storage works, see [Storage overview](/pods/storage/types). - -By stopping a Pod you are effectively releasing the GPU on the machine, and you may be reallocated [zero GPUs](/references/troubleshooting/zero-gpus) when you start the Pod again. +Stopping a Pod releases the GPU and preserves data in `/workspace` (volume disk). Container disk data is cleared. -After a Pod is stopped, you will still be charged for its [volume disk](/pods/storage/types#volume-disk) storage. If you don't need to retain your Pod environment, you should terminate it completely. +You'll still be charged for [volume disk storage](/pods/storage/types#volume-disk) while stopped. Terminate the Pod if you don't need to retain your environment. + +Pods with [network volumes](/storage/network-volumes) attached cannot be stopped, only terminated. Your `/workspace` data is preserved in the network volume. + + -To stop a Pod: -1. Open the [Pods page](https://www.console.runpod.io/pods). -2. Find the Pod you want to stop and expand it. -3. Click the **Stop button** (square icon). -4. Confirm by clicking the **Stop Pod** button. +1. Open the [Pods page](https://www.console.runpod.io/pods) and expand your Pod. +2. Click the **Stop** button (square icon) and confirm. - -To stop a Pod, enter the following command. + ```sh runpodctl stop pod $RUNPOD_POD_ID ``` -Example output: - -```sh -pod "gq9xijdra9hwyd" stopped -``` - - - - - - -### Stop a Pod after a period of time - -You can also stop a Pod after a specified period of time. The examples below show how to use the CLI or [web terminal](/pods/connect-to-a-pod#web-terminal) to schedule a Pod to stop after 2 hours of runtime. - - - -Use the following command to stop a Pod after 2 hours: - +**Schedule a stop** (e.g., after 2 hours): ```sh sleep 2h; runpodctl stop pod $RUNPOD_POD_ID & ``` -This command uses sleep to wait for 2 hours before executing the `runpodctl stop pod` command to stop the Pod. The `&` at the end runs the command in the background, allowing you to continue using the SSH session. - - - - -To stop a Pod after 2 hours using the web terminal, enter: - -```sh -nohup bash -c "sleep 2h; runpodctl stop pod $RUNPOD_POD_ID" & -``` - -`nohup` ensures the process continues running if you close the web terminal window. - - ## Start a Pod -Pods start as soon as they are created, but you can resume a Pod that has been stopped. +Resume a stopped Pod. Note: You may be allocated [zero GPUs](/references/troubleshooting/zero-gpus) if capacity has changed. -To start a Pod: -1. Open the [Pods page](https://www.console.runpod.io/pods). -2. Find the Pod you want to start and expand it. -3. Click the **Start** button (play icon). +1. Open the [Pods page](https://www.console.runpod.io/pods) and expand your Pod. +2. Click the **Start** button (play icon). - -To start a single Pod, enter the command `runpodctl start pod`. You can pass the environment variable `RUNPOD_POD_ID` to identify each Pod. + ```sh runpodctl start pod $RUNPOD_POD_ID ``` -Example output: - -```sh -pod "wu5ekmn69oh1xr" started with $0.290 / hr -``` - - - - ## Terminate a Pod - -Terminating a Pod permanently deletes all associated data that isn't stored in a [network volume](/storage/network-volumes). Be sure to export or download any data that you'll need to access again. - +Terminating permanently deletes all data not stored in a [network volume](/storage/network-volumes). Export important data first. -To terminate a Pod: -1. Open the [Pods page](https://www.console.runpod.io/pods). -2. Find the Pod you want to terminate and expand it. -3. [Stop the Pod](#stop-a-pod) if it's running. -4. Click the **Terminate** button (trash icon). -5. Confirm by clicking the **Yes** button. +1. Open the [Pods page](https://www.console.runpod.io/pods) and expand your Pod. +2. Stop the Pod if running, then click **Terminate** (trash icon) and confirm. - -To remove a single Pod, enter the following command. + ```sh +# Single Pod runpodctl remove pod $RUNPOD_POD_ID -``` - -Example output: - -```sh -pod "wu5ekmn69oh1xr" removed -``` -You can also remove Pods in bulk. For example, the following command terminates up to 40 Pods with the name `my-bulk-task`. - -```sh +# Bulk remove by name runpodctl remove pods my-bulk-task --podCount 40 ``` -You can also terminate a Pod by name: - -```sh -runpodctl remove pods [POD_NAME] -``` - - -## View Pod details - -You can find a list of all your Pods on the [Pods page](https://www.console.runpod.io/pods) of the web interface. - -If you're using the CLI, use the following command to list your Pods: - -```sh -runpodctl get pod -``` - -Or use this command to get the details of a single Pod: - -```sh -runpodctl get pod [POD_ID] -``` +## View logs -## Access logs +Pods provide two log types: -Pods provide two types of logs to help you monitor and troubleshoot your workloads: +- **Container logs**: Application output (stdout) +- **System logs**: Pod lifecycle events (startup, shutdown, errors) -- **Container logs** capture all output sent to your console standard output, including application logs and print statements. -- **System logs** provide detailed information about your Pod's lifecycle, such as container creation, image download, extraction, startup, and shutdown events. - -To view your logs, open the [Pods page](https://www.console.runpod.io/pods), expand your Pod, and click the **Logs** button. This gives you real-time access to both container and system logs, making it easy to diagnose issues or monitor your Pod's activity. +Access logs from the [Pods page](https://www.console.runpod.io/pods) by expanding your Pod and clicking **Logs**. ## Troubleshooting -Below are some common issues and solutions for troubleshooting Pod deployments. - -### Zero GPU Pods - -See [Zero GPU Pods on restart](/references/troubleshooting/zero-gpus). - -### Pod stuck on initializing - -If your Pod is stuck on initializing, check for these common issues: - -1. You're trying to SSH into the Pod but didn't provide an idle job like `sleep infinity` to keep it running. -2. The Pod received a command it can't execute. Check your logs for syntax errors or invalid commands. - -If you need help, [contact support](https://www.runpod.io/contact). - -### Docker daemon limitations - -Runpod manages the Docker daemon for you, which means you can't run your own Docker instance inside a Pod. This prevents you from building Docker containers or using tools like Docker Compose. - -To work around this, create a [custom template](/pods/templates/overview) with the Docker image you need. +| Issue | Solution | +|-------|----------| +| **Zero GPUs on restart** | See [Zero GPU Pods](/references/troubleshooting/zero-gpus) | +| **Pod stuck initializing** | Check logs for command errors; ensure you have an idle job (e.g., `sleep infinity`) if using SSH | +| **Docker Compose not working** | Not supported. Use a [custom template](/pods/templates/overview) with your dependencies baked in. | +Need help? [Contact support](https://www.runpod.io/contact). diff --git a/pods/overview.mdx b/pods/overview.mdx index f077225a..e8b63956 100644 --- a/pods/overview.mdx +++ b/pods/overview.mdx @@ -1,127 +1,78 @@ --- title: Overview description: "Get on-demand access to powerful computing resources." +mode: "wide" --- -import { NetworkVolumeTooltip, PodContainerDiskTooltip, VolumeDiskTooltip, ServerlessTooltip, RunpodHubTooltip, TemplatesTooltip, InferenceTooltip, TrainingTooltip, FineTuningTooltip, MachinesTooltip } from "/snippets/tooltips.jsx"; +import { NetworkVolumeTooltip, PodContainerDiskTooltip, VolumeDiskTooltip, ServerlessTooltip, RunpodHubTooltip, TrainingTooltip, FineTuningTooltip } from "/snippets/tooltips.jsx"; - - - +
-Pods provide instant access to powerful GPU and CPU resources for AI , , rendering, and other compute-intensive workloads. +Pods provide instant access to powerful GPU and CPU resources for AI , , rendering, and other compute-intensive workloads. You have full control over your computing environment, allowing you to customize software, storage, and networking to match your exact requirements. -You have full control over your computing environment, allowing you to customize software, storage, and networking to match your exact requirements. +## Get started -When you're ready to get started, [follow this tutorial](/get-started) to create an account and deploy your first Pod. + + + Create an account and deploy your first Pod. + + + Select the right GPU type and configuration for your workload. + + + Access your Pod via SSH, JupyterLab, or VS Code. + + -## Key components +## Concepts -Each Pod consists of these core components: +### [Templates](/pods/templates/overview) -- **Container environment**: An Ubuntu Linux-based [container](/tutorials/introduction/containers) that can run almost any compatible software. -- **Unique identifier**: Each Pod receives a dynamic ID (e.g., `2s56cp0pof1rmt`) for management and access. -- [Storage](#storage-options): - - : Houses the operating system and temporary storage. - - : Persistent storage that is preserved between Pod starts and stops. - - : Permanent, portable storage that can be moved between and persists even after Pod deletion. -- **Hardware resources**: Allocated vCPU, system RAM, and multiple GPUs (based on your selection). -- **Network connectivity**: A proxy connection enabling web access to any [exposed port](/pods/configuration/expose-ports) on your container. +Pre-configured [Docker image](/tutorials/introduction/containers#what-are-images) setups that let you quickly spin up Pods without manual environment configuration. Instead of installing PyTorch, configuring JupyterLab, and setting up all dependencies yourself, you can select an official Runpod PyTorch template and have everything ready to go instantly. -## Pod templates +### [Storage](/pods/storage/types) -Pod are pre-configured [Docker image](/tutorials/introduction/containers#what-are-images) setups that let you quickly spin up Pods without manual environment configuration. They're essentially deployment configurations that include specific models, frameworks, or workflows bundled together. +Pods offer three types of storage: for temporary files, for persistent storage throughout the Pod's lease, and optional s for permanent storage that can be transferred between Pods. -Templates eliminate the need to manually set up environments, saving time and reducing configuration errors. For example, instead of installing PyTorch, configuring JupyterLab, and setting up all dependencies yourself, you can select an official Runpod PyTorch template and have everything ready to go instantly. +### [Connection](/pods/connect-to-a-pod) -To learn how to create your own custom templates, see [Build a custom Pod template](/pods/templates/create-custom-template). If you're new to Docker, start with the [introduction to containers](/tutorials/introduction/containers) tutorial series. - -## Storage - -Pods offer three types of storage to match different use cases: - -Every Pod comes with a resizable that houses the operating system and stores temporary files, which are cleared after the Pod stops. - -By contrast, s provide persistent storage that is preserved throughout the Pod's lease, functioning like a dedicated hard drive. Data stored in the volume disk directory (`/workspace` by default) persists when you stop the Pod, but is erased when the Pod is deleted. - -Optional s provide more flexible permanent storage that can be transferred between Pods, replacing the volume disk when attached. When using a Pod with network volume attached, you can safely delete your Pod without losing the data stored in your network volume directory (`/workspace` by default). - -To learn more, see [Storage options](/pods/storage/types). +Once deployed, you can connect to your Pod through SSH for command-line access, web proxy for [exposed web services](/pods/configuration/expose-ports), JupyterLab for data science workflows, or [VS Code/Cursor](/pods/configuration/connect-to-ide) for local IDE integration. ## Deployment options You can deploy Pods in several ways: - [From a template](/pods/templates/overview): Pre-configured environments for quick setup of common workflows. -- **Custom containers**: Pull from any compatible container registry such as Docker Hub, GitHub Container Registry, or Amazon ECR. -- **Custom images**: [Build and deploy your own container images](/tutorials/introduction/containers/create-dockerfiles). -- [From Serverless repos](/hub/overview#deploy-as-a-pod): Deploy any -compatible repository from the directly as a Pod, providing a cost-effective option for consistent workloads. - - -When building a container image for Runpod on a Mac (Apple Silicon), use the flag `--platform linux/amd64` to ensure your image is compatible with the platform. Learn more about [building Docker images](/tutorials/introduction/containers/create-dockerfiles#building-for-runpod). - - -## Connecting to your Pod - -Once deployed, you can [connect to your Pod](/pods/connect-to-a-pod) through: - -- **SSH**: Direct [command-line access](/pods/configuration/use-ssh) for development and management. -- **Web proxy**: HTTP access to [exposed web services](/pods/configuration/expose-ports) via URLs in the format `https://[pod-id]-[port].proxy.runpod.net`. -- **JupyterLab**: A web-based IDE for data science and machine learning. -- **VSCode/Cursor**: [Connect to your Pod with VSCode or Cursor](/pods/configuration/connect-to-ide), working within your volume directory as if the files were stored on your local machine. - -## Data transfer - -You can sync your Pod's data with [most major cloud providers](/pods/storage/cloud-sync), and transfer data to your local machine using the [Runpod CLI](/runpodctl/overview). - -To learn more about all available options, see [Transfer files](/pods/storage/transfer-files). - -## Customization options - -Pods offer extensive customization to match your specific requirements. - -You can select your preferred [GPU type](/references/gpu-types) and quantity, adjust system disk size, and specify your container image. - -Additionally, you can configure custom start commands, set [environment variables](/pods/references/environment-variables), define [exposed HTTP/TCP ports](/pods/configuration/expose-ports), and implement various [storage configurations](pods/storage/types) to optimize your Pod for your specific workload. +- **Custom containers**: Pull from any compatible container registry such as Docker Hub, GitHub Container Registry, or Amazon ECR. Learn more about [creating your own container images](/tutorials/introduction/containers/create-dockerfiles). +- [From Serverless repos](/hub/overview#deploy-as-a-pod): Deploy any -compatible repository from the directly as a Pod. ## Pod types -Runpod offers two types of Pod: +Runpod offers two cloud options: - **Secure Cloud:** Operates in T3/T4 data centers, providing high reliability and security for enterprise and production workloads. - **Community Cloud:** Connects individual compute providers to users through a vetted, secure peer-to-peer system, with competitive pricing options. -## Deploy a Pod - -Follow these steps to deploy a Pod: - -1. [Choose a Pod](/pods/choose-a-pod) based on your computing needs and budget. -2. Navigate to the [Pod creation page](https://console.runpod.io/pod/create). -3. Configure your Pod settings, including GPU type, storage, and networking options. -4. Launch your Pod and connect using SSH, JupyterLab, or your preferred remote access method. -5. [Manage your Pod](/pods/manage-pods) through the Runpod console. - ## Pricing Pods are billed by the minute with no fees for ingress/egress. Runpod also offers long-term [savings plans](/pods/pricing#savings-plans) for extended usage patterns. See [Pod pricing](/pods/pricing) for details. ## Limitations -**Docker Compose is not supported:** Runpod runs Docker for you, so you cannot spin up your own Docker instance or use Docker Compose on Pods. If your workflow requires Docker Compose, create a custom template with a pre-built Docker image that contains all necessary components. - -**UDP connections are not supported:** Pods only support TCP and HTTP connections. If your application relies on UDP, you'll need to modify your application to use TCP-based communication instead. - -**Windows support:** Pods do not currently support Windows. - -## Next steps - -Ready to get started? Explore these pages to learn more: - -* [Deploy your first Pod](/get-started) using this tutorial. -* [Choose a Pod](/pods/choose-a-pod) based on your requirements. -* Learn how to [connect to your Pod](/pods/connect-to-a-pod) after deployment. -* Learn how to [manage your Pods](/pods/manage-pods) using the console and CLI. -* Set up [persistent storage](/pods/storage/types) for your data. -* Configure [global networking](/pods/networking) for your applications. -* [Set up Ollama on a Pod](/tutorials/pods/run-ollama) to run LLM with HTTP API access. -* [Build Docker images with Bazel](/tutorials/pods/build-docker-images) to emulate a Docker-in-Docker workflow. +- **Docker Compose is not supported:** Runpod runs Docker for you, so you cannot spin up your own Docker instance or use Docker Compose on Pods. +- **UDP connections are not supported:** Pods only support TCP and HTTP connections. +- **Windows is not supported:** Pods do not currently support Windows. + +## Tutorials + + + + Run LLM inference with HTTP API access. + + + Emulate a Docker-in-Docker workflow. + + + Build your own reusable Pod template. + + diff --git a/pods/pricing.mdx b/pods/pricing.mdx index c17b076a..2df558bd 100644 --- a/pods/pricing.mdx +++ b/pods/pricing.mdx @@ -2,170 +2,73 @@ title: "Pricing" sidebarTitle: "Pricing" description: "Explore pricing options for Pods, including on-demand, savings plans, and spot instances." +mode: "wide" --- import { MachineTooltip } from "/snippets/tooltips.jsx"; - - -Runpod offers custom pricing plans for large scale and enterprise workloads. If you're interested in learning more, [contact our sales team](https://ecykq.share.hsforms.com/2MZdZATC3Rb62Dgci7knjbA). +
+ +Runpod offers custom pricing plans for large scale and enterprise workloads. [Contact our sales team](https://ecykq.share.hsforms.com/2MZdZATC3Rb62Dgci7knjbA) to learn more. -Runpod offers multiple flexible pricing options for Pods, designed to accommodate a variety of workloads and budgets. - - -## How billing works - -All Pods are billed by the second for compute and storage, with no additional fees for data ingress or egress. Every Pod has an hourly cost based on its [GPU type](/references/gpu-types) or CPU configuration, and your Runpod credits are charged for the Pod every second it is active. - -You can find the hourly cost of a specific GPU configuration on the [Runpod console](https://www.console.runpod.io/pods) during Pod deployment. - - -If your account balance is projected to cover less than 10 seconds of remaining run time for your active Pods, Runpod will pre-emptively stop all your Pods. This is to ensure your account retains a small balance, which can help preserve your data volumes. If your balance is completely drained, all Pods are subject to deletion at the discretion of the Runpod system. We highly recommend setting up [automatic payments](https://www.console.runpod.io/user/billing) to avoid service interruptions. - +Pods are billed by the second for compute and storage, with no fees for data ingress or egress. Find the latest GPU pricing on the [Runpod console](https://www.console.runpod.io/pods) during Pod deployment. ## Pricing options - - - - - -Runpod provides three options for Pod pricing: - -- **On-demand:** Pay-as-you-go pricing for reliable, non-interruptible instances dedicated to your use. -- **Savings plan:** Commit to a fixed term upfront for significant discounts on on-demand rates, ideal for longer-term workloads where you need prolonged access to compute. -- **Spot:** Access spare compute capacity at the lowest prices. These instances are interruptible and suitable for workloads that can tolerate such interruptions. +| | On-demand | Savings plan | Spot | +|---|-----------|--------------|------| +| **Pricing** | Standard hourly rate | Discounted (prepaid) | Lowest available | +| **Commitment** | None | 3 or 6 months upfront | None | +| **Interruptible?** | No | No | Yes (5-second warning) | +| **Best for** | Development, testing, variable workloads | Long-running production workloads | Fault-tolerant batch processing | ### On-demand -On-demand instances are designed for non-interruptible workloads. When you deploy an on-demand Pod, the required resources are dedicated to your Pod. As long as you have sufficient funds in your account, your on-demand Pod cannot be displaced by other users and will run without interruption. +Pay-as-you-go pricing for non-interruptible instances. Resources are dedicated to your Pod and cannot be displaced by other users. -You must have at least one hour's worth of time in your balance for your selected Pod configuration to rent an on-demand instance. If your balance is completely drained, all Pods are subject to deletion at the discretion of the Runpod system. +You must have at least one hour's worth of credits for your selected configuration to deploy an on-demand instance. -**Benefits:** -- **Flexibility:** Ideal for workloads with unpredictable durations or for short-term tasks, development, and testing. -- **No upfront commitment:** Start using resources immediately without any long-term contracts (beyond ensuring sufficient balance). -- **Reliability:** On-demand instances are non-interruptible, providing a stable environment for your applications. - -**Use on-demand pricing for:** -- Short-term projects or experiments. -- Development and testing environments. -- Workloads where interruption is not acceptable and usage patterns are variable. -- Applications requiring immediate deployment without a long-term resource plan. - ### Savings plans -Savings plans offer a way to pay upfront for a defined period and receive a discount on compute costs in return. This is an excellent option when you know you will need prolonged access to specific compute resources. +Commit to a 3-month or 6-month term upfront for significant discounts on compute costs. When you stop a Pod, the savings plan automatically applies to your next deployment of the same GPU type. - -Savings plans only apply to GPU compute costs. [All storage costs](/pods/storage/types) (container disk, volume disk, and network volume) are billed at standard rates. - -To keep your Pod(s) running, maintain a balance of credits in your Runpod account to pay for ongoing storage costs, even if you've prepaid for them with a savings plan. Otherwise your Pod(s) will be stopped when you run out of funds. - +Savings plans only cover GPU compute costs—[storage costs](/pods/storage/types) are billed at standard rates. Maintain a credit balance for storage, or your Pods will stop when funds run out. Plans are non-refundable and have fixed expiration dates. -You commit to a usage term (3 months or 6 months) by making an upfront payment. During this term, you'll be charged a considerably lower hourly rate for your Pod. - -When you stop a Pod, the savings plan associated with it applies to your next deployment of the same GPU type. This means you can continue to benefit from your savings commitment even during temporary pauses in your Pod usage. - - -Savings plans require an upfront payment for the entire committed term and are generally non-refundable. Stopping your Pod does not extend the duration of your savings plan; each plan has a fixed expiration date set at the time of purchase. - - -**Benefits:** -- **Significant cost reduction:** Offers substantial discounts on hourly rates compared to standard on-demand pricing. -- **Budget predictability:** Lock in compute costs for your long-running workloads with a fixed upfront payment and known discounted rates. -- **Flexible application:** If you stop a Pod with an active savings plan, the plan's benefits automatically apply to the next Pod you deploy using the same GPU type. - -**Use savings plans for:** -- Long-running projects with predictable compute needs. -- Production workloads where cost optimization over time is crucial. -- Users who can commit to specific hardware configurations for an extended period. - ### Spot instances -Spot instances allow you to access spare Runpod compute capacity at significantly lower prices than on-demand rates. These instances are interruptible, meaning they can be terminated by Runpod if the capacity is needed for on-demand or savings plan Pods, or if another user outbids you for the Spot capacity. - -While resources are dedicated to your Pod when it's running, the instance can be stopped if a higher bid is placed or an on-demand deployment requires the resources. +Access spare compute capacity at the lowest prices. Spot instances are interruptible—they can be terminated with only a **5-second warning** (SIGTERM, then SIGKILL) if capacity is needed elsewhere. -Spot instances can be terminated with only a 5-second warning (SIGTERM signal, followed by SIGKILL). Your application must be designed to handle such interruptions gracefully. - -It is crucial to periodically save your work to a volume disk or push data to cloud storage, especially within the 5-second window after a SIGTERM signal. Your volume disk is retained even if your Spot instance is interrupted. +Save your work frequently to volume disk or cloud storage. Your volume disk is retained even if your Spot instance is interrupted. -**Benefits:** -- **Lowest cost:** Provides the most budget-friendly option for running compute workloads. -- **Scalability for tolerant jobs:** Enables large-scale, parallel processing tasks at a fraction of the on-demand cost. - -**Risks and considerations:** -- **Interruptibility:** Spot instances can be terminated with only a 5-second warning. Your application must be designed to handle such interruptions gracefully. -- **Data persistence:** It is crucial to periodically save your work to a volume disk or push data to cloud storage, especially within the 5-second window after a SIGTERM signal. Your volume disk is retained even if your Spot instance is interrupted. - -**Use spot instances for:** -- Fault-tolerant workloads that can withstand interruptions. -- Stateless applications or those that can quickly resume from a saved state. -- Tasks where minimizing cost is the highest priority and interruptions can be managed effectively. - -## Choosing the right pricing model - -Selecting the optimal pricing model depends on your specific needs. - -For **maximum flexibility and reliability** for short-term or unpredictable workloads, choose an **on-demand** instance. - -For **significant cost savings on long-term, stable workloads**, and if you can make an upfront commitment, choose a **savings plan** instance. - -For the **lowest possible cost on fault-tolerant, interruptible workloads**, choose a **spot instance**. - -Consider your workload's sensitivity to interruptions, your budget, the expected duration of your compute tasks, and data persistence strategies to make the most informed decision. - -## Selecting a pricing model during Pod deployment +## Storage pricing -You can select your preferred pricing model directly from the Runpod console when configuring and deploying a new Pod. +| Storage type | Running Pod | Stopped Pod | Notes | +|--------------|-------------|-------------|-------| +| **Container disk** | \$0.10/GB/month | Not charged | Temporary; erased when Pod stops | +| **Volume disk** | \$0.10/GB/month | \$0.20/GB/month | Persistent; retained until Pod deleted | +| **Network volume** | \$0.07/GB/month (< 1TB) | \$0.07/GB/month | Permanent; portable between Pods | +| | \$0.05/GB/month (> 1TB) | \$0.05/GB/month | | -1. Open the [Pods page](https://www.console.runpod.io/pods) in the Runpod console and select **Deploy**. - -2. Configure your Pod (see [Create a Pod](/pods/manage-pods#create-a-pod)). - -3. Under **Instance Pricing**, select one of the following options: - * **On-demand**: Deploys your Pod with standard, non-interruptible pricing. - * **3 month savings plan**: Deploys your Pod with a 3-month upfront commitment for discounted rates. - * **6 month savings plan**: Deploys your Pod with a 6-month upfront commitment for even greater discounted rates. - * **Spot**: Deploys your Pod as an interruptible instance at the lowest cost. - -4. Review your Pod's configuration details, including the terms of the selected pricing model. The combined cost of the Pod's GPU and storage will be displayed during deployment under **Pricing Summary**. - -5. Click **Deploy On-Demand** (or the equivalent deployment button). If you've selected a savings plan, the upfront cost will be charged to your Runpod credits, and your Pod will begin deploying with the discounted rate active. - -## Storage billing - -Runpod offers [three types of storage](/pods/storage/types) for Pods:: - -- **Container disk:** Temporary storage that is erased if the Pod is stopped, billed at \$0.10 per GB per month for storage on running Pods. Billed per-second. -- **Volume disk:** Persistent storage that is billed at \$0.10 per GB per month on running Pods and \$0.20 per GB per month for volume storage on stopped Pods. Billed per-second. -- **Network volumes:** External storage that is billed at \$0.07 per GB per month for storage requirements below 1TB. For requirements exceeding 1TB, the rate is \$0.05 per GB per month. Billed hourly. - -You are not charged for storage if the host is down or unavailable from the public internet. - -Container and volume disk storage will be included in your Pod's displayed hourly cost during deployment. +Storage is billed per-second for container and volume disks, and hourly for network volumes. You are not charged if the host is unavailable. -Runpod is not designed as a long-term cloud storage system. Storage is provided to support compute tasks. We recommend regularly backing up critical data to your local machine or to a dedicated cloud storage provider. +Runpod is not designed for long-term cloud storage. Back up critical data to your local machine or a dedicated storage provider. -## Pricing for stopped Pods - -When you [stop a Pod](/pods/manage-pods#stop-a-pod), you will no longer be charged for the Pod's hourly GPU cost, but will continue to be charged for the Pod's volume disk at a rate of \$0.20 per GB per month. - -## Account spend limits +## Account limits -By default, Runpod accounts have a spend limit of \$80 per hour across all resources. This limit protects your account from unexpected charges. If your workload requires higher spending capacity, you can [contact support](https://www.runpod.io/contact) to increase it. +- **Minimum balance**: If your balance covers less than 10 seconds of remaining runtime, Runpod stops all Pods to preserve data volumes. Set up [automatic payments](https://www.console.runpod.io/user/billing) to avoid interruptions. +- **Spend limit**: Default limit of \$80/hour across all resources. [Contact support](https://www.runpod.io/contact) to increase. -## Tracking costs and savings plans +## Track your costs -You can monitor your active savings plans, including their associated Pods, commitment periods, and expiration dates, by visiting the dedicated [Savings plans](https://www.console.runpod.io/savings-plans) section in your Runpod console. General Pod usage and billing can be tracked through the [Billing section](https://www.console.runpod.io/user/billing). +- **Savings plans**: Monitor active plans, commitment periods, and expiration dates in the [Savings plans](https://www.console.runpod.io/savings-plans) section. +- **Billing**: Track usage and charges in the [Billing section](https://www.console.runpod.io/user/billing). diff --git a/pods/storage/cloud-sync.mdx b/pods/storage/cloud-sync.mdx index f56c28d8..2bb323ff 100644 --- a/pods/storage/cloud-sync.mdx +++ b/pods/storage/cloud-sync.mdx @@ -4,15 +4,21 @@ sidebarTitle: "Sync data with cloud storage" description: "Learn how to sync your Pod data with popular cloud storage providers." --- -Runpod's Cloud Sync feature makes it easy to upload your Pod data to external cloud storage providers, or download data from cloud storage providers to your Pod. This guide walks you through setting up and using Cloud Sync with supported providers. +Cloud Sync uploads and downloads data between your Pod and external cloud storage providers. -Cloud Sync supports syncing data with these cloud storage providers: +## Supported providers -- Amazon S3 -- Google Cloud Storage -- Microsoft Azure Blob Storage -- Dropbox -- Backblaze B2 Cloud Storage +| Provider | Auth method | Setup complexity | +|----------|-------------|------------------| +| [**Amazon S3**](#amazon-s3) | Access Key + Secret | Low | +| [**Google Cloud Storage**](#google-cloud-platform-storage) | Service Account JSON | Medium | +| [**Microsoft Azure**](#microsoft-azure-blob-storage) | Account Name + Key | Medium | +| [**Backblaze B2**](#backblaze-b2-cloud-storage) | Application Key | Low | +| [**Dropbox**](#dropbox) | OAuth Access Token | Medium | + + +Cloud Sync works with Google Cloud Storage, not Google Drive. For Drive transfers, see [file transfer methods](/pods/storage/transfer-files#transfer-with-google-drive). + ## Security best practices @@ -56,10 +62,6 @@ Follow the steps below to sync your data with Amazon S3: ## Google Cloud Platform Storage - -Cloud Sync is compatible with Google Cloud Storage, but **not Google Drive**. However, you can transfer files between your Pods and Drive [using the Runpod CLI](/pods/storage/transfer-files#transfer-files-between-google-drive-and-runpod). - - Google Cloud Storage offers high-performance object storage with global availability and strong consistency. Follow the steps below to sync your data with Google Cloud Storage: diff --git a/pods/storage/types.mdx b/pods/storage/types.mdx index 4729d040..7658ddd3 100644 --- a/pods/storage/types.mdx +++ b/pods/storage/types.mdx @@ -3,76 +3,69 @@ title: "Storage options" description: "Choose the right type of storage for your Pods." --- -import { PodsTooltip, PodTooltip } from "/snippets/tooltips.jsx"; +import { PodTooltip } from "/snippets/tooltips.jsx"; -Choosing the right type of storage is crucial for optimizing your workloads, whether you need temporary storage for active computations, persistent storage for long-term data retention, or permanent, shareable storage across multiple . +Pods offer three storage types optimized for different use cases. Choose based on your data persistence, performance, and sharing needs. -This page describes the different types of storage options available for your Pods, and when to use each in your workflow. +## Comparison + +| | Container disk | Volume disk | Network volume | +|---|----------------|-------------|----------------| +| **Persistence** | Lost on stop/restart | Retained until Pod deleted | Retained independently | +| **Mount path** | System-managed | `/workspace` (default) | `/workspace` (replaces volume disk) | +| **Performance** | Fastest (local) | Fast (local) | Variable (network) | +| **Shareable** | No | No | Yes (across Pods) | +| **Resizable** | Yes | Increase only | Yes | +| **Cost** | \$0.10/GB/month | \$0.10/GB/month (running) | \$0.07/GB/month | +| | | \$0.20/GB/month (stopped) | | +| **Best for** | OS, temp files, cache | Models, datasets, checkpoints | Shared data, portable storage | ## Container disk -A container disk houses the operating system and provides temporary storage for a . It's created when a Pod is launched and is directly tied to the Pod's lifecycle. +The container disk provides temporary storage for the operating system and session data. It's created when a launches and is cleared when the Pod stops. Use it for temporary files, caches, and data that doesn't need to persist between sessions. ## Volume disk -A volume disk provides persistent storage that remains available for the duration of the Pod's lease. It functions like a dedicated hard drive, allowing you to store data that needs to be retained even if the Pod is stopped or rebooted. - -The volume disk is mounted at `/workspace` by default (this will be replaced by the network volume if one is attached). This can be changed by [editing your Pod configuration](#modifying-storage-capacity). +The volume disk provides persistent storage that is retained throughout the Pod's lease. Data stored in the `/workspace` directory survives Pod stops and restarts, but is deleted when the Pod is terminated. This is ideal for storing models, datasets, and checkpoints that you need to access across multiple sessions. ## Network volume -[Network volumes](/storage/network-volumes) offer persistent storage that can be attached to multiple Pods and persists independently from the Pod's lifecycle. This allows you to share and access data across multiple instances or transfer storage between machines, and retain data even after a Pod is deleted. -` -When attached to a Pod, a network volume replaces the volume disk, and by default it is mounted at `/workspace`. - - +Network volumes provide permanent storage that exists independently from any Pod. You can attach a network volume to multiple Pods, transfer it between machines, and retain your data even after deleting a Pod. This makes network volumes ideal for shared datasets, collaborative workflows, and portable storage. -Network volumes must be attached during Pod creation, and cannot be unattached later. +[Learn more about network volumes](/storage/network-volumes). + +Network volumes must be attached during Pod creation and cannot be detached later. When attached, the network volume replaces the volume disk at `/workspace`. -## Storage type comparison - -This table provides a comparative overview of the storage types available for your Pods: -| Feature | Container Disk | Volume Disk | Network Volume | -| :---------------- | :------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------ | :--------------------------------------------------------------------------------------------------------------------- | -| **Data persistence** | Volatile (lost on stop/restart) | Persistent (retained until Pod deletion) | Permanent (retained independently from Pod lifecycles) | -| **Lifecycle** | Tied directly to the Pod's active session | Tied to the Pod's lease period | Independent, can outlive Pods | -| **Performance** | Fastest (locally attached) | Reliable, generally slower than container | Performance can vary (network dependent) | -| **Capacity** | Determined by Pod configuration | Selectable at creation | Selectable and often resizable | -| **Cost** | \$0.1/GB/month | \$0.1/GB/month | \$0.07/GB/month | -| **Best for** | Temporary session data, cache | Persistent application data, models, datasets | Shared data, portable storage, collaborative workflows | - -## Choosing the right storage - -Here's what you should consider when selecting storage for your Pods: +## Modify storage capacity -* **Data persistence needs:** Does your data need to survive Pod restarts or deletions? -* **Performance requirements:** Do your applications require very high-speed I/O, or is standard performance sufficient? -* **Data sharing:** Do you need to share data between multiple Pods? +You can adjust your Pod's storage capacity at any time: -## Modifying storage capacity - -To update the size of a Pod's container or volume disk: - -1. Navigate to the [Pod page](https://console.runpod.io/pod) in the Runpod console. -2. Click the three dots to the right of the Pod you want to modify and select **Edit Pod**. -3. Adjust the storage capacity for the container or volume disk. Volume disk size can be increased, but not decreased. -4. Click **Save** to apply the changes. +1. Navigate to the [Pods page](https://console.runpod.io/pods). +2. Click the three dots next to your Pod and select **Edit Pod**. +3. Adjust the container or volume disk size. Note that volume disk size can only be increased, not decreased. +4. Click **Save** to apply your changes. - -Editing a running Pod will cause it to reset completely, erasing all data that isn't stored in your volume disk/network volume mount directory (`/workspace` by default). - +Editing a running Pod resets it completely, erasing all data that isn't stored in your `/workspace` directory. -## Transferring data to another cloud provider +## Transfer data -You can upload data from your Pod to AWS S3, Google Cloud Storage, Azure, Dropbox, and more by clicking the **Cloud Sync** button on the Pod page. For detailed instructions on connecting to these services, see [Export data](/pods/storage/cloud-sync). +You can export data from your Pod to external cloud providers including AWS S3, Google Cloud Storage, Azure, and Dropbox. Click the **Cloud Sync** button on the Pod page to get started. For detailed instructions, see [Export data](/pods/storage/cloud-sync). + + +Runpod is not designed for long-term cloud storage. We recommend backing up critical data to your local machine or a dedicated cloud storage provider. + ## Next steps -* Learn how to [create a network volume](/storage/network-volumes). -* Learn how to [choose the right Pod](/pods/choose-a-pod) for your workload. -* Explore options for [managing your Pods](/pods/manage-pods). -* Understand how to create [Pod templates](/pods/templates/overview) for pre-configured environments. + + + Learn how to set up portable, persistent storage for your Pods. + + + Learn how to move data to and from your Pod. + + diff --git a/pods/templates/create-custom-template.mdx b/pods/templates/create-custom-template.mdx index acb7ba31..dcf11edf 100644 --- a/pods/templates/create-custom-template.mdx +++ b/pods/templates/create-custom-template.mdx @@ -2,7 +2,6 @@ title: "Build a custom Pod template" sidebarTitle: "Build a custom template" description: "A step-by-step guide to extending Runpod's official templates." -tag: "NEW" --- import { PodTooltip, PyTorchTooltip, CUDATooltip, TemplateTooltip, InferenceTooltip } from "/snippets/tooltips.jsx"; @@ -15,18 +14,6 @@ This tutorial shows how to build a custom fro By creating custom templates, you can package everything your project needs into a reusable Docker image. Once built, you can deploy your workload in seconds instead of reinstalling dependencies every time you start a new Pod. You can also share your template with members of your team and the wider Runpod community. -## What you'll learn - -In this tutorial, you'll learn how to: - -- Create a Dockerfile that extends a Runpod base image. -- Configure container startup options (JupyterLab/SSH, application + services, or application only). -- Add Python dependencies and system packages. -- Pre-load machine learning models from Hugging Face, local files, or custom sources. -- Build and test your image, then push it to Docker Hub. -- Create a custom Pod template in the Runpod console -- Deploy a Pod using your custom template. - ## Requirements Before you begin, you'll need: diff --git a/pods/templates/environment-variables.mdx b/pods/templates/environment-variables.mdx index c1728cee..4ff0b38e 100644 --- a/pods/templates/environment-variables.mdx +++ b/pods/templates/environment-variables.mdx @@ -1,209 +1,100 @@ --- title: "Environment variables" -description: "Learn how to use environment variables in Runpod Pods for configuration, security, and automation" +description: "Configure Pods with environment variables for settings, secrets, and runtime information." --- import { PodsTooltip } from "/snippets/tooltips.jsx"; -Environment variables are key-value pairs that you can configure for your . They are accessible within your containerized application and provide a flexible way to pass configuration settings, secrets, and runtime information to your application without hardcoding them into your code or container image. +Environment variables are key-value pairs accessible within your container. Use them to pass configuration settings, secrets, and runtime information without hardcoding values into your code or container image. -## What are environment variables? +## Set environment variables -Environment variables are dynamic values that exist in your Pod's operating system environment. They act as a bridge between your Pod's configuration and your running applications, allowing you to: +You can configure up to 50 environment variables per Pod. -- Store configuration settings that can change between deployments. -- Pass sensitive information like API keys securely. -- Access Pod metadata and system information. -- Configure application behavior without modifying code. -- Reference [Runpod secrets](/pods/templates/secrets) in your containers. +**During Pod creation:** +1. Click **Edit Template** and expand **Environment Variables**. +2. Click **Add Environment Variable** and enter the key-value pair. -When you set an environment variable in your Pod configuration, it becomes available to all processes running inside that Pod's container. +**In Pod templates:** +1. Navigate to [My Templates](https://www.console.runpod.io/user/templates). +2. Create or edit a template and add variables in the **Environment Variables** section. -## Why use environment variables in Pods? +**Using secrets:** -Environment variables offer several key benefits for containerized applications: - -**Configuration flexibility**: Environment variables allow you to easily change application settings without modifying your code or rebuilding your container image. For example, you can set different model names, API endpoints, or processing parameters for different deployments: - -```bash -# Set a model name that your application can read -MODEL_NAME=llama-2-7b-chat -API_ENDPOINT=https://api.example.com/v1 -MAX_BATCH_SIZE=32 -``` - -**Security**: Sensitive information such as API keys, database passwords, or authentication tokens can be injected as environment variables, keeping them out of your codebase and container images. This prevents accidental exposure in version control or public repositories. - -**Pod metadata access**: Runpod provides [predefined environment variables](#runpod-provided-environment-variables) that give your application information about the Pod's environment, resources, and network configuration. This metadata helps your application adapt to its runtime environment automatically. - -**Automation and scaling**: Environment variables make it easier to automate deployments and scale applications. You can use the same container image with different settings for development, staging, and production environments by simply changing the environment variables. - -## Setting environment variables - -You can configure up to 50 environment variables per Pod through the Runpod interface when creating or editing a Pod or Pod template. - -### During Pod creation - -1. When creating a new Pod, click **Edit Template** and expand the **Environment Variables** section. -2. Click **Add Environment Variable**. -3. Enter the **Key** (variable name) and **Value**. -4. Repeat for additional variables. - -### In Pod templates - -1. Navigate to [My Templates](https://www.console.runpod.io/user/templates) in the console. -2. Create a new template or edit an existing one. -3. Add environment variables in the **Environment Variables** section. -4. Save the template for reuse across multiple Pods. - -### Using secrets - -For sensitive data, you can reference [Runpod secrets](/pods/templates/secrets) in environment variables using the `RUNPOD_SECRET_` prefix. For example: +Reference [Runpod secrets](/pods/templates/secrets) for sensitive data: ``` API_KEY={{ RUNPOD_SECRET_my_api_key }} DATABASE_PASSWORD={{ RUNPOD_SECRET_db_password }} ``` -## Updating environment variables +## Update environment variables -To update environment variables in your Pod: - -1. Navigate to the [Pods](https://www.console.runpod.io/user/pods) section of the console. -2. Click the three dots to the right of the Pod you want to update and select **Edit Pod**. -3. Click the **Environment Variables** section to expand it. -4. Add or update the environment variables. -5. Click **Save** to save your changes. +1. Go to [Pods](https://www.console.runpod.io/user/pods) and click the three dots next to your Pod. +2. Select **Edit Pod** and expand **Environment Variables**. +3. Add or update variables and click **Save**. -When you update environment variables your Pod will restart, clearing all data outside of your volume mount path (`/workspace` by default). +Updating environment variables restarts your Pod, clearing all data outside your volume mount path (`/workspace` by default). -## Accessing environment variables - -Once set, environment variables are available to your application through standard operating system mechanisms. - -### Verify variables in your Pod - -You can check if environment variables are properly set by running commands in your Pod's terminal: - -```bash -# View a specific environment variable -echo $ENVIRONMENT_VARIABLE_KEY - -# List all environment variables -env - -# Search for specific variables -env | grep RUNPOD -``` - -### Accessing variables in your applications - -Different programming languages provide various ways to access environment variables: - -**Python:** -```python -import os - -model_name = os.environ.get('MODEL_NAME', 'default-model') -api_key = os.environ['API_KEY'] # Raises error if not found -``` - -**Node.js:** -```javascript -const modelName = process.env.MODEL_NAME || 'default-model'; -const apiKey = process.env.API_KEY; -``` - -**Bash scripts:** -```bash -#!/bin/bash -MODEL_NAME=${MODEL_NAME:-"default-model"} -echo "Using model: $MODEL_NAME" -``` - -## Runpod-provided environment variables - -Runpod automatically sets several environment variables that provide information about your Pod's environment and resources: - -| Variable | Description | -| --------------------- | ------------------------------------------------------------------------------------------------ | -| `RUNPOD_POD_ID` | The unique identifier assigned to your Pod. | -| `RUNPOD_DC_ID` | The identifier of the data center where your Pod is located. | -| `RUNPOD_POD_HOSTNAME` | The hostname of the server where your Pod is running. | -| `RUNPOD_GPU_COUNT` | The total number of GPUs available to your Pod. | -| `RUNPOD_CPU_COUNT` | The total number of CPUs available to your Pod. | -| `RUNPOD_PUBLIC_IP` | The publicly accessible IP address for your Pod, if available. | -| `RUNPOD_TCP_PORT_22` | The public port mapped to SSH (port 22) for your Pod. | -| `RUNPOD_ALLOW_IP` | A comma-separated list of IP addresses or ranges allowed to access your Pod. | -| `RUNPOD_VOLUME_ID` | The ID of the network volume attached to your Pod. | -| `RUNPOD_API_KEY` | The API key for making Runpod API calls scoped specifically to this Pod. | -| `PUBLIC_KEY` | The SSH public keys authorized to access your Pod over SSH. | -| `CUDA_VERSION` | The version of CUDA installed in your Pod environment. | -| `PYTORCH_VERSION` | The version of PyTorch installed in your Pod environment. | -| `PWD` | The current working directory inside your Pod. | +## Access environment variables -## Common use cases - -Environment variables are particularly useful for: - -**Model configuration**: Configure which AI models to load without rebuilding your container: +**In your Pod's terminal:** ```bash -MODEL_NAME=gpt-3.5-turbo -MODEL_PATH=/workspace/models -MAX_TOKENS=2048 -TEMPERATURE=0.7 +echo $VARIABLE_NAME # View specific variable +env | grep RUNPOD # List Runpod variables ``` -**Service configuration**: Set up web services and APIs with flexible configuration: - -```bash -API_PORT=8000 -DEBUG_MODE=false -LOG_LEVEL=INFO -CORS_ORIGINS=https://myapp.com,https://staging.myapp.com -``` - -**Database and external service connections**: Connect to databases and external APIs securely: - -```bash -DATABASE_URL=postgresql://user:pass@host:5432/db -REDIS_URL=redis://localhost:6379 -API_BASE_URL=https://api.external-service.com -``` - -**Development vs. production settings**: Use different configurations for different environments: - -```bash -ENVIRONMENT=production -CACHE_ENABLED=true -RATE_LIMIT=1000 -MONITORING_ENABLED=true -``` - -**Port management**: When configuring symmetrical ports, your application can discover assigned ports through environment variables. This is particularly useful for services that need to know their external port numbers. - -For more details, see [Expose ports](/pods/configuration/expose-ports#symmetrical-port-mapping). +**In your code:** + + + + ```python + import os + + model_name = os.environ.get('MODEL_NAME', 'default-model') + api_key = os.environ['API_KEY'] # Raises error if not set + ``` + + + ```javascript + const modelName = process.env.MODEL_NAME || 'default-model'; + const apiKey = process.env.API_KEY; + ``` + + + ```bash + MODEL_NAME=${MODEL_NAME:-"default-model"} + echo "Using model: $MODEL_NAME" + ``` + + + +## Runpod-provided variables + +Runpod automatically sets these environment variables: + +| Variable | Description | +| --- | --- | +| `RUNPOD_POD_ID` | Unique Pod identifier. | +| `RUNPOD_DC_ID` | Data center identifier. | +| `RUNPOD_POD_HOSTNAME` | Server hostname. | +| `RUNPOD_GPU_COUNT` | Number of GPUs available. | +| `RUNPOD_CPU_COUNT` | Number of CPUs available. | +| `RUNPOD_PUBLIC_IP` | Public IP address (if available). | +| `RUNPOD_TCP_PORT_22` | Public port mapped to SSH. | +| `RUNPOD_VOLUME_ID` | Attached network volume ID. | +| `RUNPOD_API_KEY` | Pod-scoped API key. | +| `PUBLIC_KEY` | Authorized SSH public keys. | +| `CUDA_VERSION` | Installed CUDA version. | +| `PYTORCH_VERSION` | Installed PyTorch version. | ## Best practices -Follow these guidelines when working with environment variables: - -**Security considerations**: - -- **Never hardcode secrets**: Use [Runpod secrets](/pods/templates/secrets) for sensitive data. -- **Use descriptive names**: Choose clear, descriptive variable names like `DATABASE_PASSWORD` instead of `DB_PASS`. - -**Configuration management**: - -- **Provide defaults**: Use default values for non-critical configuration options. -- **Document your variables**: Maintain clear documentation of what each environment variable does. -- **Group related variables**: Use consistent prefixes for related configuration (for example, `DB_HOST`, `DB_PORT`, `DB_NAME`). - -**Application design**: - -- **Validate required variables**. Check that critical environment variables are set before your application starts. If the variable is missing, your application should throw an error or return a clear message indicating which variable is not set. This helps prevent unexpected failures and makes debugging easier. -- **Type conversion**: Convert string environment variables to appropriate types (such as integers or booleans) in your application. -- **Configuration validation**: Validate environment variable values to catch configuration errors early. \ No newline at end of file +- **Use secrets for sensitive data**: Never hardcode API keys or passwords. Use [Runpod secrets](/pods/templates/secrets). +- **Validate required variables**: Check that critical variables are set before your application starts. +- **Provide defaults**: Use fallback values for non-critical configuration. +- **Use descriptive names**: Prefer `DATABASE_PASSWORD` over `DB_PASS`. +- **Group related variables**: Use consistent prefixes like `DB_HOST`, `DB_PORT`, `DB_NAME`. diff --git a/pods/templates/manage-templates.mdx b/pods/templates/manage-templates.mdx index d723f668..aacfe082 100644 --- a/pods/templates/manage-templates.mdx +++ b/pods/templates/manage-templates.mdx @@ -3,7 +3,7 @@ title: "Manage Pod templates" description: "Learn how to create, and manage custom Pod templates." --- -import { PodTooltip, PodEnvironmentVariablesTooltip } from "/snippets/tooltips.jsx"; +import { PodTooltip } from "/snippets/tooltips.jsx"; Creating a custom template allows you to package your specific configuration for reuse and sharing. Templates define all the necessary components to launch a with your desired setup. @@ -104,7 +104,7 @@ For more details, see the [API reference](/api-reference/templates/POST/template ## Using environment variables in templates - provide a flexible way to configure your Pod's runtime behavior without modifying the container image. +Environment variables provide a flexible way to configure your Pod's runtime behavior without modifying the container image. ### Defining environment variables diff --git a/pods/templates/overview.mdx b/pods/templates/overview.mdx index 11a59dcd..0d63779f 100644 --- a/pods/templates/overview.mdx +++ b/pods/templates/overview.mdx @@ -1,71 +1,48 @@ --- title: "Overview" description: "Streamline your Pod deployments with templates, bundling prebuilt container images with hardware specs and network settings." +mode: "wide" --- import { PodTooltip, PodEnvironmentVariablesTooltip } from "/snippets/tooltips.jsx"; - templates are pre-configured [Docker image](/tutorials/introduction/containers#what-are-images) setups that let you quickly spin up Pods without manual environment configuration. They're essentially deployment configurations that include specific models, frameworks, or workflows bundled together. - -Templates eliminate the need to manually set up environments, saving time and reducing configuration errors. For example, instead of installing PyTorch, configuring JupyterLab, and setting up all dependencies yourself, you can select a pre-configured template and have everything ready to go instantly. - - - -## What Pod templates include - -Pod templates contain all the necessary components to launch a fully configured Pod: - -- **Container image:** The Docker image with all necessary software packages and dependencies. This is where the core functionality of the template is stored, i.e., the software package and any files associated with it. -- **Hardware specifications:** Container disk size, volume size, and mount paths that define the storage requirements for your Pod. -- **Network settings:** Exposed ports for services like web UIs or APIs. If the image has a server associated with it, you'll want to ensure that the HTTP and TCP ports are exposed as necessary. -- **:** Pre-configured settings specific to the template that customize the behavior of the containerized application. -- **Startup commands:** Instructions that run when the Pod launches, allowing you to customize the initialization process. - -## Types of templates - -Runpod offers three types of templates to meet different needs: - -### Official templates - -Official templates are curated by Runpod with proven demand and maintained quality. These templates undergo rigorous testing and are regularly updated to ensure compatibility and performance. Runpod provides full support for official templates. - -### Community templates - -Community templates are created by users and promoted based on community usage. These templates offer a wide variety of specialized configurations and cutting-edge tools contributed by the Runpod community. +
+ + templates are pre-configured [Docker image](/tutorials/introduction/containers#what-are-images) setups that let you quickly spin up Pods without manual environment configuration. Instead of installing PyTorch, configuring JupyterLab, and setting up dependencies yourself, you can select a template and have everything ready instantly. + + + + Browse official and community templates. + + + Build your own reusable Pod configuration. + + + Edit, share, and organize your templates. + + + Configure template behavior with variables. + + + +## Template types + +| Type | Description | Support | +|------|-------------|---------| +| **Official** | Curated by Runpod with proven demand and maintained quality. Regularly tested and updated. | Full Runpod support | +| **Community** | Created by users and promoted based on community usage. Wide variety of specialized configurations. | [Community Discord](https://discord.gg/runpod) only | +| **Custom** | Created by you for specialized workloads. Can be private or shared publicly. | Self-supported | Runpod does not maintain or provide customer support for community templates. If you encounter issues, contact the template creator directly or seek help on the [community Discord](https://discord.gg/runpod). -### Custom templates - -You can create custom templates for your own specialized workloads. These can be private (visible only to you or your team) or made public for the community to use. - -## Explore templates - - - - - - -You can discover and use existing templates through the Runpod console: - -**Browse all templates:** Visit the **[Explore](https://www.console.runpod.io/explore)** section to find official templates maintained by Runpod and community templates created by other users. - -**Manage your templates:** Access templates you've created or that are shared within your team in the **[My Templates](https://www.console.runpod.io/user/templates)** section. - -## Why use Pod templates +## What templates include -Templates provide significant advantages over manual Pod configuration: +Templates contain all components needed to launch a fully configured Pod: -- **Time savings:** Popular templates include options for machine learning frameworks like PyTorch, image generation tools like Stable Diffusion, and development environments with Jupyter notebooks pre-installed. This eliminates hours of manual setup and dependency management. -- **Consistency:** Templates ensure that your development and production environments are identical, reducing "it works on my machine" issues. -- **Best practices:** Official and popular community templates incorporate industry best practices for security, performance, and configuration. -- **Reduced errors:** Pre-configured templates minimize the risk of configuration mistakes that can lead to Pod startup failures or performance issues. \ No newline at end of file +- **Container image**: The Docker image with all software packages and dependencies. +- **Hardware specifications**: Container disk size, volume size, and mount paths. +- **Network settings**: Exposed HTTP and TCP ports for web UIs or APIs. +- ****: Pre-configured settings that customize application behavior. +- **Startup commands**: Instructions that run when the Pod launches. diff --git a/references/troubleshooting/jupyterlab-blank-page.mdx b/pods/troubleshooting/jupyterlab-blank-page.mdx similarity index 100% rename from references/troubleshooting/jupyterlab-blank-page.mdx rename to pods/troubleshooting/jupyterlab-blank-page.mdx diff --git a/references/troubleshooting/jupyterlab-checkpoints-folder.mdx b/pods/troubleshooting/jupyterlab-checkpoints-folder.mdx similarity index 100% rename from references/troubleshooting/jupyterlab-checkpoints-folder.mdx rename to pods/troubleshooting/jupyterlab-checkpoints-folder.mdx diff --git a/references/troubleshooting/pod-migration.mdx b/pods/troubleshooting/pod-migration.mdx similarity index 100% rename from references/troubleshooting/pod-migration.mdx rename to pods/troubleshooting/pod-migration.mdx diff --git a/references/troubleshooting/storage-full.mdx b/pods/troubleshooting/storage-full.mdx similarity index 95% rename from references/troubleshooting/storage-full.mdx rename to pods/troubleshooting/storage-full.mdx index 33b8370e..a8e4b478 100644 --- a/references/troubleshooting/storage-full.mdx +++ b/pods/troubleshooting/storage-full.mdx @@ -2,7 +2,7 @@ title: "Storage full" --- -Storage full can occur when users generate many files, transfer files, or perform other storage-intensive tasks. This document provides guidance to help you troubleshoot this. +Storage full errors can occur when users generate many files, transfer files, or perform other storage-intensive tasks. This document provides guidance to help you troubleshoot this. ## Check disk usage diff --git a/references/troubleshooting/token-authentication-enabled.mdx b/pods/troubleshooting/token-authentication-enabled.mdx similarity index 100% rename from references/troubleshooting/token-authentication-enabled.mdx rename to pods/troubleshooting/token-authentication-enabled.mdx diff --git a/references/troubleshooting/troubleshooting-502-errors.mdx b/pods/troubleshooting/troubleshooting-502-errors.mdx similarity index 81% rename from references/troubleshooting/troubleshooting-502-errors.mdx rename to pods/troubleshooting/troubleshooting-502-errors.mdx index 21991bc9..f1a7d1f6 100644 --- a/references/troubleshooting/troubleshooting-502-errors.mdx +++ b/pods/troubleshooting/troubleshooting-502-errors.mdx @@ -2,17 +2,17 @@ title: "502 errors" --- -502 errors can occur when users attempt to access a program running on a specific port of a deployed pod and the program isn't running or has encountered an error. This document provides guidance to help you troubleshoot this error. +502 errors can occur when users attempt to access a program running on a specific port of a deployed Pod and the program isn't running or has encountered an error. This document provides guidance to help you troubleshoot this error. ### Check your Pod's GPU -The first step to troubleshooting a 502 error is to check whether your pod has a GPU attached. +The first step to troubleshooting a 502 error is to check whether your Pod has a GPU attached. -1. **Access your pod's settings**: Click on your pod's settings in the user interface to access detailed information about your pod. +1. **Access your Pod's settings**: Click on your Pod's settings in the user interface to access detailed information about your Pod. -2. **Verify GPU attachment**: Here, you should be able to see if your pod has a GPU attached. If it does not, you will need to attach a GPU. +2. **Verify GPU attachment**: Here, you should be able to see if your Pod has a GPU attached. If it does not, you will need to attach a GPU. -If a GPU is attached, you will see it under the Pods screen (e.g. 1 x A6000). If a GPU is not attached, this number will be 0. Runpod does allow you to spin up a pod with 0 GPUs so that you can connect to it via a Terminal or CloudSync to access data. However, the options to connect to Runpod via the web interface will be nonfunctional, even if they are lit up. +If a GPU is attached, you will see it under the Pods screen (e.g. 1 x A6000). If a GPU is not attached, this number will be 0. Runpod does allow you to spin up a Pod with 0 GPUs so that you can connect to it via a Terminal or CloudSync to access data. However, the options to connect to Runpod via the web interface will be nonfunctional, even if they are lit up. @@ -20,9 +20,9 @@ If a GPU is attached, you will see it under the Pods screen (e.g. 1 x A6000). If ### Check your Pod's logs -After confirming that your pod has a GPU attached, the next step is to check your pod's logs for any errors. +After confirming that your Pod has a GPU attached, the next step is to check your Pod's logs for any errors. -1. **Access your pod's logs**: You can view the logs from the pod's settings in the user interface. +1. **Access your Pod's logs**: You can view the logs from the Pod's settings in the user interface. 2. diff --git a/references/troubleshooting/zero-gpus.mdx b/pods/troubleshooting/zero-gpus.mdx similarity index 100% rename from references/troubleshooting/zero-gpus.mdx rename to pods/troubleshooting/zero-gpus.mdx diff --git a/public-endpoints/overview.mdx b/public-endpoints/overview.mdx index 0f0e97f1..0572fb73 100644 --- a/public-endpoints/overview.mdx +++ b/public-endpoints/overview.mdx @@ -7,38 +7,57 @@ mode: "wide"
- - - - -Runpod Public Endpoints provide instant access to state-of-the-art AI models through simple API calls. Generate images, videos, audio, and text without deploying infrastructure or managing GPU resources. - - -Public Endpoints are pre-deployed models hosted by Runpod. If you want to deploy your own models or custom code, use [Runpod Serverless](/serverless/overview). - - -## Why use Public Endpoints? - -- **No deployment required.** Start generating immediately with a single API call. No containers, GPUs, or infrastructure to configure. -- **Production-ready models.** Access optimized versions of [Flux](/public-endpoints/models/flux-dev), [Whisper](/public-endpoints/models/whisper-v3), [Qwen](/public-endpoints/models/qwen3-32b), and other popular models, tuned for performance. -- **Pay per use.** Pay only for what you generate, with transparent per-megapixel, per-second, or per-token pricing. -- **Simple integration.** Standard REST API with OpenAI-compatible endpoints for LLMs. Works with any HTTP client or SDK. - -## When to use Public Endpoints +Runpod offers Public Endpoints for instant API access to pre-deployed AI models for image, video, audio, and text generation. No deployment or infrastructure required—just [create an API key](/get-started/api-keys) and make a request: + + + +```python Python +import requests + +response = requests.post( + "https://api.runpod.ai/v2/black-forest-labs-flux-1-schnell/runsync", + headers={ + "Authorization": "Bearer YOUR_API_KEY", # Replace YOUR_API_KEY with your actual API key + "Content-Type": "application/json" + }, + json={ + "input": { + "prompt": "A beautiful sunset over mountains", # Customize your prompt + "width": 1024, + "height": 1024 + } + } +) + +result = response.json() +print(result["output"]["image_url"]) +``` -Public Endpoints are ideal when you want to use popular AI models without managing infrastructure. Choose Public Endpoints when: +```bash cURL +# Replace YOUR_API_KEY with your actual API key +curl -X POST "https://api.runpod.ai/v2/black-forest-labs-flux-1-schnell/runsync" \ + -H "Authorization: Bearer YOUR_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{ + "input": { + "prompt": "A beautiful sunset over mountains", + "width": 1024, + "height": 1024 + } + }' +``` -- **You need quick access to standard models.** Generate images with [Flux](/public-endpoints/models/flux-dev), transcribe audio with [Whisper](/public-endpoints/models/whisper-v3), or chat with [Qwen](/public-endpoints/models/qwen3-32b) without setup. -- **You want predictable pricing.** Pay-per-output pricing makes costs easy to estimate and budget. -- **You're prototyping or building MVPs.** Test ideas quickly before committing to custom infrastructure. + -Consider [Runpod Serverless](/serverless/overview) instead if you need custom models, specialized preprocessing, or full control over your inference environment. ## Get started Generate your first image in under 5 minutes. + + + Browse available models and their parameters. Use the playground and REST API. @@ -46,9 +65,6 @@ Consider [Runpod Serverless](/serverless/overview) instead if you need custom mo Integrate with JavaScript and TypeScript projects. - - Browse available models and their parameters. - Chain multiple endpoints to generate videos from text. @@ -58,30 +74,6 @@ Consider [Runpod Serverless](/serverless/overview) instead if you need custom mo When you call a Public Endpoint, Runpod routes your request to a pre-deployed model running on optimized GPU infrastructure. The model processes your input and returns the result. -
-```mermaid -%%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'15px','fontFamily':'font-inter'}}}%% - -flowchart TD - app["Your application"] - api["Runpod API"] - model["AI model"] - output["Generated output"] - - app -->|"POST /runsync"| api - api -->|"Route request"| model - model -->|"Process"| output - output -->|"Return result"| app - - style app fill:#5F4CFE,stroke:#5F4CFE,color:#FFFFFF,stroke-width:2px - style api fill:#fb923c,stroke:#fb923c,color:#000000,stroke-width:2px - style model fill:#ecc94b,stroke:#ecc94b,color:#000000,stroke-width:2px - style output fill:#22C55E,stroke:#22C55E,color:#000000,stroke-width:2px - - linkStyle default stroke-width:2px,stroke:#5F4CFE -``` -
- Public Endpoints support two request modes: - **Synchronous (`/runsync`)**: Wait for the result and receive it in the response. Best for quick generations. @@ -93,7 +85,7 @@ For JavaScript and TypeScript projects, the [`@runpod/ai-sdk-provider`](/public- Public Endpoints offer models across four categories: -| Type | Models | Use cases | +| Type | Example models | Use cases | |------|--------|-----------| | **Image** | [Flux Dev](/public-endpoints/models/flux-dev), [Flux Schnell](/public-endpoints/models/flux-schnell), [Qwen Image](/public-endpoints/models/qwen-image), [Seedream](/public-endpoints/models/seedream-3) | Text-to-image generation, image editing | | **Video** | [WAN 2.5](/public-endpoints/models/wan-2-5), [Kling](/public-endpoints/models/kling-v2-1), [Seedance](/public-endpoints/models/seedance-1-pro), [SORA 2](/public-endpoints/models/sora-2) | Image-to-video, text-to-video generation | @@ -125,10 +117,3 @@ Pricing is calculated based on actual output. You will not be charged for failed - 1024x1024 image (1.05 MP) with [Flux Schnell](/public-endpoints/models/flux-schnell): ~\$0.0025 For complete pricing information, see the [model reference](/public-endpoints/reference). - -## Next steps - -- [Quickstart](/public-endpoints/quickstart): Generate your first image in under 5 minutes. -- [Make API requests](/public-endpoints/requests): Use the playground and REST API. -- [Vercel AI SDK](/public-endpoints/ai-sdk): Integrate with JavaScript and TypeScript projects. -- [Model reference](/public-endpoints/reference): View all available models and their parameters. \ No newline at end of file diff --git a/references/cpu-types.mdx b/references/cpu-types.mdx index 9bf44ff9..ba32970e 100644 --- a/references/cpu-types.mdx +++ b/references/cpu-types.mdx @@ -1,5 +1,5 @@ --- -title: Serverless CPU types +title: CPU types --- The following list contains all CPU types available on Runpod. diff --git a/references/troubleshooting/manage-payment-cards.mdx b/references/manage-payment-cards.mdx similarity index 98% rename from references/troubleshooting/manage-payment-cards.mdx rename to references/manage-payment-cards.mdx index c1df39ef..0f25bbce 100644 --- a/references/troubleshooting/manage-payment-cards.mdx +++ b/references/manage-payment-cards.mdx @@ -1,5 +1,6 @@ --- title: "Manage payment card declines" +sidebarTitle: "Payment card declines" description: "Learn how to troubleshoot declined payment cards and prevent service interruptions on Runpod." --- diff --git a/references/troubleshooting/leaked-api-keys.mdx b/references/troubleshooting/leaked-api-keys.mdx deleted file mode 100644 index b5d80188..00000000 --- a/references/troubleshooting/leaked-api-keys.mdx +++ /dev/null @@ -1,19 +0,0 @@ ---- -title: "Leaked API Keys" ---- - -Leaked API keys can occur when users accidentally include a plain text API key in a public repository. This document provides guidance to help you remediate a compromised key. - -## Disable - -To disable an API key: - -1. From the console, select **Settings**. -2. Under **API Keys**, select the toggle and select **Yes**. - -## Revoke - -To delete an API key: - -1. From the console, select **Settings**. -2. Under **API Keys**, select the trash can icon and select **Revoke Key**. diff --git a/serverless/development/aggregate-outputs.mdx b/serverless/development/aggregate-outputs.mdx index 2073160a..76451899 100644 --- a/serverless/development/aggregate-outputs.mdx +++ b/serverless/development/aggregate-outputs.mdx @@ -4,7 +4,7 @@ sidebarTitle: "Aggregate outputs" description: "Automatically collect and aggregate yielded results from streaming handler functions." --- -import { HandlerFunctionTooltip, WorkerTooltip, ServerlessTooltip } from "/snippets/tooltips.jsx"; +import { HandlerFunctionTooltip, ServerlessTooltip } from "/snippets/tooltips.jsx"; When building a streaming that yields results incrementally, you can use the `return_aggregate_stream` feature to automatically collect all yielded outputs into a single aggregated response. This simplifies result handling by eliminating the need to manually collect and format streaming results, making your handlers easier to implement and consume. @@ -276,5 +276,5 @@ This combines the benefits of async processing with automatic output aggregation - Learn more about [streaming handlers](/serverless/workers/handler-functions#streaming-handlers). - Explore [async handlers](/serverless/workers/handler-functions#asynchronous-handlers) for concurrent processing. -- Understand [error handling](/serverless/development/error-handling) for robust batch processing. +- Understand [error handling](/serverless/workers/handler-functions#error-handling) for robust batch processing. - Review [payload limits](/serverless/workers/handler-functions#payload-limits) to avoid oversized responses. diff --git a/serverless/development/benchmarking.mdx b/serverless/development/benchmarking.mdx index b58c431e..a0597007 100644 --- a/serverless/development/benchmarking.mdx +++ b/serverless/development/benchmarking.mdx @@ -4,7 +4,9 @@ sidebarTitle: "Benchmark workers" description: "Measure the performance of your Serverless workers and identify bottlenecks." --- -Benchmarking your Serverless workers helps you identify bottlenecks and [optimize your code](/serverless/development/optimization) for performance and cost. Performance is measured by two key metrics: +Benchmarking your Serverless workers helps you identify bottlenecks and [optimize your code](/serverless/development/optimization) for performance and cost. + +Performance is measured by two key metrics: - **Delay time**: The time spent waiting for a worker to become available. This includes the cold start time if a new worker needs to be spun up. - **Execution time**: The time the GPU takes to process the request once the worker has received the job. diff --git a/serverless/development/dual-mode-worker.mdx b/serverless/development/dual-mode-worker.mdx index a22760d5..25ad3aa6 100644 --- a/serverless/development/dual-mode-worker.mdx +++ b/serverless/development/dual-mode-worker.mdx @@ -11,16 +11,6 @@ This "Pod-first" workflow lets you develop and test interactively in a GPU envir To get started quickly, you can [clone this repository](https://github.com/justinwlin/Runpod-GPU-And-Serverless-Base) for a pre-configured template for a dual-mode worker. -## What you'll learn - -In this tutorial you'll learn how to: - -- Set up a project for a dual-mode Serverless worker. -- Create a handler that adapts based on an environment variable. -- Write a startup script to manage different operational modes. -- Build a Docker image that works in both Pod and Serverless environments. -- Deploy and test your worker in both environments. - ## Requirements - You've [created a Runpod account](/get-started/manage-accounts). diff --git a/serverless/development/error-handling.mdx b/serverless/development/error-handling.mdx deleted file mode 100644 index 06d5f473..00000000 --- a/serverless/development/error-handling.mdx +++ /dev/null @@ -1,116 +0,0 @@ ---- -title: "Error handling" -sidebarTitle: "Error handling" -description: "Implement robust error handling for your Serverless endpoints." ---- - -Robust error handling is essential for production Serverless endpoints. It prevents your workers from crashing silently and ensures that useful error messages are returned to the client, making debugging significantly easier. - -## Basic error handling - -The simplest way to handle errors is to wrap your handler logic in a `try...except` block. This ensures that even if your logic fails, the worker remains stable and returns a readable error message. - -```python -import runpod - -def handler(job): - try: - input = job["input"] - - # Replace process_input() with your own handler logic - result = process_input(input) - - return {"output": result} - except KeyError as e: - return {"error": f"Missing required input: {str(e)}"} - except Exception as e: - return {"error": f"An error occurred: {str(e)}"} - -runpod.serverless.start({"handler": handler}) -``` - -## Structured error responses - -For more complex applications, you should return consistent error objects. This allows the client consuming your API to programmatically handle different types of errors, such as [validation failures](/serverless/development/validation) versus unexpected server errors. - -```python -import runpod -import traceback - -def handler(job): - try: - # Validate input - if "prompt" not in job.get("input", {}): - return { - "error": { - "type": "ValidationError", - "message": "Missing required field: prompt", - "details": "The 'prompt' field is required in the input object" - } - } - - prompt = job["input"]["prompt"] - result = process_prompt(prompt) - return {"output": result} - - except ValueError as e: - return { - "error": { - "type": "ValueError", - "message": str(e), - "details": "Invalid input value provided" - } - } - except Exception as e: - # Log the full traceback for debugging - print(f"Unexpected error: {traceback.format_exc()}") - return { - "error": { - "type": "UnexpectedError", - "message": "An unexpected error occurred", - "details": str(e) - } - } - -runpod.serverless.start({"handler": handler}) -``` - -## Timeout handling - - -You can also set an execution timeout in your [endpoint settings](/serverless/endpoints/endpoint-configurations#execution-timeout) to automatically terminate a job after a certain amount of time. - - -For long-running operations, you may want to implement timeout logic within your handler. This prevents a job from hanging indefinitely and consuming credits without producing a result. - -```python -import runpod -import signal - -class TimeoutError(Exception): - pass - -def timeout_handler(signum, frame): - raise TimeoutError("Operation timed out") - -def handler(job): - try: - # Set a timeout (e.g., 60 seconds) - signal.signal(signal.SIGALRM, timeout_handler) - signal.alarm(60) - - # Your processing code here - result = long_running_operation(job["input"]) - - # Cancel the timeout - signal.alarm(0) - - return {"output": result} - - except TimeoutError: - return {"error": "Request timed out after 60 seconds"} - except Exception as e: - return {"error": str(e)} - -runpod.serverless.start({"handler": handler}) -``` \ No newline at end of file diff --git a/serverless/development/huggingface-models.mdx b/serverless/development/huggingface-models.mdx index 4d0611bf..c547366b 100644 --- a/serverless/development/huggingface-models.mdx +++ b/serverless/development/huggingface-models.mdx @@ -4,7 +4,7 @@ sidebarTitle: "Use Hugging Face models" description: "Integrate pre-trained Hugging Face models into your Serverless handler functions." --- -import { HandlerFunctionTooltip, WorkerTooltip, ServerlessTooltip } from "/snippets/tooltips.jsx"; +import { HandlerFunctionTooltip, WorkerTooltip } from "/snippets/tooltips.jsx"; Hugging Face provides thousands of pre-trained models for natural language processing, computer vision, audio processing, and more. You can integrate these models into your to deploy AI capabilities without training models from scratch. @@ -387,4 +387,3 @@ When deploying Hugging Face models to production endpoints, keep these additiona - [Create a Dockerfile](/serverless/workers/create-dockerfile) to package your handler with its dependencies. - [Deploy your worker](/serverless/workers/deploy) to a Runpod endpoint. - Explore [optimization techniques](/serverless/development/optimization) to improve performance. -- Learn about [error handling](/serverless/development/error-handling) for production deployments. diff --git a/serverless/development/optimization.mdx b/serverless/development/optimization.mdx index 40b1b861..b78a8236 100644 --- a/serverless/development/optimization.mdx +++ b/serverless/development/optimization.mdx @@ -4,64 +4,78 @@ sidebarTitle: "Optimization guide" description: "Implement strategies to reduce latency and cost for your Serverless endpoints." --- -import { MachineTooltip, InferenceTooltip } from "/snippets/tooltips.jsx"; +import { InferenceTooltip } from "/snippets/tooltips.jsx"; -Optimizing your Serverless endpoints involves a cycle of measuring performance with [benchmarking](/serverless/development/benchmarking), identifying bottlenecks, and tuning your [endpoint configurations](/serverless/endpoints/endpoint-configurations). This guide covers specific strategies to reduce startup times and improve throughput. +Optimization involves measuring performance with [benchmarking](/serverless/development/benchmarking), identifying bottlenecks, and tuning your [endpoint configurations](/serverless/endpoints/endpoint-configurations). -## Optimization overview +## Quick optimization checklist -Effective optimization requires making conscious tradeoffs between cost, speed, and model size. +| Strategy | Impact | When to use | +|----------|--------|-------------| +| [Use cached models](/serverless/endpoints/model-caching) | ⬇️ Cold start (major) | Models on Hugging Face | +| [Bake models into image](/serverless/workers/create-dockerfile#including-models-and-files) | ⬇️ Cold start | Private models | +| [Set active workers > 0](/serverless/endpoints/endpoint-configurations#active-workers) | ⬇️ Cold start (eliminates) | Latency-sensitive apps | +| [Select multiple GPU types](/serverless/endpoints/endpoint-configurations#gpu-configuration) | ⬆️ Availability | Production workloads | +| [Increase max workers](/serverless/endpoints/endpoint-configurations#max-workers) | ⬆️ Throughput | High concurrency | +| [Lower queue delay threshold](/serverless/endpoints/endpoint-configurations#auto-scaling-type) | ⬇️ Response time | Traffic spikes | -To ensure high availability during peak traffic, you should select multiple GPU types in your configuration rather than relying on a single hardware specification. When choosing hardware, a single high-end GPU is generally preferable to multiple lower-tier cards, as the superior memory bandwidth and newer architecture often yield better performance than parallelization across weaker cards. When choosing multiple [GPU types](/references/gpu-types), you should select the [GPU categories](/serverless/endpoints/endpoint-configurations#gpu-configuration) that are most likely to be available in your desired data centers. +## Understanding delay time -For latency-sensitive applications, utilizing active workers is the most effective way to eliminate cold starts. You should also configure your [max workers](/serverless/endpoints/endpoint-configurations#max-workers) setting with approximately 20% headroom above your expected concurrency. This buffer ensures that your endpoint can handle sudden load spikes without throttling requests or hitting capacity limits. +Two metrics affect request response time: -Your architectural choices also significantly impact performance. Whenever possible, bake your models directly into the Docker image to leverage the high-speed local NVMe storage of the host . If you utilize [network volumes](/storage/network-volumes) for larger datasets, remember that this restricts your endpoint to specific data centers, which effectively shrinks your pool of available compute resources. +| Metric | Description | Optimization | +|--------|-------------|--------------| +| **Delay time** | Waiting for a worker (includes cold start) | Model caching, active workers | +| **Execution time** | GPU processing the request | Code optimization, GPU selection | -## Reducing worker startup times - -There are two key metrics to consider when optimizing your workers to reduce request response times: - - - **Delay time**: The time spent waiting for a worker to become available. This includes the cold start time if a new worker needs to be spun up. - - **Execution time**: The time the GPU takes to actually process the request once the worker has received the job. +**Delay time** breaks down into: +- **Initialization time**: Downloading Docker image +- **Cold start time**: Loading model into GPU memory -Try [benchmarking your workers](/serverless/development/benchmarking) to measure these metrics. +Use [benchmarking](/serverless/development/benchmarking) to measure these metrics for your workload. -**Delay time** is comprised of: + +If cold start exceeds 7 minutes, the worker is marked unhealthy. Extend with `RUNPOD_INIT_TIMEOUT=800` (seconds). + - - **Initialization time**: The time spent downloading the Docker image. - - **Cold start time**: The time spent loading the model into memory. +## Reduce cold starts -If your delay time is high, use these strategies to reduce it. +### Use cached models (recommended) - -If your worker's cold start time exceeds the default 7-minute limit, the system may mark it as unhealthy. You can extend this limit by setting the `RUNPOD_INIT_TIMEOUT` environment variable (e.g. `RUNPOD_INIT_TIMEOUT=800` for 800 seconds). - +For models on Hugging Face, [cached models](/serverless/endpoints/model-caching) provide the fastest cold starts and lowest cost. -### Use cached models +### Bake models into images -If your model is available on Hugging Face, we strongly recommend enabling [cached models](/serverless/endpoints/model-caching). This provides the fastest cold starts and lowest cost for any Serverless deployment option. +For private models, [embed them in your Docker image](/serverless/workers/create-dockerfile#including-models-and-files). Models load from high-speed local NVMe storage instead of downloading at runtime. -### Bake models into Docker images +### Maintain active workers -If your model is not available on Hugging Face, you can package your ML models [directly into your worker container image](/serverless/workers/create-dockerfile#including-models-and-files) instead of downloading them in your handler function. This strategy places models on the worker's high-speed local storage (SSD/NVMe), dramatically reducing the time needed to load models into GPU memory. Note that extremely large models (500GB+) may still require network volume storage. +Set [active workers](/serverless/endpoints/endpoint-configurations#active-workers) > 0 to eliminate cold starts entirely. Active workers cost up to 30% less than flex workers. -### Use network volumes during development +**Formula**: `Active workers = (Requests/min × Request duration in seconds) / 60` -For flexibility during development, save large models to a [network volume](/storage/network-volumes) using a Pod or one-time handler, then mount this volume to your Serverless workers. While network volumes offer slower model loading compared to embedding models directly or using cached models, they can speed up your workflow by enabling rapid iteration and seamless switching between different models and configurations. +Example: 6 requests/min × 30 seconds = 3 active workers needed. -### Maintain active workers +## Improve availability + +### Select multiple GPU types + +Specify multiple [GPU types](/references/gpu-types) in priority order. A single high-end GPU often outperforms multiple lower-tier cards for . -Set [active worker counts](/serverless/endpoints/endpoint-configurations#active-workers) above zero to completely eliminate cold starts. These workers remain ready to process requests instantly and cost up to 30% less when idle compared to standard (flex) workers. +### Add headroom to max workers -You can estimate the optimal number of active workers using the formula: `(Requests per Minute × Request Duration) / 60`. For example, with 6 requests per minute taking 30 seconds each, you would need 3 active workers to handle the load without queuing. +Set [max workers](/serverless/endpoints/endpoint-configurations#max-workers) ~20% above expected concurrency to handle load spikes without throttling. -### Optimize scaling parameters +### Tune auto-scaling -Fine-tune your [auto-scaling configuration](/serverless/endpoints/endpoint-configurations#auto-scaling-type) for more responsive worker provisioning. Lowering the queue delay threshold to 2-3 seconds (default 4) or decreasing the request count threshold allows the system to respond more swiftly to traffic fluctuations. +Lower the [queue delay threshold](/serverless/endpoints/endpoint-configurations#auto-scaling-type) to 2-3 seconds (default: 4) for faster worker provisioning. -### Increase maximum worker limits +## Architecture considerations -Set a higher [max worker](/serverless/endpoints/endpoint-configurations#max-workers) limit to ensure your Docker images are pre-cached across multiple compute nodes and data centers. This proactive approach eliminates image download delays during scaling events, significantly reducing startup times. \ No newline at end of file +| Choice | Tradeoff | +|--------|----------| +| **Baked models** | Fastest loading, but larger images | +| **Network volumes** | Flexible, but restricts to specific data centers | +| **Multiple GPU types** | Higher availability, variable performance | diff --git a/serverless/development/overview.mdx b/serverless/development/overview.mdx index 7dcc8d18..a639a314 100644 --- a/serverless/development/overview.mdx +++ b/serverless/development/overview.mdx @@ -76,7 +76,7 @@ Learn more in [Local testing](/serverless/development/local-testing). Implement robust error handling to ensure your workers remain stable and return useful error messages. -Learn more in [Error handling](/serverless/development/error-handling). +Learn more in [Error handling](/serverless/workers/handler-functions#error-handling). ## SDK utilities diff --git a/serverless/endpoints/endpoint-configurations.mdx b/serverless/endpoints/endpoint-configurations.mdx index 1696768a..3a9c97fe 100644 --- a/serverless/endpoints/endpoint-configurations.mdx +++ b/serverless/endpoints/endpoint-configurations.mdx @@ -7,29 +7,39 @@ description: "Reference guide for all Serverless endpoint settings and parameter import GPUTable from '/snippets/serverless-gpu-pricing-table.mdx'; import { MachinesTooltip, InferenceTooltip } from "/snippets/tooltips.jsx"; -This guide details the configuration options available for Runpod Serverless endpoints. These settings control how your endpoint scales, how it utilizes hardware, and how it manages request lifecycles. +This guide details the configuration options available for Runpod Serverless endpoints. -Some settings can only be updated after deploying your endpoint. For instructions on modifying an existing endpoint, see [Edit an endpoint](/serverless/endpoints/overview#edit-an-endpoint). +Some settings can only be updated after deploying your endpoint. See [Edit an endpoint](/serverless/endpoints/overview#edit-an-endpoint). +## Quick reference + +| Setting | Default | Description | +|---------|---------|-------------| +| **Active workers** | 0 | Always-on workers (eliminates cold starts) | +| **Max workers** | 3 | Maximum concurrent workers | +| **GPUs per worker** | 1 | GPU count per worker instance | +| **Idle timeout** | 5s | Time before idle worker shuts down | +| **Execution timeout** | 600s (10 min) | Max job duration | +| **Job TTL** | 24h | Total job lifespan in system | +| **FlashBoot** | Enabled | Faster cold starts via state retention | + ## General configuration ### Endpoint name -The name assigned to your endpoint helps you identify it within the Runpod console. This is a local display name and does not impact the endpoint ID used for API requests. +Display name for identifying your endpoint in the console. Does not affect the endpoint ID used for API requests. ### Endpoint type -Select the architecture that best fits your application's traffic pattern: - -**Queue based endpoints** utilize a built-in queueing system to manage requests. They are ideal for asynchronous tasks, batch processing, and long-running jobs where immediate synchronous responses are not required. These endpoints provide guaranteed execution and automatic retries for failed requests. Queue based endpoints are implemented using [handler functions](/serverless/workers/handler-functions). +**Queue-based endpoints** use a built-in queueing system with guaranteed execution and automatic retries. Ideal for async tasks, batch processing, and long-running jobs. Implemented using [handler functions](/serverless/workers/handler-functions). -**Load balancing endpoints** route traffic directly to available workers, bypassing the internal queue. They are designed for high-throughput, low-latency applications that require synchronous request/response cycles, such as real-time or custom REST APIs. For implementation details, see [Load balancing endpoints](/serverless/load-balancing/overview). +**Load balancing endpoints** route traffic directly to workers, bypassing the queue. Designed for low-latency applications like real-time or custom REST APIs. See [Load balancing endpoints](/serverless/load-balancing/overview). ### GPU configuration -This setting determines the hardware tier your workers will utilize. You can select multiple GPU categories to create a prioritized list. Runpod attempts to allocate the first category in your list. If that hardware is unavailable, it automatically falls back to the subsequent options. Selecting multiple GPU types significantly improves endpoint availability during periods of high demand. +Determines the hardware tier for your workers. Select multiple GPU categories to create a prioritized fallback list. If your first choice is unavailable, Runpod automatically uses the next option. Selecting multiple types improves availability during high demand. @@ -37,83 +47,79 @@ This setting determines the hardware tier your workers will utilize. You can sel ### Active workers -This setting defines the minimum number of workers that remain warm and ready to process requests at all times. Setting this to 1 or higher eliminates cold starts for the initial wave of requests. Active workers incur charges even when idle, but they receive a 20-30% discount compared to on-demand workers. +Minimum number of workers that remain warm and ready at all times. Setting this to 1+ eliminates cold starts. Active workers incur charges when idle but receive a 20-30% discount. ### Max workers -This setting controls the maximum number of concurrent instances your endpoint can scale to. This acts as a safety limit for costs and a cap on concurrency. We recommend setting your max worker count approximately 20% higher than your expected maximum concurrency. This buffer allows for smoother scaling during traffic spikes. +Maximum concurrent instances your endpoint can scale to. Acts as a cost safety limit and concurrency cap. Set ~20% higher than expected max concurrency to handle traffic spikes smoothly. ### GPUs per worker -This defines how many GPUs are assigned to a single worker instance. The default is 1. When choosing between multiple lower-tier GPUs or fewer high-end GPUs, you should generally prioritize high-end GPUs with lower GPU count per worker when possible. +Number of GPUs assigned to each worker instance. Default is 1. Generally prioritize fewer high-end GPUs over multiple lower-tier GPUs. ### Auto-scaling type -This setting determines the logic used to scale workers up and down. +**Queue delay**: Adds workers when requests wait longer than the threshold (default: 4 seconds). Best when slight delays are acceptable for higher utilization. -**Queue delay** scaling adds workers based on wait times. If requests sit in the queue for longer than a defined threshold (default 4 seconds), the system provisions new workers. This is best for workloads where slight delays are acceptable in exchange for higher utilization. - -**Request count** scaling is more aggressive. It adjusts worker numbers based on the total volume of pending and active work. The formula used is `Math.ceil((requestsInQueue + requestsInProgress) / scalerValue)`. Use a scaler value of 1 for maximum responsiveness, or increase it to scale more conservatively. This strategy is recommended for LLM workloads or applications with frequent, short requests. +**Request count**: More aggressive scaling based on pending + active work. Formula: `Math.ceil((requestsInQueue + requestsInProgress) / scalerValue)`. Use scaler value of 1 for max responsiveness. Recommended for LLM workloads or frequent short requests. ## Lifecycle and timeouts ### Idle timeout -The idle timeout determines how long a worker remains active after completing a request before shutting down. While a worker is idle, you are billed for the time, but the worker remains "warm," allowing it to process subsequent requests immediately. The default is 5 seconds. +How long a worker stays active after completing a request before shutting down. You're billed during idle time, but the worker remains warm for immediate processing. Default: 5 seconds. ### Execution timeout -The execution timeout specifies the maximum duration a single job is allowed to run while actively being processed by a worker. When exceeded, the job is marked as failed and the worker is stopped. We strongly recommend keeping this enabled to prevent runaway jobs from consuming infinite resources. The default is 600 seconds (10 minutes). The minimum is 5 seconds and maximum is 7 days. +Maximum duration for a single job. When exceeded, the job fails and the worker stops. Keep enabled to prevent runaway jobs. Default: 600s (10 min). Range: 5s to 7 days. -You can configure the execution timeout in the **Advanced** section of your endpoint settings. You can also override this setting on a per-request basis using the `executionTimeout` field in the [job policy](/serverless/endpoints/send-requests#execution-policies). +Configure in **Advanced** settings, or override per-request via `executionTimeout` in the [job policy](/serverless/endpoints/send-requests#execution-policies). ### Job TTL (time-to-live) -The TTL defines the total lifespan of a job in the system. Once the TTL expires, the job's data is deleted from the system regardless of its current state—whether it is queued, actively running, or completed. The default is 24 hours. The minimum is 10 seconds and maximum is 7 days. +Total lifespan of a job in the system. When TTL expires, job data is deleted regardless of state (queued, running, or completed). Default: 24 hours. Range: 10s to 7 days. -The TTL timer starts when the job is submitted, not when execution begins. This means if a job sits in the queue waiting for an available worker, that time counts against the TTL. For example, if you set a TTL of 1 hour and the job waits in queue for 45 minutes, only 15 minutes remain for actual execution. +The timer starts at submission, not execution. If a job queues for 45 minutes with a 1-hour TTL, only 15 minutes remain for execution. -TTL is a hard limit on the job's existence. If the TTL expires while a job is actively running on a worker, the job is immediately removed from the system and subsequent status checks return a 404. This applies even if the job would have completed successfully given more time. Always set TTL to comfortably cover both expected queue time and execution time. +TTL is a hard limit. If it expires while a job is running, the job is immediately removed and status checks return 404. Set TTL to cover both expected queue time and execution time. -You can override this on a per-request basis using the `ttl` field in the [job policy](/serverless/endpoints/send-requests#execution-policies). +Override per-request via `ttl` in the [job policy](/serverless/endpoints/send-requests#execution-policies). ### Result retention -After a job completes, the system retains the results for a limited time. This retention period is separate from the Job TTL and cannot be extended: - -| Request type | Result retention | Notes | -|--------------|------------------|-------| -| Asynchronous (`/run`) | 30 minutes | Retrieve results via `/status/{job_id}` | -| Synchronous (`/runsync`) | 1 minute | Results returned in the response; also available via `/status/{job_id}` | +| Request type | Retention | Notes | +|--------------|-----------|-------| +| Async (`/run`) | 30 min | Retrieve via `/status/{job_id}` | +| Sync (`/runsync`) | 1 min | Returned in response; also available via `/status/{job_id}` | -Once the retention period expires, the job data is permanently deleted. +Results are permanently deleted after retention expires. ## Performance features ### FlashBoot -FlashBoot reduces cold start times by retaining the state of worker resources shortly after they spin down. This allows the system to "revive" a worker much faster than a standard fresh boot. FlashBoot is most effective on endpoints with consistent traffic, where workers frequently cycle between active and idle states. +Reduces cold starts by retaining worker state after spin-down, allowing faster "revival" than fresh boots. Most effective on endpoints with consistent traffic where workers frequently cycle between active and idle. ### Model -The Model field allows you to select from a list of [cached models](/serverless/endpoints/model-caching). When selected, Runpod schedules your workers on host that already have these large model files pre-loaded. This significantly reduces the time required to load models during worker initialization. +Select from [cached models](/serverless/endpoints/model-caching) to schedule workers on with model files pre-loaded. Significantly reduces model loading time during initialization. ## Advanced settings ### Data centers -You can restrict your endpoint to specific geographical regions. For maximum reliability and availability, we recommend allowing all data centers. Restricting this list decreases the pool of available GPUs your endpoint can draw from. +Restrict your endpoint to specific regions. For maximum availability, allow all data centers:restricting decreases the available GPU pool. ### Network volumes -[Network volumes](/storage/network-volumes) provide persistent storage that survives worker restarts. While they enable data sharing between workers, they introduce network latency and restrict your endpoint to the specific data center where the volume resides. Use network volumes only if your workload specifically requires shared persistence or datasets larger than the container limit. +[Network volumes](/storage/network-volumes) provide persistent storage across worker restarts. Tradeoffs: adds network latency and restricts your endpoint to the volume's data center. Use only when you need shared persistence or datasets exceeding container limits. ### CUDA version selection -This filter ensures your workers are scheduled on host with compatible drivers. While you should select the version your code requires, we recommend also selecting all newer versions. CUDA is generally backward compatible, and selecting a wider range of versions increases the pool of available hardware. +Ensures workers run on with compatible drivers. Select your required version plus all newer versions, since CUDA is backward compatible and a wider range increases available hardware. ### Expose HTTP/TCP ports -Enabling this option exposes the public IP and port of the worker, allowing for direct external communication. This is required for applications that need persistent connections, such as WebSockets. +Exposes the worker's public IP and port for direct external communication. Required for persistent connections like WebSockets. diff --git a/serverless/endpoints/model-caching.mdx b/serverless/endpoints/model-caching.mdx index 0e36945b..d1eac64d 100644 --- a/serverless/endpoints/model-caching.mdx +++ b/serverless/endpoints/model-caching.mdx @@ -14,7 +14,7 @@ Enabling cached models on your endpoints can reduce times a ## Why use cached models? -- **Faster cold starts:** A "cold start" refers to the delay between when a request is received by an endpoint with no running workers and when a worker is fully "warmed up" and ready to handle the request. Using cached models can reduce cold start times to just a few seconds, even for large models. +- **Faster cold starts:** Using cached models can reduce times to just a few seconds, even for large models. - **Reduced costs:** You aren't billed for worker time while your model is being downloaded. This is especially impactful for large models that can take several minutes to load. - **Accelerated deployment:** You can deploy cached models instantly without waiting for external downloads or transfers. - **Smaller container images:** By decoupling models from your container image, you can create smaller, more focused images that contain only your application logic. diff --git a/serverless/endpoints/operation-reference.mdx b/serverless/endpoints/operation-reference.mdx new file mode 100644 index 00000000..e841e751 --- /dev/null +++ b/serverless/endpoints/operation-reference.mdx @@ -0,0 +1,920 @@ +--- +title: "Operation reference" +sidebarTitle: "Operation reference" +description: "Detailed API reference for all queue-based endpoint operations." +--- + +This reference covers all operations available for queue-based endpoints. For conceptual information and advanced options, see [Send API requests](/serverless/endpoints/send-requests). + +## Setup + +Before running these examples, install the Runpod SDK: + +```bash +# Python +python -m pip install runpod + +# JavaScript +npm install --save runpod-sdk + +# Go +go get github.com/runpod/go-sdk && go mod tidy +``` + +Set your [API key](/get-started/api-keys) and endpoint ID as environment variables: + +```bash +export RUNPOD_API_KEY="YOUR_API_KEY" +export ENDPOINT_ID="YOUR_ENDPOINT_ID" +``` + + +You can also send requests using standard HTTP libraries like `fetch` (JavaScript) and `requests` (Python). + + +## /runsync + +Synchronous jobs wait for completion and return the complete result in a single response. Best for shorter tasks, interactive applications, and simpler client code without status polling. + +- **Maximum payload size**: 20 MB +- **Result retention**: 1 minute after completion +- **Default wait time**: 90 seconds (adjustable via `?wait=x` parameter, 1000-300000 ms) + +```sh +https://api.runpod.ai/v2/$ENDPOINT_ID/runsync?wait=120000 +``` + + +The `?wait` parameter controls how long the request waits for job completion, not how long results are retained. + + + + + +```sh +curl --request POST \ + --url https://api.runpod.ai/v2/$ENDPOINT_ID/runsync \ + -H "accept: application/json" \ + -H "authorization: $RUNPOD_API_KEY" \ + -H "content-type: application/json" \ + -d '{ "input": { "prompt": "Hello, world!" }}' +``` + + + + +```python +import runpod +import os + +runpod.api_key = os.getenv("RUNPOD_API_KEY") +endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) + +try: + run_request = endpoint.run_sync( + {"prompt": "Hello, world!"}, + timeout=60, # Client timeout in seconds + ) + print(run_request) +except TimeoutError: + print("Job timed out.") +``` + + + + +```javascript +const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; +import runpodSdk from "runpod-sdk"; + +const runpod = runpodSdk(RUNPOD_API_KEY); +const endpoint = runpod.endpoint(ENDPOINT_ID); + +const result = await endpoint.runSync({ + "input": { + "prompt": "Hello, World!", + }, + timeout: 60000, // Client timeout in milliseconds +}); + +console.log(result); +``` + + + + +```go +package main + +import ( + "encoding/json" + "fmt" + "log" + "os" + + "github.com/runpod/go-sdk/pkg/sdk" + "github.com.runpod/go-sdk/pkg/sdk/config" + rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" +) + +func main() { + apiKey := os.Getenv("RUNPOD_API_KEY") + baseURL := os.Getenv("RUNPOD_BASE_URL") + + endpoint, err := rpEndpoint.New( + &config.Config{ApiKey: &apiKey}, + &rpEndpoint.Option{EndpointId: &baseURL}, + ) + if err != nil { + log.Fatalf("Failed to create endpoint: %v", err) + } + + jobInput := rpEndpoint.RunSyncInput{ + JobInput: &rpEndpoint.JobInput{ + Input: map[string]interface{}{ + "prompt": "Hello World", + }, + }, + Timeout: sdk.Int(60), // Client timeout in seconds + } + + output, err := endpoint.RunSync(&jobInput) + if err != nil { + panic(err) + } + + data, _ := json.Marshal(output) + fmt.Printf("output: %s\n", data) +} +``` + + + +**Response:** + +```json +{ + "delayTime": 824, + "executionTime": 3391, + "id": "sync-79164ff4-d212-44bc-9fe3-389e199a5c15", + "output": [ + { + "image": "https://image.url", + "seed": 46578 + } + ], + "status": "COMPLETED" +} +``` + +## /run + +Asynchronous jobs process in the background and return immediately with a job ID. Best for longer-running tasks, operations requiring significant processing time, and managing multiple concurrent jobs. + +- **Maximum payload size**: 10 MB +- **Result retention**: 30 minutes after completion + + + +```sh +curl --request POST \ + --url https://api.runpod.ai/v2/$ENDPOINT_ID/run \ + -H "accept: application/json" \ + -H "authorization: $RUNPOD_API_KEY" \ + -H "content-type: application/json" \ + -d '{"input": {"prompt": "Hello, world!"}}' +``` + + + +```python +import runpod +import os + +runpod.api_key = os.getenv("RUNPOD_API_KEY") +endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) + +# Submit asynchronous job +run_request = endpoint.run({"prompt": "Hello, World!"}) + +# Check initial status +status = run_request.status() +print(f"Initial job status: {status}") + +if status != "COMPLETED": + # Poll for results with timeout + output = run_request.output(timeout=60) +else: + output = run_request.output() +print(f"Job output: {output}") +``` + + + +```javascript +const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; +import runpodSdk from "runpod-sdk"; + +const runpod = runpodSdk(RUNPOD_API_KEY); +const endpoint = runpod.endpoint(ENDPOINT_ID); + +const result = await endpoint.run({ + "input": { + "prompt": "Hello, World!", + }, +}); + +console.log(result); +``` + + + +```go +package main + +import ( + "encoding/json" + "fmt" + "log" + "os" + + "github.com/runpod/go-sdk/pkg/sdk" + "github.com/runpod/go-sdk/pkg/sdk/config" + rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" +) + +func main() { + client := sdk.New(&config.Config{ + ApiKey: os.Getenv("RUNPOD_API_KEY"), + BaseURL: os.Getenv("RUNPOD_BASE_URL"), + }) + + endpoint, err := client.NewEndpoint("YOUR_ENDPOINT_ID") + if err != nil { + log.Fatalf("Failed to create endpoint: %v", err) + } + + jobInput := rpEndpoint.RunInput{ + JobInput: &rpEndpoint.JobInput{ + Input: map[string]interface{}{ + "prompt": "Hello World", + }, + }, + RequestTimeout: sdk.Int(120), + } + + output, err := endpoint.Run(&jobInput) + if err != nil { + panic(err) + } + + data, _ := json.Marshal(output) + fmt.Printf("output: %s\n", data) +} +``` + + + + +**Response:** + +```json +{ + "id": "eaebd6e7-6a92-4bb8-a911-f996ac5ea99d", + "status": "IN_QUEUE" +} +``` + +Retrieve results using the `/status` operation. + +## /status + +Check the current state, execution statistics, and results of previously submitted jobs. + + +Configure time-to-live (TTL) for individual jobs by appending `?ttl=x` to the request URL. For example, `?ttl=6000` sets the TTL to 6 seconds. + + + + +Replace `YOUR_JOB_ID` with the job ID from your `/run` response. + +```sh +curl --request GET \ + --url https://api.runpod.ai/v2/$ENDPOINT_ID/status/YOUR_JOB_ID \ + -H "authorization: $RUNPOD_API_KEY" \ +``` + + + + + +```python +import runpod + +runpod.api_key = os.getenv("RUNPOD_API_KEY") +endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) + +input_payload = {"input": {"prompt": "Hello, World!"}} + +run_request = endpoint.run(input_payload) + +# Initial check without blocking, useful for quick tasks +status = run_request.status() +print(f"Initial job status: {status}") + +if status != "COMPLETED": + # Polling with timeout for long-running tasks + output = run_request.output(timeout=60) +else: + output = run_request.output() +print(f"Job output: {output}") +``` + + + + +```javascript +const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; +import runpodSdk from "runpod-sdk"; + +async function main() { + try { + const runpod = runpodSdk(RUNPOD_API_KEY); + const endpoint = runpod.endpoint(ENDPOINT_ID); + const result = await endpoint.run({ + input: { + prompt: "Hello, World!", + }, + }); + + const { id } = result; + if (!id) { + console.error("No ID returned from endpoint.run"); + return; + } + + const status = await endpoint.status(id); + console.log(status); + } catch (error) { + console.error("An error occurred:", error); + } +} + +main(); +``` + + + + +Replace `YOUR_JOB_ID` with the job ID from your `/run` response. + +```go + +package main + +import ( + "encoding/json" + "fmt" + "log" + "os" + + "github.com/runpod/go-sdk/pkg/sdk" + "github.com/runpod/go-sdk/pkg/sdk/config" + rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" +) + +func main() { + + apiKey := os.Getenv("RUNPOD_API_KEY") + baseURL := os.Getenv("RUNPOD_BASE_URL") + + endpoint, err := rpEndpoint.New( + &config.Config{ApiKey: &apiKey}, + &rpEndpoint.Option{EndpointId: &baseURL}, + ) + if err != nil { + log.Fatalf("Failed to create endpoint: %v", err) + } + input := rpEndpoint.StatusInput{ + Id: sdk.String("YOUR_JOB_ID"), + } + output, err := endpoint.Status(&input) + if err != nil { + panic(err) + } + dt, _ := json.Marshal(output) + fmt.Printf("output:%s\n", dt) +} +``` + + + + +**Response:** + +Returns job status (`IN_QUEUE`, `IN_PROGRESS`, `COMPLETED`, `FAILED`) with optional `output` field: + +```json +{ + "delayTime": 31618, + "executionTime": 1437, + "id": "60902e6c-08a1-426e-9cb9-9eaec90f5e2b-u1", + "output": { + "input_tokens": 22, + "output_tokens": 16, + "text": ["Hello! How can I assist you today?\nUSER: I'm having"] + }, + "status": "COMPLETED" +} +``` + +## /stream + +Receive incremental results as they become available from jobs that generate output progressively. Best for text generation, long-running jobs where you want to show progress, and large outputs that benefit from incremental processing. + +Your handler must support streaming. See [Streaming handlers](/serverless/workers/handler-functions#streaming-handlers) for implementation details. + + + + +Replace `YOUR_JOB_ID` with the job ID from your `/run` response. + +```sh +curl --request GET \ + --url https://api.runpod.ai/v2/$ENDPOINT_ID/stream/YOUR_JOB_ID \ + -H "accept: application/json" \ + -H "authorization: $RUNPOD_API_KEY" \ +``` + + + + + +```python +import runpod + +runpod.api_key = os.getenv("RUNPOD_API_KEY") +endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) + +run_request = endpoint.run( + { + "input": { + "prompt": "Hello, world!", + } + } +) + +for output in run_request.stream(): + print(output) +``` + + + +```javascript +const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; +import runpodSdk from "runpod-sdk"; + +async function main() { + const runpod = runpodSdk(RUNPOD_API_KEY); + const endpoint = runpod.endpoint(ENDPOINT_ID); + const result = await endpoint.run({ + input: { + prompt: "Hello, World!", + }, + }); + + console.log(result); + + const { id } = result; + for await (const result of endpoint.stream(id)) { + console.log(`${JSON.stringify(result, null, 2)}`); + } + console.log("done streaming"); +} + +main(); +``` + + + +```go +package main + +import ( + "encoding/json" + "fmt" + + "github.com/runpod/go-sdk/pkg/sdk/config" + rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" +) + +func main() { + + apiKey := os.Getenv("RUNPOD_API_KEY") + baseURL := os.Getenv("RUNPOD_BASE_URL") + + endpoint, err := rpEndpoint.New( + &config.Config{ApiKey: &apiKey}, + &rpEndpoint.Option{EndpointId: &baseURL}, + ) + if err != nil { + panic(err) + } + + request, err := endpoint.Run(&rpEndpoint.RunInput{ + JobInput: &rpEndpoint.JobInput{ + Input: map[string]interface{}{ + "prompt": "Hello World", + }, + }, + }) + if err != nil { + panic(err) + } + + streamChan := make(chan rpEndpoint.StreamResult, 100) + + err = endpoint.Stream(&rpEndpoint.StreamInput{Id: request.Id}, streamChan) + if err != nil { + // timeout reached, if we want to get the data that has been streamed + if err.Error() == "ctx timeout reached" { + for data := range streamChan { + dt, _ := json.Marshal(data) + fmt.Printf("output:%s\n", dt) + } + } + panic(err) + } + + for data := range streamChan { + dt, _ := json.Marshal(data) + fmt.Printf("output:%s\n", dt) + } + +} +``` + + + + + +Maximum size for a single streamed payload chunk is 1 MB. Larger outputs are split across multiple chunks. + + +**Response:** + +```json +[ + { + "metrics": { + "avg_gen_throughput": 0, + "avg_prompt_throughput": 0, + "cpu_kv_cache_usage": 0, + "gpu_kv_cache_usage": 0.0016722408026755853, + "input_tokens": 0, + "output_tokens": 1, + "pending": 0, + "running": 1, + "scenario": "stream", + "stream_index": 2, + "swapped": 0 + }, + "output": { + "input_tokens": 0, + "output_tokens": 1, + "text": [" How"] + } + } +] +``` + +## /cancel + +Stop jobs that are no longer needed or taking too long. Stops in-progress jobs, removes queued jobs before they start, and returns immediately with the canceled status. + + + + +Replace `YOUR_JOB_ID` with the job ID from your `/run` response. + +```sh +curl --request POST \ + --url https://api.runpod.ai/v2/$ENDPOINT_ID/cancel/YOUR_JOB_ID \ + -H "authorization: $RUNPOD_API_KEY" \ +``` + + + + +```python +import time +import runpod + +runpod.api_key = os.getenv("RUNPOD_API_KEY") +endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) + +run_request = endpoint.run( + { + "input": { + "prompt": "Hello, world!", + } + } +) + +try: + while True: + status = run_request.status() + print(f"Current job status: {status}") + + if status == "COMPLETED": + output = run_request.output() + print("Job output:", output) + break + elif status in ["FAILED", "ERROR"]: + print("Job failed to complete successfully.") + break + else: + time.sleep(10) +except KeyboardInterrupt: # Catch KeyboardInterrupt + print("KeyboardInterrupt detected. Canceling the job...") + if run_request: # Check if a job is active + run_request.cancel() + print("Job canceled.") +``` + + + + +```javascript +const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; +import runpodSdk from "runpod-sdk"; + +async function main() { + try { + const runpod = runpodSdk(RUNPOD_API_KEY); + const endpoint = runpod.endpoint(ENDPOINT_ID); + const result = await endpoint.run({ + input: { + prompt: "Hello, World!", + }, + }); + + const { id } = result; + if (!id) { + console.error("No ID returned from endpoint.run"); + return; + } + + const cancel = await endpoint.cancel(id); + console.log(cancel); + } catch (error) { + console.error("An error occurred:", error); + } +} + +main(); +``` + + + +```go +package main + +import ( + "encoding/json" + "fmt" + + "github.com/runpod/go-sdk/pkg/sdk" + "github.com/runpod/go-sdk/pkg/sdk/config" + rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" +) + +func main() { + + apiKey := os.Getenv("RUNPOD_API_KEY") + baseURL := os.Getenv("RUNPOD_BASE_URL") + + endpoint, err := rpEndpoint.New( + &config.Config{ApiKey: &apiKey}, + &rpEndpoint.Option{EndpointId: &baseURL}, + ) + if err != nil { + panic(err) + } + + cancelInput := rpEndpoint.CancelInput{ + Id: sdk.String("00edfd03-8094-46da-82e3-ea47dd9566dc-u1"), + } + output, err := endpoint.Cancel(&cancelInput) + if err != nil { + panic(err) + } + + healthData, _ := json.Marshal(output) + fmt.Printf("health output: %s\n", healthData) + +} +``` + + + + +**Response:** + +```json +{ + "id": "724907fe-7bcc-4e42-998d-52cb93e1421f-u1", + "status": "CANCELLED" +} +``` + +## /retry + +Requeue jobs that have failed or timed out without submitting a new request. Maintains the same job ID, requeues with original input parameters, and removes previous output. Only works for jobs with `FAILED` or `TIMED_OUT` status. + +Replace `YOUR_JOB_ID` with the job ID from your `/run` response. + +```sh +curl --request POST \ + --url https://api.runpod.ai/v2/$ENDPOINT_ID/retry/YOUR_JOB_ID \ + -H "authorization: $RUNPOD_API_KEY" +``` + +**Response:** + +```json +{ + "id": "60902e6c-08a1-426e-9cb9-9eaec90f5e2b-u1", + "status": "IN_QUEUE" +} +``` + + +Job results expire after a set period. Async jobs (`/run`) results are available for 30 minutes, sync jobs (`/runsync`) for 1 minute (up to 5 minutes with `?wait=t`). Once expired, jobs cannot be retried. + + +## /purge-queue + +Remove all pending jobs from the queue. Useful for error recovery, clearing outdated requests, and resetting after configuration changes. + + + +```sh +curl --request POST \ + --url https://api.runpod.ai/v2/$ENDPOINT_ID/purge-queue \ + -H "authorization: $RUNPOD_API_KEY" +``` + + + +```python +import runpod +import os + +runpod.api_key = os.getenv("RUNPOD_API_KEY") +endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) + +endpoint.purge_queue(timeout=3) +``` + + + +```javascript +const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; +import runpodSdk from "runpod-sdk"; + +async function main() { + try { + const runpod = runpodSdk(RUNPOD_API_KEY); + const endpoint = runpod.endpoint(ENDPOINT_ID); + await endpoint.run({ + input: { + prompt: "Hello, World!", + }, + }); + + const purgeQueue = await endpoint.purgeQueue(); + console.log(purgeQueue); + } catch (error) { + console.error("An error occurred:", error); + } +} + +main(); +``` + + + + + +This operation only affects jobs waiting in the queue. Jobs already in progress continue to run. + + +**Response:** + +```json +{ + "removed": 2, + "status": "completed" +} +``` + +## /health + +Get a quick overview of your endpoint's operational status including worker availability and job queue status. + + + +```sh +curl --request GET \ + --url https://api.runpod.ai/v2/$ENDPOINT_ID/health \ + -H "authorization: $RUNPOD_API_KEY" +``` + + + + + +```python +import runpod +import json +import os + +runpod.api_key = os.getenv("RUNPOD_API_KEY") +endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) + +endpoint_health = endpoint.health() +print(json.dumps(endpoint_health, indent=2)) +``` + + + +```javascript +const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; +import runpodSdk from "runpod-sdk"; + +const runpod = runpodSdk(RUNPOD_API_KEY); +const endpoint = runpod.endpoint(ENDPOINT_ID); + +const health = await endpoint.health(); +console.log(health); +``` + + + +```go +package main + +import ( + "encoding/json" + "fmt" + "log" + "os" + + "github.com/runpod/go-sdk/pkg/sdk/config" + rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" +) + +func main() { + apiKey := os.Getenv("RUNPOD_API_KEY") + endpointId := os.Getenv("ENDPOINT_ID") + + endpoint, err := rpEndpoint.New( + &config.Config{ApiKey: &apiKey}, + &rpEndpoint.Option{EndpointId: &endpointId}, + ) + if err != nil { + log.Fatalf("Failed to create endpoint: %v", err) + } + + health, err := endpoint.Health() + if err != nil { + log.Fatalf("Failed to get health: %v", err) + } + + data, _ := json.Marshal(health) + fmt.Printf("Health: %s\n", data) +} +``` + + + + +**Response:** + +```json +{ + "jobs": { + "completed": 1, + "failed": 5, + "inProgress": 0, + "inQueue": 2, + "retried": 0 + }, + "workers": { + "idle": 0, + "running": 0 + } +} +``` diff --git a/serverless/endpoints/overview.mdx b/serverless/endpoints/overview.mdx index 85f05306..3d40078c 100644 --- a/serverless/endpoints/overview.mdx +++ b/serverless/endpoints/overview.mdx @@ -2,76 +2,62 @@ title: "Overview" sidebarTitle: "Overview" description: "Deploy and manage Serverless endpoints using the Runpod console or REST API." +mode: "wide" --- - -Endpoints are the foundation of Runpod Serverless, serving as the gateway for deploying and managing your [Serverless workers](/serverless/workers/overview). They provide a consistent API interface that allows your applications to interact with powerful compute resources on demand. - -Endpoints are RESTful APIs that accept [HTTP requests](/serverless/endpoints/send-requests), processing the input using your [handler function](/serverless/workers/handler-functions), and returning the result via HTTP response. Each endpoint provides a unique URL and abstracts away the complexity of managing individual GPUs/CPUs. +
+ +Endpoints are the foundation of Runpod Serverless, serving as the gateway for deploying and managing your [Serverless workers](/serverless/workers/overview). Each endpoint provides a unique URL that accepts [HTTP requests](/serverless/endpoints/send-requests), processes them using your [handler function](/serverless/workers/handler-functions), and returns results. + + + + Learn how to send requests to your endpoints. + + + Configure scaling, timeouts, and GPU selection. + + + Monitor job status and metrics. + + + Reduce cold starts with cached models. + + ## Endpoint types -### Queue-based endpoints - -Queue-based endpoints are the traditional type of endpoint that process requests sequentially in a queue (managed automatically by Runpod), providing guaranteed execution and automatic retries for failed requests. - -Queue-based endpoints offer two types execution modes: - -- **Asynchronous processing** via the `/run` endpoint operation, which lets you submit jobs that run in the background and check results later (with `/status`), making this ideal for long-running tasks. -- **Synchronous operations** through the `/runsync` endpoint operation, allowing you to receive immediate results in the same request, which is perfect for interactive applications. - -To learn more about the available endpoint operations, see the [Send API requests](/serverless/endpoints/send-requests#operation-overview) page. - -### Load balancing endpoints - -Load balancing endpoints offer **direct HTTP access** to your worker's HTTP server, bypassing the queueing system. These are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog (similar to UDP's behavior in networking). - -Load balancing endpoints don't require a handler function, allowing you to define your own custom API endpoints using any HTTP framework (like FastAPI or Flask). - -To learn more, see the [Load balancing endpoints](/serverless/load-balancing/overview) page. - -## Key features - -### Auto-scaling +| | Queue-based | Load balancing | +|---|-------------|----------------| +| **Processing** | Requests queued and processed sequentially | Direct HTTP access to workers | +| **Execution modes** | Async (`/run`) or sync (`/runsync`) | Custom HTTP endpoints | +| **Retries** | Automatic retries on failure | No automatic retries | +| **Handler required?** | Yes | No (use any HTTP framework) | +| **Best for** | Batch jobs, guaranteed execution | Real-time apps, streaming | -Runpod endpoints (both queue-based and load balancing) can automatically scale from zero to hundreds of workers based on demand. You can customize your endpoint configuration to adjust the minimum and maximum worker count, GPU allocation, and memory settings. The system also offers GPU prioritization, allowing you to specify preferred GPU types in order of priority. - -To learn more, see [Endpoint settings](/serverless/endpoints/endpoint-configurations). - -### Integration options - -Runpod endpoints support [webhook notifications](/serverless/endpoints/send-requests#webhook-notifications), allowing you to configure endpoints to call your webhook when jobs complete. - -It also includes [S3-compatible storage integration](/serverless/endpoints/send-requests#s3-compatible-storage-integration) for working with object storage for larger inputs and outputs. +Learn more about [load balancing endpoints](/serverless/load-balancing/overview). ## Create an endpoint -Before creating an endpoint make sure you have a working [handler function](/serverless/workers/handler-functions) and [Dockerfile](/serverless/workers/create-dockerfile). +Before creating an endpoint, ensure you have a [handler function](/serverless/workers/handler-functions) and [Dockerfile](/serverless/workers/create-dockerfile). -To create a new Serverless endpoint through the Runpod web interface: - -1. Navigate to the [Serverless section](https://www.console.runpod.io/serverless) of the Runpod console. -2. Click **New Endpoint**. -3. On the **Deploy a New Serverless Endpoint** screen, choose your deployment source: - * **Import Git Repository** (if GitHub is connected). See [Deploy from GitHub](/serverless/workers/github-integration) for details. - * **Import from Docker Registry**. See [Deploy from Docker Hub](/serverless/workers/deploy) for details. - * Or select a preconfigured endpoint under **Ready-to-Deploy Repos**. -4. Follow the UI steps to configure your selected source (Docker image, GitHub repo), then click **Next**. -5. Configure your endpoint settings: - * **Endpoint Name**: The display name for your endpoint in the console. - * **Endpoint Type**: Select **Queue** for traditional queue-based processing or **Load balancer** for direct HTTP access. See [Load balancing endpoints](/serverless/load-balancing/overview) for details. - * **GPU Configuration**: Select the appropriate GPU types and configure worker settings. - * **Model**: (Optional) Enter a model URL from Hugging Face to optimize worker startup times. See [Cached models](/serverless/endpoints/model-caching) for details. - * **Container Configuration**: Edit the container start command, specify the [container disk size](/serverless/storage/overview), and expose HTTP/TCP ports. - * **Environment Variables**: Add [environment variables](/serverless/development/environment-variables) for your worker containers. -6. Click **Deploy Endpoint** to deploy. + +1. Navigate to the [Serverless section](https://www.console.runpod.io/serverless) and click **New Endpoint**. +2. Choose your deployment source: + - **Import Git Repository**: See [Deploy from GitHub](/serverless/workers/github-integration) + - **Import from Docker Registry**: See [Deploy from Docker Hub](/serverless/workers/deploy) + - **Ready-to-Deploy Repos**: Select a preconfigured endpoint +3. Configure your endpoint: + - **Endpoint Name** and **Type** (Queue-based or Load balancer) + - **GPU Configuration** and worker settings + - **Model** (optional): Enter a Hugging Face URL for [cached models](/serverless/endpoints/model-caching) + - **Environment Variables**: See [environment variables](/serverless/development/environment-variables) +4. Click **Deploy Endpoint**. -To create a Serverless endpoint using the REST API, send a POST request to the `/endpoints` endpoint: ```bash curl --request POST \ @@ -79,80 +65,40 @@ curl --request POST \ --header 'Authorization: Bearer RUNPOD_API_KEY' \ --header 'Content-Type: application/json' \ --data '{ - "allowedCudaVersions": [ - "12.8" - ], - "computeType": "GPU", - "cpuFlavorIds": [ - "cpu3c" - ], - "dataCenterIds": [ - "EU-RO-1", - "CA-MTL-1" - ], - "executionTimeoutMs": 600000, - "flashboot": true, - "gpuCount": 1, - "gpuTypeIds": [ - "NVIDIA GeForce RTX 4090" - ], - "idleTimeout": 5, - "name": "my-endpoint", - "scalerType": "QUEUE_DELAY", - "scalerValue": 4, - "templateId": "30zmvf89kd", - "vcpuCount": 2, - "workersMax": 3, - "workersMin": 0 -}' + "name": "my-endpoint", + "templateId": "30zmvf89kd", + "gpuTypeIds": ["NVIDIA GeForce RTX 4090"], + "workersMin": 0, + "workersMax": 3, + "idleTimeout": 5 + }' ``` -For complete API documentation and parameter details, see the [Serverless endpoint API reference](/api-reference/endpoints/POST/endpoints). +See the [Endpoint API reference](/api-reference/endpoints/POST/endpoints) for all parameters. - -You can optimize cost and availability by specifying GPU preferences in order of priority. Runpod attempts to allocate your first choice GPU. If unavailable, it automatically uses the next GPU in your priority list, ensuring your workloads run on the best available resources. - -You can enable or disable particular GPU types using the **Advanced > Enabled GPU Types** section. +Optimize cost and availability by specifying multiple GPU types in priority order. Runpod allocates your first choice if available, otherwise uses the next in your list. -After deployment, your endpoint takes time to initialize before it is ready to process requests. You can monitor the deployment status on the endpoint details page, which shows worker status and initialization progress. Once active, your endpoint displays a unique API URL (`https://api.runpod.ai/v2/{endpoint_id}/`) that you can use to send requests. +After deployment, your endpoint displays a unique API URL: `https://api.runpod.ai/v2/{endpoint_id}/` ## Edit an endpoint -{/* - - */} - -You can modify your endpoint's configuration at any time: - -1. Navigate to the [Serverless section](https://www.console.runpod.io/serverless) in the Runpod console. -2. Click the three dots in the top right corner of the endpoint you want to modify. -3. Click **Edit Endpoint**. -4. Update any [endpoint settings](/serverless/endpoints/endpoint-configurations) as needed. -5. Click **Save Endpoint** to save your changes. +1. Navigate to the [Serverless section](https://www.console.runpod.io/serverless). +2. Click the three dots on your endpoint → **Edit Endpoint**. +3. Update [endpoint settings](/serverless/endpoints/endpoint-configurations) and click **Save Endpoint**. -Changes to some settings (like GPU types or worker counts) may require restarting active workers to take effect. +Changes to GPU types or worker counts may require restarting active workers. ## Delete an endpoint -To delete an endpoint: - -1. Navigate to the [Serverless section](https://www.console.runpod.io/serverless) in the Runpod console. -2. Click the three dots in the top right corner of the endpoint you want to delete. -3. Click **Delete Endpoint**. -4. Type the name of the endpoint, then click **Confirm**. +1. Navigate to the [Serverless section](https://www.console.runpod.io/serverless). +2. Click the three dots on your endpoint → **Delete Endpoint**. +3. Type the endpoint name to confirm. -Deleting an endpoint permanently removes all configuration, logs, and job history. This action cannot be undone. +Deleting an endpoint permanently removes all configuration, logs, and job history. - -## Next steps - -* [Send requests to your endpoint](/serverless/endpoints/send-requests) -* [Configure endpoint settings](/serverless/endpoints/endpoint-configurations) -* [Monitor job states and metrics](/serverless/endpoints/job-states) -* [Optimize endpoint performance](/serverless/development/optimization) diff --git a/serverless/endpoints/send-requests.mdx b/serverless/endpoints/send-requests.mdx index ca398f93..e2ad0ec9 100644 --- a/serverless/endpoints/send-requests.mdx +++ b/serverless/endpoints/send-requests.mdx @@ -6,58 +6,37 @@ description: "Submit and manage jobs for your queue-based endpoints by sending H import { QueueBasedEndpointsTooltip, LoadBalancingEndpointTooltip } from "/snippets/tooltips.jsx"; -After creating a [Severless endpoint](/serverless/endpoints/overview), you can start sending it HTTP requests (using `cURL` or the Runpod SDK) to submit jobs and retrieve results: +After creating a [Serverless endpoint](/serverless/endpoints/overview), you can start sending HTTP requests to submit jobs and retrieve results: ```sh -curl -x POST https://api.runpod.ai/v2/ENDPOINT_ID/run \ +curl -x POST https://api.runpod.ai/v2/ENDPOINT_ID/runsync \ -H "authorization: Bearer RUNPOD_API_KEY" \ -H "content-type: application/json" \ -d '{ "input": { "prompt": "Hello, world!" }}' ``` -This page covers everything from basic input structure and job submission, to advanced options, rate limits, and best practices for queue-based endpoints. - -This guide is for . If you're building a , the request structure and endpoints will depend on how you define your HTTP servers. +This guide is for . If you're building a , the request structure and endpoints depend on how you define your HTTP servers. - - ## How requests work -A request can include parameters, payloads, and headers that define what the endpoint should process. For example, you can send a `POST` request to submit a job, or a `GET` request to check the status of a job, retrieve results, or check endpoint health. - -A **job** is a unit of work containing the input data from the request, packaged for processing by your [workers](/serverless/workers/overview). - -If no worker is immediately available, the job is queued. Once a worker is available, the job is processed using your worker's [handler function](/serverless/workers/handler-functions). - -Queue-based endpoints provide a fixed set of operations for submitting and managing jobs. You can find a full list of operations and sample code in the [sections below](/serverless/endpoints/send-requests#operation-overview). +A **job** is a unit of work containing the input data from the request, packaged for processing by your [workers](/serverless/workers/overview). If no worker is immediately available, the job is queued. Once a worker is available, the job is processed using your worker's [handler function](/serverless/workers/handler-functions). ## Sync vs. async -When you submit a job request, it can be either synchronous or asynchronous depending on which `POST` operation you use: - -- `/runsync` submits a synchronous job. +- `/runsync` submits a **synchronous** job. - Client waits for the job to complete before returning the result. - - A response is returned as soon as the job is complete. - - Results are available for 1 minute by default (5 minutes max). + - Results are available for 1 minute (5 minutes max). - Ideal for quick responses and interactive applications. -- `/run` submits an asynchronous job. - - The job is processed in the background. - - Retrieve the result by sending a `GET` request to the `/status` endpoint. +- `/run` submits an **asynchronous** job. + - The job processes in the background; retrieve results via `/status`. - Results are available for 30 minutes after completion. - Ideal for long-running tasks and batch processing. ## Request input structure -When submitting a job with `/runsync` or `/run`, your request must include a JSON object with the key `input` containing the parameters required by your worker's [handler function](/serverless/workers/handler-functions). For example: +When submitting a job with `/runsync` or `/run`, your request must include a JSON object with the key `input` containing the parameters required by your worker's [handler function](/serverless/workers/handler-functions): ```json { @@ -67,1015 +46,63 @@ When submitting a job with `/runsync` or `/run`, your request must include a JSO } ``` -The exact parameters required in the `input` object depend on your specific worker implementation (e.g. `prompt` commonly used for endpoints serving LLMs, but not all workers accept it). Check your worker's documentation for a list of required and optional parameters. +The exact parameters depend on your specific worker implementation. Check your worker's documentation for required and optional parameters. ## Send requests from the console -The quickest way to test your endpoint is directly in the Runpod console. Navigate to the [Serverless section](https://www.console.runpod.io/serverless), select your endpoint, and click the **Requests** tab. +The quickest way to test your endpoint is in the Runpod console. Navigate to [Serverless](https://www.console.runpod.io/serverless), select your endpoint, and click the **Requests** tab. -You'll see a default test request that you can modify as needed, then click **Run** to test your endpoint. On first execution, your workers will need to initialize, which may take a moment. - -The initial response will look something like this: - -```json -{ - "id": "6de99fd1-4474-4565-9243-694ffeb65218-u1", - "status": "IN_QUEUE" -} -``` - -You'll see the full response after the job completes. If there are any errors, the console will display error logs to help you troubleshoot. +Modify the default test request as needed, then click **Run**. On first execution, workers need to initialize, which may take a moment. ## Operation overview -Queue-based endpoints support comprehensive job lifecycle management through multiple operations that allow you to submit, monitor, manage, and retrieve results from jobs. - -Here's a quick overview of the operations available for queue-based endpoints: - -| Operation | HTTP method | Description | -|------------------|------------|--------------------------------------------------------------------------------------------------| -| `/runsync` | POST | Submit a synchronous job and wait for the complete results in a single response. | -| `/run` | POST | Submit an asynchronous job that processes in the background, and returns an immediate job ID.| -| `/status` | GET | Check the current status, execution details, and results of a submitted job. | -| `/stream` | GET | Receive incremental results from a job as they become available. | -| `/cancel` | POST | Stop a job that is in progress or waiting in the queue. | -| `/retry` | POST | Requeue a failed or timed-out job using the same job ID and input parameters. | -| `/purge-queue` | POST | Clear all pending jobs from the queue without affecting jobs already in progress. | -| `/health` | GET | Monitor the operational status of your endpoint, including worker and job statistics. | - - -If you need to create an endpoint that supports custom API paths, use [load balancing endpoints](/serverless/load-balancing/overview). - - -## Operation reference - -Below you'll find detailed explanations and examples for each operation using `cURL` and the Runpod SDK. - - -You can also send requests using standard HTTP request APIs and libraries, such as `fetch` (for JavaScript) and `requests` (for Python). - - - -Before running these examples, you'll need to install the Runpod SDK: - -```bash -# Python -python -m pip install runpod - -# JavaScript -npm install --save runpod-sdk - -# Go -go get github.com/runpod/go-sdk && go mod tidy -``` - -You should also set your [API key](/get-started/api-keys) and endpoint ID (found on the Overview tab for your endpoint in the Runpod console) as environment variables. Run the following commands in your local terminal, replacing `YOUR_API_KEY` and `YOUR_ENDPOINT_ID` with your actual API key and endpoint ID: - -```bash -export RUNPOD_API_KEY="YOUR_API_KEY" -export ENDPOINT_ID="YOUR_ENDPOINT_ID" -``` - -### `/runsync` - -Synchronous jobs wait for completion and return the complete result in a single response. This approach works best for shorter tasks where you need immediate results, interactive applications, and simpler client code without status polling. - -`/runsync` requests have a maximum payload size of 20 MB. - -Results are retained for 1 minute after completion. - -By default, the request waits up to 90 seconds for the job to complete. You can adjust this by appending `?wait=x` to the request URL, where `x` is the number of milliseconds to wait (between 1000 and 300000). For example, `?wait=120000` waits up to 2 minutes for completion: - -```sh -https://api.runpod.ai/v2/$ENDPOINT_ID/runsync?wait=120000 -``` - - -The `?wait` parameter controls how long the request waits for job completion, not how long results are retained. Result retention is fixed at 1 minute for sync requests. - - - - - -```sh -curl --request POST \ - --url https://api.runpod.ai/v2/$ENDPOINT_ID/runsync \ - -H "accept: application/json" \ - -H "authorization: $RUNPOD_API_KEY" \ - -H "content-type: application/json" \ - -d '{ "input": { "prompt": "Hello, world!" }}' -``` - - - - -```python -import runpod -import os - -runpod.api_key = os.getenv("RUNPOD_API_KEY") -endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) - -try: - run_request = endpoint.run_sync( - {"prompt": "Hello, world!"}, - timeout=60, # Client timeout in seconds - ) - print(run_request) -except TimeoutError: - print("Job timed out.") -``` - - - - -```javascript -const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; -import runpodSdk from "runpod-sdk"; - -const runpod = runpodSdk(RUNPOD_API_KEY); -const endpoint = runpod.endpoint(ENDPOINT_ID); - -const result = await endpoint.runSync({ - "input": { - "prompt": "Hello, World!", - }, - timeout: 60000, // Client timeout in milliseconds -}); -}); - -console.log(result); -``` - - - - -```go -package main - -import ( - "encoding/json" - "fmt" - "log" - "os" - - "github.com/runpod/go-sdk/pkg/sdk" - "github.com.runpod/go-sdk/pkg/sdk/config" - rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" -) - -func main() { - apiKey := os.Getenv("RUNPOD_API_KEY") - baseURL := os.Getenv("RUNPOD_BASE_URL") - - endpoint, err := rpEndpoint.New( - &config.Config{ApiKey: &apiKey}, - &rpEndpoint.Option{EndpointId: &baseURL}, - ) - if err != nil { - log.Fatalf("Failed to create endpoint: %v", err) - } - - jobInput := rpEndpoint.RunSyncInput{ - JobInput: &rpEndpoint.JobInput{ - Input: map[string]interface{}{ - "prompt": "Hello World", - }, - }, - Timeout: sdk.Int(60), // Client timeout in seconds - } - - output, err := endpoint.RunSync(&jobInput) - if err != nil { - panic(err) - } - - data, _ := json.Marshal(output) - fmt.Printf("output: %s\n", data) -} -``` - - - -`/runsync` returns a response as soon as the job is complete: - -```json -{ - "delayTime": 824, - "executionTime": 3391, - "id": "sync-79164ff4-d212-44bc-9fe3-389e199a5c15", - "output": [ - { - "image": "https://image.url", - "seed": 46578 - } - ], - "status": "COMPLETED" -} -``` - -### `/run` - -Asynchronous jobs process in the background and return immediately with a job ID. This approach works best for longer-running tasks that don't require immediate results, operations requiring significant processing time, and managing multiple concurrent jobs. - -`/run` requests have a maximum payload size of 10 MB. - -Job results are available for 30 minutes after completion. - - - -```sh -curl --request POST \ - --url https://api.runpod.ai/v2/$ENDPOINT_ID/run \ - -H "accept: application/json" \ - -H "authorization: $RUNPOD_API_KEY" \ - -H "content-type: application/json" \ - -d '{"input": {"prompt": "Hello, world!"}}' -``` - - - -```python -import runpod -import os - -runpod.api_key = os.getenv("RUNPOD_API_KEY") -endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) - -# Submit asynchronous job -run_request = endpoint.run({"prompt": "Hello, World!"}) +Queue-based endpoints support these operations for job lifecycle management: -# Check initial status -status = run_request.status() -print(f"Initial job status: {status}") +| Operation | Method | Description | +|-------------------|--------|----------------------------------------------------------------------------| +| `/runsync` | POST | Submit a synchronous job and wait for complete results. | +| `/run` | POST | Submit an asynchronous job that processes in the background. | +| `/status` | GET | Check status, execution details, and results of a submitted job. | +| `/stream` | GET | Receive incremental results as they become available. | +| `/cancel` | POST | Stop a job in progress or waiting in the queue. | +| `/retry` | POST | Requeue a failed or timed-out job with the same job ID and input. | +| `/purge-queue` | POST | Clear all pending jobs from the queue. | +| `/health` | GET | Monitor endpoint status, including worker and job statistics. | -if status != "COMPLETED": - # Poll for results with timeout - output = run_request.output(timeout=60) -else: - output = run_request.output() -print(f"Job output: {output}") -``` - - - -```javascript -const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; -import runpodSdk from "runpod-sdk"; - -const runpod = runpodSdk(RUNPOD_API_KEY); -const endpoint = runpod.endpoint(ENDPOINT_ID); - -const result = await endpoint.run({ - "input": { - "prompt": "Hello, World!", - }, -}); - -console.log(result); -``` - - - -```go -package main - -import ( - "encoding/json" - "fmt" - "log" - "os" - - "github.com/runpod/go-sdk/pkg/sdk" - "github.com/runpod/go-sdk/pkg/sdk/config" - rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" -) - -func main() { - client := sdk.New(&config.Config{ - ApiKey: os.Getenv("RUNPOD_API_KEY"), - BaseURL: os.Getenv("RUNPOD_BASE_URL"), - }) - - endpoint, err := client.NewEndpoint("YOUR_ENDPOINT_ID") - if err != nil { - log.Fatalf("Failed to create endpoint: %v", err) - } - - jobInput := rpEndpoint.RunInput{ - JobInput: &rpEndpoint.JobInput{ - Input: map[string]interface{}{ - "prompt": "Hello World", - }, - }, - RequestTimeout: sdk.Int(120), - } - - output, err := endpoint.Run(&jobInput) - if err != nil { - panic(err) - } - - data, _ := json.Marshal(output) - fmt.Printf("output: %s\n", data) -} -``` - - - - -`/run` returns a response with the job ID and status: - -```json -{ - "id": "eaebd6e7-6a92-4bb8-a911-f996ac5ea99d", - "status": "IN_QUEUE" -} -``` - -Further results must be retrieved using the `/status` operation. - -### `/status` - -Check the current state, execution statistics, and results of previously submitted jobs. The status operation provides the current job state, execution statistics like queue delay and processing time, and job output if completed. +See the [operation reference](/serverless/endpoints/operation-reference) for detailed examples using cURL and the Runpod SDK. -You can configure time-to-live (TTL) for individual jobs by appending a TTL parameter to the request URL. - -For example, `https://api.runpod.ai/v2/$ENDPOINT_ID/status/YOUR_JOB_ID?ttl=6000` sets the TTL to 6 seconds. +For custom API paths, use [load balancing endpoints](/serverless/load-balancing/overview). - - -Replace `YOUR_JOB_ID` with the actual job ID you received in the response to the `/run` operation. - -```sh -curl --request GET \ - --url https://api.runpod.ai/v2/$ENDPOINT_ID/status/YOUR_JOB_ID \ - -H "authorization: $RUNPOD_API_KEY" \ -``` - - - - - -Check the status of a job using the `status` method on the `run_request` object: - -```python - -import runpod - -runpod.api_key = os.getenv("RUNPOD_API_KEY") -endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) - -input_payload = {"input": {"prompt": "Hello, World!"}} - -run_request = endpoint.run(input_payload) - -# Initial check without blocking, useful for quick tasks -status = run_request.status() -print(f"Initial job status: {status}") - -if status != "COMPLETED": - # Polling with timeout for long-running tasks - output = run_request.output(timeout=60) -else: - output = run_request.output() -print(f"Job output: {output}") -print(f"An error occurred: {e}") - -``` - - - - -Check the status of a job using the ID returned by `endpoint.run`: - -```javascript -const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; -import runpodSdk from "runpod-sdk"; - -async function main() { - try { - const runpod = runpodSdk(RUNPOD_API_KEY); - const endpoint = runpod.endpoint(ENDPOINT_ID); - const result = await endpoint.run({ - input: { - prompt: "Hello, World!", - }, - }); - - const { id } = result; - if (!id) { - console.error("No ID returned from endpoint.run"); - return; - } - - const status = await endpoint.status(id); - console.log(status); - } catch (error) { - console.error("An error occurred:", error); - } -} - -main(); -``` - - - - -Replace `YOUR_JOB_ID` with the actual job ID you received in the response to the `/run` request. - -```go - -package main - -import ( - "encoding/json" - "fmt" - "log" - "os" - - "github.com/runpod/go-sdk/pkg/sdk" - "github.com/runpod/go-sdk/pkg/sdk/config" - rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" -) - -func main() { - - apiKey := os.Getenv("RUNPOD_API_KEY") - baseURL := os.Getenv("RUNPOD_BASE_URL") - - endpoint, err := rpEndpoint.New( - &config.Config{ApiKey: &apiKey}, - &rpEndpoint.Option{EndpointId: &baseURL}, - ) - if err != nil { - log.Fatalf("Failed to create endpoint: %v", err) - } - input := rpEndpoint.StatusInput{ - Id: sdk.String("YOUR_JOB_ID"), - } - output, err := endpoint.Status(&input) - if err != nil { - panic(err) - } - dt, _ := json.Marshal(output) - fmt.Printf("output:%s\n", dt) -} -``` - - - - -`/status` returns a JSON response with the job status (e.g. `IN_QUEUE`, `IN_PROGRESS`, `COMPLETED`, `FAILED`), and an optional `output` field if the job is completed: - -```json -{ - "delayTime": 31618, - "executionTime": 1437, - "id": "60902e6c-08a1-426e-9cb9-9eaec90f5e2b-u1", - "output": { - "input_tokens": 22, - "output_tokens": 16, - "text": ["Hello! How can I assist you today?\nUSER: I'm having"] - }, - "status": "COMPLETED" -} -``` - -### `/stream` - -Receive incremental results as they become available from jobs that generate output progressively. This works especially well for text generation tasks where you want to display output as it's created, long-running jobs where you want to show progress, and large outputs that benefit from incremental processing. - -To enable streaming, your handler must support the `"return_aggregate_stream": True` option on the `start` method of your handler. Once enabled, use the `stream` method to receive data as it becomes available. - -For implementation details, see [Streaming handlers](/serverless/workers/handler-functions#streaming-handlers). - - - - - -Replace `YOUR_JOB_ID` with the actual job ID you received in the response to the `/run` request. - -```sh -curl --request GET \ - --url https://api.runpod.ai/v2/$ENDPOINT_ID/stream/YOUR_JOB_ID \ - -H "accept: application/json" \ - -H "authorization: $RUNPOD_API_KEY" \ -``` - - - - - -```python -import runpod - -runpod.api_key = os.getenv("RUNPOD_API_KEY") -endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) - -run_request = endpoint.run( - { - "input": { - "prompt": "Hello, world!", - } - } -) - -for output in run_request.stream(): - print(output) -``` - - - -```javascript -const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; -import runpodSdk from "runpod-sdk"; - -async function main() { - const runpod = runpodSdk(RUNPOD_API_KEY); - const endpoint = runpod.endpoint(ENDPOINT_ID); - const result = await endpoint.run({ - input: { - prompt: "Hello, World!", - }, - }); - - console.log(result); - - const { id } = result; - for await (const result of endpoint.stream(id)) { - console.log(`${JSON.stringify(result, null, 2)}`); - } - console.log("done streaming"); -} - -main(); -``` - - - -```go -package main - -import ( - "encoding/json" - "fmt" - - "github.com/runpod/go-sdk/pkg/sdk/config" - rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" -) - -func main() { - - apiKey := os.Getenv("RUNPOD_API_KEY") - baseURL := os.Getenv("RUNPOD_BASE_URL") - - endpoint, err := rpEndpoint.New( - &config.Config{ApiKey: &apiKey}, - &rpEndpoint.Option{EndpointId: &baseURL}, - ) - if err != nil { - panic(err) - } - - request, err := endpoint.Run(&rpEndpoint.RunInput{ - JobInput: &rpEndpoint.JobInput{ - Input: map[string]interface{}{ - "prompt": "Hello World", - }, - }, - }) - if err != nil { - panic(err) - } - - streamChan := make(chan rpEndpoint.StreamResult, 100) - - err = endpoint.Stream(&rpEndpoint.StreamInput{Id: request.Id}, streamChan) - if err != nil { - // timeout reached, if we want to get the data that has been streamed - if err.Error() == "ctx timeout reached" { - for data := range streamChan { - dt, _ := json.Marshal(data) - fmt.Printf("output:%s\n", dt) - } - } - panic(err) - } - - for data := range streamChan { - dt, _ := json.Marshal(data) - fmt.Printf("output:%s\n", dt) - } - -} -``` - - - - - -The maximum size for a single streamed payload chunk is 1 MB. Larger outputs will be split across multiple chunks. - - -Streaming response format: - -```json -[ - { - "metrics": { - "avg_gen_throughput": 0, - "avg_prompt_throughput": 0, - "cpu_kv_cache_usage": 0, - "gpu_kv_cache_usage": 0.0016722408026755853, - "input_tokens": 0, - "output_tokens": 1, - "pending": 0, - "running": 1, - "scenario": "stream", - "stream_index": 2, - "swapped": 0 - }, - "output": { - "input_tokens": 0, - "output_tokens": 1, - "text": [" How"] - } - } -] -``` - -### `/cancel` - -Stop jobs that are no longer needed or taking too long to complete. This operation stops in-progress jobs, removes queued jobs before they start, and returns immediately with the canceled status. - - - - -Replace `YOUR_JOB_ID` with the actual job ID you received in the response to the `/run` request. - -```sh -curl --request POST \ - --url https://api.runpod.ai/v2/$ENDPOINT_ID/cancel/YOUR_JOB_ID \ - -H "authorization: $RUNPOD_API_KEY" \ -``` - - - - -Cancel a job using the `cancel` method on the `run_request` object. The script below demonstrates how to cancel a job using a keyboard interrupt: - -```python -import time -import runpod - -runpod.api_key = os.getenv("RUNPOD_API_KEY") -endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) - -run_request = endpoint.run( - { - "input": { - "prompt": "Hello, world!", - } - } -) - -try: - while True: - status = run_request.status() - print(f"Current job status: {status}") - - if status == "COMPLETED": - output = run_request.output() - print("Job output:", output) - break - elif status in ["FAILED", "ERROR"]: - print("Job failed to complete successfully.") - break - else: - time.sleep(10) -except KeyboardInterrupt: # Catch KeyboardInterrupt - print("KeyboardInterrupt detected. Canceling the job...") - if run_request: # Check if a job is active - run_request.cancel() - print("Job canceled.") -``` - - - - -Cancel a job by using the `cancel()` function on the run request. - -```javascript -const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; -import runpodSdk from "runpod-sdk"; - -async function main() { - try { - const runpod = runpodSdk(RUNPOD_API_KEY); - const endpoint = runpod.endpoint(ENDPOINT_ID); - const result = await endpoint.run({ - input: { - prompt: "Hello, World!", - }, - }); - - const { id } = result; - if (!id) { - console.error("No ID returned from endpoint.run"); - return; - } - - const cancel = await endpoint.cancel(id); - console.log(cancel); - } catch (error) { - console.error("An error occurred:", error); - } -} - -main(); -``` - - - -```go -package main - -import ( - "encoding/json" - "fmt" - - "github.com/runpod/go-sdk/pkg/sdk" - "github.com/runpod/go-sdk/pkg/sdk/config" - rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" -) - -func main() { - - apiKey := os.Getenv("RUNPOD_API_KEY") - baseURL := os.Getenv("RUNPOD_BASE_URL") - - endpoint, err := rpEndpoint.New( - &config.Config{ApiKey: &apiKey}, - &rpEndpoint.Option{EndpointId: &baseURL}, - ) - if err != nil { - panic(err) - } - - cancelInput := rpEndpoint.CancelInput{ - Id: sdk.String("00edfd03-8094-46da-82e3-ea47dd9566dc-u1"), - } - output, err := endpoint.Cancel(&cancelInput) - if err != nil { - panic(err) - } - - healthData, _ := json.Marshal(output) - fmt.Printf("health output: %s\n", healthData) - -} -``` - - - - - -`/cancel` requests return a JSON response with the status of the cancel operation: - -```json -{ - "id": "724907fe-7bcc-4e42-998d-52cb93e1421f-u1", - "status": "CANCELLED" -} -``` - - -### `/retry` - -Requeue jobs that have failed or timed out without submitting a new request. This operation maintains the same job ID for tracking, requeues with original input parameters, and removes previous output. It can only be used for jobs with `FAILED` or `TIMED_OUT` status. - -Replace `YOUR_JOB_ID` with the actual job ID you received in the response to the `/run` request. - -```sh -curl --request POST \ - --url https://api.runpod.ai/v2/$ENDPOINT_ID/retry/YOUR_JOB_ID \ - -H "authorization: $RUNPOD_API_KEY" -``` - -You'll see the job status updated to `IN_QUEUE` when the job is retried: - -```json -{ - "id": "60902e6c-08a1-426e-9cb9-9eaec90f5e2b-u1", - "status": "IN_QUEUE" -} -``` - - -Job results expire after a set period. Asynchronous jobs (`/run`) results are available for 30 minutes, while synchronous jobs (`/runsync`) results are available for 1 minute (up to 5 minutes with `?wait=t`). Once expired, jobs cannot be retried. - - -### `/purge-queue` - -Remove all pending jobs from the queue when you need to reset or handle multiple cancellations at once. This is useful for error recovery, clearing outdated requests, resetting after configuration changes, and managing resource allocation. - - - -```sh -curl --request POST \ - --url https://api.runpod.ai/v2/$ENDPOINT_ID/purge-queue \ - -H "authorization: $RUNPOD_API_KEY" - -H 'Authorization: Bearer RUNPOD_API_KEY' -``` - - - -```python -import runpod -import os - -runpod.api_key = os.getenv("RUNPOD_API_KEY") -endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) - -endpoint.purge_queue(timeout=3) -``` - - - -```javascript -const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; -import runpodSdk from "runpod-sdk"; - -async function main() { - try { - const runpod = runpodSdk(RUNPOD_API_KEY); - const endpoint = runpod.endpoint(ENDPOINT_ID); - await endpoint.run({ - input: { - prompt: "Hello, World!", - }, - }); - - const purgeQueue = await endpoint.purgeQueue(); - console.log(purgeQueue); - } catch (error) { - console.error("An error occurred:", error); - } -} - -main(); -``` - - - - - -`/purge-queue` operation only affects jobs waiting in the queue. Jobs already in progress will continue to run. - - -`/purge-queue` requests return a JSON response with the number of jobs removed from the queue and the status of the purge operation: - -```json -{ - "removed": 2, - "status": "completed" -} -``` - -### `/health` - -Get a quick overview of your endpoint's operational status including worker availability, job queue status, potential bottlenecks, and scaling requirements. - - - -```sh -curl --request GET \ - --url https://api.runpod.ai/v2/$ENDPOINT_ID/health \ - -H "authorization: $RUNPOD_API_KEY" -``` - - - - - -```python -import runpod -import json -import os - -runpod.api_key = os.getenv("RUNPOD_API_KEY") -endpoint = runpod.Endpoint(os.getenv("ENDPOINT_ID")) - -endpoint_health = endpoint.health() -print(json.dumps(endpoint_health, indent=2)) -``` - - - -```javascript -const { RUNPOD_API_KEY, ENDPOINT_ID } = process.env; -import runpodSdk from "runpod-sdk"; - -const runpod = runpodSdk(RUNPOD_API_KEY); -const endpoint = runpod.endpoint(ENDPOINT_ID); - -const health = await endpoint.health(); -console.log(health); -``` - - - -```go -package main - -import ( - "encoding/json" - "fmt" - "log" - "os" - - "github.com/runpod/go-sdk/pkg/sdk/config" - rpEndpoint "github.com/runpod/go-sdk/pkg/sdk/endpoint" -) - -func main() { - apiKey := os.Getenv("RUNPOD_API_KEY") - endpointId := os.Getenv("ENDPOINT_ID") - - endpoint, err := rpEndpoint.New( - &config.Config{ApiKey: &apiKey}, - &rpEndpoint.Option{EndpointId: &endpointId}, - ) - if err != nil { - log.Fatalf("Failed to create endpoint: %v", err) - } - - health, err := endpoint.Health() - if err != nil { - log.Fatalf("Failed to get health: %v", err) - } - - data, _ := json.Marshal(health) - fmt.Printf("Health: %s\n", data) -} -``` - - - - -`/health` requests return a JSON response with the current status of the endpoint, including the number of jobs completed, failed, in progress, in queue, and retried, as well as the status of workers. - -```json -{ - "jobs": { - "completed": 1, - "failed": 5, - "inProgress": 0, - "inQueue": 2, - "retried": 0 - }, - "workers": { - "idle": 0, - "running": 0 - } -} -``` - ## Advanced options -Beyond the required `input` object, you can include optional top-level parameters to enable additional functionality for your queue-based endpoints. +Beyond the required `input` object, you can include optional top-level parameters for additional functionality. ### Webhook notifications -Receive notifications when jobs complete by specifying a webhook URL. When your job completes, Runpod will send a `POST` request to your webhook URL containing the same information as the `/status/JOB_ID` endpoint. +Receive notifications when jobs complete by specifying a webhook URL: ```json { - "input": { - "prompt": "Your input here" - }, + "input": { "prompt": "Your input here" }, "webhook": "https://your-webhook-url.com" } ``` -Your webhook should return a `200` status code to acknowledge receipt. If the call fails, Runpod will retry up to 2 more times with a 10-second delay between attempts. +Your webhook should return a `200` status code. If the call fails, Runpod retries up to 2 more times with a 10-second delay. ### Execution policies -Control job execution behavior with custom policies. By default, jobs automatically terminate after 10 minutes without completion to prevent runaway costs. +Control job execution behavior with custom policies: ```json { - "input": { - "prompt": "Your input here" - }, + "input": { "prompt": "Your input here" }, "policy": { "executionTimeout": 900000, "lowPriority": false, @@ -1084,39 +111,31 @@ Control job execution behavior with custom policies. By default, jobs automatica } ``` -Policy options: - -| Option | Description | Default | Constraints | -| ------------------ | ------------------------------------------- | ------------------- | ------------------------------ | -| `executionTimeout` | Maximum time a job can run while being processed by a worker | 600000 (10 minutes) | Min 5 seconds, max 7 days | -| `lowPriority` | When true, job won't trigger worker scaling | false | - | -| `ttl` | Total lifespan of the job—once expired, the job is deleted regardless of state | 86400000 (24 hours) | Min 10 seconds, max 7 days | +| Option | Description | Default | Constraints | +|--------------------|----------------------------------------------------------|----------------------|-------------------------| +| `executionTimeout` | Maximum time a job can run while being processed | 600000 (10 minutes) | Min 5 sec, max 7 days | +| `lowPriority` | When true, job won't trigger worker scaling | false | - | +| `ttl` | Total lifespan of the job before deletion | 86400000 (24 hours) | Min 10 sec, max 7 days | Setting `executionTimeout` in a request overrides the default endpoint setting for that specific job only. -#### Understanding TTL vs execution timeout +#### TTL vs. execution timeout -The `ttl` and `executionTimeout` settings serve different purposes: - -- **`ttl`**: Total lifespan of the job in the system. The timer starts when the job is submitted and covers queue time, execution time, and everything in between. When TTL expires, the job is deleted regardless of its current state. +- **`ttl`**: Total lifespan of the job. Timer starts when submitted and covers queue time, execution time, and everything in between. When TTL expires, the job is deleted regardless of state. - **`executionTimeout`**: Maximum time the job can actively run once a worker picks it up. Only enforced during execution. -TTL is a hard limit on the job's existence. If TTL expires while a job is actively running on a worker, the job is immediately removed and subsequent status checks return a 404—even if the job would have completed successfully. The `executionTimeout` does not extend or override the TTL. +TTL is a hard limit. If TTL expires while a job is running, the job is immediately removed and status checks return 404, even if the job would have completed successfully. -**Example 1 (queue expiry)**: You set `executionTimeout` to 2 hours and `ttl` to 1 hour. If the job waits in queue for 1 hour, it expires before a worker ever picks it up. The execution timeout never comes into play. - -**Example 2 (mid-execution expiry)**: You set `executionTimeout` to 7 days and `ttl` to 7 days. If the job waits in queue for 1 day, it only has 6 days of TTL remaining for execution. If the job needs the full 7 days to run, it will be deleted on day 7 while still in progress. - #### Long-running jobs -For jobs that need to run longer than the default TTL (24 hours): +For jobs that need to run longer than the default 24-hour TTL: 1. Set `executionTimeout` to your desired maximum runtime. -2. Set `ttl` to cover **both expected queue time and execution time**. Since TTL is a hard limit on the job's total lifespan, it must be long enough for the job to finish before being deleted. +2. Set `ttl` to cover **both expected queue time and execution time**. ```json { @@ -1128,32 +147,28 @@ For jobs that need to run longer than the default TTL (24 hours): } ``` -In this example, the execution timeout allows up to 48 hours of active runtime, while the TTL gives the job 72 hours of total lifespan. The extra 24 hours of TTL headroom accounts for potential queue wait time. +This allows up to 48 hours of active runtime with 72 hours total lifespan (24 hours headroom for queue time). -Both `ttl` and `executionTimeout` have a maximum of 7 days. If your job may queue for an extended period, the effective execution window is reduced: a job with a 7-day TTL that queues for 2 days only has 5 days of TTL remaining for execution, even if `executionTimeout` is also set to 7 days. +Both `ttl` and `executionTimeout` have a maximum of 7 days. A job with 7-day TTL that queues for 2 days only has 5 days remaining for execution. -#### Result retention after completion +#### Result retention -After a job completes, results are retained for a fixed period that is separate from the `ttl` setting: +After completion, results are retained for a fixed period separate from TTL: -| Request type | Retention period | -|--------------|------------------| -| `/run` (async) | 30 minutes | -| `/runsync` (sync) | 1 minute | +| Request type | Retention period | +|--------------------|------------------| +| `/run` (async) | 30 minutes | +| `/runsync` (sync) | 1 minute | -These retention periods are fixed and cannot be extended. Once the retention period expires, the job data is permanently deleted. +### S3-compatible storage -### S3-compatible storage integration - -Configure S3-compatible storage for endpoints working with large files. This configuration is passed directly to your worker but not included in responses. +Configure S3-compatible storage for endpoints working with large files: ```json { - "input": { - "prompt": "Your input here" - }, + "input": { "prompt": "Your input here" }, "s3Config": { "accessId": "BUCKET_ACCESS_KEY_ID", "accessSecret": "BUCKET_SECRET_ACCESS_KEY", @@ -1163,86 +178,49 @@ Configure S3-compatible storage for endpoints working with large files. This con } ``` -Your worker must contain logic to use this information for storage operations. +Your worker must contain logic to use this information for storage operations. Works with any S3-compatible provider including MinIO, Backblaze B2, and DigitalOcean Spaces. - -S3 integration works with any S3-compatible provider including MinIO, Backblaze B2, DigitalOcean Spaces, and others. - - -## Rate limits and quotas +## Rate limits -Runpod enforces rate limits to ensure fair platform usage. These limits apply per endpoint and operation: +Runpod enforces rate limits per endpoint and operation: -| Operation | Method | Rate Limit | Concurrent Limit | -| ------------------------------------ | -------- | ---------------------------- | ---------------- | -| `/runsync` | POST | 2000 requests per 10 seconds | 400 concurrent | -| `/run` | POST | 1000 requests per 10 seconds | 200 concurrent | -| `/status` | GET | 2000 requests per 10 seconds | 400 concurrent | -| `/stream` | GET | 2000 requests per 10 seconds | 400 concurrent | -| `/cancel` | POST | 100 requests per 10 seconds | 20 concurrent | -| `/purge-queue` | POST | 2 requests per 10 seconds | N/A | -| `/openai/*` | POST | 2000 requests per 10 seconds | 400 concurrent | -| `/requests` | GET | 10 requests per 10 seconds | 2 concurrent | +| Operation | Method | Rate Limit | Concurrent Limit | +|----------------|--------|------------------------------|------------------| +| `/runsync` | POST | 2000 requests per 10 seconds | 400 concurrent | +| `/run` | POST | 1000 requests per 10 seconds | 200 concurrent | +| `/status` | GET | 2000 requests per 10 seconds | 400 concurrent | +| `/stream` | GET | 2000 requests per 10 seconds | 400 concurrent | +| `/cancel` | POST | 100 requests per 10 seconds | 20 concurrent | +| `/purge-queue` | POST | 2 requests per 10 seconds | N/A | +| `/openai/*` | POST | 2000 requests per 10 seconds | 400 concurrent | +| `/requests` | GET | 10 requests per 10 seconds | 2 concurrent | ### Dynamic rate limiting -In addition to the base rate limits above, Runpod implements a dynamic rate limiting system that scales with your endpoint's worker count. This helps ensure platform stability while allowing higher throughput as you scale. - -Rate limits are calculated using two values: - -1. **Base limit**: A fixed rate limit per user per endpoint (shown in the table above) -2. **Worker-based limit**: A dynamic limit calculated as `number_of_running_workers × requests_per_worker` +Rate limits scale with your endpoint's worker count. The system uses whichever is higher between: -The system uses **whichever limit is higher** between the base limit and worker-based limit. Requests are blocked with a `429 (Too Many Requests)` status when the request count exceeds this effective limit within a 10-second window. This means as your endpoint scales up workers, your effective rate limit increases proportionally. +1. **Base limit**: Fixed rate limit per user per endpoint (shown above) +2. **Worker-based limit**: `number_of_running_workers × requests_per_worker` -For example, if an endpoint has: -- Base limit: 2000 requests per 10 seconds -- Additional limit per worker: 50 requests per 10 seconds -- 20 running workers +Requests exceeding the effective limit return `429 (Too Many Requests)`. Implement retry logic with exponential backoff to handle rate limiting gracefully. -The effective rate limit would be `max(2000, 20 × 50) = 2000` requests per 10 seconds (base limit applies). With 50 running workers, it would scale to `max(2000, 50 × 50) = 2500` requests per 10 seconds (worker-based limit applies). +## Error handling -**Key points:** -- Rate limiting is based on request count per 10-second time windows -- The system automatically uses whichever limit gives you more requests - -Implement appropriate retry logic with exponential backoff to handle rate limiting gracefully. - - - - -## Best practices - -Follow these practices to optimize your queue-based endpoint usage: - -- Use asynchronous requests for jobs that take more than a few seconds to complete. -- Implement polling with backoff when checking status of asynchronous jobs. -- Set appropriate timeouts in your client applications and monitor endpoint health regularly to detect issues early. -- Implement comprehensive error handling for all API calls. -- Use webhooks for notification-based workflows instead of polling to reduce API calls. -- Cancel unneeded jobs to free up resources and reduce costs. -- During development, use the console testing interface before implementing programmatic integration. - -## Error handling and troubleshooting - -When sending requests, be prepared to handle these common errors: +Common errors and solutions: | HTTP Status | Meaning | Solution | -| ----------- | --------------------- | ------------------------------------------------- | +|-------------|-----------------------|---------------------------------------------------| | 400 | Bad Request | Check your request format and parameters | | 401 | Unauthorized | Verify your API key is correct and has permission | | 404 | Not Found | Check your endpoint ID | | 429 | Too Many Requests | Implement backoff and retry logic | | 500 | Internal Server Error | Check endpoint logs; worker may have crashed | -Here are some common issues and suggested solutions: - -| Issue | Possible Causes | Solutions | -| ------------------ | ----------------------------------------------- | ---------------------------------------------------------------------------- | -| Job stuck in queue | No available workers, max workers limit reached | Increase max workers, check endpoint health | -| Timeout errors | Job takes longer than execution timeout | Increase timeout in job policy, optimize job processing | -| Failed jobs | Worker errors, input validation issues | Check [endpoint logs](/serverless/development/logs), verify input format, retry with fixed input | -| Rate limiting | Too many requests in short time | Implement backoff strategy, batch requests when possible | -| Missing results | Results expired | Retrieve results within expiration window (30 min for async, 1 min for sync) | +| Issue | Possible Causes | Solutions | +|--------------------|------------------------------------------|---------------------------------------------------------------------| +| Job stuck in queue | No available workers, max workers reached | Increase max workers, check endpoint health | +| Timeout errors | Job takes longer than execution timeout | Increase timeout in job policy, optimize processing | +| Failed jobs | Worker errors, input validation issues | Check [endpoint logs](/serverless/development/logs), verify input | +| Missing results | Results expired | Retrieve within expiration window (30 min async, 1 min sync) | -Implementing proper [error handling](/serverless/development/error-handling) and retry logic will make your integrations more robust and reliable. +See [error handling](/serverless/workers/handler-functions#error-handling) for implementation details. diff --git a/serverless/load-balancing/build-a-worker.mdx b/serverless/load-balancing/build-a-worker.mdx index 4157688d..89007b2d 100644 --- a/serverless/load-balancing/build-a-worker.mdx +++ b/serverless/load-balancing/build-a-worker.mdx @@ -6,15 +6,6 @@ description: "Learn how to implement and deploy a load balancing worker with Fas This tutorial shows how to build a load balancing worker using FastAPI and deploy it as a Serverless endpoint on Runpod. -## What you'll learn - -In this tutorial you'll learn how to: - -- Create a FastAPI application to serve your API endpoints. -- Implement proper health checks for your workers. -- Deploy your application as a load balancing Serverless endpoint. -- Test and interact with your custom APIs. - ## Requirements Before you begin you'll need: diff --git a/serverless/load-balancing/overview.mdx b/serverless/load-balancing/overview.mdx index e4f427cf..022b12b8 100644 --- a/serverless/load-balancing/overview.mdx +++ b/serverless/load-balancing/overview.mdx @@ -6,47 +6,27 @@ description: "Deploy custom direct-access REST APIs with load balancing Serverle import { RequestsTooltip, QueueBasedEndpointsTooltip } from "/snippets/tooltips.jsx"; -Load balancing endpoints offer a completely new paradigm for Serverless endpoint creation, enabling direct access to worker HTTP servers without an intermediary queueing system. +Load balancing endpoints route incoming traffic directly to available workers, bypassing the queueing system. Unlike that process requests sequentially, load balancing distributes requests across your worker pool for lower latency. -Unlike traditional that process requests sequentially, load balancing endpoints route incoming traffic directly to available workers, distributing requests across the worker pool. - -When building a load balancer, you're no longer limited to the standard `/run` or `/runsync` endpoints. Instead, you can create custom REST endpoints that are accessible via a unique URL: +You can create custom REST endpoints accessible via a unique URL: ``` https://ENDPOINT_ID.api.runpod.ai/YOUR_CUSTOM_PATH ``` -## Get started - -When you're ready to get started, follow this tutorial to learn how to [build and deploy a load balancing worker](/serverless/load-balancing/build-a-worker). - -Or, if you're ready for a more advanced use case, you can jump straight into [building a vLLM load balancer](/serverless/load-balancing/vllm-worker). - -You can also watch this video for an brief overview of the concepts explained on this page: - - - -## Key features - -- **Direct HTTP access**: Connect directly to worker HTTP servers, bypassing queue infrastructure for lower latency. -- **Custom REST API endpoints**: Define your own API paths, methods, and contracts to match your specific application needs. -- **Environment variable port configuration**: Control which ports your API listens on through standardized environment variables. -- **Framework agnostic**: Build with FastAPI, Flask, Express.js, or any HTTP server framework of your choice. -- **Multi-endpoint support**: Expose multiple API endpoints through a single worker, creating complete REST API services. -- **Health-based routing**: Requests are only sent to healthy workers, with automatic removal of unhealthy instances. + + + Create and deploy a load balancing worker. + + + Deploy vLLM with load balancing. + + ## Load balancing vs. queue-based endpoints -Here are the key differences between the two endpoint types: -### Queue-based endpoints (traditional) +### Queue-based endpoints With queue-based endpoints, are placed in a queue and processed in order. They use the standard handler pattern (`def handler(job)`) and are accessed through fixed endpoints like `/run` and `/runsync`. @@ -58,108 +38,64 @@ Load balancing endpoints send requests directly to workers without queuing. You These endpoints are ideal for real-time applications and streaming, but provide no queuing mechanism for request backlog, similar to UDP's behavior in networking. -## Endpoint type comparison table -| **Aspect** | **Load Balancing** | **Queue-Based** | -| --- | --- | --- | -| Request flow | Direct to worker HTTP server | Through queueing system | -| Implementation | Custom HTTP server | Handler function | -| Protocol flexibility | Supports any HTTP capability | JSON input/output only | -| Backpressure handling | Request drop when overloaded | Queue buffering | -| Latency | Lower (single-hop) | Higher (queue+worker) | -| Error recovery | No built-in retry mechanism | Automatic retries | +## Endpoint type comparison table -## Worker implementation comparison + Aspect | Load balancing | Queue-based | +|--------|----------------|-------------| +| **Request flow** | Direct to worker HTTP server | Through queueing system | +| **Implementation** | Custom HTTP server (FastAPI, Flask, etc.) | Handler function | +| **API flexibility** | Custom URL paths, any HTTP capability | Fixed `/run` and `/runsync` endpoints | +| **Backpressure** | Drops requests when overloaded | Queue buffering | +| **Latency** | Lower (single-hop) | Higher (queue + worker) | +| **Error handling** | No built-in retry | Automatic retries | -### Queue-based Serverless worker +## Worker comparison -Traditional Serverless workers require a specific handler function structure: +**Queue-based worker** (traditional): ```python import runpod def handler(job): - """Handler function that will be used to process jobs.""" - job_input = job["input"] - prompt = job_input.get("prompt", "Hello world") - - # Process the request - result = f"Generated text for: {prompt}" - - return {"generated_text": result} + prompt = job["input"].get("prompt", "Hello world") + return {"generated_text": f"Generated text for: {prompt}"} runpod.serverless.start({"handler": handler}) ``` -With traditional endpoints: - -- Requests are processed through Runpod's queueing system. -- Access is available via fixed the endpoints `/run` and `/runsync`. -- You implement a single handler function. -- You’re limited to JSON input/output. - -### Load balancing worker - -Load balancing workers do not require standardized handlers, or use the Runpod SDK at all. Instead, you can create full REST APIs using frameworks like FastAPI: +**Load balancing worker** (custom HTTP server): ```python from fastapi import FastAPI -from pydantic import BaseModel import os app = FastAPI() -class GenerationRequest(BaseModel): - prompt: str - max_tokens: int = 100 - @app.get("/ping") async def health_check(): return {"status": "healthy"} @app.post("/generate") -async def generate(request: GenerationRequest): - # Process the request - result = f"Generated text for: {request.prompt}" - return {"generated_text": result} +async def generate(request: dict): + return {"generated_text": f"Generated text for: {request['prompt']}"} if __name__ == "__main__": import uvicorn - port = int(os.getenv("PORT", "80")) - uvicorn.run(app, host="0.0.0.0", port=port) - + uvicorn.run(app, host="0.0.0.0", port=int(os.getenv("PORT", "80"))) ``` -Once deployed, this example would expose two custom endpoints on each Serverless worker: - -``` -https://ENDPOINT_ID.api.runpod.ai/ping -https://ENDPOINT_ID.api.runpod.ai/generate -``` - -With load balancing endpoints: - -- Endpoint requests go directly to your HTTP server. -- You can define custom URL paths and endpoints. -- You have control over your entire API structure. +This exposes custom endpoints: `https://ENDPOINT_ID.api.runpod.ai/ping` and `https://ENDPOINT_ID.api.runpod.ai/generate` -## When to use load balancing endpoints - -Consider using load balancing endpoints when you need: - -- Direct access to your model's HTTP server -- To leverage internal batching systems, like those provided by vLLM. -- The ability to return non-JSON payloads -- To implement multiple endpoints within a single worker. -- Lower latency for real-time applications, where immediate processing is more important than guaranteed execution. +## Health checks -## Worker health management +Workers must expose a `/ping` endpoint on the `PORT_HEALTH` port. The load balancer periodically checks this endpoint: -Runpod continuously monitors worker health through a dedicated health check mechanism. Workers must expose a `/ping` endpoint on the port specified by the `PORT_HEALTH` environment variable. The load balancer periodically sends requests to this endpoint. Workers respond with appropriate HTTP status codes: - -- `200` : healthy -- `204` : initializing -- Any other code: unhealthy +| Response code | Status | +|---------------|--------| +| `200` | Healthy | +| `204` | Initializing | +| Other | Unhealthy | Unhealthy workers are automatically removed from the routing pool. @@ -169,107 +105,67 @@ When calculating endpoint metrics, Runpod calculates the cold start time for loa -## Environment variables - -You can use environment variables to configure ports and other settings for your load balancing worker. -- `PORT`: The port for the main application server (default: `80`). -- `PORT_HEALTH`: The port for the health check endpoint (default: `PORT`). +## Environment variables -If you don't set `PORT` or `PORT_HEALTH` during deployment, environment variables will automatically be set to `80` for both, and port 80 will be automatically exposed in the container configuration. +| Variable | Default | Description | +|----------|---------|-------------| +| `PORT` | `80` | Main application server port | +| `PORT_HEALTH` | Same as `PORT` | Health check endpoint port | -If you're using a custom port, make sure to add it to your endpoint's environment variables, and expose it in the container configuration of your endpoint settings (under **Expose HTTP Ports (Max 10)**). +If using a custom port, add it to your endpoint's environment variables and expose it in container configuration (under **Expose HTTP Ports (Max 10)**). -## Request timeouts +## Timeouts and limits -Requests made to a load balancing endpoint have two timeout scenarios: +| Limit | Value | +|-------|-------| +| **Request timeout** | 2 min (no worker available) | +| **Processing timeout** | 5.5 min (per request) | +| **Payload limit** | 30 MB (request and response) | -1. **Request timeout (2 minutes)**: If no worker is available to process your request within 2 minutes (e.g., if a worker can't be initialized fast enough, or the endpoint has reached `MAX_WORKERS`), the system returns a `400` error. To implement retries, you should account for this response code in your client-side application. -2. **Processing timeout (5.5 minutes)**: Once a worker receives and begins processing your request, there is a maximum processing time of 5.5 minutes. If processing exceeds this limit, the connection will be terminated with a `524` error. For tasks that consistently take longer than 5.5 minutes to process, load balancing endpoints may not be suitable. +For payloads larger than 30 MB, use [network volumes](/storage/network-volumes) or implement chunking. - -If your server is misconfigured and the ports are not correctly opened, your workers will stay up for 8 minutes before being terminated. In this case requests will return a `502` error. This is a known issue and a fix is in progress. - +If your server ports are misconfigured, workers stay up for 8 minutes before terminating, returning `502` errors. -## Payload limits - -Load balancing endpoints have a 30 MB payload limit for both requests and responses. - -If you need to handle payloads larger than 30 MB, you can try these approaches: +## Handling cold starts -- Use a [network volume](storage/network-volumes) to store model artifacts and large datasets for access during runtime. -- Implement chunking strategies to split large payloads into smaller pieces. - - -## Handling cold start errors - -When you first send a request to a load balancing endpoint, you might get a "no workers available" error. This happens because workers need time to initialize, i.e. the server is up, but the health check at `/ping` isn't passing yet. - -For production applications, you should implement a health check with retries before sending your actual requests. - -Here's a Python function that handles this: +When workers are initializing, you may get "no workers available" errors. Implement retry logic to handle this: ```python import requests import time -def health_check_with_retry(base_url, api_key, max_retries=3, delay=2): - """Simple health check with retry logic for Runpod cold starts""" +def health_check_with_retry(base_url, api_key, max_retries=3, delay=5): headers = {"Authorization": f"Bearer {api_key}"} - + for attempt in range(max_retries): try: response = requests.get(f"{base_url}/ping", headers=headers, timeout=10) if response.status_code == 200: - print("✓ Health check passed") return True - except Exception as e: - print(f"Attempt {attempt + 1} failed: {e}") - + except Exception: + pass if attempt < max_retries - 1: time.sleep(delay) - - print("✗ Health check failed after retries") return False -# Usage example -base_url = "https://ENDPOINT_ID.api.runpod.ai" -api_key = "RUNPOD_API_KEY" - -# Ensures that a worker is ready (with retries) -if health_check_with_retry(base_url, api_key): - # Worker is ready, send your actual /generate request - response = requests.post( - f"{base_url}/generate", - headers={"Authorization": f"Bearer {api_key}"}, - json={"prompt": "Hello, world!"} - ) - print(response.json()) -else: - print("Worker failed to initialize") +# Usage +if health_check_with_retry("https://ENDPOINT_ID.api.runpod.ai", "RUNPOD_API_KEY"): + # Worker ready, send requests + pass ``` -The `health_check_with_retry` function: - -- Sends requests to the `/ping` endpoint with configurable retries (default: 3 attempts). -- Waits between attempts to give workers time to initialize (default: 2 seconds). -- Uses a 10-second timeout per health check request. -- Returns `True` when the worker is ready, or `False` if initialization fails. - -Use at least 3 retries with 5-10 second delays between attempts. This gives workers enough time to complete their cold start process before you send production requests. +Use at least 3 retries with 5-10 second delays. -## Technical details -The load balancing system employs an HTTP load balancer that inspects application-level protocols to make routing decisions. When a request arrives at `https://ENDPOINT_ID.api.runpod.ai/PATH`, the system: - -1. Identifies available healthy workers within the endpoint's worker pool. -2. Routes the request to a worker's exposed HTTP server. -3. Returns the worker's response directly to the client. +## When to use load balancing endpoints -Each worker runs an independent HTTP server (such as FastAPI, Flask, or Express) that: +Use load balancing endpoints when you need: -- Listens on ports specified via environment variables. -- Handles requests according to its custom API contract. -- Implements a required health check endpoint. +- Direct access to your model's HTTP server. +- Internal batching systems (like vLLM). +- Non-JSON payloads. +- Multiple endpoints within a single worker. +- Lower latency for real-time applications. \ No newline at end of file diff --git a/serverless/load-balancing/vllm-worker.mdx b/serverless/load-balancing/vllm-worker.mdx index c6055f93..4b7a43d2 100644 --- a/serverless/load-balancing/vllm-worker.mdx +++ b/serverless/load-balancing/vllm-worker.mdx @@ -6,21 +6,12 @@ description: "Learn how to deploy a custom vLLM server to a load balancing Serve This tutorial shows how to build a vLLM application using FastAPI and deploy it as a load balancing Serverless endpoint on Runpod. -## What you'll learn - To get a basic understanding of how to build a load balancing worker (or for more general use cases), see [Build a load balancing worker](/serverless/load-balancing/build-a-worker). -In this tutorial you'll learn how to: - -- Create a FastAPI application to serve your vLLM endpoints. -- Implement proper health checks for your vLLM workers. -- Deploy your vLLM application as a load balancing Serverless endpoint. -- Test and interact with your vLLM APIs. - ## Requirements Before you begin you'll need: diff --git a/serverless/overview.mdx b/serverless/overview.mdx index 3b0cc58e..fecc5d0f 100644 --- a/serverless/overview.mdx +++ b/serverless/overview.mdx @@ -1,44 +1,29 @@ --- title: "Overview" description: "Pay-as-you-go compute for AI models and compute-intensive workloads." +mode: "wide" --- -import { WorkersTooltip, HandlerFunctionTooltip, PodTooltip, RunpodHubTooltip, PublicEndpointTooltip, JobTooltip, LoadBalancingEndpointTooltip, QueueBasedEndpointsTooltip, InferenceTooltip, TrainingTooltip } from "/snippets/tooltips.jsx"; +import { JobTooltip, LoadBalancingEndpointTooltip, QueueBasedEndpointsTooltip, InferenceTooltip } from "/snippets/tooltips.jsx"; -Runpod Serverless is a cloud computing platform that lets you serve AI models for and run other compute-intensive workloads without managing servers. You only pay for the actual compute time you use, with no idle costs when your application isn't processing requests. - -## Why use Serverless? +
-* Focus on your code, not infrastructure: Deploy your applications without worrying about server management, scaling, or maintenance. -* GPU-powered computing: Access powerful GPUs for , , and other compute-intensive tasks. -* Automatic scaling: Your application scales automatically based on demand, from zero to hundreds of . -* Cost efficiency: Pay only for what you use, with per-second billing and no costs when idle. -* Fast deployment: Get your code running in the cloud in minutes with minimal configuration. +Runpod Serverless is a cloud computing platform that lets you serve AI models for and run other compute-intensive workloads without managing servers. You only pay for the actual compute time you use, with no idle costs when your application isn't processing requests. ## Get started -To get started with Serverless, follow one of the following guides to deploy your first endpoint: - - + Write a handler function, build a worker image, create an endpoint, and send your first request. - - Deploy a Stable Diffusion endpoint to generate images at scale. + + Deploy a ComfyUI worker and generate images using JSON workflows. + + Use Runpod's worker templates on GitHub as a starting point. + -You can also watch the following video for a quick overview of the endpoint deployment process: - - - ## Concepts ### [Endpoints](/serverless/endpoints/overview) @@ -47,7 +32,7 @@ The access point for your Serverless application. Endpoints provide a URL where ### [Workers](/serverless/workers/overview) -The container instances that execute your code when requests arrive at your endpoint. Each worker runs your custom [Docker container](/tutorials/introduction/containers) with your application code and dependencies. Runpod automatically manages worker lifecycle, starting them when needed and stopping them when idle to optimize resource usage. +The container instances that execute your code when requests arrive at your endpoint. Each worker runs your custom Docker container with your application code and dependencies. Runpod automatically manages worker lifecycle, starting them when needed and stopping them when idle to optimize resource usage. ### [Handler functions](/serverless/workers/handler-functions) @@ -86,7 +71,7 @@ When a user/client sends a request to your endpoint: 5. Workers remain active for a period to handle additional requests. 6. Idle workers eventually shut down if no new requests arrive. -
+
```mermaid %%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'15px','fontFamily':'font-inter'}}}%% @@ -128,7 +113,7 @@ Minimizing cold start times is key to creating a responsive and cost-effective e ### [Load balancing endpoints](/serverless/load-balancing/overview) -These endpoints route incoming traffic directly to available workers, distributing requests across the worker pool. Unlike traditional , they provide no queuing mechanism for request backlog. +These endpoints route incoming traffic directly to available workers, distributing requests across the worker pool. Unlike , they provide no queuing mechanism for request backlog. When using load balancing endpoints, you can define your own custom API endpoints without a handler function, using any HTTP framework of your choice (like FastAPI or Flask). @@ -146,7 +131,7 @@ Here's a typical Serverless development workflow: 7. Adjust your [endpoint settings](/serverless/endpoints/endpoint-configurations) to [optimize performance and cost](/serverless/development/optimization). 8. To update your endpoint logic, go back to step 1 and repeat the process. -
+
```mermaid %%{init: {'theme':'base', 'themeVariables': { 'primaryColor':'#9289FE','primaryTextColor':'#fff','primaryBorderColor':'#9289FE','lineColor':'#5F4CFE','secondaryColor':'#AE6DFF','tertiaryColor':'#FCB1FF','edgeLabelBackground':'#5F4CFE', 'fontSize':'15px','fontFamily':'font-inter'}}}%% @@ -187,96 +172,4 @@ flowchart TD linkStyle default stroke-width:2px,stroke:#5F4CFE ``` -
- - -For faster iteration and debugging of GPU-intensive applications, you can develop on a first before deploying to Serverless. This "Pod-first" workflow gives you direct access to the GPU environment with tools like Jupyter Notebooks and SSH, letting you iterate faster than deploying repeatedly to Serverless. Learn more in [Pod-first development](/serverless/development/dual-mode-worker). - - -## Rapid deployment options - -If you don't want to start from scratch and [build a custom worker](/serverless/workers/custom-worker), Runpod offers several ways to rapidly deploy and test pre-configured AI models, without writing your own handler function: - -### Fork a worker repository - -**Best for**: Creating a custom worker using an existing repository. - -Runpod maintains a collection of [worker repositories](https://github.com/runpod-workers) on GitHub that you can use as a starting point: - -* [worker-basic](https://github.com/runpod-workers/worker-basic): A minimal repository with essential functionality. -* [worker-template](https://github.com/runpod-workers/worker-template): A more comprehensive repository with additional features -* [Model-specific repositories](https://github.com/runpod-workers#worker-collection): Specialized repositories for common AI tasks (image generation, audio processing, etc.) - -After you fork a worker you can learn how to: - -1. Customize the [handler function](/serverless/workers/handler-functions) to add your own logic. -2. [Test the handler function](/serverless/development/local-testing) locally. -3. Deploy it to an endpoint using [Docker Hub](/serverless/workers/deploy) or [GitHub](/serverless/workers/github-integration). - -[Browse worker repositories →](https://github.com/runpod-workers) - -### Deploy a vLLM worker - -**Best for**: Deploying and serving large language models (LLMs) efficiently. - -vLLM workers are specifically optimized for running LLMs: - -* Support for any [Hugging Face model](https://huggingface.co/models). -* Optimized for LLM inference. -* Simple configuration via [environment variables](/serverless/vllm/environment-variables). -* High-performance inference with vLLM's PagedAttention and continuous batching. - -[Deploy a vLLM worker →](/serverless/vllm/get-started) - - -vLLM workers may require significant configuration (using environment variables) depending on the model you are deploying. Consult the README for your model on Hugging Face and the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/) for more details. - - - -### Deploy a repo from the Runpod Hub - -**Best for**: Instantly deploying preconfigured AI models. - -You can deploy a Serverless endpoint from a repo in the in seconds: - -1. Navigate to the [Hub page](https://www.console.runpod.io/hub) in the Runpod console. -2. Browse the collection and select a repo that matches your needs. -3. Review the repo details, including hardware requirements and available configuration options to ensure compatibility with your use case. -4. Click the **Deploy** button in the top-right of the repo page. You can also use the dropdown menu to deploy an older version. -5. Click **Create Endpoint** - -[Deploy a repo from the Runpod Hub →](https://www.console.runpod.io/hub) - -### Use Public Endpoints - -**Best for**: Deploying and serving pre-configured AI models quickly. - -Runpod maintains a collection of s that you can use to integrate pre-configured AI models into your applications quickly, without writing your own or deploying workers. - -[Browse Public Endpoints →](https://console.runpod.io/hub?tabSelected=public_endpoints) - -## Next steps - - - - Create and deploy a custom Serverless worker. - - - Learn how to configure and manage endpoints. - - - Understand how workers process requests. - - - Write handler functions to process incoming requests. - - - Deploy large language models in minutes. - - - Review storage options for your endpoints. - - - Learn how to structure and send requests to endpoints. - - +
\ No newline at end of file diff --git a/serverless/pricing.mdx b/serverless/pricing.mdx index aa666bab..e6332bb5 100644 --- a/serverless/pricing.mdx +++ b/serverless/pricing.mdx @@ -2,89 +2,65 @@ title: "Pricing" sidebarTitle: "Pricing" description: "Learn how Serverless billing works to optimize your costs." +mode: "wide" --- import GPUTable from '/snippets/serverless-gpu-pricing-table.mdx'; - - -Runpod offers custom pricing plans for large scale and enterprise workloads. If you're interested in learning more, [contact our sales team](https://ecykq.share.hsforms.com/2MZdZATC3Rb62Dgci7knjbA). +
+ +Runpod offers custom pricing plans for large scale and enterprise workloads. [Contact our sales team](https://ecykq.share.hsforms.com/2MZdZATC3Rb62Dgci7knjbA) to learn more. -Runpod Serverless offers flexible, pay-per-second pricing with no upfront costs. This guide explains how pricing works and how to optimize your costs. - -## GPU pricing - -Serverless offers two pricing tiers: +Serverless offers pay-per-second pricing with no upfront costs. You're billed from when a worker starts until it fully stops, rounded up to the nearest second. -### Flex workers +## Worker types -**On-demand workers** that scale to zero when not in use, so you only pay when processing requests. Flex workers are ideal for variable workloads, non-time-sensitive applications, and maximizing cost efficiency for sporadic usage. +| | Flex workers | Active workers | +|---|--------------|----------------| +| **Behavior** | Scale to zero when idle | Always running (24/7) | +| **Pricing** | Standard per-second rate | 20–30% discount | +| **Best for** | Variable workloads, cost optimization | Consistent traffic, low-latency requirements | -### Active workers - -**Always-on workers** that run 24/7. Active workers receive a 20-30% discount compared to flex workers, but you are charged continuously regardless of usage. Use active workers for consistent workloads, latency-sensitive applications, and high-volume processing. - -### Pricing table (per second) - -The price of flex/active workers depends on the GPU type and worker configuration: +## GPU pricing -For the latest pricing information, visit the [Runpod pricing page](https://www.runpod.io/pricing). - -## How billing works - -Serverless billing operates on a precise pay-as-you-go model with specific timing mechanisms. +For the latest pricing, visit the [Runpod pricing page](https://www.runpod.io/pricing). -Billing starts when the system signals a worker to wake up and ends when the worker is fully stopped. Runpod Serverless is charged by the second, with partial seconds rounded up to the next full second. For example, if your request takes 2.3 seconds to complete, you'll be billed for 3 seconds. +## What you're billed for -### Compute and storage costs +Your total cost includes compute time and storage: -Your total Serverless costs include both compute time (GPU usage) and temporary storage: - -1. **Compute costs**: Charged per second based on the GPU type as shown in the pricing table above. -2. **Storage costs**: The worker container disk incurs charges only while workers are running, calculated in 5-minute intervals. Even if your worker runs for less than 5 minutes, you'll be charged for the full 5-minute period. The storage cost is \$0.000011574 per GB per 5 minutes (equivalent to approximately \$0.10 per GB per month). - -If you have many workers continuously running with high storage costs, you can utilize [network volumes](/storage/network-volumes) to reduce expenses. Network volumes allow you to share data efficiently across multiple workers, reduce per-worker storage requirements by centralizing common files, and maintain persistent storage separate from worker lifecycles. - -Network volumes are billed hourly at a rate of \$0.07 per GB per month for the first 1TB, and \$0.05 per GB per month for additional storage beyond that. +| Cost component | Description | Rate | +|----------------|-------------|------| +| **Compute** | GPU time while workers run | See pricing table above | +| **Container disk** | Worker storage (5-min intervals) | ~\$0.10/GB/month | +| **Network volume** | Shared persistent storage | \$0.07/GB/month (< 1TB), \$0.05/GB/month (> 1TB) | ### Compute cost breakdown -Serverless workers incur charges during these periods: - -1. **Start time:** The time required to initialize a worker and load models into GPU memory. -2. **Execution time:** The time spent actually processing the request. -3. **Idle time:** The period in which the worker remains active after completing a request. - -#### Start time - -A worker start occurs when a worker is initialized from a scaled-down state. This typically involves starting the container, loading models into GPU memory, and initializing runtime environments. Worker start time varies based on model size and complexity. Larger models take longer to load into GPU memory. +Workers incur charges during three phases: -To optimize worker start times, you can use FlashBoot (included at no extra charge) or configure your [endpoint settings](/serverless/endpoints/endpoint-configurations#reducing-worker-startup-times). +1. **Start time**: Initializing the container and loading models into GPU memory. Minimize with [FlashBoot](/serverless/endpoints/endpoint-configurations#flashboot) or [model caching](/serverless/endpoints/model-caching). -#### Execution time +2. **Execution time**: Processing requests. Set [execution timeouts](/serverless/endpoints/endpoint-configurations#execution-timeout) to prevent runaway jobs. -This is the time your worker spends processing a request. Execution time depends on the complexity of your workload, the size of input data, and the performance of the GPUs you've selected. +3. **Idle time**: Waiting for new requests before scaling down (default: 5 seconds). Configure in [endpoint settings](/serverless/endpoints/endpoint-configurations#idle-timeout). -Set reasonable [execution timeout limits](/serverless/endpoints/endpoint-configurations#execution-timeout) to prevent runaway jobs from consuming excessive resources, and optimize your code to reduce processing time where possible. - -#### Idle time - -After completing a request, workers remain active for a specified period before scaling down. This reduces response times for subsequent requests but incurs additional charges. The default idle timeout is 5 seconds, but you can configure this in your [endpoint settings](/serverless/endpoints/endpoint-configurations#idle-timeout). + +For high-volume workloads with significant storage needs, use [network volumes](/storage/network-volumes) to share data across workers and reduce per-worker storage costs. + -## Account spend limits +## Account limits -By default, Runpod accounts have a spend limit of \$80 per hour across all resources. This limit protects your account from unexpected charges. If your workload requires higher spending capacity, you can [contact support](https://www.runpod.io/contact) to increase it. +**Spend limit**: Default limit of \$80/hour across all resources. [Contact support](https://www.runpod.io/contact) to increase. ## Billing support -If you think you've been billed incorrectly, please [contact support](https://www.runpod.io/contact), and include this information in your request: - -1. The Serverless endpoint ID where you experienced billing issues. -2. Request ID for the specific request (if applicable). -3. The approximate time when the billing issue occurred. +If you believe you've been billed incorrectly, [contact support](https://www.runpod.io/contact), including the following information in your ticket: -Providing these details will help our support team resolve your issue more quickly. +- Endpoint ID +- Request ID (if applicable) +- Approximate time of the issue diff --git a/serverless/quickstart.mdx b/serverless/quickstart.mdx index 306e7f21..7706b009 100644 --- a/serverless/quickstart.mdx +++ b/serverless/quickstart.mdx @@ -10,62 +10,33 @@ For an even faster start, clone or download the [worker-basic](https://github.co -## What you'll learn - -In this tutorial you'll learn how to: - -* Set up your development environment. -* Create a handler function. -* Test your handler locally. -* Create a Dockerfile to package your handler function. -* Build and push your worker image to Docker Hub. -* Deploy your worker to a Serverless endpoint using the Runpod console. -* Send a test request to your endpoint. - ## Requirements * You've [created a Runpod account](/get-started/manage-accounts). * You've installed [Python 3.x](https://www.python.org/downloads/) and [Docker](https://docs.docker.com/get-started/get-docker/) on your local machine and configured them for your command line. -## Step 1: Create a Python virtual environment +## Step 1: Create project files -First, set up a virtual environment to manage your project dependencies. - - - Run this command in your local terminal: +Create a new directory with empty files for your project: - ```sh - # Create a Python virtual environment - python3 -m venv venv - ``` - +```bash +mkdir serverless-quickstart && cd serverless-quickstart +touch handler.py Dockerfile requirements.txt test_input.json +``` - - - - ```sh - source venv/bin/activate - ``` - +## Step 2: Install the Serverless SDK - - ```sh - venv\Scripts\activate - ``` - - - +Create a virtual environment and install the Serverless SDK - - ```sh - pip install runpod - ``` - - +```bash +python3 -m venv .venv +source .venv/bin/activate +pip install runpod +``` -## Step 2: Create a handler function +## Step 3: Create a handler function -Create a file named `handler.py` and add the following code: +Add the following code to `handler.py`: ```python handler.py import runpod @@ -100,11 +71,17 @@ if __name__ == '__main__': runpod.serverless.start({'handler': handler }) ``` -This is a bare-bones handler that processes a JSON object and outputs a `prompt` string contained in the `input` object. You can replace the `time.sleep(seconds)` call with your own Python code for generating images, text, or running any machine learning workload. +This is a bare-bones handler that processes a JSON object and outputs a `prompt` string contained in the `input` object. + + + +You can replace the `time.sleep(seconds)` call with your own Python code for generating images, text, or running any AI/ML workload. + + -## Step 3: Create a test input file +## Step 4: Create a test input file -You'll need to create an input file to properly test your handler locally. Create a file named `test_input.json` and add the following code: +Add the following code to `test_input.json` to properly test your handler locally: ```json test_input.json { @@ -114,9 +91,9 @@ You'll need to create an input file to properly test your handler locally. Creat } ``` -## Step 4: Test your handler function locally +## Step 5: Test your handler function locally -Run your handler function to verify that it works correctly: +Run your handler function using your local terminal: ```sh python handler.py @@ -139,12 +116,12 @@ INFO | Job result: {'output': 'Hey there!'} INFO | Local testing complete, exiting. ``` -## Step 5: Create a Dockerfile +## Step 6: Create a Dockerfile -Create a file named `Dockerfile` with the following content: +Add the following content to `Dockerfile`: -New to Dockerfiles? Learn the fundamentals with our [introduction to containers](/tutorials/introduction/containers) tutorial series, which covers [creating Dockerfiles](/tutorials/introduction/containers/create-dockerfiles), [Docker commands](/tutorials/introduction/containers/docker-commands), and [persisting data](/tutorials/introduction/containers/persist-data). +New to Dockerfiles? Learn the fundamentals with our [introduction to containers](/tutorials/introduction/containers) tutorial series. ```dockerfile Dockerfile @@ -162,7 +139,7 @@ COPY handler.py / CMD ["python3", "-u", "handler.py"] ``` -## Step 6: Build and push your worker image +## Step 7: Build and push your worker image @@ -187,7 +164,7 @@ Before you can deploy your worker on Runpod Serverless, you need to push it to D -## Step 7: Deploy your worker using the Runpod console +## Step 8: Deploy your worker using the Runpod console To deploy your worker to a Serverless endpoint: @@ -205,7 +182,7 @@ To deploy your worker to a Serverless endpoint: The system will redirect you to a dedicated detail page for your new endpoint. -## Step 8: Test your endpoint +## Step 9: Test your endpoint To test your endpoint, click the **Requests** tab in the endpoint detail page: diff --git a/serverless/storage/overview.mdx b/serverless/storage/overview.mdx index a5a5689d..afb735ce 100644 --- a/serverless/storage/overview.mdx +++ b/serverless/storage/overview.mdx @@ -1,66 +1,42 @@ --- title: "Storage options" sidebarTitle: "Storage options" -description: "Explore storage options for your Serverless workers, including container disks, network volumes, and S3-compatible storage." +description: "Storage options for Serverless workers: container disks, network volumes, and S3-compatible storage." --- -import { WorkersTooltip, HandlerFunctionTooltip, ColdStartTooltip } from "/snippets/tooltips.jsx"; - -This guide explains the different types of storage you can configure for your Serverless so they can access and store data when processing requests. - ## Storage types ### Container disk -Container disks hold the temporary storage that exists only while a is running, and are completely lost when the worker is stopped or scaled down. They are created automatically when a worker launches and remain tightly coupled with the worker's lifecycle. +Temporary storage that exists only while a worker is running. Data is lost when the worker stops or scales down. Fast read/write speeds since storage is locally attached. Cost is included in the worker's running cost. -Container disks provide fast read and write speeds since they are locally attached to workers. The cost of storage is included in the worker's running cost, making it an economical choice for temporary data. +All data saved by a worker's [handler function](/serverless/workers/handler-functions) is stored in the container disk by default. To persist data beyond the current worker session, use a network volume or S3-compatible storage. -Any data saved by a worker's will be stored in the container disk by default. To persist data beyond the current worker session, use a network volume or S3-compatible storage. ### Network volume -Network volumes provide persistent storage that can be attached to different workers and even shared between multiple workers. Network volumes are ideal for sharing datasets between workers, storing large models that need to be accessed by multiple workers, and preserving data that needs to outlive any individual worker. - -To learn how to attach a network volume to your endpoint, see [Network volumes for Serverless](/storage/network-volumes#network-volumes-for-serverless). - -### S3-compatible storage integration - - - -Runpod's S3 integration works with any S3-compatible storage provider, not just AWS S3. You can use MinIO, Backblaze B2, DigitalOcean Spaces, and other compatible providers. - - - -Runpod's S3-compatible storage integration allows you to connect your Serverless endpoints to external object storage services, giving you the flexibility to use your own storage provider with standardized access protocols. - -You can supply your own credentials for any S3-compatible storage service, which is particularly useful for handling large files that exceed API payload limits. This storage option exists entirely outside the Runpod infrastructure, giving you complete control over data lifecycle and retention policies. Billing depends on your chosen provider's pricing model rather than Runpod's storage rates. - -To configure requests to send data to S3-compatible storage, see [S3-compatible storage integration](/serverless/endpoints/send-requests#s3-compatible-storage-integration). - -## Storage comparison table - -| Feature | Container Disk | Network Volume | S3-Compatible Storage | -| :---------------------- | :----------------------------------- | :----------------------------------- | :----------------------------- | -| **Persistence** | Temporary (erased when worker stops) | Permanent (independent of workers) | Permanent (external to Runpod) | -| **Sharing** | Not shareable | Can be attached to multiple workers | Accessible via S3 credentials | -| **Speed** | Fastest (local) | Fast (networked NVME) | Varies by provider | -| **Cost** | Included in worker cost | \$0.05-\$0.07/GB/month | Varies by provider | -| **Size limits** | Varies by worker config | Up to 4TB self-service | Varies by provider | -| **Best for** | Temporary processing | Multi-worker sharing | Very large files, external access | +Persistent storage that can be attached to multiple workers. Ideal for sharing datasets, storing large models, and preserving data beyond individual worker sessions. -## Serverless storage behavior +See [Network volumes for Serverless](/storage/network-volumes#network-volumes-for-serverless). -### Data isolation and sharing +### S3-compatible storage -Each worker has its own local directory and maintains its own data. This means that different workers running on your endpoint cannot share data directly between each other (unless a network volume is attached). +Connect to external object storage (AWS S3, MinIO, Backblaze B2, DigitalOcean Spaces, etc.) using your own credentials. Useful for large files exceeding API payload limits. Storage exists outside Runpod infrastructure with billing based on your provider. -### Caching and cold starts +See [S3-compatible storage](/serverless/endpoints/send-requests#s3-compatible-storage). -Serverless workers cache and load their Docker images locally on the container disk, even if a network volume is attached. While this local caching speeds up initial worker startup, loading large models into GPU memory can still significantly impact times. +## Comparison -For guidance on optimizing storage to reduce cold start times, see [Endpoint configuration](/serverless/endpoints/endpoint-configurations#reducing-worker-startup-times). +| Feature | Container Disk | Network Volume | S3-Compatible Storage | +|-----------------|-----------------------------|------------------------------ |------------------------------| +| **Persistence** | Temporary (lost on stop) | Permanent | Permanent (external) | +| **Sharing** | Not shareable | Multi-worker | Via S3 credentials | +| **Speed** | Fastest (local) | Fast (networked NVMe) | Varies by provider | +| **Cost** | Included in worker cost | \$0.05-\$0.07/GB/month | Varies by provider | +| **Best for** | Temporary processing | Multi-worker sharing, models | Large files, external access | -### Location constraints +## Behavior notes -If you use network volumes with your Serverless endpoint, your deployments will be constrained to the data center where the volume is located. This constraint may impact GPU availability and failover options, as your workloads must run in proximity to your storage. For global deployments, consider how storage location might affect your overall system architecture. \ No newline at end of file +- **Data isolation**: Workers don't share data unless a network volume is attached. +- **Caching**: Docker images cache locally on container disk, but loading large models into GPU memory still impacts cold start times. See [Reducing worker startup times](/serverless/endpoints/endpoint-configurations#reducing-worker-startup-times). +- **Location constraints**: Network volumes constrain deployments to the volume's data center, which may impact GPU availability. diff --git a/serverless/troubleshooting.mdx b/serverless/troubleshooting.mdx new file mode 100644 index 00000000..d55a91d3 --- /dev/null +++ b/serverless/troubleshooting.mdx @@ -0,0 +1,155 @@ +--- +title: "Troubleshooting" +sidebarTitle: "Troubleshooting" +description: "Common issues and solutions for Serverless endpoints and workers." +--- + +## Deployment issues + +### Worker fails to start + +If your worker fails to start or initialize: + +1. **Check logs**: View endpoint logs in the [Runpod console](https://www.console.runpod.io/serverless) for error messages. +2. **Verify local testing**: Ensure your handler works in [local testing](/serverless/development/local-testing) before deploying. +3. **Check dependencies**: Verify all dependencies are installed in your [Docker image](/serverless/workers/create-dockerfile). +4. **GPU compatibility**: Ensure your Docker image is compatible with the selected GPU type. +5. **Input format**: Verify your [input format](/serverless/endpoints/send-requests) matches what your handler expects. + +### Worker initializes but fails on requests + +| Issue | Solution | +|-------|----------| +| Input validation errors | Add input validation in your handler and check logs for the expected format | +| Missing dependencies | Verify all required packages are in your Dockerfile | +| Model loading failures | Check GPU memory requirements and model path | +| Permission errors | Ensure files are readable and directories are writable | + +## Job issues + +### Jobs stuck in queue + +If jobs remain `IN_QUEUE` for extended periods: + +- **No workers available**: Check if `max_workers` is set appropriately. +- **Workers throttled**: Your endpoint may be hitting rate limits. Check the Workers tab for throttled workers. +- **Cold start delays**: First requests after idle periods require worker initialization. Consider increasing `min_workers` or enabling [FlashBoot](/serverless/endpoints/endpoint-configurations#flashboot). + +### Jobs timing out + +| Cause | Solution | +|-------|----------| +| Processing takes too long | Increase `executionTimeout` in your [job policy](/serverless/endpoints/send-requests#execution-policies) | +| Model loading too slow | Use [model caching](/serverless/endpoints/model-caching) or bake models into your image | +| TTL too short | Set `ttl` to cover both queue time and execution time | + +### Jobs failing + +Check the job status response for error details. Common causes: + +- **Handler exceptions**: Unhandled exceptions in your handler code. Add try/catch blocks and return structured errors. +- **OOM (Out of Memory)**: Model or batch size exceeds GPU memory. Reduce batch size or use a larger GPU. +- **Timeout**: Job exceeded execution timeout. Increase timeout or optimize processing. + +## Cold start issues + +### Slow cold starts + +Cold start time includes container startup, model loading, and initialization. To reduce cold starts: + +1. **Use model caching**: Store models on [network volumes](/serverless/endpoints/model-caching) instead of downloading on each start. +2. **Enable FlashBoot**: Use [FlashBoot](/serverless/endpoints/endpoint-configurations#flashboot) for faster container initialization. +3. **Optimize image size**: Use smaller base images and remove unnecessary dependencies. +4. **Initialize outside handler**: Load models at module level, not inside the handler function. + +```python +# Good: Load model once at startup +model = load_model() + +def handler(job): + return model.predict(job["input"]) + +# Bad: Load model on every request +def handler(job): + model = load_model() # Slow! + return model.predict(job["input"]) +``` + +### Too many cold starts + +If you're seeing frequent cold starts: + +- **Increase idle timeout**: Set a longer `idle_timeout` to keep workers warm between requests. +- **Set minimum workers**: Configure `min_workers` > 0 to maintain warm workers. +- **Check traffic patterns**: Sporadic traffic causes more cold starts than steady traffic. + +## Logging issues + +### Missing logs + +If logs aren't appearing in the console: + +1. **Check throttling**: Excessive logging triggers throttling. Reduce log verbosity. +2. **Verify output streams**: Ensure you're writing to stdout/stderr, not just files. +3. **Check worker status**: Logs only appear for successfully initialized workers. +4. **Retention period**: Logs older than 90 days are automatically removed. + +### Log throttling + +To avoid log throttling: + +- Reduce log verbosity in production. +- Use structured logging for efficiency. +- Store detailed logs on [network volumes](/serverless/storage/overview) instead of console output. + +## vLLM-specific issues + +### OOM errors + +If your vLLM worker runs out of memory: + +- Lower `GPU_MEMORY_UTILIZATION` from 0.90 to 0.85. +- Reduce `MAX_MODEL_LEN` to limit context window. +- Use a GPU with more VRAM. + +### Model not loading + +| Issue | Solution | +|-------|----------| +| Model not found | Verify `MODEL_NAME` matches the Hugging Face model ID exactly | +| Gated model access denied | Set `HF_TOKEN` with a token that has access to the model | +| Incompatible model | Check [vLLM supported models](https://docs.vllm.ai/en/latest/models/supported_models.html) | + +### OpenAI API errors + +| Error | Cause | Solution | +|-------|-------|----------| +| 401 Unauthorized | Invalid API key | Verify `RUNPOD_API_KEY` is correct | +| 404 Not Found | Wrong endpoint URL | Use the format `https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1` | +| Connection refused | Endpoint not ready | Wait for workers to initialize | + +## Load balancing endpoint issues + +### "No workers available" error + +This means workers didn't initialize in time. Common causes: + +- **First request**: Workers need time to start. Retry the request. (See [Handling cold starts](/serverless/load-balancing/overview#handling-cold-starts) for more information.) +- **All workers busy**: Increase `max_workers` to handle more concurrent requests. +- **Workers crashing**: Check logs for initialization errors. + +### Requests not reaching workers + +Verify your HTTP server is: +- Listening on port 8000 (or the port specified in your configuration). +- Binding to `0.0.0.0`, not `127.0.0.1`. +- Returning proper HTTP responses. + +## Getting help + +If you're still experiencing issues: + +1. **Check endpoint logs** for detailed error messages. +2. **SSH into workers** using [SSH access](/serverless/development/ssh-into-workers) to debug in real-time. +3. **Review metrics** in the Metrics tab to identify patterns. +4. **Contact support** at [help@runpod.io](mailto:help@runpod.io) with your endpoint ID and error details. diff --git a/serverless/vllm/configuration.mdx b/serverless/vllm/configuration.mdx index 3619494b..094dffcc 100644 --- a/serverless/vllm/configuration.mdx +++ b/serverless/vllm/configuration.mdx @@ -4,27 +4,31 @@ sidebarTitle: "Configure vLLM" description: "Learn how to set up vLLM endpoints to work with your chosen model." --- -Most LLMs need specific configuration to run properly on vLLM. You need to understand what settings your model expects for loading, tokenization, and generation. +Most LLMs need specific configuration to run properly on vLLM. Default settings work for some models, but many require custom tokenization, attention mechanisms, or feature flags. Without the right settings, workers may fail to load or produce incorrect outputs. -This guide covers how to configure your vLLM endpoints for different model families, how environment variables map to vLLM command-line flags, and recommended configurations for popular models, and how to select the right GPU for your model. +When deploying a model, check its Hugging Face README and the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/) for required settings. -## Why is vLLM sometimes hard to configure? +## Environment variables -vLLM supports hundreds of models, but default settings only work out of the box for a subset of them. Without the right settings, your vLLM workers may fail to load, produce incorrect outputs, or miss key features. +vLLM is configured using [command-line flags](https://docs.vllm.ai/en/latest/configuration/engine_args/). On Runpod, set these as [environment variables](/serverless/vllm/environment-variables) instead. -Different model architectures have different requirements for tokenization, attention mechanisms, and features like tool calling or reasoning. For example, Mistral models use a specialized tokenizer mode and config format, while reasoning models like DeepSeek-R1 require you to specify a reasoning parser. +Convert flag names to uppercase with underscores. -When deploying a model, check its Hugging Face README and the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/) for required or recommended settings. +For example: -## Mapping environment variables to vLLM CLI flags +```bash +--tokenizer_mode mistral +``` -When running vLLM with `vllm serve`, the engine is configured using [command-line flags](https://docs.vllm.ai/en/latest/configuration/engine_args/). On Runpod, you set these options with [environment variables](/serverless/vllm/environment-variables) instead. +Becomes: -Each vLLM command-line argument has a corresponding environment variable. Convert the flag name to uppercase with underscores: `--tokenizer_mode` becomes `TOKENIZER_MODE`, `--enable-auto-tool-choice` becomes `ENABLE_AUTO_TOOL_CHOICE`, and so on. +```bash +TOKENIZER_MODE=mistral +``` ### Example: Deploying Mistral -To launch a Mistral model using the vLLM CLI, you would run a command similar to this: +CLI command: ```bash vllm serve mistralai/Ministral-8B-Instruct-2410 \ @@ -35,9 +39,9 @@ vllm serve mistralai/Ministral-8B-Instruct-2410 \ --tool-call-parser mistral ``` -On Runpod, set these options as environment variables when configuring your endpoint: +Equivalent Runpod environment variables: -| Environment variable | Value | +| Variable | Value | | --- | --- | | `MODEL_NAME` | `mistralai/Ministral-8B-Instruct-2410` | | `TOKENIZER_MODE` | `mistral` | @@ -46,74 +50,49 @@ On Runpod, set these options as environment variables when configuring your endp | `ENABLE_AUTO_TOOL_CHOICE` | `true` | | `TOOL_CALL_PARSER` | `mistral` | -This pattern applies to any vLLM command-line flag. Find the corresponding environment variable name and add it to your endpoint configuration. - ## Model-specific configurations -The table below lists recommended environment variables for popular model families. These settings handle common requirements like tokenization modes, tool calling support, and reasoning capabilities. - -Not all models in a family require all settings. Check your model's documentation for exact requirements. +Recommended environment variables for popular model families. Check your model's documentation for exact requirements. | Model family | Example model | Key environment variables | Notes | | --- | --- | --- | --- | -| Qwen3 | `Qwen/Qwen3-8B` | `ENABLE_AUTO_TOOL_CHOICE=true` `TOOL_CALL_PARSER=hermes` | Qwen models often ship in various quantization formats. If you are deploying an AWQ or GPTQ version, ensure `QUANTIZATION` is set correctly (e.g., `awq`). | -| OpenChat | `openchat/openchat-3.5-0106` | None required | OpenChat relies heavily on specific chat templates. If the default templates produce poor results, use `CUSTOM_CHAT_TEMPLATE` to inject the precise Jinja2 template required for the OpenChat correction format. | -| Gemma | `google/gemma-3-1b-it` | None required | Gemma models require an active Hugging Face token. Ensure your `HF_TOKEN` is set as a secret. Gemma also performs best when `DTYPE` is explicitly set to `bfloat16` to match its native training precision. | +| Qwen3 | `Qwen/Qwen3-8B` | `ENABLE_AUTO_TOOL_CHOICE=true` `TOOL_CALL_PARSER=hermes` | For AWQ/GPTQ versions, set `QUANTIZATION` accordingly. | +| OpenChat | `openchat/openchat-3.5-0106` | None required | Use `CUSTOM_CHAT_TEMPLATE` if default templates produce poor results. | +| Gemma | `google/gemma-3-1b-it` | None required | Requires `HF_TOKEN`. Set `DTYPE=bfloat16` for best results. | | DeepSeek-R1 | `deepseek-ai/DeepSeek-R1-Distill-Qwen-7B` | `REASONING_PARSER=deepseek_r1` | Enables reasoning mode for chain-of-thought outputs. | -| Phi-4 | `microsoft/Phi-4-mini-instruct` | None required | Phi models are compact but have specific architectural quirks. Setting `ENFORCE_EAGER=true` can sometimes resolve initialization issues with Phi models on older CUDA versions, though it may slightly reduce performance compared to CUDA graphs. | -| Llama 3 | `meta-llama/Llama-3.2-3B-Instruct` | `TOOL_CALL_PARSER=llama3_json` `ENABLE_AUTO_TOOL_CHOICE=true` | Llama 3 models often require strict attention to context window limits. Use `MAX_MODEL_LEN` to prevent the KV cache from exceeding your GPU VRAM. If you are using a 24 GB GPU like a 4090, setting `MAX_MODEL_LEN` to `8192` or `16384` is a safe starting point. | -| Mistral | `mistralai/Ministral-8B-Instruct-2410` | `TOKENIZER_MODE=mistral`, `CONFIG_FORMAT=mistral`, `LOAD_FORMAT=mistral`, `TOOL_CALL_PARSER=mistral` `ENABLE_AUTO_TOOL_CHOICE=true` | Mistral models use specialized tokenizers to work properly. | - -## Selecting GPU size based on the model - -Selecting the right GPU for vLLM is a balance between **model size**, **quantization**, and your required **context length**. Because vLLM pre-allocates memory for its KV (Key-Value) cache to enable high-throughput serving, you generally need more VRAM than the bare minimum required just to load the model. +| Phi-4 | `microsoft/Phi-4-mini-instruct` | None required | `ENFORCE_EAGER=true` can resolve initialization issues on older CUDA versions. | +| Llama 3 | `meta-llama/Llama-3.2-3B-Instruct` | `TOOL_CALL_PARSER=llama3_json` `ENABLE_AUTO_TOOL_CHOICE=true` | Use `MAX_MODEL_LEN` to prevent KV cache from exceeding GPU VRAM. | +| Mistral | `mistralai/Ministral-8B-Instruct-2410` | `TOKENIZER_MODE=mistral` `CONFIG_FORMAT=mistral` `LOAD_FORMAT=mistral` `TOOL_CALL_PARSER=mistral` `ENABLE_AUTO_TOOL_CHOICE=true` | Mistral models require specialized tokenizers. | -### VRAM estimation formula +## GPU selection -A reliable rule of thumb for estimating the required VRAM for a model in vLLM is: +vLLM pre-allocates memory for its KV cache, so you need more VRAM than the minimum to load the model. -* **FP16/BF16 (unquantized):** 2 bytes per parameter. -* **INT8 quantized:** 1 byte per parameter. -* **INT4 (AWQ/GPTQ):** 0.5 bytes per parameter. -* **KV cache buffer:** vLLM typically reserves 10-30% of remaining VRAM for the KV cache to handle concurrent requests. +### VRAM estimation -Use the table below as a starting point to select a hardware configuration for your model. +- **FP16/BF16**: 2 bytes per parameter. +- **INT8**: 1 byte per parameter. +- **INT4 (AWQ/GPTQ)**: 0.5 bytes per parameter. +- **KV cache**: vLLM reserves 10-30% of remaining VRAM for concurrent requests. -| Model size (parameters) | Recommended GPUs | VRAM | +| Model size | Recommended GPUs | VRAM | | --- | --- | --- | -| **Small (\<10B)** | RTX 4090, A6000, L4 | 16–24 GB | -| **Medium (10B–30B)** | A6000, L40S | 32–48 GB | -| **Large (30B–70B)** | A100, H100, B200 | 80–180 GB | - ---- +| **Small (\<10B)** | RTX 4090, A6000, L4 | 16-24 GB | +| **Medium (10B-30B)** | A6000, L40S | 32-48 GB | +| **Large (30B-70B)** | A100, H100, B200 | 80-180 GB | +### Troubleshooting memory issues -### Context window vs. VRAM - -The more context you need (e.g., 32k or 128k tokens), the more VRAM the KV cache consumes. If you encounter Out-of-Memory (OOM) errors, use the `MAX_MODEL_LEN` environment variable to cap the context. For example, a 7B model that OOMs at 32k context on a 24 GB card will often run perfectly at 16k. - -### GPU memory utilization - -By default, vLLM attempts to use 90% of the available VRAM (`GPU_MEMORY_UTILIZATION=0.90`). - -* **If you OOM during initialization:** Lower this to `0.85`. -* **If you have extra headroom:** Increase it to `0.95` to allow for more concurrent requests. - -### Quantization (AWQ/GPTQ) - -If you are limited by a single GPU, use a quantized version of the model (e.g., `Meta-Llama-3-8B-Instruct-AWQ`). This reduces the weight memory by 50-75% compared to `FP16`, allowing you to fit larger models on cards like the RTX 4090 (24 GB) or A4000 (16 GB). +- **OOM errors**: Lower `GPU_MEMORY_UTILIZATION` from 0.90 to 0.85, or reduce `MAX_MODEL_LEN`. +- **Context window limits**: More context means more KV cache. A 7B model that OOMs at 32k context often runs fine at 16k. +- **Limited VRAM**: Use quantized models (AWQ/GPTQ) to reduce memory by 50-75%. -For production workloads where high availability is key, always select **multiple GPU types** in your [Serverless endpoint configuration](/serverless/endpoints/endpoint-configurations). This allows the system to fall back to a different hardware tier if your primary choice is out of stock in a specific data center. +For production workloads, select multiple GPU types in your [endpoint configuration](/serverless/endpoints/endpoint-configurations) for hardware fallback. -## vLLM recipes - -vLLM provides step-by-step recipes for common deployment scenarios, including deploying specific models, optimizing performance, and integrating with frameworks. - -Find the recipes at [docs.vllm.ai/projects/recipes](https://docs.vllm.ai/projects/recipes/en/latest/index.html). They are community-maintained and updated regularly as vLLM evolves. - -You can often find further information in the documentation for the specific model you are deploying. For example: +## Additional resources -- [Mistral + vLLM deployment guide](https://docs.mistral.ai/deployment/self-deployment/vllm). -- [Qwen + vLLM deployment guide](https://qwen.readthedocs.io/en/latest/deployment/vllm.html#). \ No newline at end of file +- [vLLM recipes](https://docs.vllm.ai/projects/recipes/en/latest/index.html): Step-by-step deployment guides. +- [Mistral + vLLM guide](https://docs.mistral.ai/deployment/self-deployment/vllm). +- [Qwen + vLLM guide](https://qwen.readthedocs.io/en/latest/deployment/vllm.html). diff --git a/serverless/vllm/get-started.mdx b/serverless/vllm/get-started.mdx index 19a98a84..1f5a1491 100644 --- a/serverless/vllm/get-started.mdx +++ b/serverless/vllm/get-started.mdx @@ -4,24 +4,13 @@ sidebarTitle: "Quickstart" description: "Create a Serverless endpoint to serve LLM inference via API request." --- -This tutorial shows how to deploy a large language model using Runpod's vLLM worker. At the end, you'll have a fully functional Serverless endpoint that can handle LLM inference requests. - -## What you'll learn - -In this tutorial, you'll learn how to: - -* Configure and deploy a vLLM worker using Runpod Serverless. -* Select the appropriate hardware and scaling settings for your model. -* Set up environment variables to customize your deployment. -* Test your endpoint using the Runpod API. -* Troubleshoot common issues that might arise during deployment. - ## Requirements -* You've [created a Runpod account](/get-started/manage-accounts). -* (For gated models) You've created a [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens). +* [Runpod account](/get-started/manage-accounts). +* [Runpod API key](/get-started/api-keys). +* (For gated models) [Hugging Face access token](https://huggingface.co/docs/hub/en/security-tokens). -## Step 1: Choose your model +## Step 1: Choose a model First, decide which LLM you want to deploy. The vLLM worker supports most models available on Hugging Face, including: @@ -39,9 +28,9 @@ For this tutorial, we'll use `openchat/openchat-3.5-0106`, but you can substitut Depending on the model you choose, you may need to [configure your endpoint](/serverless/vllm/configuration) with additional environment variables. -## Step 2: Deploy using the Runpod console +## Step 2: Deploy using the Runpod UI -The easiest way to deploy a vLLM worker is through Runpod's Ready-to-Deploy Repos: +The easiest way to deploy a vLLM worker is through Runpod's ready-to-deploy repos: 1. Find the [vLLM repo](https://console.runpod.io/hub/runpod-workers/worker-vllm) in the Runpod Hub. 2. Click **Deploy**, using the latest vLLM worker version. @@ -59,17 +48,15 @@ For more details on how to optimize your endpoint, see [Endpoint configurations] -## Step 3: Understand your endpoint - -While your endpoint is initializing, let's understand what's happening and what you'll be able to do with it. +## Step 3: Find your endpoint ID -Runpod is creating a Serverless endpoint with your specified configuration, and the vLLM worker image is being deployed using your chosen model. Once deployment is complete, make a note of your **Endpoint ID**, as you'll need this to make API requests. +Once deployment is complete, make a note of your **Endpoint ID**, as you'll need this to make API requests. -## Step 4: Send a test request +## Step 4: Send a test request using the UI To test your worker, click the **Requests** tab in the endpoint detail page: @@ -114,7 +101,22 @@ When the workers finish processing your request, you should see output on the ri } ``` -## Step 5: Customize your deployment with environment variables (optional) +## Step 5: Send a test request using the API + +To send a test request using the API, use the following command, replacing `YOUR_ENDPOINT_ID` and `YOUR_API_KEY` with your actual endpoint ID and API key: + +```bash +curl -X POST "https://api.runpod.ai/v2/YOUR_ENDPOINT_ID/runsync" \ + -H "Authorization: Bearer YOUR_API_KEY" \ + -H "Content-Type: application/json" \ + -d '{"input": {"prompt": "Hello World"}}' +``` + + +Congratulations! You've successfully deployed a vLLM worker on Runpod Serverless. You now have a powerful, scalable LLM inference API that's compatible with both the OpenAI client and Runpod's native API. + + +## Customize your deployment with environment variables (optional) If you need to customize your model deployment, you can edit your endpoint settings to add environment variables. Here are some useful environment variables you might want to set: @@ -147,13 +149,7 @@ If you encounter issues with your deployment: ## Next steps - -Congratulations! You've successfully deployed a vLLM worker on Runpod Serverless. You now have a powerful, scalable LLM inference API that's compatible with both the OpenAI client and Runpod's native API. - - -Next you can try: - -* [Sending requests using the Runpod API](/serverless/vllm/vllm-requests). -* [Learning about vLLM's OpenAI API compatibility](/serverless/vllm/openai-compatibility). -* [Customizing your vLLM worker's handler function](/serverless/workers/handler-functions). -* [Building a custom worker for more specialized workloads](/serverless/workers/custom-worker). +* [Send requests using the Runpod API](/serverless/vllm/vllm-requests). +* [Learn about vLLM's OpenAI API compatibility](/serverless/vllm/openai-compatibility). +* [Customize your vLLM worker's handler function](/serverless/workers/handler-functions). +* [Build a custom worker for more specialized workloads](/serverless/workers/custom-worker). diff --git a/serverless/vllm/openai-compatibility.mdx b/serverless/vllm/openai-compatibility.mdx index 97a80155..6e464f68 100644 --- a/serverless/vllm/openai-compatibility.mdx +++ b/serverless/vllm/openai-compatibility.mdx @@ -1,396 +1,192 @@ --- -title: "OpenAI API compatibility guide" -sidebarTitle: "OpenAI API compatibility" -description: "Integrate vLLM workers with OpenAI client libraries and API-compatible tools." +title: "OpenAI API compatibility" +sidebarTitle: "OpenAI compatibility" +description: "Use OpenAI client libraries with your vLLM workers." --- -Runpod's vLLM workers implement OpenAI API compatibility, allowing you to use familiar [OpenAI client libraries](https://platform.openai.com/docs/libraries) with your deployed models. This guide explains how to leverage this compatibility to integrate your models with existing OpenAI-based applications. - -## Endpoint structure - -You can make OpenAI-compatible API requests to your vLLM workers by sending requests to this base URL pattern: - -``` -https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1 -``` - -## Supported APIs - -vLLM workers support these core OpenAI API endpoints: - -| Endpoint | Description | Status | -| ------------------- | ------------------------------- | --------------- | -| `/chat/completions` | Generate chat model completions | Fully supported | -| `/completions` | Generate text completions | Fully supported | -| `/models` | List available models | Supported | - -## Model naming - -The `MODEL_NAME` environment variable is essential for all OpenAI-compatible API requests. This variable corresponds to either: - -1. The [Hugging Face model](https://huggingface.co/models) you've deployed (e.g., `mistralai/Mistral-7B-Instruct-v0.2`). -2. A custom name if you've set `OPENAI_SERVED_MODEL_NAME_OVERRIDE` as an environment variable. - -This model name is used in chat and text completion API requests to identify which model should process your request. - - -## Initialize the OpenAI client - -Before you can send API requests, set up an OpenAI client with your Runpod API key and endpoint URL: - -```python -from openai import OpenAI - -MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.2" # Use your deployed model - -# Replace ENDPOINT_ID and RUNPOD_API_KEY with your actual values -client = OpenAI( - api_key="RUNPOD_API_KEY", - base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1", -) -``` - -## Send requests - -You can use Runpod's OpenAI-compatible API to send requests to your Runpod endpoint, enabling you to use the same client libraries and code that you use with OpenAI's services. You only need to change the base URL to point to your Runpod endpoint. - - - -You can also send requests using [Runpod's native API](/serverless/vllm/vllm-requests), which provides additional flexibility and control. - - - -### Chat completions - -The `/chat/completions` endpoint is designed for instruction-tuned LLMs that follow a chat format. - -#### Non-streaming request - -Here's how you can make a basic chat completion request: - -```python -from openai import OpenAI -MODEL_NAME = "YOUR_MODEL_NAME" # Replace with your actual model - -# Replace ENDPOINT_ID and RUNPOD_API_KEY with your actual values -client = OpenAI( - api_key="RUNPOD_API_KEY", - base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1", -) - -# Chat completion request (for instruction-tuned models) -response = client.chat.completions.create( - model=MODEL_NAME, - messages=[ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Hello, who are you?"} - ], - temperature=0.7, - max_tokens=500 -) - -# Print the response -print(response.choices[0].message.content) -``` - -#### Response format - -The API returns responses in this JSON format: - -```json -{ - "id": "cmpl-123abc", - "object": "chat.completion", - "created": 1677858242, - "model": "mistralai/Mistral-7B-Instruct-v0.2", - "choices": [ - { - "message": { - "role": "assistant", - "content": "I am Mistral, an AI assistant based on the Mistral-7B-Instruct model. How can I help you today?" - }, - "index": 0, - "finish_reason": "stop" - } - ], - "usage": { - "prompt_tokens": 23, - "completion_tokens": 24, - "total_tokens": 47 - } -} -``` - -#### Streaming request - -Streaming allows you to receive the model's output incrementally as it's generated, rather than waiting for the complete response. This real-time delivery enhances responsiveness, making it ideal for interactive applications like chatbots or for monitoring the progress of lengthy generation tasks. - -```python -# Create a streaming chat completion request -stream = client.chat.completions.create( - model=MODEL_NAME, - messages=[ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Write a short poem about stars."} - ], - temperature=0.7, - max_tokens=200, - stream=True # Enable streaming -) - -# Print the streaming response -print("Response: ", end="", flush=True) -for chunk in stream: - if chunk.choices[0].delta.content: - print(chunk.choices[0].delta.content, end="", flush=True) -print() -``` - -### Text completions - -The `/completions` endpoint is designed for base LLMs and text completion tasks. - -#### Non-streaming request - -Here's how you can make a text completion request: - -```python -# Text completion request -response = client.completions.create( - model=MODEL_NAME, - prompt="Write a poem about artificial intelligence:", - temperature=0.7, - max_tokens=150 -) - -# Print the response -print(response.choices[0].text) -``` - -#### Response format - -The API returns responses in this JSON format: - -```json -{ - "id": "cmpl-456def", - "object": "text_completion", - "created": 1677858242, - "model": "mistralai/Mistral-7B-Instruct-v0.2", - "choices": [ - { - "text": "In circuits of silicon and light,\nA new form of mind takes flight.\nNot born of flesh, but of human design,\nArtificial intelligence, a marvel divine.", - "index": 0, - "finish_reason": "stop", - "logprobs": null - } - ], - "usage": { - "prompt_tokens": 8, - "completion_tokens": 39, - "total_tokens": 47 - } -} -``` - -#### Streaming request +vLLM workers implement OpenAI API compatibility, so you can use [OpenAI client libraries](https://platform.openai.com/docs/libraries) with your deployed models. + +To integrate with OpenAI-compatible tools, just configure the base URL and API key using your Runpod API key and Serverless endpoint ID. + +## Setup + + + + ```python + from openai import OpenAI + + client = OpenAI( + api_key="RUNPOD_API_KEY", + base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1" + ) + ``` + + + ```javascript + import { OpenAI } from "openai"; + + const client = new OpenAI({ + apiKey: "RUNPOD_API_KEY", + baseURL: "https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1" + }); + ``` + + + +Replace `ENDPOINT_ID` and `RUNPOD_API_KEY` with your actual values. + +## Supported endpoints + +| Endpoint | Description | +|----------|-------------| +| `/chat/completions` | Chat model completions (instruction-tuned models) | +| `/completions` | Text completions (base models) | +| `/models` | List available models | + +## Chat completions + +For instruction-tuned models that follow a chat format. + + + + ```python + response = client.chat.completions.create( + model="mistralai/Mistral-7B-Instruct-v0.2", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Hello, who are you?"} + ], + temperature=0.7, + max_tokens=500 + ) + + print(response.choices[0].message.content) + ``` + + + ```python + stream = client.chat.completions.create( + model="mistralai/Mistral-7B-Instruct-v0.2", + messages=[ + {"role": "system", "content": "You are a helpful assistant."}, + {"role": "user", "content": "Write a short poem about stars."} + ], + temperature=0.7, + max_tokens=200, + stream=True + ) + + for chunk in stream: + if chunk.choices[0].delta.content: + print(chunk.choices[0].delta.content, end="", flush=True) + ``` + + + +## Text completions + +For base models and raw text completion. + + + + ```python + response = client.completions.create( + model="mistralai/Mistral-7B-Instruct-v0.2", + prompt="Write a poem about artificial intelligence:", + temperature=0.7, + max_tokens=150 + ) + + print(response.choices[0].text) + ``` + + + ```python + stream = client.completions.create( + model="mistralai/Mistral-7B-Instruct-v0.2", + prompt="The future of AI is", + temperature=0.7, + max_tokens=100, + stream=True + ) + + for chunk in stream: + print(chunk.choices[0].text or "", end="", flush=True) + ``` + + + +## Model name + +The `model` parameter must match either: +- The Hugging Face model you deployed (e.g., `mistralai/Mistral-7B-Instruct-v0.2`) +- A custom name set via the `OPENAI_SERVED_MODEL_NAME_OVERRIDE` environment variable + +List available models: ```python -# Create a completion stream -response_stream = client.completions.create( - model=MODEL_NAME, - prompt="Runpod is the best platform because", - temperature=0, - max_tokens=100, - stream=True, -) - -# Stream the response -for response in response_stream: - print(response.choices[0].text or "", end="", flush=True) +models = client.models.list() +print([model.id for model in models]) ``` -### List available models +## Parameters -The `/models` endpoint allows you to get a list of available models on your endpoint: - -```python -models_response = client.models.list() -list_of_models = [model.id for model in models_response] -print(list_of_models) -``` - -#### Response format - -```json -{ - "object": "list", - "data": [ - { - "id": "mistralai/Mistral-7B-Instruct-v0.2", - "object": "model", - "created": 1677858242, - "owned_by": "runpod" - } - ] -} -``` - -## Chat completion parameters - -Here are all available parameters for the `/chat/completions` endpoint: +Standard OpenAI parameters are supported. Include them directly in your request. + | Parameter | Type | Default | Description | | --- | --- | --- | --- | -| `messages` | `list[dict[str, str]]` | Required | List of messages with `role` and `content` keys. The model's chat template will be applied automatically. | -| `model` | `string` | Required | The model repo that you've deployed on your Runpod Serverless endpoint. | -| `temperature` | `float` | `0.7` | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. | -| `top_p` | `float` | `1.0` | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | -| `n` | `int` | `1` | Number of output sequences to return for the given prompt. | -| `max_tokens` | `int` | None | Maximum number of tokens to generate per output sequence. | -| `seed` | `int` | None | Random seed to use for the generation. | -| `stop` | `string` or `list[str]` | `list` | String(s) that stop generation when produced. The returned output will not contain the stop strings. | -| `stream` | `bool` | `false` | Whether to stream the response. | -| `presence_penalty` | `float` | `0.0` | Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | -| `frequency_penalty` | `float` | `0.0` | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | -| `logit_bias` | `dict[str, float]` | None | Unsupported by vLLM. | -| `user` | `string` | None | Unsupported by vLLM. | - -### Additional vLLM parameters - -vLLM supports additional parameters beyond the standard OpenAI API: - +| `model` | `string` | Required | Your deployed model name. | +| `messages` | `list` | Required | Chat messages with `role` and `content`. | +| `prompt` | `string` | Required | Text completion prompt. | +| `temperature` | `float` | `0.7` | Sampling randomness. Lower = more deterministic. | +| `max_tokens` | `int` | `16` | Maximum tokens to generate. | +| `top_p` | `float` | `1.0` | Nucleus sampling threshold. | +| `n` | `int` | `1` | Number of completions to generate. | +| `stop` | `string` or `list` | None | Stop sequences. | +| `stream` | `bool` | `false` | Enable streaming. | +| `presence_penalty` | `float` | `0.0` | Penalize tokens already present. | +| `frequency_penalty` | `float` | `0.0` | Penalize frequent tokens. | + + + | Parameter | Type | Default | Description | | --- | --- | --- | --- | -| `best_of` | `int` | None | Number of output sequences generated from the prompt. From these `best_of` sequences, the top `n` sequences are returned. Must be ≥ `n`. Treated as beam width when `use_beam_search` is `true`. | -| `top_k` | `int` | `-1` | Controls the number of top tokens to consider. Set to -1 to consider all tokens. | -| `ignore_eos` | `bool` | `false` | Whether to ignore the EOS token and continue generating tokens after EOS is generated. | -| `use_beam_search` | `bool` | `false` | Whether to use beam search instead of sampling. | -| `stop_token_ids` | `list[int]` | `list` | List of token IDs that stop generation when produced. The returned output will contain the stop tokens unless they are special tokens. | -| `skip_special_tokens` | `bool` | `true` | Whether to skip special tokens in the output. | -| `spaces_between_special_tokens` | `bool` | `true` | Whether to add spaces between special tokens in the output. | -| `add_generation_prompt` | `bool` | `true` | Whether to add generation prompt. Read more [here](https://huggingface.co/docs/transformers/main/en/chat_templating#what-are-generation-prompts). | -| `echo` | `bool` | `false` | Echo back the prompt in addition to the completion. | -| `repetition_penalty` | `float` | `1.0` | Penalizes new tokens based on whether they appear in the prompt and generated text so far. Values > 1 encourage new tokens, values < 1 encourage repetition. | -| `min_p` | `float` | `0.0` | Minimum probability for a token to be considered. | -| `length_penalty` | `float` | `1.0` | Penalizes sequences based on their length. Used in beam search. | -| `include_stop_str_in_output` | `bool` | `false` | Whether to include the stop strings in output text. | - -## Text completion parameters - -Here are all available parameters for the `/completions` endpoint: - -| Parameter | Type | Default | Description | -| --- | --- | --- | --- | -| `prompt` | `string` or `list[str]` | Required | The prompt(s) to generate completions for. | -| `model` | `string` | Required | The model repo that you've deployed on your Runpod Serverless endpoint. | -| `temperature` | `float` | `0.7` | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. | -| `top_p` | `float` | `1.0` | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | -| `n` | `int` | `1` | Number of output sequences to return for the given prompt. | -| `max_tokens` | `int` | `16` | Maximum number of tokens to generate per output sequence. | -| `seed` | `int` | None | Random seed to use for the generation. | -| `stop` | `string` or `list[str]` | `list` | String(s) that stop generation when produced. The returned output will not contain the stop strings. | -| `stream` | `bool` | `false` | Whether to stream the response. | -| `presence_penalty` | `float` | `0.0` | Penalizes new tokens based on whether they appear in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | -| `frequency_penalty` | `float` | `0.0` | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | -| `logit_bias` | `dict[str, float]` | None | Unsupported by vLLM. | -| `user` | `string` | None | Unsupported by vLLM. | - -Text completions support the same additional vLLM parameters as chat completions (see the Additional vLLM parameters section above). +| `best_of` | `int` | None | Generate this many, return top `n`. | +| `top_k` | `int` | `-1` | Top-k sampling. -1 = all tokens. | +| `repetition_penalty` | `float` | `1.0` | Penalize repeated tokens. | +| `min_p` | `float` | `0.0` | Minimum probability threshold. | +| `use_beam_search` | `bool` | `false` | Use beam search instead of sampling. | +| `length_penalty` | `float` | `1.0` | Length penalty for beam search. | +| `ignore_eos` | `bool` | `false` | Continue after EOS token. | +| `skip_special_tokens` | `bool` | `true` | Omit special tokens from output. | +| `echo` | `bool` | `false` | Include prompt in output. | + ## Environment variables Use these environment variables to customize the OpenAI compatibility: -| Variable | Default | Description | -| ----------------------------------- | ----------- | ------------------------------------------- | -| `RAW_OPENAI_OUTPUT` | `1` (true) | Enables raw OpenAI SSE format for streaming. | -| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override the model name in responses. | -| `OPENAI_RESPONSE_ROLE` | `assistant` | Role for responses in chat completions. | - -For a complete list of all vLLM environment variables, see the [vLLM environment variables reference](/serverless/vllm/environment-variables). - -## Client libraries - -The OpenAI-compatible API works with standard [OpenAI client libraries](https://platform.openai.com/docs/libraries): - -### Python - -```python -from openai import OpenAI -MODEL_NAME = "YOUR_MODEL_NAME" # Replace with your actual model name - -# Replace ENDPOINT_ID and RUNPOD_API_KEY with your actual values -client = OpenAI( - api_key="RUNPOD_API_KEY", - base_url="https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1" -) - -response = client.chat.completions.create( - model=MODEL_NAME, - messages=[ - {"role": "system", "content": "You are a helpful assistant."}, - {"role": "user", "content": "Hello!"} - ] -) -``` - -### JavaScript - -```javascript -import { OpenAI } from "openai"; - -// Replace ENDPOINT_ID and RUNPOD_API_KEY with your actual values -const openai = new OpenAI({ - apiKey: "RUNPOD_API_KEY", - baseURL: "https://api.runpod.ai/v2/ENDPOINT_ID/openai/v1" -}); - -// Replace MODEL_NAME with your actual model name -const response = await openai.chat.completions.create({ - model: "MODEL_NAME", - messages: [ - { role: "system", content: "You are a helpful assistant." }, - { role: "user", content: "Hello!" } - ] -}); -``` +| Variable | Default | Description | +| --- | --- | --- | +| `RAW_OPENAI_OUTPUT` | `1` | Enable raw OpenAI SSE format for streaming. | +| `OPENAI_SERVED_MODEL_NAME_OVERRIDE` | None | Override model name in responses. | +| `OPENAI_RESPONSE_ROLE` | `assistant` | Role for chat completion responses. | -## Implementation differences +See [environment variables reference](/serverless/vllm/environment-variables) for all options. -While the vLLM worker aims for high compatibility, there are some differences from OpenAI's implementation: +## Differences from OpenAI -**Token counting** may differ slightly from OpenAI models due to different tokenizers. - -**Streaming format** follows OpenAI's Server-Sent Events (SSE) format, but the exact chunking of streaming responses may vary. - -**Error responses** follow a similar but not identical format to OpenAI's error responses. - -**Rate limits** follow Runpod's endpoint policies rather than OpenAI's rate limiting structure. - -### Current limitations - -The vLLM worker has a few limitations: - -* Function and tool calling APIs are not currently supported. -* Some OpenAI-specific features like moderation endpoints are not available. -* Vision models and multimodal capabilities depend on the underlying model support in vLLM. +- **Token counting** may differ due to different tokenizers. +- **Rate limits** follow Runpod's policies, not OpenAI's. +- **Function/tool calling** depends on model and vLLM support. +- **Vision/multimodal** depends on underlying model support. ## Troubleshooting -Common issues and their solutions: - -| Issue | Solution | -| ------------------------- | --------------------------------------------------------------------- | -| "Invalid model" error | Verify your model name matches what you deployed. | -| Authentication error | Check that you're using your Runpod API key, not an OpenAI key. | -| Timeout errors | Increase client timeout settings for large models. | -| Incompatible responses | Set `RAW_OPENAI_OUTPUT=1` in your environment variables. | -| Different response format | Some models may have different output formatting; use a chat template. | - -## Next steps - -* [Learn how to send vLLM requests using Runpod's native API](/serverless/vllm/vllm-requests). -* [Explore environment variables for customization](/serverless/vllm/environment-variables). -* [Review all Serverless endpoint operations](/serverless/endpoints/send-requests). -* [Explore the OpenAI API documentation](https://platform.openai.com/docs/api-reference). +| Issue | Solution | +| --- | --- | +| "Invalid model" error | Verify model name matches your deployment. | +| Authentication error | Use your Runpod API key, not an OpenAI key. | +| Timeout errors | Increase client timeout for large models. | +| Unexpected response format | Set `RAW_OPENAI_OUTPUT=1`. | diff --git a/serverless/vllm/overview.mdx b/serverless/vllm/overview.mdx index 74649a78..78babe63 100644 --- a/serverless/vllm/overview.mdx +++ b/serverless/vllm/overview.mdx @@ -2,119 +2,43 @@ title: "Overview" sidebarTitle: "Overview" description: "Deploy scalable LLM inference endpoints using vLLM workers." +mode: "wide" --- -You can use vLLM workers to deploy and serve large language models on Runpod Serverless, delivering fast and efficient inference with automatic scaling. - -The vLLM worker image can be deployed directly from the [Runpod Hub](https://console.runpod.io/hub/runpod-workers/worker-vllm) or customized and built from the [GitHub repository](https://github.com/runpod-workers/worker-vllm). +
+ +vLLM workers deploy and serve large language models on Runpod Serverless with fast inference and automatic scaling. Deploy directly from the [Runpod Hub](https://console.runpod.io/hub/runpod-workers/worker-vllm) or customize using the [runpod-workers/worker-vllm](https://github.com/runpod-workers/worker-vllm) repository as a base. + + + + Deploy your first vLLM worker in minutes. + + + Configure your vLLM endpoint with environment variables. + + + Send requests using Runpod's native API. + + + Integrate vLLM with OpenAI-compatible tools. + + ## What is vLLM? -vLLM is an open-source inference engine designed to serve large language models efficiently. It maximizes throughput and minimizes latency when running LLM inference workloads. - -The vLLM worker image includes the vLLM engine with GPU optimizations and support for both OpenAI's API and Runpod's native API. You can deploy any supported model from Hugging Face with minimal configuration and start serving requests immediately. The workers run on Runpod Serverless, which automatically scales based on demand. - -* **Pre-built optimization**: vLLM workers come with the vLLM inference engine pre-configured, which includes [PagedAttention](https://docs.vllm.ai/en/latest/design/paged_attention.html) technology for optimized memory usage and faster inference. -* **OpenAI API compatibility**: They provide a drop-in replacement for OpenAI's API, allowing you to use existing OpenAI client code by simply changing the endpoint URL and API key. -* **Hugging Face integration**: vLLM workers support most models available on Hugging Face, including popular options like Llama 2, Mistral, Gemma, and many others. -* **Configurable environments**: Extensive customization options through [environment variables](/serverless/vllm/environment-variables) allow you to adjust model parameters, performance settings, and other behaviors. -* **Auto-scaling architecture**: Serverless automatically scales your endpoint from zero to many workers based on demand, billing on a per-second basis. - -vLLM uses several advanced techniques to achieve high performance when serving LLMs. Understanding these can help you optimize your deployments and troubleshoot issues. - -### PagedAttention for memory efficiency - -PagedAttention is the key innovation in vLLM. It dramatically improves how GPU memory is used during inference. Traditional LLM serving wastes memory by pre-allocating large contiguous blocks for key-value (KV) caches. PagedAttention breaks the KV cache into smaller pages, similar to how operating systems manage memory. - -This reduces memory waste and allows vLLM to serve more requests concurrently on the same GPU. You can handle higher throughput or serve larger models on smaller GPUs. - -### Continuous batching - -vLLM uses continuous batching (also called dynamic batching) to process multiple requests simultaneously. Unlike traditional batching, which waits for a batch to fill up before processing, continuous batching processes requests as they arrive and adds new requests to the batch as soon as previous ones complete. - -This keeps your GPU busy and reduces latency for individual requests, especially during periods of variable traffic. - -### Request lifecycle - -When you send a request to a vLLM worker endpoint: - -1. The request arrives at Runpod Serverless infrastructure. -2. If no worker is available, the request is queued and a worker starts automatically. -3. The worker loads your model from Hugging Face (or from the pre-baked Docker image). -4. vLLM processes the request using PagedAttention and continuous batching. -5. The response is returned to your application. -6. If there are no more requests, the worker scales down to zero after a configured timeout. - -vLLM endpoints use the same `/run` and `/runsync` operations as other Runpod Serverless endpoints. The only difference is the input format and the specialized LLM processing inside the worker. - -## Why use vLLM? - -vLLM workers offer several advantages over other LLM deployment options. - -### Performance and efficiency - -vLLM's PagedAttention and continuous batching deliver significantly better throughput than traditional serving methods. You can serve 2-3x more requests per GPU compared to naive implementations, which directly translates to lower costs and better user experiences. - -### OpenAI API compatibility - -vLLM workers provide a drop-in replacement for OpenAI's API. If you're already using the OpenAI Python client or any other OpenAI-compatible library, you can switch to your Runpod endpoint by changing just two lines of code: the API key and the base URL. Your existing prompts, parameters, and response handling code continue to work without modification. - -### Model flexibility - -You can deploy virtually any model available on Hugging Face, including popular options like Llama, Mistral, Qwen, Gemma, and thousands of others. vLLM supports a wide range of model architectures out of the box, and new architectures are added regularly. - -### Auto-scaling and cost efficiency +vLLM is an open-source inference engine optimized for serving large language models. It maximizes throughput and minimizes latency through techniques like PagedAttention and continuous batching. -Runpod Serverless automatically scales your vLLM workers from zero to many based on demand. You only pay for the seconds when workers are actively processing requests. This makes vLLM workers ideal for workloads with variable traffic patterns or when you're getting started and don't want to pay for idle capacity. - -### Production-ready features - -vLLM workers come with features that make them suitable for production deployments, including streaming responses, configurable context lengths, quantization support (AWQ, GPTQ), multi-GPU tensor parallelism, and comprehensive error handling. +- **[PagedAttention](https://docs.vllm.ai/en/latest/design/paged_attention.html)**: Breaks KV cache into pages for efficient memory use, enabling higher concurrency and larger models on smaller GPUs. +- **Continuous batching**: Processes requests as they arrive rather than waiting for batches, keeping GPUs busy and reducing latency. +- **OpenAI compatibility**: Drop-in replacement for OpenAI's API. Switch by changing the endpoint URL and API key. +- **Hugging Face integration**: Supports most models including Llama, Mistral, Qwen, Gemma, DeepSeek, and [many more](https://docs.vllm.ai/en/latest/models/supported_models.html). +- **Auto-scaling**: Scales from zero to many workers based on demand, with per-second billing. ## Deployment options -There are two ways to deploy vLLM workers on Runpod. - -### Using cached models - -If your model is available on Hugging Face, we strongly recommend enabling [cached models](/serverless/endpoints/model-caching) instead of baking/downloading the model into your Docker image. Cached models provide faster startup times, lower costs, and uses less storage. - -### Building custom Docker images with models baked in - -For production deployments where cold start time matters, you can build a [custom Docker image](/serverless/workers/create-dockerfile#including-models-and-files) that includes your model weights. This eliminates download time and can reduce cold starts from minutes to seconds. - -This approach requires more upfront work but provides the best performance for production workloads with consistent traffic. - -## Compatible models - -vLLM supports most model architectures available on Hugging Face. You can deploy models from families including Llama (1, 2, 3, 3.1, 3.2), Mistral and Mixtral, Qwen2 and Qwen2.5, Gemma and Gemma 2, Phi (2, 3, 3.5, 4), DeepSeek (V2, V3, R1), GPT-2, GPT-J, OPT, BLOOM, Falcon, MPT, StableLM, Yi, and many others. - -For a complete and up-to-date list of supported model architectures, see the [vLLM supported models documentation](https://docs.vllm.ai/en/latest/models/supported_models.html). +- **[Cached models](/serverless/endpoints/model-caching)** (recommended): Fastest setup with lower storage costs. Best for most deployments. +- **[Baked-in models](/serverless/workers/create-dockerfile#including-models-and-files)**: Eliminates download time and reduces cold starts to seconds. Requires building a custom Docker image. ## Configuration -vLLM supports hundreds of models, but default settings only work out of the box for a subset of them. Depending on the model you're deploying, you may need to [configure your endpoint](/serverless/vllm/configuration) with additional [environment variables](/serverless/vllm/environment-variables) (which map directly to `vllm serve` command line flags) to get it working properly. You will likely need to consult the README for your model on Hugging Face and the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/) for more details. - -## Use cases - -vLLM workers are ideal for several types of applications. - -**Production LLM APIs** benefit from vLLM's high throughput and OpenAI compatibility. You can build scalable APIs for chatbots, content generation, code completion, or any other LLM-powered feature. - -**Cost-effective scaling** is enabled by Serverless auto-scaling. If your traffic varies significantly throughout the day or week, vLLM workers automatically scale down to zero during quiet periods, saving costs compared to always-on servers. - -**OpenAI migration** is straightforward because vLLM provides API compatibility. You can migrate existing OpenAI-based applications to open-source models by changing only your endpoint URL and API key. - -**Custom model hosting** lets you deploy fine-tuned or specialized models. If you've trained a custom model or fine-tuned an existing one, vLLM workers make it easy to serve it at scale. - -**Development and experimentation** is cheaper with pay-per-second billing. You can test multiple models and configurations without worrying about idle costs. - -## Next steps - -Ready to deploy your first vLLM worker? Start with the [get started guide](/serverless/vllm/get-started) to deploy a model in minutes. - -Once your endpoint is running, learn how to send requests using [Runpod's native API](/serverless/vllm/vllm-requests) or the [OpenAI-compatible API](/serverless/vllm/openai-compatibility). - -For advanced configuration options, see the [environment variables documentation](/serverless/vllm/environment-variables). - -For a hands-on example, see [Create a chatbot with Gemma 3](/tutorials/serverless/run-gemma-7b) to deploy a vLLM endpoint and build an interactive chatbot using the OpenAI-compatible API. +Default settings work for many models, but some require additional [environment variables](/serverless/vllm/environment-variables) (which map to `vllm serve` flags). Consult your model's Hugging Face README and the [vLLM documentation](https://docs.vllm.ai/en/latest/usage/) for model-specific requirements. diff --git a/serverless/vllm/vllm-requests.mdx b/serverless/vllm/vllm-requests.mdx index 4698eba7..f33eb057 100644 --- a/serverless/vllm/vllm-requests.mdx +++ b/serverless/vllm/vllm-requests.mdx @@ -3,83 +3,14 @@ title: "Send requests to vLLM workers" sidebarTitle: "Send vLLM requests" description: "Use Runpod's native API to send requests to vLLM workers." --- -import { InferenceTooltip, QueueBasedEndpointTooltip, HandlerFunctionTooltip } from "/snippets/tooltips.jsx"; -vLLM workers run on Serverless endpoints. They use the same `/run` and `/runsync` operations as other Runpod endpoints, following the standard [Serverless request structure](/serverless/endpoints/send-requests). - -The key difference is the input format. vLLM workers expect specific parameters for language model , such as prompts, messages, and sampling parameters. The worker's processes these inputs using the vLLM engine and returns generated text. - -## Request operations - -vLLM endpoints support both synchronous and asynchronous requests. - -### Asynchronous requests with `/run` - -Use `/run` to submit a job that processes in the background. You'll receive a job ID immediately, then poll for results using the `/status` endpoint. - -```python - -import requests - -# Replace ENDPOINT_ID and RUNPOD_API_KEY with your actual values -url = "https://api.runpod.ai/v2/ENDPOINT_ID/run" -headers = { - "Authorization": "Bearer RUNPOD_API_KEY", - "Content-Type": "application/json" -} - -data = { - "input": { - "prompt": "Explain quantum computing in simple terms.", - "sampling_params": { - "temperature": 0.7, - "max_tokens": 200 - } - } -} - -response = requests.post(url, headers=headers, json=data) -job_id = response.json()["id"] -print(f"Job ID: {job_id}") -``` - -### Synchronous requests with `/runsync` - -Use `/runsync` to wait for the complete response in a single request. The client blocks until processing is complete. - -```python -import requests - -# Replace ENDPOINT_ID and RUNPOD_API_KEY with your actual values -url = "https://api.runpod.ai/v2/ENDPOINT_ID/runsync" -headers = { - "Authorization": "Bearer RUNPOD_API_KEY", - "Content-Type": "application/json" -} - -data = { - "input": { - "prompt": "Explain quantum computing in simple terms.", - "sampling_params": { - "temperature": 0.7, - "max_tokens": 200 - } - } -} - -response = requests.post(url, headers=headers, json=data) -print(response.json()) -``` - -For more details on request operations, see [Send API requests to Serverless endpoints](/serverless/endpoints/send-requests). +vLLM workers use the same `/run` and `/runsync` operations as other Runpod Serverless endpoints. The difference is the input format: vLLM expects prompts, messages, and sampling parameters for text generation. ## Input formats -vLLM workers accept two input formats for text generation. - -### Messages format (for chat models) +### Messages (chat models) -Use the messages format for instruction-tuned models that expect conversation history. The worker automatically applies the model's chat template. +Use for instruction-tuned models. The worker automatically applies the model's chat template. ```json { @@ -96,9 +27,9 @@ Use the messages format for instruction-tuned models that expect conversation hi } ``` -### Prompt format (for text completion) +### Prompt (text completion) -Use the prompt format for base models or when you want to provide raw text without a chat template. +Use for base models or when providing raw text without a chat template. ```json { @@ -112,163 +43,170 @@ Use the prompt format for base models or when you want to provide raw text witho } ``` -### Applying chat templates to prompts +To apply the model's chat template to a prompt, add `"apply_chat_template": true`. -If you use the prompt format but want the model's chat template applied, set `apply_chat_template` to `true`. +## Send requests -```json -{ - "input": { - "prompt": "What is the capital of France?", - "apply_chat_template": true, - "sampling_params": { - "temperature": 0.7, - "max_tokens": 100 - } - } -} -``` + + + Submit a job that processes in the background. Poll `/status/{job_id}` for results. -## Request input parameters + ```python + import requests -Here are all available parameters you can include in the `input` object of your request. + response = requests.post( + "https://api.runpod.ai/v2/ENDPOINT_ID/run", + headers={ + "Authorization": "Bearer RUNPOD_API_KEY", + "Content-Type": "application/json" + }, + json={ + "input": { + "messages": [{"role": "user", "content": "Explain quantum computing."}], + "sampling_params": {"temperature": 0.7, "max_tokens": 200} + } + } + ) + + job_id = response.json()["id"] + print(f"Job ID: {job_id}") + + # Poll for results + status = requests.get( + f"https://api.runpod.ai/v2/ENDPOINT_ID/status/{job_id}", + headers={"Authorization": "Bearer RUNPOD_API_KEY"} + ) + print(status.json()) + ``` + + + Wait for the complete response in a single request. + + ```python + import requests + + response = requests.post( + "https://api.runpod.ai/v2/ENDPOINT_ID/runsync", + headers={ + "Authorization": "Bearer RUNPOD_API_KEY", + "Content-Type": "application/json" + }, + json={ + "input": { + "messages": [{"role": "user", "content": "Explain quantum computing."}], + "sampling_params": {"temperature": 0.7, "max_tokens": 200} + } + } + ) -| Parameter | Type | Default | Description | -| --- | --- | --- | --- | -| `prompt` | `string` | None | Prompt string to generate text based on. | -| `messages` | `list[dict[str, str]]` | None | List of messages with `role` and `content` keys. The model's chat template will be applied automatically. Overrides `prompt`. | -| `apply_chat_template` | `bool` | `false` | Whether to apply the model's chat template to the `prompt`. | -| `sampling_params` | `dict` | `{}` | Sampling parameters to control generation (see Sampling parameters section below). | -| `stream` | `bool` | `false` | Whether to enable streaming of output. If `true`, responses are streamed as they are generated. | -| `max_batch_size` | `int` | env `DEFAULT_BATCH_SIZE` | The maximum number of tokens to stream per HTTP POST call. | -| `min_batch_size` | `int` | env `DEFAULT_MIN_BATCH_SIZE` | The minimum number of tokens to stream per HTTP POST call. | -| `batch_size_growth_factor` | `int` | env `DEFAULT_BATCH_SIZE_GROWTH_FACTOR` | The growth factor by which `min_batch_size` multiplies for each call until `max_batch_size` is reached. | + print(response.json()) + ``` + + -## Sampling parameters +For more on request operations, see [Send requests to Serverless endpoints](/serverless/endpoints/send-requests). -Sampling parameters control how the model generates text. Include them in the `sampling_params` dictionary in your request. +## Streaming -| Parameter | Type | Default | Description | -| --- | --- | --- | --- | -| `n` | `int` | `1` | Number of output sequences generated from the prompt. The top `n` sequences are returned. | -| `best_of` | `int` | `n` | Number of output sequences generated from the prompt. The top `n` sequences are returned from these `best_of` sequences. Must be ≥ `n`. Treated as beam width in beam search. | -| `presence_penalty` | `float` | `0.0` | Penalizes new tokens based on their presence in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | -| `frequency_penalty` | `float` | `0.0` | Penalizes new tokens based on their frequency in the generated text so far. Values > 0 encourage new tokens, values < 0 encourage repetition. | -| `repetition_penalty` | `float` | `1.0` | Penalizes new tokens based on their appearance in the prompt and generated text. Values > 1 encourage new tokens, values < 1 encourage repetition. | -| `temperature` | `float` | `1.0` | Controls the randomness of sampling. Lower values make it more deterministic, higher values make it more random. Zero means greedy sampling. | -| `top_p` | `float` | `1.0` | Controls the cumulative probability of top tokens to consider. Must be in (0, 1]. Set to 1 to consider all tokens. | -| `top_k` | `int` | `-1` | Controls the number of top tokens to consider. Set to -1 to consider all tokens. | -| `min_p` | `float` | `0.0` | Represents the minimum probability for a token to be considered, relative to the most likely token. Must be in [0, 1]. Set to 0 to disable. | -| `use_beam_search` | `bool` | `false` | Whether to use beam search instead of sampling. | -| `length_penalty` | `float` | `1.0` | Penalizes sequences based on their length. Used in beam search. | -| `early_stopping` | `bool` or `string` | `false` | Controls stopping condition in beam search. Can be `true`, `false`, or `"never"`. | -| `stop` | `string` or `list[str]` | `None` | String(s) that stop generation when produced. The output will not contain these strings. | -| `stop_token_ids` | `list[int]` | `None` | List of token IDs that stop generation when produced. Output contains these tokens unless they are special tokens. | -| `ignore_eos` | `bool` | `false` | Whether to ignore the End-Of-Sequence token and continue generating tokens after its generation. | -| `max_tokens` | `int` | `16` | Maximum number of tokens to generate per output sequence. | -| `min_tokens` | `int` | `0` | Minimum number of tokens to generate per output sequence before EOS or stop sequences. | -| `skip_special_tokens` | `bool` | `true` | Whether to skip special tokens in the output. | -| `spaces_between_special_tokens` | `bool` | `true` | Whether to add spaces between special tokens in the output. | -| `truncate_prompt_tokens` | `int` | `None` | If set, truncate the prompt to this many tokens. | - -## Streaming responses - -Enable streaming to receive tokens as they're generated instead of waiting for the complete response. +Receive tokens as they're generated instead of waiting for the complete response. ```python import requests import json -# Replace ENDPOINT_ID and RUNPOD_API_KEY with your actual values -url = "https://api.runpod.ai/v2/ENDPOINT_ID/run" -headers = { - "Authorization": "Bearer RUNPOD_API_KEY", - "Content-Type": "application/json" -} - -data = { - "input": { - "prompt": "Write a short story about a robot.", - "sampling_params": { - "temperature": 0.8, - "max_tokens": 500 - }, - "stream": True +# Submit with streaming enabled +response = requests.post( + "https://api.runpod.ai/v2/ENDPOINT_ID/run", + headers={ + "Authorization": "Bearer RUNPOD_API_KEY", + "Content-Type": "application/json" + }, + json={ + "input": { + "prompt": "Write a short story about a robot.", + "sampling_params": {"temperature": 0.8, "max_tokens": 500}, + "stream": True + } } -} +) -response = requests.post(url, headers=headers, json=data) job_id = response.json()["id"] -# Stream the results +# Stream results stream_url = f"https://api.runpod.ai/v2/ENDPOINT_ID/stream/{job_id}" -with requests.get(stream_url, headers=headers, stream=True) as r: +with requests.get(stream_url, headers={"Authorization": "Bearer RUNPOD_API_KEY"}, stream=True) as r: for line in r.iter_lines(): if line: print(json.loads(line)) ``` -For more information on streaming, see the [stream operation documentation](/serverless/endpoints/send-requests#stream). +See [streaming documentation](/serverless/endpoints/send-requests#stream) for more details. + +## Sampling parameters + +Add parameters to control how the model generates text. Include these in the `sampling_params` object in your request. + + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `max_tokens` | `int` | `16` | Maximum tokens to generate. | +| `temperature` | `float` | `1.0` | Randomness of sampling. Lower = more deterministic. | +| `top_p` | `float` | `1.0` | Cumulative probability of top tokens to consider. | +| `top_k` | `int` | `-1` | Number of top tokens to consider. -1 = all. | +| `stop` | `string` or `list` | `None` | Stop generation when these strings are produced. | +| `presence_penalty` | `float` | `0.0` | Penalize tokens based on presence in output. | +| `frequency_penalty` | `float` | `0.0` | Penalize tokens based on frequency in output. | + + + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `n` | `int` | `1` | Number of output sequences to generate. | +| `best_of` | `int` | `n` | Generate this many sequences, return top `n`. | +| `repetition_penalty` | `float` | `1.0` | Penalize repeated tokens. Values > 1 discourage repetition. | +| `min_p` | `float` | `0.0` | Minimum probability threshold relative to top token. | +| `min_tokens` | `int` | `0` | Minimum tokens before allowing EOS. | +| `use_beam_search` | `bool` | `false` | Use beam search instead of sampling. | +| `length_penalty` | `float` | `1.0` | Length penalty for beam search. | +| `early_stopping` | `bool` | `false` | Stop beam search early. | +| `stop_token_ids` | `list[int]` | `None` | Token IDs that stop generation. | +| `ignore_eos` | `bool` | `false` | Continue generating after EOS token. | +| `skip_special_tokens` | `bool` | `true` | Omit special tokens from output. | +| `spaces_between_special_tokens` | `bool` | `true` | Add spaces between special tokens. | +| `truncate_prompt_tokens` | `int` | `None` | Truncate prompt to this many tokens. | + + + +| Parameter | Type | Default | Description | +| --- | --- | --- | --- | +| `stream` | `bool` | `false` | Enable streaming output. | +| `max_batch_size` | `int` | env default | Max tokens per streaming chunk. | +| `min_batch_size` | `int` | env default | Min tokens per streaming chunk. | +| `batch_size_growth_factor` | `int` | env default | Growth factor for batch size. | + ## Error handling -Implement proper error handling to manage network timeouts, rate limiting, worker initialization delays, and model loading errors. +Implement retry logic with exponential backoff to handle network issues, rate limits, and cold starts. ```python import requests import time -def send_vllm_request(url, headers, payload, max_retries=3): +def send_request(url, headers, payload, max_retries=3): for attempt in range(max_retries): try: response = requests.post(url, headers=headers, json=payload, timeout=300) response.raise_for_status() return response.json() - except requests.exceptions.Timeout: - print(f"Request timed out. Attempt {attempt + 1}/{max_retries}") - if attempt < max_retries - 1: - time.sleep(2 ** attempt) # Exponential backoff except requests.exceptions.HTTPError as e: - if e.response.status_code == 429: - print("Rate limit exceeded. Waiting before retry...") + if e.response.status_code == 429: # Rate limit time.sleep(5) elif e.response.status_code >= 500: - print(f"Server error: {e.response.status_code}") - if attempt < max_retries - 1: - time.sleep(2 ** attempt) + time.sleep(2 ** attempt) else: raise - except requests.exceptions.RequestException as e: - print(f"Request failed: {e}") - if attempt < max_retries - 1: - time.sleep(2 ** attempt) - + except requests.exceptions.RequestException: + time.sleep(2 ** attempt) raise Exception("Max retries exceeded") - -# Usage -result = send_vllm_request(url, headers, data) ``` - -## Best practices - -Follow these best practices when sending requests to vLLM workers. - -**Set appropriate timeouts** based on your model size and expected generation length. Larger models and longer generations require longer timeouts. - -**Implement retry logic** with exponential backoff for failed requests. This handles temporary network issues and worker initialization delays. - -**Use streaming for long responses** to provide a better user experience. Users see output immediately instead of waiting for the entire response. - -**Optimize sampling parameters** for your use case. Lower temperature for factual tasks, higher temperature for creative tasks. - -**Monitor response times** to identify performance issues. If requests consistently take longer than expected, consider using a more powerful GPU or optimizing your parameters. - -**Handle rate limits** gracefully by implementing queuing or request throttling in your application. - -**Cache common requests** when appropriate to reduce redundant API calls and improve response times. - -## Next steps - -* [Learn about OpenAI API compatibility](/serverless/vllm/openai-compatibility). -* [Explore environment variables for customization](/serverless/vllm/environment-variables). -* [Review all Serverless endpoint operations](/serverless/endpoints/send-requests). diff --git a/serverless/workers/concurrent-handler.mdx b/serverless/workers/concurrent-handler.mdx index 5141f672..a1d0ac6b 100644 --- a/serverless/workers/concurrent-handler.mdx +++ b/serverless/workers/concurrent-handler.mdx @@ -3,15 +3,6 @@ title: "Build a concurrent handler" description: "Build a concurrent handler function to process multiple requests simultaneously on a single worker." --- -## What you'll learn - -In this guide you will learn how to: - -* Create an asynchronous handler function. -* Create a concurrency modifier to dynamically adjust concurrency levels. -* Optimize worker resources based on request patterns. -* Test your concurrent handler locally. - ## Requirements * You've [created a Runpod account](/get-started/manage-accounts). diff --git a/serverless/workers/overview.mdx b/serverless/workers/overview.mdx index 48e4c46a..164f1dd7 100644 --- a/serverless/workers/overview.mdx +++ b/serverless/workers/overview.mdx @@ -1,112 +1,84 @@ --- title: "Overview" description: "Package your handler function for deployment." +mode: "wide" --- -import { WorkerContainerDiskTooltip, MachineTooltip, NetworkVolumeTooltip } from "/snippets/tooltips.jsx"; +import { MachineTooltip } from "/snippets/tooltips.jsx"; -Workers are the containerized environments that run your code on Runpod Serverless. After creating and testing your [handler function](/serverless/workers/handler-functions), you need to package it into a Docker image and deploy it to an endpoint. +
-If you're new to containers and Docker, start with the [introduction to containers](/tutorials/introduction/containers) tutorial series to learn the fundamentals. +Workers are containerized environments that run your code on Runpod Serverless. -This page provides an overview of the worker deployment process. +## Deployment workflow -## [Create a Dockerfile](/serverless/workers/create-dockerfile) +After creating your [handler function](/serverless/workers/handler-functions), package it into a Docker image and deploy it to an endpoint: -To deploy your worker to Runpod, you need to [create a Dockerfile](/serverless/workers/create-dockerfile) that packages your handler function and all its dependencies. - -## Package and deploy a worker - -Once you've created your Dockerfile, you can deploy the worker image to a Serverless endpoint using one of the following methods: - -### [Deploy from Docker Hub](/serverless/workers/deploy) - -Build your Docker image locally, push it to Docker Hub (or another container registry), and deploy it to Runpod. This gives you full control over the build process and allows you to test the image locally before deployment. - -### [Deploy from GitHub](/serverless/workers/github-integration) - -Connect your GitHub repository to Runpod and deploy directly from your code. Runpod automatically builds the Docker image from your repository and deploys it to an endpoint. This streamlines the deployment process and enables continuous deployment workflows. - -## Worker storage options - -Serverless offers two storage options for your workers to access and store data for requests: - -- : Temporary storage that exists only while a worker is running, and is completely lost when the worker is stopped or scaled down. -- : Persistent storage that can be attached to different workers and even shared between multiple workers. - -See [Storage options](/serverless/storage/overview) for more details. + + + Package your handler function and all its dependencies [into a Docker image](/serverless/workers/create-dockerfile). + + + Push your image and create an endpoint using one of two methods: + - [Deploy from Docker Hub](/serverless/workers/deploy): Build locally and push to a container registry. + - [Deploy from GitHub](/serverless/workers/github-integration): Auto-build and deploy directly from your repository. + + ## Model deployment -To deploy your workers with AI/ML models, follow this order of preference: - -1. [Use cached models](/serverless/endpoints/model-caching): If your model is available on Hugging Face (public or gated), this is the recommended approach. Cached models provide the fastest cold starts, eliminate download costs, and persist across worker restarts. - -2. [Bake the model into your Docker image](/serverless/workers/create-dockerfile#including-models-and-files): If your model is private and not available on Hugging Face, embed it directly in your worker's container image using `COPY` or `RUN wget`. This ensures the model is always available, but it increases image size and build time. - -3. [Use network volumes](/storage/network-volumes): You can use network volumes to store models and other files that need to persist between workers. Models loaded from network storage are slower than cached or baked models, so you should only use this option when the preceeding approaches don't fit your needs. +To deploy workers with AI/ML models, follow this order of preference: -## Worker configuration +1. [**Use cached models**](/serverless/endpoints/model-caching): For models on Hugging Face (public or gated), this is the recommended approach. Cached models provide the fastest cold starts and persist across worker restarts. -When deploying workers, you can configure: +2. [**Bake the model into your Docker image**](/serverless/workers/create-dockerfile#including-models-and-files): For private models not on Hugging Face, embed them directly in your container image. This ensures the model is always available but increases image size. -* **GPU/CPU types**: Select specific GPU models for your workload. -* **GPU count**: Set the number of GPUs for each worker. -* **Max workers**: Set the maximum number of workers for your endpoint. -* **Container disk size**: Allocate temporary storage for your worker. See [Storage options](/serverless/storage/overview). -* **Environment variables**: Pass configuration values to your worker. See [Environment variables](/serverless/development/environment-variables) for usage details. -* **Model caching**: Pre-load models to reduce cold start times. See [Cached models](/serverless/endpoints/model-caching). +3. [**Use network volumes**](/storage/network-volumes): For development workflows or very large models (500GB+), store models on a network volume. This is slower than cached or baked models but offers flexibility for iteration. -These settings are configured when you [create or edit an endpoint](/serverless/endpoints/overview#create-an-endpoint). +## Worker types -## Active vs. flex workers +Workers can run in two modes depending on your latency and cost requirements: -You can deploy workers in two modes: +- **Active workers** run continuously (24/7) and are always ready to process requests instantly. They eliminate cold starts entirely and receive a discounted rate, making them ideal for latency-sensitive or high-traffic applications. -* **Active workers**: "Always on" workers that eliminate cold start delays. They never scale down, so you are charged as long as they are active, but they receive a discount (up to 30%) compared to flex workers. (Default: `0`). -* **Flex workers**: "Sometimes on" workers that scale during traffic surges. They transition to idle after completing jobs. (Default: `max_workers - active_workers = 3`). +- **Flex workers** scale dynamically based on demand, spinning down to zero when idle. They incur cold starts when scaling up but cost nothing when not in use, making them ideal for variable or sporadic workloads. -The system will also sometimes add additiona **extra workers** during traffic spikes when Docker images are cached on host servers. (Default: 2). +The system may also spin up **extra workers** during traffic spikes when Docker images are cached on hosts (default: 2). ## Worker states -Workers move through different states as they handle requests and respond to changes in traffic patterns. Understanding these states helps you monitor and troubleshoot your workers effectively. +| State | Description | Billing | +|-------|-------------|---------| +| **Initializing** | Downloading image, loading code | Yes | +| **Idle** | Ready, waiting for requests | No | +| **Running** | Processing requests | Yes | +| **Throttled** | Ready but host constrained | No | +| **Outdated** | Marked for replacement after update | Yes (while processing) | +| **Unhealthy** | Crashed; auto-retries for up to 7 days | No | -* **Initializing**: The worker starts up while the system downloads and prepares the Docker image. The container starts and loads your code. -* **Idle**: The worker is ready but not processing requests. No charges apply while idle. -* **Running**: The worker actively processes requests. Billing occurs per second. -* **Throttled**: The worker is ready but temporarily unable to run due to host resource constraints. -* **Outdated**: The system marks the worker for replacement after endpoint updates. It continues processing current jobs during rolling updates (10% of max workers at a time). -* **Unhealthy**: The worker has crashed due to Docker image issues, incorrect start commands, or machine problems. The system automatically retries with exponential backoff for up to 7 days. +View worker states in the **Workers** tab of your endpoint in the [Runpod console](https://www.console.runpod.io/serverless). -You can view the state of your workers using the **Workers** tab of the Serverless endpoint details page in the [Runpod console](https://www.console.runpod.io/serverless). This page provides real-time information about each worker's current state, resource utilization, and job processing history, allowing you to monitor performance and troubleshoot issues effectively. +## Max worker limits -## Max worker limit +Account balance determines your maximum workers (flex + active combined): -By default, each Runpod account can allocate a maximum of 5 workers (flex + active combined) across all endpoints. If your account balance exceeds a certain threshold, you can increase this limit: +| Balance | Max workers | +|---------|-------------| +| Default | 5 | +| \$100+ | 10 | +| \$200+ | 20 | +| \$300+ | 30 | +| \$500+ | 40 | +| \$700+ | 50 | +| \$900+ | 60 | -- \$100 balance: 10 max workers -- \$200 balance: 20 max workers -- \$300 balance: 30 max workers -- \$500 balance: 40 max workers -- \$700 balance: 50 max workers -- \$900 balance: 60 max workers - -If your workload requires additional capacity beyond 60 workers, [contact our support team](https://www.runpod.io/contact). +Need more capacity? [Contact support](https://www.runpod.io/contact). ## Best practices -Follow these best practices when deploying workers: - -* **Optimize image size**: Smaller images download faster and reduce cold start times. See [Create a Dockerfile](/serverless/workers/create-dockerfile) for optimization techniques. -* **Use model caching**: Pre-load models to avoid downloading them on every cold start. See [Cached models](/serverless/endpoints/model-caching). -* **Test locally first**: Always test your handler locally before deploying. See [Local testing](/serverless/development/local-testing). -* **Handle errors gracefully**: Implement proper error handling to prevent worker crashes. See [Error handling](/serverless/development/error-handling). -* **Debug using logs and SSH**: Use logs and SSH to debug and optimize your workers. See [Monitor logs](/serverless/development/logs) and [SSH into workers](/serverless/development/ssh-into-workers). - -## Next steps - -* [Write a handler function](/serverless/workers/handler-functions) to define your worker's logic. -* [Create a Dockerfile](/serverless/workers/create-dockerfile) to package your handler function and all its dependencies. -* [Build and push](/serverless/workers/deploy) your worker image to Docker Hub (or another container registry). -* [Deploy from GitHub](/serverless/workers/github-integration) to automatically build your worker image from your repository. \ No newline at end of file +| Practice | Benefit | +|----------|---------| +| [Optimize image size](/serverless/workers/create-dockerfile) | Faster downloads, reduced cold starts | +| [Use model caching](/serverless/endpoints/model-caching) | Fastest cold starts | +| [Test locally first](/serverless/development/local-testing) | Catch issues before deployment | +| [Use logs and SSH](/serverless/development/logs) | Debug and optimize effectively | diff --git a/storage/network-volumes.mdx b/storage/network-volumes.mdx index 3c8f048f..6d1d1e37 100644 --- a/storage/network-volumes.mdx +++ b/storage/network-volumes.mdx @@ -3,302 +3,156 @@ title: "Network volumes" description: "Persistent, portable storage for your AI workloads." --- -import { PodsTooltip, ServerlessTooltip, WorkersTooltip, WorkerTooltip, HandlerFunctionTooltip, ColdStartTooltip, PodTooltip, InstantClusterTooltip } from "/snippets/tooltips.jsx"; +import { PodsTooltip, ServerlessTooltip, WorkersTooltip, ColdStartTooltip, InstantClusterTooltip } from "/snippets/tooltips.jsx"; -Network volumes offer persistent storage that exists independently of your compute resources. Your data is retained even when your are terminated or your are scaled to zero. You can use them to share data and maintain datasets across multiple machines and [Runpod products](/overview). +Network volumes provide persistent storage that exists independently of your compute resources. Data is retained when terminate or scale to zero. Use them to share data across multiple machines and Runpod products. -Network volumes are backed by high-performance NVMe SSDs connected via high-speed networks. Transfer speeds typically range from 200-400 MB/s, with peak speeds up to 10 GB/s depending on location and network conditions. - -## When to use network volumes - -Consider using network volumes when you need: - -- **Persistent data that outlives compute resources**: Your data remains accessible even after Pods are terminated or Serverless workers stop. -- **Shareable storage**: Share data across multiple Pods or Serverless endpoints by attaching the same network volume. -- **Portable storage**: Move your working environment and data between different compute resources. -- **Efficient data management**: Store frequently used models or large datasets to avoid re-downloading them for each new Pod or , saving time, bandwidth, and reducing times. +Network volumes are backed by high-performance NVMe SSDs with transfer speeds of 200-400 MB/s (up to 10 GB/s peak). ## Pricing -Network volumes are billed hourly at a rate of \$0.07 per GB per month for the first 1TB, and \$0.05 per GB per month for additional storage beyond that. +- **First 1 TB**: \$0.07/GB/month +- **Beyond 1 TB**: \$0.05/GB/month - -If your account lacks sufficient funds to cover storage costs, your network volume may be terminated. Once terminated, the disk space is immediately freed for other users, and Runpod cannot recover lost data. Ensure your account remains funded to prevent data loss. - +If your account lacks funds to cover storage costs, your network volume may be terminated, after which data cannot be recovered. ## Create a network volume - - -Network volume size can be increased later, but cannot be decreased. - - - - -To increase your network volume beyond 4 TB, [contact support](https://www.runpod.io/contact). - +Volume size can be increased later but cannot be decreased. For volumes beyond 4 TB, [contact support](https://www.runpod.io/contact). - - -To create a new network volume: - -1. Navigate to the [Storage page](https://www.console.runpod.io/user/storage) in the Runpod console. -2. Click **New Network Volume**. -3. Select a datacenter for your volume. Datacenter location does not affect pricing, but determines which GPU types and endpoints your network volume can be used with. -4. Provide a descriptive name for your volume (e.g., "project-alpha-data" or "shared-models"). -5. Specify the desired size for the volume in gigabytes (GB). -6. Click **Create Network Volume**. - -You can edit and delete your network volumes using the [Storage page](https://www.console.runpod.io/user/storage). - - - - - -To create a network volume using the REST API, send a POST request to the `/networkvolumes` endpoint: - -```bash -curl --request POST \ - --url https://rest.runpod.io/v1/networkvolumes \ - --header 'Authorization: Bearer RUNPOD_API_KEY' \ - --header 'Content-Type: application/json' \ - --data '{ - "name": "my-network-volume", - "size": 100, - "dataCenterId": "US-KS-2" -}' -``` - -For complete API documentation and parameter details, see the [network volumes API reference](/api-reference/network-volumes/POST/networkvolumes). - - - + + 1. Navigate to the [Storage page](https://www.console.runpod.io/user/storage). + 2. Click **New Network Volume**. + 3. Select a datacenter, enter a name, and specify size in GB. + 4. Click **Create Network Volume**. + + + ```bash + curl --request POST \ + --url https://rest.runpod.io/v1/networkvolumes \ + --header 'Authorization: Bearer RUNPOD_API_KEY' \ + --header 'Content-Type: application/json' \ + --data '{ + "name": "my-network-volume", + "size": 100, + "dataCenterId": "US-KS-2" + }' + ``` + + See [network volumes API reference](/api-reference/network-volumes/POST/networkvolumes) for details. + ## Network volumes for Serverless -When attached to a Serverless endpoint, a network volume is mounted at `/runpod-volume` within the worker environment. - -### Benefits for Serverless +Network volumes mount at `/runpod-volume` within Serverless workers. Benefits include reduced times (no re-downloading models), lower costs, and centralized data management. -Using network volumes with Serverless provides several advantages: +**Attach to an endpoint:** -- **Reduced cold starts**: Store large models or datasets on a network volume so workers can access them quickly without downloading on each cold start. -- **Cost efficiency**: Network volume storage costs less than frequently re-downloading large files. -- **Simplified data management**: Centralize your datasets and models for easier updates and management across multiple workers and endpoints. - -### Attach to an endpoint - -To enable workers on an endpoint to use network volumes: - -1. Navigate to the [Serverless section](https://www.console.runpod.io/serverless/user/endpoints) of the Runpod console. -2. Select an existing endpoint and click **Manage**, then select **Edit Endpoint**. -3. In the endpoint configuration menu, scroll down and expand the **Advanced** section. -4. Click **Network Volumes** and select one or more network volumes you want to attach to the endpoint. -5. Configure any other fields as needed, then select **Save Endpoint**. - -Data from the attached network volume(s) will be accessible to workers from the `/runpod-volume` directory. Use this path to read and write shared data in your . - - - -When you attach multiple network volumes to an endpoint, you can only select one network volume per datacenter. - - +1. Go to [Serverless](https://www.console.runpod.io/serverless/user/endpoints) and select your endpoint. +2. Click **Manage** → **Edit Endpoint**. +3. Expand **Advanced**, click **Network Volumes**, and select volumes to attach. +4. Click **Save Endpoint**. - -Writing to the same network volume from multiple endpoints or workers simultaneously may result in conflicts or data corruption. Ensure your application logic handles concurrent access appropriately for write operations. - +Writing to the same volume from multiple workers simultaneously may cause data corruption. Handle concurrent write access in your application logic. ### Attach multiple volumes -If you attach a single network volume to your Serverless endpoint, worker deployments will be constrained to the datacenter where the volume is located. This may impact GPU availability and failover options. - -To improve GPU availability and reduce downtime during datacenter maintenance, you can attach multiple network volumes to your endpoint. Workers will be distributed across the datacenters where the volumes are located, with each worker receiving exactly one network volume based on its assigned datacenter. +Attaching a single network volume constrains worker deployments to that volume's datacenter, which may limit GPU availability and reduce failover options. - +To improve availability and reduce downtime during datacenter maintenance, attach multiple network volumes from different datacenters. Workers are distributed across these datacenters, with each worker receiving exactly one volume based on its assigned location. -Data **does not sync** automatically between multiple network volumes even if they are attached to the same endpoint. You'll need to manually copy data (using the [S3-compatible API](/storage/s3-api) or [`runpodctl`](#using-runpodctl)) if you need the same data to be available to all workers on the endpoint (regardless of which volume they're attached to). + +You can only select one network volume per datacenter. + + +**Data does not sync automatically between volumes.** To make the same data available to all workers regardless of datacenter, manually copy data using the [S3-compatible API](/storage/s3-api) or [runpodctl](#using-runpodctl). ## Network volumes for Pods -When attached to a , a network volume replaces the Pod's default volume disk and is typically mounted at `/workspace`. +Network volumes replace the Pod's default volume disk, typically mounted at `/workspace`. -Network volumes are only available for Pods in the Secure Cloud. For more information, see [Pod types](/pods/overview#pod-types). +Network volumes are only available for Pods in the [Secure Cloud](/pods/overview#pod-types). -### Attach to a Pod - -Network volumes must be attached during Pod deployment. They cannot be attached to a previously-deployed Pod, nor can they be detached later without deleting the Pod. - -To deploy a Pod with a network volume attached: - -1. Navigate to the [Pods section](https://www.console.runpod.io/pods) of the Runpod console. -2. Select **Deploy**. -3. Select **Network Volume** and choose the network volume you want to attach from the dropdown list. -4. Select a GPU type. The system will automatically show which Pods are available to use with the selected network volume. -5. Select a **Pod Template**. -6. If you wish to change where the volume mounts, select **Edit Template** and adjust the **Volume Mount Path**. -7. Configure any other fields as needed, then select **Deploy On-Demand**. +**Attach to a Pod:** -Data from the network volume will be accessible to the Pod from the volume mount path (default: `/workspace`). Use this directory to upload, download, and manipulate data that you want to share with other Pods. +1. Navigate to [Pods](https://www.console.runpod.io/pods) and click **Deploy**. +2. Select **Network Volume** and choose your volume. +3. Select a GPU type (available options depend on volume location). +4. Configure template and other settings, then click **Deploy On-Demand**. -### Share data between Pods - -You can attach a network volume to multiple Pods, allowing them to share data seamlessly. Multiple Pods can read files from the same volume concurrently, but you should avoid writing to the same file simultaneously to prevent conflicts or data corruption. +Network volumes must be attached during Pod deployment. They cannot be attached or detached later without deleting the Pod. ## Network volumes for Instant Clusters -Network volumes for s work the same way as they do for Pods. They must be attached during cluster creation, and by default are mounted at `/workspace` within each node in the cluster. - -### Attach to an Instant Cluster - -To enable workers on an Instant Cluster to use a network volume: +Network volumes for s work like Pods. Attach during cluster creation; mounts at `/workspace` on each node. -1. Navigate to the [Instant Clusters section](https://www.console.runpod.io/cluster) of the Runpod console. -2. Click **Create Cluster**. -3. Click **Network Volume** and select the network volume you want to attach to the cluster. -4. Configure any other fields as needed, then click **Deploy Cluster**. +1. Go to [Instant Clusters](https://www.console.runpod.io/cluster) and click **Create Cluster**. +2. Click **Network Volume** and select the volume to attach. +3. Configure other settings and click **Deploy Cluster**. ## S3-compatible API -Runpod provides an [S3-compatible API](/storage/s3-api) that allows you to access and manage files on your network volumes directly, without needing to launch a Pod or run a Serverless worker for file management. This is particularly useful for: +The [S3-compatible API](/storage/s3-api) lets you manage files on network volumes without launching compute resources. Upload datasets before launching Pods, automate workflows with standard S3 tools, or pre-populate volumes to improve cold start performance. -- **Uploading large datasets or models** before launching compute resources. -- **Managing files remotely** without maintaining an active connection. -- **Automating data workflows** using standard S3 tools and libraries. -- **Reducing costs** by avoiding the need to keep compute resources running for file management. -- **Pre-populating volumes** to reduce worker initialization time and improve cold start performance. - -The S3-compatible API supports standard S3 operations including file uploads, downloads, listing, and deletion. You can use it with popular tools like the AWS CLI and Boto3 (Python). - - -The S3-compatible API is currently available for network volumes in the following datacenters: `EUR-IS-1`, `EU-RO-1`, `EU-CZ-1`, `US-KS-2`, `US-CA-2`. - - - - -## Migrate files - -You can migrate files between network volumes (including between data centers) using the following methods: +## Migrate files between volumes ### Using runpodctl -The simplest way to migrate files between network volumes is to use `runpodctl send` and `receive` on two running Pods. - -Before you begin, you'll need: - -- A source network volume containing the data you want to migrate. -- A destination network volume (which can be empty or contain existing data). +The simplest way to migrate files between network volumes is to use `runpodctl send` and `receive` on two running Pods: - -Deploy two Pods using the default Runpod PyTorch template. Each Pod should have one [network volume attached](#attach-to-a-pod). - -1. Deploy the first Pod in the source data center and attach the source network volume. -2. Deploy the second Pod in the target data center and attach the target network volume. -3. Start the [web terminal](/pods/connect-to-a-pod#web-terminal) in both Pods. - - - -Using your source Pod's web terminal, navigate to the network volume directory (usually `/workspace`): - -```bash -cd workspace -``` - - - -Use `runpodctl send` to start the transfer. To transfer the entire volume: - -```bash -runpodctl send * -``` - -You can also specify specific files or directories instead of `*`. - - - -After running the send command, copy the `receive` command from the output. It will look something like this: - -```bash -runpodctl receive 8338-galileo-collect-fidel -``` - - - -Using your destination Pod's web terminal, navigate to the network volume directory (usually `/workspace`): - -```bash -cd workspace -``` - - - -Paste and run the `receive` command you copied earlier: - -```bash -runpodctl receive 8338-galileo-collect-fidel -``` - -The transfer will begin and show progress as it copies files from the source to the destination volume. - + + Deploy Pods with the source and destination volumes attached. Open web terminals on both. + + + On the source Pod: + ```bash + cd /workspace + runpodctl send * + ``` + Copy the receive command from the output. + + + On the destination Pod: + ```bash + cd /workspace + runpodctl receive 8338-galileo-collect-fidel # Use your code + ``` + -For a visual walkthrough using JupyterLab, check out this video tutorial: - - - - - ### Using rsync over SSH -For faster migration speed and more reliability for large transfers, you can use `rsync` over SSH on two running Pods. - -Before you begin, you'll need: - -- A network volume in the source data center containing the data you want to migrate. -- A network volume in the target data center (which can be empty or contain existing data). +For faster migration speed and more reliability for large transfers, you can use `rsync` over SSH on two running Pods: - -Deploy two Pods using the default Runpod PyTorch template. Each Pod should have one [network volume attached](#attach-to-a-pod). - -1. Deploy the first Pod in the source data center and attach the source network volume. -2. Deploy the second Pod in the target data center and attach the target network volume. -3. Start the [web terminal](/pods/connect-to-a-pod#web-terminal) in both Pods. - - - -On the source Pod, install required packages and generate an SSH key pair: - -```bash -apt update && apt install -y vim rsync && \ -ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -q && \ -cat ~/.ssh/id_ed25519.pub -``` - -Copy the public key that appears in the terminal output. - - - + + Deploy Pods with source and destination volumes attached. + + + On the source Pod, install required packages and generate an SSH key pair: + + ```bash + apt update && apt install -y rsync + ssh-keygen -t ed25519 -f ~/.ssh/id_ed25519 -N "" -q + cat ~/.ssh/id_ed25519.pub + ``` + Copy the public key. + + On the destination Pod, install required packages and add the source Pod's public key to `authorized_keys`: ```bash @@ -312,10 +166,11 @@ vi ~/.ssh/authorized_keys In the editor that opens, paste the public key you copied from the source Pod, then save and exit (press `Esc`, type `:wq`, and press `Enter`). The command above also displays the `rsync` command you'll need to run on the source Pod. Copy this command for the next step. - - - -On the source Pod, run the `rsync` command from the previous step. If you didn't copy it, you can construct it manually using the destination Pod's IP address and port number. + + + On the source Pod, run the `rsync` command from the previous step. + + If you didn't copy it, you can construct it manually using the destination Pod's IP address and port number: ```bash # Replace DESTINATION_PORT and DESTINATION_IP with values from the destination Pod @@ -325,20 +180,8 @@ rsync -avzP --inplace -e "ssh -p DESTINATION_PORT" /workspace/ root@DESTINATION_ rsync -avzP --inplace -e "ssh -p 18598" /workspace/ root@157.66.254.13:/workspace ``` -The `rsync` command displays progress as it transfers files. Depending on the size of your data, this may take some time. - - - -After the `rsync` command completes, verify the data transfer by checking disk usage on both Pods: - -```bash -du -sh /workspace -``` - -The destination Pod should show similar disk usage to the source Pod if all files transferred successfully. - - -You can run the `rsync` command multiple times if the transfer is interrupted. The `--inplace` flag ensures that `rsync` resumes from where it left off rather than starting over. - - + + You can run the `rsync` command multiple times if the transfer is interrupted. The `--inplace` flag ensures that `rsync` resumes from where it left off rather than starting over. + + diff --git a/tutorials/flash/build-rest-api-with-load-balancer.mdx b/tutorials/flash/build-rest-api-with-load-balancer.mdx index 0495655d..bf99914e 100644 --- a/tutorials/flash/build-rest-api-with-load-balancer.mdx +++ b/tutorials/flash/build-rest-api-with-load-balancer.mdx @@ -7,17 +7,6 @@ tag: "BETA" This tutorial shows you how to build a REST API using Flash load-balanced endpoints. You'll create a multi-route API that handles text processing, demonstrates both CPU and GPU endpoints, and deploys to production. -## What you'll learn - -In this tutorial you'll learn how to: - -- Create load-balanced endpoints with the `Endpoint` class -- Define multiple routes on a single endpoint with `.get()`, `.post()`, and other HTTP method decorators -- Add GPU-accelerated routes for ML inference -- Test your API locally with `flash run` -- Deploy your API to production with `flash deploy` -- Call your deployed endpoints with proper authentication - ## Requirements - You've [created a Runpod account](/get-started/manage-accounts) diff --git a/tutorials/flash/image-generation-with-sdxl.mdx b/tutorials/flash/image-generation-with-sdxl.mdx index 2b924c59..24422930 100644 --- a/tutorials/flash/image-generation-with-sdxl.mdx +++ b/tutorials/flash/image-generation-with-sdxl.mdx @@ -11,16 +11,6 @@ This tutorial shows you how to build an image generation script using Flash and -## What you'll learn - -In this tutorial you'll learn how to: - -- Use the Hugging Face diffusers library with Flash. -- Load and run Stable Diffusion XL models on GPU workers. -- Generate high-quality images from text prompts. -- Save generated images to disk. -- Configure generation parameters like guidance scale and steps. - ## Requirements - You've [created a Runpod account](/get-started/manage-accounts). diff --git a/tutorials/flash/text-generation-with-transformers.mdx b/tutorials/flash/text-generation-with-transformers.mdx index 23c97e58..29f5cf45 100644 --- a/tutorials/flash/text-generation-with-transformers.mdx +++ b/tutorials/flash/text-generation-with-transformers.mdx @@ -7,16 +7,6 @@ tag: "BETA" This tutorial shows you how to build a text generation script using Flash and Hugging Face's transformers library. You'll learn how to load a pretrained language model on a GPU worker and generate text from prompts. -## What you'll learn - -In this tutorial you'll learn how to: - -- Install and use the Hugging Face transformers library with Flash. -- Load pretrained models on remote GPU workers. -- Move models to GPU for faster inference. -- Configure text generation parameters like temperature and max length. -- Return structured results with metadata. - ## Requirements - You've [created a Runpod account](/get-started/manage-accounts). diff --git a/tutorials/introduction/containers.mdx b/tutorials/introduction/containers.mdx index d09dad64..86a0269d 100644 --- a/tutorials/introduction/containers.mdx +++ b/tutorials/introduction/containers.mdx @@ -6,16 +6,6 @@ description: "Learn about Docker containers, images, and how they enable portabl Containers are the foundation of modern cloud computing, enabling you to package applications with all their dependencies and run them consistently across different environments. This tutorial series teaches you the container fundamentals you need to work effectively with Runpod's [Serverless](/serverless/overview) and [Pods](/pods/overview) platforms. -## What you'll learn - -In this tutorial series, you will learn: - -- What containers and images are and why they matter for cloud deployment. -- How to create custom Docker images using Dockerfiles. -- Essential Docker commands for building, running, and managing containers. -- How to persist data outside of containers using volumes. -- How container concepts apply to Runpod's Serverless and Pods platforms. - ## Requirements To follow this tutorial series, you need: diff --git a/tutorials/introduction/containers/create-dockerfiles.mdx b/tutorials/introduction/containers/create-dockerfiles.mdx index fc439c7e..332345ec 100644 --- a/tutorials/introduction/containers/create-dockerfiles.mdx +++ b/tutorials/introduction/containers/create-dockerfiles.mdx @@ -6,17 +6,6 @@ description: "Learn how to write Dockerfiles, build custom images, and run your A Dockerfile is a text file containing instructions for building a Docker image. By creating a Dockerfile, you can package your application with its dependencies and configuration, making it easy to deploy anywhere Docker runs. This guide walks you through creating your first Dockerfile, building an image, and running a container. -## What you'll learn - -In this guide, you will learn how to: - -- Verify your Docker installation. -- Run your first container from an existing image. -- Write a Dockerfile with common instructions. -- Create and configure an entrypoint script. -- Build a custom Docker image. -- Run containers from your custom image. - ## Requirements Before starting, you need: diff --git a/tutorials/introduction/containers/docker-commands.mdx b/tutorials/introduction/containers/docker-commands.mdx index 9febbd10..a71c5c41 100644 --- a/tutorials/introduction/containers/docker-commands.mdx +++ b/tutorials/introduction/containers/docker-commands.mdx @@ -6,16 +6,6 @@ description: "Essential Docker CLI commands for building, running, managing, and This reference guide covers the most commonly used Docker commands for working with images and containers. Use this as a quick reference when building and deploying applications, especially for Runpod's bring-your-own-container (BYOC) workflows with [Serverless](/serverless/workers/overview) and [Pods](/pods/overview). -## What you'll learn - -In this reference, you'll find: - -- Commands for building and managing Docker images. -- Commands for running and controlling containers. -- Commands for working with volumes and networks. -- Common command workflows for development and deployment. -- Runpod-specific guidance for container deployment. - ## Building images These commands help you create and manage Docker images. diff --git a/tutorials/introduction/containers/persist-data.mdx b/tutorials/introduction/containers/persist-data.mdx index 04f36164..f05a8fc0 100644 --- a/tutorials/introduction/containers/persist-data.mdx +++ b/tutorials/introduction/containers/persist-data.mdx @@ -8,17 +8,6 @@ By default, containers are ephemeral—when a container stops, any data written This guide shows you how to use volumes to persist data, a fundamental concept for working with Runpod's [Serverless](/serverless/workers/overview) and [Pods](/pods/overview) platforms. -## What you'll learn - -In this guide, you will learn how to: - -- Understand why containers lose data when they stop. -- Create Docker volumes for persistent storage. -- Mount volumes to containers at runtime. -- Read and write data to volumes. -- Access volume data across multiple containers. -- Apply these concepts to Runpod's storage solutions. - ## Requirements Before starting, you should have: diff --git a/tutorials/introduction/overview.mdx b/tutorials/introduction/overview.mdx index cfdbcd15..a415b6d3 100644 --- a/tutorials/introduction/overview.mdx +++ b/tutorials/introduction/overview.mdx @@ -7,31 +7,30 @@ mode: "wide"
-This section includes step-by-step guides to help you build and deploy example applications on the Runpod platform, covering basic concepts and advanced implementations. +Step-by-step guides for building and deploying example applications on Runpod. ## Serverless - - Deploy a Stable Diffusion endpoint and generate your first AI image using Serverless. + + Deploy a Stable Diffusion endpoint and generate your first AI image. - - Deploy an image generation endpoint and integrate it into a web application. + + Deploy an image generation endpoint and integrate it into a web app. - Learn how to create a custom Serverless endpoint that uses model caching to serve a large language model with reduced cost and cold start times. + Serve an LLM with reduced cost and cold start times. - - Deploy a Serverless endpoint with Google's Gemma 3 model using vLLM and the OpenAI API to build an interactive chatbot. + + Use vLLM and the OpenAI API to build an interactive chatbot. - - Deploy ComfyUI on Serverless and generate images using JSON workflows. + + Deploy ComfyUI and generate images using JSON workflows. - - ## Flash + Deploy SDXL as a serverless endpoint with Python decorators. @@ -40,24 +39,24 @@ This section includes step-by-step guides to help you build and deploy example a Deploy a text generation model on Runpod. - Create a REST API with automatic load balancing using Flash. + Create a REST API with automatic load balancing. ## Pods - - Launch JupyterLab on a GPU Pod and run LLM inference using the Python `transformers` library. + + Launch JupyterLab on a GPU Pod and run inference with Transformers. - Deploy Ollama on a GPU Pod and run LLM inference using the Ollama API. + Deploy Ollama on a GPU Pod and run inference using the Ollama API. - + Build Docker images on Pods using Bazel. - - Deploy ComfyUI on a GPU Pod and generate images using the ComfyUI web interface. + + Deploy ComfyUI on a GPU Pod and use the web interface. @@ -65,6 +64,6 @@ This section includes step-by-step guides to help you build and deploy example a - Chain multiple Public Endpoints to generate videos from text prompts using Python. + Chain multiple Public Endpoints to generate videos from text. - \ No newline at end of file + diff --git a/tutorials/pods/build-docker-images.mdx b/tutorials/pods/build-docker-images.mdx index 7e152c1e..0aef2094 100644 --- a/tutorials/pods/build-docker-images.mdx +++ b/tutorials/pods/build-docker-images.mdx @@ -2,22 +2,12 @@ title: "Build Docker images on Runpod with Bazel" sidebarTitle: "Build Docker images with Bazel" description: "Build and push Docker images from inside a Runpod Pod using Bazel." -tag: "NEW" --- -import { TemplateTooltip, PodTooltip, NetworkVolumeTooltip } from "/snippets/tooltips.jsx"; +import { TemplateTooltip, NetworkVolumeTooltip } from "/snippets/tooltips.jsx"; Runpod Pods use custom Docker images, so you can't directly build Docker containers or use Docker Compose on a GPU Pod. However, you can use [Bazel](https://bazel.build) to build and push Docker images from inside a Pod, effectively creating a "Docker in Docker" workflow. -## What you'll learn - -In this tutorial, you'll learn how to: - -- Deploy a for building Docker images. -- Install Bazel and Docker dependencies. -- Configure a Bazel project with [rules_oci](https://github.com/bazel-contrib/rules_oci). -- Build and push a Docker image to Docker Hub. - ## Requirements Before starting, you'll need: diff --git a/tutorials/pods/comfyui.mdx b/tutorials/pods/comfyui.mdx index 816a130b..9e189a48 100644 --- a/tutorials/pods/comfyui.mdx +++ b/tutorials/pods/comfyui.mdx @@ -4,7 +4,7 @@ sidebarTitle: "Generate images with ComfyUI" description: "Deploy ComfyUI on Runpod to create AI-generated images." --- -import { TemplateTooltip, PodTooltip, NetworkVolumeTooltip, RunpodCLITooltip, ServerlessTooltip } from "/snippets/tooltips.jsx"; +import { TemplateTooltip, NetworkVolumeTooltip, RunpodCLITooltip, ServerlessTooltip } from "/snippets/tooltips.jsx"; This tutorial walks you through how to configure ComfyUI on a [GPU Pod](/pods/overview) and use it to generate images with text-to-image models. @@ -18,16 +18,6 @@ When you're just getting started with ComfyUI, it's important to use a workflow For example, if you load a workflow created for the Flux Dev model and try to use it with SDXL-Turbo, the workflow might run, but with poor speed or image quality. -## What you'll learn - -In this tutorial, you'll learn how to: - -- Deploy a with ComfyUI pre-installed. -- Connect to the ComfyUI web interface. -- Browse pre-configured workflow templates. -- Install new models to your Pod. -- Generate an image. - ## Requirements Before you begin, you'll need: diff --git a/tutorials/pods/run-ollama.mdx b/tutorials/pods/run-ollama.mdx index d706a374..9e8a81ec 100644 --- a/tutorials/pods/run-ollama.mdx +++ b/tutorials/pods/run-ollama.mdx @@ -2,21 +2,12 @@ title: "Set up Ollama on a Pod" sidebarTitle: "Set up Ollama" description: "Install and run Ollama on a Pod with HTTP API access." -tag: "NEW" --- -import { TemplateTooltip, PodTooltip, PodEnvironmentVariablesTooltip } from "/snippets/tooltips.jsx"; +import { PodTooltip, PodEnvironmentVariablesTooltip } from "/snippets/tooltips.jsx"; This tutorial shows you how to set up [Ollama](https://ollama.com), a platform for running large language models, on a Runpod GPU . By the end, you'll have Ollama running with HTTP API access for external requests. -## What you'll learn - -In this tutorial, you'll learn how to: - -- Deploy a Pod with the PyTorch . -- Install and configure Ollama for external access. -- Run AI models and interact via the HTTP API. - ## Requirements - A Runpod account with credits. diff --git a/tutorials/pods/run-your-first.mdx b/tutorials/pods/run-your-first.mdx index 52d8d98a..432eefa2 100644 --- a/tutorials/pods/run-your-first.mdx +++ b/tutorials/pods/run-your-first.mdx @@ -4,23 +4,13 @@ sidebarTitle: "Run LLMs with JupyterLab" description: "Learn how to run inference on the SmolLM3 model in JupyterLab using the transformers library." --- -import { TemplateTooltip, PodTooltip, NetworkVolumeTooltip } from "/snippets/tooltips.jsx"; +import { PodTooltip, NetworkVolumeTooltip } from "/snippets/tooltips.jsx"; This tutorial shows how to deploy a and use JupyterLab to generate text with the SmolLM3 model using the Python `transformers` library. [SmolLM3](https://huggingface.co/docs/transformers/en/model_doc/smollm3) is a family of small language models developed by Hugging Face that provides strong performance while being efficient enough to run on modest hardware. The 3B parameter model we'll use in this tutorial requires only 24 GB of VRAM, making it accessible for experimentation and development. -## What you'll learn - -In this tutorial, you'll learn how to: - -- Deploy a Pod with the PyTorch . -- Access the web terminal and JupyterLab services. -- Install the transformers and accelerate libraries. -- Use SmolLM3 for text generation in a Python notebook. -- Configure model parameters for different use cases. - ## Requirements Before you begin, you'll need: diff --git a/tutorials/public-endpoints/text-to-video-pipeline.mdx b/tutorials/public-endpoints/text-to-video-pipeline.mdx index 3db75896..ac71b198 100644 --- a/tutorials/public-endpoints/text-to-video-pipeline.mdx +++ b/tutorials/public-endpoints/text-to-video-pipeline.mdx @@ -2,7 +2,6 @@ title: "Build a text-to-video pipeline" sidebarTitle: "Text-to-video pipeline" description: "Chain multiple Public Endpoints to generate videos from text prompts using Python." -tag: "NEW" --- This tutorial shows you how to build a complete text-to-video pipeline by chaining three Runpod [Public Endpoints](/public-endpoints/overview) together. You'll take a simple idea and transform it into an animated video, all with a single Python script. diff --git a/tutorials/serverless/comfyui.mdx b/tutorials/serverless/comfyui.mdx index d753f354..76840b33 100644 --- a/tutorials/serverless/comfyui.mdx +++ b/tutorials/serverless/comfyui.mdx @@ -4,22 +4,12 @@ sidebarTitle: "Deploy ComfyUI on Serverless" description: "Learn how to deploy a Serverless endpoint running ComfyUI from the Runpod Hub and use it to generate images with FLUX Dev." --- -import { ServerlessTooltip, EndpointTooltip, RunpodHubTooltip, WorkerTooltip, PodTooltip } from "/snippets/tooltips.jsx"; +import { ServerlessTooltip, EndpointTooltip, WorkerTooltip, PodTooltip } from "/snippets/tooltips.jsx"; In this tutorial, you will learn how to deploy a running [ComfyUI](https://github.com/comfyanonymous/ComfyUI) on Runpod, submit image generation jobs using workflow JSON, monitor their progress, and decode the resulting images. [Runpod's Serverless platform](/serverless/overview) allows you to run AI/ML models in the cloud without managing infrastructure, automatically scaling resources as needed. ComfyUI is a powerful node-based interface for Stable Diffusion that provides fine-grained control over the image generation process through customizable workflows. -## What you'll learn - -In this tutorial you'll learn: - -- How to deploy a ComfyUI Serverless endpoint using the . -- How to structure ComfyUI workflow JSON for API requests. -- How to submit jobs, monitor their progress, and retrieve results. -- How to generate images using the FLUX.1-dev-fp8 model. -- How to decode the base64 output to retrieve the generated image. - ## Requirements Before starting this tutorial you'll need: diff --git a/tutorials/serverless/generate-sdxl-turbo.mdx b/tutorials/serverless/generate-sdxl-turbo.mdx index 4065fc47..4f7f7127 100644 --- a/tutorials/serverless/generate-sdxl-turbo.mdx +++ b/tutorials/serverless/generate-sdxl-turbo.mdx @@ -2,7 +2,6 @@ title: "Integrate Serverless with a web application" sidebarTitle: "Integrate with a web application" description: "Deploy an image generation endpoint from the Hub and integrate it into a web application." -tag: "NEW" --- import { ServerlessTooltip, RunpodHubTooltip, WorkerTooltip, ColdStartTooltip } from "/snippets/tooltips.jsx"; @@ -11,13 +10,6 @@ In this tutorial, you'll deploy a pre-built SDXL Turbo from th By the end, you'll know how to deploy endpoints from the Hub and integrate them into your applications using standard HTTP requests. -## What you'll learn - -In this tutorial, you'll learn how to: - -- Deploy a pre-built AI worker from the Runpod Hub. -- Build a web application that generates images from text prompts. - ## Requirements - A [Runpod account](/get-started/manage-accounts) with credits. diff --git a/tutorials/serverless/model-caching-text.mdx b/tutorials/serverless/model-caching-text.mdx index 56a50e99..184686ee 100644 --- a/tutorials/serverless/model-caching-text.mdx +++ b/tutorials/serverless/model-caching-text.mdx @@ -2,10 +2,9 @@ title: "Deploy Phi-3 using model caching" sidebarTitle: "Deploy a cached model" description: "Learn how to create a custom Serverless endpoint that uses model caching to serve Phi-3 with reduced cost and cold start times." -tag: "NEW" --- -import { ServerlessTooltip, WorkerTooltip, HandlerFunctionTooltip, CachedModelsTooltip } from "/snippets/tooltips.jsx"; +import { ServerlessTooltip, WorkerTooltip, CachedModelsTooltip } from "/snippets/tooltips.jsx"; You can download the finished code for this tutorial [on GitHub](https://github.com/runpod-workers/model-store-cache-example). @@ -13,13 +12,6 @@ You can download the finished code for this tutorial [on GitHub](https://github. This tutorial demonstrates how to build a custom that leverages Runpod's feature to serve the Phi-3 language model. You'll learn how to create a handler function that locates and loads cached models in offline mode, which can significantly reduce costs and cold start times. -## What you'll learn - -- How to configure a Serverless endpoint to use cached models. -- How to programmatically locate a cached model in your handler function. -- How to create a custom for text generation. -- How to integrate the Phi-3 model with the Hugging Face Transformers library. - ## Requirements Before starting this tutorial, make sure: diff --git a/tutorials/serverless/run-gemma-7b.mdx b/tutorials/serverless/run-gemma-7b.mdx index f5b9887a..37d62d37 100644 --- a/tutorials/serverless/run-gemma-7b.mdx +++ b/tutorials/serverless/run-gemma-7b.mdx @@ -2,22 +2,14 @@ title: "Deploy a chatbot with Gemma 3 and send requests using the OpenAI API" sidebarTitle: "Create a chatbot with Gemma 3" description: "Deploy a Serverless endpoint with Google's Gemma 3 model using vLLM and the OpenAI API to build an interactive chatbot." -tag: "NEW" --- -import { ServerlessTooltip, EndpointTooltip, VLLMTooltip, WorkerTooltip, ColdStartTooltip } from "/snippets/tooltips.jsx"; +import { ServerlessTooltip, EndpointTooltip, WorkerTooltip, ColdStartTooltip } from "/snippets/tooltips.jsx"; This tutorial walks you through deploying a with Google's Gemma 3 model using the vLLM worker. You'll deploy the `gemma-3-1b-it` instruction-tuned variant, a lightweight model that runs efficiently on a variety of GPUs. By the end, you'll have a fully functional Serverless endpoint that can respond to chat-style prompts through the [OpenAI-compatible API](/serverless/vllm/openai-compatibility). -## What you'll learn - -- How to accept Google's terms for gated models on Hugging Face. -- How to deploy a worker from the Runpod Hub. -- How to interact with your endpoint using the OpenAI-compatible API. -- How to build a simple command-line chatbot. - ## Requirements Before starting this tutorial, you'll need: diff --git a/tutorials/serverless/run-ollama-inference.mdx b/tutorials/serverless/run-ollama-inference.mdx index f580cfa5..cddbff6c 100644 --- a/tutorials/serverless/run-ollama-inference.mdx +++ b/tutorials/serverless/run-ollama-inference.mdx @@ -2,19 +2,12 @@ title: "Run Ollama on Serverless (CPU)" sidebarTitle: "Run Ollama (CPU)" description: "Learn how to run an Ollama server on Serverless CPU workers." -tag: "NEW" --- -import { ServerlessTooltip, EndpointTooltip, NetworkVolumeTooltip, WorkersTooltip, ColdStartTooltip } from "/snippets/tooltips.jsx"; +import { ServerlessTooltip, WorkersTooltip } from "/snippets/tooltips.jsx"; Run an Ollama server on CPU for LLM inference. This tutorial focuses on CPU compute, but you can also select a GPU for faster performance. -## What you'll learn - -- Deploy an Ollama container as a Serverless . -- Configure a to cache models and reduce times. -- Send inference requests to your Ollama endpoint. - ## Requirements Before starting, you'll need: diff --git a/tutorials/serverless/run-your-first.mdx b/tutorials/serverless/run-your-first.mdx index 81fa1d92..73bc70c1 100644 --- a/tutorials/serverless/run-your-first.mdx +++ b/tutorials/serverless/run-your-first.mdx @@ -4,21 +4,12 @@ sidebarTitle: "Generate images with SDXL" description: "Learn how to deploy a Serverless endpoint running SDXL from the Runpod Hub and use it to generate images." --- -import { ServerlessTooltip, RunpodHubTooltip, InferenceTooltip } from "/snippets/tooltips.jsx"; +import { ServerlessTooltip, InferenceTooltip } from "/snippets/tooltips.jsx"; In this tutorial, you will learn how to deploy a endpoint running [Stable Diffusion XL](https://stablediffusionxl.com/) (SDXL) on Runpod, submit image jobs, monitor their progress, and decode the resulting images. [Runpod's Serverless platform](/serverless/overview) allows you to run AI/ML models in the cloud without managing infrastructure, automatically scaling resources as needed. SDXL is a powerful AI model that generates high-quality images from text prompts. -## What you'll learn - -In this tutorial you'll learn: - -- How to deploy a Serverless endpoint using the . -- How to submit jobs, monitor their progress, and retrieve results. -- How to generate an image using SDXL. -- How to decode the base64 output to retrieve the image. - ## Requirements Before starting this tutorial you'll need: