Skip to content

feat: enable multi-datacenter support#266

Open
KAJdev wants to merge 5 commits intomainfrom
zeke/ae-2422-enable-adding-dcs
Open

feat: enable multi-datacenter support#266
KAJdev wants to merge 5 commits intomainfrom
zeke/ae-2422-enable-adding-dcs

Conversation

@KAJdev
Copy link
Contributor

@KAJdev KAJdev commented Mar 11, 2026

flash is hardcoded to EU-RO-1. The DataCenter enum only has one value, NetworkVolume.dataCenterId is frozen to it, and ServerlessResource.datacenter defaults to it with a prod-only guard that syncs it to locations.

DataCenter now has all 10 listed RunPod DCs. The datacenter parameter on both Endpoint and ServerlessResource accepts a single DC, a list, a string, or None (no location restriction, matching the console's default "all datacenters" behavior). The old default of EU-RO-1 is replaced with None.

Endpoint(volume=...) accepts a single NetworkVolume or a list. Network volumes are tied to individual DCs, so a one-volume-per-datacenter constraint is enforced in ServerlessResource.normalize_network_volumes. When multiple volumes are attached, the deploy path sends networkVolumeIds to the GraphQL API instead of the singular networkVolumeId.

CPU availability is limited to a subset of DCs. CPU_DATACENTERS (EU-RO-1, US-TX-1, EU-SE-1) is exported and validated against in validate_cpu_datacenters so you get a clear error

Closes AE-2422

@KAJdev KAJdev force-pushed the zeke/ae-2422-enable-adding-dcs branch from c567dc6 to c01fa70 Compare March 11, 2026 22:04
@KAJdev KAJdev requested review from deanq and runpod-Henrik March 11, 2026 22:05
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Enables multi-datacenter deployments by expanding the DataCenter enum, allowing datacenter=None (all DCs) or lists, and adding first-class support for attaching multiple network volumes (one per datacenter) with corresponding GraphQL payload updates.

Changes:

  • Expand DataCenter to all supported RunPod DCs and default endpoint/resource datacenter behavior to None (no restriction / all DCs).
  • Add multi-volume support (accept single or list; validate one volume per DC; send networkVolumeIds when multiple).
  • Add CPU datacenter restrictions via exported CPU_DATACENTERS, with validation and clearer errors.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/test_p2_gaps.py Updates expectations for default locations/datacenter behavior when unset.
tests/unit/test_endpoint.py Adds coverage for volume normalization to lists and expanded datacenter input forms.
tests/unit/test_deprecations.py Marks CPU_DATACENTERS as non-deprecated public API.
tests/unit/resources/test_serverless.py Updates/extends tests for multi-DC datacenter normalization, locations syncing, multi-volume validation, and CPU DC validation.
tests/unit/resources/test_network_volume.py Adds tests for NetworkVolume(datacenter=...) alias behavior.
src/runpod_flash/runtime/resource_provisioner.py Passes locations from manifest into reconstructed resources.
src/runpod_flash/endpoint.py Accepts volume as single or list; forwards datacenter as-is; serializes multi-volume configs into networkVolumes.
src/runpod_flash/core/resources/serverless_cpu.py Excludes networkVolumes from CPU endpoint payload.
src/runpod_flash/core/resources/serverless.py Adds multi-volume normalization/validation; datacenter normalization; CPU DC validation; deploy/update inject networkVolumeIds.
src/runpod_flash/core/resources/network_volume.py Expands DataCenter; adds from_string; exports CPU_DATACENTERS; adds datacenter alias for dataCenterId.
src/runpod_flash/core/resources/load_balancer_sls_resource.py Excludes networkVolumes from CPU load balancer payload.
src/runpod_flash/core/resources/init.py Re-exports CPU_DATACENTERS.
src/runpod_flash/core/api/runpod.py Extends saveEndpoint selection set to include networkVolumeIds.
src/runpod_flash/cli/commands/build_utils/manifest.py Persists locations into the build manifest deployment config.
src/runpod_flash/init.py Exposes CPU_DATACENTERS at the top-level package API.
docs/Flash_SDK_Reference.md Documents expanded DC set, datacenter=None, CPU_DATACENTERS, and multi-volume examples.
docs/Flash_Deploy_Guide.md Updates persistent storage examples for explicit DC targeting and multi-volume usage.
Comments suppressed due to low confidence (2)

src/runpod_flash/cli/commands/build_utils/manifest.py:242

  • Manifest extraction still only serializes the singular networkVolume (and networkVolumeId), so endpoints configured with multiple volumes (networkVolumes) will be written to the manifest with only the first volume. This will break deploy/runtime provisioning for multi-volume configs. Extend manifest extraction to include networkVolumes when present (and ensure runtime reconstruction supports it).
        if hasattr(resource_config, "locations") and resource_config.locations:
            config["locations"] = resource_config.locations

        if hasattr(resource_config, "env") and resource_config.env:
            env_dict = dict(resource_config.env)
            env_dict.pop("RUNPOD_API_KEY", None)
            if env_dict:
                config["env"] = env_dict

        if hasattr(resource_config, "networkVolume") and resource_config.networkVolume:
            nv = resource_config.networkVolume
            nv_config = {"name": nv.name}
            if nv.size is not None:
                nv_config["size"] = nv.size
            if hasattr(nv, "dataCenterId") and nv.dataCenterId is not None:
                nv_config["dataCenterId"] = (
                    nv.dataCenterId.value
                    if hasattr(nv.dataCenterId, "value")
                    else nv.dataCenterId
                )
            config["networkVolume"] = nv_config

src/runpod_flash/runtime/resource_provisioner.py:157

  • create_resource_from_manifest can reconstruct only networkVolume/networkVolumeId, but multi-volume endpoints are now represented via networkVolumes (and deployed via networkVolumeIds). Without handling those keys, manifest-based provisioning will silently drop additional volumes. Add support for networkVolumes (and/or networkVolumeIds) reconstruction here.
    if "locations" in resource_data:
        deployment_kwargs["locations"] = resource_data["locations"]

    # Reconstruct NetworkVolume from manifest data if present
    if "networkVolume" in resource_data:
        from runpod_flash.core.resources.network_volume import NetworkVolume

        nv_data = resource_data["networkVolume"]
        deployment_kwargs["networkVolume"] = NetworkVolume(**nv_data)
    elif "networkVolumeId" in resource_data:
        deployment_kwargs["networkVolumeId"] = resource_data["networkVolumeId"]


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines 110 to 128
# CPU endpoints don't use GPU-specific fields, so exclude them from API payload
# This prevents the RunPod GraphQL API from rejecting CPU endpoints with GPU-specific fields
# Note: instanceIds is NOT in _input_only, so it will be sent to the API
_input_only = {
"id",
"cudaVersions", # GPU-specific, exclude from API payload
"datacenter",
"env",
"gpus", # Inherited from parent, but always None for CPU endpoints
"gpuIds", # GPU-specific API field, exclude from payload
"gpuCount", # GPU-specific API field, exclude from payload
"allowedCudaVersions", # GPU-specific API field, exclude from payload
"flashboot",
"flashEnvironmentId",
"imageName",
"networkVolume",
"networkVolumes",
"python_version",
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CpuServerlessEndpoint now excludes networkVolumes from the API payload via _input_only, but the overridden config_hash below only includes networkVolume (singular). That means changing the multi-volume configuration won’t change config_hash, so drift detection/updates may not trigger. Include networkVolumes in the CPU config hash (or otherwise ensure multi-volume changes affect drift detection).

Copilot uses AI. Check for mistakes.
Comment on lines 332 to 347
_input_only = {
"id",
"cudaVersions",
"datacenter",
"env",
"flashboot",
"flashEnvironmentId",
"gpus",
"gpuIds",
"gpuCount",
"allowedCudaVersions",
"imageName",
"networkVolume",
"networkVolumes",
"python_version",
}
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CpuLoadBalancerSlsResource now excludes networkVolumes from the API payload via _input_only, but its custom config_hash implementation later in the class only hashes networkVolume (singular). Multi-volume changes may not be detected as drift, preventing needed updates. Include networkVolumes in the CPU LB config hash (or ensure multi-volume affects drift detection).

Copilot uses AI. Check for mistakes.
from runpod_flash import NetworkVolume, DataCenter

# existing volume by ID
vol = NetworkVolume(id="vol_abc123")
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs show vol = NetworkVolume(id="vol_abc123"), but NetworkVolume currently requires a name field (no default) in the model. Either update the example to include name=... when referencing an existing volume, or make name optional when id is provided.

Suggested change
vol = NetworkVolume(id="vol_abc123")
vol = NetworkVolume(id="vol_abc123", name="my-existing-volume")

Copilot uses AI. Check for mistakes.
Comment on lines 580 to 591
async def _ensure_network_volume_deployed(self) -> None:
"""Ensures all network volumes are deployed.

Deploys each volume in networkVolumes and collects their IDs.
Sets networkVolumeId (singular) for backward compat with the first volume.
Populates _deployed_volume_ids for multi-volume API payloads.
"""
Ensures network volume is deployed and ready if one is specified.
Updates networkVolumeId with the deployed volume ID.
"""
self._deployed_volume_ids: list[str] = []

if self.networkVolumeId:
return
self._deployed_volume_ids.append(self.networkVolumeId)

Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_ensure_network_volume_deployed assigns self._deployed_volume_ids on a Pydantic model without declaring it as a field/private attr. With the current BaseResource.model_config (no extra='allow'), setting an undeclared attribute will raise at runtime, breaking deploy/update. Define _deployed_volume_ids as a Pydantic PrivateAttr (or stash it via self.__dict__[...]) so it can be set safely.

Copilot uses AI. Check for mistakes.
if isinstance(volume, NetworkVolume):
return [volume]
if isinstance(volume, list):
return volume or None
Copy link

Copilot AI Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_normalize_volumes accepts a list but doesn’t validate that every element is a NetworkVolume. If a caller passes a mixed/incorrect list, the failure will happen later (e.g., when calling model_dump), producing a less clear error. Consider validating list element types here and raising a ValueError that points to the bad element.

Suggested change
return volume or None
if not volume:
# preserve existing behavior: empty list normalizes to None
return None
for idx, vol in enumerate(volume):
if not isinstance(vol, NetworkVolume):
raise ValueError(
"volume list elements must be NetworkVolume; "
f"element at index {idx} is {type(vol).__name__}"
)
return volume

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants