Releases · dstackai/dstack-enterprise

11 Jun 10:28

r4victor

0.19.13-v1

f087935

0.19.13-v1

Clusters

Built-in InfiniBand support in `dstack` Docker images

The dstack default Docker images now come with built-in InfiniBand support, which includes the necessary libibverbs library and InfiniBand utilities from rdma-core. This means you can run torch distributed and other workloads utilizing NCCL, and they'll take full advantage of InfiniBand without custom Docker images.

You can try InfiniBand clusters with dstack on Nebius.

Built-in EFA support in `dstack` VM images

dstack switches to DLAMI as the default AWS GPU VM image from a custom one. DLAMI supports EFA out-of-the-box, so you no longer need to use a custom VM image to take advantage of EFA.

Server

GCS support for code uploads

It's now possible to configure the dstack server to use GCP Cloud Storage for code uploads. Previously, only DB and S3 storages were supported. Learn more in the Server deployment guide.

What's Changed

Support file upload to gcs bucket by @colinjc in dstackai/dstack#2737
Document File storage by @r4victor in dstackai/dstack#2755
[Docs] Minor update of Clusters and Distributed tasks sections by @peterschmidt85 in dstackai/dstack#2741
Fix CLI exiting while master starting by @r4victor in dstackai/dstack#2757
[UI] Implement property filter on Run list page by @olgenn in dstackai/dstack#2762
[Bug]: Text is unavailable for selection on run logs page by @olgenn in dstackai/dstack#2763
Preinstall rdma-core packages into dstack Docker image by @r4victor in dstackai/dstack#2764
[UX] Show status message as retrying in case a run or job is being retired by @peterschmidt85 in dstackai/dstack#2758
[Docs] Minor improvements by @peterschmidt85 in dstackai/dstack#2766
[Feature]: Include priority to the list of runs and sort runs by priority by @olgenn in dstackai/dstack#2768
[Feature]: The Run details page should display the same fields as the Run list page by @olgenn in dstackai/dstack#2769
[Feature]: Show Quickstart button if user don't have any runs by @olgenn in dstackai/dstack#2770
[Feature]: Implement links for elements that have details page by @olgenn in dstackai/dstack#2772
[Feature]: Add Refresh button on Run details page by @olgenn in dstackai/dstack#2773
[Bug]: Tab Billing changes to Settings after top up balance by @olgenn in dstackai/dstack#2774
Exclude backward incompatible fields from rest plugin calls by @colinjc in dstackai/dstack#2767
[UI] Minor fixes by @peterschmidt85 in dstackai/dstack#2775
Pin dkms by @r4victor in dstackai/dstack#2776
Use DLAMI on AWS by @r4victor in dstackai/dstack#2782
2674 prop filter by @olgenn in dstackai/dstack#2778
Fixed defect #2752 by @olgenn in dstackai/dstack#2784
Update base image to 0.9 by @r4victor in dstackai/dstack#2786
Fix status_message with missing on_events by @r4victor in dstackai/dstack#2788
[Bug]: UI doesn't show Resources for instances of SSH fleets by @peterschmidt85 in dstackai/dstack#2785
Ignore AWS quotas when hitting rate limits by @r4victor in dstackai/dstack#2791

Full Changelog: dstackai/dstack@0.19.12...0.19.13

Contributors

olgenn, colinjc, and 2 other contributors

Assets 2

04 Jun 11:18

r4victor

0.19.12-v1

f087935

0.19.12-v1

Clusters

Simplified use of MPI

`startup_order` and `stop_criteria`

New run configuration properties are introduced:

startup_order: any/master-first/workers-first specifies the order in which master and workers jobs are started.
stop_criteria: all-done/master-done specifies the criteria when a multi-node run should be considered finished.

These properties simplify running certain multi-node workloads. For example, MPI requires that workers are up and running when the master runs mpirun, so you'd use startup_order: workers-first. MPI workload can be considered done when the master is done, so you'd use stop_criteria: master-done and dstack won't wait for workers to exit.

`DSTACK_MPI_HOSTFILE`

dstack now automatically creates an MPI hostfile and exposes the DSTACK_MPI_HOSTFILE environment variable with the hostfile path. It can be used directly as mpirun --hostfile $DSTACK_MPI_HOSTFILE.

CLI

We've also updated how the CLI displays run and job status. Previously, the CLI displayed the internal status code which was hard to interpret. Now, the the STATUS column in dstack ps and dstack apply displays a status code which is easy to understand why run or job was terminated.

dstack ps -n 10
 NAME               BACKEND             RESOURCES                            PRICE    STATUS        SUBMITTED
 oom-task                                                                             no offers     yesterday
 oom-task           nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  exited (127)  yesterday
 oom-task           nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  exited (127)  yesterday
 heavy-wolverine-1                                                                    done          yesterday
   replica=0 job=0  aws (us-east-1)     cpu=4 mem=16GB disk=100GB T4:16GB:1  $0.526   exited (0)    yesterday
   replica=0 job=1  aws (us-east-1)     cpu=4 mem=16GB disk=100GB T4:16GB:1  $0.526   exited (0)    yesterday
 cursor             nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  stopped       yesterday
 cursor             nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  error         yesterday
 cursor             nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  interrupted   yesterday
 cursor             nebius (eu-north1)  cpu=2 mem=8GB disk=100GB             $0.0496  aborted       yesterday

Examples

Simplified NCCL tests

With this release improvements, it became much easier to run MPI workloads with dstack. This includes NCCL tests that can now be run using the following configuration:

type: task
name: nccl-tests

nodes: 2
startup_order: workers-first
stop_criteria: master-done

image: dstackai/efa
env:
  - NCCL_DEBUG=INFO
commands:
  - cd /root/nccl-tests/build
  - |
    if [ ${DSTACK_NODE_RANK} -eq 0 ]; then
      mpirun \
        --allow-run-as-root --hostfile $DSTACK_MPI_HOSTFILE \
        -n ${DSTACK_GPUS_NUM} \
        -N ${DSTACK_GPUS_PER_NODE} \
        --mca btl_tcp_if_exclude lo,docker0 \
        --bind-to none \
        ./all_reduce_perf -b 8 -e 8G -f 2 -g 1
    else
      sleep infinity
    fi

resources:
  gpu: nvidia:4:16GB
  shm_size: 16GB

See the updated NCCL tests example for more details.

Distributed training

TRL

The new TRL example walks you through how to run distributed fine-tune using TRL, Accelerate and Deepspeed.

Axolotl

The new Axolotl example walks you through how to run distributed fine-tune using Axolotl with dstack.

What's changed

[Feature] Update .gitignore logic to catch more cases by @colinjc in dstackai/dstack#2695
[Bug] Increase upload_code client timeout by @r4victor in dstackai/dstack#2709
[Bug] Fix missing apt-get update by @r4victor in dstackai/dstack#2710
[Internal]: Update git hooks and package.json by @olgenn in dstackai/dstack#2706
[Examples] Add distributed Axolotl and TRL example by @Bihan in dstackai/dstack#2703
[Docs] Update dstack-proxy contributing guide by @jvstme in dstackai/dstack#2683
[Feature] Implement DSTACK_MPI_HOSTFILE by @r4victor in dstackai/dstack#2718
[Feature] Implement startup_order and stop_criteria by @r4victor in dstackai/dstack#2714
[Bug] Fix CLI exiting while master starting by @r4victor in dstackai/dstack#2720
[Examples] Simplify NCCL tests example by @r4victor in dstackai/dstack#2723
[Examples] Update TRL Single Node example to uv by @Bihan in dstackai/dstack#2715
[Bug] Fix backward compatibility when creating fleets by @jvstme in dstackai/dstack#2727
[UX]: Make run status in UI and CLI easier to understand by @peterschmidt85 in dstackai/dstack#2716
[Bug] Fix relative paths in dstack apply --repo by @jvstme in dstackai/dstack#2733
[Internal]: Drop hardcoded regions from the backend template by @jvstme in dstackai/dstack#2734
[Internal]: Update backend template to match ruff formatting by @jvstme in dstackai/dstack#2735

Full changelog: dstackai/dstack@0.19.11...0.19.12

Contributors

olgenn, Bihan, and 4 other contributors

Assets 2

28 May 09:40

r4victor

0.19.11-v1

f087935

0.19.11-v1

Runs

Replacing `conda` with `uv`

dstack's default Docker images now come with uv installed. Installing Python packages with uv can be significantly faster than with pip or conda. Here's for example, uv vs pip times for installing torch on GCP VMs:

# time uv pip install torch
...
real    0m32.771s
user    0m29.070s
sys     0m8.300s

# time pip install torch
...
real    2m26.338s
user    1m37.514s
sys     0m16.711s

To continue supporting pip, dstack now automatically activates a virtual environment with pip available.

conda is no longer included in dstack's default Docker images. If you need to use conda, it should be installed manually:

commands:
  - wget -O miniconda.sh https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  - bash miniconda.sh -b -p /workflow/miniconda
  - eval "$(/workflow/miniconda/bin/conda shell.bash hook)"

Plugins

Built-in `rest_plugin`

dstack gets support for a built-in rest_plugin that allows writing custom plugins as API servers, so you don't need to install plugins as Python packages.

Plugins implemented as API servers have advantages over plugins implemented as Python packages in some cases:

No dependency conflicts with dstack.
You can use any programming language.
If you run the dstack server via Docker, you don't need to extend the dstack server image with plugins or map them via volumes.

To get started, check out the plugin server example. The rest_plugin server API is documented here.

AWS

New CPU series

dstack now supports most recent AWS CPU VMs based on Intel Xeon Sapphire Rapids: M7i, C7i, and R7i. It also adds support for the burstable T3 family. Previously, only M5, C5 and t2.small CPU instances were supported.

Azure

New CPU series

dstack now supports most recent Azure CPU VMs based on Intel Xeon Sapphire Rapids: general purpose Dsv6 and memory optimized Esv6 series. Previously, only Dsv3, Esv4, and Fsv2 series were supported.

GCP

New CPU series

dstack now supports most recent GCP CPU VMs: C4, M4, H3, N4, N2. Previously, only E2 and M1 were supported.

Note that C4, M4, H3, N4 instances do not currently support Volumes since they require Hyperdisk support.

Examples

Ray+RAGEN

The new Ray+RAGEN example shows how use dstack and RAGEN to fine-tune an agent on multiple nodes.

Breaking changes

conda is no longer included in dstack's default Docker images.

Deprecations

Azure VM series Dsv3 and Esv4 are deprecated.

What's Changed

[Examples] Ray+RAGEN by @Bihan in dstackai/dstack#2665
[UX] Minor improvements of dstack metrics by @peterschmidt85 in dstackai/dstack#2667
Fix request filtering for service stats by @jvstme in dstackai/dstack#2678
Auto activate uv venv with pip installed by @r4victor in dstackai/dstack#2666
Support new Azure CPU series by @r4victor in dstackai/dstack#2668
[Blog] Case study: how EA uses dstack to fast-track AI development by @peterschmidt85 in dstackai/dstack#2682
Add REST plugin for user-defined policies by @Nadine-H in dstackai/dstack#2631
[UI] Minor update of help messages by @peterschmidt85 in dstackai/dstack#2690
Fix wrong env var name in error message by @colinjc in dstackai/dstack#2686
Fix upload_code limit message by @r4victor in dstackai/dstack#2691
Support new GCP CPU series by @r4victor in dstackai/dstack#2685
Drop humanize by @r4victor in dstackai/dstack#2692
Support new AWS CPU series by @r4victor in dstackai/dstack#2693
Disable max code upload limit in runner by @colinjc in dstackai/dstack#2694
Generate REST plugin API docs by @Nadine-H in dstackai/dstack#2696
Fix docs-build by @r4victor in dstackai/dstack#2700
[UX]: Only show update notices for stable releases #2697 by @peterschmidt85 in dstackai/dstack#2698
Run plugins in executor by @r4victor in dstackai/dstack#2701
Fix phantom priority changes detected by @r4victor in dstackai/dstack#2702
Update GRID drivers in Azure VM image by @jvstme in dstackai/dstack#2704

New Contributors

@Nadine-H made their first contribution in dstackai/dstack#2631

Full Changelog: dstackai/dstack@0.19.10...0.19.11

Contributors

Bihan, colinjc, and 4 other contributors

Assets 2

27 May 09:42

r4victor

0.19.11rc2-v1

f087935

0.19.11rc2-v1 Pre-release

Pre-release

What's Changed

[Examples] Ray+RAGEN by @Bihan in dstackai/dstack#2665
[UX] Minor improvements of dstack metrics by @peterschmidt85 in dstackai/dstack#2667
Fix request filtering for service stats by @jvstme in dstackai/dstack#2678
Auto activate uv venv with pip installed by @r4victor in dstackai/dstack#2666
Support new Azure CPU series by @r4victor in dstackai/dstack#2668
[Blog] Case study: how EA uses dstack to fast-track AI development by @peterschmidt85 in dstackai/dstack#2682
Add REST plugin for user-defined policies by @Nadine-H in dstackai/dstack#2631
[UI] Minor update of help messages by @peterschmidt85 in dstackai/dstack#2690
Fix wrong env var name in error message by @colinjc in dstackai/dstack#2686
Fix upload_code limit message by @r4victor in dstackai/dstack#2691
Support new GCP CPU series by @r4victor in dstackai/dstack#2685
Drop humanize by @r4victor in dstackai/dstack#2692
Support new AWS CPU series by @r4victor in dstackai/dstack#2693
Disable max code upload limit in runner by @colinjc in dstackai/dstack#2694
Generate REST plugin API docs by @Nadine-H in dstackai/dstack#2696
Fix docs-build by @r4victor in dstackai/dstack#2700
[UX]: Only show update notices for stable releases #2697 by @peterschmidt85 in dstackai/dstack#2698

New Contributors

@Nadine-H made their first contribution in dstackai/dstack#2631

Full Changelog: dstackai/dstack@0.19.10...0.19.11rc2

Contributors

Bihan, colinjc, and 4 other contributors

Assets 2

22 May 10:33

r4victor

0.19.10-v2

f087935

0.19.10-v2

Linking Okta accounts

Now dstack automatically links Okta accounts to existing dstack users on first login instead of creating new users if their emails match.

[14:27:08] INFO     dstack_enterprise.services.auth.okta:70 Linked existing  
                    dstack user r4victor to Okta account pqefub12345
                    (victor@dstack.ai)

Assets 2

21 May 09:31

r4victor

0.19.10-v1

89d5de7

0.19.10-v1

Runs

Priorities

Run configurations now support a new priority property that allows controlling the order in which the runs are provisioned:

type: task
nodes: 1
priority: 50
commands:
  - ...

Runs with higher priorities take precedence over runs with lower priorities. Previously, submitted jobs were processed in FIFO manner with older jobs processed first. Now, the jobs are first sorted by descending priority. Note that if a high priority run cannot be scheduled, it does not block other runs with lower priority from scheduling (a.k.a. Best effort FIFO).

The priority property is updatable, so it can be changed for already submitted runs and will take effect.

CLI

`dstack project` command

The new dstack project command replaces the existing dstack config command.

dstack project (same as dstack project list)

$ dstack project

 PROJECT         URL                    USER            DEFAULT
 peterschmidt85  https://sky.dstack.ai  peterschmidt85
 main            http://127.0.0.1:3000  admin              ✓

dstack project set-default

$ dstack project set-default peterschmidt85
OK

dstack project add (similar to old dstack config, but --project is changed to --name)

$ dstack project add --name peterschmidt85 --url https://sky.dstack.ai --token 76d8dd51-0470-74a7-24ed9ec18-fb7d341
OK

`dstack ps -n/--last`

The dstack ps command now supports a new -n/--last parameter to show last N runs:

✗ dstack ps -n 3
 NAME             BACKEND             RESOURCES                                    PRICE    STATUS      SUBMITTED    
 good-panther-2   gcp (europe-west4)  cpu=2 mem=8GB disk=100GB                     $0.0738  terminated  49 mins ago  
 new-chipmunk-1   azure (westeurope)  cpu=2 mem=8GB disk=100GB (spot)              $0.0158  terminated  23 hours ago 
 fuzzy-panther-1  runpod (EU-RO-1)    cpu=6 mem=31GB disk=100GB RTX2000Ada:16GB:1  $0.28    terminated  yesterday

Azure

Fsv2 series

The Azure backend now supports compute-optimized Fsv2 series:

✗ dstack apply -b azure
 Project              main                           
 User                 admin                          
 Configuration        .dstack.yml                    
 Type                 dev-environment                
 Resources            cpu=4.. mem=8GB.. disk=100GB.. 
 Spot policy          auto                           
 Max price            -                              
 Retry policy         -                              
 Creation policy      reuse-or-create                
 Idle duration        5m                             
 Max duration         -                              
 Inactivity duration  -                              
 Reservation          -                              

 #  BACKEND             RESOURCES                         INSTANCE TYPE      PRICE     
 1  azure (westeurope)  cpu=4 mem=8GB disk=100GB (spot)   Standard_F4s_v2    $0.0278   
 2  azure (westeurope)  cpu=4 mem=16GB disk=100GB (spot)  Standard_D4s_v3    $0.0312   
 3  azure (westeurope)  cpu=4 mem=32GB disk=100GB (spot)  Standard_E4-2s_v4  $0.0416   
    ...                                                                                
 Shown 3 of 98 offers, $40.962max

Major bugfixes

[Bug]: Instances with blocks feature cannot be used for multi-node runs #2650

Deprecations

The dstack config CLI command is deprecated in favor of dstack project add.

What's changed

[Bug] Allow multi-node tasks on idle instances with blocks by @un-def in dstackai/dstack#2651
[UX] Make local code upload size limit configurable by @colinjc in dstackai/dstack#2673
[Feature] Implement run priorities by @r4victor in dstackai/dstack#2635
[Bug] Fix IllegalStateChangeError in delete_metrics task by @un-def in dstackai/dstack#2639
[Examples] Renamed some example groups for better extensibility by @peterschmidt85 in dstackai/dstack#2641
[Azure] Support Azure Fsv2-series by @r4victor in dstackai/dstack#2647
[UX]: Add dstack project CLI to configure, list and switching between projects by @peterschmidt85 in dstackai/dstack#2653
[UI] Dark/light theme toggler state is reset after page reload #289 by @olgenn in dstackai/dstack#2675
[UX] Support dstack ps -n NUM by @peterschmidt85 in dstackai/dstack#2654
[Docs] Added Clusters guide by @peterschmidt85 in dstackai/dstack#2646
[UX] Replace conda with uv in dstackai/base images by @un-def in dstackai/dstack#2649
[Docs]: Mention SSH fleet networking requirements by @jvstme in dstackai/dstack#2643
[Bug] Put lower bounds on oci deps by @r4victor in dstackai/dstack#2658
[UX]: Replace conda with uv in dstack's default Docker image #2625 by @peterschmidt85 in dstackai/dstack#2652
[UX]: Replace conda with uv in dstack's default Docker image #2625 by @peterschmidt85 in dstackai/dstack#2659
[Internal] Support building staging Docker images by @r4victor in dstackai/dstack#2664
[Bug] Forbid scaling.target <= 0 by @jvstme in dstackai/dstack#2672

Full changelog: dstackai/dstack@0.19.9...0.19.10

Contributors

un-def, olgenn, and 4 other contributors

Assets 2

15 May 09:46

r4victor

0.19.9-v1

89d5de7

0.19.9-v1

CLI

Container exit status

The CLI now displays the container exit status of each failed run or job:

Monitoring

Metrics

Previously, dstack stored and displayed only metrics within the last hour. If a run or job is finished, eventually metrics disappeared.
Now, dstack stores the last hour window of metrics for all finished runs.

AMD

On AMD, a wider range of ROCm/AMD SMI versions is now supported. Previously, for certain versions, metrics were not shown properly.

Server

Robust handling of networking issues

It sometimes happens that the dstack server cannot establish connections to running instances due to networking problems or because instances become temporarily unreachable. Previously, dstack failed jobs very quickly in such cases. Now, the server puts a graceful timeout of 2 minutes before considering jobs failed if instances are unreachable.

Runs

`DSTACK_RUN_ID` and `DSTACK_JOB_ID`

Two new environment variables are now available within runs:

DSTACK_RUN_ID stores the UUID of the run. It's unique for a run unlike DSTACK_RUN_NAME.
DSTACK_JOB_ID stores the UUID of the job submission. It's unique for every replica, job, and retry attempt.

What's Changed

Add rccl test by @Bihan in dstackai/dstack#2613
[Docs] Extracted Distributed training examples by @peterschmidt85 in dstackai/dstack#2614
[Docs] fix YAML indent on trl example by @aaroniscode in dstackai/dstack#2617
Add example of including plugins into the dstack-server Docker image by @r4victor in dstackai/dstack#2620
Pull and store process exit status from jobs by @un-def in dstackai/dstack#2615
[.github] Fix python-test by @peterschmidt85 in dstackai/dstack#2619
[Docker] Add dstackai/amd-smi image by @un-def in dstackai/dstack#2611
[runner] Improve GPU metrics collector by @un-def in dstackai/dstack#2612
Set DSTACK_RUN_ID and DSTACK_JOB_ID by @r4victor in dstackai/dstack#2622
Drop override message when overriding finished runs by @r4victor in dstackai/dstack#2623
Change default gpu count to 1.. by @r4victor in dstackai/dstack#2624
Add Nebius InfiniBand fabric for us-central1 by @jvstme in dstackai/dstack#2629
Introduce JOB_DISCONNECTED_RETRY_TIMEOUT by @r4victor in dstackai/dstack#2627
Keep last metrics for finished jobs by @un-def in dstackai/dstack#2628
Update Nebius default project detection by @jvstme in dstackai/dstack#2633
[Docs]: Nebius InfiniBand clusters by @jvstme in dstackai/dstack#2634
Update cudo image by @r4victor in dstackai/dstack#2636

New Contributors

@aaroniscode made their first contribution in dstackai/dstack#2617

Full Changelog: dstackai/dstack@0.19.8...0.19.9

Contributors

aaroniscode, un-def, and 4 other contributors

Assets 2

07 May 15:59

jvstme

0.19.8-v1

89d5de7

0.19.8-v1

ARM

dstack now supports compute instances with ARM CPUs. To request ARM CPUs in a run or fleet configuration, specify the arm architecture in the resources.cpu property:

resources:
  cpu: arm:4..  # 4 or more ARM cores

If the hosts in an SSH fleet have ARM CPUs, dstack will automatically detect them and enable their use.

To see available offers with ARM CPUs, pass --cpu arm to the dstack offer command.

Lambda

GH200

With the lambda backend, it's now possible to use GH200 instances that come with an ARM-based 72-core NVIDIA Grace CPU and an NVIDIA H200 Tensor Core GPU, connected with a high-bandwidth, memory-coherent NVIDIA NVLink-C2C interconnect.

type: dev-environment
name: my-env

ide: vscode

resources:
  gpu: GH200:1

If Lambda has GH200 on-demand instances at the time, you'll see them when you run dstack apply:

$ dstack apply -f .dstack.yml

 #   BACKEND             RESOURCES                                      INSTANCE TYPE  PRICE
 1   lambda (us-east-3)  cpu=arm:64 mem=464GB disk=4399GB GH200:96GB:1  gpu_1x_gh200   $1.49

Note, if no GH200 is available at the moment, you can specify the retry policy in your run configuration so that dstack can run the configuration once the GPU becomes available.

Nebius

InfiniBand clusters

The nebius backend now supports InfiniBand clusters. A cluster is automatically created when you apply a fleet configuration with placement: cluster and supported GPUs: e.g. 8xH100 or 8xH200.

type: fleet
name: my-fleet

nodes: 2
placement: cluster

resources:
  gpu: H100,H200:8

A suitable InfiniBand fabric for the cluster is selected automatically. You can also limit the allowed fabrics in the backend settings.

Once the cluster is provisioned, you can benefit from its high-speed networking when running distributed tasks, such as NCCL tests or Hugging Face TRL.

Azure

Managed identities

The new vm_managed_identity backend setting allows you to configure the managed identity that is assigned to VMs created in the azure backend.

projects:
- name: main
  backends:
  - type: azure
    subscription_id: 06c82ce3-28ff-4285-a146-c5e981a9d808
    tenant_id: f84a7584-88e4-4fd2-8e97-623f0a715ee1
    creds:
      type: default
    vm_managed_identity: dstack-rg/my-managed-identity

Make sure that dstack has the required permissions for managed identities to work.

What's changed

Fix: handle OSError from os.get_terminal_size() in CLI table rendering for non-TTY environments by @vuyelwadr in dstackai/dstack#2599
Clarify how retry works for tasks and services by @r4victor in dstackai/dstack#2600
[Docs] Added Tenstorrent example by @peterschmidt85 in dstackai/dstack#2596
Lambda: Docker: use cgroupfs driver by @un-def in dstackai/dstack#2603
Don't collect Prometheus metrics on container-based backends by @un-def in dstackai/dstack#2605
Support Nebius InfiniBand clusters by @jvstme in dstackai/dstack#2604
Add ARM64 support by @un-def in dstackai/dstack#2595
Allow to configure Nebius InfiniBand fabrics by @jvstme in dstackai/dstack#2607
Support vm_managed_identity for Azure by @r4victor in dstackai/dstack#2608
Fix API quota hitting when provisioning many A3 instances by @r4victor in dstackai/dstack#2610

New contributors

@vuyelwadr made their first contribution in dstackai/dstack#2599

Full changelog: dstackai/dstack@0.19.7...0.19.8

Contributors

un-def, r4victor, and 3 other contributors

Assets 2

01 May 14:07

un-def

0.19.7-v1

89d5de7

0.19.7-v1

Plugins

Run configurations have many options. While dstack aims to simplify them and provide rational defaults, teams may sometimes want to enforce their own defaults and configurations across projects.

To support this, we're introducing a plugin system that allows such enforcements to be defined programmatically. You can now define a plugin using dstack's Python SDK and bundle it with the dstack server.

For example, you can create your own plugin to override run configuration options—e.g., to prepend commands, set policies, and more.

For more information on plugin development, see the documentation and example.

Note

Plugins are currently an experimental feature. Backward compatibility is not guaranteed between releases.

Tenstorrent

The new update introduces initial support for Tenstorrent's Wormhole accelerators.

Now, if you create SSH fleets with hosts that have N150 or N300 PCIe boards, dstack will automatically detect them and allow you to use such a fleet for running dev environments, tasks, and services.

Dedicated examples for using dstack with Tenstorrent's accelerators will be published soon.

What's changed

Fix client backward compatibility when reapplying runs by @r4victor in dstackai/dstack#2558
Add A3 High example by @r4victor in dstackai/dstack#2559
Document GCP firewall allowing inter-VPC traffic by @r4victor in dstackai/dstack#2563
[CI] Build dstack-{shim,runner} for ARM64 by @un-def in dstackai/dstack#2561
Implement /api/project/{project_name}/fleets/apply by @r4victor in dstackai/dstack#2577
Introduce effective_spec for runs and fleets by @r4victor in dstackai/dstack#2579
Support Nebius tenancies with multiple projects by @jvstme in dstackai/dstack#2575
[UX] Shorter resource syntax for dstack apply, dstack offer, anddstack ps by @peterschmidt85 in dstackai/dstack#2572
Fix missing /fleets/apply for old servers by @r4victor in dstackai/dstack#2582
Updated runner and shim contributing guide by @peterschmidt85 in dstackai/dstack#2534
Mount volumes at /mnt/disks by @r4victor in dstackai/dstack#2584
Use gVNIC for GCP A3 VMs by @r4victor in dstackai/dstack#2585
[Bug] Several issues with vastai provider #142 #2566 by @peterschmidt85 in dstackai/dstack#2567
[Feature] Support Tenstorrent's Wormhole accelerators #2573 by @peterschmidt85 in dstackai/dstack#2574
Implement plugins by @r4victor in dstackai/dstack#2581
[Feature] Support Tenstorrent's Wormhole accelerators #2573 by @peterschmidt85 in dstackai/dstack#2589

Full changelog: dstackai/dstack@0.19.5...0.19.7

Contributors

un-def, r4victor, and 2 other contributors

Assets 2

23 Apr 10:41

r4victor

0.19.5-v1

89d5de7

0.19.5-v1

CLI

Offers

You can now list available offers (hardware configurations) from the configured backends using the CLI—without needing to define a run configuration. Just run dstack offer and specify the resource requirements. The CLI will output available offers, including backend, region, instance type, resources, spot availability, and pricing:

$ dstack offer --gpu H100:1.. --max-offers 10

 #   BACKEND     REGION     INSTANCE TYPE          RESOURCES                                     SPOT  PRICE   
 1   datacrunch  FIN-01     1H100.80S.30V          30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 2   datacrunch  FIN-02     1H100.80S.30V          30xCPU, 120GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 3   datacrunch  FIN-02     1H100.80S.32V          32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 4   datacrunch  ICE-01     1H100.80S.32V          32xCPU, 185GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.19   
 5   runpod      US-KS-2    NVIDIA H100 PCIe       16xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.39   
 6   runpod      CA         NVIDIA H100 80GB HBM3  24xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.69   
 7   nebius      eu-north1  gpu-h100-sxm           16xCPU, 200GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.95   
 8   runpod      AP-JP-1    NVIDIA H100 80GB HBM3  20xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.99   
 9   runpod      CA-MTL-1   NVIDIA H100 80GB HBM3  28xCPU, 251GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.99   
 10  runpod      CA-MTL-2   NVIDIA H100 80GB HBM3  26xCPU, 125GB, 1xH100 (80GB), 100.0GB (disk)  no    $2.99   
     ...                                                                                                                
 Shown 10 of 99 offers, $127.816 max

Learn more about how the new CLI works in the reference

Configuration

Resource tags

It's now possible to set custom resource-level tags using the new tags property:

type: dev-environment
ide: vscode
tags:
  my_custom_tag: some_value
  another_tag: another_value_123

The tags property is supported by all configuration types: runs, fleets, volumes, gateways, and profiles. The tags are propagated to the underlying cloud resources on backends that support tags. Currently, it's AWS, Azure, and GCP.

Shell configuration

With the new shell property you can specify the shell used to run commands (or init for dev environments):

type: task
image: ubuntu

shell: bash
commands:
  # now we can use Bash features, e.g., arrays:
  - words=(dstack is)
  - words+=(awesome)
  - echo ${words[@]}  # prints "dstack is awesome"

GCP

A3 High and A3 Edge

dstack now automatically sets up GCP A3 High and A3 Edge instances with GPUDirect-TCPX optimized NCCL communication.

An example on how to provision an A3 High cluster and run NCCL tests on it using dstack is coming soon!

Volumes

Total cost

The UI now shows volumes total cost and termination date alongside volume price. Previously, only the price information was available.

What's changed

Update Axolotl Examples by @Bihan in dstackai/dstack#2502
Update TGI Example with Llama 4 Scout by @Bihan in dstackai/dstack#2529
Implement custom per-resource tags by @r4victor in dstackai/dstack#2533
Add try_advisory_lock_ctx by @r4victor in dstackai/dstack#2537
[chore]: Drop is_core_model_instance by @jvstme in dstackai/dstack#2536
[runner] Rework env variables exporting by @un-def in dstackai/dstack#2535
Fix ruff version discrepancy by @jvstme in dstackai/dstack#2539
Add volume cost by @r4victor in dstackai/dstack#2541
Add shell run property by @un-def in dstackai/dstack#2542
[Feature]: Support dstack offer #2142 by @peterschmidt85 in dstackai/dstack#2540
Update Llama4 Readme with Axolotl fine-tuning example by @Bihan in dstackai/dstack#2545
[Docs] Document dstack offer by @peterschmidt85 in dstackai/dstack#2546
[Docs]: Replace vRAM -> VRAM by @jvstme in dstackai/dstack#2548
Include statics as artifacts in both wheel and sdist by @r4victor in dstackai/dstack#2544
Support A3 High/Edge GCP clusters with GPUDirect-TCPX by @r4victor in dstackai/dstack#2549

Full changelog: dstackai/dstack@0.19.4...0.19.5

Contributors

un-def, Bihan, and 3 other contributors

Assets 2

Releases: dstackai/dstack-enterprise

0.19.13-v1

Clusters

Built-in InfiniBand support in dstack Docker images

Built-in EFA support in dstack VM images

Server

GCS support for code uploads

What's Changed

Contributors

Uh oh!

0.19.12-v1

Clusters

Simplified use of MPI

startup_order and stop_criteria

DSTACK_MPI_HOSTFILE

CLI

Examples

Simplified NCCL tests

Distributed training

TRL

Axolotl

What's changed

Contributors

Uh oh!

0.19.11-v1

Runs

Replacing conda with uv

Plugins

Built-in rest_plugin

AWS

New CPU series

Azure

New CPU series

GCP

New CPU series

Examples

Ray+RAGEN

Breaking changes

Deprecations

What's Changed

New Contributors

Contributors

Uh oh!

0.19.11rc2-v1

What's Changed

New Contributors

Contributors

Uh oh!

0.19.10-v2

Linking Okta accounts

Uh oh!

0.19.10-v1

Runs

Priorities

CLI

dstack project command

dstack ps -n/--last

Azure

Fsv2 series

Major bugfixes

Deprecations

What's changed

Contributors

Uh oh!

0.19.9-v1

CLI

Container exit status

Monitoring

Metrics

AMD

Server

Robust handling of networking issues

Runs

DSTACK_RUN_ID and DSTACK_JOB_ID

What's Changed

New Contributors

Contributors

Uh oh!

0.19.8-v1

ARM

Built-in InfiniBand support in `dstack` Docker images

Built-in EFA support in `dstack` VM images

`startup_order` and `stop_criteria`

`DSTACK_MPI_HOSTFILE`

Replacing `conda` with `uv`

Built-in `rest_plugin`

`dstack project` command

`dstack ps -n/--last`

`DSTACK_RUN_ID` and `DSTACK_JOB_ID`