Skip to content

Commit dde4a07

Browse files
Docs minor improvements (#3501)
* [Docs] Improved documentation structure (WIP) - [x] Introduced `More` under `Concepts` - [x] Moved `Metrics` to `Concepts` - [x] Improved `Installation` (removed SSH fleets - only keep it in `Backends`; moved `Configure` after `Set up the server`) - [x] Mention `server restart is required after updating server/config.yml` in `Backends` - [x] Improved `Distributed tasks` (structure; links to `Fleets` and `Examples`) * [Docs] Documentation improvements - [x] Improved `Fleets` documentation - [x] Minor improvements of the `Tasks` page under `Concepts` - [x] Minor improvements on the home page * [Docs] Minor updates to `README.md`, `Overview`, `Fleets`, `Quickstart`, and examples
1 parent b4c6f17 commit dde4a07

19 files changed

Lines changed: 325 additions & 382 deletions

File tree

README.md

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818

1919
It streamlines development, training, and inference, and is compatible with any hardware, open-source tools, and frameworks.
2020

21-
#### Hardware
21+
#### Accelerators
2222

2323
`dstack` supports `NVIDIA`, `AMD`, `Google TPU`, `Intel Gaudi`, and `Tenstorrent` accelerators out of the box.
2424

@@ -46,7 +46,7 @@ It streamlines development, training, and inference, and is compatible with any
4646

4747
##### Configure backends
4848

49-
To orchestrate compute across cloud providers or existing Kubernetes clusters, you need to configure backends.
49+
To orchestrate compute across GPU clouds or Kubernetes clusters, you need to configure backends.
5050

5151
Backends can be set up in `~/.dstack/server/config.yml` or through the [project settings page](https://dstack.ai/docs/concepts/projects#backends) in the UI.
5252

@@ -123,12 +123,11 @@ Configuration is updated at ~/.dstack/config.yml
123123

124124
`dstack` supports the following configurations:
125125

126-
* [Dev environments](https://dstack.ai/docs/dev-environments) — for interactive development using a desktop IDE
127-
* [Tasks](https://dstack.ai/docs/tasks) — for scheduling jobs (incl. distributed jobs) or running web apps
128-
* [Services](https://dstack.ai/docs/services) — for deployment of models and web apps (with auto-scaling and authorization)
129-
* [Fleets](https://dstack.ai/docs/fleets) — for managing cloud and on-prem clusters
126+
* [Fleets](https://dstack.ai/docs/concepts/fleets) — for managing cloud and on-prem clusters
127+
* [Dev environments](https://dstack.ai/docs/concepts/dev-environments) — for interactive development using a desktop IDE
128+
* [Tasks](https://dstack.ai/docs/concepts/tasks) — for scheduling jobs (incl. distributed jobs) or running web apps
129+
* [Services](https://dstack.ai/docs/concepts/services) — for deployment of models and web apps (with auto-scaling and authorization)
130130
* [Volumes](https://dstack.ai/docs/concepts/volumes) — for managing persisted volumes
131-
* [Gateways](https://dstack.ai/docs/concepts/gateways) — for configuring the ingress traffic and public endpoints
132131

133132
Configuration can be defined as YAML files within your repo.
134133

docs/blog/posts/gpu-health-checks.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ categories:
1212

1313
In large-scale training, a single bad GPU can derail progress. Sometimes the failure is obvious — jobs crash outright. Other times it’s subtle: correctable memory errors, intermittent instability, or thermal throttling that quietly drags down throughput. In big experiments, these issues can go unnoticed for hours or days, wasting compute and delaying results.
1414

15-
`dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/guides/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.
15+
`dstack` already supports GPU telemetry monitoring through NVIDIA DCGM [metrics](../../docs/concepts/metrics.md), covering utilization, memory, and temperature. This release extends that capability with passive hardware health checks powered by DCGM [background health checks](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/feature-overview.html#background-health-checks). With these, `dstack` continuously evaluates fleet GPUs for hardware reliability and displays their status before scheduling workloads.
1616

1717
<img src="https://dstack.ai/static-assets/static-assets/images/gpu-health-checks.png" width="630"/>
1818

@@ -69,5 +69,5 @@ If you have experience with GPU reliability or ideas for automated recovery, joi
6969
!!! info "What's next?"
7070
1. Check [Quickstart](../../docs/quickstart.md)
7171
2. Explore the [clusters](../../docs/guides/clusters.md) guide
72-
3. Learn more about [metrics](../../docs/guides/metrics.md)
72+
3. Learn more about [metrics](../../docs/concepts/metrics.md)
7373
4. Join [Discord](https://discord.gg/u8SmfwPpMd)

docs/blog/posts/metrics-ui.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,6 @@ For persistent storage and long-term access to metrics, we still recommend setti
5353
metrics from `dstack`.
5454

5555
!!! info "What's next?"
56-
1. See [Metrics](../../docs/guides/metrics.md)
56+
1. See [Metrics](../../docs/concepts/metrics.md)
5757
2. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
5858
3. Join [Discord](https://discord.gg/u8SmfwPpMd)

docs/blog/posts/prometheus.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,7 @@ Overall, `dstack` collects three groups of metrics:
4545
| **Runs** | Run metrics include run counters for each user in each project. |
4646
| **Jobs** | A run consists of one or more jobs, each mapped to a container. Job metrics offer insights into execution time, cost, GPU model, NVIDIA DCGM telemetry, and more. |
4747

48-
For a full list of available metrics and labels, check out [Metrics](../../docs/guides/metrics.md).
48+
For a full list of available metrics and labels, check out [Metrics](../../docs/concepts/metrics.md).
4949

5050
??? info "NVIDIA"
5151
NVIDIA DCGM metrics are automatically collected for `aws`, `azure`, `gcp`, and `oci` backends,
@@ -59,7 +59,7 @@ For a full list of available metrics and labels, check out [Metrics](../../docs/
5959
only accessible through the UI and the [`dstack metrics`](dstack-metrics.md) CLI.
6060

6161
!!! info "What's next?"
62-
1. See [Metrics](../../docs/guides/metrics.md)
62+
1. See [Metrics](../../docs/concepts/metrics.md)
6363
1. Check [dev environments](../../docs/concepts/dev-environments.md),
6464
[tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md),
6565
and [fleets](../../docs/concepts/fleets.md)

docs/docs/concepts/backends.md

Lines changed: 13 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,21 +1,22 @@
11
# Backends
22

3-
Backends allow `dstack` to provision fleets across cloud providers or Kubernetes clusters.
3+
Backends allow `dstack` to provision fleets across GPU clouds or Kubernetes clusters.
44

55
`dstack` supports two types of backends:
66

77
* [VM-based](#vm-based) – use `dstack`'s native integration with cloud providers to provision VMs, manage clusters, and orchestrate container-based runs.
88
* [Container-based](#container-based) – use either `dstack`'s native integration with cloud providers or Kubernetes to orchestrate container-based runs; provisioning in this case is delegated to the cloud provider or Kubernetes.
99

10-
??? info "SSH fleets"
10+
!!! info "SSH fleets"
1111
When using `dstack` with on-prem servers, backend configuration isn’t required. Simply create [SSH fleets](../concepts/fleets.md#ssh-fleets) once the server is up.
1212

1313
Backends can be configured via `~/.dstack/server/config.yml` or through the [project settings page](../concepts/projects.md#backends) in the UI. See the examples of backend configuration below.
1414

15+
> If you update `~/.dstack/server/config.yml`, you have to restart the server.
16+
1517
## VM-based
1618

17-
VM-based backends allow `dstack` users to manage clusters and orchestrate container-based runs across a wide range of cloud providers.
18-
Under the hood, `dstack` uses native integrations with these providers to provision clusters on demand.
19+
VM-based backends allow `dstack` users to manage clusters and orchestrate container-based runs across a wide range of cloud providers. Under the hood, `dstack` uses native integrations with these providers to provision clusters on demand.
1920

2021
Compared to [container-based](#container-based) backends, this approach offers finer-grained, simpler control over cluster provisioning and eliminates the dependency on a Kubernetes layer.
2122

@@ -1036,9 +1037,13 @@ projects:
10361037

10371038
No additional setup is required — `dstack` configures and manages the proxy automatically.
10381039

1039-
??? info "NVIDIA GPU Operator"
1040-
For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the
1041-
[NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) pre-installed.
1040+
??? info "Required operators"
1041+
=== "NVIDIA"
1042+
For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the
1043+
[NVIDIA GPU Operator](https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/index.html) pre-installed.
1044+
=== "AMD"
1045+
For `dstack` to correctly detect GPUs in your Kubernetes cluster, the cluster must have the
1046+
[AMD GPU Operator](https://github.com/ROCm/gpu-operator) pre-installed.
10421047

10431048
<!-- ??? info "Managed Kubernetes"
10441049
While `dstack` supports both managed and on-prem Kubernetes clusters, it can only run on pre-provisioned nodes.
@@ -1071,7 +1076,7 @@ projects:
10711076

10721077
Ensure you've created a ClusterRoleBinding to grant the role to the user or the service account you're using.
10731078

1074-
> To learn more, see the [Kubernetes](../guides/kubernetes.md) guide.
1079+
> To learn more, see the [Lambda](../../examples/clusters/lambda/#kubernetes) and [Lambda](../../examples/clusters/crusoe/#kubernetes) examples.
10751080

10761081
### RunPod
10771082

0 commit comments

Comments
 (0)