Skip to content

Commit be7daeb

Browse files
[Blog] Model inference with Prefill-Decode disaggregation (#3595)
1 parent 8764de4 commit be7daeb

File tree

1 file changed

+141
-0
lines changed

1 file changed

+141
-0
lines changed
Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
title: "Model inference with Prefill-Decode disaggregation"
3+
date: 2026-02-19
4+
description: "TBA"
5+
slug: pd-disaggregation
6+
image: https://dstack.ai/static-assets/static-assets/images/dstack-pd-disaggregation.png
7+
categories:
8+
- Changelog
9+
links:
10+
- SGLang router integration: https://dstack.ai/blog/sglang-router/
11+
---
12+
13+
# Model inference with Prefill-Decode disaggregation
14+
15+
While `dstack` started as a GPU-native orchestrator for development and training, over the last year it has increasingly brought inference to the forefront — making serving a first-class citizen.
16+
17+
<img src="https://dstack.ai/static-assets/static-assets/images/dstack-pd-disaggregation.png" width="630"/>
18+
19+
At the end of last year, we introduced [SGLang router](../posts/sglang-router.md) integration — bringing cache-aware routing to [services](../../docs/concepts/services.md). Today, building on that integration, we’re adding native Prefill–Decode (PD) disaggregation.
20+
21+
<!-- more -->
22+
23+
Unlike many PD disaggregation setups tied to Kubernetes as the control plane, dstack does not depend on Kubernetes. It’s an open-source, GPU-native orchestrator that can provision GPUs directly in your cloud accounts or on bare-metal infrastructure — while also running on top of existing Kubernetes clusters if needed.
24+
25+
For inference, `dstack` provides a [services](../../docs/concepts/services.md) abstraction. While remaining framework-agnostic, we integrate more deeply with leading open-source frameworks — [SGLang](https://github.com/sgl-project/sglang) being one of them for model inference.
26+
27+
> If you’re new to Prefill–Decode disaggregation, see the official [SGLang docs](https://docs.sglang.io/advanced_features/pd_disaggregation.html).
28+
29+
## Services
30+
31+
With `dstack` `0.20.10`, you can define a service with separate replica groups for Prefill and Decode workers and enable PD disaggregation directly in the `router` configuration.
32+
33+
<div editor-title="glm45air.dstack.yml">
34+
35+
```yaml
36+
type: service
37+
name: glm45air
38+
39+
env:
40+
- HF_TOKEN
41+
- MODEL_ID=zai-org/GLM-4.5-Air-FP8
42+
43+
image: lmsysorg/sglang:latest
44+
45+
replicas:
46+
- count: 1..4
47+
scaling:
48+
metric: rps
49+
target: 3
50+
commands:
51+
- |
52+
python -m sglang.launch_server \
53+
--model-path $MODEL_ID \
54+
--disaggregation-mode prefill \
55+
--disaggregation-transfer-backend mooncake \
56+
--host 0.0.0.0 \
57+
--port 8000 \
58+
--disaggregation-bootstrap-port 8998
59+
resources:
60+
gpu: H200
61+
62+
- count: 1..8
63+
scaling:
64+
metric: rps
65+
target: 2
66+
commands:
67+
- |
68+
python -m sglang.launch_server \
69+
--model-path $MODEL_ID \
70+
--disaggregation-mode decode \
71+
--disaggregation-transfer-backend mooncake \
72+
--host 0.0.0.0 \
73+
--port 8000
74+
resources:
75+
gpu: H200
76+
77+
port: 8000
78+
model: zai-org/GLM-4.5-Air-FP8
79+
80+
probes:
81+
- type: http
82+
url: /health_generate
83+
interval: 15s
84+
85+
router:
86+
type: sglang
87+
pd_disaggregation: true
88+
```
89+
90+
</div>
91+
92+
Deploy it as usual:
93+
94+
<div class="termy">
95+
96+
```shell
97+
$ HF_TOKEN=...
98+
$ dstack apply -f glm45air.dstack.yml
99+
```
100+
101+
</div>
102+
103+
### Gateway
104+
105+
Just like `dstack` relies on the SGLang router for cache-aware routing, Prefill–Decode disaggregation also requires a [gateway](../../docs/concepts/gateways.md#sglang) configured with the SGLang router.
106+
107+
<div editor-title="gateway-sglang.dstack.yml">
108+
109+
```yaml
110+
type: gateway
111+
name: inference-gateway
112+
113+
backends: [kubernetes]
114+
region: any
115+
116+
domain: example.com
117+
118+
router:
119+
type: sglang
120+
policy: cache_aware
121+
```
122+
123+
</div>
124+
125+
## Limitations
126+
127+
* Because the SGLang router requires all workers to be on the same network, and `dstack` currently runs the router inside the gateway, the gateway and the service must be running in the same cluster.
128+
* Prefill–Decode disaggregation is currently available with the SGLang backend (vLLM support is coming).
129+
* Autoscaling supports RPS as the metric for now; TTFT and ITL metrics are planned next.
130+
131+
With native support for inference and now Prefill–Decode disaggregation, `dstack` makes it easier to run high-throughput, low-latency model serving across GPU clouds, and Kubernetes or bare-metal clusters.
132+
133+
## What's next?
134+
135+
We’re working on PD disaggregation benchmarks and tuning guidance — coming soon.
136+
137+
In the meantime:
138+
139+
1. Read about [services](../../docs/concepts/services.md), [gateways](../../docs/concepts/gateways.md), and [fleets](../../docs/concepts/fleets.md)
140+
2. Check out [Quickstart](../../docs/quickstart.md)
141+
3. Join [Discord](https://discord.gg/u8SmfwPpMd)

0 commit comments

Comments
 (0)