Cake exposes an OpenAI-compatible REST API when using cake serve. The same server handles text chat, image generation, and audio/TTS — only endpoints for the loaded model type return results; all others return 404 with a JSON error.
# Text model
cake serve evilsocket/Qwen2.5-Coder-1.5B-Instruct
# Image model
cake serve evilsocket/flux1-dev --model-type image-model --image-model-arch flux1
# Audio model
cake serve evilsocket/VibeVoice-1.5B --model-type audio-model \
--voice-prompt voice.wav| Endpoint | Method | Model Type | Description |
|---|---|---|---|
/v1/chat/completions |
POST | Text | Chat completion (OpenAI-compatible) |
/api/v1/chat/completions |
POST | Text | Alias for the above |
/v1/audio/speech |
POST | Audio | Text-to-speech generation |
/v1/images/generations |
POST | Image | Image generation (OpenAI-compatible) |
/api/v1/image |
POST | Image | Image generation (legacy format) |
/v1/models |
GET | Any | List loaded models |
/api/v1/topology |
GET | Any | Cluster topology as JSON |
/ |
GET | Any | Web UI |
POST /v1/chat/completions
OpenAI-compatible chat completion with optional streaming.
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Why is the sky blue?"}
],
"stream": true,
"max_tokens": 4096
}'| Field | Type | Required | Default | Description |
|---|---|---|---|---|
messages |
array | Yes | - | Chat messages, each with role (system, user, assistant) and content |
stream |
bool | No | false |
Enable Server-Sent Events streaming |
max_tokens |
int | No | 2048 |
Maximum tokens to generate |
model |
string | No | - | Ignored (uses the loaded model) |
temperature |
float | No | - | Sampling temperature |
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1234567890,
"model": "evilsocket/Qwen2.5-Coder-1.5B-Instruct",
"choices": [
{
"index": 0,
"message": { "role": "assistant", "content": "The sky appears blue because..." },
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 2,
"completion_tokens": 42,
"total_tokens": 44
}
}With "stream": true, the response is text/event-stream with SSE chunks:
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1234567890,"model":"...","choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1234567890,"model":"...","choices":[{"index":0,"delta":{"content":"The"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc123","object":"chat.completion.chunk","created":1234567890,"model":"...","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
HTTP 404
{"error": "No text model loaded"}POST /v1/audio/speech
Generate speech from text using VibeVoice TTS. Returns audio bytes directly.
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{
"input": "Hello world, this is a test.",
"voice_path": "/path/to/voice_reference.wav",
"response_format": "wav"
}' -o output.wav| Field | Type | Required | Default | Description |
|---|---|---|---|---|
input |
string | Yes | - | Text to synthesize |
model |
string | No | - | Ignored (uses the loaded model) |
voice |
string | No | - | Voice name (reserved for future use) |
voice_data |
string | No | - | Base64-encoded WAV bytes for voice cloning |
voice_path |
string | No | - | Server-side path to voice prompt file |
response_format |
string | No | "wav" |
"wav" (16-bit PCM WAV) or "pcm" (raw f32 little-endian) |
cfg_scale |
float | No | 1.5 |
Classifier-free guidance scale (1.0-3.0) |
max_frames |
int | No | 150 |
Max speech frames (~133ms each at 7.5Hz) |
diffusion_steps |
int | No | 10 |
Diffusion steps per frame (higher = better, slower) |
To clone a voice without server-side files, base64-encode the WAV reference and send it in voice_data:
VOICE_B64=$(base64 -w0 voice_reference.wav)
curl http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d "{
\"input\": \"Hello from a cloned voice.\",
\"voice_data\": \"$VOICE_B64\"
}" -o output.wavresponse_format: "wav"—Content-Type: audio/wav, binary WAV body (16-bit PCM, 24kHz, mono)response_format: "pcm"—Content-Type: audio/pcm, raw f32 little-endian samples at 24kHz
HTTP 404
{"error": "No audio model loaded"}POST /v1/images/generations
Generate images from text prompts. By default returns raw PNG bytes with Content-Type: image/png.
# Raw PNG (default) — pipe to file or viewer
curl http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"prompt": "A photorealistic landscape at golden hour"}' \
-o landscape.png
# Base64 JSON (OpenAI-compatible)
curl http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"prompt": "A landscape", "response_format": "b64_json"}'| Field | Type | Required | Default | Description |
|---|---|---|---|---|
prompt |
string | Yes | - | Text description of the image |
n |
int | No | 1 |
Number of images (currently always generates 1) |
size |
string | No | - | Reserved (size controlled by --flux-width/--flux-height) |
response_format |
string | No | "png" |
"png" (raw image/png) or "b64_json" (OpenAI JSON envelope) |
Content-Type: image/png — raw PNG bytes.
{
"created": 1234567890,
"data": [
{
"b64_json": "iVBORw0KGgoAAAANSUh..."
}
]
}HTTP 404
{"error": "No image model loaded"}POST /api/v1/image
Original image generation endpoint. Accepts SD/FLUX generation arguments directly.
# Raw PNG
curl http://localhost:8080/api/v1/image \
-H "Content-Type: application/json" \
-d '{
"image_args": {"prompt": "An old man at seaside"},
"response_format": "png"
}' -o output.png
# Base64 JSON (default, backwards-compatible)
curl http://localhost:8080/api/v1/image \
-H "Content-Type: application/json" \
-d '{
"image_args": {
"prompt": "An old man sitting on the chair at seaside",
"sd-num-samples": 1,
"image-seed": 2439383
}
}'| Field | Type | Required | Default | Description |
|---|---|---|---|---|
image_args |
object | Yes | - | SD/FLUX generation arguments |
response_format |
string | No | "b64_json" |
"b64_json" (backwards-compatible) or "png" (raw image/png) |
{
"images": ["iVBORw0KGgoAAAANSUh..."]
}Content-Type: image/png — raw PNG bytes.
GET /v1/models
Returns the currently loaded model(s) in OpenAI-compatible format.
{
"object": "list",
"data": [
{
"id": "evilsocket/Qwen2.5-Coder-1.5B-Instruct",
"object": "model",
"owned_by": "cake"
}
]
}When no model is loaded, data is an empty array.
GET /api/v1/topology
Returns detailed cluster topology including master/worker nodes, layer assignments, VRAM usage, and per-tensor metadata.
Protected by --ui-auth if configured.
All endpoints return structured JSON errors:
| Status | Condition |
|---|---|
404 |
Endpoint requires a model type that isn't loaded |
400 |
Malformed request body or invalid parameters |
500 |
Internal inference error |
Error response format:
{"error": "description of what went wrong"}The chat endpoint is fully compatible with OpenAI client libraries. Image and audio endpoints return raw binary by default for efficiency — use response_format: "b64_json" if you need the OpenAI JSON envelope.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="unused")
# Chat (streaming)
response = client.chat.completions.create(
model="any",
messages=[{"role": "user", "content": "Hello!"}],
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
# Image (b64_json for OpenAI client compatibility)
response = client.images.generate(
prompt="A rusty robot on a beach",
response_format="b64_json",
)
print(response.data[0].b64_json[:50])import requests
# Image — raw PNG
resp = requests.post("http://localhost:8080/v1/images/generations",
json={"prompt": "A cat"})
with open("cat.png", "wb") as f:
f.write(resp.content)
# Audio — raw WAV
resp = requests.post("http://localhost:8080/v1/audio/speech",
json={"input": "Hello world"})
with open("speech.wav", "wb") as f:
f.write(resp.content)# Image — save PNG directly
curl -s http://localhost:8080/v1/images/generations \
-H "Content-Type: application/json" \
-d '{"prompt": "A rusty robot"}' \
-o robot.png
# Audio — save WAV directly
curl -s http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Hello from Cake TTS"}' \
-o speech.wav
# Audio — play immediately (requires ffplay)
curl -s http://localhost:8080/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"input": "Streaming audio", "response_format": "pcm"}' \
| ffplay -f f32le -ar 24000 -ac 1 -