You can pass a HuggingFace repo ID as the model argument and Cake will download the model automatically (with progress bars). Files are cached in ~/.cache/huggingface/hub/ — subsequent runs skip the download.
cake serve evilsocket/Qwen2.5-Coder-1.5B-InstructTo pre-download a model without running inference:
cake pull evilsocket/Qwen2.5-Coder-1.5B-InstructFor gated models (like LLaMA 3), set the HF_TOKEN environment variable with your HuggingFace token.
You can also pass a local path to a model directory:
cake serve /path/to/Meta-Llama-3-8BTo quickly test a model with a single prompt (no API server):
cake run evilsocket/Qwen2.5-Coder-1.5B-Instruct "Why is the sky blue?"List all models available locally (downloaded from HuggingFace or received from a master):
cake listThis scans ~/.cache/huggingface/hub/ and ~/.cache/cake/ and shows each model's status (complete or partial), size, and source.
Delete a cached model:
cake rm evilsocket/Qwen3-0.6B # full name
cake rm Qwen3-0.6B # short name (auto-matched)Shows the model name, path, and size, then asks for confirmation before deleting.
When using cake serve, Cake serves a web interface with two tabs:
cake serve evilsocket/Qwen2.5-Coder-1.5B-InstructOpen http://localhost:8080 in your browser.
Chat tab — Interactive chat with the model. Supports streaming responses, shows token count and tokens/second throughput, and renders markdown in responses.
Cluster tab — Visualizes the distributed inference topology. Shows master and worker nodes, VRAM usage, layer assignments, and per-tensor details. Auto-refreshes every 10 seconds.
To protect the web UI with basic auth:
cake serve evilsocket/Qwen2.5-Coder-1.5B-Instruct --ui-auth user:passCake includes a terminal-based chat client with two modes:
Local mode — loads a model and starts interactive chat (no separate server needed):
cake chat Qwen/Qwen3-0.6BRemote mode — connects to a running API server:
cake chat --server http://localhost:8080Keyboard controls:
| Key | Action |
|---|---|
Tab |
Switch between Chat and Cluster tabs |
Enter |
Send message |
Shift+Enter |
Insert newline |
Esc / Ctrl+C |
Quit |
PageUp / PageDown |
Scroll |
The Chat tab shows streaming responses with real-time tokens/second stats. Models that use <think> tags (e.g. Qwen3) show a "thinking..." indicator with the reasoning streamed in gray, followed by the final response in white. The Cluster tab displays topology info, VRAM usage, and layer distribution across nodes.
Cake exposes an OpenAI-compatible REST API when using cake serve, supporting chat completion, audio/TTS, and image generation. All endpoints are served from the same server; only the loaded model type produces results — others return 404.
cake serve evilsocket/Qwen2.5-Coder-1.5B-InstructQuick example:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Why is the sky blue?"}],
"stream": true
}'See the full REST API Reference for all endpoints, request/response formats, and client library examples.
| Command | Positional Args | Description |
|---|---|---|
cake run <model> [prompt] |
model (optional), prompt (optional) |
Run inference, cluster master, or worker |
cake serve <model> |
model (required) |
Start OpenAI-compatible API server |
cake pull <model> |
model (required) |
Download model from HuggingFace |
cake list |
- | List locally available models |
cake chat [model] |
model (optional) |
Interactive TUI chat (local or remote) |
cake rm <model> |
model (required) |
Delete a cached model |
cake split |
- | Split model into per-worker bundles |
| Argument | Default | Description |
|---|---|---|
--name |
- | Worker name (used with manual topology or zero-config) |
--address |
0.0.0.0:10128 |
Worker bind address and port |
--system-prompt |
"You are a helpful AI assistant." |
System prompt |
--topology |
- | Topology file for manual cluster setup |
-n / --sample-len |
2048 |
Max tokens to generate |
--temperature |
1.0 |
Sampling temperature (0 = greedy) |
--top-k |
- | Top-K sampling |
--top-p |
- | Nucleus sampling cutoff |
--repeat-penalty |
1.1 |
Repeat token penalty (1.0 = none) |
--repeat-last-n |
128 |
Context window for repeat penalty |
--seed |
299792458 |
Random seed |
--device |
0 |
GPU device index |
--cpu |
false |
Force CPU inference |
--dtype |
- | Override dtype (default: f16) |
--text-model-arch |
auto |
Force model architecture (auto, llama, qwen2, qwen3, qwen3-moe, qwen3-5, qwen3-5-moe, phi4, mistral, gemma3, falcon3, ol-mo2, exaone4, lux-tts) |
--cluster-key |
- | Zero-config cluster key (or CAKE_CLUSTER_KEY env) |
--discovery-timeout |
10 |
Worker discovery timeout in seconds |
--min-workers |
0 |
Stop discovery once this many workers found (0 = wait full timeout) |
--ui-auth |
- | Basic auth for web UI (user:pass) |
--model-type |
text-model |
Model type (text-model, image-model, audio-model) |
--expert-offload |
false |
Offload MoE expert weights to disk |
--tts-reference-audio |
- | WAV file for LuxTTS voice cloning (24kHz mono) |
--tts-t-shift |
1.0 |
LuxTTS flow matching time shift |
--tts-speed |
1.0 |
LuxTTS speed factor (lower = longer audio) |
--tts-token-ids |
- | Pre-computed IPA token IDs file for LuxTTS |