Usage

Downloading Models

You can pass a HuggingFace repo ID as the model argument and Cake will download the model automatically (with progress bars). Files are cached in ~/.cache/huggingface/hub/ — subsequent runs skip the download.

cake serve evilsocket/Qwen2.5-Coder-1.5B-Instruct

To pre-download a model without running inference:

cake pull evilsocket/Qwen2.5-Coder-1.5B-Instruct

For gated models (like LLaMA 3), set the HF_TOKEN environment variable with your HuggingFace token.

You can also pass a local path to a model directory:

cake serve /path/to/Meta-Llama-3-8B

Single Prompt

To quickly test a model with a single prompt (no API server):

cake run evilsocket/Qwen2.5-Coder-1.5B-Instruct "Why is the sky blue?"

Listing Local Models

List all models available locally (downloaded from HuggingFace or received from a master):

cake list

This scans ~/.cache/huggingface/hub/ and ~/.cache/cake/ and shows each model's status (complete or partial), size, and source.

Delete a cached model:

cake rm evilsocket/Qwen3-0.6B    # full name
cake rm Qwen3-0.6B               # short name (auto-matched)

Shows the model name, path, and size, then asks for confirmation before deleting.

Web UI

When using cake serve, Cake serves a web interface with two tabs:

cake serve evilsocket/Qwen2.5-Coder-1.5B-Instruct

Open http://localhost:8080 in your browser.

Chat tab — Interactive chat with the model. Supports streaming responses, shows token count and tokens/second throughput, and renders markdown in responses.

Cluster tab — Visualizes the distributed inference topology. Shows master and worker nodes, VRAM usage, layer assignments, and per-tensor details. Auto-refreshes every 10 seconds.

To protect the web UI with basic auth:

cake serve evilsocket/Qwen2.5-Coder-1.5B-Instruct --ui-auth user:pass

TUI Chat

Cake includes a terminal-based chat client with two modes:

Local mode — loads a model and starts interactive chat (no separate server needed):

cake chat Qwen/Qwen3-0.6B

Remote mode — connects to a running API server:

cake chat --server http://localhost:8080

Keyboard controls:

Key	Action
`Tab`	Switch between Chat and Cluster tabs
`Enter`	Send message
`Shift+Enter`	Insert newline
`Esc` / `Ctrl+C`	Quit
`PageUp` / `PageDown`	Scroll

The Chat tab shows streaming responses with real-time tokens/second stats. Models that use <think> tags (e.g. Qwen3) show a "thinking..." indicator with the reasoning streamed in gray, followed by the final response in white. The Cluster tab displays topology info, VRAM usage, and layer distribution across nodes.

API

Cake exposes an OpenAI-compatible REST API when using cake serve, supporting chat completion, audio/TTS, and image generation. All endpoints are served from the same server; only the loaded model type produces results — others return 404.

cake serve evilsocket/Qwen2.5-Coder-1.5B-Instruct

Quick example:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "Why is the sky blue?"}],
    "stream": true
}'

See the full REST API Reference for all endpoints, request/response formats, and client library examples.

CLI Arguments

Subcommands

Command	Positional Args	Description
`cake run <model> [prompt]`	`model` (optional), `prompt` (optional)	Run inference, cluster master, or worker
`cake serve <model>`	`model` (required)	Start OpenAI-compatible API server
`cake pull <model>`	`model` (required)	Download model from HuggingFace
`cake list`	-	List locally available models
`cake chat [model]`	`model` (optional)	Interactive TUI chat (local or remote)
`cake rm <model>`	`model` (required)	Delete a cached model
`cake split`	-	Split model into per-worker bundles

Flags

Argument	Default	Description
`--name`	-	Worker name (used with manual topology or zero-config)
`--address`	`0.0.0.0:10128`	Worker bind address and port
`--system-prompt`	`"You are a helpful AI assistant."`	System prompt
`--topology`	-	Topology file for manual cluster setup
`-n` / `--sample-len`	`2048`	Max tokens to generate
`--temperature`	`1.0`	Sampling temperature (0 = greedy)
`--top-k`	-	Top-K sampling
`--top-p`	-	Nucleus sampling cutoff
`--repeat-penalty`	`1.1`	Repeat token penalty (1.0 = none)
`--repeat-last-n`	`128`	Context window for repeat penalty
`--seed`	`299792458`	Random seed
`--device`	`0`	GPU device index
`--cpu`	`false`	Force CPU inference
`--dtype`	-	Override dtype (default: f16)
`--text-model-arch`	`auto`	Force model architecture (`auto`, `llama`, `qwen2`, `qwen3`, `qwen3-moe`, `qwen3-5`, `qwen3-5-moe`, `phi4`, `mistral`, `gemma3`, `falcon3`, `ol-mo2`, `exaone4`, `lux-tts`)
`--cluster-key`	-	Zero-config cluster key (or `CAKE_CLUSTER_KEY` env)
`--discovery-timeout`	`10`	Worker discovery timeout in seconds
`--min-workers`	`0`	Stop discovery once this many workers found (0 = wait full timeout)
`--ui-auth`	-	Basic auth for web UI (`user:pass`)
`--model-type`	`text-model`	Model type (`text-model`, `image-model`, `audio-model`)
`--expert-offload`	`false`	Offload MoE expert weights to disk
`--tts-reference-audio`	-	WAV file for LuxTTS voice cloning (24kHz mono)
`--tts-t-shift`	`1.0`	LuxTTS flow matching time shift
`--tts-speed`	`1.0`	LuxTTS speed factor (lower = longer audio)
`--tts-token-ids`	-	Pre-computed IPA token IDs file for LuxTTS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Usage

Downloading Models

Single Prompt

Listing Local Models

Web UI

TUI Chat

API

CLI Arguments

Subcommands

Flags

Uh oh!

FilesExpand file tree

usage.md

Latest commit

History

usage.md

File metadata and controls

Usage

Downloading Models

Single Prompt

Listing Local Models

Web UI

TUI Chat

API

CLI Arguments

Subcommands

Flags