This guide is the practical playbook for running Flow MoonBit AI tasks at the lowest possible invocation latency.
It covers:
- when to use
fvsfai - how daemon mode actually works
- exact env knobs for tuning
- benchmark workflow to validate improvements
- troubleshooting for common regressions
Flow supports multiple runtime paths for .ai/tasks/*.mbt:
-
moon runpath
FLOW_AI_TASK_RUNTIME=moon-run f ai:flow/dev-check
Highest flexibility, highest overhead. -
Cached binary path through
f
FLOW_AI_TASK_RUNTIME=cached f ai:flow/dev-check
Uses build cache, still pays fullfprocess startup. -
Daemon path through
f
f tasks run-ai --daemon ai:flow/dev-check
Usesai-taskdover Unix socket, still paysfprocess startup. -
Fast daemon client (
fai)
fai ai:flow/dev-check
Lowest invocation overhead for hot loops.
From ~/code/flow:
f install-ai-fast-client
f tasks daemon startWhat this gives you:
~/.local/bin/faiinstalled (low-overhead client).ai-taskdrunning and warm (~/.flow/run/ai-taskd.sock).
Verify:
which fai
fai --help
f tasks daemon status
fai ai:flow/noopFor always-on daemon across login sessions (recommended for stable latency):
f ai-taskd-launchd-install
f ai-taskd-launchd-statusfaisends a compact request to~/.flow/run/ai-taskd.sockai-taskdresolves task selector (fast exact path first)ai-taskdreuses cached binary artifact when available- task process runs with
FLOW_AI_TASK_PROJECT_ROOTset
- daemon discovery cache with TTL:
FLOW_AI_TASKD_DISCOVERY_TTL_MS(default750)
- daemon artifact cache with TTL:
FLOW_AI_TASKD_ARTIFACT_TTL_MS(default1500)
- fast selector resolution:
- exact selectors skip full recursive task discovery
- faster cache key computation:
- file metadata fingerprints instead of full content hashing
- Moon version cached on disk with TTL
Moon version knobs:
FLOW_AI_TASK_MOON_VERSION(explicit override)FLOW_AI_TASK_MOON_VERSION_TTL_SECS(default43200)
Wire protocol knobs:
fai --protocol msgpack(default)fai --protocol json(compat / debugging)
f can optionally route AI task dispatch through the fast client when daemon mode is enabled.
Required:
export FLOW_AI_TASK_DAEMON=1
export FLOW_AI_TASK_FAST_CLIENT=1Optional selector control:
export FLOW_AI_TASK_FAST_SELECTORS='ai:flow/noop,ai:flow/bench-cli,ai:project/*'Optional client binary override:
export FLOW_AI_TASK_FAST_CLIENT_BIN="$HOME/.local/bin/fai"Without FLOW_AI_TASK_FAST_CLIENT=1, f keeps normal daemon behavior.
fai [--root PATH] [--socket PATH] [--protocol json|msgpack] [--no-cache] [--capture-output] [--timings] <selector> [-- <args...>]
fai [--root PATH] [--socket PATH] [--protocol json|msgpack] [--no-cache] [--capture-output] [--timings] --batch-stdinExamples:
fai ai:flow/noop
fai --root ~/code/flow ai:flow/bench-cli -- --iterations 50
fai --no-cache ai:flow/dev-check
fai --timings ai:flow/noop
printf 'ai:flow/noop\nai:flow/noop\n' | fai --batch-stdinNotes:
- default is no-capture mode for lower overhead
- use
--capture-outputif you need command output returned through client response - use
--timingsto print server-side phase timings (resolve_us,run_us,total_us) - use
--batch-stdinfor pooled client bursts (single client process, multiple requests)
Run baseline runtime benchmark:
f bench-ai-runtime --iterations 80 --warmup 10 --json-out /tmp/flow_ai_runtime.jsonThis includes:
moon_run_noopcached_noopdaemon_cached_noopcached_binary_directdaemon_client_noop(ifai-taskd-clientbinary is present)
For focused hot-loop comparisons:
python3 - <<'PY'
import subprocess,time,statistics
from pathlib import Path
root=Path('~/code/flow').expanduser()
cases=[
('f_daemon',['./target/debug/f','tasks','run-ai','--daemon','ai:flow/noop']),
('fai',['fai','ai:flow/noop']),
('f_cached',['./target/debug/f','ai:flow/noop']),
]
for name,cmd in cases:
xs=[]
for i in range(60):
t0=time.perf_counter()
p=subprocess.run(cmd,cwd=root,stdout=subprocess.DEVNULL,stderr=subprocess.DEVNULL)
dt=(time.perf_counter()-t0)*1000
if p.returncode!=0: raise SystemExit((name,p.returncode))
if i>=10: xs.append(dt)
xs=sorted(xs)
pct=lambda p: xs[int((len(xs)-1)*p)]
print(name,'p50',round(pct(0.5),2),'p95',round(pct(0.95),2),'mean',round(statistics.mean(xs),2))
PYUse this default profile for lowest latency:
export FLOW_AI_TASK_DAEMON=1
export FLOW_AI_TASK_FAST_CLIENT=1
export FLOW_AI_TASK_FAST_SELECTORS='ai:flow/*'
f tasks daemon startThen:
- latency-critical loops:
fai ai:... - normal dev ergonomics:
f ai:...(auto fast-client when selectors match)
Start daemon:
f tasks daemon startCheck:
f tasks daemon status
ls -l ~/.flow/run/ai-taskd.sockOr install persistent daemon:
f ai-taskd-launchd-installUse full selector:
f tasks list | rg '^ai:'
fai ai:flow/dev-checkCheck:
ps -Ao pcpu,pmem,comm | sort -k1 -nr | head -n 20
f tasks daemon statusThen rerun benchmark with warmup.
Use --capture-output on fai for output-capture parity.
fai --timings ai:flow/noop
FLOW_AI_TASKD_TIMINGS_LOG=1 f tasks daemon serveImplemented in this iteration:
- always-on daemon support via launchd tasks (
ai-taskd-launchd-*) - binary request framing support (
msgpack) infai+ai-taskd - pooled client burst mode via
fai --batch-stdin - per-request stage timings exposed via
fai --timingsand daemon timing logs
Potential next frontier:
- keep a persistent client-side socket session with framed multi-request protocol
- add lock-free shared-memory ring for local burst dispatch if socket overhead becomes dominant
- push per-stage timing aggregation into benchmark JSON outputs for automatic regression gating