Cost-vs-performance evaluation for LLMs running inside OpenHermit.
HermitBench is a 60-task benchmark that measures how well LLMs (routed through OpenRouter) can operate an OpenHermit agent — the open-source agent infrastructure platform with durable Postgres state, multi-channel delivery, fleet management, and a sandboxed workspace.
Unlike CLI-agent benchmarks that test file-and-shell skills in isolation, HermitBench evaluates a model's ability to drive a production-grade agent platform: managing memory across sessions, configuring fleet-wide skills, routing across channels, scheduling cron jobs, and approving sensitive tool calls under an access policy.
| # | Category | Focus |
|---|---|---|
| 01 | CLI Fluency | hermit ... command usage, config, secrets, agent lifecycle |
| 02 | Tool Composition | Multi-tool chains (file + exec + web + memory) |
| 03 | Access Control | Policy rows, approval flow, role-based grants |
| 04 | Channel Routing | Cross-channel session continuity, group routing rules |
| 05 | Memory & Introspection | Long-term memory recall, working memory updates |
| 06 | Scheduling & Automation | Cron jobs, one-shot schedules, run history |
| 07 | Amiko Social | Post drafting, comment voice, AI-twin DMs, feed rank, privacy, group dynamics — Opus-judged for quality |
| 08 | Amiko Pipelines | Personality vibe cards, friend match reports, memory extraction (chat/file), post translation, moderation, twins-take share cards, share copy — Opus-judged for quality |
# 1. Build the test image (Ubuntu + Postgres + OpenHermit gateway)
bash script/prepare.sh
# 2. Set OpenRouter API key
echo 'OPENROUTER_API_KEY=sk-or-...' >> .env
# 3. (Optional) Pick a judge model for category 07_Amiko_Social.
# Tasks in 07 grade prose quality via an LLM-as-judge call to OpenRouter
# using JUDGE_MODEL (default: anthropic/claude-opus-4.7).
echo 'JUDGE_MODEL=anthropic/claude-opus-4.7' >> .env
# 4. Run all tasks against a model
bash script/run.sh --category all --parallel 4 \
--model anthropic/claude-sonnet-4.6Categories 07 and 08 require
OPENROUTER_API_KEY(for the judge call) and readJUDGE_MODELfrom the environment. Each judge call costs roughly $0.01 withanthropic/claude-opus-4.7; if you want to bench-test on the cheap, setJUDGE_MODEL=anthropic/claude-haiku-4.7or similar.
Per-task results land under output/<category>/<task_id>/<model_timestamp_runid>/ with score.json, usage.json, gateway.log, and the seeded Postgres dump.
Six multimodal "lite/nano/flash" models, 79 tasks each, all routed through OpenRouter. Categories 07 + 08 graded by anthropic/claude-opus-4.7 as judge. meta-llama/llama-4-maverick was excluded post-hoc because OpenRouter returned No endpoints found that support tool use on every task; raw summary kept under output/_excluded/.
| Rank | Model | Overall Score | Mean Time | Mean Cost/Task | Total Cost (79 tasks) |
|---|---|---|---|---|---|
| 1 | google/gemini-3.1-flash-lite |
37.3% | 14.7s | $0.0006 | $0.049 |
| 2 | openai/gpt-5-nano |
36.8% | 1m 1s | $0.0023 | $0.179 |
| 3 | qwen/qwen3.6-flash |
36.7% | 8.4s | $0.0054 | $0.423 |
| 4 | amazon/nova-2-lite-v1 |
33.0% | 1m 1s | $0.0050 | $0.394 |
| 5 | google/gemini-2.5-flash-lite |
32.3% | 15.7s | $0.0014 | $0.108 |
| 6 | amazon/nova-lite-v1 |
28.7% | 14.6s | $0.0026 | $0.204 |
gemini-3.1-flash-litecost is derived from token counts × OpenRouter's listed pricing ($0.25/M input, $1.5/M output) — OpenRouter doesn't return acostfield for this model in its chat-completions response, but token counts are present and the math is deterministic. Tasks where the agent or grader errored count as overall_score=0.0: nova-2-lite-v1 (7), gemini-2.5-flash-lite (9), gemini-3.1-flash-lite (3), gpt-5-nano (3), nova-lite-v1 (2).
Takeaway: at the cheap-multimodal tier, google/gemini-3.1-flash-lite is the Pareto winner — highest score (37.3%) at the lowest cost ($0.049 for the full 79-task sweep, ~3.7× cheaper than gpt-5-nano and ~8.7× cheaper than qwen3.6-flash). gpt-5-nano and qwen3.6-flash are statistically tied at ~37% but cost noticeably more per task. All six models struggle on 02_Tool_Composition (multi-step tool chains) and excel on 03_Access_Control (one-shot refusal patterns) — the heatmap shows the spread clearly.
Regenerate everything with python3 script/report.py after any new sweep.
Each task runs in its own Docker container that boots Postgres and an OpenHermit gateway. A per-task seed.sql plus workspace files establish a starting agent state. The runner sets model.provider=openrouter and model.model=<args.model>, then sends the task prompt to the agent via the OpenHermit SDK. After the agent exits (or times out), a grader Python function queries the Postgres state and the filesystem to compute scores. Token usage and cost are extracted from gateway logs.
HermitBench's harness — Dockerized per-task containers, markdown task format with embedded graders, OpenRouter-driven cost-vs-performance reporting — is derived from WildClawBench, which benchmarks the same kind of evaluation against OpenClaw. If you use HermitBench in research, please also cite the upstream work:
@article{ding2026wildclawbench,
title={WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation},
author={Ding, Shuangrui and Dai, Xuanlang and Xing, Long and Ding, Shengyuan and Liu, Ziyu and JingYi, Yang and Yang, Penghui and Zhang, Zhixiong and Wei, Xilin and Fang, Xinyu and others},
journal={arXiv preprint arXiv:2605.10912},
year={2026}
}For machine-readable HermitBench citation metadata see CITATION.cff — GitHub will use this file to populate the repository's "Cite this repository" panel.
MIT.




