Evaluating and selecting off-the-shelf or fine-tuned models for a specific use case is difficult.
Choosing the right LLM means navigating a minefield of hidden pitfalls:
| Challenge | Why It Hurts | |
|---|---|---|
| π― | Generic benchmarks don't transfer | Public data and metrics often miss the nuances of your real-world requirements. |
| π§© | Custom benchmarks are hard to design | Defining representative tasks, building rubrics, and choosing robustness variations is non-trivial. |
| πΈ | Multi-model multi-task benchmarks are expensive to execute | Running every candidate model across every task and rubric quickly multiplies cost and compute. |
| π³οΈ | Leakage biases results | Public and private benchmark items (or near-duplicates) may lurk in training data, inflating scores via memorization. |
| βοΈ | Ops and cost are complex | Running evaluations across providers, inference modes, and scoring criteria demands careful orchestration. |
Bottom line: You can't trust a leaderboard number, and building your own eval is a project in itself.
Ensemble-based synthetic self-evaluation benchmarking β let the models evaluate each other.
CoEval generates a synthetic evaluation suite spanning multiple domain-specific tasks and scoring rubrics, then assembles an ensemble of models that rotate through three roles:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β MODEL ENSEMBLE β
β β
β βββββββββββββ βββββββββββββ βββββββββββββ β
β β Model A β β Model B β β Model C β ... β
β βββββββ¬ββββββ βββββββ¬ββββββ βββββββ¬ββββββ β
β β β β β
β βΌ βΌ βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β ROTATING ROLE ASSIGNMENT β β
β ββββββββββββ³βββββββββββββββββ³ββββββββββββββββββ³βββββ β
β βΌ βΌ βΌ β
β π TEACHER π STUDENT βοΈ JUDGE β
β Generate synthetic Models under Score outputs β
β challenges & evaluation take against the β
β reference answers the challenges rubric β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Not all teachers and judges are created equal. CoEval improves signal quality by identifying:
| Role | Selection Criterion | Intuition |
|---|---|---|
| π Teacher | Differentiating β produces challenges that separate student performance | A good exam question reveals who studied. |
| βοΈ Judge | Consensus β high agreement with ensemble majority | A reliable judge aligns with peer consensus. |
Fully Automatic Semi-Automatic Manual
ββββββββββββββ ββββββββββββββββββ ββββββββββββββββ
β Tasks β β Tasks βοΈ β β Tasks βοΈ β
β Rubrics β βββΊ β Rubrics β βββΊ β Rubrics βοΈ β
β Attr. Space β β Attr. Space βοΈ β β Attr. Space βοΈβ
ββββββββββββββ ββββββββββββββββββ ββββββββββββββββ
AI-generated Human-guided Human-defined
Tasks, rubrics, and diversity/attribute spaces can be provisioned fully automatically, semi-automatically (human-in-the-loop), or manually β choose the level of control that fits your workflow.
CoEval is an end-to-end system β from benchmark design to interactive reporting.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β C o E v a l β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ£
β β
β π¦ Multi-Vendor Support β
β βββ Multiple LLM providers & interfaces out of the box β
β βββ Plug in proprietary / self-hosted models β
β β
β πΊοΈ Benchmark Design & Planning β
β βββ Automated task & rubric provisioning β
β βββ Run orchestration with cost optimization β
β β
β π Interactive Visual Reports β
β βββ Side-by-side model comparison β
β βββ Drill-down into tasks, rubrics & scores β
β β
β π Experiment Tracking β
β βββ Easy reruns & parameter sweeps β
β βββ Repair & resume after interruptions β
β β
β π Complete Documentation β
β βββ User guides & tutorials β
β βββ Developer API reference β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Feature | Description |
|---|---|
| Multi-vendor | Swap providers without changing your eval pipeline. |
| Auto-provisioning | Generate tasks, rubrics, and attribute spaces from a domain description. |
| Orchestration | Schedule and parallelize runs; optimize for cost and latency. |
| Visual reports | Interactive dashboards for deep-dive analysis. |
| Resilient tracking | Resume interrupted experiments; repair partial results. |
| Docs-first | Comprehensive guides for users and contributors alike. |
OpenAI, Anthropic, Google Gemini, Azure OpenAI, Azure AI Inference, AWS Bedrock, Google Vertex AI, OpenRouter, Groq, DeepSeek, Mistral, DeepInfra, Cerebras, Cohere, HuggingFace API, HuggingFace (local), Ollama
β Providers & Pricing β auth setup, batch discounts, pricing tables for all 18 interfaces.
# 1. Install
pip install coeval
# 2. Add your API keys (see: docs/tutorial.md Β§ 2)
cp keys.yaml.template keys.yaml # then fill in your provider keys
# 3. Probe all models β no tokens consumed
coeval probe --config benchmark/mixed.yaml
# 4. Estimate cost before spending anything
coeval plan --config benchmark/mixed.yaml
# 5. Run the experiment
coeval run --config benchmark/mixed.yaml --continue
# 6. Generate analysis reports
coeval analyze all --run ./eval_runs/mixed-benchmark --out ./reportsmodels:
- name: gpt-4o-mini
interface: openai
parameters: { model: gpt-4o-mini, temperature: 0.7, max_tokens: 512 }
roles: [teacher, student, judge]
tasks:
- name: text_sentiment
description: Classify the sentiment of a short customer review.
output_description: A single word β either Positive or Negative.
target_attributes:
sentiment: [positive, negative]
intensity: [mild, strong]
sampling: { target: [1,1], nuance: [0,1], total: 20 }
rubric:
accuracy: "The label matches the actual sentiment of the review."
evaluation_mode: single
experiment:
id: sentiment-v1
storage_folder: ./eval_runsInteractive HTML examples β click to open rendered in browser:
| Example | Description |
|---|---|
| Education Benchmark β Planning View | Full experiment plan: 3 real-dataset tasks + 10 synthetic tasks, 6 models, per-phase call budget, cost table, and attribute maps |
| Mixed Benchmark β Planning View | Mixed benchmark plan: real benchmark datasets + OpenAI models |
| Paper Dual-Track β Planning View | Paper evaluation: dual-track design with benchmark + generative teachers |
Generate your own planning view:
coeval describe --config my_experiment.yaml --out my_experiment_plan.html
| Report | Description |
|---|---|
| Dashboard | Overview dashboard β all reports in one place with top-line rankings and navigation |
| Student Performance Report | Per-student score breakdowns, task rankings, rubric factor heatmaps |
| Judge Consistency Report | Inter-judge ICC agreement, calibration drift, flagged uncertain items |
| Robust Summary Report | Final model rankings with confidence intervals and robust ensemble weights |
| Score Distribution Report | High / Medium / Low histograms filterable by task, teacher, student, and judge |
| Teacher Report | Per-teacher source quality, attribute stratum coverage, data consistency |
| Interaction Matrix | Teacher Γ Student pair quality heatmap β spot which combinations succeed or fail |
| Coverage Summary | Attribute Coverage Ratio (ACR) and rare-attribute recall per task |
| Judge Report | Judge-level bias rates, score calibration, inter-rater reliability |
| Annotated Report Guide | Detailed annotated screenshots of every CoEval report with explanations of every visualization and metric |
Generate all reports from a completed run:
coeval analyze all --run ./Runs/my-experiment-v1 --out ./reports
| Guide | What it covers |
|---|---|
| Concepts Glossary | Every first-class concept explained: teacher, student, judge, attributes, rubric, datapoint, slot, phases, wizard, probing, planning, resume, repair, auto interface, batch API, and more |
| Evaluation Experiment Planning and Preparation Guide | End-to-end walkthrough: installation, config design, probing, running, analysis, and benchmark export |
| Command Line Option Reference | Every coeval subcommand, flag, and exit code β run, probe, plan, generate, status, models, analyze, describe, wizard, ingest, repair |
| Running Experiments | Phase modes, --continue, batch API, quota control, cost estimation, fault recovery, use-case examples |
| Providers & Pricing | All 18 interfaces with auth, batch support, code examples, and pricing tables |
| Analytics & Reports | 11 interactive HTML dashboards, paper-quality result tables, programmatic API, Excel workbook export |
| Configuration Guide | YAML config schema: models, tasks, attributes, rubric, sampling, prompt overrides, experiment settings |
| Benchmark Datasets | Pre-ingested datasets, coeval ingest, interface: benchmark virtual teacher, reproducing published results |
| Testing Guide | All 20 test files, how to run each suite, interpreting failures, CI/CD setup |
| System Feature Wishlist | 35-item prioritized roadmap: 10 benchmark additions, 12 system features, 13 new report types |
YAML Config β Phase 1: Attribute Mapping (teachers infer task dimensions)
β Phase 2: Rubric Mapping (teachers build evaluation criteria)
β Phase 3: Data Generation (teachers produce benchmark items)
β Phase 4: Response Collection (students answer benchmark prompts)
β Phase 5: Evaluation (judges score student responses)
β coeval analyze all (8 HTML reports + Excel workbook)
| Cloud β Async Batch β | Cloud β Real-time | OpenAI-Compatible | Local / Virtual |
|---|---|---|---|
openai |
azure_openaiΒΉ |
groq |
huggingface |
anthropic |
azure_ai |
deepseek |
ollama |
geminiΒ² |
bedrock |
mistral |
benchmark |
vertex |
deepinfra |
||
openrouter |
cerebras |
ΒΉ
azure_openaisupports Azure Global Batch API (50% discount) β enable viabatch: azure_openai:in config. Β²geminiuses concurrent requests (pseudo-batch) β no async discount.
| Capability | Detail |
|---|---|
| Cost estimation | Itemised call budget and cost table before any phases run; Batch API discounts modelled |
| Batch API | 50% async discount for OpenAI, Anthropic, and Azure OpenAI; Gemini uses concurrent mode (no discount) |
| Resume | --continue resumes at exact JSONL record; no duplicate API calls |
| Auto attributes | Teachers infer task dimensions from a description; no hand-labelling required |
| Auto rubric | Teachers propose rubric factors; merge-and-deduplicate across N teachers |
| Multi-judge ensemble | N judges β bias-resistant aggregate scores; outlier judges down-weighted |
| 8 HTML reports | Interactive charts, filterable tables, CSV export, fully self-contained (no CDN) |
| Model probe | Verify all 16 interfaces are reachable before spending a dollar |
| Virtual teachers | Pre-ingested public datasets supply zero-cost Phase 3 ground truth |
| Label accuracy | Judge-free exact-match for classification tasks (label_attributes) |
| Component | Files | LoC |
|---|---|---|
Code/runner β pipeline engine |
59 .py |
15,087 |
Code/analyzer β analysis & reports |
21 .py |
9,554 |
Public/benchmark β dataset utilities |
34 .py |
5,211 |
Tests β test suites |
41 .py |
16,845 |
docs β documentation |
35 .md |
12,521 |
CoEval Β· Multi-Model LLM Evaluation Framework
Designed for LLM developers, integrators, and evaluation practitioners who require robust model evaluation and ranking using custom use-case data and metrics.
Copyright (c) 2026 Alexander Apartsin. All rights reserved.
