CoEval: Ensemble-Based Self-Evaluation for LLMs

🚨 The Challenge

Evaluating and selecting off-the-shelf or fine-tuned models for a specific use case is difficult.

Choosing the right LLM means navigating a minefield of hidden pitfalls:

	Challenge	Why It Hurts
🎯	Generic benchmarks don't transfer	Public data and metrics often miss the nuances of your real-world requirements.
🧩	Custom benchmarks are hard to design	Defining representative tasks, building rubrics, and choosing robustness variations is non-trivial.
💸	Multi-model multi-task benchmarks are expensive to execute	Running every candidate model across every task and rubric quickly multiplies cost and compute.
🕳️	Leakage biases results	Public and private benchmark items (or near-duplicates) may lurk in training data, inflating scores via memorization.
⚙️	Ops and cost are complex	Running evaluations across providers, inference modes, and scoring criteria demands careful orchestration.

Bottom line: You can't trust a leaderboard number, and building your own eval is a project in itself.

💡 The Concept

Ensemble-based synthetic self-evaluation benchmarking — let the models evaluate each other.

CoEval generates a synthetic evaluation suite spanning multiple domain-specific tasks and scoring rubrics, then assembles an ensemble of models that rotate through three roles:

┌─────────────────────────────────────────────────────────────┐
│                     MODEL  ENSEMBLE                         │
│                                                             │
│   ┌───────────┐    ┌───────────┐    ┌───────────┐          │
│   │  Model A   │    │  Model B   │    │  Model C   │  ...   │
│   └─────┬─────┘    └─────┬─────┘    └─────┬─────┘          │
│         │                │                │                 │
│         ▼                ▼                ▼                 │
│   ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓     │
│   ┃          ROTATING  ROLE  ASSIGNMENT               ┃     │
│   ┗━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━┛     │
│              ▼                ▼                  ▼           │
│      🎓 TEACHER        📝 STUDENT          ⚖️ JUDGE        │
│   Generate synthetic   Models under       Score outputs     │
│   challenges &         evaluation take    against the       │
│   reference answers    the challenges     rubric            │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Reliability through selection

Not all teachers and judges are created equal. CoEval improves signal quality by identifying:

Role	Selection Criterion	Intuition
🎓 Teacher	Differentiating — produces challenges that separate student performance	A good exam question reveals who studied.
⚖️ Judge	Consensus — high agreement with ensemble majority	A reliable judge aligns with peer consensus.

Flexible provisioning

  Fully Automatic          Semi-Automatic               Manual
  ┌────────────┐          ┌────────────────┐        ┌──────────────┐
  │ Tasks       │          │ Tasks ✏️       │        │ Tasks ✏️      │
  │ Rubrics     │  ──►     │ Rubrics        │  ──►   │ Rubrics ✏️    │
  │ Attr. Space │          │ Attr. Space ✏️ │        │ Attr. Space ✏️│
  └────────────┘          └────────────────┘        └──────────────┘
   AI-generated            Human-guided               Human-defined

Tasks, rubrics, and diversity/attribute spaces can be provisioned fully automatically, semi-automatically (human-in-the-loop), or manually — choose the level of control that fits your workflow.

🏗️ The Framework

CoEval is an end-to-end system — from benchmark design to interactive reporting.

  ╔══════════════════════════════════════════════════════════════╗
  ║                        C o E v a l                          ║
  ╠══════════════════════════════════════════════════════════════╣
  ║                                                              ║
  ║   📦 Multi-Vendor Support                                   ║
  ║   ├── Multiple LLM providers & interfaces out of the box    ║
  ║   └── Plug in proprietary / self-hosted models              ║
  ║                                                              ║
  ║   🗺️ Benchmark Design & Planning                            ║
  ║   ├── Automated task & rubric provisioning                  ║
  ║   └── Run orchestration with cost optimization              ║
  ║                                                              ║
  ║   📊 Interactive Visual Reports                             ║
  ║   ├── Side-by-side model comparison                         ║
  ║   └── Drill-down into tasks, rubrics & scores               ║
  ║                                                              ║
  ║   🔄 Experiment Tracking                                    ║
  ║   ├── Easy reruns & parameter sweeps                        ║
  ║   └── Repair & resume after interruptions                   ║
  ║                                                              ║
  ║   📚 Complete Documentation                                 ║
  ║   ├── User guides & tutorials                               ║
  ║   └── Developer API reference                               ║
  ║                                                              ║
  ╚══════════════════════════════════════════════════════════════╝

At a glance

Feature	Description
Multi-vendor	Swap providers without changing your eval pipeline.
Auto-provisioning	Generate tasks, rubrics, and attribute spaces from a domain description.
Orchestration	Schedule and parallelize runs; optimize for cost and latency.
Visual reports	Interactive dashboards for deep-dive analysis.
Resilient tracking	Resume interrupted experiments; repair partial results.
Docs-first	Comprehensive guides for users and contributors alike.

Supported Model APIs

OpenAI, Anthropic, Google Gemini, Azure OpenAI, Azure AI Inference, AWS Bedrock, Google Vertex AI, OpenRouter, Groq, DeepSeek, Mistral, DeepInfra, Cerebras, Cohere, HuggingFace API, HuggingFace (local), Ollama

→ Providers & Pricing — auth setup, batch discounts, pricing tables for all 18 interfaces.

Quick Start

# 1. Install
pip install coeval

# 2. Add your API keys  (see: docs/tutorial.md § 2)
cp keys.yaml.template keys.yaml   # then fill in your provider keys

# 3. Probe all models — no tokens consumed
coeval probe --config benchmark/mixed.yaml

# 4. Estimate cost before spending anything
coeval plan --config benchmark/mixed.yaml

# 5. Run the experiment
coeval run --config benchmark/mixed.yaml --continue

# 6. Generate analysis reports
coeval analyze all --run ./eval_runs/mixed-benchmark --out ./reports

Minimal experiment config

models:
  - name: gpt-4o-mini
    interface: openai
    parameters: { model: gpt-4o-mini, temperature: 0.7, max_tokens: 512 }
    roles: [teacher, student, judge]

tasks:
  - name: text_sentiment
    description: Classify the sentiment of a short customer review.
    output_description: A single word — either Positive or Negative.
    target_attributes:
      sentiment: [positive, negative]
      intensity:  [mild, strong]
    sampling: { target: [1,1], nuance: [0,1], total: 20 }
    rubric:
      accuracy: "The label matches the actual sentiment of the review."
    evaluation_mode: single

experiment:
  id: sentiment-v1
  storage_folder: ./eval_runs

Examples

Interactive HTML examples — click to open rendered in browser:

Experiment Planning

Example	Description
Education Benchmark — Planning View	Full experiment plan: 3 real-dataset tasks + 10 synthetic tasks, 6 models, per-phase call budget, cost table, and attribute maps
Mixed Benchmark — Planning View	Mixed benchmark plan: real benchmark datasets + OpenAI models
Paper Dual-Track — Planning View	Paper evaluation: dual-track design with benchmark + generative teachers

Generate your own planning view:
coeval describe --config my_experiment.yaml --out my_experiment_plan.html

Example of Reports

Report	Description
Dashboard	Overview dashboard — all reports in one place with top-line rankings and navigation
Student Performance Report	Per-student score breakdowns, task rankings, rubric factor heatmaps
Judge Consistency Report	Inter-judge ICC agreement, calibration drift, flagged uncertain items
Robust Summary Report	Final model rankings with confidence intervals and robust ensemble weights
Score Distribution Report	High / Medium / Low histograms filterable by task, teacher, student, and judge
Teacher Report	Per-teacher source quality, attribute stratum coverage, data consistency
Interaction Matrix	Teacher × Student pair quality heatmap — spot which combinations succeed or fail
Coverage Summary	Attribute Coverage Ratio (ACR) and rare-attribute recall per task
Judge Report	Judge-level bias rates, score calibration, inter-rater reliability
Annotated Report Guide	Detailed annotated screenshots of every CoEval report with explanations of every visualization and metric

Generate all reports from a completed run:
coeval analyze all --run ./Runs/my-experiment-v1 --out ./reports

Pipeline at a Glance

YAML Config  →  Phase 1: Attribute Mapping   (teachers infer task dimensions)
             →  Phase 2: Rubric Mapping       (teachers build evaluation criteria)
             →  Phase 3: Data Generation      (teachers produce benchmark items)
             →  Phase 4: Response Collection  (students answer benchmark prompts)
             →  Phase 5: Evaluation           (judges score student responses)
             →  coeval analyze all            (8 HTML reports + Excel workbook)

16 Model Interfaces

Cloud — Async Batch ✅	Cloud — Real-time	OpenAI-Compatible	Local / Virtual
`openai`	`azure_openai`¹	`groq`	`huggingface`
`anthropic`	`azure_ai`	`deepseek`	`ollama`
`gemini`²	`bedrock`	`mistral`	`benchmark`
	`vertex`	`deepinfra`
	`openrouter`	`cerebras`

¹ azure_openai supports Azure Global Batch API (50% discount) — enable via batch: azure_openai: in config. ² gemini uses concurrent requests (pseudo-batch) — no async discount.

Key Capabilities

Capability	Detail
Cost estimation	Itemised call budget and cost table before any phases run; Batch API discounts modelled
Batch API	50% async discount for OpenAI, Anthropic, and Azure OpenAI; Gemini uses concurrent mode (no discount)
Resume	`--continue` resumes at exact JSONL record; no duplicate API calls
Auto attributes	Teachers infer task dimensions from a description; no hand-labelling required
Auto rubric	Teachers propose rubric factors; merge-and-deduplicate across N teachers
Multi-judge ensemble	N judges → bias-resistant aggregate scores; outlier judges down-weighted
8 HTML reports	Interactive charts, filterable tables, CSV export, fully self-contained (no CDN)
Model probe	Verify all 16 interfaces are reachable before spending a dollar
Virtual teachers	Pre-ingested public datasets supply zero-cost Phase 3 ground truth
Label accuracy	Judge-free exact-match for classification tasks (`label_attributes`)

Project Statistics · System v1.3

Component	Files	LoC
`Code/runner` — pipeline engine	59 `.py`	15,087
`Code/analyzer` — analysis & reports	21 `.py`	9,554
`Public/benchmark` — dataset utilities	34 `.py`	5,211
`Tests` — test suites	41 `.py`	16,845
`docs` — documentation	35 `.md`	12,521

CoEval · Multi-Model LLM Evaluation Framework

Designed for LLM developers, integrators, and evaluation practitioners who require robust model evaluation and ranking using custom use-case data and metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
Code		Code
Config		Config
Public/benchmark		Public/benchmark
Runs		Runs
Tests		Tests
docs		docs
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
MEMORY.md		MEMORY.md
README.md		README.md
conftest.py		conftest.py
pyproject.toml		pyproject.toml

Guide	What it covers
Concepts Glossary	Every first-class concept explained: teacher, student, judge, attributes, rubric, datapoint, slot, phases, wizard, probing, planning, resume, repair, auto interface, batch API, and more
Evaluation Experiment Planning and Preparation Guide	End-to-end walkthrough: installation, config design, probing, running, analysis, and benchmark export
Command Line Option Reference	Every `coeval` subcommand, flag, and exit code — `run`, `probe`, `plan`, `generate`, `status`, `models`, `analyze`, `describe`, `wizard`, `ingest`, `repair`
Running Experiments	Phase modes, `--continue`, batch API, quota control, cost estimation, fault recovery, use-case examples
Providers & Pricing	All 18 interfaces with auth, batch support, code examples, and pricing tables
Analytics & Reports	11 interactive HTML dashboards, paper-quality result tables, programmatic API, Excel workbook export
Configuration Guide	YAML config schema: models, tasks, attributes, rubric, sampling, prompt overrides, experiment settings
Benchmark Datasets	Pre-ingested datasets, `coeval ingest`, `interface: benchmark` virtual teacher, reproducing published results
Testing Guide	All 20 test files, how to run each suite, interpreting failures, CI/CD setup
System Feature Wishlist	35-item prioritized roadmap: 10 benchmark additions, 12 system features, 13 new report types

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoEval: Ensemble-Based Self-Evaluation for LLMs

🚨 The Challenge

💡 The Concept

Reliability through selection

Flexible provisioning

🏗️ The Framework

At a glance

Supported Model APIs

Quick Start

Minimal experiment config

Examples

Experiment Planning

Example of Reports

Related documents

Pipeline at a Glance

16 Model Interfaces

Key Capabilities

Project Statistics · System v1.3

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CoEval: Ensemble-Based Self-Evaluation for LLMs

🚨 The Challenge

💡 The Concept

Reliability through selection

Flexible provisioning

🏗️ The Framework

At a glance

Supported Model APIs

Quick Start

Minimal experiment config

Examples

Experiment Planning

Example of Reports

Related documents

Pipeline at a Glance

16 Model Interfaces

Key Capabilities

Project Statistics · System v1.3

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages