Skip to content

ApartsinProjects/CoEval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

76 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

CoEval: Ensemble-Based Self-Evaluation for LLMs

Status WIP Python β‰₯3.10 Version 0.3.0 Tests 622 passing Β© 2026 Alexander Apartsin

CoEval β€” Teacher Β· Student Β· Judge evaluation ensemble


🚨 The Challenge

Evaluating and selecting off-the-shelf or fine-tuned models for a specific use case is difficult.

Choosing the right LLM means navigating a minefield of hidden pitfalls:

Challenge Why It Hurts
🎯 Generic benchmarks don't transfer Public data and metrics often miss the nuances of your real-world requirements.
🧩 Custom benchmarks are hard to design Defining representative tasks, building rubrics, and choosing robustness variations is non-trivial.
πŸ’Έ Multi-model multi-task benchmarks are expensive to execute Running every candidate model across every task and rubric quickly multiplies cost and compute.
πŸ•³οΈ Leakage biases results Public and private benchmark items (or near-duplicates) may lurk in training data, inflating scores via memorization.
βš™οΈ Ops and cost are complex Running evaluations across providers, inference modes, and scoring criteria demands careful orchestration.

Bottom line: You can't trust a leaderboard number, and building your own eval is a project in itself.


πŸ’‘ The Concept

Ensemble-based synthetic self-evaluation benchmarking β€” let the models evaluate each other.

CoEval generates a synthetic evaluation suite spanning multiple domain-specific tasks and scoring rubrics, then assembles an ensemble of models that rotate through three roles:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     MODEL  ENSEMBLE                         β”‚
β”‚                                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”‚
β”‚   β”‚  Model A   β”‚    β”‚  Model B   β”‚    β”‚  Model C   β”‚  ...   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜          β”‚
β”‚         β”‚                β”‚                β”‚                 β”‚
β”‚         β–Ό                β–Ό                β–Ό                 β”‚
β”‚   ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓     β”‚
β”‚   ┃          ROTATING  ROLE  ASSIGNMENT               ┃     β”‚
β”‚   ┗━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━┳━━━━┛     β”‚
β”‚              β–Ό                β–Ό                  β–Ό           β”‚
β”‚      πŸŽ“ TEACHER        πŸ“ STUDENT          βš–οΈ JUDGE        β”‚
β”‚   Generate synthetic   Models under       Score outputs     β”‚
β”‚   challenges &         evaluation take    against the       β”‚
β”‚   reference answers    the challenges     rubric            β”‚
β”‚                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Reliability through selection

Not all teachers and judges are created equal. CoEval improves signal quality by identifying:

Role Selection Criterion Intuition
πŸŽ“ Teacher Differentiating β€” produces challenges that separate student performance A good exam question reveals who studied.
βš–οΈ Judge Consensus β€” high agreement with ensemble majority A reliable judge aligns with peer consensus.

Flexible provisioning

  Fully Automatic          Semi-Automatic               Manual
  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚ Tasks       β”‚          β”‚ Tasks ✏️       β”‚        β”‚ Tasks ✏️      β”‚
  β”‚ Rubrics     β”‚  ──►     β”‚ Rubrics        β”‚  ──►   β”‚ Rubrics ✏️    β”‚
  β”‚ Attr. Space β”‚          β”‚ Attr. Space ✏️ β”‚        β”‚ Attr. Space βœοΈβ”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
   AI-generated            Human-guided               Human-defined

Tasks, rubrics, and diversity/attribute spaces can be provisioned fully automatically, semi-automatically (human-in-the-loop), or manually β€” choose the level of control that fits your workflow.


πŸ—οΈ The Framework

CoEval is an end-to-end system β€” from benchmark design to interactive reporting.

  ╔══════════════════════════════════════════════════════════════╗
  β•‘                        C o E v a l                          β•‘
  ╠══════════════════════════════════════════════════════════════╣
  β•‘                                                              β•‘
  β•‘   πŸ“¦ Multi-Vendor Support                                   β•‘
  β•‘   β”œβ”€β”€ Multiple LLM providers & interfaces out of the box    β•‘
  β•‘   └── Plug in proprietary / self-hosted models              β•‘
  β•‘                                                              β•‘
  β•‘   πŸ—ΊοΈ Benchmark Design & Planning                            β•‘
  β•‘   β”œβ”€β”€ Automated task & rubric provisioning                  β•‘
  β•‘   └── Run orchestration with cost optimization              β•‘
  β•‘                                                              β•‘
  β•‘   πŸ“Š Interactive Visual Reports                             β•‘
  β•‘   β”œβ”€β”€ Side-by-side model comparison                         β•‘
  β•‘   └── Drill-down into tasks, rubrics & scores               β•‘
  β•‘                                                              β•‘
  β•‘   πŸ”„ Experiment Tracking                                    β•‘
  β•‘   β”œβ”€β”€ Easy reruns & parameter sweeps                        β•‘
  β•‘   └── Repair & resume after interruptions                   β•‘
  β•‘                                                              β•‘
  β•‘   πŸ“š Complete Documentation                                 β•‘
  β•‘   β”œβ”€β”€ User guides & tutorials                               β•‘
  β•‘   └── Developer API reference                               β•‘
  β•‘                                                              β•‘
  β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

At a glance

Feature Description
Multi-vendor Swap providers without changing your eval pipeline.
Auto-provisioning Generate tasks, rubrics, and attribute spaces from a domain description.
Orchestration Schedule and parallelize runs; optimize for cost and latency.
Visual reports Interactive dashboards for deep-dive analysis.
Resilient tracking Resume interrupted experiments; repair partial results.
Docs-first Comprehensive guides for users and contributors alike.

Supported Model APIs

OpenAI, Anthropic, Google Gemini, Azure OpenAI, Azure AI Inference, AWS Bedrock, Google Vertex AI, OpenRouter, Groq, DeepSeek, Mistral, DeepInfra, Cerebras, Cohere, HuggingFace API, HuggingFace (local), Ollama

β†’ Providers & Pricing β€” auth setup, batch discounts, pricing tables for all 18 interfaces.


Quick Start

# 1. Install
pip install coeval

# 2. Add your API keys  (see: docs/tutorial.md Β§ 2)
cp keys.yaml.template keys.yaml   # then fill in your provider keys

# 3. Probe all models β€” no tokens consumed
coeval probe --config benchmark/mixed.yaml

# 4. Estimate cost before spending anything
coeval plan --config benchmark/mixed.yaml

# 5. Run the experiment
coeval run --config benchmark/mixed.yaml --continue

# 6. Generate analysis reports
coeval analyze all --run ./eval_runs/mixed-benchmark --out ./reports

Minimal experiment config

models:
  - name: gpt-4o-mini
    interface: openai
    parameters: { model: gpt-4o-mini, temperature: 0.7, max_tokens: 512 }
    roles: [teacher, student, judge]

tasks:
  - name: text_sentiment
    description: Classify the sentiment of a short customer review.
    output_description: A single word β€” either Positive or Negative.
    target_attributes:
      sentiment: [positive, negative]
      intensity:  [mild, strong]
    sampling: { target: [1,1], nuance: [0,1], total: 20 }
    rubric:
      accuracy: "The label matches the actual sentiment of the review."
    evaluation_mode: single

experiment:
  id: sentiment-v1
  storage_folder: ./eval_runs

Examples

Interactive HTML examples β€” click to open rendered in browser:

Experiment Planning

Example Description
Education Benchmark β€” Planning View Full experiment plan: 3 real-dataset tasks + 10 synthetic tasks, 6 models, per-phase call budget, cost table, and attribute maps
Mixed Benchmark β€” Planning View Mixed benchmark plan: real benchmark datasets + OpenAI models
Paper Dual-Track β€” Planning View Paper evaluation: dual-track design with benchmark + generative teachers

Generate your own planning view:

coeval describe --config my_experiment.yaml --out my_experiment_plan.html

Example of Reports

Report Description
Dashboard Overview dashboard β€” all reports in one place with top-line rankings and navigation
Student Performance Report Per-student score breakdowns, task rankings, rubric factor heatmaps
Judge Consistency Report Inter-judge ICC agreement, calibration drift, flagged uncertain items
Robust Summary Report Final model rankings with confidence intervals and robust ensemble weights
Score Distribution Report High / Medium / Low histograms filterable by task, teacher, student, and judge
Teacher Report Per-teacher source quality, attribute stratum coverage, data consistency
Interaction Matrix Teacher Γ— Student pair quality heatmap β€” spot which combinations succeed or fail
Coverage Summary Attribute Coverage Ratio (ACR) and rare-attribute recall per task
Judge Report Judge-level bias rates, score calibration, inter-rater reliability
Annotated Report Guide Detailed annotated screenshots of every CoEval report with explanations of every visualization and metric

Generate all reports from a completed run:

coeval analyze all --run ./Runs/my-experiment-v1 --out ./reports

Related documents

Guide What it covers
Concepts Glossary Every first-class concept explained: teacher, student, judge, attributes, rubric, datapoint, slot, phases, wizard, probing, planning, resume, repair, auto interface, batch API, and more
Evaluation Experiment Planning and Preparation Guide End-to-end walkthrough: installation, config design, probing, running, analysis, and benchmark export
Command Line Option Reference Every coeval subcommand, flag, and exit code β€” run, probe, plan, generate, status, models, analyze, describe, wizard, ingest, repair
Running Experiments Phase modes, --continue, batch API, quota control, cost estimation, fault recovery, use-case examples
Providers & Pricing All 18 interfaces with auth, batch support, code examples, and pricing tables
Analytics & Reports 11 interactive HTML dashboards, paper-quality result tables, programmatic API, Excel workbook export
Configuration Guide YAML config schema: models, tasks, attributes, rubric, sampling, prompt overrides, experiment settings
Benchmark Datasets Pre-ingested datasets, coeval ingest, interface: benchmark virtual teacher, reproducing published results
Testing Guide All 20 test files, how to run each suite, interpreting failures, CI/CD setup
System Feature Wishlist 35-item prioritized roadmap: 10 benchmark additions, 12 system features, 13 new report types

Pipeline at a Glance

YAML Config  β†’  Phase 1: Attribute Mapping   (teachers infer task dimensions)
             β†’  Phase 2: Rubric Mapping       (teachers build evaluation criteria)
             β†’  Phase 3: Data Generation      (teachers produce benchmark items)
             β†’  Phase 4: Response Collection  (students answer benchmark prompts)
             β†’  Phase 5: Evaluation           (judges score student responses)
             β†’  coeval analyze all            (8 HTML reports + Excel workbook)

16 Model Interfaces

Cloud β€” Async Batch βœ… Cloud β€” Real-time OpenAI-Compatible Local / Virtual
openai azure_openaiΒΉ groq huggingface
anthropic azure_ai deepseek ollama
geminiΒ² bedrock mistral benchmark
vertex deepinfra
openrouter cerebras

ΒΉ azure_openai supports Azure Global Batch API (50% discount) β€” enable via batch: azure_openai: in config. Β² gemini uses concurrent requests (pseudo-batch) β€” no async discount.

Key Capabilities

Capability Detail
Cost estimation Itemised call budget and cost table before any phases run; Batch API discounts modelled
Batch API 50% async discount for OpenAI, Anthropic, and Azure OpenAI; Gemini uses concurrent mode (no discount)
Resume --continue resumes at exact JSONL record; no duplicate API calls
Auto attributes Teachers infer task dimensions from a description; no hand-labelling required
Auto rubric Teachers propose rubric factors; merge-and-deduplicate across N teachers
Multi-judge ensemble N judges β†’ bias-resistant aggregate scores; outlier judges down-weighted
8 HTML reports Interactive charts, filterable tables, CSV export, fully self-contained (no CDN)
Model probe Verify all 16 interfaces are reachable before spending a dollar
Virtual teachers Pre-ingested public datasets supply zero-cost Phase 3 ground truth
Label accuracy Judge-free exact-match for classification tasks (label_attributes)

Project Statistics Β· System v1.3

Component Files LoC
Code/runner β€” pipeline engine 59 .py 15,087
Code/analyzer β€” analysis & reports 21 .py 9,554
Public/benchmark β€” dataset utilities 34 .py 5,211
Tests β€” test suites 41 .py 16,845
docs β€” documentation 35 .md 12,521

CoEval Β· Multi-Model LLM Evaluation Framework

Designed for LLM developers, integrators, and evaluation practitioners who require robust model evaluation and ranking using custom use-case data and metrics.

Copyright (c) 2026 Alexander Apartsin. All rights reserved.

About

CoEval - LLM evaluation framework specs and samples

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors