ai-evaluation-framework

Here are 24 public repositories matching this topic...

Vvkmnn / awesome-ai-eval

☑️ A curated list of tools, methods & platforms for evaluating AI reliability in real applications

Updated Mar 25, 2026

Comprehensive AI Model Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated May 22, 2026
Python

hparreao / Awesome-AI-Evaluation-Guide

Star

A comprehensive, implementation-focused guide to evaluating Large Language Models, RAG systems, and Agentic AI in production environments.

awesome gpt evaluation-metrics evaluation-framework awesome-lists claude ai-evaluation large-language-models llm agentic-ai ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Dec 5, 2025

vishwanathakuthota / openvals

Star

Open-source AI model evaluation and benchmarking framework for LLMs (OpenAI, Ollama, Claude, Gemini)

machine-learning gemini openai ai-safety ai-agents ai-evaluation ai-testing ai-quality llm-tools ollama llm-benchmarking ai-evaluation-framework calude ai-reliability vishwanath-akuthota

Updated May 27, 2026
Python

firstlinesoftware / eval-ai-library

Star

Comprehensive AI Evaluation Framework with advanced techniques including Temperature-Controlled Verdict Aggregation via Generalized Power Mean. Support for multiple LLM providers and 15+ evaluation metrics for RAG systems and AI agents.

ai-evaluation llm-evaluation ai-evaluation-tools ai-evaluation-metrics aieval ai-evaluation-framework

Updated Mar 3, 2026
Python

SS47816 / AGI-Elo

Star

[NeurIPS 2025] AGI-Elo: How Far Are We From Mastering A Task?

benchmark leaderboard agi imagenet coco artificial-general-intelligence datasets evaluation-metrics elo-rating rating-system evaluation-framework sota ai-benchmarks waymo-open-dataset mmlu vision-language-action ai-evaluation-framework livecodebench navsim

Updated Oct 28, 2025
Python

karloks2005 / JailbreakLab

Star

Test and evaluate Large Language Models against prompt injections, jailbreaks, and adversarial attacks with a web-based interactive lab.

react docker kubernetes jailbreak model-alignment machine-learning-security ai-security fastapi huggingface prompt-injection llm-security llm-safety security-research-tool ai-evaluation-framework adversarial-ai prompt-defense llm-red-teaming

Updated Mar 27, 2026
Python

justindobbs / Tracecore

Star

Deterministic runtime for agent evaluation

reliability-engineering specification ai-agents benchmarking-framework autogen fastapi langchain observability-platform ai-evaluation-framework agent-testing agent-benchmark deterministic-testing autoresearch

Updated Mar 25, 2026
Python

syamsasi99 / prompt-evaluator

Star

prompt-evaluator is an open-source toolkit for evaluating, testing, and comparing LLM prompts. It provides a GUI-driven workflow for running prompt tests, tracking token usage, visualizing results, and ensuring reliability across models like OpenAI, Claude, and Gemini.

electron react typescript datascience developer-tools ai-evaluation llm prompt-engineering prompt-testing promptfoo ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Dec 4, 2025
TypeScript

AGBAJEMUH / Awesome-AI-Evaluation-Guide

Star

🤖 Evaluate AI systems effectively with our comprehensive guide to methods, tools, and frameworks for assessing Large Language Models and agents.

awesome gpt evaluation-metrics evaluation-framework claude ai-evaluation large-language-models llm agentic-ai ai-evaluation-tools ai-evaluation-metrics ai-evaluation-framework

Updated Feb 24, 2026

provnai / vex-halt

Sponsor

Star

VEX-HALT — Benchmark suite for AI verification systems. 443+ tests for calibration, robustness, honesty, and proof integrity.

testing rust benchmark cryptography ai merkle testing-tools ai-evaluation llm-as-judge ai-evaluation-framework

Updated Dec 23, 2025
Rust

lalitkpal / VerifyAI

Star

VerifyAI is a simple UI application to test GenAI outputs

ai-evaluation llm generative-ai genai llm-test llm-evaluation llm-evaluation-framework llm-evaluation-metrics llm-testing ai-metrics ai-evaluation-framework generative-ai-evaluation

Updated May 28, 2026
HTML

mbayers6370 / ALIGN-framework

Star

Multi-dimensional evaluation of AI responses using semantic alignment, conversational flow, and engagement metrics.

human-in-the-loop emotional-analysis contextual-ai llm-evaluation emotional-alignment ai-evaluation-framework

Updated Oct 29, 2025
Python

PabloCabaleiro / pondera

Star

Pondera is a lightweight, YAML-first framework to evaluate AI models and agents with pluggable runners and an LLM-as-a-judge.

python ai agents model-agnostic ai-evaluation llms llm-evaluation llm-evaluation-framework llm-judge agent-evaluation ai-evaluation-framework rubric-based-evaluation yaml-first

Updated Oct 23, 2025
Python

joshualamerton / agent-evaluation-lab

Star

Sandbox platform for testing and evaluating autonomous agents

ai developer-tools ai-agents ai-agent agent-simulator ai-evaluation llm ai-testing large-language-model ai-evaluation-framework agent-simulation ai-sandbox

Updated Mar 15, 2026
Python

alyssadata / Driftmap-Public-Harness_llm-eval-harness-lite

Star

Public Driftmap harness: public-safe CSV suites + rubrics + run logs for drift detection, refusal integrity, injection resistance, and uncertainty tracking.

benchmark-framework ai-framework ai-safety drift-detection ai-agent ai-evaluation red-teaming-tools ai-agents-framework llm-evaluation refusal llm-evaluation-framework ai-agent-tools ai-evaluation-framework

Updated Mar 2, 2026
Python

LungleyM / ks-school-leader-governance

Star

Structural Reliability Evaluation Report and Supporting Artefacts

ai-governance crep llm-governance governance-framework ai-evaluation-framework prompt-architecture human-in-the-loop-ai decision-systems

Updated Feb 25, 2026

lazzaro-ai / llm-reliability-research

Star

Public research artifacts, evaluation frameworks, prototype workflows, and technical documentation for LLM reliability, structured analysis, and applied AI systems.

conversational-ai workflow-designer ai-systems applied-ai structured-analysis ai-evaluation llm llms prompt-testing model-behavior workflow-design reliability-testing ai-evaluation-framework ai-reliability model-behavior-analysis

Updated Apr 22, 2026
PowerShell

Mcsunday44 / research-publications

Star

Quantitative research in credit risk modeling, telecom analytics & AI mathematical reasoning evaluation

python data-science statistics machine-learning-algorithms mathematics python3 latex-document markdown-language churn-prediction integers-hashids python-lambda fintech-design statamic-v3 fintech-api credit-risk-analysis ai-evaluation-framework

Updated May 13, 2026

MirrorLoop / mirrorloop-core

Star

Official public release of MirrorLoop Core (v1.3 – April 2025)

framework ai gpt language-model semantic-testing llm-testing recursive-ai ai-evaluation-framework mirrorloop

Updated Apr 2, 2025

Improve this page

Add a description, image, and links to the ai-evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evaluation-framework

Here are 24 public repositories matching this topic...

Vvkmnn / awesome-ai-eval

meshkovQA / Eval-ai-library

hparreao / Awesome-AI-Evaluation-Guide

vishwanathakuthota / openvals

firstlinesoftware / eval-ai-library

SS47816 / AGI-Elo

karloks2005 / JailbreakLab

justindobbs / Tracecore

syamsasi99 / prompt-evaluator

AGBAJEMUH / Awesome-AI-Evaluation-Guide

provnai / vex-halt

lalitkpal / VerifyAI

mbayers6370 / ALIGN-framework

PabloCabaleiro / pondera

joshualamerton / agent-evaluation-lab

alyssadata / Driftmap-Public-Harness_llm-eval-harness-lite

LungleyM / ks-school-leader-governance

lazzaro-ai / llm-reliability-research

Mcsunday44 / research-publications

MirrorLoop / mirrorloop-core

Improve this page

Add this topic to your repo