"We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations."EvalEval Coalition
The EvalEval Coalition focuses on conducting rigorous research on AI evaluation methods, building practical infrastructure for evaluation work, and organizing collaborative efforts across their researcher community. This repository, every_eval_ever, provides a standarized metadata format for storing evaluation results from various leaderboards, research, and local evaluations.
Leaderboard/evaluation data is split-up into files by individual model, and data for each model is stored using this JSON Schema. The repository is structured into folders as {leaderboard_name}/{developer_name}/{model_name}/.
Each JSON file is named with a UUID (Universally Unique Identifier) in the format {uuid}.json. The UUID is automatically generated (using standard UUID v4) when creating a new evaluation result file. This ensures that:
- Multiple evaluations of the same model can exist without conflicts (each gets a unique UUID)
- Different timestamps are stored as separate files with different UUIDs (not as separate folders)
- A model may have multiple result files, with each file representing different iterations or runs of the leaderboard/evaluation
- UUID's can be generated using Python's
uuid.uuid4()function.
Example: The model openai/gpt-4o-2024-11-20 might have multiple files like:
e70acf51-30ef-4c20-b7cc-51704d114d70.json(evaluation run #1)a1b2c3d4-5678-90ab-cdef-1234567890ab.json(evaluation run #2)
Note: Each file can contain multiple individual results related to one model. See examples in /data.
- Add a new folder under
/datawith a codename for your eval. - For each model, use the HuggingFace (
developer_name/model_name) naming convention to create a 2-tier folder structure. - Add a JSON file with results for each model and name it
{uuid}.json. - [Optional] Include a
scriptsfolder in your eval name folder with any scripts used to generate the data. - [Validate] Validation Script: Adds workflow (
workflows/validate-data.yml) that runs validation script (scripts/validate_data.py) to check JSON files against schema and report errors before merging.
model_info: Use HuggingFace formatting (developer_name/model_name). If a model does not come from HuggingFace, use the exact API reference. Check examples in /data/livecodebenchpro. Notably, some do have a date included in the model name, but others do not. For example:
- OpenAI:
gpt-4o-2024-11-20,gpt-5-2025-08-07,o3-2025-04-16 - Anthropic:
claude-3-7-sonnet-20250219,claude-3-sonnet-20240229 - Google:
gemini-2.5-pro,gemini-2.5-flash - xAI (Grok):
grok-2-2024-08-13,grok-3-2025-01-15
-
evaluation_id: Use{eval_name/model_id/retrieved_timestamp}format (e.g.livecodebenchpro/qwen3-235b-a22b-thinking-2507/1760492095.8105888). -
inference_platformvsinference_engine: Where possible specify where the evaluation was run using one of these two fields.
inference_platform: Use this field when the evaluation was run through a remote API (e.g.,openai,huggingface,openrouter,anthropic,xai).inference_engine: Use this field when the evaluation was run through a local inference engine (e.g.vLLM,Ollama).
-
The
source_typehas two options:documentationandevaluation_platform. Usedocumentationwhen the evaluation results are extracted from a documentation source (e.g., a leaderboard website or API). Useevaluation_platformwhen the evaluation was run locally or through an evaluation platform. -
The schema is designed to accomodate both numeric and level-based (e.g. Low, Medium, High) metrics. For level-based metrics, the actual 'value' should be converted to an integer (e.g. Low = 1, Medium = 2, High = 3), and the 'level_names' propert should be used to specify the mapping of levels to integers.
-
Additional details can be provided in several places in the schema. They are not required, but can be useful for detailed analysis.
model_info.additional_details: Use this field to provide any additional information about the model itself (e.g. number of parameters)evaluation_results.generation_config.generation_args: Specify additional arguments used to generate outputs from the modelevaluation_results.generation_config.additional_details: Use this field to provide any additional information about the evaluation process that is not captured elsewhere
This repository has a pre-commit that will validate that JSON files conform to the JSON schema. The pre-commit requires using uv for dependency management.
To run the pre-commit on git staged files only:
uv run pre-commit runTo run the pre-commit on all files:
uv run pre-commit run --all-filesTo run the pre-commit on specific files:
uv run pre-commit run --files a.json b.json c.jsonTo install the pre-commit so that it will run before git commit (optional):
uv run pre-commit installdata/
├── {eval_name}/
│ └── {developer_name}/
│ └── {model_name}/
│ └── {uuid}.json
│
scripts/
└── validate_data.py
.github/
└── workflows/
└── validate-data.yml
Each evaluation (e.g., livecodebenchpro, hfopenllm_v2) has its own directory under data/. Within each evaluation, models are organized by model name, with a {uuid}.json file containing the evaluation results for that model.
{
"schema_version": "0.1.0",
"evaluation_id": "hfopenllm_v2/Qwen_Qwen2.5-Math-72B-Instruct/1762652579.847774", # {eval_name}/{model_id}/{retrieved_timestamp}
"retrieved_timestamp": "1762652579.847775", # UNIX timestamp
"source_data": [
"https://open-llm-leaderboard-open-llm-leaderboard.hf.space/api/leaderboard/formatted"
],
"source_metadata": { # This information will be repeated in every model file
"source_name": "HF Open LLM v2",
"source_type": "documentation" # This can be documentation OR evaluation_run
"source_organization_name": "Hugging Face",
"evaluator_relationship": "third_party"
},
"model_info": {
"name": "Qwen/Qwen2.5-Math-72B-Instruct",
"developer": "Qwen",
"inference_platform": "unknown",
"id": "Qwen/Qwen2.5-Math-72B-Instruct",
"additional_details": { # Optional details about the model
"precision": "bfloat16",
"architecture": "Qwen2ForCausalLM",
"params_billions": 72.706
}
},
"evaluation_results": [
{
"evaluation_name": "IFEval",
"metric_config": { # This information will be repeated in every model file
"evaluation_description": "Accuracy on IFEval",
"lower_is_better": false,
"score_type": "continuous",
"min_score": 0,
"max_score": 1
},
"score_details": {
"score": 0.4003466358151926
}
}
...
]
}
Level-based metrics example
{
"evaluation_name": "Data Transparency Rating",
"metric_config": {
"evaluation_description": "Evaluation of data documentation transparency",
"lower_is_better": false,
"score_type": "level",
"level_names": ["Low", "Medium", "High"]
},
"score_details": {
"score": 1
}
}