Home/ai-models/Standardizing AI Performance: Every Eval Ever Unveils Unified Schema for Fragmented Benchmark Data
A detailed pencil sketch of an open, old-fashioned ledger book with meticulous grid lines, architectural blueprints of data structures flowing out from its pages, and a classic magnifying glass resting on a stack of neatly organized parchment documents. No text, no logos.
AI ModelsPublished 30 June 20263 min read

Standardizing AI Performance: Every Eval Ever Unveils Unified Schema for Fragmented Benchmark Data

Measuring the progress of artificial intelligence has historically been a fragmented and inconsistent endeavor. While evaluations are essential for testing models, results are currently scattered across academic papers, leaderboards, blog posts, and ad hoc log files. This lack of standardization makes it incredibly difficult to compare models, reproduce findings, or conduct meta-analyses. To address these challenges, researchers have introduced Every Eval Ever, a unifying schema and community-governed repository designed to standardize how AI evaluation results are stored and analyzed.

The Challenge of Fragmented AI Benchmarking

Currently, evaluation results are saved in incompatible formats across the machine learning ecosystem. Different evaluation frameworks often produce divergent scores for nominally identical benchmarks, and they record metadata inconsistently. Standard benchmarks like the Massive Multitask Language Understanding (MMLU), Big Bench Hard (BBH), GSM8K, TruthfulQA, and the Holistic Evaluation of Language Models (HELM) measure everything from general knowledge to mathematical reasoning. However, because developers omit critical generation parameters, evaluation settings, and data provenance, researchers struggle to perform systematic scaling or quantization analyses. Even basic tasks, such as loading metrics like precision, recall, f1, and accuracy via evaluation libraries, can run into implementation bugs and inconsistent outputs across custom datasets.

A Unified Schema and the EEE Datastore

Every Eval Ever offers a solution by defining a standardized metadata format stored in a single JSON document. Governed by the community on GitHub, the project provides a core metadata schema alongside an instance-level schema that allows for fine-grained, per-instance analysis. It is designed to be source-agnostic, automatically converting data from popular evaluation harnesses, leaderboards, and research papers.

The accompanying crowdsourced database, hosted on Hugging Face as the EEE_datastore, already spans 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats. The dataset tracks highly granular fields, including evaluation IDs, schema versions, retrieved timestamps, model developers, inference platforms, and evaluator relationships. For instance, the datastore captures third-party evaluations run via Inspect AI by Arcadia Impact, such as evaluating DeepSeek R1 on the GAIA benchmark via OpenRouter, which yielded an accuracy score of 0.115, alongside scores for Mistral Small Latest and Google Gemini 2.0 Flash 001.

Hugging Face's Decentralized Evaluation Ecosystem

In parallel with community-led standardization efforts, Hugging Face has been developing a decentralized tracking system for model evaluations on its Hub. Under this system, dataset repositories can be designated as Benchmarks, such as MMLU-Pro, the Humanity Last Exam (HLE), or GPQA, which automatically compile results to display live leaderboards.

Model developers can store their evaluation scores directly within their model repositories as YAML files in a designated .eval_results folder. Anyone in the community can submit these scores via pull requests. To ensure validity, results can display specific badges: a verified badge indicates the evaluation was run in Hugging Face Jobs using Inspect AI with a valid verifyToken, while other badges indicate community PRs, direct links to leaderboards, or links to external source logs. These efforts complement existing specialized community leaderboards, such as the Massive Text Embedding Benchmark (MTEB), the OpenVLM Leaderboard for vision models, the Open ASR Leaderboard for audio, and the LLM-Perf Leaderboard for measuring latency and throughput.

The transition from fragmented, ad hoc evaluation tables to a unified, community-governed datastore represents a critical step toward making AI performance claims verifiable and comparable across the entire industry.
#ai-models#ai#blended#auto

This digest was compiled from:

Share this digest

Share on XWhatsAppLinkedInTelegram

People Also Ask

Share your thoughts

Reactions, corrections, or insights — all welcome.

0/2000