Standardizing AI Performance: Every Eval Ever Unveils Unified Schema for Fragmented Benchmark Data
Measuring the progress of artificial intelligence has historically been a fragmented and inconsistent endeavor. While evaluations are essential for testing models, results are currently scattered across academic papers, leaderboards, blog posts, and ad hoc log files. This lack of standardization makes it incredibly difficult to compare models, reproduce findings, or conduct meta-analyses. To address these challenges, researchers have introduced Every Eval Ever, a unifying schema and community-governed repository designed to standardize how AI evaluation results are stored and analyzed.
The Challenge of Fragmented AI Benchmarking
Currently, evaluation results are saved in incompatible formats across the machine learning ecosystem. Different evaluation frameworks often produce divergent scores for nominally identical benchmarks, and they record metadata inconsistently. Standard benchmarks like the Massive Multitask Language Understanding (MMLU), Big Bench Hard (BBH), GSM8K, TruthfulQA, and the Holistic Evaluation of Language Models (HELM) measure everything from general knowledge to mathematical reasoning. However, because developers omit critical generation parameters, evaluation settings, and data provenance, researchers struggle to perform systematic scaling or quantization analyses. Even basic tasks, such as loading metrics like precision, recall, f1, and accuracy via evaluation libraries, can run into implementation bugs and inconsistent outputs across custom datasets.
A Unified Schema and the EEE Datastore
Every Eval Ever offers a solution by defining a standardized metadata format stored in a single JSON document. Governed by the community on GitHub, the project provides a core metadata schema alongside an instance-level schema that allows for fine-grained, per-instance analysis. It is designed to be source-agnostic, automatically converting data from popular evaluation harnesses, leaderboards, and research papers.
The accompanying crowdsourced database, hosted on Hugging Face as the EEE_datastore, already spans 22,235 models, 2,273 unique benchmarks, and 31 evaluation formats. The dataset tracks highly granular fields, including evaluation IDs, schema versions, retrieved timestamps, model developers, inference platforms, and evaluator relationships. For instance, the datastore captures third-party evaluations run via Inspect AI by Arcadia Impact, such as evaluating DeepSeek R1 on the GAIA benchmark via OpenRouter, which yielded an accuracy score of 0.115, alongside scores for Mistral Small Latest and Google Gemini 2.0 Flash 001.
Hugging Face's Decentralized Evaluation Ecosystem
In parallel with community-led standardization efforts, Hugging Face has been developing a decentralized tracking system for model evaluations on its Hub. Under this system, dataset repositories can be designated as Benchmarks, such as MMLU-Pro, the Humanity Last Exam (HLE), or GPQA, which automatically compile results to display live leaderboards.
Model developers can store their evaluation scores directly within their model repositories as YAML files in a designated .eval_results folder. Anyone in the community can submit these scores via pull requests. To ensure validity, results can display specific badges: a verified badge indicates the evaluation was run in Hugging Face Jobs using Inspect AI with a valid verifyToken, while other badges indicate community PRs, direct links to leaderboards, or links to external source logs. These efforts complement existing specialized community leaderboards, such as the Massive Text Embedding Benchmark (MTEB), the OpenVLM Leaderboard for vision models, the Open ASR Leaderboard for audio, and the LLM-Perf Leaderboard for measuring latency and throughput.
The transition from fragmented, ad hoc evaluation tables to a unified, community-governed datastore represents a critical step toward making AI performance claims verifiable and comparable across the entire industry.This digest was compiled from:
Share this digest
People Also Ask
- UK Drives Productivity Agenda with AI, Industrial Strategy, and Workforce Initiatives
The UK is launching a comprehensive strategy to boost productivity through AI, innovation, targeted industrial sectors, and workforce development.
- Three Reasons Why DeepSeek's New V4 Model Matters
DeepSeek has released V4, an efficient open-source model that matches top closed-source rivals at a fraction of the cost.
- AI in Sierra Leone Education: A New Era for Learning Outcomes
A groundbreaking trial in Sierra Leone shows AI-powered learning can accelerate student math progress by over a year in just eight weeks, setting a new benchmark for educational technology in Africa.
Share your thoughts
Reactions, corrections, or insights — all welcome.
