We are a research community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.
This project aims to investigate how to systematically characterize the complexity and behavior of AI benchmarks over time, with the overarching goal of informing more robust benchmark design. The ...
Learn more →This project addresses the need for a structured and systematic approach to documenting AI model evaluations through the creation of "evaluation cards," focusing specifically on technical base syst...
Learn more →The Eleuther Harness Tutorials project is designed to lower the barrier to entry for using the LM Evaluation Harness, making it easier for researchers and practitioners to onboard, evaluate, and co...
Learn more →Researchers, practitioners, and students are welcome to contribute to our mission.