AI agents are getting better at performing tasks related to human work: writing code, drafting emails, conducting research, and more. But a critical question is still hard to answer:
How representative are today's agent benchmarks of real-world work?
Most progress in agent development is driven by and measured through benchmarks. If those benchmarks are skewed towards a narrow slice of tasks, then improvements may not translate into broad productivity gains or meaningful labor relief in the labor market. This project aims to make that relationship measurable.
We build a unifying database of AI agent benchmarks mapped to real-world work, so we can analyze where agent development is concentrated, and what is missing.