AI Agents for Human Work

Agent benchmarks focus on programming, while 92.4% of human employment lies elsewhere.

Explore Domains

Loading visualization...

We map agent benchmark examples to real-world work they represent, to expose the gap between agent benchmarks and real-world work.

Why This Project?

AI agents are getting better at performing tasks related to human work: writing code, drafting emails, conducting research, and more. But a critical question is still hard to answer:

How representative are today's agent benchmarks of real-world work?

Most progress in agent development is driven by and measured through benchmarks. If those benchmarks are skewed towards a narrow slice of tasks, then improvements may not translate into broad productivity gains or meaningful labor relief in the labor market. This project aims to make that relationship measurable.

We build a unifying database of AI agent benchmarks mapped to real-world work, so we can analyze where agent development is concentrated, and what is missing.

What We Built

We collected and standardized: 70,732 tasks from 39 agent benchmarks, mapped to 1,016 real-world occupations in the U.S. labor market.

Each benchmark instance is mapped to:

  • Domains: e.g., engineering, administrative support, management
  • Skills: e.g., getting information, interacting with others, mental processes

How Well Do Agents Carry Out Skills?

Explore Skills

Loading visualization...

What We Found

Agent development is highly concentrated in a few domains and skills. Benchmarks disproportionately focus on programming and math-heavy tasks. Meanwhile, large portions of human labor and capital lie in other fields (e.g., management, legal work) where benchmarks are sparse.

Work-like benchmarks are not always realistic. Some tasks resemble real work on the surface but involve limited domain context or require only a narrow set of skills. Real-world jobs often involve coordinating multiple skills across domains, which many benchmarks only partially capture.

Which domains matter most? Which skills should we prioritize?

Measuring Agent Autonomy

To translate benchmark scores into practical insights, we introduce a unified task complexity scale and use it to measure agent autonomy: agents' performance frontier as task complexity increases.

By analyzing agent trajectories across benchmarks, we can:

  • Compare agents across human work (which agent performs best in my domain and skill?)
  • Identify the situated autonomy level of agents

This tells us the right level of autonomy for each agent, and when human oversight is needed.

What's Next?

We hope this project helps shift the conversation from "are agents getting better on benchmarks?" to

"Are agents getting better at human work that matters?"

We still have a long way to go:

  • We need more benchmarks that cover a wider range of work areas and skills
  • We need more agent trajectories to measure autonomy comprehensively

Any feedback or contributions are welcome!