Open eval harnesses, datasets, and benchmark curation for agentic systems.
Agentic benchmark shards with trajectory transcripts, tool traces, and evaluator labels.