01Environments
A neutral record, measured the same for every lab.
Verifiable tasks
A real engineering task with a checkable outcome.
Dense reward
Every step scored against ground truth.
Real rollouts
Grounded in production work, not invented.
02Method
One process, from a capability to a graded environment. The same five steps every time.
01Capability
Perceive
Map a capability and its failure modes until the reward is well defined.
02Rubric
Represent
Formalize it into a task distribution with a verifiable rubric.
03No contamination
Build
Stand up environments that separate cleanly from eval and resist contamination.
04Distribution
Scale
Mass-produce variants across the distribution. Early environments become training data.
05pass@k
Choose
Score pass@k by model. Point the next environment at what they fail.
03Domains
Where the method is pointed. In priority order, by stakes.
Safety
Alignment and oversight. The first call on everything.
PriorityDefense
High-stakes capability and red-team work.
High-stakesScience
Bio, pharma, research automation.
ResearchCommerce
Agentic work on real company operations. Live today.
Live04Why Idler
Grounded, broad, frontier.
A
Grounded
Environments from real production data, not invented. Less reward hacking, better transfer.
B
Broad
Coverage across coding, tool use, long-horizon, error recovery.
C
Frontier
Built for the models clearing the hardest evals, on the work they fail next.
05Contact
Name the capability your models miss. We build the environment, graded against ground truth.