Frontier Long-Horizon Evals

Train and evaluate your model on frontier tasks.

Idler builds evals for coding, finance, science, and defense. Our tasks are fair, frontier-difficulty, long horizon, diverse, and interesting. Customers use Idler tasks to measure how their models stack up against the competition, to check for regressions across model versions, as training tasks, and as high-taste hills for researchers to climb.

01Environments
Every environment is a real engineering task, graded against a working result.
Real tasks
Pulled from live codebases, with a checkable result.
Dense reward
Every step scored, not just the final patch.
Real rollouts
Grounded in production engineering, not invented benchmarks.
02Method
From real engineering work to a graded world. The same five steps every time.
01
Perceive
Find where coding agents break on real engineering work.
02
Represent
Turn the task into an environment with a checkable result.
03
Build
Stand up the repo, the tests, and the grader.
04
Scale
Mass-produce variants. Early environments become training data.
05
Measure
Score where models fail, and aim the next environment there.
03Domains
The engineering work the environments are built from.
Debugging
Reproduce, localize, and fix real bugs in a live repo.
Fix
Feature work
Build features across an unfamiliar codebase.
Build
Refactors
Restructure code without breaking what works.
Shape
Tests & review
Write tests, read diffs, and catch regressions.
Verify
04Why Idler
Real, graded, frontier.
A
Real
Environments from real engineering work, never invented benchmarks. The skill transfers.
B
Graded
Every step checked against a working result. Dense reward, not just pass or fail.
C
Frontier
Built for the best models, aimed at the engineering they still get wrong.
05Contact
Tell us where your models fail at real engineering. We build the world that trains it.