Each environment runs sixty-four rollouts, scored step by step against ground truth. The record is the same for every lab.
The corpus sorts by where models hold the baseline and where they drift, so you target the next capability by failure.
New environments enter as frontier models clear the old ones. Early environments become training data.
| ID | Environment | pass@k | Domain |
|---|