ERP-Bench
300 verifiable, long-horizon agent tasks in a real ERP.
Procurement and manufacturing workflows in Odoo 19, a production-grade open-source ERP.
Procurement · manufacturing · sales · finance · ships in Harbor format.
pass@1 · harness: pi · † halted early after >500 zero-point trials.
Anchor
Compile every task from one solved constraint program — so the instruction, environment, solution, and grader can't disagree.
Environments drift.
Authored independently, a task's pieces disagree — leading to unsolvable tasks, broken grading, and reward hacks.
Anchor keeps them aligned.
Formalize the workflow as a parametric CP-SAT program. The solver certifies an optimum per sample; deterministic compilers emit the task.
A certified optimum lets us distinguish between good enough and perfect solutions on a sliding scale.
Tasks that are…
-
VerifiableSolver-certified optimum.
-
Open-endedMany valid end-states.
-
ConsistentOne spec, no drift.
-
TunableDifficulty controlled by parameters.
-
ScalableMint fresh instances at will.
@misc{ivanov2026anchor,
title = {Anchor: Mitigating Artifact Drift in Agent Benchmark Generation},
author = {Ivanov, Maksim and Rana, Abhijay},
year = {2026},
url = {https://openreview.net/forum?id=Vm6HkNyehc},
note = {Presented at the RLEval Workshop, ACM CAIS 2026 (non-archival)}
}