Built with Anchor

by Agentic Labs

paper · openreview.net

code · github.com

data · harborframework.com

ERP-Bench

300 verifiable, long-horizon agent tasks in a real ERP.

Procurement and manufacturing workflows in Odoo 19, a production-grade open-source ERP.

Maksim Ivanov · Abhijay Rana — Agentic Labs

300

Verifiable tasks

Task patterns

100+

Steps per task

50+

Business rules / task

Procurement · manufacturing · sales · finance · ships in Harbor format.

Example task · multi-order fulfillment

Agent

operates

"40 widgets in 5–9 days. Stock 34. Buy parts, schedule build, send invoices — spend least."

ERP Workspace db: widget-co · user: agent

Sales › Open orders(4) due in 5–9 days

#2401

Acme Robotics

12 widgets5 Aug

#2402

Crown Distrib.

10 widgets7 Aug

#2403

Hartwell GmbH

11 widgets7 Aug

#2404

Voyager Co.

7 widgets9 Aug

In stock34

Ordered40

Short−6

agent must: choose suppliers buy parts schedule build send invoices follow policy

Database

end state

customer orders+4

parts ordered+3

products built+2

items shipped+19

invoices sent+4

cash spent−$4.8k

pass@1 (%)

0 25 50 75 100

GPT-5.5 OpenAI · proprietary

Coding

43.4

Browser

9.7

Computer

8.0

Claude Opus 4.7 Anthropic · proprietary

Coding

30.8

Browser

28.8

Computer

23.4

GLM-5.1 Zhipu · open-weight

Coding

35.8

Browser

2.4

Computer

0.0^†

Kimi K2.5 Moonshot · open-weight

Coding

9.1

Browser

1.5

Computer

0.0^†

pass@1 · harness: pi · ^† halted early after >500 zero-point trials.

The method behind ERP-Bench

Anchor

Compile every task from one solved constraint program — so the instruction, environment, solution, and grader can't disagree.

§ 1 — Problem

Environments drift.

Authored independently, a task's pieces disagree — leading to unsolvable tasks, broken grading, and reward hacks.

Four task artifacts with arrows showing pairwise inconsistency failure modes.

§ 2 — Method

Anchor keeps them aligned.

Formalize the workflow as a parametric CP-SAT program. The solver certifies an optimum per sample; deterministic compilers emit the task.

Anchor pipeline: parametric spec to solved instance to task.

A certified optimum lets us distinguish between good enough and perfect solutions on a sliding scale.

§ 3 — What you get

Tasks that are…

Verifiable

Solver-certified optimum.
Open-ended

Many valid end-states.
Consistent

One spec, no drift.
Tunable

Difficulty controlled by parameters.
Scalable

Mint fresh instances at will.

Citation

@misc{ivanov2026anchor,
  title  = {Anchor: Mitigating Artifact Drift in Agent Benchmark Generation},
  author = {Ivanov, Maksim and Rana, Abhijay},
  year   = {2026},
  url    = {https://openreview.net/forum?id=Vm6HkNyehc},
  note   = {Presented at the RLEval Workshop, ACM CAIS 2026 (non-archival)}
}