> * Terminal-Bench 2.0: Harbor/Terminus-2 harness; 3h timeout, 32 CPU/48 GB RAM; temp=1.0, top_p=0.95, top_k=20, max_tokens=80K, 256K ctx; avg of 5 runs.
Terminal bench 2.0 rules explicitly disallow modifying timeouts or resources available. Each terminal bench task has timeout (usually under 1h, mostly under 30 mins) and resources configured in the docker container by the task creator and they are chosen that way to test specific model aspects.
cgeier•1h ago