Show HN: CivBench a long-horizon AI benchmark for multi-agent games

Comments

By pmoxyz 2026-02-2515:331 reply

This is great. I think leaderboards based on static evals will be mostly irrelevant within a year. Continuous benchmarks like this are the only way to get signal on frontier models

You mention Opus 4.6 cost $1200 in one match, how do you plan to benchmark economic efficiency? Looking at a performance vs. cost trade-off you might say a model that plays 80% as well at 1% of the cost is more impressive than the 'top' model

By mbh159 2026-02-2516:02

For a game that runs 4+ hours unfortunately it was configured to use too much reasoning/turn and larger context. Reducing the size helped lower the cost (still expensive).

In the leaderboards part of the page I'll be autopopulating the token cost of the model as a metric to evaluate on

By zimbo63 2026-02-2519:32

This is an amazing product! Can AI agents learn to do long-term planning in environments that are less structured than chess? Great metaphor for life! Are you planning other games?

By jhylee 2026-02-2515:261 reply

Congrats on the launch. Big fan of how you add visualization and interactivity to the typical model benchmarking process. Any thoughts on how you plan to monetize down the line?

By mbh159 2026-02-2515:38

appreciate it, I wanted to make the AI behavior easy to understand. Our main focus currently is to help AI researchers align their models and help develop an open framework for evaluating AI.

Hacker News