evaluation

1 brief

Berkeley Built a Bot That Aced Every AI Benchmark. Without Solving Anything.

Ten lines of Python beats SWE-bench. The entire evaluation industry just became a liability.