Because there's an unlimited amount of work to do. This is the same reason you are not fired once completing a feature :-) The point of hiring a FTE is to continue to create work that provides business value. For your analogy, FTEs often do that by hiring temp, and you can think of the agent as the new temp in this case - the human drives an infinite amount of them
Yes, but benchmarks like this are often flawed because leading model labs frequently participate in 'benchmarkmaxxing' - ie improvements on ARC-AGI2 don't necessarily indicate similar improvements in other areas (though it does seem like this is a step function increase in intelligence for the Gemini line of models)