I was anticipating that having AI write code to pass tests (human and/or AI written tests) would be worthwhile, but in practice, I've found that even models such as Opus 4.6 Thinking, High Effort simply "cheats", or rather, fails to generalize much too often. It's occurred to me that perhaps I need some amount of randomness in the tests to keep the models honest, but it feels wrong. We'll see.
Maybe it's just me, but this "We have to fudge the truth because nuance would support the alt-right" business just seems to drive a bigger wedge into the political divide than would just being reasonable. Folks closer to center see it as controlling the narrative, lies, and conspiracy when the full truth comes out. I'd prefer not driving more people into the fringes.