Yeah this matches what we've seen too. The biggest gains we got weren't from switching models, it was from investing in better context, giving the agent a well structured spec, relevant code samples from the repo, and explicit constraints upfront. Without that, even the best models will happily produce working but unmaintainable code. Feels like the whole SWE-bench framing misses this, passing tests is the easy part, fitting into an existing codebase's patterns and conventions is where it actually gets hard.
that's already happening tbh. the real issue isn't hypocrisy though, it's that maintainers reviewing their own LLM output have full context on what they asked for and can verify it against their mental model of the codebase. a random contributor's LLM output is basically unverifiable, you don't know what prompt produced it or whether the person even understood the code they're submitting.