...

mrothroc

5

Karma

2026-03-03

Created

Recent Activity

  • Senior review can definitely help, regardless if the code comes from a junior or an LLM. We've done this since the dawn of time. However, it doesn't scale and since LLM volume far exceeds what juniors can do, you end up overwhelming the seniors, who are normally overbooked anyway.

    The other problem is that the type of errors LLMs make are different than juniors. There are huge sections of genuinely good code. So the senior gets "review fatigue" because so much looks good they just start rubber stamping.

    I use an automated pipeline to generate code (including terraform, risking infrastructure nukes), and I am the senior reviewer. But I have gates that do a whole range of checks, both deterministic and stochastic, before it ever gets to me. Easy things are pushed back to the LLM for it to autofix. I only see things where my eyes can actually make a difference.

    Amazon's instinct is right (add a gate), but the implementation is wrong (make it human). Automated checks first, humans for what's left.

  • The disposition problem you describe maps to something I keep running into. I've been running fully autonomous software development agents in my own harness and there's real tension between "check everything" and "agent churns forever".

    It'a a liveness constraint: more checks means less of the agent output can pass. Even if the probabilistic mass of the output centers around "correct", you can still over-check and the pipeline shuts down.

    The thing I noticed: the errors have a pattern and you can categorize them. If you break up the artifact delivery into stages, you can add gates in between to catch specific classes of errors. You keep throughput while improving quality. In the end, instead of LLMs with "personas", I structured my pipeline around "artifact you create".

    I wrote up the data and reasoning framework here: https://michael.roth.rocks/research/trust-topology/

  • Everyone is circling around this. We are shifting to "code factories" that take user intent in at one end and crank out code at the other end. The big question: can you trust it?

    We're building our tooling around it (thanks, Claude!) and seeing what works. Personally, I have my own harness and I've been focused on 1) discovering issues (in the broadest sense) and 2) categorizing the issues into "hard" and "easy" to solve inside the pipeline itself.

    I found patterns in the errors the coding agents made in my harness, which I then exploited. I have an automated workflow that produces code in stages. I added structured checks to catch the "easy" problems at stage boundaries. It fixes those automatically. It escalates the "hard" problems to me.

    In the end, this structure took me from ~73% first-pass to over 90%.

  • Yeah, this is what happens when there's nothing between "the agent decided to do this" and "it happened." The agent followed the state file logically. It wasn't wrong. It just wasn't checked.

    His post-mortem is solid but I think he's overcorrecting. If he does this as part of a CICD pipeline and he manually reviews every time, he will pretty quickly get "verification fatigue". The vast majority of cases are fine, so he'll build the habit of automatically approving it. Sure, he'll deeply review the first ones, but over time it becomes less because he'll almost always find nothing. Then he'll pay less attention. This is how humans work.

    He could automate the "easy" ones, though. TF plans are parseable, so maybe his time would be better spent only reviewing destructive changes. I've been running autonomous agents on production code for a while and this is the pattern that keeps working: start by reviewing everything, notice you're rubber-stamping most of it, then encode the safe cases so you only see the ones that matter.

HackerNews