Hacker News

Porting 100k lines from TypeScript to Rust using Claude Code in a month

2026-01-2613:58255164blog.vjeux.com

I read this post “Our strategy is to combine AI and Algorithms to rewrite Microsoft’s largest codebases [from C++ to Rust]. Our North Star is ‘1 engineer, 1 month, 1 million lines of code.” and it got…

Show article

I read this post “Our strategy is to combine AI and Algorithms to rewrite Microsoft’s largest codebases [from C++ to Rust]. Our North Star is ‘1 engineer, 1 month, 1 million lines of code.” and it got me curious, how difficult is it really?

I've long wanted to build a competitive Pokemon battle AI after watching a lot of WolfeyVGC and following the PokéAgent challenge at NeurIPS. Thankfully there's an open source project called "Pokemon Showdown" that implements all the rules but it's written in JavaScript which is quite slow to run in a training loop. So my holiday project came to life: let's convert it to Rust using Claude!

Escaping the sandbox

Having the AI able to run arbitrary code on your machine is dangerous, so there's a lot of safeguards put in place. But... at the same time, this is what I want to do in this case. So let me walk through the ways I escaped the various sandboxes.

git push

Claude runs in a sandbox that limits some operations like ssh access. You need ssh access in order to publish to GitHub. This is very important as I want to be able to check how the AI is doing from my phone while I do some other activities 😉

What I realized is that I can run the code on my terminal but Claude cannot do it from its own terminal. So what I did was to ask Claude to write a nodejs script that opens an http server on a local port that executes the git commands from the url. Now I just need to keep a tab open on my terminal with this server active and ask Claude to write instructions in Claude.md for it to interact with it.

rustc

There's an antivirus on my computer that requires a human interaction when an unknown binary is being ran. Since every time we compile it's a new unknown binary, this wasn't going to work.

What I found is that I can setup a local docker instance and compile + run the code inside of docker which doesn't trigger the antivirus. Again, I asked Claude to generate the right instructions in Claude.md and problem solved.

The next hurdle was to figure out how to let Claude Code for hours without any human intervention.

--yes

Claude keeps asking for permission to do things. I tried adding a bunch of things to the allowed commands file and --allow-dangerously-skip-permissions --dangerously-skip-permissionswas disabled in my environment (it has now been resolved).

I realized that I could run an AppleScript that presses enter every few seconds in another tab. This way it's going to say Yes to everything Claude asks to do. So far it hasn't decided to hack my computer...

#!/bin/bash

osascript -e \
'tell application "System Events"
    repeat
        delay 5
        key code 36
    end repeat
end tell'

Never give up

Claude after working for some time seem to always stop to recap things. I tried prompting it to never do, even threatening it to no avail.

Read the original article

ibobev

Karma: 27178

@Hacker__News
@hacker._news

Comments

By thijser 2026-01-2615:5815 reply

This reminds me of when I tried to let Claude port an Android libgdx-based game to a WASM-based libgdx version, so I can play the game in the browser.

No matter how much I tried to force it to stick to a mostly line-by-line port, it kept trying to "improve" the code. At some point it had to undo everything as it introduced a number of bugs. I asked it: "What should I add to your prompt so you won't do this again?" and it gave me this:

  ### CRITICAL LESSON: Don't "Improve" During Porting
  -  **BIGGEST MISTAKE: Reorganizing working code**
    - **What I did wrong:** Tried to "simplify" by splitting `createStartButton()` into separate creation and layout methods
    - **Why it failed:** Introduced THREE bugs:
      1. Layout overlap (getY() vs getY() - getHeight())
      2. Children not sized (Group.setSize() doesn't affect children)
      3. Origins not updated (scaling animations broken)
    - **The fix:** Deleted my "improvements" and copied the original Android pattern faithfully
    - **Root cause:** Arrogance - assuming I could improve production-tested code without understanding all the constraints
    - **Solution:** **FOLLOW THE PORTING PRINCIPLES ABOVE** - copy first, don't reorganize
    - **Time wasted:** ~1 hour debugging self-inflicted bugs that wouldn't exist if I'd just copied the original
    - **Key insight:** The original Android code is correct and battle-tested. Your "improvements" are bugs waiting to happen.

I like the self-reflection of Claude, unfortunately even adding this to CLAUDE.md didn't fix it and it kept taking wrong turns so I had to abandon the effort.

By phpnode 2026-01-2616:486 reply

Claude doesn't know why it acted the way it acted, it is only predicting why it acted. I see people falling for this trap all the time

By LoganDark 2026-01-2616:57

It's not even predicting why it acted, it's predicting an explanation of why it acted, which is even worse since there's no consistent mental model.

By GuB-42 2026-01-2711:373 reply

It had been shown that LLMs don't know how they work. They asked a LLM to perform computations, and explain how they got to the result. The LLM explanation is typical of how we do it: add number digit by digit, with carry, etc... But by looking inside the neural network, it show that the reality is completely different and much messier. None of it is surprising.

Still, feeding it back its own completely made up self-reflection could be an effective strategy, reasoning models kind of work like this.

By FireBeyond 2026-01-2718:55

Right. Last time I checked this was easy to demonstrate with word logic problems:

"Adam has two apples and Ben has four bananas. Cliff has two pieces of cardboard. How many pieces of fruit do they have?" (or slightly more complex, this would probably be easily solved, but you get my drift.)

Change the wordings to some entirely random, i.e. something not likely to be found in the LLM corpus, like walruses and skyscrapers and carbon molecules, and the LLM will give you a suitably nonsensical answer showing that it is incapable of handling even simple substitutions that a middle schooler would recognize.

By phpnode 2026-01-2712:46

The explanation becomes part of the context which can lead to more effective results in the next turn, it does work, but it does so in a completely misleading way

By wongarsu 2026-01-2714:162 reply

Which should be expected, since the same is true for humans. The "adding numbers digit by digit with carry" works well on paper, but it's not an effective method for doing math in your head, and is certainly not how I calculate 14+17. In fact I can't really tell you how I calculate 14+17 since that's not in the "inner monologue" part of my brain, and I have little introspection in any of the other parts

Still, feeding humans their completely made-up self-reflection back can be an effective strategy

By GuB-42 2026-01-2717:571 reply

The difference is that if you are honest and pragmatic and someone asked you how you added two numbers, you would only say you did long addition if that's what you actually did. If you had no idea what you actually did, you would probably say something like "the answer came to me naturally".

LLMs work differently. Like a human, 14+17=31 may come naturally, but when asked about their though process, LLMs will not self-reflect on their condition, instead they will treat it like "in your training data, when someone is asked how he added number, what follows?", and usually, it is long addition, so that is the answer you will get.

It is the same idea as to why LLMs hallucinate. They will imitate what their dataset has to say, and their dataset doesn't have a lot of "I don't know" answers, and a LLM that learns to answer "I don't know" to every question wouldn't be very useful anyways.

By sigmoid10 2026-01-2718:58

>if you are honest and pragmatic and someone asked you how you added two numbers, you would only say you did long addition if that's what you actually did. If you had no idea what you actually did, you would probably say something like "the answer came to me naturally".

To me that misses the argument of the above comment. The key insight is that neither humans nor LLMs can express what actually happens inside their neural networks, but both have been taught to express e.g. addition using mathematical methods that can easily be verified. But it still doesn't guarantee for either of them not to make any mistakes, it only makes it reasonably possible for others to catch on to those mistakes. Always remember: All (mental) models are wrong. Some models are useful.

By estimator7292 2026-01-2715:461 reply

Life lesson for you: the internal functions of every individual's mind are unique. Your n=1 perspective is in no way representative of how humans as a category experience the world.

Plenty of humans do use longhand arithmetic methods in their heads. There's an entire universe of mental arithmetic methods. I use a geometric process because my brain likes problems to fit into a spatial graph instead of an imaginary sheet of paper.

Claiming you've not examined your own mental machinery is... concerning. Introspection is an important part of human psychological development. Like any machine, you will learn to use your brain better if you take a peek under the hood.

By wongarsu 2026-01-2715:55

> Claiming you've not examined your own mental machinery is... concerning

The example was carefully chosen. I can introspect how I calculate 356*532. But I can't introspect how I calculate 14+17 or 1+3. I can deliberate the question 14+17 more carefully, switching from "system 1" to "system 2" thinking (yes, I'm aware that that's a flawed theory), but that's not how I'd normally solve it. Similarly I can describe to you how I can count six eggs in a row, I can't describe to you how I count three eggs in a row. Sure, I know I'm subitizing, but that's just putting a word on "I know how many are there without conscious effort". And without conscious effort I can't introspect it. I can switch to a process I can introspect, but that's not at all the same

By kaffekaka 2026-01-2616:541 reply

Yes, this pitfall is a hard one. It is very easy to interpret the LLM in a way there is no real ground for.

By scotty79 2026-01-2623:101 reply

It must be anthropomorphization that's hard to shake off.

If you understand how this all works it's really no surprise that reasoning post-factum is exactly as hallucinated as the answer itself and might have very little to do with it and it always has nothing to do with how the answer actually came to be.

The value of "thinking" before giving an answer is reserving a scratchpad for the model to write some intermediate information down. There isn't any actual reasoning even there. The model might use information that it writes there in completely obscure way (that has nothing to do what's verbally there) while generating the actual answer.

By nnevatie 2026-01-271:30

That's because when the failure becomes the context, it can clearly express the intent of not falling for it again. However, when the original problem is the context, none of this obviousness applies.

Very typical, and gives LLMs the annoying Captain Hindsight -like behaviour.

By nonethewiser 2026-01-2617:011 reply

IDK how far AIs are from intelligence, but they are close enough that there is no room for anthropomorphizing them. When they are anthropomorphized its assumed to be a misunderstanding of how they work.

Whereas someone might say "geeze my computer really hates me today" if it's slow to start, and we wouldn't feel the need to explain the computer cannot actually feel hatred. We understand the analogy.

I mean your distinction is totally valid and I dont blame you for observing it because I think there is a huge misunderstanding. But when I have the same thought, it often occurs to me that people aren't necessarily speaking literally.

By amenhotep 2026-01-2617:444 reply

This is a sort of interesting point, it's true that knowingly-metaphorical anthropomorphisation is hard to distinguish from genuine anthropomorphisation with them and that's food for thought, but the actual situation here just isn't applicable to it. This is a very specific mistaken conception that people make all the time. The OP explicitly thought that the model would know why it did the wrong thing, or at least followed a strategy adjacent to that misunderstanding. He was surprised that adding extra slop to the prompt was no more effective than telling it what to do himself. It's not a figure of speech.

By zarzavat 2026-01-2618:132 reply

A good time to quote our dear leader:

> No one gets in trouble for saying that 2 + 2 is 5, or that people in Pittsburgh are ten feet tall. Such obviously false statements might be treated as jokes, or at worst as evidence of insanity, but they are not likely to make anyone mad. The statements that make people mad are the ones they worry might be believed. I suspect the statements that make people maddest are those they worry might be true.

People are upset when AIs are anthropomorphized because they feel threatened by the idea that they might actually be intelligent.

Hence the woefully insufficient descriptions of AIs such as "next token predictors" which are about as fitting as describing Terry Tao as an advanced gastrointestinal processor.

By jdub 2026-01-272:45

I'm not threatened by the idea that LLMs might actually be intelligent. I know they're not.

I'm threatened by other people wrongly believing that LLMs possess elements of intelligence that they simply do not.

Anthropomorphosis of LLMs is easy, seductive, and wrong. And therefore dangerous.

By antonvs 2026-01-2623:49

The comment you replied to made a point that, if you accept it (which you probably should), makes that PG quote inapplicable here. The issue in this case is that treating the model as though it has useful insight into its own operation - which is being summarized as anthropomorphizing - leads to incorrect conclusions. It’s just a mistake, that’s all.

By phpnode 2026-01-2617:49

There's this underlying assumption of consistency too - people seem to easily grasp that when starting on a task the LLM could go in a completely unexpected direction, but when that direction has been set a lot of people expect the model to stay consistent. The confidence with which it answers questions plays tricks on the interlocutor.

By nonethewiser 2026-01-2617:58

Whats not a figure of speech?

I am speaking general terms - not just this conversation here. The only specific figure of speech I see in the original comment is "self reflection" which doesn't seem to be in question here.

By electroglyph 2026-01-274:551 reply

some models are capable of metacognition. i've seen Anthropic's research replicated.

By lukashahnart 2026-01-275:12

Can you elaborate on what you mean by metacognition and where you’ve seen it in Anthropic’s models?

By drob518 2026-01-2710:13

It’s not even doing that. It’s just an algorithm for predicting the next word. It doesn’t have emotions or actually think. So, I had to chuckle when it said it was arrogant. Basically, it’s training data contains a bunch of postmortem write ups and it’s using those as a template for what text to generate and telling us what we want to hear.

By everfrustrated 2026-01-2618:031 reply

Worth pointing out that your IDE/plugin usually adds a whole bunch of prompts before yours - let alone the prompts that the model hosting provider prepends as well.

This might be what is encouraging the agent to do best practices like improvements. Looking at mine:

>You are a highly sophisticated automated coding agent with expert-level knowledge across many different programming languages and frameworks and software engineering tasks - this encompasses debugging issues, implementing new features, restructuring code, and providing code explanations, among other engineering activities.

I could imagine that an LLM could well interpret that to mean improve things as it goes. Models (like humans) don't respond well to things in the negative (don't think about pink monkeys - Now we're both thinking about them).

By hombre_fatal 2026-01-2618:36

It's also common for your own CLAUDE.md to have some generic line like "Always use best practices and good software design" that gets in the way of other prompts.

By theLiminator 2026-01-270:181 reply

For anything large like this, I think it's critical that you port over the tests first, and then essentially force it to get the tests passing without mutating the tests. This works nicely for stuff that's very purely functional, a lot harder with a GUI app though.

By pwdisswordfishy 2026-01-2714:51

The same insight can be applied to the codebase itself.

When you're porting the tests, you're not actually working on the app. You're getting it to work on some other adjacent, highly useful thing that supports app development, but nonetheless is not the app.

Rather than trying to get the language model to output constructs in the target PL/ecosystem that go against its training, get it to write a source code processor that you can then run on the original codebase to mechanically translate it into the target PL.

Not only does this work around the problem where you can't manage to convince the fuzzy machine to reliably follow a mechanical process, it sidesteps problems around the question of authorship. If a binary that has been mechanically translated from source into executable by a conventional compiler inherits the same rightsholder/IP status as the source code that it was mechanically translated from, then a mechanical translation by a source-to-source compiler shouldn't be any different, no matter what the model was trained on. Worst case scenario, you have to concede that your source processor belongs to the public domain (or unknowingly infringed someone else's IP), but you should still be able to keep both versions of your codebase, one in each language.

By wyldfire 2026-01-2619:08

One thing that might be effective at limited-interaction recovery-from-ignoring-CLAUDE.md is the code-review plugin [1], which spawns agents who check that the changes conform to rules specified in CLAUDE.md.

[1] https://github.com/anthropics/claude-code/blob/main/plugins/...

By surajrmal 2026-01-2716:03

I recently did a c++ to rust port with Gemini and it was basically a straight line port like I wanted. Nearly 10k lines of code too. It needed to change a bit of structure to get it compiling, but that's only because rust found bugs at compile time. I attribute this success to the fact my team writes c++ stylistically close to what is idiomatic rust, and that generally the languages are quite similar. I will likely do another pass in the future to turn the callback driven async into async await syntax, but off the bat it largely avoided doing so when it would change code structure.

By arjie 2026-01-278:26

It's not context-free (haha) but a trick you can try is to include negative examples into the prompt. It used to be an awful trick originally because of Waluigi Effect but then became a good trick, and lately with Opus 4.5 I haven't needed to do it that much. But it did work once. e.g. like take the original code and supply the correct answer and the wrong answers in the prompt as examples in Claude.MD and then redo.

If it works, do share.

By pwdisswordfishy 2026-01-2714:11

Humans act the same way.

For all the (unfortunately necessary) conversations that have occurred over the years of the form, "JavaScript is not Java—they're two different languages," people sometimes go too far and tack on some remark like, "They're not even close to being alike." The reality, though, is that many times you can take some in-house package (though not the Enterprise-hardened™ ones with six different overloads for every constructor, and four for every method, and that buy hard into Java (or .NET) platform peculiarities—just the ones where someone wrote just enough code to make the thing work in that late-90's OOP style associated with Java), and more or less do a line-by-line port until you end up with a native JS version of the same program, which with a little more work will be able to run in browser/Node/GraalJS/GJS/QuickJS/etc. Generally, you can get halfway there by just erasing the types and changing the class/method declarations to conform to the different syntax.

Even so, there's something that happens in folks' brains that causes them to become deranged and stray far off-course. They never just take their program, where they've already decomposed the solution to a given problem into parts (that have already been written!), and then just write it out again—same components, same identifier names, same class structure. There's evidently some compulsion where, because they sense the absence of guardrails from the original language, they just go absolutely wild, turning out code that no one would or should want to read—especially not other programmers hailing from the same milieu who explicitly, avowedly, and loudly state their distaste for "JS" (whereby they mean "the kind of code that's pervasive on GitHub and NPM" and is so hated exactly because it's written in the style their coworker, who has otherwise outwardly appeared to be sane up to this point, just dropped on the team).

By andai 2026-01-270:241 reply

Was this Claude Code? If you tried it with one file at a time in the chat UI I think you would get a straight-line port, no?

Edit: It could be because Rust works a little differently from other languages, a 1:1 port is not always possible or idiomatic. I haven't done much with Rust but whenever I try porting something to Rust with LLMs, it imports like 20 cargo crates first (even when there were no dependencies in the original language).

Also Rust for gamedev was a painful experience for me, because rust hates globals (and has nanny totalitarianism so there's no way to tell it "actually I am an adult, let me do the thing"), so you have to do weird workarounds for it. GPT started telling me some insane things like, oh it's simple you just need this rube goldberg of macro crates. I thought it was tripping balls until I joined a Rust discord and got the same advice. I just switched back to TS and redid the whole thing on the last day of the jam.

By zozbot234 2026-01-275:51

> rust hates globals

Rust has added OnceCell and OnceLock recently to make threadsafe globals a lot easier for some things. it's not "hate", it just wants you to be consistent about what you're doing.

By antonvs 2026-01-2623:41

That’s a terrible prompt, more focused on flagellating itself for getting things wrong than actually documenting and instructing what’s needed in future sessions. Not surprising it doesn’t help.

By abyesilyurt 2026-01-273:07

Sonnet 4.5 had this problem. Opus 4.5 is much better at focusing on the task instead of getting sidetracked.

By 0x696C6961 2026-01-2616:015 reply

I wish there was a feature to say "you must re-read X" after each compaction.

By nickreese 2026-01-2616:01

Some people use hooks for that. I just avoid CC and use Codex.

By esafak 2026-01-275:19

https://github.com/anthropics/claude-code/issues/3537 https://github.com/anthropics/claude-code/issues/14258

By taberiand 2026-01-2622:581 reply

Getting the context full to the point of compaction probably means you're already dealing with a severely degraded model, the more effective approach is to work in chunks that don't come close to filling the context window

By 0x696C6961 2026-02-0313:01

The problem is that I'm not always using it interactively. I'll give it something that I think is going to be a simple task and it turns out to be complex. It overruns the context, compacts, and the starts doing dumb things.

By philipp-gayret 2026-01-2618:03

There's no PostCompact hook unfortunately. You could try with PreCompact and giving back a message saying it's super duper important to re-read X, and hope that survives the compacting.

By root_axis 2026-01-2619:351 reply

What would it even mean to "re-read after a compaction"?

By esafak 2026-01-275:20

To enter a file into the context after losing it through compaction.

By andai 2026-01-270:30

Tangential but doesn't libgdx have native web support?

By esafak 2026-01-275:17

It doesn't seem very bound by CLAUDE.md

By badlogic 2026-01-2711:56

libGDX, now that's a name I haven't heard in a while.

By Lionga 2026-01-2616:191 reply

Well its close to AGI, can you really expect AGI to follow simple instructions from dumbos like you when it can do the work of god?

By b00ty4breakfast 2026-01-2616:46

as an old coworker once said, when talking about a certain manager; That boy's just smart enough to be dumb as shit (The AI, not you; I don't know you well enough to call you dumb)

By danesparza 2026-01-2614:409 reply

Some quotes from the article stand out: "Claude after working for some time seem to always stop to recap things" Question: Were you running out of context? That's why certain frameworks like intentional compaction are being worked on. Large codebases have specific needs when working with an LLM.

"I've never interacted with Rust in my life"

:-/

How is this a good idea? How can I trust the generated code?

By johnfn 2026-01-2615:552 reply

The author says that he runs both the reference implementation and the new Rust implementation through 2 million (!) randomly generated battles and flags every battle where the results don't line up.

By simonw 2026-01-2616:001 reply

This is the key to the whole thing in my opinion.

If you ask a coding agent to port code from one language to the another and don't have a robust mechanism to test that the results are equivalent you're inevitably going to waste a lot of time and money on junk code that doesn't work.

By storystarling 2026-01-278:181 reply

Fuzzing handles the logic verification, but I'd be more worried about the architectural debt of mapping GC patterns to Rust. You often end up with a mess of Arc/Mutex wrappers and cloning just to satisfy the borrow checker, which defeats the purpose of the port.

By zozbot234 2026-01-2710:48

That will vary depending on how the code is architected to begin with, and the problem domain. Single-ownership patterns can be refactored into Rust ownership, and a good AI model might be able to spot them even when not explicitly marked in the code.

For some problems dealing with complex general graphs, you may even find it best to use a Rust-based general GC solution, especially if it can be based on fast concurrent GC.

By Herring 2026-01-2616:301 reply

Yeah and he claims a pass rate of 99.96%. At that point you might be running into bugs in the original implementation.

By sanxiyn 2026-01-276:531 reply

Not really. Due to combinatorial explosion some path is hard to hit randomly in this kind of source code. I would have preferred if after 2M random battles the reference implementation had 99% code coverage, than 99% pass rate.

I don't know anything about Pokemon, but I briefly looked at the code. "weather" seemed like a self contained thing I could potentially understand. Looking at https://github.com/vjeux/pokemon-showdown-rs/blob/master/src...

> NOTE: ignoringAbility() and abilityState.ending not fully implemented

So it is almost certain even after 99.96% pass rate, it didn't hit battle with weather suppressing Pokemon but with ability ignored. Code coverage driven testing loop would have found and fixed this one easily.

By Herring 2026-01-2715:35

Good catch. I should really look at the code before commenting on it.

By Palomides 2026-01-2615:061 reply

I'm very skeptical, but this is also something that's easy to compare using the original as a reference implementation, right? providing lots of random input and fixing any disparities is a classic approach for rewriting/porting a system

By ethin 2026-01-2615:471 reply

This only works up to a certain point. Given that the author openly admits they don't know/understand Rust, there is a really high likelihood that the LLM made all kinds of mistakes that would be avoided, and the dev is going to be left flailing about trying to understand why they happen/what's causing them/etc. A hand-rewrite would've actually taught the author a lot of very useful things I'm guessing.

By galangalalgol 2026-01-2617:52

It seems like they have something like differential fuzzing to guarantee identical behavior to the original, but they still are left with a codebase they cannot read...

By rkozik1989 2026-01-2614:572 reply

Hopefully they have a test suite written by QA otherwise they're for sure going to have a buggy mess on their hands. People need to learn that if you must rewrite something (often you don't actually need to) then an incremental approach best.

By yieldcrv 2026-01-2615:27

1 month of Claude Code would be an incremental approach

It would honestly try to one-shot the whole conversion in a 30 minute autonomous session

By jamesfinlayson 2026-01-271:461 reply

> often you don't actually need to

Feels like this one is always a mistake that needs to be made for the lesson to be learned.

By port11 2026-01-278:05

At this point it seems pretty clear that all projects ported from Ruby to Python, then Python to Typescript, must now be ported to Rust. It will solve almost all problems of the tech industry…

By captbaritone 2026-01-2615:481 reply

His goal was to get a faster oracle that encoded the behavior of Pokemon that he could use for a different training project. So this project provides that without needing to be maintainable or understandable itself.

By topaz0 2026-01-2713:59

Back of the envelope, they'll need to use this on the order of a billion times to break even, under the (laughable) assumption that running claude code uses comparable compute as the computer he's running his code on. So more like hundreds of billions or trillions, I'd guess.

By ferguess_k 2026-01-2616:02

I think it could work if they have tests with good coverage, like the "test farm" described by someone who worked in Oracle.

By atonse 2026-01-2615:471 reply

My answer to this is to often get the LLMs to do multiple rounds of code review (depending on the criticality of the code, doing reviews on every commit. but this was clearly a zero-impact hobby project).

They are remarkably good at catching things, especially if you do it every commit.

By usrbinbash 2026-01-2615:574 reply

> My answer to this is to often get the LLMs to do multiple rounds of code review

So I am supposed to trust the machine, that I know I cannot trust to write the initial code correctly, to somehow do the review correctly? Possibly multiple times? Without making NEW mistakes in the review process?

Sorry no sorry, but that sounds like trying to clean a dirty floor by rubbing more dirt over it.

By atonse 2026-01-2620:102 reply

It sounds to me like you may not have used a lot of these tools yet, because your response sounds like pushback around theoreticals.

Please try the tools (especially either Claude Code with Opus 4.5, or OpenAI Codex 5.2). Not at all saying they're perfect, but they are much better than you currently think they might be (judging by your statements).

AI code reviews are already quite good, and are only going to get better.

By gixco 2026-01-275:111 reply

Why is the go-to always "you must not have used it" in lieu of the much more likely experience of having already seen and rejected first-hand the slop that it churns out? Synthetic benchmarks can rise all they want; Opus 4.5 is still completely useless at all but the most trivial F# code and, in more mainstream affairs, continues to choke even on basic ASP.NET Core configuration.

By atonse 2026-01-2712:45

About a year ago they sucked at writing elixir code.

Now I use them to write nearly 100% of my elixir code.

My point isn’t a static “you haven’t tried them”. My point is, “try them every 2-3 months and watch the improvements, otherwise your info is outdated”

By usrbinbash 2026-01-2711:271 reply

> It sounds to me like you may not have used a lot of these tools yet

And this is more and more becoming the default answer I get whenever I point out obvious flaws of LLM coding tools.

Did it occur to you that I know these flaws precisely because I work a lot with, and evaluate the performance of, LLM based coding tools? Also, we're almost 4y into the alleged "AI Boom" now. It's pretty safe to assume that almost everyone in a development capacity has spent at least some effort evaluating how these tools do. At this point, stating "you're using it wrong" is like assuming that people in 2010 didn't know which way to hold a smartphone.

Sorry no sorry, but when every criticism towards a tool elecits the response that people are not using it well, then maybe, just maybe, the flaw is not with all those people, but with the tool itself.

By atonse 2026-01-2712:411 reply

Spending 4 years evaluating something that’s changing every month means almost nothing, sorry.

Almost every post exalting these models’ capabilities talks about how good they’ve gotten since November 2025. That’s barely 90 days ago.

So it’s not about “you’re doing it wrong”. It’s about “if you last tried it more than 3 months ago, your information is already outdated”

By usrbinbash 2026-01-2716:53

> Spending 4 years evaluating something that’s changing every month means almost nothing, sorry.

No need to be sorry. Because, if we accept that premise, you just countered your own argument.

If me evaluating these things for the past 4 years "means almost nothing" because they are changing sooo rapidly...then by the same logic, any experience with them also "means almost nothing". If the timeframe to get any experience with these models befor said experience becomes irelevant is as short as 90 days, then there is barely any difference between someone with experience and someone just starting out.

Meaning, under that premise, as long as I know how to code, I can evaluate these models, no matter how little I use them.

Luckily for me though, that's not the case anyway because...

> It’s about “if you last tried it more than 3 months ago,

...guessss what: I try these almost every week. It's part of my job to do so.

By pluralmonad 2026-01-2617:011 reply

Implementation -> review cycles are very useful when iterating with CC. The point of the agent reviewer is not to take the place of your personal review, but to catch any low hanging fruit before you spend your valuable time reviewing.

By usrbinbash 2026-01-2712:131 reply

> but to catch any low hanging fruit before you spend your valuable time reviewing.

And that would be great, if it wern't for the fact that I also have to review the reviewers review. So even for the "low hanging fruit", I need to double-check everything it does.

Which kinda eliminates the time savings.

By pluralmonad 2026-01-2715:44

That is not my perspective. I don't review every review, instead use a review agent with fresh context to find as much as possible. After all automated reviews pass, I then review the final output diff. It saves a lot of back and forth, especially with a tight prompt for the review agent. Give the reviewer specific things to check and you won't see nearly as much garbage in your review.

By hombre_fatal 2026-01-2618:04

Well, you can review its reasoning. And you can passively learn enough about, say, Rust to know if it's making a good point or not.

Or you will be challenged to define your own epistemic standard: what would it take for you to know if someone is making a good point or not?

For things you don't understand enough to review as comfortably, you can look for converging lines of conclusions across multiple reviews and then evaluate the diff between them.

I've used Claude Code a lot to help translate English to Spanish as a hobby. Not being a native Spanish speaker myself, there are cases where I don't know the nuances between two different options that otherwise seem equivalent.

Maybe I'll ask 2-3 Claude Code to compare the difference between two options in context and pitch me a recommendation, and I can drill down into their claims infinitely.

At no point do I need to go "ok I'll blindly trust this answer".

By ctoth 2026-01-2616:401 reply

Wait until you start working with us imperfect humans!

By Ronsenshi 2026-01-2617:031 reply

Humans do have capacity for deductive reasoning and understanding, at least. Which helps. LLMs do not. So would you trust somebody who can reason or somebody who can guess?

By galangalalgol 2026-01-2618:02

People work different than llms they fond things we don't and the reverse is also obviously true. As an example, a stavk ise after free was found in a large monolithic c++98 codebase at my megacorp. None of the static analyzers caught it, even after modernizing it and getting clang tidy modernize to pass, nothing found it. Asan would have found it if a unit test had covered that branch. As a human I found it but mostly because I knew there was a problem to find. An llm found and explained the bug succinctly. Having an llm be a reviewer for merge requests males a ton of sense.

By rvz 2026-01-2615:35

> How is this a good idea? How can I trust the generated code?

You don't. The LLMs wrote the code and is absolutely right. /s

What could possibly go wrong?

By eddythompson80 2026-01-2615:39

Same way you trust any auto translation for a document. You wrote it in English (or whatever language you’re most proficient in), but someone wants it in Thai or Czech, so you click a button and send them the document. It’s their problem now.

By hedgehog 2026-01-2618:281 reply

I ported a closed source web conferencing tool to Rust over about a week with a few hours of actual attention and keyboard time. From 2.8MB of minified JS hosted in a browser to a 35MB ARM executable that embeds its own audio, WebRTC, graphics, embedded browser, etc. Also a mdbook spec to explain the protocol, client UI, etc. Zero lines of code by me. The steering work did require understanding the overall work to be done, some high level design of threading and buffering strategy, what audio processing to do, how to do sprite graphics on GPU, some time in a profiler to understand actual CPU time and memory allocations, etc. There is no way I could have done this by hand in a comparable amount of time, and given the clearly IP-encumbered nature I wouldn't spend the time to do it except that it was easy enough and allowed me to then fix two annoying usability bugs with the original.

By written-beyond 2026-01-2619:031 reply

Please give us a write up

By hedgehog 2026-01-2622:351 reply

I don't have time right now for a proper write-up but the basic points in the process were:

1. Write a document that describes the work. In this case I had the minified+bundled JS, no documentation, but I did know how I use the system and generally the important behavioral aspects of the web client. There are aspects of the system that I know from experience tend to be tricky, like compositing an embedded browser into other UI, or dealing with VOIP in general. Other aspects, like JS itself, I don't really know deeply. I knew I wanted a Mac .app out the end, as well as Flatpak for Linux. I knew I wanted an mdbook of the protocol and behavioral specs. Do the best you can. Think really hard about how to segment the work for hands-off testability so the assistant can grind the loop of add logs, test run, fix, etc.

2. In Claude Desktop (or whatever) paste in the text from 1 and instruct it to research and ask you batches of 10 clarifying questions until it has enough information to write a work plan for how to do the job, specific tools, necessary documentation, etc. Then read and critique until you feel like the thread has the elements of a good plan, and have Claude generate a .md of the plan.

3. Create a repo containing the JS file and the plan.

4. Add other tools like my preferred template for change implementation plans, Rust style guide, etc (have the chatbot write a language style guide for any language you use that covers the gap between common practice ~3 years ago and the specific version of the language you want to use, common errors, etc). I have specific instructions for tracking current work, work log, and key points to remember in files, everyone seems to do this differently.

5. Add Claude Code (or whatever) to the container or machine holding the repo.

Repeat until done:

6a. Instruct the assistant to do a time-boxed 60 minutes of work towards the goal, or until blocked on questions, then leave changes for your review along with any questions.

6b. Instruct the assistant to review changes from HEAD for correctness, completeness, and opportunities to simplify, leaving questions in chat.

6c. Review and give feedback / make changes as necessary. Repeat 6b until satisfied.

6d. Go back to 6a.

At various points you'll find that the job is mis-specified in some important way, or the assistant can't figure out what to do (e.g. if you have choppy audio due to a buffer bug, or a slow memory leak, it won't necessarily know about it). Sometimes you need to add guidance to the instructions like "update instructions to emphasize that we must never allocate in situation XYZ". Sometimes the repo will start to go off the rails messy, improved with instructions like "consider how to best organize this repository for ease of onboarding the next engineer, describe in chat your recommendations" and then have it do what it recommended.

There's a fair amount of hand-holding but a lot of it is just making sure what it's doing doesn't look crazy and pressing OK.

By written-beyond 2026-01-2819:411 reply

Oh no I didn't meant a write about the prompting I meant about the actual client you wrote.

What was the final framework like, how did the protocols work, etc.

By hedgehog 2026-01-2822:53

Oh, there's a centrally hosted web server that hosts the assets, some of the conference state, account info, that sort of thing. Clients join a SSE channel for notifications of events relating to other clients. Then a combination of POST to the web service & ICE and STUN to establish all-to-all RTP over WebRTC for audio, and other client state updates as JSON over WebRTC data channel. The UI is very specific to the app but built on winit, webgpu, and egui. wry for embedded browser.