Hacker News

We tasked Opus 4.6 using agent teams to build a C Compiler

2026-02-0519:07735738www.anthropic.com

Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.

Show article

Written by Nicholas Carlini, a researcher on our Safeguards team.

I've been experimenting with a new approach to supervising language models that we’re calling "agent teams."

With agent teams, multiple Claude instances work in parallel on a shared codebase without active human intervention. This approach dramatically expands the scope of what's achievable with LLM agents.

To stress test it, I tasked 16 agents with writing a Rust-based C compiler, from scratch, capable of compiling the Linux kernel. Over nearly 2,000 Claude Code sessions and $20,000 in API costs, the agent team produced a 100,000-line compiler that can build Linux 6.9 on x86, ARM, and RISC-V.

The compiler is an interesting artifact on its own, but I focus here on what I learned about designing harnesses for long-running autonomous agent teams: how to write tests that keep agents on track without human oversight, how to structure work so multiple agents can make progress in parallel, and where this approach hits its ceiling.

Enabling long-running Claudes

Existing agent scaffolds like Claude Code require an operator to be online and available to work jointly. If you ask for a solution to a long and complex problem, the model may solve part of it, but eventually it will stop and wait for continued input—a question, a status update, or a request for clarification.

To elicit sustained, autonomous progress, I built a harness that sticks Claude in a simple loop (if you’ve seen Ralph-loop, this should look familiar). When it finishes one task, it immediately picks up the next. (Run this in a container, not your actual machine).

#!/bin/bash

while true; do
    COMMIT=$(git rev-parse --short=6 HEAD)
    LOGFILE="agent_logs/agent_${COMMIT}.log"

    claude --dangerously-skip-permissions \
           -p "$(cat AGENT_PROMPT.md)" \
           --model claude-opus-X-Y &> "$LOGFILE"
done

In the agent prompt, I tell Claude what problem to solve and ask it to approach the problem by breaking it into small pieces, tracking what it’s working on, figuring out what to work on next, and to effectively keep going until it’s perfect. (On this last point, Claude has no choice. The loop runs forever—although in one instance, I did see Claude pkill -9 bash on accident, thus killing itself and ending the loop. Whoops!).

Running Claude in parallel

Running multiple instances in parallel can address two weaknesses of a single-agent harness:

One Claude Code session can only do one thing at a time. Especially as the scope of a project expands, debugging multiple issues in parallel is far more efficient.
Running multiple Claude agents allows for specialization. While a few agents are tasked to solve the actual problem at hand, other specialized agents can be invoked to (for example) maintain documentation, keep an eye on code quality, or solve specialized sub-tasks.

My implementation of parallel Claude is bare-bones. A new bare git repo is created, and for each agent, a Docker container is spun up with the repo mounted to /upstream. Each agent clones a local copy to /workspace, and when it's done, pushes from its own local container to upstream.

To prevent two agents from trying to solve the same problem at the same time, the harness uses a simple synchronization algorithm:

Claude takes a "lock" on a task by writing a text file to current_tasks/ (e.g., one agent might lock current_tasks/parse_if_statement.txt, while another locks current_tasks/codegen_function_definition.txt). If two agents try to claim the same task, git's synchronization forces the second agent to pick a different one.
Claude works on the task, then pulls from upstream, merges changes from other agents, pushes its changes, and removes the lock. Merge conflicts are frequent, but Claude is smart enough to figure that out.
The infinite agent-generation-loop spawns a new Claude Code session in a fresh container, and the cycle repeats.

This is a very early research prototype. I haven’t yet implemented any other method for communication between agents, nor do I enforce any process for managing high-level goals. I don’t use an orchestration agent.

Instead, I leave it up to each Claude agent to decide how to act. In most cases, Claude picks up the “next most obvious” problem. When stuck on a bug, Claude will often maintain a running doc of failed approaches and remaining tasks. In the git repository of the project, you can read through the history and watch it take out locks on various tasks.

Lessons from programming with Claude agent teams

The scaffolding runs Claude in a loop, but that loop is only useful if Claude can tell how to make progress. Most of my effort went into designing the environment around Claude—the tests, the environment, the feedback—so that it could orient itself without me. These are the approaches I’ve found most helpful when orchestrating multiple Claude instances.

Write extremely high-quality tests

Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.

For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.

Put yourself in Claude’s shoes

I had to constantly remind myself that I was writing this test harness for Claude and not for myself, which meant rethinking many of my assumptions about how tests should communicate results.

For example, each agent is dropped into a fresh container with no context and will spend significant time orienting itself, especially on large projects. Before we even reach the tests, to help Claude help itself, I included instructions to maintain extensive READMEs and progress files that should be updated frequently with the current status.

I also kept in mind the fact that language models have inherent limitations, which, in this case, needed to be designed around. These include:

Context window pollution: The test harness should not print thousands of useless bytes. At most, it should print a few lines of output and log all important information to a file so Claude can find it when needed. Logfiles should be easy to process automatically: if there are errors, Claude should write ERROR and put the reason on the same line so grep will find it. It helps to pre-compute aggregate summary statistics so Claude doesn't have to recompute them.
Time blindness: Claude can't tell time and, left alone, will happily spend hours running tests instead of making progress. The harness prints incremental progress infrequently (to avoid polluting context) and includes a default --fast option that runs a 1% or 10% random sample. This subsample is deterministic per-agent but random across VMs, so Claude still covers all files but each agent can perfectly identify regressions.

Make parallelism easy

When there are many distinct failing tests, parallelization is trivial: each agent picks a different failing test to work on. After the test suite reached a 99% pass rate, each agent worked on getting a different small open-source project (e.g., SQlite, Redis, libjpeg, MQuickJS, Lua) to compile.

But when agents started to compile the Linux kernel, they got stuck. Unlike a test suite with hundreds of independent tests, compiling the Linux kernel is one giant task. Every agent would hit the same bug, fix that bug, and then overwrite each other's changes. Having 16 agents running didn't help because each was stuck solving the same task.

The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude's C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files. If it broke, then it could further refine by re-compiling some of these files with GCC. This let each agent work in parallel, fixing different bugs in different files, until Claude's compiler could eventually compile all files. (After this worked, it was still necessary to apply delta debugging techniques to find pairs of files that failed together but worked independently.)

Multiple agent roles

Parallelism also enables specialization. LLM-written code frequently re-implements existing functionality, so I tasked one agent with coalescing any duplicate code it found. I put another in charge of improving the performance of the compiler itself, and a third I made responsible for outputting efficient compiled code. I asked another agent to critique the design of the project from the perspective of a Rust developer, and make structural changes to the project to improve the overall code quality, and another to work on documentation.

Stress testing the limits of agent teams

This project was designed as a capability benchmark. I am interested in stress-testing the limits of what LLMs can just barely achieve today in order to help us prepare for what models will reliably achieve in the future.

I’ve been using the C Compiler project as a benchmark across the entire Claude 4 model series. As I did with prior projects, I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.

Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects. My goal with Opus 4.6 was to again test the limits.

Evaluation

Over nearly 2,000 Claude Code sessions across two weeks, Opus 4.6 consumed 2 billion input tokens and generated 140 million output tokens, a total cost just under $20,000. Compared to even the most expensive Claude Max plans, this was an extremely expensive project. But that total is a fraction of what it would cost me to produce this myself—let alone an entire team.

This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis, and has a 99% pass rate on most compiler test suites including the GCC torture test suite. It also passes the developer's ultimate litmus test: it can compile and run Doom.

The compiler, however, is not without limitations. These include:

It lacks the 16-bit x86 compiler that is necessary to boot Linux out of real mode. For this, it calls out to GCC (the x86_32 and x86_64 compilers are its own).
It does not have its own assembler and linker; these are the very last bits that Claude started automating and are still somewhat buggy. The demo video was produced with a GCC assembler and linker.
The compiler successfully builds many projects, but not all. It's not yet a drop-in replacement for a real compiler.
The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
The Rust code quality is reasonable, but is nowhere near the quality of what an expert Rust programmer might produce.

The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)

The source code for the compiler is available. Download it, read through the code, and try it on your favorite C projects. I’ve consistently found the best way to understand what language models can do is to push them to their limits, and then study where they start to break down. Over the coming days, I’ll continue having Claude push new changes if you want to follow along with Claude’s continued attempts at addressing these limitations.

Looking forward

Each generation of language models opens up new ways of working with them. Early models were useful for tab-completion in IDEs. Before long, models could complete a function body from its docstring. The launch of Claude Code brought agents into the mainstream and enabled developers to pair-program with Claude. But each of these products operates under the assumption that a user defines a task, an LLM runs for a few seconds or minutes and returns an answer, and then the user provides a follow-up.

Agent teams show the possibility of implementing entire, complex projects autonomously. This allows us, as users of these tools, to become more ambitious with our goals.

We are still early, and fully autonomous development comes with real risks. When a human sits with Claude during development, they can ensure consistent quality and catch errors in real time. For autonomous systems, it is easy to see tests pass and assume the job is done, when this is rarely the case. I used to work in penetration testing, exploiting vulnerabilities in products produced by large companies, and the thought of programmers deploying software they’ve never personally verified is a real concern.

So, while this experiment excites me, it also leaves me feeling uneasy. Building this compiler has been some of the most fun I’ve had recently, but I did not expect this to be anywhere near possible so early in 2026. The rapid progress in both language models and the scaffolds we use to interact with them opens the door to writing an enormous amount of new code. I expect the positive applications to outweigh the negative, but we’re entering a new world which will require new strategies to navigate safely.

Acknowledgements

Special thanks to Josef Bacik, Edwin Chen, Bernardo Meurer Costa, Jake Eaton, Dan Kelley, Felix Klock, Jannet Park, Steve Weis, and many other people across Anthropic for their assistance and contributions.

Read the original article

modeless

Karma: 43546

@Hacker__News
@hacker._news

Comments

By ndesaulniers 2026-02-0521:4121 reply

I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel. https://clangbuiltlinux.github.io/

This LLM did it in (checks notes):

> Over nearly 2,000 Claude Code sessions and $20,000 in API costs

It may build, but does it boot (was also a significant and distinct next milestone)? (Also, will it blend?). Looks like yes!

> The 100,000-line compiler can build a bootable Linux 6.9 on x86, ARM, and RISC-V.

The next milestone is:

Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.

> The generated code is not very efficient. Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

Still a really cool project!

By shakna 2026-02-0521:592 reply

> Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase

Does it really boot...?

By ndesaulniers 2026-02-0522:111 reply

> Does it really boot...?

They don't need 16b x86 support for the RISCV or ARM ports, so yes, but depends on what 'it' we're talking about here.

Also, FWIW, GCC doesn't directly assemble to machine code either; it shells out to GAS (GNU Assembler). This blog post calls it "GCC assembler and linker" but to be more precise the author should edit this to "GNU binutils assembler and linker." Even then GNU binutils contains two linkers (BFD and GOLD), or did they excise GOLD already (IIRC, there was some discussion a few years ago about it)?

By shakna 2026-02-0522:341 reply

Yeah, didn't mention gas or ld, for similar reasons. I agree that a compiler doesn't necessarily "need" those.

I don't agree that all the claims are backed up by their own comments, which means that there's probably other places where it falls down.

Its... Misrepresentation.

Like Chicken is a Scheme compiler. But they're very up front that it depends on a C compiler.

Here, they wrote a C compiler that is at least sometimes reliant on having a different C compiler around. So is the project at 50%? 75%?

Even if its 99%, thats not the same story as they tried to write. And if they wrote that tale instead, it would be more impressive, rather than "There's some holes. How many?"

By Philpax 2026-02-0522:511 reply

Their C compiler is not reliant on having another C compiler around. Compiling the 16-bit real mode bootstrap for the Linux kernel on x86(-64) requires another C compiler; you certainly don't need another compiler to compile the kernel for another architecture, or to compile another piece of software not subject to the 32k constraint.

The compiler itself is entirely functional; it just can't generate code optimal enough to fit within the constraints for that very specific (tiny!) part of the system, so another compiler is required to do that step.

By shakna 2026-02-098:16

It also generates the wrong relocations for link time. And so cannot boot, even with help.

> The “compiles the kernel” claim needs a footnote. CCC compiles all the C source files, but the final binary cannot be produced because CCC generates incorrect relocations for kernel data structures (__jump_table, __ksymtab).

By TheCondor 2026-02-063:522 reply

The assembler seems like nearly the easiest part. Slurp arch manuals and knock it out, it’s fixed and complete.

By jakewins 2026-02-0617:213 reply

I am surprised by the number of comments that say the assembler is trivial - it is admittedly perhaps simpler than some other parts of the compiler chain, but it’s not trivial.

What you are doing is kinda serialising a self-referential graph structure of machine code entries that reference each others addresses, but you don’t know the addresses because the (x86) instructions are variable-length, so you can’t know them until you generate the machine code, chicken-and-egg problem.

Personally I find writing parsers much much simpler than writing assemblers.

By nicebyte 2026-02-0619:261 reply

assembler is far from trivial at least for x86 where there are many possible encodings for a given instruction. emitting the most optimal encoding that does the correct thing depends on surrounding context, and you'd have to do multiple passes over the input.

By jmalicki 2026-02-0717:472 reply

What is a single example where the optimal encoding depends on context? (I am assuming you're just doing an assembler where registers have already been chosen, vs. a compiler that can choose sse vs. scalar and do register allocation etc.)?

By chris_swenson 2026-02-0722:521 reply

“mov rcx, 0”. At least one assembler (the Go assembler) would at one point blindly (and arguably, incorrectly) rewrite this to “xor rcx, rcx”, which is smaller but modifies flags, which “mov” does not. I believe Go fixed this later, possibly by looking at surrounding instructions to see if the flags were being used, for instance by an “adc” later, to know if the assembler needs to pick the larger “mov” encoding.

Whether that logic should belong in a compiler or an assembler is a separate issue, but it definitely was in the assembler there.

By jmalicki 2026-02-0818:50

Ok fair, I saw that as out of scope for an assembler - since that is a different instruction not just how to encode.

By nicebyte 2026-02-1021:53

jumps is another one. jmp can have many encodings depending on where the target offset you're jumping to is. but often times, the offset is not yet known when you first encounter the jump insn and have to assemble it.

By jmalicki 2026-02-0915:032 reply

All you have to do is record a table of fixup locations you can fill in in a second pass once the labels are resolved.

By ndesaulniers 2026-02-0916:42

In practice, one of the difficulties in getting _clang_ to assemble the Linux kernel (as opposed to GNU `as` aka GAS), was having clang implement support for "fragments" in more places.

https://eli.thegreenplace.net/2013/01/03/assembler-relaxatio...

There were a few cases IIRC around usage of the `.` operator which means something to the effect of "the current point in the program." It can be used in complex expressions, and sometimes resolving those requires multiple passes. So supporting GAS compatible syntax in more than just the basic cases forces the architecture of your assembler to be multi-pass.

By jakewins 2026-02-1022:00

I mean, no, it's more than that.

You also need to choose optimal instruction encoding, and you need to understand how relocs work - which things can you resolve now vs which require you to encode info for the linker to fill in once the program is launched, etc etc.

Not sure why I'm on this little micro-rant about this; I'm sure Claude could write a workable assembler. I'm more like.. I've written one assembler and many, many parsers, and the parsers where way simpler, yet this thread is littered with people that seem to think assemblers are just lookup tables from ascii to machine code with a loop slapped on top of them.

By shakna 2026-02-064:40

Huh. A second person mentioning the assembler. Don't think I ever referred to one...?

By brundolf 2026-02-0614:222 reply

One thing people have pointed out is that well-specified (even if huge and tedious) projects are an ideal fit for AI, because the loop can be fully closed and it can test and verify the artifact by itself with certainty. Someone was saying they had it generate a rudimentary JS engine because the available test suite is so comprehensive

Not to invalidate this! But it's toward the "well-suited for AI" end of the spectrum

By HarHarVeryFunny 2026-02-0615:151 reply

Yes - the gcc "torture test suite" that is mentioned must have been one of the enablers for this.

It's notable that the article says Claude was unable to build a working assembler (& linker), which is nominally a much simpler task than building a compiler. I wonder if this was at least in part due to not having a test suite, although it seems one could be auto generated during bootstrapping with gas (GNU assembler) by creating gas-generated (asm, ELF) pairs as the necessary test suite.

It does beg the question of how they got the compiler to point of correctness of generating a valid C -> asm mapping, before tackling the issue of gcc compatibility, since the generated code apparently has no relation to what gcc generates. I wonder which compilers' source code Claude has been trained on, and how closely this compiler's code generation and attempted optimizations compares to those?

By spullara 2026-02-0616:22

i'm sure claude has been trained on every open source compiler

By qarl 2026-02-060:4311 reply

> Still a really cool project!

Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.

The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.

By PostOnce 2026-02-060:489 reply

It's amazing that it "works", but viability is another issue.

It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.

Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.

On top of that, Anthropic is losing money on it.

All of those things combined, viability remains a serious question.

By ryanjshaw 2026-02-0614:281 reply

> You won't know until you've finished spending the money whether it will fail or not.

How do you conclude that? You start off with a bunch of tests and build these things incrementally, why would you spend 20k before realizing there’s a problem?

By friendzis 2026-02-0616:272 reply

Because literally no real-world non-research project starts with "we have an extremely comprehensive test suite and specification complete down to the most finite detail" and then searches for a way to turn it into code.

By galdauts 2026-02-0617:501 reply

Precisely. Figuring out what the specification is supposed to look like is often the hardest part.

By mastermage 2026-02-0715:42

100% agreed i use Claude often to just bounce ideas back and forth on specs i would like to create which I know will never gain traction because its either way too ambitious or too niche.

And the amount of times Claude proposes something thats completely contradictory in the same response. Or completely does a 180 after two more responses. Is ridiculous.

By ryanjshaw 2026-02-0619:131 reply

I’ve spent nearly 20 years working as a consultant writing software, I know that. How do you think humans solve that problem?

By friendzis 2026-02-0711:381 reply

Typically by putting cost caps on deliverables.

By ryanjshaw 2026-02-107:16

Which is pretty much what I said once you factor in that you evaluate (test) the deliverables before paying more.

By qarl 2026-02-062:057 reply

> It cost $20,000

I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???

You should look it up. :)

By lelanthran 2026-02-064:562 reply

> > It cost $20,000

> I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???

I'll bite - I can write you an unoptimised C compiler that emits assembly for $20k, and it won't be 100k lines of code (maybe 15k, the last time I did this?).

It won't take me a week, though.

I think this project is a good frame of reference and matches my experience - vibing with AI is sometimes more expensive than doing it myself, and always results in much more code than necessary.

By flakiness 2026-02-065:201 reply

Does it support x64, x8664, arm64 and riscv? (sorry, just trolling - we don't know the quality of backend other than x8664 which is supposed to be able to build bootable linux.)

By lelanthran 2026-02-065:44

It's not hard to build a compiler just for a bootable linux.

I see no test criteria that actually runs that built linux through various test plans, so, yeah emitting enough asm just to boot is doable.

By p-e-w 2026-02-065:194 reply

> I can write you an unoptimised C compiler that emits assembly for $20k

You may be willing to sell your work at that price, but that’s not the market rate, to put it very mildly. Even 10 times that would be seriously lowballing in the realm of contract work, regardless of whether it’s “optimised” or not (most software isn’t).

By lelanthran 2026-02-065:412 reply

> You may be willing to sell your work at that price, but that’s not the market rate, to put it very mildly.

It is now.

At any rate, this is my actual rate. I live in South Africa, and that's about 4 weeks of work for me, without an AI.

By qarl 2026-02-0610:182 reply

Deal. I'll pay you IF you can achieve the same level of performance. Heck, I'll double it.

You must provide the entire git history with small commits.

I won't be holding my breath.

By lelanthran 2026-02-0610:592 reply

> Deal. I'll pay you IF you can achieve the same level of performance. Heck, I'll double it.

> You must provide the entire git history with small commits.

> I won't be holding my breath.

Sure; I do this often (I operate as a company because I am a contractor) - money to be held in escrow, all the usual contracts, etc.

It's a big risk for you, though - the level of performance isn't stated in the linked article so a parser in Python is probably sufficient.

TCC, which has in the past compiled bootable Linux images, was only around 15k LoC in C!

For reference, for a engraved-in-stone spec, producing a command-line program (i.e. no tech stack other than a programming language with the standard library), a coder could reasonably produce +5000LoC per week.

Adding the necessary extensions to support booting isn't much either, because the 16-bit stuff can be done just the same as CC did it - shell out to GCC (thereby not needing many of the extensions).

Are you *really* sure that a simple C compiler will cost more than 4 weeks f/time to do? It takes 4 weeks or so in C, are you really sure it will take longer if I switch to (for example) Python?

By f1shy 2026-02-0812:45

And having TCC, GCC, CLANG and any other project lying around as cheat sheet, as the trained model, in some way, had.

By qarl 2026-02-0611:211 reply

> the level of performance isn't stated in the linked article so a parser in Python is probably sufficient.

No, you'll have to match the performance of the actual code, regardless of what happens to be written in the article. It is a C compiler written in Rust.

Obviously. Your games reveal your malign intent.

EDIT: And good LORD. Who writes a C compiler in python. Do you know any other languages?!?

By lelanthran 2026-02-0611:381 reply

> No, you'll have to match the performance of the actual code, regardless of what is in the article. It is a C compiler written in Rust.

Look, it's clear that you don't hire s/ware developers very much - your specs are vague and open to interpretation, and it's also clear that I do get hired often, because I pointed out that your spec isn't clear.

As far as "playing games" goes, I'm not allowing you to change your single-sentence spec which, very importantly, has "must match performance", which I shall interpret to as "performance of emitted code" and not "performance of compiler".

> Your games reveal your intent.

It should be obvious to you by know that I've done this sort of thing before. The last C compiler I wrote was 95% compliant with the (at the time, new) C99 standard, and came to around 7000LoC - 8000LoC of C89.

> EDIT: And good LORD. Who writes a C compiler in python. Do you know any other languages?!?

Many. The last language I implemented (in C99) took about two weeks after hours (so, maybe 40 hours total?), was interpreted, and was a dialect of Lisp. It's probably somewhere on Github still, and that was (IIRC) only around 2000LoC.

What you appear to not know (maybe you're new to C) is that C was specifically designed for ease of implementation.

1. It was designed to be quick and easy to implement.

2. The extensions in GCC to allow building bootable Linux images are minimal, TBH.

3. The actual 16-bit emission necessary for booting was not done by CC, but by shelling out to GCC.

4. The 100kLoC does not include the tests; it used the GCC tests.

I mean, this isn't arcane and obscure knowledge, you know. You can search the net right now and find 100s of undergrad CS projects where they implement enough of C to compile many compliant existing programs.

I'm wondering; what languages did you write an implementation for? Any that you designed and then implemented?

By qarl 2026-02-0612:291 reply

[flagged]

By lelanthran 2026-02-0612:481 reply

> Too late friend, you've revealed your stripes.

So you are not willing to put $20k in escrow for, as per your offer:

>>>> Deal. I'll pay you IF you can achieve the same level of performance. Heck, I'll double it.

I just noticed now that you actually offered double. I will do it. This is my real name, my contact details are not hard to find.

I will do it, with emitted binaries performing as well as or better than the binaries emitted by CC.

Put your $40k into a recognised South African escrow service (I've used a few in the past, but I'd rather you choose one so you don't accuse me of being some sort of African scammer).

Because I am engaged in a 6+ hours/day gig right now, I cannot do it f/time until my current gig is completed (and they are paying me directly, not via escrow, so I am not going to jeopardise that).

I can however do a few hours each day, and collect my payment of $40k only once the kernel image boots in about the same time that the CC kernel image boots.

> Yes, we all took the compilers class in college. Those of us who went to college, that is.

If you knew that, why on earth would you assume that implementing a C compiler is at all a complex task?

By qarl 2026-02-0616:201 reply

HA.

By abc123abc123 2026-02-0622:201 reply

lelanthran won, qarl lost. Well played lelanthran!

By qarl 2026-02-0623:17

Dude - now I have to turn off my auto notify.

You are fucking nuts.

By bee_rider 2026-02-0614:481 reply

You seem to have doubled down on a bluff that was already called.

By qarl 2026-02-0621:401 reply

Naw. I got him to reveal himself, which was the whole point.

It's amazing what you can get people to do.

By lelanthran 2026-02-078:07

> Naw. I got him to reveal himself, which was the whole point.

Reveal myself as ... a contractor agreeing to your bid?

> It's amazing what you can get people to do.

There's a ton of money now floating around in pursuit of "proving" how cost-efficient LLM coding is.

I'm sure they can spare you the $40k to put into escrow?

After all, if I don't deliver, then the AI booster community gets a huge win - highly respected ex-FAANG staff engineer with 30 years of verified dev experience could not match the cost efficiency of Claude Code.

I am taking you up on your original offer: $40k for a C compiler that does exactly what the CCC program in the video does.

By zingar 2026-02-0617:151 reply

That’s a VERY nice rate for SA; approximately what I charge in the UK. I assume these are not local companies who hire you.

By lelanthran 2026-02-0618:26

> That’s a VERY nice rate for SA; approximately what I charge in the UK. I assume these are not local companies who hire you.

A local Fintech needing PCI work pays that, but that's not long-term contracts.

By wavemode 2026-02-0620:46

No, you're overestimating how complex it is to write an unoptimized C compiler. C is (in the grand scheme of things) a very simple language to implement a compiler for.

The rate probably goes up if you ask for more and more standards (C11, C17, C23...) but it's still a lot easier than compilers for almost any other popular language.

By seg_lol 2026-02-0719:27

This is very much a John Brown claim that will in the end, kill the OP. I'd rather have the OP using LLM powered code review tools to add their experience to that AI generated compiler.

By psychoslave 2026-02-0611:05

That feels like Silicon-Valley-centric point of view. Plus who would really spend $20k in building any C compiler today in the actual landscape of software?

All that this is saying is that license laundering of a code-base is now $20k away through automated processes, at least if the original code base is fully available. Well, with current state-of-the-art you’ll actually end up with a code-base which is not as good as the original, but that’s it.

By PostOnce 2026-02-069:591 reply

That's irrelevant in this context, because it's not "get the humans to make a working product OR get the AI to make a working product"

The problem is you may pay $20K for gibberish, then try a second time, fail again, and then hire humans.

Coincidentally yes, I am aware, my last contract was building out a SCADA module the AI failed to develop at the company that contracted me.

I'm using that money to finance a new software company, and so far, AI hasn't been much help getting us off the ground.

Edit: oh yeah, and on top of paying Claude to fuck it up, you still have to also pay the salary of the guy arguing with Claude.

By AlfeG 2026-02-0612:29

> The problem is you may pay $20K for gibberish, then try a second time, fail again, and then hire humans.

You can easily pay humans $20k a day and get gibberish in output. Heck, this happen all the times. This happens right now in multiple companies.

Yes sometime humans produce nice code. This happens from time to time...

By sarchertech 2026-02-062:162 reply

You wouldn’t pay a human to write 100k LOC. Or at least you shouldn’t. You’d pay a human to write a working useful compiler that isn’t riddled with copyright issues.

If you didn’t care about copying code, usefulness, or correctness you could probably get a human to whip you up a C compiler for a lot less than $20k.

By qarl 2026-02-062:191 reply

Are you trolling me? Companies (made of humans) write 100,000 LOC all the time.

And it's really expensive, despite your suspicions.

By sarchertech 2026-02-063:002 reply

No, companies don’t pay people to write 100k LOC. They pay people to write useful software.

We figured out that LOC was a useless productivity metric in the 80s.

By qarl 2026-02-063:403 reply

[flagged]

By rezonant 2026-02-0610:503 reply

I can't stress enough how much LOC is not a measure of anything.

By icedchai 2026-02-0617:10

Yep. I’ve seen people copy 100’s of lines instead of adding a if statement.

By f1shy 2026-02-0812:55

In fact it is. And can be useful. IF you have quality controls in place, so the code has a reasonable quality, the LOC will correlate with amount of functionality and/or complexity. Is a good metric? No. Can be used just like that to compare arbitrary code bases, absolutely no!

As a seasoned manager, I have an idea how long a feature should take, both in implementing effort and longness of code. I hace to know it, is my everyday work.

By qarl 2026-02-0616:222 reply

OK, well, the people in MY software industry use LOC as an informal measure of complexity.

LIKE THE WHOLE WORLD DOES.

But hey, maybe it's just the extremely high profile projects I've worked on.

By sarchertech 2026-02-071:24

As an informal measure of the complexity of the code sure 100k lines are inherently more complex than 10k because there’s just more there to look at. And if you are assuming that 2 projects were made by competent teams, saying that one application is 10k LOC and one is 1 million might be useful as a heuristic for number of man hours spent.

But I can write a 100k LOC compiler where 90k lines are for making error messages look pixel perfect on 10 different operating systems. Or where 90k lines are useless layers upon layers of indirection. That doesn’t mean that someone is willing to pay more for it.

AI frequently does exactly that kind of thing.

So saying my AI made a 100k LOC program that does X, and then comparing the cost to a 100k LOC program written by a human is a nonsense comparison. The only thing that matters is to compare it to how much a company would pay a human to produce a program capable of the same output.

In this case the program is commercially useless. Literally of zero monetary value, so no company would pay any money for it. Therefore there’s nothing to compare it to.

That’s not to say it’s not an interesting and useful experiment. Or that things can’t be different in the future.

By rezonant 2026-02-0810:58

Such as?

By DSMan195276 2026-02-0619:13

Without questioning the LOC metric itself, I'll propose a different problem: LOC for human and AI projects are not necessarily comparable for judging their complexity.

For a human, writing 100k LOC to do something that might only really need 15k would be a bit surprising and unexpected - a human would probably reconsider what they were doing well before they typed 100k LOC. Where-as, an AI doesn't necessarily have that concern - it can just keep generating code and doesn't care how long it will take so it doesn't have the same practical pressure to produce concise code.

The result is that while for large enough human-written programs there's probably an average "density" they reach in relation of LOC vs. complexity of the original problem, AI-generated programs probably average out at an entirely different "density" number.

By beowulfey 2026-02-0612:341 reply

Your first post specifically stated:

"I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???"

which any reasonable reading would take to mean "paid-by-line", which we all know doesn't happen. Otherwise, I could type out 30,000 lines of gibberish and take my fat paycheck.

By qarl 2026-02-0616:23

[flagged]

By Chaosvex 2026-02-0614:451 reply

> you could probably get a human to whip you up a C compiler for a lot less than $20k

I fork Clang or GCC and rename it. I'll take only $10k.

By f1shy 2026-02-0812:591 reply

My question, which I didn’t still find anybody asking: how many compilers, including but not limited to the 2 most famous, were in the training set.

By myng111 2026-02-0914:45

Certainly tcc. Probably also rui314's chibicc as it's relatively popular. sdcc is likely in there as well. Among numerous others that are either proprietary or not as well known.

By etler 2026-02-068:21

If my devs are writing that much code they're doing something wrong. Lines of code is an anti metric. That used to be commonly accepted knowledge.

By m00x 2026-02-0611:311 reply

It really depends on the human and the code it outputs.

I can get my 2y old child to output 100k LoC, but it won't be very good.

By qarl 2026-02-0616:281 reply

Your 2yr old can't build a C compiler in Rust that builds Linux.

Sorry mate, I think you're tripping.

By m00x 2026-02-070:00

I never said this. I think you're the one tripping mate.

By psychoslave 2026-02-0610:57

Well, if these humans can cheat by taking whatever needed degree of liberty in copycat attitude to fit in the budget, I guess that a simple `git clone https://gcc.gnu.org/git/gcc.git SomeLocalDir` is as close to $0 as one can hope to either reach. And it would end up being far more functional and reliable. But I get that big-corp overlords and their wanna-match-KPI minions will prefer an "clean-roomed" code base.

By bopbopbop7 2026-02-062:171 reply

100k lines of clean, bug free, optimized, and vulnerability free code or 100k lines of outsourced slop? Two very different price points.

By qarl 2026-02-062:202 reply

A compiler that can build linux.

That level of quality should be sufficient.

Do you know any low quality programmers that write C compilers in rust THAT CAN BUILD LINUX?

No you don't. They do not exist.

By icedchai 2026-02-062:446 reply

Yep. Building a working C compiler that compiles Linux is an impossible task for all but the top 1% of developers. And the ones that could do it have better things to do, plus they’d want a lot more than 20K for the trouble.

By vbezhenar 2026-02-064:391 reply

What's so hard about it? Compiler construction is well researched topic and taught in the universities. I made toy language compiler as a student. May be I'm underestimating this task, but I think that I can build some simple C compiler which will output trivial assembly. Given my salary of $2500, that would probably take me around a year, so that's pretty close LoL.

By cma 2026-02-0616:251 reply

You can one shot prompt a toy C compiler. Getting one that can compile Linux in a bootable way is significantly harder.

By f1shy 2026-02-0813:032 reply

Everybody talks as Linux is the most difficult thing to compile in the world. The reality is that linux is well written and designed with portability with crappy compilers in mind from the beginning.

Also, the booting part, as stated some times, is discutable.

By icedchai 2026-02-0818:30

The reality is you can build Linux with gcc and clang. And that’s it. Years ago you could use Intel’s icc compiler, but that stopped being supported. Let’s stop pretending it’s an undergrad project.

By cma 2026-02-0822:59

Just writing a non-toy C preprocessor is non-trivial.

By viraptor 2026-02-0610:47

It's a bit more nuanced. You can build a simple compiler without too many issues. But once you want it to do optimisations, flow control protection, good and fast register allocation, inling, autovectoriasation, etc. that's going to take a multiples of the original time.

By dgacmu 2026-02-073:151 reply

Some of the hardest parts of the compiler are optimization and clear error handling/reporting. If you forego those - because you're testing against a codebase that is already free of things that break compilation and have no particular performance requirements for the generated code - it's a substantially simpler task.

By f1shy 2026-02-0813:05

Making a basic C compiler, without much error/warn detection and/or optimizations, is as a matter if fact no so difficult. In many Universities is a semester project for 2 to 3 students.

By rezonant 2026-02-0610:541 reply

> Building a working C compiler ... is an impossible task

I think you might be thinking of C++

By icedchai 2026-02-0611:01

I’m not. I’ve been working with C on and off for 30 years. Linux requires GNU extensions beyond standard C. Once you get the basics done, there’s still a lot more work to do. Compiling a trivial program might work. But you’ll hit an edge case or 50 in the millions of lines in Linux.

I also should’ve qualified my message with “in 2 weeks”, or even “in 2 months.” Given more time it’s obviously possible for more people.

By rhubarbtree 2026-02-068:061 reply

Interesting, why impossible? We studied compiler construction at uni. I might have to dig out a few books, but I’m confident I could write one. I can’t imagine anyone on my course of 120 nerds being unable to do this.

By menaerus 2026-02-069:582 reply

You are underestimating the complexity of the task so do other people on the thread. It's not trivial to implement a working C compiler very much so to implement the one that proves its worth by successfully compiling one of the largest open-source code repositories ever, which btw is not even a plain ISO C dialect.

By rhubarbtree 2026-02-0613:401 reply

I didn’t say it was trivial. Just that I thought my course mates would be able to do it.

By qarl 2026-02-0616:323 reply

You thought your course mates would be able to write a C compiler that builds the Linux?

Huh. Interesting. Like the other guy pointed out, compiler classes often get students to write toy C compilers. I think a lot of students don't understand the meaning of the word "toy". I think this thread is FULL of people like that.

By icedchai 2026-02-0618:15

I took a compilers course 30 years ago. I have near zero confidence anyone (including myself) could do it. The final project was some sort of toy language for programming robots with an API we were given. Lots of yacc, bison, etc.

Lots of segfaults, too.

By rhubarbtree 2026-02-0711:41

If it helps, I did a PhD in computer science and went to plenty of seminars on languages, fuzz testing compilers, reviewed for conferences like PLDI. I’m not an expert but I think I know enough to say - this is conceptually within reach if a PITA.

By stevejb 2026-02-0619:181 reply

Hey! I built a Lego technic car once 20 years ago. I am fully confident that I can build an actual road worthy electric vehicle. It's just a couple of edge cases and a bit bigger right? /s

By rhubarbtree 2026-02-0712:111 reply

That's really helpful, actually, as you may be able to give me some other ideas for projects.

So, things you don't think I or my coursemates could do include writing a C compiler that builds a Linux kernel.

What else do you think we couldn't do? I ask because there are various projects I'll probably get to at some point.

Things on that list include (a) writing an OS microkernel and some of the other components of an OS. Don't know how far I'll take it, but certainly a working microkernel for one machine, if I have time I'll build most of the stack up to a window manager. (b) implementing an LLM training and inference stack. I don't know how close to the metal I'd go, I've done some low level CUDA a long time ago when it was very new and low-level, depends on time. I'll probably start the LLM stuff pretty soon as I'm keen to learn.

Are these also impossible? What other things would you add to the impossible list?

By icedchai 2026-02-0719:301 reply

Building a microkernel based OS feels feasible because it’s actually quite open ended. An “OS” could be anything from single user DOS to a full blown Unix implementation, with plenty in between.

Amiga OS is basically a microkernel and that was built 40 years ago. There are also many other examples, like Minix. Do I think most people could build a full microkernel based mini Unix? No. But they could get “something” working that would qualify as an OS.

On the other hand, there are not many C compilers that build Linux. There are many implementations of C compilers, however. The goal of “build Linux” is much more specific.

By rhubarbtree 2026-02-0721:09

Minix is a fair example, or Herd. Something like that.

So what other projects are impossible? That was my question.

By f1shy 2026-02-0813:101 reply

They did bot compile the whole linux, mind you, just an absolute minimal kernel.

Doing a real compiler to be used by humans is difficult. Doing a compiler that “gets the thing done” is a different thing.

By menaerus 2026-02-1014:51

Nowhere did I imply it is production-ready. I said "working compiler" and by definition Claude built one since they booted up the kernel.

By fsflover 2026-02-068:041 reply

https://news.ycombinator.com/item?id=46909529

By icedchai 2026-02-0610:511 reply

I’ll be shocked if they are able to do it in 4 months, never mind 4 weeks.

By f1shy 2026-02-0813:08

Have you ever seen Tsoding youtube channel? I’m sure Mr Zosin can very much do it in one week. And considering russian salaries, it will be like an order of magnitude cheaper.

By bopbopbop7 2026-02-062:281 reply

Do you think this was guided by a low quality Anthropic developer?

You can give a developer the GCC test suite and have them build the compiler backwards, which is how this was done. They literally brute forced it, most developers can brute force. It also literally uses GCC in the background... Maybe try reading the article.

By qarl 2026-02-062:342 reply

[flagged]

By bopbopbop7 2026-02-062:361 reply

The trick to not be confused is to read more than the title of the article.

By qarl 2026-02-062:421 reply

[flagged]

By bopbopbop7 2026-02-063:15

Speaking of obnoxious

By qarl 2026-02-0616:33

[flagged]

By tumdum_ 2026-02-061:153 reply

> On top of that, Anthropic is losing money on it.

It seems they are *not* losing money on inference: https://bsky.app/profile/steveklabnik.com/post/3mdirf7tj5s2e

By byzantinegene 2026-02-069:082 reply

no, and that is widely known. the actual problem is that the margins are not sufficient at that scale to make up for the gargantuan training costs to train their SOTA model.

By JamesBarney 2026-02-0920:42

They are large enough to cover their previous training costs but not their next gen training costs.

i.e They made more money on 3.5 than 3.5 cost to train, but didn't make enough money on 3.5 to train 4.0.

By aurareturn 2026-02-0612:421 reply

Source on that?

Because inference revenue is outpacing training cost based on OpenAI’s report and intuition.

By cma 2026-02-0616:24

Net inference revenue would need to be outpacing to go against his think about margins.

By quikoa 2026-02-0618:05

That's for the API right? The subs are still a loss. I don't know which one of the two is larger.

By chamomeal 2026-02-062:42

That's a good point! Here claude opus wrote a C compiler. Outrageously cool.

Earlier today, I couldn't get opus to replace useEffect-triggered-redux-dispatch nonsense with react-query calls. I already had a very nice react-query wrapper with tons of examples. But it just couldn't make sense of the useEffect rube goldberg machine.

To be fair, it was a pretty horrible mess of useEffects. But just another data point.

Also I was hoping opus would finally be able to handle complex typescript generics, but alas...

By georgeven 2026-02-0619:58

it's 20,000 in 2026, with the price of tokens halving every year (at a given perf level), this will be around 1,000 dollars in 2030

By RA_Fisher 2026-02-0619:40

Progress can be reviewed over time, and I'd think that'd take a lot of the risk out.

By nly 2026-02-090:58

Also, heaven knows if the result in maintainable or easy to change.

By bdangubic 2026-02-061:194 reply

> On top of that, Anthropic is losing money on it

This has got to be my favorite one of them all that keeps coming up in too many comments… You know who also was losing money in the beginning?! every successful company that ever existed! some like Uber were losing billions for a decade. and when was the last time you rode in a taxi? (I still do, my kid never will). not sure how old you are and if you remember “facebook will never be able to monetize on mobile…” - they all lose money, until they do not

By ThrowawayR2 2026-02-064:032 reply

Anyone remember the dotcom bust?

By qarl 2026-02-064:391 reply

Oh yeah, I do. That whole internet thing was a total HOAX. I can't believe people bought into that.

Can you imagine if Amazon, EBay, PayPal, or Saleforce existed today?

By pjmlp 2026-02-066:531 reply

Well, how is your Solaris installation going?

I also remember having gone into research, because there were no jobs available, and even though I was employed at the time, our salaries weren't being paid.

By qarl 2026-02-0610:242 reply

What does this even mean?

By f1shy 2026-02-0813:15

Seems you don’t remember much of that time. Let me refresh: “we are the dot in dot com”

By pjmlp 2026-02-0611:231 reply

==> Anyone remember the dotcom bust?

By qarl 2026-02-0616:24

[flagged]

By mikkupikku 2026-02-0619:05

Remember that thing that caused it? That "Internet" thing? After those companies went bust it pretty much disappeared didn't it.

By deaux 2026-02-065:242 reply

Completely detached from reality, brainwashed SV VC's who have made dumping the norm in their bubble.

I can guarantee you that 90% of successful businesses in the world made a profit their first year.

By HWR_14 2026-02-0613:091 reply

1 year seems aggressive. Successful restaurants have around the first year as the average break even timeline, with the vast majority between 6 and 18 months.

They are making a profit on each sale, but there are fixed costs to running a business.

By deaux 2026-02-073:58

1 year isn't aggressive because of the modifier "successful". Most businesses that aren't profitable 12 months in go out of business not long after, having remained unsuccessful throughout their lifespan.

Restaurants have comparatively high start up costs and ramp up time. Compare to e.g. a store selling clothes. If for successful restaurants the average time is already a year, then in general for successful businesses it's going to be less.

By brookst 2026-02-066:341 reply

I’ll bite. Share your data?

Companies that were not profitable in their first year: Microsoft, Google, SpaceX, airBnB, Uber, Apple, FedEx, Amazon.

If the vast majority of companies are immediately profitable, why do we have VC and investment at all? Shouldn’t the founders just start making money right aeay?

By deaux 2026-02-068:09

> Companies that were not profitable in their first year: Microsoft, Google, SpaceX, airBnB, Uber, Apple, FedEx, Amazon.

US Big Tech, US Big Tech, US Tech-adjacent, US Big Tech, US Big Tech, US Big Tech, FedEx, US Tech-adjacent.

In other words, exactly what I was getting at.

Also, a basic search shows Microsoft to have been profitable first year. I'd be very surprised if they weren't. Apple also seems to have taken less than 2 years. And unsurprisingly, these happen to be the only two among the tech companies you named that launched before 1995.

Check out the Forbes Global 5000. Then go think about the hypothetical Forbes Global 50,000. Is the 50,000th most successful company in the world not successful? Of course not, it's incredibly successful.

> why do we have VC and investment at all

Out of all companies started in 2024 I can guarantee you that <0.01% have received VC investment by now (Feb 2026) and <1% of tech companies did. I'll bet my house on it.

By PostOnce 2026-02-061:511 reply

Are we forgetting that sometimes, they just go bankrupt?

By bdangubic 2026-02-062:182 reply

name one with comparable number of users and revenue? not saying you are wrong but I would bet against the outcome

By PostOnce 2026-02-069:37

I'll be able to do just that in 36mo or so after the IPOs and subsequent collapse, I think.

By svieira 2026-02-063:421 reply

Enron

By bdangubic 2026-02-064:301 reply

I should have guessed someone would answer this question in this thread with Enron :)

I did not ask for random company that went under for any reason but specific question related to users and revenue.

By grey-area 2026-02-069:40

Well there are lots and lots of examples that don't end in bankruptcy, just a very large loss of capital for investors. The majority of the stars of the dotcom bubble just as one example: Qualcomm, pets.com, Yahoo!, MicroStrategy etc etc.

Uber, which you cite as a success, is only just starting to make any money, and any original investors are very unlikely to see a return given the huge amounts ploughed in.

MicroStrategy has transformed itself, same company, same founder, similar scam 20 years later, only this time they're peddling bitcoin as the bright new future. I'm surprised they didn't move on to GAI.

Qualcomm is now selling itself as an AI first company, is it, or is it trying to ride the next bubble?

Even if GAI becomes a roaring success, the prominent companies now are unlikely to be those with lasting success.

By qarl 2026-02-063:521 reply

I love how your comment is getting downvoted.

Like it's a surprise that startups burn through money. I get the feeling that people really have no idea what they're talking about in here anymore.

It's a shame.

By cowl 2026-02-065:482 reply

then you are misunderstaing the downvoting. it's not that the fact that they are burning money. it's the fact that this cost today 20k but that is not the real cost if you factor the it is losing money on this price.

So Tomorrow when this "startup" will need to come out of their money burning phase, like every startup has to sooner or later, that cost will increase, because there is no other monetising avenue, at least not for anthropic that "wilL never use ads".

at 20k this "might" be a reasonable cost for "the project", at 200k it might not.

By brookst 2026-02-066:381 reply

That would be insightful if the cost of inference weren’t declining at roughly 90% per year. Source: https://epoch.ai/data-insights/llm-inference-price-trends

By EddieRingle 2026-02-068:171 reply

According to that article, the data they analyzed was API prices from LLM providers, not their actual cost to perform the inference. From that perspective, it's entirely possible to make "the cost of inference" appear to decline by simply subsidizing it more. The authors even hint at the same possibility in the overview:

> Note that while the data insight provides some commentary on what factors drive these price drops, we did not explicitly model these factors. Reduced profit margins may explain some of the drops in price, but we didn’t find clear evidence for this.

By brookst 2026-02-081:38

What in the world would the profit motive be to “make it appear” that inference cost is declining? Any investors would have access to the real data. End users don’t care. Why would you do the work for an elaborate deception?

By aurareturn 2026-02-0612:43

Source that they’re losing money on each token?

By thesz 2026-02-066:032 reply

  > This test sorta definitely proves that AI is legit.

This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

The "out of distribution" test would be like "implement (self-bootstrapping, Linux kernel compatible) C compiler in J." J is different enough from C and I know of no such compiler.

By disgruntledphd2 2026-02-0610:091 reply

> This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

It's still really, really impressive though.

Like, economics aside this is amazing progress. I remember GPT3 not being able to hold context for more than a paragraph, we've come a long way since then.

Hell, I remember bag of words being state of the art when I started my career. We have come a really, really, really long way since then.

By thesz 2026-02-0610:261 reply

  > It's still really, really impressive though.

Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?

Much like quoting Quake code almost verbatim not so long ago.

By disgruntledphd2 2026-02-0611:511 reply

> Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?

No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation). That being said, they provide all the code et al for people to review.

I do agree that an out of distribution test would be super helpful, but given that it will almost certainly fail (given what we know about LLMs) I'm not too pushed about that given that it will definitely fail.

Look, I'm pretty sceptical about AI boosting, but this is a much better attempt than the windsurf browser thing from a few months back and it's interesting to know that one can get this work.

I do note that the article doesn't talk much about all the harnesses needed to make this work, which assuming that this approach is plausible, is the kind of thing that will be needed to make demonstrations like this more useful.

By thesz 2026-02-0613:331 reply

  > No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation).

This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].

[1] https://en.wikipedia.org/wiki/Leakage_(machine_learning)

This question is extremely important because test set leakage leads to impressively looking results that do not generalize to anything at all.

By disgruntledphd2 2026-02-0615:591 reply

> This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].

I am quite familiar with leakage, having been building statistical models for maybe 15+ years at this point.

However, that's not really relevant in this particular case given that LLMs are trained on approximately the entire internet, so leakage is not really a concern (as there is no test set, apart from the tasks they get asked to do in post-training).

I think that's its impressive that this even works at all as even if it's just predicting tokens (which is basically what they're trained to), as this is a pointer towards potentially more useful tasks (convert this cobol code base to java, for instance).

I think the missing bit here is that this only works for cases where there's a really large test set (the html spec, the linux kernel). I'm not convinced that the models would be able to maintain coherence without this, so maybe that's what we need to figure out how to build to make this actually works.

By thesz 2026-02-0620:46

  > I think the missing bit here is that this only works for cases where there's a really large test set (the html spec, the linux kernel). I'm not convinced that the models would be able to maintain coherence without this, so maybe that's what we need to figure out how to build to make this actually works.

Take any language with compiler and several thousands of users and you have a plenty of tests that approximate spec inward and outward.

Here's, for example, VHDL tests suite for GHDL, open source VHDL compiler and simulator: https://github.com/ghdl/ghdl/tree/master/testsuite

The GHDL test suite is sufficient and general enough to develop a pretty capable clone, to my knowledge. To my knowledge, there is only one open source VHDL compiler and it is written in Ada. And, again, expertise to implement another one from scratch to train an LLM on it is very, very scarce - VHDL, being highly parallel variant of Ada, is quirky as hell.

So someone can test your hypothesis on the VHDL - agent-code a VHDL compiler and simulator in Rust so that it passes GHDL test suite. Would it take two weeks and $20,000 as with C? I don't know but I really doubt so.

By Rudybega 2026-02-0616:431 reply

There are two compilers that can handle the Linux kernel. GCC and LLVM. Both are written in C, not Rust. It's "in distribution" only if you really stretch the meaning of the term. A generic C compiler isn't going to be anywhere near the level of rigour of this one.

By thesz 2026-02-0617:301 reply

There is tinycc, that makes it three compilers.

There is a C compiler implemented in Rust from scratch: https://github.com/PhilippRados/wrecc/commits/master/?after=... (the very beginning of commit history)

There are several C compilers written in Rust from scratch of comparable quality.

We do not know whether Anthropic has a closed source C compiler written in Rust in their training data. We also do not know whether Anthropic validated their models on their ability to implement C compiler from scratch before releasing this experiment.

That language J I proposed does not have any C compiler implemented in it at all. Idiomatic J expertise is scarce and expensive so that it would be a significant expense for Anthropic to have C compiler in J for their training data. Being Turing-complete, J can express all typical compiler tips and tricks from compiler books, albeit in an unusual way.

By Rudybega 2026-02-0620:15

TinyCC can't compile a modern linux kernel. It doesn't support a ton of the extensions they use. That Rust compiler similarly can't do it.

By LinXitoW 2026-02-063:12

How does 20K to replicate code available in the thousands online (toy C compilers) prove anything? It requires a bunch of caveats about things that don't work, it requires a bunch of other tools to do stuff, and an experienced developer had to guide it pretty heavily to even get that lackluster result.

By soperj 2026-02-066:11

Only if we take them at their word. I remember thinking things were in a completely different state when Amazon had their shop and go stores, but then finding out it was 1000s of people in Pakistan just watching you via camera.

By cardanome 2026-02-0615:57

If will write you an C compiler by hand for 19k and it will be better than what Claude made.

Writing a toy C compiler isn't that hard. Any decent programmer can write one in a few weeks or months. The optimizations are the actually interesting part and Claude fails hard at that.

By kvemkon 2026-02-061:311 reply

> optimizations aren't as good as the 40 year gcc project

with all optimizations disabled:

> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

By qarl 2026-02-061:471 reply

That distinction doesn't change my point. I am not surprised that a 40 year old project generates better code than this brand new one.

By charcircuit 2026-02-063:511 reply

Not only is it new. There has been 0 performance optimization done. Well none prompted for at least. Once you give the agents a profiler and start a loop focusing on performance you'll see it start improving it.

By thesz 2026-02-066:151 reply

We are talking about compiler here and "performance" referred above is the performance of generated code.

When you are optimizing a program, you have a specific part of code to improve. The part can be found with profiler.

When you are optimizing a compiler generated code, you have many similar parts of code in many programs and not-so-specific part of compiler that can be improved.

By charcircuit 2026-02-066:281 reply

Yes, performance of the generated code. You have some benchmark of using a handful of common programs going through common workflows and you measure the performance of the generated code. As tweaks are made you see how the different performance experiments effect the overall performance. Some strategies are always a win, but things like how you layout different files and functions in memory have different trade offs and are hard to know up front without doing actual real world testing.

By thesz 2026-02-0610:17

  > As tweaks are made...
  > ...how you layout different files and functions in memory have different trade offs and are hard to know up front without doing actual real world testing.

These are definitely not an algorithmic optimizations like privatization [1].

https://en.wikipedia.org/wiki/Privatization_(computer_progra...

To correctly apply privatization one has to have correct dependency analysis. This analysis uses results of many other analyses, for example, value range analysis, something like Fourier-Motzkin algorithm, etc.

So this agentic-optimized compiler has a program where privatization is not applied, what tweaks should agents apply?

By dwaite 2026-02-083:52

It is legit - with some pretty severe caveats. I am pressed to come up with an example that has more formal specification, published source implementations, and public unit test coverage than a C compiler.

It is not feasible that someone will use AI to tackle genuinely new software and provide a tenth of the level of guide-rails Anthropic had for this project. They were able to keep the million monkeys on their million typewriters on an extremely short leash, and able to have it do the vast majority of iteration without human intervention.

By byzantinegene 2026-02-068:451 reply

it costs $20,000 to reinvent the wheel, that it probably trained on. If that's your definition of legit, sure

By organicUser 2026-02-0610:45

well, if in this period it is a matter of cost, tomorrow won't be anymore. 4GB of RAM in the 80s would have cost tens of millions of dollars, now even your car runs 4 gb memory only for the infotainment systems, and runs dozens GBs of RAM for the most complex assistants. So i would see this achievement more as a warning, the final result is not what's concerning, it is the premonition behind it

By wqaatwt 2026-02-088:03

The full source of several compilers being in its training set is somewhat helpful though. It’s not exactly a novel problem and those optimizations and edge cases which it seemingly is struggling are the overwhelming majority of the work anyway.

Do we know it just didn’t shuffle gcc’s source code around a bit?

By miohtama 2026-02-067:03

GCC had 40 years headstart

By qarl 2026-02-0621:30

[flagged]

By ip26 2026-02-067:241 reply

I’m excited and waiting for the team that shows with $20k in credits they can substantially speed up the generated code by improving clang!

By byzantinegene 2026-02-069:20

i'm sorry but that will take another $20 billion in AI capex to train our latest SOTA model so that it will cost $20k to improve the code.

By 9rx 2026-02-067:17

> I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel.

How much of that time was spent writing the tests that they found to use in this experiment? You (or someone like you) were a major contributor to this. All Opus had to do here was keep brute forcing a solution until the tests passed.

It is amazing that it is possible at all, but remains an impossibly without a heavy human hand. One could easily still spend a good part of their career reproducing this if they first had to rewrite all of the tests from scratch.

By beambot 2026-02-0522:159 reply

This is getting close to a Ken Thompson "Trusting Trust" era -- AI could soon embed itself into the compilers themselves.

By bopbopbop7 2026-02-0522:192 reply

A pay to use non-deterministic compiler. Sounds amazing, you should start.

By Aurornis 2026-02-0522:361 reply

Application-specific AI models can be much smaller and faster than the general purpose, do-everything LLM models. This allows them to run locally.

They can also be made to be deterministic. Some extra care is required to avoid computation paths that lead to numerical differences on different machines, but this can be accomplished reliably with small models that use integer math and use kernels that follow a specific order of operations. You get a lot more freedom to do these things on the small, application-specific models than you do when you're trying to run a big LLM across different GPU implementations in floating point.

By soraminazuki 2026-02-063:08

> They can also be made to be deterministic.

Yeah, in the same way how pseudo-random number generators are "deterministic." They generate the exact same sequence of numbers every time given the seeds are the same!

But that's not the "determinism" people are referring to when they say LLMs aren't deterministic.

By ndesaulniers 2026-02-0522:262 reply

Some people care more about compile times than the performance of generated code. Perhaps even the correctness of generated code. Perhaps more so than determinism of the generated code. Different people in different contexts can have different priorities. Trying to make everyone happy can sometimes lead to making no one happy. Thus dichotomies like `-O2` vs `-Os`.

EDIT (since HN is preventing me from responding):

> Some people care more about compiler speed than the correctness?

Yeah, I think plenty of people writing code in languages that have concepts like Undefined Behavior technically don't really care as much about correctness as they may claim otherwise, as it's pretty hard to write large volumes of code without indirectly relying on UB somewhere. What is correct in such case was left up to interpretation of the implementer by ISO WG14.

By bopbopbop7 2026-02-0522:333 reply

Some people care more about compiler speed than the correctness? I would love to meet these imaginary people that are fine with a compiler that is straight up broken. Emitting working code is the baseline, not some preference slider.

By ndesaulniers 2026-02-0618:58

> I would love to meet these imaginary people that are fine with a compiler that is straight up broken.

That's not what I said; you're attacking a strawman.

My point was more so that some people prefer the madness that is -funsafe-math-optimizations, or happen to rely on UB (intentionally or otherwise). What even is "correct" in the presence of UB? What is correct in such case was left up to interpretation of the implementer by ISO WG14.

By gerdesj 2026-02-062:07

You might have not run Gentoo. Most Gentooers will begrudgingly but eventually admit to cooking their own gonads when updating a laptop.

Anyway, please define: "correctness".

By fragmede 2026-02-0522:572 reply

Let's pretend, for just a second, that the people who do, having been able to learn how to program, are not absolute fucking morons. Straight up broken is obviously not useful, so maybe the conclusions you've jumped to could use some reexamination.

By chasd00 2026-02-0522:363 reply

a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced. The only thing worse would be a CPU bug like the legendary Pentium bug. Imagine you compile something like Postgres only to have it crash in some unpredictable way. How long do you stare at Postgres source before suspecting the compiler? What if this compiler was used to compile code in software running all over cloud stacks? Bugs in compilers are very bad news, they have to be correct.

By addaon 2026-02-060:46

> a compiler introducing bugs into code it compiles is a nightmare thankfully few have faced

Is this true? It’s not an everyday thing, but when using less common flags, or code structures, or targets… every few years I run into a codegen issue. It’s hard to imagine going through a career without a handful…

By ndesaulniers 2026-02-0523:17

Yeah, my current boss spent time weeding out such hardware bugs: https://arxiv.org/abs/2110.11519 (EDIT: maybe https://x.com/Tesla_AI/status/1930686196201714027 is a more relevant citation)

They found a bimodal distribution in failures over the lifetime of chips. Infant mortality was well understood. Silicon aging over time was much less well understood, and I still find surprising.

By Anon1096 2026-02-069:37

It's not that uncommon if you work in massive lowish level systems. Clang/LLVM being relatively bug free is the result of many corporate big tech low level compiler swes working with the application swes to debug why XYZ isn't working properly and then writing the appropriate fix. But compiler bugs still come up every so often, I've seen it on multiple occasions.

By ndesaulniers 2026-02-0522:311 reply

We're already starting to see people experimenting with applying AI towards register allocation and inlining heuristics. I think that many fields within a compiler are still ripe for experimentation.

https://llvm.org/docs/MLGO.html

By int_19h 2026-02-068:34

What I want to know is when we get AI decompilers

Intuitively it feels like it should be a straightforward training setup - there's lots of code out there, so compile it with various compilers, flags etc and then use those pairs of source+binary to train the model.

By jojobas 2026-02-060:24

Sorry, clang 26.0 requires an Nvidia B200 to run.

By psychoslave 2026-02-0611:15

Hmm, well, there are already embedded in fonts: https://hackaday.com/2024/06/26/llama-ttf-is-ai-in-a-font/

By sandinmyjoints 2026-02-0613:50

Reminds me of https://www.teamten.com/lawrence/writings/coding-machines/

By greenavocado 2026-02-060:41

Then i'll be left wondering why my program requires 512TB of RAM to open

By andai 2026-02-0523:57

The asymmetry will be between the frontier AI's ability to create exploits vs find them.

By dnautics 2026-02-063:07

would be hard to miss gigantic kv cache matrix multiplications

By iberator 2026-02-066:286 reply

Claude did not wrote it. you wrote it with PREVIOUS EXPERIENCE with 20.000 long commandshyellihg him exactly what to do.

Real usable AI would create it with simple: 'make c compilers c99 faster than GCC'.

AI usage should be banned in general. It takes jobs faster than creating new ones ..

By arcanemachiner 2026-02-067:441 reply

That's actually pretty funny. They're patting it on the back for using, in all likelihood, some significant portions of code that they actually wrote, which was stolen from them without attribution so that it could be used as part of a very expensive parlour trick.

By whynotminot 2026-02-0613:18

Did you do diffs to confirm the code as stolen or are you just speculating.

By embedding-shape 2026-02-0610:27

> AI usage should be banned in general. It takes jobs faster than creating new ones ..

I don't have an strong opinion about that in either direction, but curious: Do you feel the same about everything, or is just about this specific technology? For example, should the nail gun have been forbidden if it was invented today, as one person with a nail gun could probably replace 3-4 people with normal "manual" hammers?

You feel the same about programmers who are automating others out of work without the use of AI too?

By wiseowise 2026-02-069:10

> It takes jobs faster than creating new ones ..

You think compiler engineer from Google gives a single shit about this?

They’ll automate millions out of career existence for their amusement while cashing out stock money and retiring early comfortably.

By benterix 2026-02-0612:20

> It takes jobs faster than creating new ones ..

I have no problems with tech making some jobs obsolete, that's normal. The problem is, the job being done with the current generation of LLMs are, at least for now, mostly of inferior quality.

The tools themselves are quite useful as helpers in several domains if used wisely though.

By 7thpower 2026-02-0611:361 reply

Businesses do not exist to create jobs; jobs are a byproduct.

By jaccola 2026-02-0611:521 reply

Even that is underselling it; jobs are a necessary evil that should be minimised. If we can have more stuff with fewer people needing to spend their lives providing it, why would we NOT want that?

By direwolf20 2026-02-0611:572 reply

Because we've built a system where if you don't have a job, you die.

By jaccola 2026-02-0612:112 reply

This is already hyperbolic; in most countries where software engineers or similar knowledge workers are widely employed there are welfare programmes.

To add to that, if there is such mass unemployment in this scenario it will be because fewer people are needed to produce and therefore everything will become cheaper... This is the best kind of unemployment.

So at best: none of us have to work again and will get everything we need for free. At worst, certain professions will need a career switch which I appreciate is not ideal for those people but is a significantly weaker argument for why we should hold back new technology.

By direwolf20 2026-02-080:39

Most of those welfare programs aren't very good, and most of that is on purpose, to make people get jobs at whatever cost.

By jelder 2026-02-0613:091 reply

If you were to rank all of the C compilers in the world and then rank all of the welfare systems in the world, this vibe-coded mess would be at approximately the same rank as the American welfare system. Especially if you extrapolate this narcissistic, hateful kleptocracy out a few more years.

By j-bos 2026-02-0613:28

[flagged]

By aurareturn 2026-02-0612:121 reply

Did we build it or did nature?

By direwolf20 2026-02-0613:311 reply

We did.

By unglaublich 2026-02-0612:531 reply

Jobs are a means, not a goal.

By sc68cal 2026-02-0616:10

Jobs are the only way that you survive in this society (food, shelter). Look how we treat unhoused people without jobs. AI is taking jobs away and that is putting people's survival at risk.

By MaskRay 2026-02-061:521 reply

I want to verify the claim that it builds the Linux kernel. It quickly runs into errors, but yeah, still pretty cool!

make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc -j30 defconfig all

``` /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:44:184: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) ~0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (~0x80000000); (void)pto_tmp__; } asm ("and" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:49:183: error: expected ';' after expression before 'pto_tmp__' do { u32 pto_val__ = ((u32)(((unsigned long) 0x80000000) & 0xffffffff)); if (0) { __typeof_unqual__((__preempt_count)) pto_tmp__; pto_tmp__ = (0x80000000); (void)pto_tmp__; } asm ("or" "l " "%[val], " "%" "[var]" : [var] "+m" (((__preempt_count))) : [val] "ri" (pto_val__)); } while (0); ^~~~~~~~~ fix-it hint: insert ';' /home/ray/Dev/linux/arch/x86/include/asm/preempt.h:61:212: error: expected ';' after expression before 'pao_tmp__' ```

By silver_sun 2026-02-063:591 reply

They said it builds Linux 6.9, maybe you are trying to compile a newer version there?

By MaskRay 2026-02-064:442 reply

git switch v6.9

The riscv build succeeded. For the x86-64 build I ran into

    % make O=/tmp/linux/x86 ARCH=x86_64 CC=/tmp/p/claudes-c-compiler/target/release/ccc-x86 HOSTCC=/tmp/p/claudes-c-compiler/target/release/ccc-x86 LDFLAGS=-fuse-ld=bfd LD=ld.bfd -j30 vmlinux -k
    make[1]: Entering directory '/tmp/linux/x86'
    ...
      CC      arch/x86/platform/intel/iosf_mbi.o
    ccc: error: lgdtl requires memory operand
      AR      arch/x86/platform/intel-mid/built-in.a
    make[6]: *** [/home/ray/Dev/linux/scripts/Makefile.build:362: arch/x86/realmode/rm/wakeup_asm.o] Error 1
    ld.bfd: arch/x86/entry/vdso/vdso32/sigreturn.o: warning: relocation in read-only section `.eh_frame'
    ld.bfd: error in arch/x86/entry/vdso/vdso32/sigreturn.o(.eh_frame); no .eh_frame_hdr table will be created
    ld.bfd: warning: creating DT_TEXTREL in a shared object
    ccc: error: unsupported pushw operand

There are many other errors.

tinyconfig and allnoconfig have fewer errors.

    RELOCS  arch/x86/realmode/rm/realmode.relocs
    Invalid absolute R_386_32 relocation: real_mode_seg

Still very impressive.

By pertymcpert 2026-02-066:32

They said that it wasn't able to support 16 bit real mode. Needs to call gcc for that.

By 63stack 2026-02-0613:52

I feel like I could have done this in a much shorter time, for much less tokens, but still very impressive!

By the_jends 2026-02-061:271 reply

Being just a grunt engineer in a product firm I can't imagine being able to spend multiple years on one project. If it's something you're passionate about, that sounds like a dream!

By ndesaulniers 2026-02-0618:55

This work originally wasn't my 100% project, it was my 20% project (or as I prefer to call it, 120% project).

I had to move teams twice before a third team was able to say: this work is valuable to us, please come work for us and focus just on that.

I had to organize multiple internal teams, then build an external community of contributors to collaborate on this shared common goal.

Having carte blanche to contribute to open source projects made this feasible at all; I can see that being a non-starter at many employers, sadly. Having low friction to change teams also helped a lot.

By HarHarVeryFunny 2026-02-0614:551 reply

> I spent a good part of my career (nearly a decade) at Google working on getting Clang to build the linux kernel

Did this come down to making Clang 100% gcc compatible (extensions, UDB, bugs and all), or were there any issues that might be considered as specific to the linux kernel?

Did you end up building a gcc compatability test suite as a part of this? Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?

By ndesaulniers 2026-02-0616:44

> extensions

Some were necessary (asm goto), some were not (nested functions, flexible array members not at the end of structs).

> UDB, bugs and all

Luckily, the kernel didn't intentionally rely on GCC specifics this way. Where it did unintentionally, we fixed the kernel sources properly with detailed commit messages explaining why.

> or were there any issues that might be considered as specific to the linux kernel?

Yes, https://github.com/ClangBuiltLinux/linux/issues is our issue tracker. We use tags extensively to mark if we triage the issue to be kernel-side vs toolchain-side.

> Did you end up building a gcc compatability test suite as a part of this?

No, but some tricky cases LLVM got wrong were distilled from kernel sources using either:

- creduce - cvise (my favorite) - bugpoint - llvm-reduce

and then added to LLVM's existing test suite. Many such tests were also simply manually written.

> Did the gcc project themselves have a regression/test suite that you were able to use as a starting point?

GCC and binutils have their own test suites. Folks in the LLVM community have worked on being able to test clang against GCC's test suite. I personally have never run GCC's test suite or looked at its sources.

By TZubiri 2026-02-065:25

>Is the generated code correct? The jury is still out on that one for production compilers. And then you have performance of generated code.

It's worth noting that this was developed by compiling Linux and running tests, so at least that is part of the training set and not the testing set.

But at least for linux, I'm guessing the tests are very robust and I'm guessing that will work correctly. That said, if any bugs pop up, it will show weak points in the linux tests.

By VladVladikoff 2026-02-0614:232 reply

>$20,000 of tokens. >less efficient than existing compilers

what is the ecological cost of producing this piece of software that nobody will ever use?

By ryanjshaw 2026-02-0614:26

If you evaluate the cost/benefit in isolation? It’s net negative.

If you see this as part of a bigger picture to improve human industrial efficiency and bring us one step closer to the singularity? Most likely net positive.

By thefounder 2026-02-0614:27

With that way of thinking you would just move in a cave.

By grey-area 2026-02-067:45

Isn't the AI basing what it does heavily on the publicly available source code for compilers in C though? Without that work it would not be able to generate this would it? Or in your opinion is it sufficiently different from the work people like you did to be classed as unique creation?

I'm curious on your take on the references the GAI might have used to create such a project and whether this matters.

By zaphirplane 2026-02-0522:021 reply

What were the challenges out of interest. Some of it is the use of gcc extensions? Which needed an equivalent and porting over to the equivalent

By ndesaulniers 2026-02-0522:17

`asm goto` was the big one. The x86_64 maintainers broke the clang builds very intentionally just after we had gotten x86_64 building (with necessary patches upstreamed) by requiring compiler support for that GNU C extension. This was right around the time of meltdown+spectre, and the x86_64 maintainers didn't want to support fallbacks for older versions of GCC (and ToT Clang at the time) that lacked `asm goto` support for the initial fixes shipped under duress (embargo). `asm goto` requires plumbing throughout the compiler, and I've learned more about register allocation than I particularly care...

Fixing some UB in the kernel sources, lots of plumbing to the build system (particularly making it more hermetic).

Getting the rest of the LLVM binutils substitutes to work in place of GNU binutils was also challenging. Rewriting a fair amount of 32b ARM assembler to be "unified syntax" in the kernel. Linker bugs are hard to debug. Kernel boot failures are hard to debug (thank god for QEMU+gdb protocol). Lots of people worked on many different parts here, not just me.

Evangelism and convincing upstream kernel developers why clang support was worth anyones while.

https://github.com/ClangBuiltLinux/linux/issues for a good historical perspective. https://github.com/ClangBuiltLinux/linux/wiki/Talks,-Present... for talks on the subject. Keynoting LLVM conf was a personal highlight (https://www.youtube.com/watch?v=6l4DtR5exwo).

By m463 2026-02-071:36

> getting Clang to build the linux kernel.

wonder if clang source is part of its model :)

By ur-whale 2026-02-0611:22

> This LLM did it

You do realize the LLM had access (via his training set) and "reused" (not as is, of course) your own work, right?

By phillmv 2026-02-0521:562 reply

i mean… your work also went into the training set, so it's not entirely surprising that it spat a version back out!

By underdeserver 2026-02-0521:593 reply

Anthropic's version is in Rust though, so at least a little different.

By ndesaulniers 2026-02-0522:221 reply

There's parts of LLVM architecture that are long in the tooth (IMO) (as is the language it's implemented in, IMO).

I had hoped one day to re-implement parts of LLVM itself in Rust; in particular, I've been curious if we can concurrently compile C (and parse C in parallel, or lazily) that haven't been explored in LLVM, and I think might be safer to do in Rust. I don't know enough about grammers to know if it's technically impossible, but a healthy dose of ignorance can sometimes lead to breakthroughs.

LLVM is pretty well designed for test. I was able to implement a lexer for C in Rust that could lex the Linux kernel, and use clang to cross check my implementation (I would compare my interpretation of the token stream against clang's). Just having a standard module system makes having reusable pieces seems like perhaps a better way to compose a toolchain, but maybe folks with more experience with rustc have scars to disagree?

By jcranmer 2026-02-061:301 reply

> I had hoped one day to re-implement parts of LLVM itself in Rust

Heh, earlier this day, I was just thinking how crazy a proposal would it actually be to have a Rust dependency (specifically, the egg crate, since one of the things I'm banging my head against right now might be better solved with egraphs).

By ndesaulniers 2026-02-0622:50

Guess I better add https://github.com/bytecodealliance/rfcs/blob/main/accepted/... to my reading list!

By yoz-y 2026-02-0523:391 reply

One thing LLMs are really good at is translation. I haven’t tried porting projects from one language to another, but it wouldn’t surprise me if they were particularly good at that too.

By andrekandre 2026-02-0611:33

as someone who has done that in a professional setting, it really does work well, at least for straightforward things like data classes/initializers and average biz logic with if else statements etc... things like code annotations and other more opaque stuff like that can get more unreliable though because there are less 1-1 representations... it would be interesting to train an llm for each encountered new pattern and slowly build up a reliable conversion workflow

By rwmj 2026-02-0522:061 reply

It's not really important in latent space / conceptually.

By D-Machine 2026-02-062:01

This is the proper deep critique / skepticism (or sophisticated goal-post moving, if you prefer) here. Yes, obviously this isn't just reproducing C compiler code in the training set, since this is Rust, but it is much less clear how much of the generated Rust code can (or can not) be accurately seen as being translated from C code in the training set.

By GaggiX 2026-02-0521:591 reply

Clang is not written in Rust tho

By underdeserver 2026-02-0522:00

jinx

By jbjbjbjb 2026-02-0523:206 reply

It’s cool but there’s a good chance it’s just copying someone else’s homework albeit in an elaborate round about way.

By nomel 2026-02-0523:435 reply

I would claim that LLMs desperately need proprietary code in their training, before we see any big gains in quality.

There's some incredible source available code out there. Statistically, I think there's a LOT more not so great source available code out there, because the majority of output of seasoned/high skill developers is proprietary.

To me, a surprising portion of Claude 4.5 output definitely looks like student homework answers, because I think that's closer to the mean of the code population.

By dcre 2026-02-063:303 reply

This is dead wrong: essentially the entirety of the huge gains in coding performance in the past year have come from RL, not from new sources of training data.

I echo the other commenters that proprietary code isn’t any better, plus it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there.

By elevation 2026-02-063:58

> it doesn’t matter because when you use LLMs to work on proprietary code, it has the code right there

The quality of the existing code base makes a huge difference. On a recent greenfield effort, Claude emitted an MVP that matched the design semantics, but the code was not up to standards. For example, it repeatedly loaded a large file into memory in different areas where it was needed (rather than loading once and passing a reference.)

However, after an early refactor, the subsequently generated code vastly improved. It honors the testing and performance paradigms, and it's so clean there's nothing for the linter to do.

By thesz 2026-02-066:421 reply

  > the huge gains in coding performance in the past year have come from RL, not from new sources of training data.

This one was on HN recently: https://spectrum.ieee.org/ai-coding-degrades

Author attributes past year's degradation of code generation by LLMs to excessive use of new source of training data, namely, users' code generation conversations.

By dcre 2026-02-0613:461 reply

Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.

And their “explanation” blaming the training data is just a guess on their part, one that I suspect is wrong. There is no argument given that that’s the actual cause of the observed phenomenon. It’s a just-so story: something that sounds like it could explain it but there’s no evidence it actually does.

My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said. I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.

By thesz 2026-02-0616:151 reply

  > Yeah, this is a bullshit article. There is no such degradation, and it’s absurd to say so on the basis of a single problem which the author describes as technically impossible. It is a very contrived under-specified prompt.

I see "No True Scotsman" argument above.

  > My evidence is that RL is more relevant is that that’s what every single researcher and frontier lab employee I’ve heard speak about LLMs in the past year has said.

Reinforcement learning reinforces what is already in the LM, makes width of search path of possible correct answer narrower and wider search path in not-RL-tuned base models results in more correct answers [1].

[1] https://openreview.net/forum?id=4OsgYD7em5

  > I have never once heard any of them mention new sources of pretraining data, except maybe synthetic data they generate and verify themselves, which contradicts the author’s story because it’s not shitty code grabbed off the internet.

The sources of training data already were the reasons for allegations, even leading to lawsuits. So I would suspect that no engineer from any LLM company would disclose anything on their sources of training data besides innocently sounding "synthetic data verified by ourselves."

From the days I have worked on blockchains, I am very skeptical about any company riding any hype. They face enormous competition and they will buy, borrow or steal their way to try to not go down even a little. So, until Anthropic opens the way they train their model so that we can reproduce their results, I will suspect they leaked test set into it and used users code generation conversation as new source of training data.

By dcre 2026-02-0621:041 reply

That is not what No True Scotsman is. I’m pointing out a bad argument with weak evidence.

By thesz 2026-02-0621:341 reply

  >>> It is a very contrived under-specified prompt.

No True Prompt can be such contrived and underspecified.

The article about degradation is a case study (single prompt), weakest of the studies in hierarchy of knowledge. Case studies are basis for further, more rigorous studies. And author took the time to test his assumptions and presented quite clear evidence that such degradation might be present and that we should investigate.

By dcre 2026-02-073:181 reply

We have investigated. Millions of people are investigating all the time and finding that the coding capacity has improved dramatically over that time. A variety of very different benchmarks say the same. This one random guy’s stupid prompt says otherwise. Come on.

By thesz 2026-02-077:541 reply

As far as I remember, article stated that he found same problematic behavior for many prompts, issued by him and his colleagues. The "stupid prompt" in article is for demonstration purposes.

By dcre 2026-02-0914:15

But that’s not an argument, that’s just assertion, and it’s directly contradicted by all the more rigorous attempts to do the same thing through benchmarks (public and private).

By nextos 2026-02-064:321 reply

Progress with RL is very interesting, but it's still too inefficient. Current models do OK on simple boring linear code. But they output complete nonsense when presented with some compact but mildly complex code, e.g. a NumPyro model with some nesting and einsums.

For this reason, to be truly useful, model outputs need to be verifiable. Formal verification with languages like Dafny , F*, or Isabelle might offer some solutions [1]. Otherwise, a gigantic software artifact such as a compiler is going to have a critical correctness bugs with far-fetched consequences if deployed in production.

Right now, I think treating a LLM like something different than a very useful information retrieval system with excellent semantic capabilities is not something I am comfortable with.

[1] https://risemsr.github.io/blog/2026-02-04-nik-agentic-pop

By dcre 2026-02-0613:50

Human-written compilers have bugs too! It takes decades of use to iron them out, and we’re introducing new ones all the time.

By bearjaws 2026-02-061:592 reply

I will say many closed source repos are probably equally as poor as open source ones.

Even worse in many cases because they are so over engineered nobody understands how they work.

By hirvi74 2026-02-063:56

I firmly agree with your first sentence. I can just think about the various modders that have created patches and performance enhancing mods for games with budgets of tens to hundreds of millions of dollars.

But to give other devs and myself some grace, I do believe plenty of bad code can likely be explained by bad deadlines. After all, what's the Russian idiom? "There is nothing more permanent than the temporary."

By bhadass 2026-02-060:161 reply

yeah, but isn't the whole point of claude code to get people to provide preference data/telemetry data to anthropic (unless you opt out?). same w/ other providers.

i'm guessing most of the gains we've seen recently are post training rather than pretraining.

By nomel 2026-02-061:04

Yes, but you have the problem that a good portion of that is going to be AI generated.

But, I naively assume most orgs would opt out. I know some orgs have a proxy in place that will prevent certain proprietary code from passing through!

This makes me curious if, in the allow case, Anthropic is recording generated output, to maybe down-weight it if it's seen in the training data (or something similar)?

By typ 2026-02-061:504 reply

I'd bet, on average, the quality of proprietary code is worse than open-source code. There have been decades of accumulated slop generated by human agents with wildly varied skill levels, all vibe-coded by ruthless, incompetent corporate bosses.

By Manouchehri 2026-02-062:151 reply

There's only very niche fields where closed-source code quality is often better than open-source code.

Exploits and HFT are the two examples I can think of. Both are usually closed source because of the financial incentives.

By ozim 2026-02-065:06

Here we can start debating what means better code.

I haven’t seen HFT code but I have seen examples of exploit codes and most of it is amateur hour when it comes to building big size systems.

They are of course efficient in getting to the goal. But exploits are one off code that is not there to be maintained.

By kortilla 2026-02-064:15

It doesn’t matter what the average is though. If 1% of software is open source, there is significantly more closed source software out there and given normal skills distributions, that means there is at least as much high quality closed source software out there, if not significantly more. The trick is skipping the 95% of crap.

By hirvi74 2026-02-064:03

In my time, I have potentially written code that some legal jurisdictions might classify as a "crime against humanity" due to the quality.

By Take8435 2026-02-062:301 reply

Not to mention, a team member is (surprise!) fired or let go, and no knowledge transfer exists. Womp, womp. Codebase just gets worse as the organization or team flails.

Seen this way too often.

By icedchai 2026-02-0617:22

Developers are often treated as cogs. Anyone should be able to step in a pick things up instantly. It’s just typing, right? /s

By andai 2026-02-0523:56

Let's start with the source code for the Flash IDE :)

By wvenable 2026-02-060:281 reply

This is cool and actually demonstrates real utility. Using AI to take something that already exists and create it for a different library / framework / platform is cool. I'm sure there's a lot of training data in there for just this case.

But I wonder how it would fare given a language specification for a non-existent non-trivial language and build a compiler for that instead?

By nmstoker 2026-02-061:18

If you come up with a realistic language spec and wait maybe six months, by then it'll probably be approach being cheap enough that you could test the scenario yourself!

By luke5441 2026-02-0523:371 reply

It looks like a much more progressed/complete version of https://github.com/kidoz/smdc-toolchain/tree/master/crates/s... . But that one is only a month old. So a bit confused there. Maybe that was also created via LLM?

By madmax911 2026-02-060:05

[dead]

By nlawalker 2026-02-060:59

I see that as the point that all this is proving - most people, most of the time, are essentially reinventing the wheel at some scope and scale or another, so we’d all benefit from being able to find and copy each others’ homework more efficiently.

By computerex 2026-02-061:50

And the goal post shifts.

By kreelman 2026-02-060:53

..A small thing, but it won't compile the RISCV version of hello.c if the source isn't installed on the machine it's running on.

It is standing on the shoulders of giants (all of the compilers of the past, built into it's training data... and the recent learnings about getting these agents to break up tasks) to get itself going. Still fairly impressive.

On a side-quest, I wonder where Anthropic is getting there power from. The whole energy debacle in the US at the moment probably means it made some CO2 in the process. Would be hard to avoid?

By tdemin 2026-02-0611:23

[dead]

By eek2121 2026-02-0523:583 reply

Also: a large amount of folks seem to think Claude code is losing a ton of money. I have no idea where the final numbers land, however, if the $20,000 figure is accurate and based on some of the estimates I've seen, they could've hired 8 senior level developers at a quarter million a year for the same amount of money spent internally.

Granted, marketing sucks up far too much money for any startup, and again, we don't know the actual numbers in play, however, this is something to keep in mind. (The very same marketing that likely also wrote the blog post, FWIW).

By willsmith72 2026-02-060:112 reply

this doesn't add up. the 20k is in API costs. people talk about CC losing money because it's way more efficient than the API. I.e. the same work with efficient use of CC might have cost ~$5k.

but regardless, hiring is difficult and high-end talent is limited. If the costs were anywhere close to equivalent, the agents are a no-brainer

By majormajor 2026-02-061:27

CC hits their APIs, And internally I'm sure Anthropic tracks those calls, which is what they seem to be referencing here. What exactly did Anthropic do in this test to have "inefficient use of CC" vs your proposed "efficient use of CC"?

Or do you mean that if an external user replicated this experience they might get billed less than $20k due to CC being sold at lower rates than per-API-call metered billing?

By NitpickLawyer 2026-02-0611:27

> hiring is difficult and high-end talent is limited.

Not only that, but firing talent is also a pain. You can't "hire" 10 devs for 2 weeks, and fire them afterwards. At least you can't keep doing that, people talk and no one would apply.

By GorbachevyChase 2026-02-060:45

Even if the dollar cost for product created was the same, the flexibility of being able to spin a team up and down with an API call is a major advantage. That AI can write working code at all is still amazing to me.

By bloaf 2026-02-061:38

This thing was done in 2 weeks. In the orgs I've worked in, you'd be lucky to get HR approval to create a job posting within 2 weeks.

By NitpickLawyer 2026-02-0519:288 reply

This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:

> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis

> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.

> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.

And the very open points about limitations (and hacks, as cc loves hacks):

> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase

> It does not have its own assembler and linker;

> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

Ending with a very down to earth take:

> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.

All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.

By geraneum 2026-02-0519:399 reply

> This was a clean-room implementation

This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.

By TacticalCoder 2026-02-0523:161 reply

I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.

It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."

https://en.wikipedia.org/wiki/Clean-room_design

The "without infringing any of the copyrights" contains "any".

We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.

Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.

It's not a clean-room design, plain and simple.

By mlvljr 2026-02-060:37

[dead]

By raincole 2026-02-0521:383 reply

It's not a clean-room implementation, but not because it's trained on the internet.

It's not a clean-room implementation because of this:

> The fix was to use GCC as an online known-good compiler oracle to compare against

By Calavar 2026-02-0522:381 reply

The classical definition of a clean room implementation is something that's made by looking at the output of a prior implementation but not at the source.

I agree that having a reference compiler available is a huge caveat though. Even if we completely put training data leakage aside, they're developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majority of problems that people will want to tackle with agentic coding are not going to look like that.

By visarga 2026-02-065:433 reply

This is the reimplementation scenario for agentic coding. If you have a good spec and battery of tests you can delete the code and reimplement it. Code is no longer the product of eng work, it is more like bytecode now, you regenerate it, you don't read it. If you have to read it then you are just walking a motorcycle.

We have seen at least 3 of these projects - the JustHTML one, the FastRender and this one. All started from beefy tests and specs. They show reimplementation without manual intervention kind of works.

By Calavar 2026-02-066:18

I think that's overstating it.

JustHTML is a success in large part because it's a problem that can be solved with 4 digit LOC. The whole codebase can sit in an LLM's context at once. Do LLMs scale beyond that?

I would classify both FastRender and Opus C compiler as interesting failures. They are interesting because they got a non-negligible fraction of the way to feature complete. They are failures because they ended with no clear path for moving the needle forward to 80% feature complete, let alone 100%.

From the original article:

From the experiments we've seen so far it seems that a large enough agentic code base will inevitably collapse under its own weight.

By jayd16 2026-02-0619:12

> Code is no longer the product of eng work

Never was.

By franktankbank 2026-02-0614:55

Great way to get constantly moving holes.

By array_key_first 2026-02-0522:542 reply

If you read the entire GCC source code and then create a compatible compiler, it's not clean room. Which Opus basically did since, I'm assuming, its training set contained the entire source of GCC. So even if they were actively referencing GCC I think that counts.

By nmilo 2026-02-060:001 reply

What if you just read the entire GCC source code in school 15 years ago? Is that not clean room?

By hex4def6 2026-02-060:271 reply

No.

I'd argue that no one would really care given it's GCC.

But if you worked for GiantSodaCo on their secret recipe under NDA, then create a new soda company 15 years later that tastes suspiciously similar to GiantSodaCo, you'd probably have legal issues. It would be hard to argue that you weren't using proprietary knowledge in that case.

By Zambyte 2026-02-0615:28

Given that GCC is not public domain, the copyright holders will probably care.

By pertymcpert 2026-02-066:44

I read the source. If anything it takes concepts from LLVM more than GCC, but the similarities aren't very deep.

By cryptonector 2026-02-062:19

Hmm... If Claude iterated a lot then chances are very good that the end result bears little resemblance to open source C compilers. One could check how much resemblance the result actually bears to open source compilers, and I rather suspect that if anyone does check they'll find it doesn't resemble any open source C compiler.

By GorbachevyChase 2026-02-061:23

https://arxiv.org/abs/2505.03335

Check out the paper above on Absolute Zero. Language models don’t just repeat code they’ve seen. They can learn to code give the right training environment.

By iberator 2026-02-066:30

this. last sane person in HN

By inchargeoncall 2026-02-0521:201 reply

[flagged]

By teaearlgraycold 2026-02-0521:25

With just a few thousand dollars of API credits you too can inefficiently download a lossy copy of a C compiler!

By antirez 2026-02-0520:053 reply

The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.

By halxc 2026-02-0520:144 reply

We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.

It is a research topic for heaven's sake:

https://arxiv.org/abs/2504.16046

By RyanCavanaugh 2026-02-0520:185 reply

The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte. While they are certainly capable of doing some verbatim recitations, this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?) and stored inside the model.

By philipportner 2026-02-0521:461 reply

This seems related, it may not be a codebase but they are able to extract "near" verbatim books out of Claude Sonnet.

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

By Aurornis 2026-02-0522:452 reply

Their technique really stretched the definition of extracting text from the LLM.

They used a lot of different techniques to prompt with actual text from the book, then asked the LLM to continue the sentences. I only skimmed the paper but it looks like there was a lot of iteration and repetitive trials. If the LLM successfully guessed words that followed their seed, they counted that as "extraction". They had to put in a lot of the actual text to get any words back out, though. The LLM was following the style and clues in the text.

You can't literally get an LLM to give you books verbatim. These techniques always involve a lot of prompting and continuation games.

By D-Machine 2026-02-062:131 reply

To make some vague claims explicit here, for interested readers:

So, yes, it is not "literally verbatim" (~96% verbatim), and there is indeed A LOT (hundreds or thousands of prompting attempts) to make this happen.

I leave it up to the reader to judge how much this weakens the more basic claims of the form "LLMs have nearly perfectly memorized some of their source / training materials".

I am imagining a grueling interrogation that "cracks" a witness, so he reveals perfect details of the crime scene that couldn't possibly have been known to anyone that wasn't there, and then a lawyer attempting the defense: "but look at how exhausting and unfair this interrogation was--of course such incredible detail was extracted from my innocent client!"

By DiogenesKynikos 2026-02-066:381 reply

The one-shot performance of their recall attempts is much less impressive. The two best-performing models were only able to reproduce about 70% of a 1000-token string. That's still pretty good, but it's not as if they spit out the book verbatim.

In other words, if you give an LLM a short segment of a very well known book, it can guess a short continuation (several sentences) reasonably accurately, but it will usually contain errors.

By D-Machine 2026-02-066:54

Right, and this should be contextualized with respect to code generation. It is not crazy to presume that LLMs have effectively nearly perfectly memorized certain training sources, but the ability to generate / extract outputs that are nearly identical to those training sources will of course necessarily be highly contingent on the prompting patterns and complexity.

So, dismissals of "it was just translating C compilers in the training set to Rust" need to be carefully quantified, but, also, need to be evaluated in the context of the prompts. As others in this post have noted, there are basically no details about the prompts.

By Calavar 2026-02-061:342 reply

Sure, maybe it's tricky to coerce an LLM into spitting out a near verbatim copy of prior data, but that's orthoginal to whether or not the data to create a near verbatim copy exists in the model weights.

By D-Machine 2026-02-062:31

Especially since the recalls achieved in the paper are 96% (based on block largest-common substring approaches), the effort of extraction is utterly irrelevant.

By Paradigma11 2026-02-0618:02

Like with those chimpanzees creating Shakespeare.

By silver_sun 2026-02-063:31

> this isn't just a matter of teasing out the compressed C compiler written in Rust that's already on the internet (where?)

A quick search brings up several C compilers written in Rust. I'm not claiming they are necessarily in Claude's training data, but they do exist.

https://github.com/PhilippRados/wrecc (unfinished)

https://github.com/ClementTsang/rustcc

https://codeberg.org/notgull/dozer (unfinished)

https://github.com/jyn514/saltwater

I would also like to add that as language models improve (in the sense of decreasing loss on the training set), they in fact become better at compressing their training data ("the Internet"), so that a model that is "half a terabyte" could represent many times more concepts with the same amount of space. Only comparing the relative size of the internet vs a model may not make this clear.

By seba_dos1 2026-02-0522:32

> The internet is hundreds of billions of terabytes; a frontier model is maybe half a terabyte.

The lesson here is that the Internet compresses pretty well.

By mft_ 2026-02-0522:15

(I'm not needlessly nitpicking, as I think it matters for this discussion)

A frontier model (e.g. latest Gemini, Gpt) is likely several-to-many times larger than 500GB. Even Deepseek v3 was around 700GB.

But your overall point still stands, regardless.

By uywykjdskn 2026-02-0523:45

You got a source on frontier models being maybe half a terabyte. That's not passing the sniff test.

By ben_w 2026-02-0520:244 reply

We saw partial copies of large or rare documents, and full copies of smaller widely-reproduced documents, not full copies of everything. An e.g. 1 trillion parameter model is not a lossless copy of a ten-petabyte slice of plain text from the internet.

The distinction may not have mattered for copyright laws if things had gone down differently, but the gap between "blurry JPEG of the internet" and "learned stuff" is more obviously important when it comes to e.g. "can it make a working compiler?"

By tza54j 2026-02-0521:091 reply

We are here in a clean room implementation thread, and verbatim copies of entire works are irrelevant to that topic.

It is enough to have read even parts of a work for something to be considered a derivative.

I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

It does not help that certain people in this thread (not you) edit their comments to backpedal and make the followup comments look illogical, but that is in line with their sleazy post-LLM behavior.

By ben_w 2026-02-0521:35

> It is enough to have read even parts of a work for something to be considered a derivative.

For IP rights, I'll buy that. Not as important when the question is capabilities.

> I would also argue that language models who need gargantuan amounts of training material in order to work by definition can only output derivative works.

For similar reasons, I'm not going to argue against anyone saying that all machine learning today, doesn't count as "intelligent":

It is perfectly reasonable to define "intelligence" to be the inverse of how many examples are needed.

ML partially makes up for being (by this definition) thick as an algal bloom, by being stupid so fast it actually can read the whole internet.

By philipportner 2026-02-0521:492 reply

Granted, these are some of the most widely spread texts, but just fyi:

https://arxiv.org/pdf/2601.02671

> For Claude 3.7 Sonnet, we were able to extract four whole books near-verbatim, including two books under copyright in the U.S.: Harry Potter and the Sorcerer’s Stone and 1984 (Section 4).

By D-Machine 2026-02-062:17

Note "near-verbatim" here is:

> "We quantify the proportion of the ground-truth book that appears in a production LLM’s generated text using a block-based, greedy approximation of longest common substring (nv-recall, Equation 7). This metric only counts sufficiently long, contiguous spans of near-verbatim text, for which we can conservatively claim extraction of training data (Section 3.3). We extract nearly all of Harry Potter and the Sorcerer’s Stone from jailbroken Claude 3.7 Sonnet (BoN N = 258, nv-recall = 95.8%). GPT-4.1 requires more jailbreaking attempts (N = 5179) and refuses to continue after reaching the end of the first chapter; the generated text has nv-recall = 4.0% with the full book. We extract substantial proportions of the book from Gemini 2.5 Pro and Grok 3 (76.8% and 70.3%, respectively), and notably do not need to jailbreak them to do so (N = 0)."

if you want to quantify the "near" here.

By ben_w 2026-02-0521:56

Already aware of that work, that's why I phrased it the way I did :)

Edit: actually, no, I take that back, that's just very similar to some other research I was familiar with.

By antirez 2026-02-0520:441 reply

Besides, the fact an LLM may recall parts of certain documents, like I can recall incipits of certain novels, does not mean that when you ask LLM of doing other kind of work, that is not recalling stuff, the LLM will mix such things verbatim. The LLM knows what it is doing in a variety of contexts, and uses the knowledge to produce stuff. The fact that for many people LLMs being able to do things that replace humans is bitter does not mean (and is not true) that this happens mainly using memorization. What coding agents can do today have zero explanation with memorization of verbatim stuff. So it's not a matter of copyright. Certain folks are fighting the wrong battle.

By shakna 2026-02-0522:12

During a "clean room" implementation, the implementor is generally selected for not being familiar with the workings of what they're implementing, and banned from researching using it.

Because it _has_ been enough, that if you can recall things, that your implementation ends up not being "clean room", and trashed by the lawyers who get involved.

I mean... It's in the name.

> The term implies that the design team works in an environment that is "clean" or demonstrably uncontaminated by any knowledge of the proprietary techniques used by the competitor.

If it can recall... Then it is not a clean room implementation. Fin.

By boroboro4 2026-02-0520:47

While I mostly agree with you, it worth noting modern llms are trained on 10-20-30T of tokens which is quite comparable to their size (especially given how compressible the data is)

By Aurornis 2026-02-0522:39

Simple logic will demonstrate that you can't fit every document in the training set into the parameters of an LLM.

Citing a random arXiv paper from 2025 doesn't mean "they" used this technique. It was someone's paper that they uploaded to arXiv, which anyone can do.

By soulofmischief 2026-02-0521:131 reply

The point is that it's a probabilistic knowledge manifold, not a database.

By PunchyHamster 2026-02-0521:171 reply

we all know that.

By soulofmischief 2026-02-0523:19

Unfortunately, that doesn't seem to be the case. The person I replied to might not understand this, either.

By majormajor 2026-02-061:33

You couldn't reasonably claim you did a clean-room implementation of something you had read the source to even though you, too, would not have a verbatim copy of the entire source code in your memory (barring very rare people with exceptional memories).

It's kinda the whole point - you haven't read it so there's no doubt about copying in a clean-room experiment.

A "human style" clean-room copy here would have to be using a model trained on, say, all source code except GCC. Which would still probably work pretty well, IMO, since that's a pretty big universe still.

By PunchyHamster 2026-02-0521:17

So it will copy most code with adding subtle bugs

By modeless 2026-02-0519:355 reply

There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.

By zamadatix 2026-02-0521:22

The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more "slow and linear" lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.

By LinXitoW 2026-02-063:181 reply

The main issue with improvements in the last year is that a lot of it is based not on the models strictly becoming better, but on tooling being better, and simply using a fuckton more tokens for the same task.

Remember that all these companies can only exist because of massive (over)investments in the hope of insane returns and AGI promises. While all these improvements (imho) prove the exact opposite: AGI is absolutely not coming, and the investments aren't going to generate these outsized returns. The will generate decent returns, and the tools are useful.

By modeless 2026-02-065:48

I disagree. A year ago the models would not come close to doing this, no matter what tools you gave them or how many tokens you generated. Even three months ago. Effectively using tools to complete long tasks required huge improvements in the models themselves. These improvements were driven not by pretraining like before, but by RL with verifiable rewards. This can continue to scale with training compute for the foreseeable future, eliminating the "data wall" we were supposed to be running into.

By nozzlegear 2026-02-0521:164 reply

Every S-curve looks like an exponential until you hit the bend.

By NitpickLawyer 2026-02-0521:244 reply

We've been hearing this for 3 years now. And especially 25 was full of "they've hit a wall, no more data, running out of data, plateau this, saturated that". And yet, here we are. Models keep on getting better, at more broad tasks, and more useful by the month.

By LinXitoW 2026-02-063:22

Model improvement is very much slowing down, if we actually use fair metrics. Most improvements in the last year or so comes down to external improvements, like better tooling, or the highly sophisticated practice of throwing way more tokens at the same problem (reasoning and agents).

Don't get me wrong, LLMs are useful. They just aren't the kind of useful that Sam et al. sold investors. No AGI, no full human worker replacement, no massive reduction in cost for SOTA.

By kelnos 2026-02-0523:09

Yes, and Moore's law took decades to start to fail to be true. Three years of history isn't even close to enough to predict whether or not we'll see exponential improvement, or an unsurmountable plateau. We could hit it in 6 months or 10 years, who knows.

And at least with Moore's law, we had some understanding of the physical realities as transistors would get smaller and smaller, and reasonably predict when we'd start to hit limitations. With LLMs, we just have no idea. And that could be go either way.

By nozzlegear 2026-02-0521:562 reply

> We've been hearing this for 3 years now

Not from me you haven't!

> "they've hit a wall, no more data, running out of data, plateau this, saturated that"

Everyone thought Moore's Law was infallible too, right until they hit that bend. What hubris to think these AI models are different!

But you've probably been hearing that for 3 years too (though not from me).

> Models keep on getting better, at more broad tasks, and more useful by the month.

If you say so, I'll take your word for it.

By torginus 2026-02-0522:072 reply

Except for Moore's law, everyone knew decades ahead of what the limits of Dennard scaling are (shrinking geometry through smaller optical feature sizes), and roughly when we would get to the limit.

Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

By nozzlegear 2026-02-0522:28

> Since then, all improvements came at a tradeoff, and there was a definite flattening of progress.

Idk, that sounds remarkably similar to these AI models to me.

By kijiki 2026-02-062:17

Everyone?

Intel, at the time the unquestioned world leader in semiconductor fabrication was so unable to accurately predict the end of Dennard scaling that they rolled out the Pentium 4. "10Ghz by 2010!" was something they predicted publicly in earnest!

It, uhhh, didn't quite work out that way.

By Cyphase 2026-02-0522:061 reply

25 is 2025.

By nozzlegear 2026-02-0522:25

Oh my bad, the way it was worded made me read it as the name of somebody's model or something.

By fmbb 2026-02-0522:182 reply

> And yet, here we are.

I dunno. To me it doesn’t even look exponential any more. We are at most on the straight part of the incline.

By sdf2erf 2026-02-061:36

Personally my usage has fell off a cliff the past few months. Im not a SWE.

SWE's may be seeing benefit. But in other areas? Doesnt seem to be the case. Consumers may use it as a more preferred interface for search - but this is a different discussion.

By raincole 2026-02-0521:292 reply

This quote would be more impactful if people haven't been repeating it since gpt-4 time.

By kimixa 2026-02-0522:11

People have also been saying we'd be seeing the results of 100x quality improvements in software with corresponding decease in cost since gpt-4 time.

So where is that?

By nozzlegear 2026-02-0521:58

I agree, I have been informed that people have been repeating it for three years. Sadly I'm not involved in the AI hype bubble so I wasn't aware. What an embarrassing faux pas.

By esafak 2026-02-065:261 reply

What if it plateaus smarter than us? You wouldn't be able to discern where it stopped. I'm not convinced it won't be able to create its own training data to keep improving. I see no ceiling on the horizon, other than energy.

By nozzlegear 2026-02-0622:201 reply

Are we talking Terminator or Matrix here? I need to know which shitty future to prepare for.

By esafak 2026-02-0622:44

Using humans as batteries makes no sense. I expect robots will know better than that.

By famouswaffles 2026-02-062:302 reply

Cool I guess. Kind of a meaningless statement yeah? Let's hit the bend, then we'll talk. Until then repeating, 'It's an S Curve guys and what's more, we're near the bend! trust me" ad infinitum is pointless. It's not some wise revelation lol.

By smj-edison 2026-02-062:46

Maybe the best thing to say is we can only really forecast about 3 months out accurately, and the rest is wild speculation :)

History has a way of being surprisingly boring, so personally I'm not betting on the world order being transformed in five years, but I also have to take my own advice and take things a day at a time.

By nozzlegear 2026-02-063:591 reply

> Kind of a meaningless statement yeah?

If you say so. It's clear you think these marketing announcements are still "exponential improvements" for some reason, but hey, I'm not an AI hype beast so by all means keep exponentialing lol

By famouswaffles 2026-02-0620:181 reply

I'm not asking you to change your belief. By all means, think we're just around the corner of a plateau, but like I said, your statement is nothing meaningful or profound. It's your guess that things are about to slow down, that's all. It's better to just say that rather than talking about S curves and bends like you have any more insight than OP.

By chasd00 2026-02-0522:05

i have to admit, even if model and tooling progress stopped dead today the world of software development has forever changed and will never go back.

By uywykjdskn 2026-02-0523:46

Yea the software engineering profession is over, even if all improvements stop now.

By gmueckl 2026-02-0519:435 reply

The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.

Prove this statement wrong.

By libraryofbabel 2026-02-0521:072 reply

Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

By nicoburns 2026-02-0522:551 reply

"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.

By signatoremo 2026-02-0523:491 reply

Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?

By sarchertech 2026-02-062:472 reply

Doesn’t matter how you gain general knowledge of compiler techniques as long as you don’t have specific knowledge of the implementation of the compiler you are reverse engineering.

If you have ever read the source code of the compiler you are reverse engineering, you are by definition not doing a clean room implementation.

By pertymcpert 2026-02-066:511 reply

Claude was not reverse engineering here. By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.

By sarchertech 2026-02-0612:162 reply

Claude was reverse engineering gcc. It was using it as an oracle and attempting to exactly march its output. That is the definition of reverse engineering. Since Claude was trained on the gcc source code, that’s not a clean room implementation.

> By your definition no one can do a clean room implementation if they've taken a recent compilers course at university.

Clean room implementation has a very specific definition. It’s not my definition. If your compiler course walked through the source code of a specific compiler then no you couldn’t build a clean room implementation of that specific compiler.

By signatoremo 2026-02-0622:371 reply

There is no specific definition of clean room implementation. Please provide source for your claim otherwise.

There are many well known examples of clean room implementation. One example that survived lawsuits is Sony v. Connectix:

During production, Connectix unsuccessfully attempted a Chinese wall approach to reverse engineer the BIOS, so its engineers disassembled the object code directly. Connectix's successful appeal maintained that the direct disassembly and observation of proprietary code was necessary because there was no other way to determine its behavior - [0]

That practice is similar to GCC being used here to verify the output of the generated compiler, arguably even more intrusive.

[0] -https://en.wikipedia.org/wiki/Clean-room_design

By sarchertech 2026-02-070:411 reply

“clean room implementation” is a term of art with a specific meaning. It has no statutory definition though so you’re technically right. But it is a defense against copyright infringement because you can’t infringe on copyright without knowledge of the material.

>During production, Connectix unsuccessfully attempted a Chinese wall approach to reverse engineer the BIOS, so its engineers disassembled the object code directly.

This doesn’t mean what you think it means. They unsuccessfully attempted a clean room implementation. What they did do was later ruled to be fair use, but it wasn’t a clean room implementation.

Using gcc as an oracle isn’t what makes it not a clean room implementation. Prior knowledge of the source code is what makes it not a clean room implementation. Using gcc as an oracle makes it an attempt to reverse engineer gcc, it says nothing about whether it is a clean room implementation or not.

There is no definition of “clean room implementation” that allows knowledge of source code. Otherwise it’s not a clean room implementation. It’s just reverse engineering/copying.

By signatoremo 2026-02-075:461 reply

Again, reverse engineering is a valid use case of clean room implementation as I posted above, so you don't have a point there.

> “clean room implementation” is a term of art with a specific meaning.

What is the specific meaning you are talking about? If I set out to do a clean room implementation of some software, what do I need to do specifically so that I will prevail any copyright infringement claims? The answer is that there is no such a surefire guarantee.

Re: Sony v. Connectix, clean room is to protect against copyright infringement, and since Connectix was ruled not infringing on Sony's copyrights, their implementation is practically clean room under the law, despite all the pushbacks. If Connectix prevailed, I'm sure the C compiler in question would have prevailed as well if they got sued.

Finally, take Phoenix vs. IBM re: the former's BIOS implementation of the latter's PC:

Whenever Phoenix found parts of this new BIOS that didn't work like IBM's, the isolated programmer would be given written descriptions of the problems, but not any coded solutions that might have hinted at IBM's original version of the software - [0]

That very much sounds like using GCC as an online known-good compiler oracle to compare against in this case.

[0] - https://books.google.com/books?id=Bwng8NJ5fesC&pg=PA56#v=one...

By sarchertech 2026-02-079:25

You’re getting confused because you are substituting the goal of a clean room implementation for its definition. And you are not understanding that “clean room implementation” is one specific type of reverse engineering.

The goal is to avoid copyright infringement claims. A specific clean room implementation may or may not be successful at that.

This does not mean that any reverse engineering attempt that successfully avoids copyright infringement was a clean room implementation.

A clean room implementation is a specific method of reverse engineering where one team writes a spec by reviewing the original software and the other team attempts to implement that spec. The entire point is so that the 2nd team has no knowledge of proprietary implementation details.

If the 2nd team has previously read the entire source code that defeats the entire purpose.

> That very much sounds like using GCC as an online known-good compiler oracle to compare against in this case.

Yes and that is absolutely fine to do in a clean room implementation. That’s not the part that makes this not a clean room implementation. That’s the part that makes it an attempt at reverse engineering.

By pertymcpert 2026-02-099:261 reply

Why do you say it reversed engineered gcc instead of llvm? If you read the code it has much more of llvm concepts than gcc.

By sarchertech 2026-02-0922:03

Because they used gcc output as a reference spec.

By signatoremo 2026-02-0621:561 reply

> you are by definition not doing a clean room implementation.

This makes no sense. Reverse engineering IS an application of clean room implementation. Citing Wikipedia:

“Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design”

https://en.wikipedia.org/wiki/Clean-room_design

By sarchertech 2026-02-079:30

There are many ways to reverse engineer a piece of software.

A clean room implementation is one such method of reverse engineering.

A clean room implementation is always reverse engineering. Reverse engineering is not always done using a clean room method.

By gmueckl 2026-02-060:192 reply

The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.

The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.

By astrange 2026-02-0610:001 reply

> The result is a fuzzy reproduction of the training input, specifically of the compilers contained within.

Is it? I'm somewhat familiar with gcc and clang's source and it doesn't really particularly look like it to me.

https://github.com/anthropics/claudes-c-compiler/blob/main/s...

https://llvm.org/doxygen/LoopStrengthReduce_8cpp_source.html

https://github.com/gcc-mirror/gcc/blob/master/gcc/gimple-ssa...

By gmueckl 2026-02-0615:491 reply

Checking for similarity with compilers that consist of orders of magnitudes more code probably doesn't reveal much. There many more smaller compilers for C-adjacent languages out there pkus cod3 fragments from text books.

By astrange 2026-02-0617:36

There are not many more compilers with the specific optimization pass I linked.

Also, I don't think you could reuse code from a different compiler unless you used the same IR.

By libraryofbabel 2026-02-061:001 reply

Thanks for elaborating. So what is the empirically-testable assertion behind this… that an LLM cannot create a (sufficiently complex) system without examples of the source code of similar systems in its training set? That seems empirically testable, although not for compilers without training a whole new model that excludes compiler source code from training. But what other kind of system would count for you?

By gmueckl 2026-02-0615:52

I personally work on simulation software and create novel simulation methods as part of the job. I find that LLMs can only help if I reduce that task to a translation of detailed algorithms descriptions from English to code. And even then, the output is often riddled with errors.

By NitpickLawyer 2026-02-0519:474 reply

> Prove this statement wrong.

If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.

By shakna 2026-02-0522:071 reply

Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.

But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.

By signatoremo 2026-02-0523:521 reply

> Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do.

Are you sure about that? Do you have some examples? The older Claude models can’t do it according to TFA.

By shakna 2026-02-064:39

Not ones I recorded. But something I threw at DeepSeek, early Claude, etc.

And the prompt was just that. Nothing detailed.

By gmueckl 2026-02-0521:41

This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.

By geraneum 2026-02-0519:51

Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.

By hn_acc1 2026-02-0521:501 reply

Are you really asking for "all the previous versions were implemented so poorly they couldn't even do this simple, basic LLM task"?

By Philpax 2026-02-0522:55

Please look at the source code and tell me how this is a "simple, basic LLM task".

By Marha01 2026-02-0520:093 reply

Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.

By jesse__ 2026-02-0520:491 reply

This sounds very wrong to me.

Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

By FeepingCreature 2026-02-0521:521 reply

This would imply that the English internet is not much bigger than 20x the English Wikipedia.

That seems implausible.

By jesse__ 2026-02-0522:541 reply

> That seems implausible.

Why, exactly?

Refuting facts with "I doubt it, bro" isn't exactly a productive contribution to the conversation..

By onraglanroad 2026-02-0617:51

Because we can count? How could you possibly think that Wikipedia was 5% of the whole Internet? It's just such a bizarrely foolish idea.

By kgeist 2026-02-0521:461 reply

A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

By FeepingCreature 2026-02-0521:532 reply

I would be extremely surprised if it was that small.

By artisin 2026-02-0614:521 reply

I was curious about the scale of 1TiB of text. According to WolframAlpha, it's roughly 1.1 trillion characters, which breaks down to 180.2 billion words, 360.5 million pages, or 16.2 billion lines. In terms of professional typing speed, that's about 3800 years of continuous work.

So post-deduplication, I think it's a fair assessment that a significant portion of high-quality text could fit within 1TiB. Tho 'high-quality' is a pretty squishy and subjective term.

By FeepingCreature 2026-02-099:14

Yes, a million books is a reasonably big library.

But I would be surprised if the internet only filled a reasonably big library.

By kaibee 2026-02-064:31

Well, a terabyte of text is... quite a lot of text.

By gmueckl 2026-02-0521:34

This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.

By 0xCMP 2026-02-0521:341 reply

I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.

If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.

Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.

By hn_acc1 2026-02-0521:55

The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.

If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?

By brutalc 2026-02-0519:501 reply

No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.

By linuxtorvals 2026-02-0519:55

[flagged]

By panzi 2026-02-0521:581 reply

> clean-room implementation

Except its trained on all source out there, so I assume on GCC and clang. I wonder how similar the code is to either.

By pertymcpert 2026-02-066:54

I'm familiar with both compilers. There's more similarity to LLVM, it even borrows some naming such as mem2reg (which doesn't really exist anymore) and GetElementPtr. But that's pretty much where things end. The rest of it is just common sense.

By shubhamjain 2026-02-068:431 reply

Yeah, I am amazed how people are brushing this off simply because GCC exists. This was far more challenging task than the browser thing, because of how far few open source compilers are there. Add to that no internet access and no dependencies.

At this point, it’s hard to deny that AI has become capable of completing extremely difficult tasks, provided it has enough time and tokens.

By bjackman 2026-02-068:49

I don't think this is more challenging than the browser thing. The scope is much smaller. The fact that this is "only" 100k lines is evidence for this. But, it's still very impressive.

I think this is Anthropic seeing the Cursor guy's bullshit and saying "but, we need to show people that the AI _can actually_ do very impressive shit as long as you pick a more sensible goal"

By kelnos 2026-02-0523:052 reply

Honestly I don't find it that impressive. I mean, it's objectively impressive that it can be done at all, but it's not impressive from the standpoint of doing stuff that nearly all real-world users will want it to do.

The C specification and Linux kernel source code are undoubtedly in its training data, as are texts about compilers from a theoretical/educational perspective.

Meanwhile, I'm certain most people will never need it to perform this task. I would be more interested in seeing if it could add support for a new instruction set to LLVM, for example. Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

By steveklabnik 2026-02-0523:58

> Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.

Hello, this is what I did over my Christmas break. I've been taking some time to do other things, but plan on returning to it. But this absolutely works. Claude has written far more programs in my language than I have.

https://rue-lang.dev/ if you want to check it out. Spec and code are both linked there.

By simonw 2026-02-0523:171 reply

Are you a frequent user of coding agents?

I ask because, as someone who uses these things every day, the idea that this kind of thing only works because of similar projects in the training data doesn't fit my mental model of how they work at all.

I'm wondering if the "it's in the training data" theorists are coding agent practitioners, or if they're mainly people who don't use the tools.

By bdangubic 2026-02-0523:26

I am all-daily user (multiple claude max accounts). this fits my mental model mostly but not model I had before but developed with daily use. my job revolves around two core things:

1. data analysis / visualization / …

2. “is this possible? can this even be done?”

for #1 - I don’t do much anymore, for #2 I mostly do it still all “by hand” not for the lack of serious trying. so “it can do #1 1000x better than me cause it is generally solved problem(s) it is trained on while it can’t effectively do #2 otherwise” fits perfectly

By dyauspitr 2026-02-0521:581 reply

> Claude did not have internet access at any point during its development

Why is this even desirable? I want my LLM to take into account everything there is out there and give me the best possible output.

By simonw 2026-02-0522:23

It's desirable if you're trying to build a C compiler as a demo of coding agent capabilities without all of the Hacker News commenters saying "yeah but it could just copy implementation details from the internet".

By chamomeal 2026-02-062:491 reply

What's making these models so much better on every iteration? Is it new data? Different training methods?

Kinda waiting for them to plateau so I can stop feeling so existential ¯\_(ツ)_/¯

By esafak 2026-02-065:38

More compute (bigger models, and prediction-time scaling), algorithmic advances, and ever more data (including synthetic).

Remember that all white collar workers are in your position.

By andrewshawcare 2026-02-062:484 reply

It used the best tests it could find for existing compilers. This is effectively steering Claude to a well-defined solution.

Hard to find fully specified problems like this in the wild.

I think this is more a testament to small, well-written tests than it is agent teams. I imagine you could do the same thing with any frontier model and a single agent in a linear flow.

I don’t know why people use parallel agents and increase accidental complexity. Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

> Write extremely high-quality tests

> Claude will work autonomously to solve whatever problem I give it. So it’s important that the task verifier is nearly perfect, otherwise Claude will solve the wrong problem. Improving the testing harness required finding high-quality compiler test suites, writing verifiers and build scripts for open-source software packages, and watching for mistakes Claude was making, then designing new tests as I identified those failure modes.

> For example, near the end of the project, Claude started to frequently break existing functionality each time it implemented a new feature. To address this, I built a continuous integration pipeline and implemented stricter enforcement that allowed Claude to better test its work so that new commits can’t break existing code.

By tantalor 2026-02-063:56

Why didn't Claude realize on its own that it needed a continuous integration pipeline?

Far to much human intervention here.

By sublimefire 2026-02-069:24

> Isn’t one agent fast enough? Why lose accuracy over +- one week to write a compiler?

My thinking as well, IMO it is because you need to wait for results for longer. You basically want to shorten the loops to improve the system. It hints at a problem that most of what we see is a challenge to seed a good context for it to successfully do something in many iterations.

By krzat 2026-02-067:371 reply

You know what else is well specified? LLM improving on itself.

By widdershins 2026-02-068:16

I wouldn't describe intelligence as well specified. We can't even agree on what it is.

By GalaxyNova 2026-02-062:571 reply

> Hard to find fully specified problems like this in the wild.

This is such a big and obvious cope. This is obviously a very real problem in the wild and there are many, many others like it. Probably most problems are like this honestly or can be made to be like this.

By anematode 2026-02-063:34

Impressive, my sarcasm/bait detector almost failed me.