
Opinion: Autonomous agents may generate millions of lines of code, but shipping software is another matter
Opinion AI-integrated development environment (IDE) company Cursor recently implied it had built a working web browser almost entirely with its AI agents. I won't say they lied, but CEO Michael Truell certainly tweeted: "We built a browser with GPT-5.2 in Cursor."
He followed up with: "It's 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM."
That sounds impressive, doesn't it? He also added: "It *kind of* works," which is not the most ringing endorsement. Still, numerous news sources and social media chatterboxes ran with the news that AI built a web browser in a week.
Too bad it wasn't true. If you actually looked at Cursor engineer Wilson Lin's blog post about FastRender, the AI-created web browser, you won't see much boasting about a working web browser. Instead, there's a video of a web browser sort of working, and a much less positive note that "building a browser from scratch is extremely difficult."
The thing about making such a software announcement on GitHub is that while the headlines are proclaiming another AI victory, developers have this nasty trick. They actually git the code and try it out.
Developers quickly discovered the "browser" barely compiles, often does not run, and was heavily misrepresented in marketing.
As a techie, the actual blog post about how they tried and didn't really succeed was much more interesting. Of course, that Cursor sicced hundreds of GPT-5.2-style agents which ran for a week to produce three million lines of new code, to produce, at best, a semi-functional web browser from scratch, doesn't make for a good headline.
According to Perplexity, my AI chatbot of choice, this week‑long autonomous browser experiment consumed in the order of 10-20 trillion tokens and would have cost several million dollars at then‑current list prices for frontier models.
I'd just cloned a copy of Chromium myself, and for all that time and money, independent developers who cloned the repo reported that the codebase is very far from a functional browser. Recent commits do not compile cleanly, GitHub Actions runs on main are failing, and reviewers could not find a single recent commit that was built without errors.
Where builds succeeded after manual patching, performance was abysmal, with reports of pages taking around a minute to load and a heavy reliance on existing projects like Servo, a Rust-based web rendering engine, and QuickJS, a JavaScript engine, despite "from scratch" claims.
Lin defended the project on Y Combinator, saying, for instance: "The JS engine used a custom JS VM being developed in vendor/ecma-rs as part of the browser, which is a copy of my personal JS parser project vendored to make it easier to commit to." If it's derived from his personal JavaScript parser, that's not really from scratch, is it? Nor is it, from the sound of the argument, written by AI.
Gregory Terzian, a Servo maintainer, responded: "The actual code is worse; I can only describe it as a tangle of spaghetti... I can't make much, if anything, out of it." He then gave the backhanded compliment: "So I agree this isn't just wiring up of dependencies, and neither is it copied from existing implementations: it's a uniquely bad design that could never support anything resembling a real-world web engine." Now that's a burn.
From where I sit, what makes the Cursor case more dangerous than just a failed hack‑week project is that the hype is baked into its methodology. The "experiment" wasn't presented as what it really was: an interesting, but messy, internal learning exercise. No, it was rolled out as a milestone that conveniently confirmed the company's long‑running autonomous agent advertising. Missing from the story were basics any senior engineer would demand: passing Continuous Integration (CI), reproducible builds, and real benchmarks that show the browser doing more than limping through a hello-world page.
Zoom out, and CEOs are still predicting that AI will write 90 percent of code in a year, while most enterprise AI pilots still fail to deliver meaningful return on investment.
We're now in a kind of AI uncanny valley for developers. Sure, tools like Cursor can be genuinely helpful as glorified autocomplete and refactoring assistants, but marketing keeps insisting junior engineers can take whole projects from spec to shipping. When you start believing your own sizzle reel, you stop doing the tedious validation work that separates a demo from a deliverable.
Enough already. The hype has grown cold. Sarah Friar, OpenAI's CFO, recently blogged that in 2026, its focus would be on "practical adoption." Let's see real-world practical results first, and then we can talk about practical AI adoption. ®
I love the quote from Gregory Terzian, one of the servo maintainers:
> "So I agree this isn't just wiring up of dependencies, and neither is it copied from existing implementations: it's a uniquely bad design that could never support anything resembling a real-world web engine."
It hurts, that it wasn't framed as an "Experiment" or "Look, we wanted to see how far AI can go - kinda failed the bar." Like it is, it pours water on the mills of all CEOs out there, that have no clue about coding, but wonder why their people are so expensive when: "AI can do it! D'oh!"
In blacksmithing there's the concept of an "anvil shaped object". That is, something that looks like an anvil but is hollow or made of ceramic or something. It might stand up to tapping on for making jewelry or something but should never be worked like a real anvil for fear of hurting someone or wrecking the thing you're working on when it breaks.
I feel like a lot of the AI articles and experiments like this one are producing "app shaped objects" that look okay for making content (and indeed are fine for making earrings) but fall apart when pounded on by the real world.
Plus we can suspect a tremendous amount of astroturfing on this topic. When you’re spending billions on the tech, a few millions (if it even is that much) for “creative marketing” are really nothing.
That was from a conversation here on Hacker News the other day: https://news.ycombinator.com/item?id=46624541#46709191
I wish your recent interview had pushed much harder on this. It came across as politely not wanting to bring up how poorly this really went, even for what the engineer intended.
They were making claims without the level of rigor to back them up. There was an opportunity to learn some difficult lessons, but—and I don’t think this was your intention—it came across to me as kind of access journalism; not wanting to step on toes while they get their marketing in.
pushing would definitely stop the supply of interviews/freebies/speaking engagements
Why would he push back? His whole schtick is to sell only AI hype. He’s not going to hurt his revenue.
If I sell only AI hype why do I keep telling people that many systems built on top of LLMs are inherently insecure? https://simonwillison.net/2025/Jun/16/the-lethal-trifecta/
That's a great way to tell on yourself that you've never read Simon's work.
On the contrary, we get to read hundreds of his comments explaining how the LLM in anecdote X didn't fail, it was the developer's fault and they should know better than to blame the LLM.
I only know this because on occasion I'll notice there was a comment from them (I only check the name of the user if it's a hot take) and I ctrl-F their username to see 20-70 matches on the same thread. Exactly 0 of those comments present the idea that LLMs are seriously flawed in programming environments regardless of who's in the driver seat. It always goes back to operator error and "just you watch, in the next 3 months or years...".
I dunno, I manage LLM implementation consulting teams and I will tell you to your face that LLMs are unequivocally shit for the majority of use cases. It's not hard to directly criticize the tech without hiding behind deflections or euphemisms.
> Exactly 0 of those comments present the idea that LLMs are seriously flawed in programming environments regardless of who's in the driver seat.
Why would I say that when I very genuinely believe the opposite?
LLMs are flawed in programming environments if driven by people who don't know how to use them effectively.
Learning to use them effectively is unintuitive and difficult, as I'm sure you've seen yourself.
So I try to help people learn how to use them, through articles like https://simonwillison.net/2025/Mar/11/using-llms-for-code/ and comments like this one: https://news.ycombinator.com/item?id=46765460#46765940
(I don't ever say variants of "just you watch, in the next 3 months or years..." though, I think predicting future improvements is pointless when we can be focusing on what the models we have right now can do.)
I literally see their posts every (other) day, and its always glazing something that doesn't fully work (but is kind of cool at a glance) or is really just hyped beyond belief.
Comments usually point out the issues or more grounded reality.
BTW I'm bullish on AI, going through 100s of millions of tokens per month.
the bare minimum of criticism to allow independence to be claimed?
I actually don't think this is true, and certainly of people who cover LLMs Simon Willison is one of the more critical and measured people.
The person you're responding to isn't a journalist, they're a mouthpiece. Pushing means they don't get these interviews anymore.
The quality of whatever they put out as a result of it is yours to take into consideration.
I just don't think that's the case.
The claims they made really weren't that extreme. In the blog post they said:
> To test this system, we pointed it at an ambitious goal: building a web browser from scratch. The agents ran for close to a week, writing over 1 million lines of code across 1,000 files. You can explore the source code on GitHub.
> Despite the codebase size, new agents can still understand it and make meaningful progress. Hundreds of workers run concurrently, pushing to the same branch with minimal conflicts.
That's all true.
On Twitter their CEO said:
> We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week.
> It's 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM.
> It kind of works! It still has issues and is of course very far from Webkit/Chromium parity, but we were astonished that simple websites render quickly and largely correctly.
That's mostly accurate too, especially the "it kind of works" bit. You can take exception to "from-scratch" claim if you like. It's a tweet, the lack of nuance isn't particularly surprising.
In the overall genre of CEO's over-hyping their company's achievements this is a pretty weak example.
I think the people making out that Cursor massively and dishonestly over-hyped this are arguing with a straw man version of what the company representatives actually said.
> That's mostly accurate too, especially the "it kind of works" bit. You can take exception to "from-scratch" claim if you like. It's a tweet, the lack of nuance isn't particularly surprising.
> In the overall genre of CEO's over-hyping their company's achievements this is a pretty weak example
I kind of agree, but kind of not. The tweet isn't too bad when read from an experienced engineer perspective, but if we're being real then the target audience was probably meant to be technically clueless investors who don't and can't understand the nuance.
What people take issue with is the claim that agents built a web browser "from scratch" only to find by looking deeper that they were using Servo, WGPU, Taffy, winit, and other libraries which do most of the heavy lifting.
It's like claiming "my dog filed my taxes for me!" when in reality everything was filled out in TurboTax and your dog clicked the final submit button. Technically true, but clearly disingenuous.
I'm not saying an LLM using existing libraries is a bad thing--in fact I'd consider an LLM which didn't pull in a bunch of existing libraries for the prompt "build a web browser" to be behaving incorrectly--but the CEO is misrepresenting what happened here.
Did you read the comment that started this thread? Let me repeat that, ICYMI:
> "So I agree this isn't just wiring up of dependencies, and neither is it copied from existing implementations: it's a uniquely bad design that could never support anything resembling a real-world web engine."
It didn't use Servo, and it wasn't just calling dependencies. It was terribly slow and stupid, but your comment is more of a mischaracterization than anything the Cursor people have said.
You're right in the sense it didn't `use::servo`, merely Servo's CSS parser `cssparser`[0] and Servo's DOM parser `html5ever`[1]. Maybe that dog can do taxes after all.
[0] https://github.com/search?q=repo%3Awilsonzlin%2Ffastrender%2...
[1] https://github.com/search?q=repo%3Awilsonzlin%2Ffastrender+h...
Taffy is related to Servo too, though apparently not officially part of the Servo project - but Servo does use it.
https://github.com/DioxusLabs/taffy
Used here (I think): https://github.com/servo/servo/tree/c639bb1a7b3aa0fd5e02b40d...
Servo uses Taffy for CSS Grid. It could also very easily use it for Flexbox, but they currently prefer to use their own implementation there.
It was originally a derivative of React Native's Yoga implementation of Flexbox, and is currently developed primarily as part of the Blitz engine.
I agree that "from scratch" is a misrepresentation.
But it was accompanied by a link to the GitHub repo, so you can hardly claim that they were deliberately hiding the truth.
Sorry, just to be clear, the defense that they pulled something out of their ass is that they linked to something that outed them? So they couldn't have actually have been overstating it?
If anything, that proves the point that they weren't rigorous! They claimed a thing. The thing didn't accomplish what they said. I'm not saying that they hid it but that they misrepresented the thing that they built. My comment to you is that the interview didn't directly firmly pressure them on this.
Generating a million lines of code in parallel isn't impressive. Burning a mountain of resources in parallel isn't noteworthy (see: the weekly post of someone with an out of control EC2 instance racking up $100k of charges.)
It would have been remarkable if they'd built a browser from scratch, which they said they did, except they didn't. It was a 50 million token hackathon project that didn't work, dressed up as a groundbreaking example of their product.
As feedback, I hope in the future you'll push back firmly on these types of claims when given the opportunity, even if it makes the interviewee uncomfy. Incredible claims require incredible evidence. They didn't have it.
My goal in the interview was to get to as accurate a version of what they actually built and how they built it as possible.
I don't think directly accusing them of being misleading about what they had done would have supported that goal, so I didn't do it.
Instead I made sure to dig into things like what QuickJS was doing in there and why it used Taffy as part of the conversation.
3 days ago: (https://news.ycombinator.com/item?id=46743831)
> Honestly, grilling him about what the CEO had tweeted didn't even cross my mind.
Today:
> I don't think directly accusing them of being misleading about what they had done would have supported that goal, so I didn't do it.
I find it hard to follow how it didn't cross your mind while for the same interview you had also considered the situation and determined it didn't meet the interview goal.
I don't think those two statements are particularly inconsistent.
It didn't cross my mind to grill him over his CEO's tweets.
I also don't think that directly accusing them of being misleading would support the goal of my interview - which was to figure out the truth of what they built and how.
If you like, I'll retract the fragment "so I didn't do it" since that implies that I thought "maybe I should grill him about what the CEO said... no actually I won't" - which isn't what happened.
So I guess you win?
> I agree that "from scratch" is a misrepresentation.
I believe in the UK the term for this is actually fraudulent misrepresentation:
https://en.wikipedia.org/wiki/Misrepresentation#English_law
And in this context it seems to go against The Consumer Protection from Unfair Trading Regulations 2008 and the Digital Markets, Competition and Consumers Act 2024:
I very much don't believe for a second anyone would manage to get a judgement against them on this in the UK.
For starters, the language is highly subjective, and they'd be able to show vast amounts of discourse about software engineering where "from scratch" often does not involve starting with nothing, and they'd then go on to argue that the person suing haven't actually had any reason to believe that they would be able to replicate a setup that was described as a complex large-scale experiment without much more information.
The person suing would have an uphill battle showing that whatever assumptions they made were something that was reasonable to infer based on that statement.
And to have a case, a consumer would also then need to have relied on this as a significant factor in choosing to buy their services.
But even if we assume the court would agree it is fraudulent, the remedy is only "directly consequential losses".
In other words, I doubt anyone would lose sleep over this risk.
How many non developers were going to look at that? They knew exactly what they were doing by saying that.
> But it was accompanied by a link to the GitHub repo, so you can hardly claim that they were deliberately hiding the truth.
Well, yes and no; we live in an era where people consume headlines, not articles, and certainly not links to Github repositories in articles. If VCs and other CEOs read the headline "Cursor Agents Autonomously Create Web Browser From Scratch" on LinkedIn, the project has served its purpose and it really doesn't matter if the code compiles or not.
> I think the people making out that Cursor massively and dishonestly over-hyped this are arguing with a straw man version of what the company representatives actually said.
It's far more dishonest to search for contrived interpretations of their statements in an attempt to frame them as "mostly accurate" when their statements are clearly misleading (and in my opinion, intentionally so).
You're giving them infinite benefit of the doubt where they deserve none, as this industry is well known for intentionally misleading statements, you're brushing off serious factual misrepresentations as simple "lack of nuance" and finally trying to discredit people who have an issue with all of this.
With all due respect, that's not the behavior of a neutral reporter but someone who's heavily invested in maintaining a certain narrative.
According to the twitter analytics you can see on the post (at least on nitter), the original
> We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week.
tweet was seen by over 6 million people.
The follow up tweet which includes the link to the actual details was seen by less than 200000.
That's just how Twitter engagement works and these companies know it. Over 6 million people were fed bullshit. I'm sorry, but it's actually a great example of CEOs over hyping their products.
That Tweet that was seen by 6 million people is here: https://x.com/mntruell/status/2011562190286045552
You only quoted the first line. The full tweet includes the crucial "it kind of works" line - that's not in the follow-up tweet, it's in the original.
Here's that first tweet in full:
> We built a browser with GPT-5.2 in Cursor. It ran uninterrupted for one week.
> It's 3M+ lines of code across thousands of files. The rendering engine is from-scratch in Rust with HTML parsing, CSS cascade, layout, text shaping, paint, and a custom JS VM.
> It kind of works! It still has issues and is of course very far from Webkit/Chromium parity, but we were astonished that simple websites render quickly and largely correctly.
The second tweet, with only 225,000 views, was just the following text and a link to the GitHub repository:
> Excited to continue stress testing the boundaries of coding agents and report back on what we learn.
> Code here: https://github.com/wilsonzlin/fastrender
The fact that the codebase is meaningless drivel has already been established, you don’t need to defend them. It’s just pure slop, and they’re trying to get people to believe that it’s a working browser. At the time he bragged about that `cargo build` didn’t even run! It was completely broken going back a hundred commits. So it was a complete lie to claim that it “kind of works”.
You have a reputation. You don’t need to carry water for people who are misleading people to raise VC money. What’s the point of you language lawyering about the precise meaning of what he said?
“No no, you don’t get it guys. I’m technically right if you look at the precise wording” is the kind of silly thing I do all the time. It’s not that important to be technically right. Let this one go.
Which part of their CEO saying "It kind of works" are you interpreting as "trying to get people to believe that it’s a working browser"?
The reason I won't let this one go is that I genuinely believe people are being unfair to the engineer who built this, because some people will jump on ANY opportunity to "debunk" stories about AI.
I won't stand for misleading rhetoric like "it's just a Servo wrapper" when that isn't true.
> I won't stand for misleading rhetoric like "it's just a Servo wrapper" when that isn't true.
this level of outrage seems absent when it's misleading in the pro-"AI" direction
> "It kind of works"
https://github.com/wilsonzlin/fastrender/issues/98
A project that didn't compile at all counts as "kind of" working now?
> I won't stand for misleading rhetoric like "it's just a Servo wrapper" when that isn't true.
True, at least if it was a wrapper then it would actually kind of work, unlike this which is the most obvious case of hyping lies up for investors I've witnessed in the last... Well, week or so, considering how much bullshit spews out of the mouths of AI bros.
It did compile. It just didn't compile in GitHub Actions CI, since that wasn't correctly configured.
The linked GitHub issue has quotes from multiple people who were not able to compile it locally, not just in CI.
simonw has drunk the koolaid on this one. There’s no point trying to convince him. Relatedly, he made a prediction that AI would be able to write a web browser from scratch in 3 years. He really wants to see this happen, so maybe that’s why he’s defending these scammers.
It’s been fascinating, watching you go from someone who I could always turn into more sensible opinion about technology for the last 15 years, to a sellout whose every message drips with motivated reasoning.
I feel like I spend way too much of my time arguing back against motivated reasoning from people.
"This project is junk that doesn't even compile", for example.
It's largely futile. There's a certain contingent that will not be convinced of this until they see what these tools can do first hand, and they'll refuse to try to do this properly until it's everywhere.
I’m super impressed by how "zillions of lines of code" got re-branded as a reasonable metric by which to measure code, just because it sounds impressive to laypeople and incidentally happens to be the only thing LLMs are good at optimizing.
It really is insane. I really thought we had made progress stamping out the idea that more LOC == better software, and this just flies in the face of that.
I was in a meeting recently where a director lauded Claude for writing "tens of thousands of lines of code in a day", as if that metric in and of itself was worth something. And don't even get me started on "What percentage of your code is written by AI?"
As Dijkstra once opined in 1988: "My point today is that, if we wish to count lines of code, we should not regard them as "lines produced" but as "lines spent": the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger."
As a fun exercise, I tried to see how close I could get to Cursor's results without using any Rust crates, and by making the agent actually care about the code. End results: 20K LOC for a browser that more or less works the same, on three platforms, leveraging commonly available system libraries and no 3rd party Rust crates: https://emsh.cat/one-human-one-agent-one-browser/ (https://news.ycombinator.com/item?id=46779522)
I'm not entirely sure what the millions of lines of code is supposedly doing.
"What percentage of your code is written by AI?"
"I don't know, what percentage of your sweater is polyester?"
"I don't know, I think it's all cotton, why do you ask me such a random question?"
"Well surely you know that polyester can be made far cheaper in a plastics factory than cotton? Why do you use cotton?"
LOC per day metrics are bovine metrics: how many pounds of dung per day.
I'd argue porcine: how many pounds of slop per day.
KPIs are slowly destroying the American economy. The idea that everything can be easily measured meaningfully with simple metrics by laypeople is a myth propagated by overpaid business consultante. It's absurd and facetious. Every attempt to do so is degrading and counter-productive.
Other western economies too. In the UK its destroying the education system too.
The problem is that Western societies shifted into a "zero trust" mode - on all levels. It begins with something like being able to leave your house door unlocked after going for work to that not being reasonable due to thefts and vandalism, and it ends with insane amounts of "dumb capital" being flushed into public companies by ETFs and other investment vehicles.
And the latter is what's driving the push for KPIs the most - "active" ETFs already were bad enough because their managers would ask the companies they invested in to provide easily-to-grok KPIs (so that they could keep more of the yearly fee instead of having to pay analysts to dig down into a company's finances), and passive ETFs make that even worse because there is now barely any margin left to pay for more than a cursory review.
America's desire for stock-based pensions is frying the world's economy with its second and third order effects. Unfortunately, that rotten system will most probably only collapse when I'm already dead, so there is zero chance for most people alive today to ever see a world free of this BS.
I completely agree. The issue is that some misconceptions just never go away. People were talking about how bad lines of code is as a metric in the 1980s [1]. Its persistence as a measure of productivity only shows to me that people feel some deep-seated need to measure developer productivity. They would rather have a bad but readily-available metric than no measure of productivity.
Every line of code is technical debt. Some of the hardest projects I’ve ever worked on involved deleting as much code as I wrote.
Exactly. I once worked on a large project where the primary contractor was Accenture. They threw a party when we hit a million lines of C++. I sat in the back at a table with the other folks who knew enough to realize it should have been a wake.
It’s just the easiest metric to measure progress. Measuring real progress in terms of meeting user needs with high quality is a lot harder.
Oh yeah. At a previous job there was a guy who'd deleted more code than he'd written which I always found a little amusing.
Being in a similar position to him now though... if it can be deleted it gets deleted.
That's what got me. I've never written a browser from scratch but just telling me that it took millions of lines of code made me feel like something was wrong. Maybe somehow that's what it takes? But I've worked in massive monorepos that didn't have 3million lines of code and were able to facilitate an entire business's function.
To be fair, it easily takes 3 million lines of code to make a browser from scratch. Firefox and Chrome both have around ten times that(!) – presumably including tests etc. But if the browser is in large part third-party libraries glued together, that definitely shouldn't take 3 million lines.
Depending on how functional you want the browser to be. I can technically write a web browser in a few lines of perl but you wouldn't get any styling, let alone javascript. Plus 90% of the code is likely going to fixing compatibility issues with poorly designed sites.
FastRender isn't "in large part third-party libraries glued together". The only dependency that fits that bill in my opinion is Taffy for CSS grid and flexbox layout.
The rest is stuff like HarfBuzz for font rendering which is an entirely cromulent dependency for a project like this.
Yeah I would have thought 3 million lines for a fully functional browser is a little lean, though I imagine that Chrome and Firefox have probably reinvented some STL stuff over the years (for performance) which would bulk it out.
Lines of code is just phrenology for software development, but a lot of people are very incentivized to believe in phrenology.
These 'metrics' are deliberately meant to trick investors into throwing money into hyped up inflated companies for secondary share sales because it sounds like progress.
The reality was the AI made an uncompilable mess, adding 100+ dependencies including importing an entire renderer from another browser (servo) and it took a human software engineer to clean it all up.
Citing the ability to turn on an endless faucet of code as a benefit and not a liability should be disqualifying.
It's only impressive if you've ever only saw code as a means to an end and SLOC never really mattered.
If you write code in any capacity, you'll know that high LOC counts are usually a sign of a bad time, browsers and operating systems aside.
> According to Perplexity, my AI chatbot of choice, this week‑long autonomous browser experiment consumed in the order of 10-20 trillion tokens and would have cost several million dollars at then‑current list prices for frontier models.
Don't publish things like that. At the very least link to a transcript, but this is a very non-credible way of reporting those numbers.
That implies a throughput of around 16 million tokens per second. Since coding agent loops are inherently sequential—you have to wait for the inference to finish before the next step—that volume seems architecturally impossible. You're bound by latency, not just cost.
The original post claimed they were "running hundreds of concurrent agents":
It was 2,000 concurrent agents at peak.
I'd still be surprised if that added up to "trillions" of tokens. A trillion is a very big number.
16 million a second across 2000 agents would be 8000 tokens per second per agent. This doesn't seem right to me.
I mean, its right there in their blog - https://cursor.com/blog/scaling-agents
"We've deployed trillions of tokens across these agents toward a single goal. The system isn't perfectly efficient, but it's far more effective than we expected."