NanoChat – The best ChatGPT that $100 can buy

2025-10-1315:221523308github.com

The best ChatGPT that $100 can buy. Contribute to karpathy/nanochat development by creating an account on GitHub.

Show article

The best ChatGPT that $100 can buy.

This repo is a full-stack implementation of an LLM like ChatGPT in a single, clean, minimal, hackable, dependency-lite codebase. nanochat is designed to run on a single 8XH100 node via scripts like speedrun.sh, that run the entire pipeline start to end. This includes tokenization, pretraining, finetuning, evaluation, inference, and web serving over a simple UI so that you can talk to your own LLM just like ChatGPT. nanochat will become the capstone project of the course LLM101n being developed by Eureka Labs.

The fastest way to feel the magic is to run the speedrun script speedrun.sh, which trains and inferences the $100 tier of nanochat. On an 8XH100 node at $24/hr, this gives a total run time of about 4 hours. Boot up a new 8XH100 GPU box from your favorite provider (e.g. I use and like Lambda), and kick off the training script:

Alternatively, since the script runs for 4 hours, I like to launch it like this inside a new screen session speedrun (and also log output to speedrun.log):

screen -L -Logfile speedrun.log -S speedrun bash speedrun.sh

See the screen cheatsheet if you are less familiar. You can watch it go inside the screen session, or detach with Ctrl-a d and tail speedrun.log to view progress. Now wait 4 hours. Once it's done, you can talk to your LLM via the ChatGPT-like web UI. Make sure again that your local uv virtual environment is active (run source .venv/bin/activate), and serve it:

python -m scripts.chat_web

And then visit the URL shown. Make sure to access it correctly, e.g. on Lambda use the public IP of the node you're on, followed by the port, so for example http://209.20.xxx.xxx:8000/, etc. Then talk to your LLM as you'd normally talk to ChatGPT! Get it to write stories or poems. Ask it to tell you who you are to see a hallucination. Ask it why the sky is blue. Or why it's green. The speedrun is a 4e19 FLOPs capability model so it's a bit like talking to a kindergartener :).

You can also cat report.md file which appeared in the project directory and contains the "report card" of the run, i.e. a bunch of evaluations and metrics. At the very end, you'll see a summary table, for example:

Characters: 333,989
Lines: 8,304
Files: 44
Tokens (approx): 83,497
Dependencies (uv.lock lines): 2,004

Metric	BASE	MID	SFT	RL
CORE	0.2219	-	-	-
ARC-Challenge	-	0.2875	0.2807	-
ARC-Easy	-	0.3561	0.3876	-
GSM8K	-	0.0250	0.0455	0.0758
HumanEval	-	0.0671	0.0854	-
MMLU	-	0.3111	0.3151	-
ChatCORE	-	0.0730	0.0884	-

Total wall clock time: 3h51m

(Your table might be missing the RL number by default). For a lot more information around the speedrun script and what to look for and expect, please refer to the walkthrough that I posted in Discussions of the repo: "Introducing nanochat: The best ChatGPT that $100 can buy".

Unsurprisingly, $100 is not enough to train a highly performant ChatGPT clone. In fact, LLMs are famous for their multi-million dollar capex. For our purposes, I think there are two more scales of interest. First is the ~$300 tier d26 model (i.e. depth=26) that trains in ~12 hours, which slightly outperforms GPT-2 CORE score. Second is the $1000 tier (~41.6 hours), just because it's a nice round number. But both of these are not yet fully supported and therefore not attached here in the master branch yet.

That said, to give a sense, the example changes needed for the speedrun.sh file to train a GPT-2 grade model d26 only involve three changes:

...
# you'll need to download more data shards for pretraining
# get the number of parameters, multiply 20 to get tokens, multiply by 4.8 to get chars,
# divide by 250 million to get number of shards. todo need to improve this...
python -m nanochat.dataset -n 450 &
...
# use --depth to increase model size. to not oom, halve device batch size 32 -> 16:
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=26 --device_batch_size=16
...
# make sure to use the same later during midtraining:
torchrun --standalone --nproc_per_node=8 -m scripts.mid_train -- --device_batch_size=16

That's it! The biggest thing to pay attention to is making sure you have enough data shards to train on (the code will loop and do more epochs over the same training set otherwise, decreasing learning speed a bit), and managing your memory/VRAM, primarily by decreasing the device_batch_size until things fit (the scripts automatically compensates by increasing the number of gradient accumulation loops, simply turning parallel compute to sequential compute).

And a bit more about computing environments that will run nanochat:

The code will run just fine on the Ampere 8XA100 GPU node as well, but a bit slower.
All code will run just fine on even a single GPU by omitting torchrun, and will produce ~identical results (code will automatically switch to gradient accumulation), but you'll have to wait 8 times longer.
If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM / run out of VRAM. Look for --device_batch_size in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1. Less than that you'll have to know a bit more what you're doing and get more creative.
Most of the code is fairly vanilla PyTorch so it should run on anything that supports that - xpu, mps, or etc, but I haven't implemented this out of the box so it might take a bit of tinkering.

nanochat is designed to be short and sweet. One big advantage of this is that we can package up all of the files together and copy paste them to your favorite LLM to ask arbitrary questions. As an example, I like to package up the repo using the files-to-prompt utility like so:

files-to-prompt . -e py -e md -e rs -e html -e toml -e sh --ignore "*target*" --cxml > packaged.txt

This includes all py, rs, html, toml, sh files, excludes the rustbpe/target folder, and chooses the cxml output format. Everything is written to the packaged.txt file, which atm measures ~330KB (i.e. well below ~100K tokens for a state of the art LLM), and ~8K lines of code in 45 files.

Alternatively, I recommend using DeepWiki from Devin/Cognition to ask questions of this repo. In the URL of this repo, simply change github.com to deepwiki.com, and you're off.

I haven't invested too much here but some tests exist, especially for the tokenizer. Run e.g. as:

python -m pytest tests/test_rustbpe.py -v -s

nanochat is nowhere finished. The goal is to improve the state of the art in micro models that are accessible to work with end to end on budgets of < $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM "framework"; there will be no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable "strong baseline" codebase designed to run start to end and produce a concrete ChatGPT clone and its report card.

The name (nanochat) derives from my earlier project nanoGPT, which only covered pretraining.
nanochat is also inspired by modded-nanoGPT, which gamified the nanoGPT repo with clear metrics and a leaderboard, and borrows a lot of its ideas and some implementation for pretraining.
Thank you to HuggingFace for fineweb and smoltalk.
Thank you Lambda for the compute used in developing this project.
Thank you to chief LLM whisperer 🧙‍♂️ Alec Radford for advice/guidance.

If you find nanochat helpful in your research cite simply as:

@misc{nanochat, author = {Andrej Karpathy}, title = {nanochat: The best ChatGPT that $100 can buy}, year = {2025}, publisher = {GitHub}, url = {https://github.com/karpathy/nanochat}
}

MIT

Read the original article

Comments

By tehnub 2025-10-1321:2615 reply

Interesting exchange on the use of AI coding tools:

    curious how much did you write the code by hand of it?

    Karpathy: Good question, it's basically entirely hand-written (with tab autocomplete). I tried to use claude/codex agents a few times but they just didn't work well enough at all and net unhelpful, possibly the repo is too far off the data distribution.

https://x.com/karpathy/status/1977758204139331904

By gyomu 2025-10-1322:024 reply

> the repo is too far off the data distribution

ah, this explains why these models have been useless to me this whole time. everything i do is just too far off the data distribution!

By SchemaLoad 2025-10-1322:428 reply

Everything is unless your app is a React todolist or leatcode questions.

By notatoad 2025-10-1323:009 reply

people say this like it's a criticism, but damn is it ever nice to start writing a simple crud form and just have copilot autocomplete the whole thing for me.

By pja 2025-10-149:192 reply

Yep. I find the hype around AI to be wildly overblown, but that doesn’t mean that what it can do right now isn’t interesting & useful.

If you told me a decade ago that I could have a fuzzy search engine on my desktop that I could use to vaguely describe some program that I needed & it would go out into the universe of publicly available source code & return something that looks as close to the thing I’ve asked for as it can find then that would have been mindblowing. Suddenly I have (slightly lossy) access to all the code ever written, if I can describe it.

Same for every other field of human endeavour! Who cares if AI can “think“ or “do new things”? What it can do is amazing & sometimes extremely powerful. (Sometimes not, but that’s the joy of new technology!)

By mrugge 2025-10-1411:265 reply

Why do you think what you describe being excited about does not warrant the current level of AI hype? I agree with your assessment and sometimes I think there is too much cynicism and not enough excitement.

By notatoad 2025-10-1414:111 reply

the current level of AI hype amongst a lot of people, but especially investors and bosses, is that you can already give an AI a simple prompt and get it to spit out a fully functional, user-ready application for you. and we're so incredibly far off that.

the things that AI is able to do are incredible, but hype levels are just totally detached from reality.

By pmarreck 2025-10-1414:512 reply

> is that you can already give an AI a simple prompt and get it to spit out a fully functional, user-ready application for you.

But it can already do that. Isn't that the whole "one-shotting" thing?

The problem is, of course, that it won't be optimized, maintainable or have anyone responsible you can point to if something with it goes wrong. It almost certainly (unless you carefully prompted it to) won't have a test suite, which means any changes (even fixes) to it are risky.

So it's basically a working mockup generator.

I am so, so tired of "semi-technical" youtubers showing off new models with one-shots. The vast majority of actual devs who use this stuff need it to work over long-term context windows and over multiple iterations.

By derefr 2025-10-1416:35

The thing is, we've already had "working mockup generators" — a.k.a. prototyping tools — for decades now.

If you come at the problem from the direction of "I draw a user interface; you guess what it's supposed to do and wire it up for me", then all you need to solve that problem (to a first-order approximation) is some plain-old 1970s "AI" heuristics.

The buzz around current AI coding prompting seems to be solely generated by the fact that while prototyping tools require you to at least have some training as a designer (i.e. understanding the problem you're solving on the level of inputs and outputs), these tools allow people with no experience in programming or design to get results. (Mainly by doing for UIs what genAI image/video tools do for art: interpolating the average of many ingested examples of how a designer would respond to a client request for X, with no regard for the designer's personal style†.)

† Unless prompted to have such regard... but if you know enough to tell the AI how to design everything, then you may as well just design everything. Just as, if you know art well enough to prompt an AI into developing a unique art style, then you likely know art well enough to just make that same art yourself with less effort than it takes to prompt and re-prompt and patch-erase-infill-prompt the AI into drawing what you want.

By notatoad 2025-10-1416:16

from what i can tell, the one-shot thing only works on youtube.

you might produce something that looks usable at first, but the actual application functionality will be significantly broken in most ways. it maybe works enough to do a demo for your video, but it won't work enough to actually distribute to end-users. and of course, as you say, it's not testable or maintainable in any way, so fixing what's broken is a bigger project than just writing it properly in the first place.

By chamomeal 2025-10-1416:49

I think the cynicism is only on software dev circles, and it’s probably a response to the crazy hype.

Remember the hype isn’t just “wow it’s so cool and amazing and useful”, it’s also “I can’t wait to fire all my dumb meat-based employees”

By SchemaLoad 2025-10-1422:04

Because to justify the current hype and spending, these companies have to have a product that will generate trillions of dollars and create mass unemployment. Which they don't have.

By afpx 2025-10-1414:06

The current AI hype is causing a lot of leaders to put their organizations on the path to destruction.

By pja 2025-10-1412:50

Oh sure, there’s also way too much cynicism in some quarters. But that’s all part of the fun.

By fragmede 2025-10-1412:091 reply

They go beyond merely "return something that looks as close to the thing I’ve asked for as it can find". Eg: Say we asked for "A todo app that has 4 buttons on the right that each play a different animal sound effect for no good reason and also you can spin a wheel and pick a random task to do". That isn't something that already exists, so in order to build that, the LLM has to break that down, look for appropriate libraries and source and decide on a framework to use, and then glue those pieces together cohesively. That didn't come from a singular repo off GitHub. The machine had to write new code in order to fulfill my request. Yeah, some if it existed in the training data somewhere, but not arranged exactly like that. The LLM had to do something in order to glue those together in that way.

Some people can't see past how the trick is done (take training data and do a bunch of math/statistics on it), but the fact that LLMs are able to build the thing is in-and-of-itself interesting and useful (and fun!).

By pja 2025-10-1412:47

I’m aware. But the first part is “find me something in the vector space that looks something like the thing I’m asking for”. Then the rest is vibes. Sometimes the vibes are good, sometimes they are ... decidedly not.

If the results are useful, then that’s what matters. Although I do suspect that some AI users are spending more time pulling the AI one-armed bandit handle than it would take them to just solve their problem the old fashioned way a lot of the time - but if pulling the one-armed bandit gets them a solution to their problem that they wouldn’t work up the motivation to solve themselves then that counts too, I guess.

By goalieca 2025-10-140:154 reply

Back in the 90s you could drag and drop a vb6 applet in Microsoft word. Somehow we’ve regressed..

Edit: for the young, wysiwyg (what you see is what you get) was common for all sorts of languages from c++ to Delphi to html. You could draw up anything you wanted. Many had native bindings to data sources of all kinds. My favourite was actually HyperCard because I learned it in grade school.

By squeaky-clean 2025-10-141:414 reply

Wysiwyg kind of fell apart once we had to stop assuming everyone had an 800x600 or 1024x768 screen, because what you saw was no longer what others got.

By benterix 2025-10-1410:36

Not entirely, in these RAD tools you also had flexible layout choices and obviously you could test it for various window sizes (although the maximum was the one supported by your graphics card). Too bad many chose the lazy way and just enforced fixed window size at 800x600.

By hackit2 2025-10-141:53

Most of the internet still assumes you're using a 96 DPI monitor. Tho the rise of mobile phone has changed that it seems like the vast majority of the content consumed on mobile lends itself to being scaled to any DPI - eg.. movies, pictures, youtube ect.

By eternauta3k 2025-10-149:01

Not a big issue with QT layouts (still have to test the result though)

By philipallstar 2025-10-1415:20

I can imagine adding breakpoints to a wysiwyg editor being not terribly difficult. They decouple presentation from logic pretty well.

By mcmoor 2025-10-143:33

I still miss my days of programming Visual Basic 6. Nothing since then ever compares.

By ako 2025-10-145:47

4gl or RAD is still here, but now it’s called low- or no-code.

By chairmansteve 2025-10-140:52

I agree. I am "writing" simple crud apps for my own convenience and entertainment. I can use unfamiliar frameworks and languaged for extra fun and education.

Good times!

By Arisaka1 2025-10-148:03

Before copilot what I'd do is diagnose and identify the feature that resembles the one that I'm about to build, and then I'd copy the files over before I start tweaking.

Boilerplate generation was never, ever the bottleneck.

By Mabusto 2025-10-1416:041 reply

I've been using AI like this as well. The code-complete / 'randomly pop up a block of code while typing' feature was cool for a bit but soon became annoying. I just use it to generate a block of boilerplate code or to ask it questions, I do 90% of the 'typing the code' bit myself, but that's not where most programmers time is spent.

By notatoad 2025-10-1416:11

i'm not sure when you tried it, but if you've had copilot disabled it might be worth giving it another go. in my totally anecdotal experience, over the last few months it's gotten significantly better at shutting up when it can't provide anything useful.

By Tade0 2025-10-1410:50

It is, because the frontend ecosystem is not just React. There are plenty of projects where LLMs still give weird suggestions just because the app is not written in React.

By giancarlostoro 2025-10-1414:03

I've probably commented the same thing like 20 times, but my rule of thumb and use with AI / "vibe coding" is two-fold:

* Scaffolding first and foremost - It's usually fine for this, I typically ask "give me the industry standard project structure for x language as designed by a Staff level engineer" blah blah just give me a sane project structure to follow and maintain so I don't have to wonder after switching around to yet another programming language (I'm a geek, sue me).

* Code that makes sense at first glance and is easy to maintain / manage, because if you blindly take code you don't understand, you'll regret it the moment you need to be called in for a production outage and you don't know your own codebase.

By smrtinsert 2025-10-1414:10

"Anything that can be autogenerated by a computer shouldn't have to be, it can be automated"

By tclancy 2025-10-144:39

People say inbreeding like it’s criticism too.

By meowface 2025-10-142:473 reply

HN's cynicism towards AI coding (and everything else ever) is exhausting. Karpathy would probably cringe reading this.

By benterix 2025-10-1410:322 reply

First, it's not cynicism but a more realistic approach than just following SV marketing blindly, and second, it's not "everything else", just GenAI, NFTs/ICOs/Web3, "Metaverse" (or Zucks interpretation of it), delf-driving cars ready today, maybe a bit Theranos.

By stingraycharles 2025-10-1412:051 reply

I’ve recently written a message queue <> database connector in Go using Claude Code, checkpointing, recovery, all that stuff built in.

I’d say it made me around 2x as productive.

I don’t think the cynicism of HN is justified, but I think what people forget is that it takes several months of really investing a lot of time into learning how to use AI well. If I see some of the prompts people give, and expect it to work, yeah no wonder that only works for React-like apps.

By cantor_S_drug 2025-10-1412:20

I asked AI to create a basic autoencoder based deep learning architecture for classifying time series data. This AI is a boon.

By meowface 2025-10-153:22

The thing is cryptocurrency and metaverse stuff was obvious bullshit from day one while even GPT-3 was clearly a marvel from day one. It's a false pattern match.

By trial3 2025-10-143:011 reply

okay but he literally does have a bridge that non-deterministically might take you to the wrong place to sell you

By meowface 2025-10-144:491 reply

The original context of this sub-thread was Karpathy saying how AI coding tools were pretty useless for him when working on this particular project.

By troupo 2025-10-147:341 reply

Indeed. And only Karpathy is entitled to say that AI tools produce wrong code for him. And he's only entitled to say it for this project only.

If anyone else says this, "the skepticism is exhausting", and their experience is completely irrelevant.

By kasey_junk 2025-10-1410:353 reply

Go look at the comments on HN whenever someone posts about their AI coding workflow. It will be littered with negative comments that either imply or outright say that the poster is either shilling, ignorant or working only on toy examples.

The grievance attitude seems to exist in both directions and is actually what is exhausting.

By troupo 2025-10-1411:191 reply

> It will be littered with negative comments that either imply or outright say that the poster is either shilling, ignorant or working only on toy examples.

And they would be often be right. Coupled with the fact that most of the glowing "omg I only code with AI" posts don't even try to show what code or products they are working on.

And yes, the absolute vast majority of people who are skeptical are skeptical precisely because they use these tools every day themselves.

By kasey_junk 2025-10-1412:251 reply

Just so we are clear, you are upset by people dismissing your experience gained skepticism but have no problem dismissing every positive comment as a shill, ignorant or simple?

You don’t see any dissonance in that? It’s only the positive people that are exhausting?

By troupo 2025-10-1416:32

I myself post positive comments about AI from time to time.

I never pretend that AI is the be all end all of programming, don't claim that it can do all the magical things, or that it's capable of running hours on end just creating software with not proof like most positive posts do.

See the difference?

I'm all for positive posts. I'm against childish belief in magic: https://dmitriid.com/everything-around-llms-is-still-magical...

By AlexeyBelov 2025-10-166:20

Show HNs about AI startups are littered with positive comments though. It's rare to see people calling out the submitter.

By hirako2000 2025-10-1415:09

posts about yet another ai workflow, typically presented with hyperbole is exhausting. The backfires are rather appeasing, entertaining at the least.

By hansmayer 2025-10-1410:542 reply

I mean Karpathy himself wrote that he could not use the AI tools for the project, so he had to handwrite most of it. I wonder why.

By kannanvijayan 2025-10-1411:29

One of my hobby projects is an esoteric game engine oriented towards expressing simulation mechanics. I simply do not use agentic tools when editing the core code for this project (mostly rust and wgsl). It always stumbles, and leaves code that I need to fix up manually, and even then feel unsure about. I've tried a few different agents, including the current top of the line. The power is just not there yet.

At the same time, these tools have helped me reduce the development time on this project by orders of magnitude. There are two prominent examples.

--- Example 1:

The first relates to internal tooling. I was debugging a gnarly problem in an interpreter. At some point I had written code to do a step-by-step dump of the entire machine state to file (in json) and I was looking through it to figure out what was going wrong.

In a flash of insight, I asked my AI service (I'll leave names out since I'm not trying to promote one over another) to build a react UI for this information. Over the course of a single day, I (definitely not a frontend dev by history) worked with it to build out a beautiful, functional, easy to use interface for browsing step-data for my VM, with all sorts of creature comforts (like if you hover over a memory cell, and the memory cell's value happens to be a valid address to another memory cell, the target memory cell gets automatically highlighted).

This single tool has reduced my debugging time from hours or days to minutes. I never would have built the tool without AI support, because I'm simply not experienced enough in frontend stuff to build a functional UI quickly.. and this thing built an advanced UI for me based on a conversation. I was truly impressed.

--- Example 2:

As part of verifying correctness for my project, I wanted to generate a set of tests that validated the runtime behaviour. The task here consists of writing a large set of reference programs, and verifying that their behaviour was identical between a reference implementation and the real implementation.

Half decent coverage meant at least a hundred or so tests were required.

Here I was able to use agentic AI to reduce the testcase construction time from a month to about a week. I asked the AI to come up with a coverage plan and write the test case ideas to a markdown file in an organized, categorized way. Then I went through each category in the test case markdown and had the AI generate the test cases and integrate them into the code.

---

I was and remain a strong skeptic of the hype around this tech. It's not the singularity, it's not "thinking". It's all pattern matching and pattern extension, but in ways so sophisticated that it feels like magic sometimes.

But while the skeptical perspective is something I value, I can't deny that there is core utility in this tech that has a massive potential to contribute to efficiency of software development.

This is a tool that we as industry are still figuring out the shape of. In that landscape you have all sorts of people trying to evangelize these tools along their particular biases and perspectives. Some of them clearly read more into the tech than is there. Others seem to be allergically reacting to the hype and going in the other direction.

I can see that there is both noise, and fundamental value. It's worth it to try to figure out how to filter the noise out but still develop a decent sense of what the shape of that fundamental value is. It's a de-facto truth that these tools are in the future of every mainstream developer.

By meowface 2025-10-153:271 reply

That's exactly why I said he would cringe at it. Seeing someone look at him saying "it's not able to make a good GPT clone" and going "yeah it's useless for anything besides React todo list demos" would definitely evoke some kind of reaction. He understands AI coding agents are neither geniuses nor worthless CRUD monkeys.

By hansmayer 2025-10-1512:58

Hm, interesting point. So if he and other GenAI hotshots understand that, why do they keep seeling the tools as precisely no less than geniuses? Often with a bit of fear mongering about all the jobs that would be lost soon etc.?

By SeanAnderson 2025-10-1322:562 reply

or a typical CRUD app architecture, or a common design pattern, or unit/integration test scaffolding, or standard CI/CD pipeline definitions, or one-off utility scripts, etc...

Like 80% of writing coding is just being a glorified autocomplete and AI is exceptional at automating those aspects. Yes, there is a lot more to being a developer than writing code, but, in those instances, AI really does make a difference in the amount of time one is able to spend focusing on domain-specific deliverables.

By MasterScrat 2025-10-140:38

And even for "out of distribution" code you can still ask question about how to do the same thing but more optimized, could a library help for this, why is that piece of code giving this unexpected output etc

By positron26 2025-10-1323:11

It has gotten to the point that I don't modify or write SQL. Instead I throw some schema and related queries in and use natural language to rubber duck the change, by which point the LLM can already get it right.

By KeplerBoy 2025-10-147:37

I don't know. I successfully use it for small changes on VHDL FPGA designs these days.

By allochthon 2025-10-1420:15

I've had some success with a multi-threaded software defined radio (SDR) app in Rust that does signal processing. It's been useful for trying something out that's beyond my experience. Which isn't to say it's been easy. It's been a learning experience to figure out how to work around Claude's limitations.

By lukev 2025-10-1416:221 reply

Generative AI for coding isn't your new junior programmer, it's the next generation of app framework.

By Zenst 2025-10-1416:31

I wished such sentiments prevailed in upper management, as it is true. Much like owning a car that can drive itself - you still need to pass a driving test to be allowed to use it.

By SalmoShalazar 2025-10-1418:00

Really such an annoying genre of comment. Yes I’m sure your groundbreaking bespoke code cannot be written by LLMs, however for the rest of us that build and maintain 99% of the software people actually use, they are quite useful.

By dahcryn 2025-10-1415:21

simple CRUD, is as common in many many business applications or backend portals, are a good fit for AI assistance imho. And fix some designs here and there, where you can't be bothered to keep track of the latest JS/CSS framework

By teleforce 2025-10-141:12

I wonder if the new GenAI architecture namely DDN or distributed discrete networks being discussed recently can outperform the conventional architecture of GAN and VAE. As the name suggests, it can provide multitude of distributions for training and inference purposes [1].

[1] Show HN: I invented a new generative model and got accepted to ICLR (90 comments):

https://news.ycombinator.com/item?id=45536694

By CapsAdmin 2025-10-148:282 reply

I work on this typed lua language in lua, and sometimes use llms to help fix internal analyzer stuff, which works 30% of the time for complex, and sometimes not at all, but helps me find a solution in the end.

However when I ask an llm to generate my typed lua code, with examples and all, on how the syntax is supposed to be, it mostly gets it wrong.

my syntax for tables/objects is: local x: {foo = boolean}

but an llm will most likely gloss over this and always use : instead of = local x: {foo: boolean}

By pmarreck 2025-10-1415:04

I've had success in the past with getting it to write YueScript/Moonscript (which is not a very large part of its training data) by pointing it to the root URL for the language docs and thus making that part of the context.

If your typed version of Lua has a syntax checker, you could also have it try to use that first on any code it's generated

By kasey_junk 2025-10-1410:371 reply

Are you using a coding agent or just an llm chat interface? Do you have a linter or compiler that will catch the misuse that you’ve hooked up to the agent?

By CapsAdmin 2025-10-1411:121 reply

I've dabbled with claude code in this particular project, but not much. My short experience with it is that it's slow, costly and goes off the rails easily.

I prefer to work with more isolated parts of the code. But again, I don't really know all that much about agents.

One thing I wanted to do on my project is reorganize all the tests, which sounds like an agent job. But I'd imagine I need to define some hard programmatic constraints to make sure tests are not lost or changed in the process.

By kasey_junk 2025-10-1412:22

Agents aren’t magic. They are loops with tool calls in them that help keep agents on track. And most of the agent systems have some manner of hook that you can put your own tools in to enforce things like types and styles.

I’ve had good experiences writing small scripts and linters to enforce things that agents get wrong frequently. What’s nice about those is that the agents are very good at writing them and they are easy to verify. Plus they are valuable for new humans devs as well.

By random_cynic 2025-10-149:58

[dead]

By rootusrootus 2025-10-1322:352 reply

That is a good thing to hear from someone as reputable as Karpathy. The folks who think we're on the cusp of AGI may want to temper their expectations a bit.

I do love Claude Code, because one thing I periodically need to do is write some web code, which is not my favorite type of coding but happens to have incredibly good coverage in the training data. Claude is a much better web developer than I am.

But for digging into the algorithmic core of our automation tooling, it doesn't have nearly as much to work with and makes far more mistakes. Still a net win I'm happy to pay for, even if it's never anything more than my web developer slave.

By vunderba 2025-10-140:052 reply

100%. I find the "LLMs are completely useless" and the "LLMs will usher in a new era of messianic programming" camps to be rather reductive.

I've already built some pretty large projects [1] with the assistance of agentic tooling like Claude Code. When it comes to the more squirrely algorithms and logic, they can fall down pretty hard. But as somebody who is just dreadful at UI/UX, having it hammer out all the web dev scaffolding saves me a huge amount of time and stress.

It's just a matter of tempering one's expectations.

[1] https://animated-puzzles.specr.net

By ggsp 2025-10-148:571 reply

Hey, thank you for making this—I really enjoyed playing it and it feels like it fits the mental-reward-between-work-tasks need. It did spin up my M1's fans after a few minutes which is a rather rare occurrence, but I'm guessing that's par for the course when you're working with a bunch of video on canvas. Either way, hope I remember it the next time I'm looking for a puzzle to solve while I take a break :)

By JLC443 2025-10-1414:12

Just thought I'd add to this thread that I also had a lot of fun playing this game, and I don't normally enjoy puzzles on the computer!

A couple of very minor pieces of feedback, if you're open to it: The camera momentum when dragging felt a little unnatural. The videos seemed to have a slightly jumpy framerate and were a bit low-resolution when zoomed in.

Honestly though, those are minor nitpicks. It's a really fun and polished experience. Thanks for sharing!

By meowface 2025-10-142:481 reply

>and the "LLMs will usher in a new era of messianic programming" camps

Well, this one might still be borne out. It's just silly to think it's the case right now. Check in again in 10 years and it may be a very different story. Maybe even in 5 years.

By handfuloflight 2025-10-143:03

What do we build now to reap the coming of the messianic era?

By bdangubic 2025-10-1322:501 reply

> But for digging into the algorithmic core of our automation tooling

What I find fascinating is reading this same thing in other context like “UI guru” will say “I would not let CC touch the UI but I let it rip on algorithmic core of our automation tooling cause it is better at it than me…”

By Filligree 2025-10-1323:051 reply

Both can be true. LLMs tend to be mediocre at (almost) everything, so they're always going to be worse than the user at whatever the user is an expert in.

But 'mediocre' isn't 'useless'.

By rootusrootus 2025-10-140:05

I completely agree. I'm definitely not an expert web developer. I know enough to build functional tools, but it's not exactly art that I'm making. But the core of our tooling is my primary focus, I wrote it, I've spent a lot of time perfecting it. Claude can easily impress me with things like the CSS magic it weaves, because I am unsophisticated.

By SeanAnderson 2025-10-1322:184 reply

This makes sense, right? It's a relatively novel thing to be writing. I don't find it to be a damning remark like other comments here seem to be concluding.

If anything, the fact that Karpathy reached towards Claude/Codex in an attempt to gain value is indicative that, in previous coding efforts, those tools were helpful to him.

By simonw 2025-10-1322:341 reply

Yeah, if your goal is "build the tightest 8,000 line implementation of training an LLM from scratch, with a focus on both conciseness and educational value" I don't think it's particularly surprising that Claude/Codex weren't much help.

By fragmede 2025-10-1410:561 reply

Now to wait for Sonnet 5 and GPT-6, and ask them to build that, and see what they come up with.

By Tepix 2025-10-1411:251 reply

Why would you expect an improvement?

By bjord 2025-10-1411:56

because they'll be trained on karpathy's implementation

By JustFinishedBSG 2025-10-147:35

> This makes sense, right? It's a relatively novel thing to be writing.

It's really not though? Honestly I'm surprised coding agents fail hard at this task apparently

By krackers 2025-10-145:381 reply

It's not _that_ far off distribution though. The math and concepts are well understood.

By nomel 2025-10-150:26

That's not really how LLM work though. It's fundamentally next word prediction, based on statistics of the context. Reordering ideas (which can drastically change the outcome) can result in a statistically rare context. The silly failures of simple riddles [1], and the like, demonstrate this well.

The riddle issue is putting trivial ideas together, but combined them in a statistically rare way, giving low quality output that tends towards the statistically significant answer, even if it's incorrect. The same things happens with coding, when combining well known things together in uncommon ways.

Worse (as with the riddle problem), nearby concepts that have strong statistics are going to act like attractors, with the LLM always trending towards those, removing and rewriting bits of code to better accommodate those, even if they're the opposite of what you want. I have this happen all the time in my somewhat obscure work. It'll rewrite key maths in my code to be the statistically significant textbook example, which is not what I need. I'll fix it manually or point it out, and a few changes later, it'll rewrite it again. A reasonable way around this is to "pin" the concept with a very strongly worded negative comment, like "DO NOT USE DOT PRODUCT. THIS IS NOT VECTOR PROJECTION. DO NOT modify the next line of code.".

[1] Claude 4.1 Opus:

> Prompt: "A son and his mother are in a car accident. They’re both rushed to the hospital. The doctor is about to operate on the son. But, before he does, he looks down and says "This is my son! I can’t operate on this boy!". How can this be?"

> Response: "The doctor is the boy's father.

This is a classic riddle that plays on assumptions people might make about gender and professions. The boy has two parents - his mother (who was in the car with him) and his father (the doctor at the hospital). The riddle works because some people automatically assume the doctor must be male, forgetting that the boy's father could be the doctor since his mother was already mentioned as being in the accident."

Another, with output that doesn't match the goal, statistically attracted to the riddle:

> Prompt: "A man, a sheep, and a wolf are on one side of the river, with a boat that can only hold two. How can the man safely get the boat to the other side of the river, without the sheep being eaten?"

By bringmeiron 2025-10-1323:28

> If anything, the fact that Karpathy reached towards Claude/Codex in an attempt to gain value is indicative that, in previous coding efforts, those tools were helpful to him.

This is good for bitcoin.

By sva_ 2025-10-1323:06

https://nitter.net/karpathy/status/1977755427569111362

By kubb 2025-10-1413:25

He probably just doesn’t know how to prompt correctly (heheh).

By satvikpendem 2025-10-141:561 reply

That's funny that the coiner of the term vibe coding has eventually found it not useful anymore.

By JimDabell 2025-10-144:07

That’s not what he said. This is the new project:

> My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it.

This is how he described vibe coding:

> There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.

Vibe coding is clearly aimed at having fun hacking around on something that doesn’t matter, and he’s doing the opposite of that with this project. The fact that he’s not using vibe coding for something that is completely inappropriate for vibe coding is neither surprising nor a failure of vibe coding.

By samus 2025-10-1421:52

The llama.cpp maintainers working on supporting Qwen3-next are also not enthused by LLM output. They had to go over everything and fix it up.

https://github.com/ggml-org/llama.cpp/pull/16095#issuecommen...

By martingalex2 2025-10-1416:53

Isn't the point that now Andrej's published this, it will be in-distribution soon?

By RA_Fisher 2025-10-1413:20

> too far off the data distribution.

I guess his prompts couldn’t provide sufficient information either (there’s no limit). Sounds more like a user issue to me. :) I don’t think there’s anyone that can type faster than ChatGPT.

By nurettin 2025-10-1414:28

Backprop and transformers isn't exactly off the grid coding, but I can see how it would require a lot of patience to force claude into writing this.

By dude250711 2025-10-1322:13

How convenient! You know, my code is somewhat far off the data distribution too.

By oblio 2025-10-1321:47

We're still not ready for ouroboros.

By hansmayer 2025-10-1410:52

... or maybe he just forgot to include the claude.md ? :)

By bringmeiron 2025-10-1323:261 reply

Clearly he has little idea what he's talking about.

AI can write better code than 99% of developers. This embarrassingly anti-AI shill included.

If he used the AI tool my company is developing the code would have been better and shipped sooner.

By throwaway0123_5 2025-10-1323:321 reply

Anti-AI shill? A cofounder of OpenAI?

By bringmeiron 2025-10-1323:511 reply

You have found the joke.

By freedomben 2025-10-1415:38

I think you are running into Poe's law here.

By montebicyclelo 2025-10-1320:283 reply

> nanochat is also inspired by modded-nanoGPT

Nice synergy here, the lineage is: Karpathy's nano-GPT -> Keller Jordan's modded-nanoGPT (a speedrun of training nanoGPT) -> NanoChat

modded-nanoGPT [1] is a great project, well worth checking out, it's all about massively speeding up the training of a small GPT model.

Notably it uses the author's Muon optimizer [2], rather than AdamW, (for the linear layers).

[1] https://github.com/KellerJordan/modded-nanogpt

[2] https://kellerjordan.github.io/posts/muon/

By varunneal 2025-10-1320:534 reply

Muon was invented by Keller Jordan (and then optimized by others) for the sake of this speedrunning competition. Even though it was invented less than a year ago, it has already been widely adopted as SOTA for model training

By tbalsam 2025-10-1321:071 reply

This is the common belief but not quite correct! The Muon update was proposed by Bernstein as the result of a theoretical paper suggesting concrete realizations of the theory, and Keller implemented it and added practical things to get it to work well (input/output AdamW, aggressive coefficients, post-Nesterov, etc).

Both share equal credit I feel (also, the paper's co-authors!), both put in a lot of hard work for it, though I tend to bring up Bernstein since he tends to be pretty quiet about it himself.

(Source: am experienced speedrunner who's been in these circles for a decent amount of time)

By varunneal 2025-10-1416:16

I think it's good to bring up Bernstein & Newhouse as well as Yuchen Jin, Jiacheng You and the other speedrunners who helped iterate on Muon. But I think it's very fair to call Keller Jordan the main author of Muon of its current form. I'm also in the speedrunning community though maybe not as long as you have

By swyx 2025-10-1321:531 reply

sharing some useful resrources for learning Muon (since I'm also just catching up on it)

- https://x.com/leloykun/status/1846842883967692926

- https://www.yacinemahdid.com/p/muon-optimizer-explained-to-a...

By cantor_S_drug 2025-10-1412:23

This Simple Optimizer Is Revolutionizing How We Train AI [Muon]

https://www.youtube.com/watch?v=bO5nvE289ec

I found the above video as a good introduction.

By kouteiheika 2025-10-1415:12

The most exciting thing about Muon for me is that it requires half the state of Adam while having either equivalent or better performance. That's amazing if you are VRAM limited! And just like Adam, you can also quantize it. I can get it to work relatively well as low as 4-bit, which essentially cuts down the memory requirements from full 32-bit Adam by a factor of 16x! (And by a factor of 4x vs 8-bit Adam).

By ComplexSystems 2025-10-148:371 reply

I haven't heard of this before. Has Muon dethroned Adam and AdamW as the standard general purpose optimizer for deep learning?

By spyder 2025-10-1415:55

It's for hidden layers and not for every parameter: From Keller's Muon github page:

"Muon is an optimizer for the hidden weights of a neural network. Other parameters, such as embeddings, classifier heads, and hidden gains/biases should be optimized using standard AdamW."

And I just looked into this nanochat repo and it's also how it's used here.

https://github.com/karpathy/nanochat/blob/dd6ff9a1cc23b38ce6...

By echelon 2025-10-1321:132 reply

8xH100 is pretty wild for a single inference node.

Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?

At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.

Is my math wrong?

By vessenes 2025-10-1321:141 reply

This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.

By andai 2025-10-1410:07

The default model is ~0.5B params right?

By Tepix 2025-10-1321:32

As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.

By sammyd56 2025-10-1319:364 reply

I'm doing a training run right now (started 20min ago). You can follow it at https://api.wandb.ai/links/sjd333-none/dsv4zkij

Will share the resulting model once ready (4 hours from now) for anyone to test inference.

By sammyd56 2025-10-140:041 reply

I've uploaded the model here: https://huggingface.co/sdobson/nanochat

I didn't get as good results as Karpathy (unlucky seed?)

It's fun to play with though...

User: How many legs does a dog have? Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)

By simonw 2025-10-140:443 reply

I got your model working on CPU on macOS by having Claude Code hack away furiously for a while. Here's a script that should work for anyone: https://gist.github.com/simonw/912623bf00d6c13cc0211508969a1...

You can run it like this:

  cd /tmp
  git clone https://huggingface.co/sdobson/nanochat
  uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
    --model-dir /tmp/nanochat \
    --prompt "Tell me about dogs."

By sammyd56 2025-10-141:44

This is a much easier way to run the model. I'm going to update the huggingface README to point to this. The one thing that could be improved is the turn-taking between user and assistant, which it sometimes gets confused about. I fixed that in my fork of your gist here: https://gist.github.com/samdobson/975c8b095a71bbdf1488987eac...

By vessenes 2025-10-141:161 reply

Simon, I had to run "brew install git-lfs && cd nano-chat && git lfs install && git lfs pull" and then it worked. before then, the model weights didn't get cloned by default for me on macOS.

% uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0... \ --model-dir nanochat/ --prompt "who is simonw on hacker news?" Using device: cpu Loading model from nanochat/model_000650.pt Loading metadata from nanochat/meta_000650.json Model config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280} Loading model weights (this may take a minute for a 2GB model)... Converting model to float32 for CPU... Model loaded successfully! Loading tokenizer... Tokenizer loaded successfully!

Prompt: who is simonw on hacker news? Encoded to 9 tokens

Generating... -------------------------------------------------- who is simonw on hacker news?<|user_end|><|assistant_start|>A hacker news reporter, I'd say a few things. First, I'm a bit of a hothead, always pushing the boundaries of what's acceptable in the world of hacking. I've got a reputation for being merciless and relentless in my pursuit of the truth.

In many ways, I've developed a sixth sense for this type of thing. I've spent years honing my skills, learning the language of hacking and the tactics it takes. I know how to think like the hacker --------------------------------------------------

By homeless_engi 2025-10-145:20

Adding on: Claude also gave me the following line which was necessary to get the model weights to download from HF. This might be obvious for anyone familiar with HF but it helped me so sharing here!

git lfs install

By iamcreasy 2025-10-141:421 reply

For anyone curious this is the error when running uv sync on macos,

> uv sync Resolved 88 packages in 3ms error: Distribution `torch==2.8.0+cu128 @ registry+https://download.pytorch.org/whl/cu128` can't be installed because it doesn't have a source distribution or wheel for the current platform

hint: You're on macOS (`macosx_15_0_arm64`), but `torch` (v2.8.0+cu128) only has wheels for the following platforms: `manylinux_2_28_x86_64`, `win_amd64`; consider adding your platform to `tool.uv.required-environments` to ensure uv resolves to a version with compatible wheels

Also, tmp/nanochat expects all contents from tokenizer and chatsft_checkpoints folder.

By stoobs 2025-10-1410:341 reply

Yeah, that's because cuda on a mac isn't a thing - it could be swapped to the normal torch package but you'd have to do some code patching to make sure it's running on mps, even then some of the code may need rewriting/patching if there's no mps version of the cuda kernals.

By iamcreasy 2025-10-1422:22

Isn't there a common PyTorch API interface that could chose OS/hardware specific backend automatically? Or this project is hard coding cuda variant of PyTorch as a requirement?

By Lerc 2025-10-1320:532 reply

The comment beside the first chart

>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".

Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.

By SeanAnderson 2025-10-1322:27

ELI5 for anyone else (I had to have this explained to me):

When you train a language model, it tries to predict the next token.

We measure how good it is at that using loss aka how surprised it was by the real answer.

Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.

So, compare loss to bytes of text data instead.

By typpilol 2025-10-1322:133 reply

Why hasn't anyone made a tokenizer that's 1 character per token. Is it because it requires an insane amount of compute?

Or would the loss of efficiency make it dumber then modern tokenizers?

By nl 2025-10-140:06

Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.

Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.

[1] https://aclanthology.org/P16-1162/

By SeanAnderson 2025-10-1322:49

yes to both.

absolutely requires longer training time and more compute.

once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.

if you had infinite compute and data for training then performance would be equivalent though, i think.

By skirmish 2025-10-1322:47

Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.

By royosherove 2025-10-1319:431 reply

Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?

By sammyd56 2025-10-1319:591 reply

There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR

By royosherove 2025-10-1321:59

Ah I was missing the WANDB_RUN env var. so did not get any logs. thanks!

By bravura 2025-10-141:211 reply

The measures that drop exponentially like val/bpb and train/loss you should put the x-axis in log-scale. That will better show you if it's converged

By sammyd56 2025-10-1410:211 reply

Great call, thankyou - I switched to log scale for those metrics - agree that it is much clearer.

By bravura 2025-10-1414:23

Sorry fat fingers. It should be the y axis that is log scale, not x axis. (Sometimes both is good.)

Did you notice the inflection point in which the loss drops faster than expected in the top graph? Maybe you should let it run more…