Hacker News

Backpropagation is a leaky abstraction (2016)

2025-11-025:20353160karpathy.medium.com

When we offered CS231n (Deep Learning class) at Stanford, we intentionally designed the programming assignments to include explicit…

Show article

When we offered CS231n (Deep Learning class) at Stanford, we intentionally designed the programming assignments to include explicit calculations involved in backpropagation on the lowest level. The students had to implement the forward and the backward pass of each layer in raw numpy. Inevitably, some students complained on the class message boards:

“Why do we have to write the backward pass when frameworks in the real world, such as TensorFlow, compute them for you automatically?”

This is seemingly a perfectly sensible appeal - if you’re never going to write backward passes once the class is over, why practice writing them? Are we just torturing the students for our own amusement? Some easy answers could make arguments along the lines of “it’s worth knowing what’s under the hood as an intellectual curiosity”, or perhaps “you might want to improve on the core algorithm later”, but there is a much stronger and practical argument, which I wanted to devote a whole post to:

> The problem with Backpropagation is that it is a leaky abstraction.

In other words, it is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data. So lets look at a few explicit examples where this is not the case in quite unintuitive ways.

Some eye candy: a computational graph of a Batch Norm layer with a forward pass (black) and backward pass (red). (borrowed from this post)

Vanishing gradients on sigmoids

We’re starting off easy here. At one point it was fashionable to use sigmoid (or tanh) non-linearities in the fully connected layers. The tricky part people might not realize until they think about the backward pass is that if you are sloppy with the weight initialization or data preprocessing these non-linearities can “saturate” and entirely stop learning — your training loss will be flat and refuse to go down. For example, a fully connected layer with sigmoid non-linearity computes (using raw numpy):

z = 1/(1 + np.exp(-np.dot(W, x))) # forward passdx = np.dot(W.T, z*(1-z)) # backward pass: local gradient for xdW = np.outer(z*(1-z), x) # backward pass: local gradient for W

If your weight matrix W is initialized too large, the output of the matrix multiply could have a very large range (e.g. numbers between -400 and 400), which will make all outputs in the vector z almost binary: either 1 or 0. But if that is the case, z*(1-z), which is local gradient of the sigmoid non-linearity, will in both cases become zero (“vanish”), making the gradient for both x and W be zero. The rest of the backward pass will come out all zero from this point on due to multiplication in the chain rule.

Another non-obvious fun fact about sigmoid is that its local gradient (z*(1-z)) achieves a maximum at 0.25, when z = 0.5. That means that every time the gradient signal flows through a sigmoid gate, its magnitude always diminishes by one quarter (or more). If you’re using basic SGD, this would make the lower layers of a network train much slower than the higher ones.

TLDR: if you’re using sigmoids or tanh non-linearities in your network and you understand backpropagation you should always be nervous about making sure that the initialization doesn’t cause them to be fully saturated. See a longer explanation in this CS231n lecture video.

Dying ReLUs

Another fun non-linearity is the ReLU, which thresholds neurons at zero from below. The forward and backward pass for a fully connected layer that uses ReLU would at the core include:

z = np.maximum(0, np.dot(W, x)) # forward pass
dW = np.outer(z > 0, x) # backward pass: local gradient for W

If you stare at this for a while you’ll see that if a neuron gets clamped to zero in the forward pass (i.e. z=0, it doesn’t “fire”), then its weights will get zero gradient. This can lead to what is called the “dead ReLU” problem, where if a ReLU neuron is unfortunately initialized such that it never fires, or if a neuron’s weights ever get knocked off with a large update during training into this regime, then this neuron will remain permanently dead. It’s like permanent, irrecoverable brain damage. Sometimes you can forward the entire training set through a trained network and find that a large fraction (e.g. 40%) of your neurons were zero the entire time.

TLDR: If you understand backpropagation and your network has ReLUs, you’re always nervous about dead ReLUs. These are neurons that never turn on for any example in your entire training set, and will remain permanently dead. Neurons can also die during training, usually as a symptom of aggressive learning rates. See a longer explanation in CS231n lecture video.

Exploding gradients in RNNs

Vanilla RNNs feature another good example of unintuitive effects of backpropagation. I’ll copy paste a slide from CS231n that has a simplified RNN that does not take any input x, and only computes the recurrence on the hidden state (equivalently, the input x could always be zero):

This RNN is unrolled for T time steps. When you stare at what the backward pass is doing, you’ll see that the gradient signal going backwards in time through all the hidden states is always being multiplied by the same matrix (the recurrence matrix Whh), interspersed with non-linearity backprop.

What happens when you take one number a and start multiplying it by some other number b (i.e. a*b*b*b*b*b*b…)? This sequence either goes to zero if |b| < 1, or explodes to infinity when |b|>1. The same thing happens in the backward pass of an RNN, except b is a matrix and not just a number, so we have to reason about its largest eigenvalue instead.

TLDR: If you understand backpropagation and you’re using RNNs you are nervous about having to do gradient clipping, or you prefer to use an LSTM. See a longer explanation in this CS231n lecture video.

Spotted in the Wild: DQN Clipping

Lets look at one more — the one that actually inspired this post. Yesterday I was browsing for a Deep Q Learning implementation in TensorFlow (to see how others deal with computing the numpy equivalent of Q[:, a], where a is an integer vector — turns out this trivial operation is not supported in TF). Anyway, I searched “dqn tensorflow”, clicked the first link, and found the core code. Here is an excerpt:

If you’re familiar with DQN, you can see that there is the target_q_t, which is just [reward * \gamma \argmax_a Q(s’,a)], and then there is q_acted, which is Q(s,a) of the action that was taken. The authors here subtract the two into variable delta, which they then want to minimize on line 295 with the L2 loss with tf.reduce_mean(tf.square()). So far so good.

The problem is on line 291. The authors are trying to be robust to outliers, so if the delta is too large, they clip it with tf.clip_by_value. This is well-intentioned and looks sensible from the perspective of the forward pass, but it introduces a major bug if you think about the backward pass.

The clip_by_value function has a local gradient of zero outside of the range min_delta to max_delta, so whenever the delta is above min/max_delta, the gradient becomes exactly zero during backprop. The authors are clipping the raw Q delta, when they are likely trying to clip the gradient for added robustness. In that case the correct thing to do is to use the Huber loss in place of tf.square:

def clipped_error(x): return tf.select(tf.abs(x) < 1.0, 0.5 * tf.square(x),  tf.abs(x) - 0.5) # condition, true, false

It’s a bit gross in TensorFlow because all we want to do is clip the gradient if it is above a threshold, but since we can’t meddle with the gradients directly we have to do it in this round-about way of defining the Huber loss. In Torch this would be much more simple.

I submitted an issue on the DQN repo and this was promptly fixed.

In conclusion

Backpropagation is a leaky abstraction; it is a credit assignment scheme with non-trivial consequences. If you try to ignore how it works under the hood because “TensorFlow automagically makes my networks learn”, you will not be ready to wrestle with the dangers it presents, and you will be much less effective at building and debugging neural networks.

The good news is that backpropagation is not that difficult to understand, if presented properly. I have relatively strong feelings on this topic because it seems to me that 95% of backpropagation materials out there present it all wrong, filling pages with mechanical math. Instead, I would recommend the CS231n lecture on backprop which emphasizes intuition (yay for shameless self-advertising). And if you can spare the time, as a bonus, work through the CS231n assignments, which get you to write backprop manually and help you solidify your understanding.

That’s it for now! I hope you’ll be much more suspicious of backpropagation going forward and think carefully through what the backward pass is doing. Also, I’m aware that this post has (unintentionally!) turned into several CS231n ads. Apologies for that :)

Read the original article

swatson741

Karma: 3146

@Hacker__News
@hacker._news

Comments

By gchadwick 2025-11-027:202 reply

Karpathy's contribution to teaching around deep learning is just immense. He's got a mountain of fantastic material from short articles like this, longer writing like https://karpathy.github.io/2015/05/21/rnn-effectiveness/ (on recurrent neural networks) and all of the stuff on YouTube.

Plus his GitHub. The recently released nanochat https://github.com/karpathy/nanochat is fantastic. Having minimal, understandable and complete examples like that is invaluable for anyone who really wants to understand this stuff.

By kubb 2025-11-029:039 reply

I was slightly surprised that my colleagues, who are extremely invested in capabilities of LLMs, didn’t show any interest in Karpathy’s communication on the subject when I recommended it to them.

Later I understood that they don’t need to understand LLMs, and they don’t care how they work. Rather they need to believe and buy into them.

They’re more interested in science fiction discussions — how would we organize a society where all work is done by intelligent machines — than what kinds of tasks are LLMs good at today and why.

By Al-Khwarizmi 2025-11-029:312 reply

What's wrong or odd about that? You can like a technology as a user and not want to delve into how it works (sentence written by a human despite use of "delve"). Everyone should have some notions on what LLMs can or cannot do, in order to use them successfully and not be misguided by their limitations, but we don't need everyone to understand what backpropagation is, just as most of us use cars without knowing much about how an internal combustion engine works.

And the issue you mention in the last paragraph is very relevant, since the scenario is plausible, so it is something we definitely should be discussing.

By Archelaos 2025-11-0211:06

> What's wrong or odd about that? You can like a technology as a user and not want to delve into how it works

The question here is whether the details are important for the major issues, or whether they can be abstracted away with a vague understanding. To what extent abstracting away is okay depends greatly on the individual case. Abstractions can work over a large area or for a long time, but then suddenly collapse and fail.

The calculator, which has always delivered sufficiently accurate results, can produce nonsense when one approaches the limits of its numerical representation or combines numbers with very different levels of precision. This can be seen, for example, when one rearranges commutative operations; due to rounding problems, it suddenly delivers completely different results.

The 2008 financial crisis was based, among other things, on models that treated certain market risks as independent of one another. Risk could then be spread by splitting and recombining portfolios. However, this only worked as long as the interdependence of the different portfolios was actually quite small. An entire industry, with the exception of a few astute individuals, had abstracted away this interdependence, acted on this basis, and ultimately failed.

As individuals, however, we are completely dependent on these abstractions. Our entire lives are permeated by things whose functioning we simply have to rely on without truly understanding them. Ultimately, it is the nature of modern, specialized societies that this process continues and becomes even more differentiated.

But somewhere there should be people who work at the limits of detailed abstractions and are concerned with researching and evaluating the real complexity hidden behind them, and thus correcting the abstraction if necessary, sending this new knowledge upstream.

The role of an expert is to operate with less abstraction and more detail in her oder his field of expertise than a non-expert -- and the more so, the better an expert she or he is.

By Marazan 2025-11-0210:334 reply

Because if you don't understand how a tool works you can't use the tool to it's full potential.

Imagine if you were using single layer perceptrons without understanding seperability and going "just a few more tweaks and it will approximate XOR!"

By famouswaffles 2025-11-031:381 reply

If you want a good idea of how well LLMs will work for your use case then use them. Use them in different ways, for different things.

Knowledge of backprop no matter how precise, and any convoluted 'theories' will not make you utilize LLMs any better. You'll be worse off if anything.

By Al-Khwarizmi 2025-11-037:11

Yeah, that's what I'm trying to explain (maybe unsuccessfully). I do know backprop, I studied and used it back in the early 00s when it was very much not cool. But I don't think that knowledge is especially useful to use LLMs.

We don't even have a complete explanation of how we go from backprop to the emerging abilities we use and love, so who cares (for that purpose) how backprop works? It's not like we're actually using it to explain anything.

As I say in another comment, I often give talks to laypeople about LLMs and the mental model I present is something like supercharged Markov chain + massive training data + continuous vocabulary space + instruction tuning/RLHF. I think that provides the right abstraction level to reason about what LLMs can do and what their limitations are. It's irrelevant how the supercharged Markov chain works, in fact it's plausible that in the future one could replace backprop with some other learning algorithm and LLMs could still work in essentially the same way.

In the line of your first paragraph, probably many teens who had a lot of time in their hands when Bing Chat was released, and some critical spirit to not get misled by the VS, have better intuition about what an LLM can do than many ML experts.

By tarsinge 2025-11-0211:461 reply

I disagree in the case of LLMs, because they really are an accidental side effect of another tool. Not understanding the inner workings will make users attribute false properties to them. Once you understand how they work (how they generate plausible text), you get a far deeper grasp on their capabilities and how to tweak and prompt them.

And in fact this is true of any tool, you don’t have to know exactly how to build them but any craftsman has a good understanding how the tool works internally. LLMs are not a screw or a pen, they are more akin to an engine, you have to know their subtleties if you build a car. And even screws have to be understood structurally in advanced usage. Not understanding the tool is maybe true only for hobbyists.

By adi_kurian 2025-11-031:26

Could you provide an example of an advanced prompt technique or approach that one would be much more likely to employ if they had knowledge of X internal working?

By kubb 2025-11-0210:53

You hit the nail on the head, in my opinion.

There are things that you just can’t expect from current LLMs that people routinely expect from them.

They start out projects with those expectations. And that’s fine. But they don’t always learn from the outcomes of those projects.

By Al-Khwarizmi 2025-11-0211:082 reply

I don't think that's a good analogy, becuase if you're trying to train a single layer perceptron to approximate XOR you're not the end user.

By vajrabum 2025-11-0217:181 reply

None of this is about an end user in the sense of the user of an LLM. This is aimed at the prospective user of a training framework which implements backpropagation at a high level of abstraction. As such it draws attention to training problems which arise inside the black box in order to motivate learning what is inside that box. There aren't any ML engineers who shouldn't know all about single layer perceptrons I think, and that makes for a nice analogy to real life issues in using SGD and backpropagation for ML training.

By Al-Khwarizmi 2025-11-037:04

The post I was replying to was about "colleagues, who are extremely invested in capabilities of LLMs" and then mentions how they are uninterested in how they work and just interested in what they can do and societal implications.

It sounds to me very much like end users, not people who are training LLMs.

By Marazan 2025-11-0211:47

The analogy is if you don't understand the limitations of the tool you may try and make it do something it is bad at and never understand why it will never do the thing you want despite looking like it potentially coild

By CuriouslyC 2025-11-0210:211 reply

I think there are a lot of people who just don't care about stuff like nanochat because it's exclusively pedagogical, and a lot of people want to learn by building something cool, not taking a ride on a kiddie bike with training wheels.

By HarHarVeryFunny 2025-11-0214:321 reply

That's fine as far as it goes, but there is a middle ground ...

Feynman was right that "If you can't build it, you don't understand it", but of course not everyone needs or wants to fully understand how an LLM works. However, regarding an LLM as a magic black box seems a bit extreme if you are a technologist and hope to understand where the technology is heading.

I guess we are in an era of vibe-coded disposable "fast tech" (cf fast fashion), so maybe it only matters what can it do today, if playing with or applying towards this end it is all you care about, but this seems a rather blinkered view.

By vrighter 2025-11-0514:22

The problem is that not even the ones building them can understand it. Otherwise they wouldn't be breaking their "I promise AGI by the end of next year" promises for the Nth time.

That or they are flat out lying. My money's on the latter.

By tanepiper 2025-11-0216:322 reply

If everyone had to understand how carburettors, engines and break systems work; to be able to drive a car - rather than just learn to drive and get from A to B - I'm guessing there would be a lot less cars on the road.

(Thinking about it, would that necessarily be a bad thing...)

By whizzter 2025-11-0221:28

The problem is that we have a huge swathe of "mechanics" that basically don't know much more than how to open a paintcan and paint a pig despite promising to deliver finely tuned supercars with their magic car making machine.

By danielbln 2025-11-0211:581 reply

I'm personally very interested in how LLMs work under the hood, but I don't think everyone who uses them as tools needs that. I don't know the wiring inside my drill, but I know how to put a hole in my wall and not my hand regardless.

By vrighter 2025-11-0514:25

The thing is that if you actually learn how they work, they lose all of their magic. This has happened to anyone I know who bothered studying them and is not selling them. So I'd rather people learned. Knowing how a drill works doesn't make you any less likely to use a drill.

By miki123211 2025-11-0217:29

Not everybody who drives a car (even as a professional driver) knows how to make one.

If you live in a world of horse carriages, you can be thinking about what the world of cars is going to be like, even if you don't fully understand what fuel mix is the most efficient or what material one should use for a piston in a four-stroke.

By android521 2025-11-0210:37

Do you go deep into molecular biology to see how it works , it is much more interesting and important

By amelius 2025-11-0213:02

But the question is if you have a better understanding of LLMs from a user's perspective, or they.

By arisAlexis 2025-11-029:571 reply

Obviously they are more focused on making something that works

By spwa4 2025-11-0210:46

Wow. Definitely NOT management material then.

By teiferer 2025-11-029:192 reply

Which is terrible. That's the root of all the BS around LLMs. People lacking understanding of what they are and ascribing capabilities which LLMs just don't have, by design. Even HN discussions are full of that. Even though this page literally has "hacker" in its name.

By tim333 2025-11-0211:132 reply

I see your point but on the other hand a lot of conversations go: A: what will we do when AI do all the jobs, B: that's silly LLMs can't do the jobs. The thing is A didn't say LLM, they said AI as in whatever that will be a short while into the future. Which is changing rapidly because thousands of bright people are being paid to change it.

By teiferer 2025-11-0221:24

> a short while into the future

And what gives you that confidence? A few AI nerds already claimed that in the 80s.

We're currently exploring what LLMs can do. There is no indication that any further fundamental breakthrough is around the corner. Everybody is currently squeezing the same stone.

By HarHarVeryFunny 2025-11-0213:441 reply

The trouble is that "AI" is also very much a leaky abstraction, which makes it tempting to see all the "AI" advances of recent years, then correctly predict that these "AI" advances will continue, but then jump to all sorts of wrong conclusions about what those advances will be.

For example, things like "AI" image and video generation are amazing, as are things like AlphaGo and AlphaFold, but none of these have anything to do with LLMs, and the only technology they share with LLMs is machine learning and neural nets. If you lump these together with LLMs, calling them all "AI", then you'll come to the wrong conclusion that all of these non-LLM advances indicate that "AI" is rapidly advancing and therefore LLMs (also being "AI") will do too ...

Even if you leave aside things like AlphaGo, and just focus on LLMs, and other future technology that may take all our jobs, then using terms like "AI" and "AGI" are still confusing and misleading. It's easy to fall into the mindset that "AGI" is just better "AI", and that since LLMs are "AI", AGI is just better LLMs, and is around the corner because "AI" is advancing rapidly ...

In reality LLMs are, like AlphaFold, something highly specific - they are auto-regressive next-word predictor language models (just as a statement of fact, and how they are trained, not a put-down), based on the Transformer architecture.

The technology that could replace humans for most jobs in the future isn't going to be a better language model - a better auto-regressive next-word predictor - but will need to be something much more brain like. The architecture itself doesn't have to be brain-like, but in order to deliver brain-like functionality it will probably need to include another half-dozen "Transformer-level" architectural/algorithmic breakthroughs including things like continual learning, which will likely turn the whole current LLM training and deployment paradigm on it's head.

Again, just focusing on LLMs, and LLM-based agents, regarding them as a black-box technology, it's easy to be misled into thinking that advances in capability are broadly advancing, and will rise all ships, when in reality progress is much more narrow. Headlines about LLMs achievement in math and competitive programming, touted as evidence of reasoning, do NOT imply that LLM reasoning is broadly advancing, but you need to get under the hood and understand RL training goals to realize why that is not necessarily the case. The correctness of most business and real-world reasoning is not as easy to check as is marking a math problem as correct or not, yet that capability is what RL training depends on.

I could go on .. LLM-based agents are also blurring the lines of what "AI" can do, and again if treated as a black box will also misinform as to what is actually progressing and what is not. Thousands of bright people are indeed working on improving LLM-adjacent low-hanging fruit like this, but it'd be illogical to conclude that this is somehow helping to create next-generation brain-like architectures that will take away our jobs.

By tim333 2025-11-0215:443 reply

I'll give you algorithmic breakthroughs have been quite slow to come about - I think backpropagation in 1986 and then transformers in 2017. Still the fact that LLMs can do well in things like the maths olympiad have me thinking there must be some way to tweak this to be more brain like. I recently read how LLMs work and was surprised how text focused it is, making word vectors and not physical understanding.

By dontlikeyoueith 2025-11-0218:461 reply

> Still the fact that LLMs can do well in things like the maths olympiad have me thinking there must be some way to tweak this to be more brain like

That's because you, as you admit in the next sentence, have almost no understanding of how they work.

Your reasoning is on the same level as someone in the 1950s thinking ubiquitous flying cars are just a few years away. Or fusion power, for that matter.

In your defense, that seems to be about the average level of engagement with this technology, even on this website.

By tim333 2025-11-0310:592 reply

Maybe but the flying cars and fusion ran into fundamental barriers of the physics being hard. With human level intelligence though we have evidence it's possible from our brains which seem to use less compute than the LLMs going by power usage so I don't see a fundamental barrier to it just needing some different code.

By HarHarVeryFunny 2025-11-0315:16

You could say there is no fundamental barrier to humans doing anything that is allowed by the laws of Physics, but that is not a very useful statement, and doesn't indicate how long it may take.

Since nobody has yet figured out how to build an artificial brain, having that as a proof it's possible doesn't much help. It will be decades or more before we figure out how the brain works and are able to copy that, although no doubt people will attempt to build animal intelligence before fully knowing how nature did it.

Saying that AGI "just needs some different code" than an LLM is like saying that building an interstellar spaceship "just needs some different parts than a wheelbarrow". Both are true, and both are useless statements offering zero insight into the timeline involved.

By dontlikeyoueith 2025-11-0320:01

> I don't see a fundamental barrier to it

Neither did the people expecting fusion power and flying cars to come quickly.

We have just as much evidence that fusion power is possible as we do that human level intelligence is possible. Same with small vehicle flight for that matter.

None of that makes any of these things feasible.

By teiferer 2025-11-0221:29

> Still the fact that LLMs can do well in things like the maths olympiad have me thinking there must be some way to tweak this to be more brain like.

That's like saying, well, given how fast bicycles make us, so much closer to horse speed, I wonder if we can tweak this a little to move faster than any animal can run. But cars needed more technological breakthroughs, even though some aspects of them used insights gained from tweaking bicycles.

By HarHarVeryFunny 2025-11-0216:09

Yes, it's a bit shocking to realize that all LLMs are doing is predicting next word (token) from samples in the training data, but the Transformer is powerful enough to do a fantastic job of prediction (which you can think of as selecting which training sample(s) to copy from), which is why the LLM - just a dumb function - appears as smart as the human training data it is copying.

The Math Olympiad results are impressive, but at the end of the day is just this same next word prediction, but in this case fine tuned by additional LLM training on solutions to math problems, teaching the LLM which next word predictions (i.e. output) will add up to solution steps that lead to correct problem solutions in the training data. Due to the logical nature of math, the reasoning/solution steps that worked for training data problems will often work for new problems it is then tested on (Math Olympiad), but most reasoning outside of logical domains like math and programming isn't so clear cut, so this approach of training on reasoning examples isn't necessarily going to help LLMs get better at reasoning on more useful real-world problems.

By kubb 2025-11-029:231 reply

I’m trying not to be disappointed by people, I’d rather understand what’s going on in their minds, and how to navigate that.

By throwaway290 2025-11-027:582 reply

And to all the LLM heads here, this is his work process:

> Yesterday I was browsing for a Deep Q Learning implementation in TensorFlow (to see how others deal with computing the numpy equivalent of Q[:, a], where a is an integer vector — turns out this trivial operation is not supported in TF). Anyway, I searched “dqn tensorflow”, clicked the first link, and found the core code. Here is an excerpt:

Notice how it's "browse" and "search" not just "I asked chatgpt". Notice how it made him notice a bug

By stingraycharles 2025-11-028:091 reply

First of all, this is not a competition between “are LLMs better than search”.

Secondly, the article is from 2016, ChatGPT didn’t exist back then

By code51 2025-11-028:253 reply

I doubt he's letting LLM creep in to his decision-making in 2025, aside from fun side projects (vibes). We don't ever come across Karpathy going to an LLM or expressing that an LLM helped in any of his Youtube videos about building LLMs.

He's just test driving LLMs, nothing more.

Nobody's asking this core question in podcasts. "How much and how exactly are you using LLMs in your daily flow?"

I'm guessing it's like actors not wanting to watch their own movies.

By mquander 2025-11-028:361 reply

Karpathy talking for 2 hours about how he uses LLMs:

https://www.youtube.com/watch?v=EWvNQjAaOHw

By code51 2025-11-029:431 reply

Vibing, not firing at his ML problems.

He's doing a capability check in this video (for the general audience, which is good of course), not attacking a hard problem in ML domain.

Despite this tweet: https://x.com/karpathy/status/1964020416139448359 , I've never seen him citing an LLM helped him out in ML work.

By soulofmischief 2025-11-0212:161 reply

You're free to believe whatever fantasy you wish, but as someone who frequently consults an LLM alongside other resources when thinking about complex and abstract problems, there is no way in hell that Karpathy intentionally limits his options by excluding LLMs when seeking knowledge or understanding.

If he did not believe in the capability of these models, he would be doing something else with his time.

By strogonoff 2025-11-0213:462 reply

One can believe in the capability of a technology but on principle refuse to use implementations of it built on ethically flawed approaches (e.g., violating GPL licensing laws and/or copyright, thus harming open source ecosystem).

By CamperBob2 2025-11-0217:251 reply

AI is more important than copyright law. Any fight between them will not go well for the latter.

Truth be told, a whole lot of things are more important than copyright law.

By esafak 2025-11-0313:341 reply

Important for whom, the copyright creators? Being fed is more important than supermarkets, so feel free to raid them?

By CamperBob2 2025-11-0316:492 reply

Conflating natural law -- our need to eat -- with something we pulled out of our asses a couple hundred years ago to control the dissemination of ideas on paper is certainly one way to think about the question.

A pretty terrible way, but... certainly one way.

By strogonoff 2025-11-0318:061 reply

I am sure it had nothing to do with the amount of innovation that has been happening since, including the entire foundation that gave us LLMs themselves.

It would be crazy to think the protections of IP laws and the ability to claim original work as your own and have a degree of control over it as an author fostered creativity in science and arts.

By soulofmischief 2025-11-0319:441 reply

Innovation? Patents are designed to protect innovation. Copyright is designed to make sure Disney gets a buck every time someone shares a picture of Mickey Mouse.

The human race has produced an extremely rich body of work long before US copyright law and the DMCA existed. Instead of creating new financial models which embrace freedoms while still ensuring incentives to create new art, we have contorted outdated financial models, various modes of rent-seeking and gatekeeping, to remain viable via artificial and arbitrary restriction of freedom.

By strogonoff 2025-11-0514:04

Patents and copyright are both IP. Feel free to replace “copyright” with “IP” in my comment. Do you not agree that IP laws are related to the explosion of innovation and creativity in the last few hundred years in the Western world?

Furthermore, claiming “X is not natural” is never a valid argument. Humans are part of nature, whatever we do is as well by extension. The line between natural and unnatural inevitably ends up being the line between what you like and what you don’t like.

The need to eat is as much a natural law as higher human needs—unless you believe we should abandon all progress and revert to pre-civilization times.

IP laws ensure that you have a say in the future of the product of your work, can possibly monetise it, etc., which means a creative 1) can fulfil your need to eat (individual benefit), and 2) has an incentive to create it in the first place (societal benefit).

In the last few hundred years intellectual property, not physical property, is increasingly the product of our work and creative activities. Believing that physical artifacts we create deserve protection against theft while intellectual property we create doesn’t needs a lot of explanation.

By soulofmischief 2025-11-0217:281 reply

What you see as copyright violation, I see as liberation. I have open models running locally on my machine that would have felled kingdoms in the past.

By strogonoff 2025-11-0318:111 reply

I personally see no issue with training and running open local models by individuals. When corporations run scrapers and expropriate IP at an industrial scale, then charge for using them, it is different.

By soulofmischief 2025-11-0319:421 reply

What about Meta and the commercially licensed family of Llama open-weight models?

By strogonoff 2025-11-046:281 reply

I have not researched closely enough but I think it falls under what corporations do. They are commercially licensed, you cannot use them freely, and crucially they were trained using data scraped at an industrial scale, contributing to degradation of the Web for humans.

By soulofmischief 2025-11-0514:041 reply

Since Llama 2, the models have been commercially licensed under an acceptable use policy.

So you're able to use them commercially as you see fir, but you can't use them freely in the most absolute sense, but then again this is a thread about restricting the freedoms of organizations in the name of a 25-year-old law that has been a disgrace from the start.

> contributing to degradation of the Web for humans

I'll be the first to say that Meta did this with Facebook and Instagram, along with other companies such as Reddit.

However, we don't yet know what the web is going to look like post-AI, and it's silly to blame any one company for what clearly is an inevitable evolution in technology. The post-AI web was always coming, what's important is how we plan to steward these technologies.

By strogonoff 2025-11-0516:02

The models are either commercial or not. They are, and as such they monetise the work of original authors without their consent, compensation, and often in violation of copyleft licensing.

> The post-AI web was always coming

“The third world war was always coming.”

These things are not a force of nature, they are products of human effort, which can be ill-intentioned. Referring to them as “always coming” is 1) objectively false and 2) defeatist.

By confirmmesenpai 2025-11-028:37

> Continuing the journey of optimal LLM-assisted coding experience. In particular, I find that instead of narrowing in on a perfect one thing my usage is increasingly diversifying

https://x.com/karpathy/status/1959703967694545296

By danielbln 2025-11-028:35

https://news.ycombinator.com/item?id=45788753

By confirmmesenpai 2025-11-028:321 reply

what you did here is called confirmation bias.

> I think congrats again to OpenAI for cooking with GPT-5 Pro. This is the third time I've struggled on something complex/gnarly for an hour on and off with CC, then 5 Pro goes off for 10 minutes and comes back with code that works out of the box. I had CC read the 5 Pro version and it wrote up 2 paragraphs admiring it (very wholesome). If you're not giving it your hardest problems you're probably missing out.

https://x.com/karpathy/status/1964020416139448359

By away74etcie 2025-11-0212:38

Yes, embedding .py code inside of a speedrun.sh to "simplify the [sic] bash scripts."

Eureka runs LLM101n, which is teaching software for pedagogic symbiosis.

[1]:https://eurekalabs.ai/

By nirinor 2025-11-0214:402 reply

Its a nit pick, but backpropagation is getting a bad rep here. These examples are about gradients+gradient descent variants being a leaky abstraction for optimization [1].

Backpropagation is a specific algorithm for computing gradients of composite functions, but even the failures that do come from composition (multiple sequential sigmoids cause exponential gradient decay) are not backpropagation specific: that's just how the gradients behave for that function, whatever algorithm you use. The remedy, of having people calculate their own backwards pass, is useful because people are _calculating their own derivatives_ for the functions, and get a chance to notice the exponents creeping in. Ask me how I know ;)

[1] Gradients being zero would not be a problem with a global optimization algorithm (which we don't use because they are impractical in high dimensions). Gradients getting very small might be dealt with by with tools like line search (if they are small in all directions) or approximate newton methods (if small in some directions but not others). Not saying those are better solutions in this context, just that optimization(+modeling) are the actually hard parts, not the way gradients are calculated.

By xpe 2025-11-0215:443 reply

Yes. No need to be apologetic or timid about it — it’s not a nit to push back against a flawed conceptual framing.

I respect Karpathy’s contributions to the field, but often I find his writing and speaking to be more than imprecise — it is sloppy in the sense that it overreaches and butchers key distinctions. This may sound harsh, but at his level, one is held to a higher standard.

By embedding-shape 2025-11-0216:03

> often I find his writing and speaking to be more than imprecise

I think that's more because he's trying to write to an audience who isn't hardcore deep into ML already, so he simplifies a lot, sometimes to the detriment of accuracy.

At this point I see him more as a "ML educator" than "ML practitioner" or "ML researcher", and as far as I know, he's moving in that direction on purpose, and I have no qualms with it overall, he seems good at educating.

But I think shifting the mindset of what the purpose of his writings are maybe help understand why sometimes it feels imprecise.

By HarHarVeryFunny 2025-11-0216:481 reply

Whoever chose this topic title perhaps did him a disservice in suggesting he said the problem was backprop itself, since in his blog post he immediately clarifies what he meant by it. It's a nice pithy way of stating the issue though.

By nirinor 2025-11-0217:251 reply

Nah, Karpathy's title is "Yes you should understand backprop", and his first highlight is "The problem with Backpropagation is that it is a leaky abstraction." This is his choice as a communicator, not the poster to HN.

And his _examples_ are about gradients, but nowhere does he distinguish between backpropagation, a (part of) an algorithm for automatic differentiation and the gradients themselves. None of the issues are due to BP returning incorrect gradients (it totally could, for example, lose too much precision, but it doesn't).

By HarHarVeryFunny 2025-11-0217:38

Yeah - he chose it as a pithy/catchy description of the issue, then immediately clarified what he meant by it.

> In other words, it is easy to fall into the trap of abstracting away the learning process — believing that you can simply stack arbitrary layers together and backprop will “magically make them work” on your data.

Then follows this with multiple clear examples of exactly what he is talking about.

The target audience was people building and training neural networks (such as his CS231n students), so I think it's safe to assume they knew what backprop and gradients are, especially since he made them code gradients by hand, which is what they were complaining about!

By mitthrowaway2 2025-11-0217:121 reply

But Karpathy is completely right; students who understand and internalize how backprop works, having implemented it rather than treating it as a magic spell cast by TF/PyTorch, will also be able to intuitively understand these problems of vanishing gradients and so on.

Sure, instead of "the problem with backpropagation is that it's a leaky abstraction" he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading for an introductory-level article for an undergraduate audience, and also unnecessary because he already said that in the introduction.

By xpe 2025-11-0315:431 reply

I never disagreed with the utility and importance of understanding backprop. I'm glad the article exists. And it could be easily improved -- and all of us can gain [1] by acknowledging this rather than circling the wagons [2], so to speak, or excusing unforced errors.

> ... he could have written "the problem with not learning how back propagation works and just learning how to call a framework is that backpropagation is a leaky abstraction". But that would be a terrible sub-heading ...

My concern isn't about the heading he chooses. My concern is deeper; he commits a category error [3]. These following things are true, but Karpathy's article gets them wrong: (1) Leaky abstractions only occur with interfaces; (2) Backpropagation is algorithm; (3) Algorithms can never be leaky abstractions.

Karpathy could have communicated his point clearly and correctly by saying e.g.: "treating backprop learning as a magical optimization oracle is risky". There is zero need for introducing the concept of leaky abstractions at all.

---

Ok, with the above out of the way, we can get to some interesting technical questions that are indeed about leaky abstractions which can inform the community about pros/cons of the design space: To what degree is the interface provided by [Library] a leaky abstraction? (where [Library] might be PyTorch or TensorFlow) Getting into these details is interesting. (See [4] for example.) There is room for more writing on this.

[1]: We can all gain because accepting criticism is hard. Once we see that even Karpathy messes up, we probably shouldn't be defensive when we mess up.

[2]: No one is being robbed here. Criticism is a gift; offering constructive criticism is a sign of respect. It also respects the community by saying i.e. "I want to make it easier for people to get the useful, clear ideas into their heads rather than muddled ones."

[3]: https://en.wikipedia.org/wiki/Category_mistake

[4]: https://elanapearl.github.io/blog/2025/the-bug-that-taught-m...

By shwaj 2025-11-0420:32

Hear hear, one of my favorite comments recently.

Can’t agree more about the technical points (category error etc), and then the unexpected switch to the value of receiving constructive criticism as a gift not an attack.

Myself, I’m definitely conditioned to receive it as an attack. I’m trying to break this habit. This morning I gave some extensive feedback to some friends who have a startup. The whole time I was writing it, I was stressing out that they’d feel attacked, because that’s how I might take similar criticism.

How was it actually received? A mix I think. Some people explicitly received it as a gift, and others I’m not so sure.

By fjdjshsh 2025-11-0218:461 reply

I get your point, but I don't think your nit-pick is useful in this case.

The point is that you can't abstract away the details of back propagation (which involve computing gradients) under some circumstances. For example, when we are using gradient descend. Maybe in other circumstances (global optimization algorithm) it wouldn't be an issue, but the leaky abstraction idea isn't that the abstraction is always an issue.

(Right now, back propagation is virtually the only way to calculate gradients in deep learning)

By nirinor 2025-11-032:192 reply

So, are computing gradients details of backpropagation that it is failing to abstract over, or are gradients the goal that backpropagation achieves? It isn't both, its just the latter.

This is like complaining about long division not behaving nicely when dividing by 0. The algorithm isn't the problem, and blaming the wrong part does not help understanding.

It distracts from what is actually helping which is using different functions with nicer behaving gradients, e.g., the Huber loss instead of quadratic.

By grumbelbart2 2025-11-037:28

> It distracts from what is actually helping which is using different functions with nicer behaving gradients, e.g., the Huber loss instead of quadratic.

Fully agree. It's not the "fault" of Backprop. It does what you tell it to do, find the direction in which your loss is reduced the most. If the first layers get no signal because the gradient vanishes, then the reason is your network layout: Very small modifications in the initial layers would lead to very large modifications in the final layers (essentially an unstable computation), so gradient descend simply cannot move that fast.

Instead, it's a vital signal for debugging your network. Inspecting things like gradient magnitudes per layer shows you might have vanishing or exploding gradients. And that has lead to great inventions how to deal with that, such as residual networks and a whole class of normalization methods (such as batch normalization).

By DSingularity 2025-11-037:18

It’s just an observation. It’s an abstraction in the classical computer science sense in that you stack some modules and the backprop is generated. It’s leaky in the sense that you cant fully abstract away the details because of the vanishing/exploding gradient issues you must be mindful of.

It is definitely a useful thing for people who are learning this topic to understand from day 1.

By joaquincabezas 2025-11-028:033 reply

I took a course in my Master's (URV.cat) where we had to do exactly this, implementing backpropagation (fwd and backward passes) from a paper explaining it, using just basic math operations in a language of our choice.

I told everyone this was the best single exercise of the whole year for me. It aligns with the kind of activity that I benefit immensely but won't do by myself, so this push was just perfect.

If you are teaching, please consider this kind of assignments.

P.S. Just checked now and it's still in the syllabus :)

By blitzar 2025-11-028:271 reply

The difference in understanding (for me and how my brain works) between reading the paper in what appears to be a future or past alien language & doing a minimal paper / code example is massive.

By joaquincabezas 2025-11-028:34

same here, even more if I'm doing it over few days and different angles

By LPisGood 2025-11-028:091 reply

I did this in highschool from some online textbook in plain Java. I recall implementing matrix multiplication myself being the hardest part.

I made a UI that showed how the weights and biases changed throughout the training iterations.

By aDyslecticCrow 2025-11-0213:351 reply

I had a whole course just about how computers do maths. Matrix multiplication, linear fit, finding eigenvectors, multiplication and division, square root, solving linear systems, numerically calculating differential equations, spline interpolation, FEM analysis.

"Computers are good at maths" is normally a pretty obvious statement... but many things we take for granted from analytical mathematics, is quite difficult to actually implement in a computer. So there is a mountain of clever algorithms hiding behind some of the seemingly most obvious library operations.

One of the best courses I've ever had.

By e-master 2025-11-0217:071 reply

Would you mind sharing which course it was? Is it available online by any chance?

By aDyslecticCrow 2025-11-0219:35

Unfortunately it was a course at my university, and in Swedish. But it wouldn't surprise me if there are similar courses online.

By mkl 2025-11-0212:442 reply

Is that paper publicly available?

By joaquincabezas 2025-11-038:56

Hi! It's not public, it's part of the https://www.urv.cat/en/studies/master/courses/computer-secur... and I've not found it online

By joaquincabezas 2025-11-0315:47

just found it! but it's private, I can send it to you if interested but not to publish it