How linear regression works intuitively and how it leads to gradient descent

2025-05-0515:05334101briefer.cloud

Building an intuitive understanding of how Linear Regression works and how it leads to Gradient Descent

Learning, to a computer, is just turning bad guesses into better ones. In this post, we’ll see how that starts with a straight line: how linear regression makes the first guess, and gradient descent keeps improving it.

Let's start with something familiar: house prices. Bigger houses tend to cost more; smaller ones, less. It's the kind of pattern you can almost see without thinking: more space, more money.

When we plot it, the shape is clear: a loose upward slope, with some noise but a definite trend.

As you can see, price and size move together in a way that feels predictable. Not in fixed steps or categories, but on a sliding scale. A house might go for $180,000, $305,500, or anything in between.

Now imagine you're selling your own house. It's 1,850 square feet—larger than average, but not a mansion. You've seen what homes go for in your area, but the prices are scattered. What's a fair number to list it at?

One option is to text your real estate friend and get a half-baked guess. A better option is to look at the pattern in past sales and sketch a line that seems to match it. You grab a ruler, hold it up to the scatterplot, and draw something that feels about right. Then you find your square footage on the x-axis, trace upward to your line, and read off the predicted price.

Whatever line you draw, it'll be a steady upward slope. Bigger homes, higher prices. It might not be perfect, but it gives you a way to turn square footage into a price that kind of makes sense.

And while the lines you might draw vary, they all follow the same formula. Each one takes the area of a house (the explanatory variable), multiplies it by a number (the slope), and adds another number (the intercept).

Notice that this isn't just a house-pricing formula. It's how every straight line works. One number sets the tilt (the slope), and the other shifts the line up or down (the intercept). That's all it takes to draw a line: scale, then slide.

In our case, the slope is the "price-per-square-foot". That's the amount each extra foot adds to the total. If the slope is 150, then a 1,000 square foot home would land at $150,000 just from size alone. A steeper slope means prices rise faster as homes get bigger.

The intercept is where the line starts. It's the predicted price of a home with zero square feet. That doesn't mean much in isolation (nobody's buying a 0-square-foot house), but it sets the baseline, like the minimum price you'd expect even for the tiniest studio. If two neighborhoods increase in price at the same rate per square foot, the one with the higher intercept starts from a pricier floor.

Now that we know each line connects an explanatory variable to a prediction using just a slope and an intercept, the question is: which line should we trust to price our house?

To answer that, we need a way to measure how well a line fits the data we already have.

Take one of the houses in our dataset. The line says it should've sold for $350,000. But we know the actual price was $375,000. That $25,000 difference is an "error" that measures how far off the line was for that point. Price it too low and you leave money on the table or it too high and the house might not sell.

A good line keeps the gaps between predictions and actual values small. The simplest way to measure how well a line fits is to add up those gaps—the errors.

Of course, there's more than one way to measure error. And how we choose to combine them is what turns guessing into regression.

One simple option is to use absolute error, just like we did in the previous example. You take the difference between the predicted price and the actual price (no matter which one is higher) and treat it as a distance. If the house sold for $375,000 and one line guesses $350,000, the error is $25,000. Another line guesses $360,000? That one's off by $15,000. Smaller is better.

To compare lines, you just add up the absolute errors across all the houses. The line with the smallest total error is the winner.

If you're trying to price your own home, this makes intuitive sense: you'd rather be off by $10,000 than $30,000.

But absolute error treats all mistakes the same—two medium errors count the same as one big one. That can hide problems.

If your predictions are always a little off, you're consistently close. But if one estimate nails the price and another misses by $100,000, it's hard to know if the model is solid or just lucky. Consistency matters more than the occasional hit, especially when prices shift over time.

Take home prices, for example. Imagine checking listings in your neighborhood every week. Most of the time, estimates are a little off—maybe $10,000 high or low. Nothing to worry about. But then, one day, a house just like yours is priced $150,000 below market. The next week, another is listed way too high. Suddenly, you're not just surprised—you stop trusting the estimates altogether.

To recover trust in the estimates, we might want big mistakes to count more—not just equally. One way to do that is to make the penalty grow faster as the error increases. In effect, we're plotting the errors on a curve, not a straight line. The further out you go, the steeper the penalty becomes. That's what it means to use a non-linear scale: small misses barely move the needle, but big ones explode.

Instead of measuring error directly, we square it. The effect is clear in the plot above: a $20,000 miss isn't just twice as bad as $10,000—it's four times worse. A $50,000 error? Twenty-five times the cost. Bigger mistakes grow fast. Squaring the errors pushes us toward a line that stays consistently close, not one that swings between lucky guesses and huge misses.

Everything we've done—measuring errors, choosing how to weigh them, and adding them up—comes together into a single idea: the error function. It's just a rule for scoring how well a line fits the data. Squared error is a popular option. Absolute error is another, but they are not the only ones.

Different error functions reflect different priorities. For example, the Deming regression uses a different error function that accounts for errors in both variables—not just the one we're trying to predict. It's useful when both measurements are noisy, like when comparing lab results from two different instruments.

Many lines are close to the data points. We don't want any good line, but the best one (the one with smallest error). No matter which error function you choose, the goal stays the same: find the line that minimizes the error function.

But that raises the next question: where do these lines come from? One option is brute force. Get a set of every possible combination of slope and intercept, one by one. Then, calculate the error and keep the best one.

It works. At least in theory. But there's a problem: there are way too many possibilities. Infinite ones. Testing them all would take forever. We need a more sensible algorithm than computing the error of every possible line.

To understand what we're up against, imagine a plot where each point represents a different line: the slope and intercept are the coordinates, and the vertical axis shows how bad the line is—its error. In this landscape of errors, high points mean worse lines; low points mean better ones. Our goal is simple: find the lowest point. We won't actually draw this plot (there are too many possibilities), but picturing it helps make the task clear.

For example, if you look at the absolute error surface, it's a bit of a mess. It's not smooth like a hill or a valley. Instead, it's more like a weird origami sculpture, with sharp edges and folds.

In some cases, like in the example below, every point along the fold gives the same lowest error. If that happens, it means there's not one best line, but many.

You can see this clearly in a 2D slice: fix the intercept at zero, and a range of slopes yield the same lowest error.

This ambiguity matters because if multiple lines fit the data equally well, then we won't have a single best answer—just a range of equally good ones. Imagine trying to sell a house whose price could reasonably be $100,000, $500,000, or $1,000,000, depending on which line you pick. Not ideal.

This situation happens when the data points are symmetric: tilting the line slightly brings it closer to some points while moving it away from others, without changing the total error. The plot below shows an example. Any line that passes through the green area (like the red one: is a valid solution.

This kind of ambiguity happens to absolute error only. If we were using squared error instead, the plot would look different: smooth and bowl-shaped, with a single lowest point. One best line, stable and predictable.

If you were standing anywhere on that surface, you'd just need to walk downhill to find the best line. Pick the direction where the slope is steepest, take a step, and keep moving downhill. Eventually, you'll reach the bottom.

You don't have to explore in every direction or worry about getting stuck. No matter where you start, as long as you follow the steepest descent, you'll get to the best solution. It might take 100, 1,000, or more steps, but you'll get there.

Of course, we're not actually hiking across hills. In our case, walking downhill means tweaking numbers: the slope and intercept of the line. Each point on the error surface corresponds to a specific line. Nearby points represent lines with similar parameters. By nudging the slope and intercept step by step, we move through this space—chasing lower and lower error until we land on the best fit.

At that point, we can stop and look at the slope and intercept of the line we found. That's our best guess for the price of your house.

To get a better feel for what's happening as you step down the valley, you can zoom in on just one dimension—the slope. The plot below shows how the error changes as we adjust the slope (while keeping the intercept fixed). Notice how the curve is smooth and bowl-shaped, making it clear how to keep stepping downhill.

At every step, we need to decide which way to move: left or right. To do that, we measure how steep the curve is at our current point. This measurement (the steepness of the curve) is called the derivative.

If the derivative is positive, it means the error increases if we move right, so we should go left. If the derivative is negative, it means the error increases if we move left, so we should go right.

In short: the derivative points uphill. Since we want to go downhill, we move in the opposite direction.

We keep stepping like this—checking the derivative, flipping directions if needed—until we land at a point where the slope is flat. That's when the derivative becomes zero, and we've reached the minimum.

When using least squares, a zero derivative always marks a minimum. But that's not true in general. It could also be a maximum or an inflection point. To tell the difference between a minimum and a maximum, you'd need to look at the second derivative. If the second derivative is positive, it's a minimum. If it's negative, it's a maximum. If it's zero, we're at an inflection point.

This process of iteratively minimizing a function using derivatives is called gradient descent.

The gradient descent is an intuitive and efficient algorithm—and squared error plays especially nice with it.

Squaring the errors keeps everything smooth: no sharp corners, no sudden jumps. That matters because gradient descent needs to measure slopes—and if a function isn't smooth, you can't always define a clear slope to follow.

This wasn't always obvious. In the 1800s, mathematicians like Karl Weierstrass showed that you can have a perfectly continuous curve that's still so jagged that it has no slope at all—not even at a single point. It was a wake-up call: just because something looks smooth from far away doesn't mean it actually is.

Minimizing absolute error runs into a smaller version of this problem. The absolute value function has a sudden bend at zero—right where you most need a clean slope. You can work around it with special tricks, but it's messier and less natural.

Squared error, on the other hand, glides along smoothly. Its derivative is clean, continuous, and geometrically meaningful—it not only points you downhill, but also tells you how big a step to take. No hacks required.

This is pretty much the real reason why everyone squares their errors instead of taking the absolute value, even when it might not be the most appropriate pick. Squared error makes optimization smooth and easy and, as we saw earlier, it also guarantees a single, stable best solution.

As a side note, there are many ways to find a function's minimum, but Gradient Descent rose to fame thanks to its handsome cousin: the stochastic gradient descent, which is the algorithm of choice for training neural networks.

Underneath all the complexity, deep learning still runs on the same basic idea: adjusting parameters to minimize error, step by step, just like we do with Linear Regression and Squared Errors.

It's simple math, sharpened by persistence. The kind of quiet hustle that helps you put a price on a house and actually trust it.


Read the original article

Comments

  • By c7b 2025-05-086:554 reply

    One interesting property of least squares regression is that the predictions are the conditional expectation (mean) of the target variable given the right-hand-side variables. So in the OP example, we're predicting the average price of houses of a given size.

    The notion of predicting the mean can be extended to other properties of the conditional distribution of the target variable, such as the median or other quantiles [0]. This comes with interesting implications, such as the well-known properties of the median being more robust to outliers than the mean. In fact, the absolute loss function mentioned in the article can be shown to give a conditional median prediction (using the mid-point in case of non-uniqueness). So in the OP example, if the data set is known to contain outliers like properties that have extremely high or low value due to idiosyncratic reasons (e.g. former celebrity homes or contaminated land) then the absolute loss could be a wiser choice than least squares (of course, there are other ways to deal with this as well).

    Worth mentioning here I think because the OP seems to be holding a particular grudge against the absolute loss function. It's not perfect, but it has its virtues and some advantages over least squares. It's a trade-off, like so many things.

    [0] https://en.wikipedia.org/wiki/Quantile_regression

    • By easygenes 2025-05-089:481 reply

      Yeah. Squared error is optimal when the noise is Gaussian because it estimates the conditional mean; absolute error is optimal under Laplace noise because it estimates the conditional median. If your housing data have a few eight-figure outliers, the heavy tails break the Gaussian assumption, so a full quantile regression for, say, the 90th percentile—will predict prices more robustly than plain least squares.

      • By c7b 2025-05-0817:38

        True. But it's worth mentioning that normality is only required for asymptotic inference. A lot of things that make least squares stand out, like being a conditional mean forecast, or that it's the best linear unbiased estimator, hold true regardless of the error distribution.

        My impression is that many tend to overestimate the importance of normality. In practice, I'd worry more about other things. The example in the OP, eg, if it were an actual analysis, would raise concerns about omitted variables. Clearly, house prices depend on more factors than size, eg location. Non-normality here could be just an artifact of an underspecified model.

    • By lupire 2025-05-0811:363 reply

      How does an upcoming college student, or worse an already graduate, learn statistics like this, with depth of understanding of the meaning of the math, vs just plug an chugging cookbook formulas and "proving" theorems mechanically without the deep semantics?

      • By ayhanfuat 2025-05-0812:481 reply

        Statistical Rethinking is quite good in explaining this stuff. https://xcelab.net/rm/

        • By disgruntledphd2 2025-05-0814:37

          Basically all of the Andrew Gelman books are also good.

          Data Analysis... https://sites.stat.columbia.edu/gelman/arm/ Regression and Other Stories: https://avehtari.github.io/ROS-Examples/

          Wasserman's All of Statistics is a really good introduction to mathematical statistics (the Gelman stuff above are more practically and analytically focused).

          But yeah, it would probably be easier to find a good statistics course at a local university and try to audit it or do it at night.

      • By monkeyelite 2025-05-0814:311 reply

        Dont take the “for engineers” version.

        > and "proving" theorems mechanically

        I think you’ve have a bad experience because writing a proof is explaining deep understanding.

        • By JadeNB 2025-05-0815:522 reply

          > I think you’ve have a bad experience because writing a proof is explaining deep understanding.

          I think your wording is the key—coming up with a proof is creating deep understanding, but writing a proof very much need not be explaining or creating deep understanding. Writing a proof can be done mechanically, by both instructor and student, and, if done so, neither demonstrates nor creates understanding.

          (Also, in statistics more than in almost any other mathematically based subject, while the rigorous mathematical foundations are important, a complete theoretical understanding of those foundations need not shed any light on the actual practice of statistics.)

          • By alejohausner 2025-05-0817:23

            You're right. Coming up with a proof is a creative process. Each major proof in mathematics is so unique, that it usually gets named after its inventor. So we have Euclid's proof that there are infinitely many primes, Euler's proof that e is irrational, and Wiles' proof of Fermat's last theorem.

          • By monkeyelite 2025-05-0817:00

            > need not shed any light on the actual practice of statistics.

            That’s not what this comment asked for.

      • By c7b 2025-05-0817:39

        I'd say reading about statistics and being curious is a great start :)

    • By levocardia 2025-05-0817:35

      Quantile regression is great, especially when you need more than just the average. A quantile model for, say, the 10th and 90th percentiles of something are really useful for decision-making. There is a great R package called qgam that lets you fit very powerful nonlinear quantile models -- one of R's "killer apps" that keeps me from using Python full-time.

  • By easygenes 2025-05-087:463 reply

    This is very light and approachable but stops short of building the statistical intuition you want here. They fixate on the smoothness of squared errors without connecting that to the gaussian noise model and establishing how that relates to the predictive power against natural sorts of data.

    • By akst 2025-05-089:251 reply

      It isn't too hard to find resources on this for anyone genuinely looking to get a deeper understanding of a topic. I think a blog post (likely written for SEO purposes, which is in no way a knock against the content) is probably the wrong place that kind of enlightenment, but I also think there are limits to the level of detail you can reasonable expect from a high level blog post.

      And for introductory content there's always that risk if you provide to much information you overwhelm the reader, make them feel like maybe this is too hard for them.

      Personally I find the process of building a model is a great way of learning all this.

      I think a course is probably helpful, but the problem with things like data camp is they are overly repetitive and they don't do a great job of helping you look up earlier content unless you want to scroll through a bunch of videos, where the formula goes on screen for 5 seconds.

      Would definitely just recommend getting a book for that stuff, I found "All of statistics" good, I just wouldn't recommend trying to read it from cover to cover, but I have found it good as a manual where I could just look up the bits I needed when I needed it. Tho the book may be a bit intimidating if you're unfamiliar with integration and derivatives (as they often express the PDF/CDF of random variables in those terms).

      • By jovial_cavalier 2025-05-0812:31

        >I think a blog post... is probably the wrong place that kind of enlightenment

        There's this site full of cool knowledgeable people called Hacker News which usually curates good articles with deep intuition about stuff like that. I haven't been there in years, though.

    • By jfjfjtur 2025-05-089:231 reply

      Yes, and it seems like it could’ve been written in-part by an LLM. But, the LLM could take your criticism, improve upon the original, and iterate that way until you feel that it has produced something close to an optimal textbook. The one thing missing is soul. I noticeably don’t feel like there was anyone behind this writing.

      • By easygenes 2025-05-0810:16

        Ah, we’re resorting to ad machinum today. :)

    • By BlueUmarell 2025-05-088:581 reply

      Any resource/link you know of that further develops your point?

  • By stared 2025-05-089:511 reply

    I really recommend this explorable explanation: https://setosa.io/ev/ordinary-least-squares-regression/

    And for actual gradient descent code, here is an older example of mine in PyTorch: https://github.com/stared/thinking-in-tensors-writing-in-pyt...

    • By revskill 2025-05-0812:224 reply

      Google search is evil by not giving me those resources.

      • By stared 2025-05-0814:26

        Yeah - I wanted to post it here, but after searching for "linear regression explorable explanation" I got some other random links. Thankfully, I saved the PyTorch materials + https://pinboard.in/u:pmigdal/t:explorable-explanation.

      • By sorcerer-mar 2025-05-0812:45

        This is an all-time great blog post for this line alone: "That's why we have statistics: to make us unsure about things."

        The interactive visualizations are a great bonus though!

      • By Nifty3929 2025-05-0814:28

        Google does however provide this very nice course that explains these things in more detail: https://developers.google.com/machine-learning/crash-course

      • By mhb 2025-05-0813:181 reply

        Kagi FTW?

        • By billbrown 2025-05-0815:22

          That was my initial thought, too. But I didn't know what the original Google search consisted of and the site didn't show up in a couple Kagi searches I tried. (Aside from the obvious titular one, of course.)

HackerNews