Why does a least squares fit appear to have a bias when applied to simple data?

Comments

By tomp 2026-01-0423:032 reply

Linear Regression a.k.a. Ordinary Least Squares assumes only Y has noise, and X is correct.

Your "visual inspection" assumes both X and Y have noise. That's called Total Least Squares.

By esafak 2026-01-055:33

There is an illustration in https://en.wikipedia.org/wiki/Total_least_squares

By emmelaich 2026-01-051:291 reply

Yep, to demonstrate, tilt it (swap x and y) and do it again. Maybe this is what TLS does?

By srean 2026-01-0510:24

>(swap x and y) and do it again.

This is a great diagnostic check for symmetry.

> Maybe this is what TLS does?

No, swapping just exchanges the relation. What one needs to do is to put the errors in X and errors in Y in equal footing. That's exactly what TLS does.

Another way to think about it is that the error of a point from the line is not measured as a vertical drop parallel to Y axis but in a direction orthogonal to the line (so that the error breaks up in X and Y directions). From this orthogonality you can see that TLS is PCA (principal component analysis) in disguise.

By theophrastus 2026-01-0423:115 reply

Had a QuantSci Prof who was fond of asking "Who can name a data collection scenario where the x data has no error?" and then taught Deming regression as a generally preferred analysis [1]

[1] https://en.wikipedia.org/wiki/Deming_regression

By moregrist 2026-01-051:191 reply

Most of the time, if you have a sensor that you sample at, say 1 KHz and you’re using a reliable MCU and clock, the noise terms in the sensor will vastly dominate the jitter of sampling.

So for a lot of sensor data, the error in the Y coordinate is orders of magnitude higher than the error in the X coordinate and you can essentially neglect X errors.

By sigmoid10 2026-01-059:02

That is actually the case in most fields outside of maybe clinical chemistry and such, where Deming became famous for explaining it (despite not even inventing the method). Ordinary least squares originated in astronomy, where people tried to predict movement of celestial objects. Timing a planet's position was never an issue (in fact time is defined by celestian position), but getting the actual position of a planet was.

Total least squares regression also is highly non-trivial because you usually don't measure the same dimension on both axes. So you can't just add up errors, because the fit will be dependent on the scale you chose. Deming skirts around this problem by using the ratio of variances of errors (division also works for different units), but that is rarely known well. Deming works best when the measurement method for both dependent and independent variable is the same (for example when you regress serum levels against one another), meaning the ratio is simply one. Which of course implies that they have the same unit. So you don't run into the scale-invariance issues, which you would in most natural science fields.

By jmpeax 2026-01-0423:321 reply

From that wikipedia article, delta is the ratio of y variance to x variance. If x variance is tiny compared to y variance (often the case in practice) then will we not get an ill-conditioned model due to the large delta?

By kevmo314 2026-01-057:24

If you take the limit of delta -> infinity then you will get beta_1 = s_xy / s_xx which is the OLS estimator.

In the wiki page, factor out delta^2 from the sqrt and take delta to infinity and you will get a finite value. Apologies for not detailing the proof here, it's not so easy to type math...

By ghc 2026-01-0513:26

In my field, the X data error (measurement jitter) is generally <10ns, which might as well be no error.

By Beretta_Vexee 2026-01-059:28

For most time series, noise in time measurement is negligible. However, this does not prevent complex coupling phenomena from occurring for other parameters, such as GPS coordinates.

By RA_Fisher 2026-01-0511:48

The issue in that case is that OLS is BLUE, the best linear unbiased estimator (best in the sense of minimum variance). This property is what makes OLS exceptional.

By dllu 2026-01-0421:074 reply

You can think of it as: linear regression models only noise in y and not x, whereas ellipse/eigenvector of the PCA models noise in both x and y.

By analog31 2026-01-0421:332 reply

That brings up an interesting issue, which is that many systems do have more noise in y than in x. For instance, time series data from an analog-to-digital converter, where time is based on a crystal oscillator.

By jjk166 2026-01-051:321 reply

Well yeah, x is specifically the thing you control, y is the thing you don't. For all but the most trivial systems, y will be influenced by something besides x which will be a source of noise no matter how accurately you measure. Noise in x is purely due to setup error. If your x noise was greater than your y noise, you generally wouldn't bother taking the measurement in the first place.

By bravura 2026-01-057:551 reply

“ If your x noise was greater than your y noise, you generally wouldn't bother taking the measurement in the first place.”

Why not? You could still do inference in this case.

By jjk166 2026-01-059:04

You could, and maybe sometimes you would, but generally you won't. If at all possible, it makes a lot more sense to improve your setup to reduce the x noise, either with a better setup or changing your x to be something you can better control.

By GardenLetter27 2026-01-0422:291 reply

This fact underlies a lot of causal inference.

By randrus 2026-01-050:24

I’m not an SME here and would love to hear more about this.

By CGMthrowaway 2026-01-051:401 reply

So when fitting a trend, e.g. for data analytics, should we use eigenvector of the PCA instead of linear regression?

By stdbrouw 2026-01-058:18

(Generalized) linear models have a straightforward probabilistic interpretation -- E(Y|X) -- which I don't think is true of total least squares. So it's more of an engineering solution to the problem, and in statistics you'd be more likely to go for other methods such as regression calibration to deal with measurement error in the independent variables.

By 10000truths 2026-01-051:172 reply

Is there any way to improve upon the fit if we know that e.g. y is n times as noisy as x? Or more generally, if we know the (approximate) noise distribution for each free variable?

By defrost 2026-01-053:04

> Or more generally, if we know the (approximate) noise distribution for each free variable?

This was a thing 30 odd years ago in radiometric spectrometry surveying.

The X var was time slot, a sequence of (say) one second observation accumulation windows, the Yn vars were 256 (or 512, etc) sections of the observable ground gamma ray spectrum (many low energy counts from the ground, Uranium, Thorium, Potassium, and associated breakdown daughter products; some high energy counts from the infinite cosmic background that made it through the radiation belts and atmosphere to near surface altitudes)

There was a primary NASVD (Noise Adjusted SVD) algorithm (Simple var adjustment based on expected gamma event distributions by energy levels) and a number of tweaks and variations based on how much other knowledge seemed relevant (broad area geology and radon expression by time of day, etc)

See, eg: Improved NASVD smoothing of airborne gamma-ray spectra Minty / McFadden (1998) - https://connectsci.au/eg/article-abstract/29/4/516/80344/Imp...

By dllu 2026-01-0517:39

Yeah, you can generally "whiten" the problem by scaling it in each axis until the variance is the same in each dimension. What you describe is if x and y have a covariance matrix of like

    [ σ², 0;
      0,  (nσ)² ]

but whitening also works in general for any arbitrary covariance matrix too.

[1] https://en.wikipedia.org/wiki/Whitening_transformation

By scotty79 2026-01-0515:46

It might be cool to train neural network by minimizing error with assumption there's noise on both inputs and outputs.

Hacker News