https://www.ibm.com/think/topics/linear-regression
A proven way to scientifically and reliably predict the future
Business and organizational leaders can make better decisions by using linear regression techniques. Organizations collect masses of data, and linear regression helps them use that data to better manage reality, instead of relying on experience and intuition. You can take large amounts of raw data and transform it into actionable information.
You can also use linear regression to provide better insights by uncovering patterns and relationships that your business colleagues might have previously seen and thought they already understood.
For example, performing an analysis of sales and purchase data can help you uncover specific purchasing patterns on particular days or at certain times. Insights gathered from regression analysis can help business leaders anticipate times when their company’s products will be in high demand.
Linear regression, for all its faults, forces you to be very selective about parameters that you believe to be meaningful, and offers trivial tools to validate the fit (i.e. even residuals, or posterior predictive simulations if you want to be fancy).
ML and beyond, on the other hand, throws you in a whirl of hyperparameters that you no longer understand and which traps even clever people in overfitting that they don't understand.
Obligatory xkcd: https://xkcd.com/1838/
So a better critique, in my view, would be something that the JW Tukey wrote in his famous 1962 paper: (paraphrasing because I'm lazy):
"better to have an approximate answer to a precise question rather than an answer to an approximate question, which can always be made arbitrarily precise".
So our problem is not the tools, it's that we fool ourselves by applying the tools to the wrong problems because they are easier.
This can be seen as another occurence of the "bitter lesson": http://www.incompleteideas.net/IncIdeas/BitterLesson.html
I indeed find the lesson that it describes unbearably bitter. Searching and learning, as used by the article, may discover patterns and results (due to infinite scaling of computation) that we, humans, are physically uncapable of discovering -- however, all those learnings will have no meaning, they will not expose any causality. This is what I find unbearable, as it implies that the real world must ultimately remain impervious to human cognizance; it implies that our meaning- and causality-based human reasoning ultimately falls short to model the world, while general, computation-only methods (given ever-growing computing power) at least "converges" to a faithful (but meaningless) description of the world.
See examples like protein folding, medicine research, AI-assisted diagnosis, self driving cars. We're going to rely on their results, but we'll never know why those results work. We're not going to reject self-driving cars if those cars save lives per same distance driven and/or same time driven; however, we're going to sit in, and drive, those cars blind. To me, that's an unbearable thought, even apart from the possibility that at some point the system might break down, and cause a huge accident inexplicably. An inexplicable misbehavior of the system is of course catastrophic, but to me, even the inexplicable proper behavior of the system is an unsettling thought -- because it is inexplicable.
Edited to add: I think the phrase "how we think we think" is awesome in the essay. We don't even know how our reasoning works, so trying to "machinize" those misconceptions is likely bound to fail.
The notion of predicting the mean can be extended to other properties of the conditional distribution of the target variable, such as the median or other quantiles [0]. This comes with interesting implications, such as the well-known properties of the median being more robust to outliers than the mean. In fact, the absolute loss function mentioned in the article can be shown to give a conditional median prediction (using the mid-point in case of non-uniqueness). So in the OP example, if the data set is known to contain outliers like properties that have extremely high or low value due to idiosyncratic reasons (e.g. former celebrity homes or contaminated land) then the absolute loss could be a wiser choice than least squares (of course, there are other ways to deal with this as well).
Worth mentioning here I think because the OP seems to be holding a particular grudge against the absolute loss function. It's not perfect, but it has its virtues and some advantages over least squares. It's a trade-off, like so many things.
My impression is that many tend to overestimate the importance of normality. In practice, I'd worry more about other things. The example in the OP, eg, if it were an actual analysis, would raise concerns about omitted variables. Clearly, house prices depend on more factors than size, eg location. Non-normality here could be just an artifact of an underspecified model.
Data Analysis... https://sites.stat.columbia.edu/gelman/arm/ Regression and Other Stories: https://avehtari.github.io/ROS-Examples/
Wasserman's All of Statistics is a really good introduction to mathematical statistics (the Gelman stuff above are more practically and analytically focused).
But yeah, it would probably be easier to find a good statistics course at a local university and try to audit it or do it at night.
> and "proving" theorems mechanically
I think you’ve have a bad experience because writing a proof is explaining deep understanding.
I think your wording is the key—coming up with a proof is creating deep understanding, but writing a proof very much need not be explaining or creating deep understanding. Writing a proof can be done mechanically, by both instructor and student, and, if done so, neither demonstrates nor creates understanding.
(Also, in statistics more than in almost any other mathematically based subject, while the rigorous mathematical foundations are important, a complete theoretical understanding of those foundations need not shed any light on the actual practice of statistics.)
That’s not what this comment asked for.
For gradients, Stanford CS229 [1] jumps right into it.
[0] https://www.stat.cmu.edu/~cshalizi/mreg/15/lectures/06/lectu...
[1] https://cs229.stanford.edu/lectures-spring2022/main_notes.pd...
And for introductory content there's always that risk if you provide to much information you overwhelm the reader, make them feel like maybe this is too hard for them.
Personally I find the process of building a model is a great way of learning all this.
I think a course is probably helpful, but the problem with things like data camp is they are overly repetitive and they don't do a great job of helping you look up earlier content unless you want to scroll through a bunch of videos, where the formula goes on screen for 5 seconds.
Would definitely just recommend getting a book for that stuff, I found "All of statistics" good, I just wouldn't recommend trying to read it from cover to cover, but I have found it good as a manual where I could just look up the bits I needed when I needed it. Tho the book may be a bit intimidating if you're unfamiliar with integration and derivatives (as they often express the PDF/CDF of random variables in those terms).
There's this site full of cool knowledgeable people called Hacker News which usually curates good articles with deep intuition about stuff like that. I haven't been there in years, though.
There are plenty of error formulations that give a smooth loss function, and many even a convex one, but most don't have analytical solutions so they are solved via numerical optimization like GD.
The main message is IMHO correct though: square error (and its implicit gaussian noise assumption) is all too often used just per convenience and tradition.
And in any case nobody uses GD for regressions for statistical analysis purposes. In practice Newton-Raphson or other more complicated schemes (with a lot higher computation, memory and IO demands) with a lot nicer convergence properties are used.
What are the common cases for 10^5 parameter OLS? Perhaps something like weather models could include such computations?
Square error is used because it is the maximum likelihood estimator under the assumption that observation noise is normally distributed, not because it is analytical.
I think that as a field, Machine Learning is the exception rather than the norm, where people people start off or proceed rapidly to non-linear models, huge datasets and (stochastic) gradient based solvers.
Gaussianity of errors is more of a post-hoc justification (which is often not even tested) for fitting with OLS.
Even the most popular more complicted models like multilevel (linear) regression make use of the mathematical convenience of the square error, even though the solutions aren't fully analytical.
Square error indeed gives estimates for normally distributed noise, but as I said, this assumption is quite often implicit, and not even really well understood by many practitioners.
Analytical solutions for squared errors have a long history for more or less all fields using regression and related models, and there's a lot of inertia for them. E.g. ANOVA is still the default method (although being replaced by multilevel regression) for many fields. This history is mainly due to the analytical convenience as they were computed on paper. That doesn't mean the normality assumption is not often justifiable. And when not directly, the traditional solution is to transform the variables to get (approximately) normally distributed ones for analytical solutions.
https://www.inference.vc/notes-on-the-origin-of-implicit-reg...
Dissecting the Effects of SGD Noise in Distinct Regimes of Deep Learning https://arxiv.org/abs/2301.13703
That's one of the reasons that multicollinearity is seen as a big deal by statisticians, but ML practitioners couldn't give a hoot.
The point about PCA applies to population genetics and psychometrics (IQ). Some conclusions have been derived using PCA that appear to be supported by little else, and these have come under question.
Statistical modeling is done primarily in service of scientific discovery--for the purpose of making an inference (population estimate from a sample) or a comparison to test a hypothesis derived from a theoretical causal model of a real-world process before viewing data. The parameters of a model are interpreted because they represent an estimate of a treatment effect of some intervention.
Methods like PCA can be part of that modeling process either way, but analyzing and fitting models to data to mine it for patterns without an a priori hypothesis is not science.
But theoretically speaking, in a scientific context, why would you want to fit an explanatory model that includes multiple highly (but not perfectly) correlated independent variables?
It shouldn't be an accident. Usually it's because you've intentionally taken multiple proxy measurements of the same theoretical latent variable and you want to reduce measurement error. So that becomes a part of your measurement and modeling strategy.
I mention the relation to the gaussian distribution. Which part of the comment is incorrect?
OLS is popular because it gives correct answers as a result of the CLT
I did the Stats I -> II -> II pipeline at uni but you should be fitting basic linear models by the end of Stats I
I used to use em dashes before they were cool. I actually learned about them when I emailed a guy who's a software engineer at Genius and also writes for The New Yorker and The Atlantic.
I asked him for tips on how to write well and he recommended that I read Steven Pinker's "The Sense of Style", which uses em dashes exhaustively, and explains when and why one should use them.
It also pains me that I can't use them anymore or else people will think an AI did the writing.
vs
I also recommend "The Sense of Style"--knowing how to wield punctuation and grammatical structure is critical for clearly and successfully articulating your ideas--and I use semicolons, colons, and parentheticals heavily (but en dashes and em dashes are great too).
I find that dashes are great for conversational style flowing sentence structure, but sometimes they can become too long and tiring to the reader.
Previously I rarely saw it used in my English-as-second-language peer group, even by otherwise decent writers. Now I see it everywhere in personal/professional updates in my feed by. The simpler assumption is that people over-rely on LLMs for crafting these posts, and LLMs disproportionately use em dashes.
And for actual gradient descent code, here is an older example of mine in PyTorch: https://github.com/stared/thinking-in-tensors-writing-in-pyt...
The interactive visualizations are a great bonus though!
I always find those counters to greatly overestimate reading speed, but for a technical article like this it's outright insulting, to be honest.
When you intimately understand a topic, you have an intuition that naturally paves over gaps and bumps. This is excellent for getting work done, but terrible for teaching. Your road from start to finish is 12 minutes, and without that knack for teaching, you are unable to see what that road looks like to a beginner.
I spent some time making it work with interpolation so that the transitions are smooth.
Then I expanded to another version, including a small neural network (nn) [1].
And finally, for the two functions that have a 2d parameter space, I included a viz of the loss [2]. You can click on the 2d space and get a new initial point for the descent, and see the trajectory.
Never really finished it, though I wrote a blog post about it [3]
[0] https://gradfront.pages.dev/
[1] https://f36dfeb7.gradfront.pages.dev/
Are "first" and "second" switched here?
In a few rare cases I have found situations where sqrt(y) or 1/y is a clever and useful transform but they're very situational, often occurring when there's some physical law behind the data generation process with that sort of mathematical form.
The "trick" allows you to fit a linear function in that higher dimensional space without any potentially costly explicit computation in the higher dimensional space based on the observation that the optimal solution's parameters can be represented as a sum of the higher dimensional representations of points in the training set.
It's not about finding a line of best fit or making the dataset appear linear, it's about being able to split a dataset into two classes using a linear function.
Given that the kernel trick is pretty specific jargon used mostly in a specific circumstance, it's in your interest to use that term in that specific context. If you're interested in the more general term of making things work with respect to some function, which can be linear or Gaussian or some other form the term is "feature transformation".
In relation to gradient descent, I do not know enough if multiple regression is at all relevant, or why not.
And yeah, for non-normal error distributions, we should be looking at generalized linear models, which allows one to specify other distributions that might better fit the data.
This kind of problem is actually a good intro to iterative refitting methods for regression models: How do you know what the weights should be? Well, you fit the initial model with no weights, get its residuals, use those to fit another model, rinse and repeat until convergence. A good learning experience and easy to hand-code.
A while ago I think I even proved to myself that this hypothetical mechanical system is mathematically equivalent to doing a linear regression, since the system naturally tries to minimize the potential energy.
Technically, physical springs will also have momentum and overshoot/oscillate. But even this is something that is used in practice, gradient descent with momentumg.
The reason to look at statistical assumptions, is because we want to make probabilistic/statistical statements about the response variable, like how much is its central tendency and how much it varies as values of X change. The response variable is not easy to measure.
Now, one can easily determine, for example using OLS(or gradient descent), the point estimates for parameters of a line that needs to be fit to two variables X and Y, without using any probability or statistical theory. OLS is, in point of fact, just an analytical result and has nothing to do with theory of statistics or inference. The assumptions of simple linear regression are statistical assumptions which can be right or wrong but if they hold, help us in making inferences, like:
- Is the response variable varying uniformly over values of another r.v., X(predictors)?
- Assuming an r.v. Y what model can we make if its expectation is a linear function.
So why do we make statistical assumptions instead of just point estimates?
Because all points of measurements can’t be certain and making those assumptions it is one way of quantifying uncertainty.. Indeed, going through history one finds that Regression's use outside experimental data(Galton 1885) was discovered much after least squares(Newton 1795-1809). The fundamental reasons to understand natural variations in data was the original motivation. In Galton's case he wanted to study hereditary traits like wealth over generations as well as others like height, status, intelligence( coincidentally its also what makes the assumptions of linear regression a good tool for studying this: I think it's the idea of Regression to the mean; Very Wealthy or very pool families don't remain so over a families generations, they regress towards the mean. So is the case with Societal Class, Intelligence over generations)When you follow this arc of reasoning, you come to the following _statistical_ conditions the data must satisfy for linear assumptions to work(ish):
Linear mean function of the response variable conditioned on a value of X
E[Y|X=x] = \beta_0+\beta_1*x
Constant Variance of the response variable conditioned on a value of X
Var[Y|X=x] = \sigma^2 (OR ACTUALLY JUST FINITE ALSO WORKS WELL)
- A Binomial/Multinomial random variable, which gives you the the cross entropy like loss function.
- Is a Normal random variable, which gives you the squared loss.
This point is where many ML text books skip to directly. Its not wrong to do this, but this is a much more narrow intuition of how regression works!
But there is no reason Y needs to follow those two DGPs (The process could be a poisson or a mean reverting process)! There is no reason to believe prima-facie and apriori that the Y|X is following those assumptions. This also gives motivation for using other kinds of models.
Its why you test weather those statistical assumptions carefully first using a bit of EDA and from it comes some appreciation and understanding of how linear regression actually works.
He isn't talking about how to calculate the linear regression, correct? He's talking about why using squared distances between data points and our line is a preferred technique over using absolute distances. Also, he doesn't explain why absolute distances produce multiple results I think? These aren't criticisms, I am just trying to make sure I understand.
ISTM that you have no idea how good your regression formula (y = ax + c) is without further info. You may have random data all over the place, and yet you will still come out with one linear regression to rule them all. His house price example is a good example of this: square footage is, obviously, only one of many factors that influence price -- and also the most easily quantified factor by far. Wouldn't a standard deviation be essential info to include?
Also, couldn't the fact that squared distance gives us only one result actually be a negative, since it can so easily oversimplify and therefore cut out a whole chunk of meaningful information?
brrrrrm•2mo ago
It's interesting to continue the analysis into higher dimensions, which have interesting stationary points that require looking at the matrix properties of a specific type of second order derivative (the Hessian) https://en.wikipedia.org/wiki/Saddle_point
In general it's super powerful to convert data problems like linear regression into geometric considerations.