In my case the model was for logistic regression and it was the boundary lines of the classification, but the thought is largely the same. Viewing it as a 3d shape form by boundary lines and considering hill tops as areas where entire classification boundaries disappeared as the regularization coefficient grew large enough to eliminate them. Impractical to do on models of any size and only useful when looking at two features at a time, but a fun consideration.
More on topic with the article, how well does this work with considering multiple features and the different combinations of them. Instead of sigma(n => 50) of x^n, what happens if you have sigma(n => 50) of sigma(m => 50) of (x^n)*(y^n). Well probably less than 50 in the second example, maybe it is fair to have n and m go to 7 so there are 49 total terms compared to the original 50, instead of 2500 terms if they both go to 50.
Worth a read if you're ever fitting functions.
They're the kind of math which is non-obvious and a bit intricate but yet if you knew the basics of how they worked and you were bored you might sit down with pencil and paper and derive everything about them. Then you wouldn't be bored anymore.
[0] https://en.wikipedia.org/wiki/Standard_basis#Generalizations
So... are _all_ introductions to machine learning just extremely wrong here? I feel like I've seen tens of reputable books and courses that introduce overfitting and generalization using severe overfitting and terrible generalization of high-degree polynomials in the usual basis (1,x,x^2,...). Seemingly everyone warns of the dangers of high-degree polynomials, yet here the author just says "use another basis" and proves everyone wrong? Mind blown, or is there a catch?
The other catch, which the author mentions, is that extrapolation is still essentially impossible with polynomials. This is easy to see. The highest-degree term will dominate all the other terms by an order of magnitude once you step out of the interval of interest. Every non-constant polynomial grows without bound.
Honestly, though, what the author is doing isn't much different than a Taylor approximation. If you're okay fitting any polynomial in a small neighbourhood of the data, then go whole hog and fit the best possible polynomial: a Taylor polynomial. You wouldn't usually need to go to that high degree to get what you want in a neighbourhood either.
It's more of a heuristic. Most people have their first experience in Excel, where you can fit a polynomial. Cranking up degree will always improve r2 (since excel doesn't do a holdout), so it's a very common mistake new students make.
It's much more understandable at the beginner level to say "you'll overfit if you crank up the degree" than it is to explain regularization and basises. Later on you can introduce it, but early on it's confusing and distracting to students who might not even know how to solve for an ordinary least squares model.
1) data availability and computer limited training set sizes. 2) they could simulate infinite datasets.
While challenging for our minds, training set sizes today make it highly likely that the patterns in your test set are similar to concept classes in your training set.
This is very different than saying procedure or random generated test sets, both of which can lead to problems like over fitting with over parameterized networks.
When the chances are that similar patterns exist, the cost of some memorization goes down and is actually somewhat helpful for generalization.
There are obviously more factors at play here, but go look at the double decent papers and their citations to early 90's papers and you will see this.
The low sensitivity of transformers also dramatically helps, with UHAT without CoT only having the expressiveness of TC0, and with log space scratch space having PTIME expressability.
You can view this from autograd requiring a smooth manifold with the ability to approximate global gradient too if that works better for you.
But yes all intros have to simplify concepts, and there are open questions.
The articles examples only work because the interval 0 to 1 (or -1 to 1) were chosen. For whatever reason the author does not point that out or even acknowledges the fact that had he chosen a larger interval the limitations of floating point arithmetic would have ruined the argument he was trying to make.
10^100 is a very large number and numerically difficult to treat. For whatever reason the author pretends this is not a valid reason to be cautious about high degree polynomials.
The series of articles posted here are interesting and I plan to review them in more detail. But I'm concerned about what the "unknown-unknowns" are.
Try the examples in the article with the interval 0 to 10.
The "particular form" is important--if you don't know that then there is no reason for choosing x^10000 over x^5 (other than computational complexity). But there is also no reason for choosing x^5 over x^10000! Maybe that function really is flat and maybe it isn't. You really just don't know.
If you don't have anything too weird, Fourier is pretty much optimal in the limit--if a bit expensive to calculate. In addition, since you can "bandwidth limit" it, you can very easily control overfitting and oscillation. Even more, Fourier often reflects something about the underlying process (epicycles, for example, were correct--they were pointing at the fact that orbits were ellipses).
"Optimal" by what metric? Why is projecting on the Fourier basis "better" than projecting on a polynomial basis?
The interval was chosen as 0 to 1. This single fact was what made this feasible. Had the interval been chosen as 0 to 10. A degree 100 polynomial would have to calculate 10^100, this would have lead to drastic numerical errors.
The article totally fails to give any of the totally legitimate and very important reason why high degree polynomials are dangerous. It is absurd to say that well known numerical problems do not exist because you just found one example where they did not occur.
"The second source of their bad reputation is misunderstanding of Weierstrass’ approximation theorem. It’s usually cited as “polynomials can approximate arbitrary continuous functions”. But that’s not entrely true. They can approximate arbitrary continuous functions in an interval. This means that when using polynomial features, the data must be normalized to lie in an interval. It can be done using min-max scaling, computing empirical quantiles, or passing the feature through a sigmoid. But we should avoid the use of polynomials on raw un-normalized features."
As I understand it, one of the main ideas of this series of posts is that normalizing features to very specific intervals is important when fitting polynomials. I don't think this "went completely uncommented".
The scaling to an interval in the quote is about formal mathematical reasons,in particular that polynomials do not approximate continuous functions globally. This is totally unrelated to numerics.
The issue is that in particular the interval 0 to 1 has to be chosen, as otherwise the numerics totally fall apart. The message of the article is that high degree polynomials pose no danger, but that is wrong. All the examples in the article only work because of a specific choice of interval. All the major numerical issues are totally ignored, which would immediately invalid the core thesis of the article. If you calculate 10^100 in 64 bit floating point you will run into trouble. The article pretends that will not be the case.
Indeed, the examples work thanks to this choice of the interval, but this comes with the choice of the basis. Of course Bernstein basis functions explode outside [0,1], but I think the point is that high-degree polynomials pose no danger if you scale the data *according to the polynomial* (use [0,1] for Bernstein and [-1,1] for Chebyshev, for example). So the "magic combo" is polynomial + scaling to its interval. Otherwise all bets are off, of course.
"Any polynomial basis has a “natural domain” where its approximation properties are well-known. Raw features must be normalized to that domain. The natural domain of the Bernstein basis is the interval [0,1][0,1]."
By using the "right" kind of polynomial basis you can get a polynomial approximation which also tells you the mean and variance of the function under a random variable.
creata•3h ago
https://people.maths.ox.ac.uk/trefethen/mythspaper.pdf