So... Automatic integration?
Proportional, integrative, derivative. A PID loop sure sounds like what they're talking about.
It has a lot more overhead than regular forwards mode autodiff because you need to cache values from running the function and refer back to them in reverse order, but the advantage is that for function with many many inputs and very few outputs (i.e. the classic example is calculating the gradient of a scalar function in a high dimensional space like for gradient descent), it is algorithmically more efficient and requires only one pass through the primal function.
On the other hand, traditional forwards mode derivatives are most efficient for functions with very few inputs, but many outputs. It's essentially a duality relationship.
For vector valued functions, the naive way you would learn in a vector calculus class corresponds to forward mode AD.
As the name implies, the calculation is done forward.
Reverse mode automatic differentiation starts from the root of the symbolic expression and calculates the derivative for each subexpression simultaneously.
The difference between the two is like the difference between calculating the Fibonacci sequence recursively without memoization and calculating it iteratively. You avoid doing redundant work over and over again.
e.g. optimization of state space control coefficients looks something like training a LLM matrix...
However, from what I have seen, this isn't really a useful way of reframing the problem. The optimal control problem is at least as hard, if not harder, than the original problem of training the neural network, and the latter has mature and performant software for doing it efficiently. That's not to say there isn't good software for optimal control, but it's a more general problem and therefore off-the-shelf solvers can't leverage the network structure very well.
Some researchers have made interesting theoretical connections like in neural ODEs, but even there the practicality is limited.
We can also reduce supervised learning to reinforcement learning, but that doesn't mean we should use RL algorithms to do supervised learning.
We can also reduce sorting a list of integers to SAT, but that doesn't mean we should use a SAT solver to sort lists of integers.
In it, he stated the following:
> Indeed, the famous “backpropagation” algorithm that was rediscovered by David Rumelhart in the early 1980s, and which is now viewed as being at the core of the so-called “AI revolution,” first arose in the field of control theory in the 1950s and 1960s. One of its early applications was to optimize the thrusts of the Apollo spaceships as they headed towards the moon.
I was wondering whether anyone could point me to the paper or piece of work he was referring to. There are many citations in Schmidhuber’s piece, and in my previous attempts I've gotten lost in papers.
AIAA 65‑701 (1965) “optimum thrust programming” for lunar transfers via steepest descent (Apollo‑era) https://arc.aiaa.org/doi/abs/10.2514/6.1965-701
Meditch 1964 (optimal thrust programming for lunar landing) https://openmdao.github.io/dymos/examples/moon_landing/moon_...
Smith 1967 & Colunga 1970 (explicit Apollo‑type trajectory/re‑entry optimization using adjoint gradients) https://ntrs.nasa.gov/citations/19670015714
One thing AI has been great for, recently, has been search for obscure or indirect references like this, that might be one step removed from any specific thing you're searching for, or if you have a tip-of-the-tongue search where you might have forgotten a phrase, or know you're using the wrong wording.
It's cool that you can trace the work of these rocket scientists all the way to the state of the art AI.
Henry J. Kelley (1960). Gradient Theory of Optimal Flight Paths.
[1] https://claude.ai/public/artifacts/8e1dfe2b-69b0-4f2c-88f5-0...
I am still going through it, but the latter is quite interesting!
So sad to see the current state. Hopefully we can turn it around.
I think "its" refers to control theory, not backpropagation.
The Minimum-Time Thrust-Vector Control Law in the Apollo Lunar-Module Autopilot (1970)
https://www.sciencedirect.com/science/article/pii/S147466701...
> "Since his first work on the subject, the author has found that A. Bryson and Y.-C. Ho [Bryson and Ho, 1969] described the backpropagation algorithm using Lagrange formalism. Although their description was, of course, within the framework of optimal control rather than machine learning, the resulting procedure is identical to backpropagation."
I remember reading this book enthusiastically back in the mid 90s. I don't recall struggling with the proof, it was fairly straightforward. (I was in senior high school year at the time.)
Some ask: "Isn't backpropagation just the chain rule of Leibniz (1676) [LEI07-10] & L'Hopital (1696)?" No, it is the efficient way of applying the chain rule to big networks with differentiable nodes—see Sec. XII of [T22][DLH]). (There are also many inefficient ways of doing this.) It was not published until 1970 [BP1].
[1]: https://www.amazon.com/Talking-Nets-History-Neural-Networks/...
https://en.wikipedia.org/wiki/Adaptive_filter
doesn't need a differentiation of the forward term, but if you squint it looks pretty close
Once people had a sufficiently compelling reason to write differentiable code, the frameworks around differentiable programming (theano, tensorflow, torch, JAX) picked up a lot of steam.
Maybe they are. I'm not here to do a deep research project that involves reading every citation in that article. If it makes you feel better, pretend that what I said was instead:
"I don't have all the relevant citations stored in my short-term memory right this second and I am not interested in writing a lengthy thesis to satisfy pedantic navel-gazers on HN."
Or, if you really know some reinvention of backprop that is not mentioned here,
WTF are you on about? I never made any such claim, or anything remotely close to it.
I don't really understand your negativity here, and what you are reading into my comment. I never asked you to do a research project? I just thought you might know some other references which are not in the article. If you don't, fine.
Note that I don't expect that any relevant reference is missing here. Schmidhuber always try to be very careful to be very complete and exhaustive cite everything there is on some topic. That is why I was double curious about the possibility that sth is missing, and what it could be.
Nah, I wasn't trying to imply that that book had anything more than the article, at least in regards to the back-prob question specifically. Just pointing it out as one more good resource for this kind of historical perspective.
I don't really understand your negativity here, and what you are reading into my comment. I never asked you to do a research project? I just thought you might know some other references which are not in the article. If you don't, fine.
No worries. I may be reacting more to a general HN meme than to you in particular. There's a certain brand of pedantry and obsessive nit-picking that is all too common here IMO. It grates on my nerves, so if I ever seem a little salty, it's probably because I thought somebody was doing that thing. It's all good. My apologies for the argumentative tone earlier.
Schmidhuber always try to be very careful to be very complete and exhaustive cite everything there is on some topic.
Agreed. That's one reason I don't get why people are always busting on Jurgen. For the most part, it seems that he can back up the claims he makes, and then some. I've heard plenty of people complain about him, but I'm not sure any of them have ever been able to state any particular sense in which he is actually wrong about anything. :-)
[a] https://www.nobelprize.org/uploads/2024/11/advanced-physicsp...
Some things never change.
Hint: Schmidhuber has amassed solid evidence over years of digging.
In a recent talk he made a quip that he had to change some slides because if you have a Nobel prize in physics you should at least get the units right.
Now perhaps Hinton does deserve the award, but certainly it should not be because of the reasons you cite: money and popularity.
You refuted an argument about being honest about accepting an award on the basis that the award pays a lot of money and grants one a great deal of popularity.
If your argument didn't involve money and popularity, then why did you choose those two specific criteria as the justification for accepting this award?
I want to be clear, I am not claiming that Dr. Hinton accepted the award in a dishonest manner or that he did it for money, I am simply refuting your position that money is a valid reason to disregard honesty for accepting a prestigious award.
There are two reasons why Hinton got the prize.
A good majority of modern physics research depends on ML in some aspect. Look at the list of talks at any physics conference and count the number of talks that mention ML in the title.
And the 'physics community' has not produced any fundamental physics for a while. Look at the last several years of physics nobel prize. You can categorize the last ten years of prizes into two categories: engineering breakthroughs, and confirming important predictions. Both are impportant, but lacking fundamental physics breakthroughs, they are not clearly ahead in impact compared to ML.
The setup by itself can also general technique that is useful beyond confirming one thing (example LIGO). But then, ML is itself is a more general technique that has enabled a lot more new physics than one new experiment.
And a large number of predictions being made now are unlikely to be ever confirmed.
That's not a good argument. They do in fact sometimes give awards with which the "scientific community" disagrees. Schmidhuber actually gave object level arguments on why the official justification for the Turing award contained substantial errors.
[HIN] J. Schmidhuber (AI Blog, 2020). Critique of Honda Prize for Dr. Hinton. Science must not allow corporate PR to distort the academic record.
[RUM] DE Rumelhart, GE Hinton, RJ Williams (1985). Learning Internal Representations by Error Propagation.
Neither are really inventions, they are discoveries, if anything the chain rule leans slightly more to invention than backdrop.
I understand the need for attribution as a means to track the means and validity of discovery, but I intensely dislike it when people act like it is a deed of ownership of an idea.
I remember when I learnt about artificial neural networks at university in the late 00s my professors were really sceptical of them, rightly explaining that they become hard to train as you added more hidden layers.
See, what makes backpropagation and artificial neural networks work are all of the small optimisations and algorithm improvements that were added on top of backpropagation. Without these improvements it's too computationally inefficient to be practical and you have to contend with issues like exploding gradients.
I think Geoffrey Hinton has noted a few times that for people like him who have been working on artificial neural networks for years it's quite surprising that today neural networks just work because for years it was so hard to get them to do anything. In this sense while backpropagation is the foundational algorithm, it's not sufficient on it's own. It was the many improvements that were made on top of backpropagation that actually make artificial neural networks work and take off in the 2010s when some of the core components of modern neural networks started to fall into place.
I remember when I first learnt about neural networks I thought maybe coupling them with some kind of evolutionary approach might be what was needed to make them work. I had absolutely no idea what I was doing of course, but I spent so many nights experimenting with neural networks. I just loved the idea of an artificial "neural network" being able to learn a new problem and spit out an answer. The biggest regret of my life was coming out of university and going into web development because there were basically no AI jobs back then, and no such thing as an AI startup. If you wanted to do AI back then you basically had to be a researcher which didn't interest me at the time.
I did this in an artificial life simulation. It was pretty fun to see the creatures change from pure random bouncing around to movement that helped them get food and move away from something eating them.
My naive vision was all kinds of advanced movement, like hiding around corners for prey, but it never got close to something like that.
As I worked the evolutionary parameters I began to realize more and more that the process of evolving specific advanced traits requires lots of time and (I think) environmental complexity and compartmentalization of groups of creatures.
There are lots of simple/dumb capabilities that help with survival and they are much much easier to acquire than a more advanced capability like being aware of other creatures and tracking it's movement on the other side of an obstacle.
There are mainly 2 forms of AD: forward mode (optimal when the function being differentiated has more outputs than latent parameter inputs) and reverse mode (when it has more latent parameter inputs than outputs). If you don't understand why, you don't understand AD.
If you understand AD, you'd know why, but then you'd also see a huge difference with symbolic differentiation. In symbolic differentiation, input is an expression or DAG, the variables being computed along the way are similar such symbolic expressions (typically computed in reverse order in high school or uni, so the expression would grow exponentially with each deeper nested function, and only at the end are the input coordinates filled into the final expression, to end up with the gradient). Both forward and reverse mode have numeric variables being calculated, not symbolic expressions.
The third "option" is numeric differentiation, but for N latent parameter inputs this requires (N+1) forward evaluations: N of the function f(x1,x2,..., xi + delta, ..., xN) and 1 reference evaluation at f(x1, ..., xN). Picking a smaller delta makes it closer to a real gradient assuming infinite precision, but in practice there will be irregular rounding near the pseudo "infinitesimal" values of real world floats; alternatively take delta big enough, but then its no longer the theoretical gradient.
So symbolic differentiation was destined to fail due to ever increasing symbolic expression length (the chain rule).
Numeric differentiation was destined to fail due to imprecise gradient computation and huge amounts (N+1, many billions for current models) of forward passes to get a single (!) gradient.
AD gives the theoretically correct result with a single forward and backward pass (as opposed to N+1 passes), without requiring billions of passes, or lots of storage to store strings of formulas.
AD is just simple application of the pushbacks/pullforwards from differential geometry that are just the chain rule. It is important to distinguish between a mathematical concept and a particular algorithm/computation for implementing it. The symbolic manipulation with an 'exponentially growing nested function' is a particular way of applying the chain rule, but it is not the only way.
The problem you describe with symbolic differentiation (exponential growth of expressions) is not inherent to symbolic differentiation itself, but to a particular naïve implementation. If you represent computations as DAGs and apply common subexpression elimination, the blow-up you mention can be avoided. In fact, forward- and reverse-mode AD can be viewed as particular algorithmic choices for evaluating the same derivative information that symbolic differentiation encodes. If you represent your function as a DAG and propagate pushforwards/pullbacks, you’ve already avoided swell
[0] https://www.urbandictionary.com/define.php?term=schmidhubere...
Annotated History of Modern AI and Deep Learning - https://people.idsia.ch/~juergen/deep-learning-history.html
Japanese scientists were pioneers of AI, yet they’re being written out of its history - https://theconversation.com/japanese-scientists-were-pioneer...
fritzo•5mo ago
DoctorOetker•5mo ago
Iterating gradient descent is much, much older, and immediately recognized upon defining what a gradient is (regardless of how one computes this gradient).
AD vs { symbolic differentiation, numeric finite "differentiation" } is about the insight how to compute a numeric gradient efficiently both in terms of (space) memory and time (compute) requirements.