Another important lesson is that often good ideas get passed over because of hype or politics. We often like to pretend that science is all about the merit and what is correct. Unfortunately this isn't true. It is that way in the long run, but in the short run there's a lot of politics and humans still get in their own way. This is a solvable problem, but we need to acknowledge it and create systematic changes. Unfortunately a lot of that is coupled to the aforementioned one.
> I do respect the extent to which he continues his credit-attribution crusade even to his own reputational detriment.
As should we all. Clearly he was upset that others got credit for his contributions. But what I do appreciate is that he has recognized that it is a problem bigger than him, and is trying to combat the problem at large and not just his own little battlefield. That's respectable.The person with whom an idea ends up associated often isn't the first person to have the idea. Most often is the person who explains why the idea is important, or find a killer application for the idea, or otherwise popularizes the idea.
That said, you can open what Schmidhuber would say is the paper which invented residual NNs. Try and see if you notice anything about the paper that perhaps would hinder the adoption of its ideas [1].
[1] https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdv...
[1] https://people.idsia.ch/~juergen/SeppHochreiter1991ThesisAdv...
Conversely, a huge amount of science is just scientists going "here's something I found interesting" but no one can figure out what to do with it. Then 30 or 100 years go by and it's a useful in a field that didn't even exist at the time.
It seems that these two people Schimidhuber and Hochreiter were perhaps solving the right problem for the wrong reasons. They thought this was important because they expected that RNNs could hold memory indefinitely. Because of BPTT, you can think of that as a NN with infinitely many layers. At the time I believe nobody worries about vanishing gradient for deep NNs, because the compute power for networks that deep just didn't exist. But nowadays that's exactly how their solution is applied.
That's science for you.
After reading Lang & Witbrock 1988 https://gwern.net/doc/ai/nn/fully-connected/1988-lang.pdf I'm not sure how convincing I find this explanation.
aDyslecticCrow•3h ago