This is possible because the Hessian of a deep net has a matrix polynomial structure that factorizes nicely. The Hessian-inverse-product algorithm that takes advantage of this is similar to running backprop on a dual version of the deep net. It echoes an old idea of Pearlmutter's for computing Hessian-vector products.
Maybe this idea is useful as a preconditioner for stochastic gradient descent?
MontyCarloHall•31m ago
Silly question, but if you have some clever way to compute the inverse Hessian, why not go all the way and use it for Newton's method, rather than as a preconditioner for SGD?
rahimiali•17m ago
MontyCarloHall•16m ago
rahimiali•11m ago
probably my nomenclature bias is that i started this project as a way to find new preconditioners on deep nets.