This is possible because the Hessian of a deep net has a matrix polynomial structure that factorizes nicely. The Hessian-inverse-product algorithm that takes advantage of this is similar to running backprop on a dual version of the deep net. It echoes an old idea of Pearlmutter's for computing Hessian-vector products.
Maybe this idea is useful as a preconditioner for stochastic gradient descent?
MontyCarloHall•36m ago
Silly question, but if you have some clever way to compute the inverse Hessian, why not go all the way and use it for Newton's method, rather than as a preconditioner for SGD?
rahimiali•22m ago
MontyCarloHall•21m ago
rahimiali•16m ago
probably my nomenclature bias is that i started this project as a way to find new preconditioners on deep nets.
hodgehog11•4m ago
It's an interesting trick though, so I'd be curious to see how it compares to CG.
[1] https://arxiv.org/abs/2204.09266 [2] https://arxiv.org/abs/1601.04737 [3] https://pytorch-minimize.readthedocs.io/en/latest/api/minimi...