[0] which I learned from this talk https://youtube.com/watch?v=17gfCTnw6uE
When the spaces are Euclidean spaces then we conflate the tangent space with the space itself because they're identical.
By the way, this makes it easy to remember the chain rule formula in 1 dimension. There's only one logical thing it could be between spaces of arbitrary dimensions m, n, p: composition of linear transformations from T_a A to T_f(a) B to T_g(f(a)) C. Now let m = n = p = 1, and composition of linear transformations just becomes multiplication.
(Only half kidding)
Of course that picture is not formally correct. We formally define the tangent space without having to embed the manifold in Euclidean space. But that picture is a correct description of an embedding of both the sphere and the tangent space at a single point.
Incorrect!
There are a bit more to go from local linearization to a complete view of a derivative but it's not exactly incorrect.
https://web.archive.org/web/20250817152111/https://blog.demo...
That is so much easier to understand than most descriptions. The whole opening was.
Jacobians can be understood as a collection of gradients when considering each coordinates of the output independently.
My mental picture for Hessian is to associate each point with the shape of a parabola (or saddle), which best match the function locally. It's easy to visualize once you realize it's the shape of what you see when you zoom-in on the point. (Technically this mental picture is more of a hessian + gradient tangent plane simultaneously multivariate Taylor expansion but I find them hard to mentally separate the slope from the curvature).
A lot of optimization theory becomes intuitive once you work through a few of those and compare your understanding to arrow maps like you suggest.
When I started thinking of vector calculus in terms of multiplying both vector components and the corresponding basis vectors, there was a nice unification of ordinary vector operations, jacobians, and the metric tensor.
It's particularly true when you try to apply it to physics. Often where you are introduced to vector calculus for the first time, things like Maxwell's equation, or fluid mechanics.
In physics there are often additional constraints, like conserved quantities. Like you calculate a scalar the total energy of a system, call it the Hamiltonian, auto-differentiate it with respect of the degrees of freedom, and you got very complex vector equations.
But taking a step back you realise it's just a field of "objects" of a certain type and you are just locally comparing these "objects" to their neighbors. And the whole mess of vector calculus is reduced to in which direction, you rotate, stretch and project these objects. (As an example you can imagine a field of balls in various orientation, whose energy is defined by the difference of orientation between neighboring balls)
When you wrap around that the whole point of vector calculus (and why it was invented) is to describe a state as an integral (a sum of successive) of linear infinitesimal transformations, this make more sense. These "objects" being constrained and continuous, by moving in infinitesimal steps along the tangent space (and then reproject to satisfy the constraints (or exponentiating in a Clifford algebra to stay in the space) ).
All the infinitesimal transformations are "linear", and linear transformations are rotating, stretching, mirroring, projecting to various degrees.
The Hessian shouldn't have been called a matrix.
The Jacobian describes all the first order derivatives of a vector valued function (of multiple inputs), while the Hessian is all the second order derivatives of a scalar valued output function (of multiple inputs). Why doesn't the number of dimensions of the array increase by one as the derivation order increases? It does! The object that fully describes second order derivation of a vector valued function of multiple inputs is actually a 3 dimensionnal tensor. One dimension for the original vector valued output, and one for each derivation order. Mathematicians are afraid of tensors of more than 2 dimensions for some reason and want everything to be a matrix.
In other words, given a function R^n -> R^m:
Order 0: Output value: 1d array of shape (m) (a vector)
Order 1: First order derivative: 2d array of shape (m, n) (Jacobian matrix)
Order 2: Second order derivative: 3d array of shape (m, n, n) (array of Hessian matrices)
It all makes sense!
Talking about "Jacobian and Hessian" matrices as if they are both naturally matrices is highly misleading.
What made sense to me is to start from the definition of derivative (the best linear approximation in some sense), and then everything else is about how to represent this. vectors, matrices, etc. are all vectors in the appropriate vector space, the derivative is always the same form in a functional form, etc.
E.g. you want the derivative of f(M) ? Just write f(M+h) - f(M), and then look for the terms in h / h^2 / etc. Apply chain rules / etc. for more complicated cases. This is IMO a much better way to learn about this.
As for notation, you use vec/kronecker product for complicated cases: https://janmagnus.nl/papers/JRM093.pdf
If input x has components xⁿ, and output f(x) components fᵐ, then the Jacobian is ∂ₙfᵐ which has one index upstairs and one downstairs. The derivative has a downstairs index... because x is in the denominator of d/dx, roughly? If x had units seconds, then d/dx has units per second.
Whereas if g(x) is a number, the gradient is ∂ₙg, and the Hessian is ∂ₙ∂ₙ₂g with two downstairs indices. You might call this a (0,2) tensor, while the Jaconian is (1,1). Most of the matrices in ordinary linear algebra are (1,1) tensors.
Upstairs/downstairs is kinda cute tho xD
Sub/superscript... strike me as the typographical terms, not the meaning? Like $x_\mathrm{alice}$ is certainly a subscript, and footnote 2 is a superscript, but neither is an index.
I think if you tried to actually make sense of it, you'd expect that you plug in f twice (I think ∇ already secretly lives in R^n* ⊗ L(C^infty, C^infty), so you'd expect 2 copies of each of those). If you think of it as ∇: L(C^infty, L(R^n, C^infty)) then the composition ∇^2: L(C^infty, L(R^n, L(R^n, C^infty))) ≅ C^infty* ⊗ R^n* ⊗ R^n* ⊗ C^infty, so the types should work correctly (though the type of ∇ actually needs some generic parameters for composition to make sense), and you get that you only plug f in once.
Unfortunately ∇^2 is already taken by the Laplacian, so I suppose ∇⊗∇ would be "∇^2, but the one that makes a tensor", and it'll do as long as you don't try to type check it, which physicists won't.
The Hessian is defined as the second order partial derivative of a scalar function. Therefore it will always give you a matrix.
What you're doing with the shape (m,n,n) isn't actually guaranteed at all since the output shape of an arbitrary function can be any tensor and you can apply the Hessian to each scalar value in the tensor to get another arbitrary tensor that has two dimensions more.
It's the Jacobian that is weird, since it is just a vector of gradients and therefore its partial derivative must also be a vector of Hessians.
There's a whole workshop of useful matrix tools. Decompositions, spectral theory, etc. These tools really break down when you generalize them to k-tensors. Even basic concepts like rank become sticky. (Iirc, the set of 3-tensors of tensor rank ≤k is not even topologically closed in general. Terrifying.) If you hand me some random 5-tensor, it's quite difficult to begin to understand it without somehow turning it into a matrix first by flattening or slicing or whatever.
Don't get me wrong. People work with these things. They do their best. But in general, mathematicians are afraid of higher order tensors. You should be too.
whatever1•5mo ago
We all can do it in 2-3D. But our algorithms don’t do it. Even in 2D.
Sure if I was blindfolded, feeling the surface and looking for minimization direction would be the way to go. But when I see, I don’t have to.
What are we missing?
adrianN•5mo ago
Chinjut•5mo ago
3eb7988a1663•5mo ago
Gradient descent: take a step in the steepest downward direction. Look around and repeat. When you reach a level area, how do you know you are at the lowest point?
ks2048•5mo ago
For a loss-function, the value at each point must be computed.
You can compute them all and "look at" the surface and just directly choose the lowest - that is called a grid search.
For high dimensions, there's just way too many "points" to compute.
samsartor•5mo ago
cubefox•5mo ago
xigoi•5mo ago
That’s a big “if”.
cubefox•5mo ago
xigoi•5mo ago
cubefox•5mo ago
pestatije•5mo ago
fancyfredbot•5mo ago
hackinthebochs•5mo ago
i_am_proteus•5mo ago
jpeloquin•5mo ago
It's just that when the function is implemented on the computer, evaluating so many points takes a long time, and using a more sophisticated optimization algorithm that exploits information like the gradient is almost always faster. In physical reality all the points already exist, so if they can be observed cheaply the brute force approach works well.
Edit: Your question was good. Asking superficially-naive questions like that is often a fruitful starting point for coming up with new tricks to solve seemingly-intractable problems.
whatever1•5mo ago
It does feels to me that we do some sort of sampling, definitely is not a naive grid search.
Also I find it easier to find the minima in specific directions (up, down, left, right) rather than let’s say a 42 degree one. So some sort of priors are probably used to improve sample efficiency.
cinntaile•5mo ago
GuB-42•5mo ago
It is not perfect though, see the many optical illusions.
But we follow gradients all the time, consciously or not. You know you are at the bottom of the hole when all the paths go up for instance.
dcanelhas•5mo ago
it's not a bug - it's a feature :D
nwallin•5mo ago
When a computer has a surface which is 2 dimensions in and 1 dimension out, you can actually just do the same thing. Check like 100 values in the x/y directions and you only have to check like 10000 values. A computer can do that easy peasy.
When a computer does ML with a deep neural network, you don't have 2 dimensions in and 1 dimension out. You have thousands to millions of dimensions in and thousands to millions of dimensions out. If you have 100000 inputs, and you check 1000 values for each input, the total number of combinations is 1000^100000. Then remember that you also have 100000 outputs. You ain't doin' that much math. You ain't.
So we need fancy stuff like Jacobians and backtracking.
whatever1•5mo ago
cvoss•5mo ago
You have suggested that the process in your mind to find a global minimum is immediate, apparently to contrast this with a standard computational algorithm. But such comparison fails. I don't know whether you mean "with few computational steps" or "in very little time"; the former is not knowable to you; the latter is not relevant since the hardware is not the same.
zoogeny•5mo ago
In construction, grading a site for building is a whole process involving surveying. If you dropped a person on a random patch of earth that hasn't previously been levelled and gave them no tools, it would be a significant challenge for that person to level the ground correctly.
What I'm saying is, your intuition that "I can look around me and find the minimum of anything" is almost certainly wrong, unless you have a superpower that no other person has.
whatever1•5mo ago
raffael_de•5mo ago
shoo•5mo ago
Here are some alternative example problems, that are a lot more high dimensional, and also where the dimensions are not spatial dimensions so your eyes give you absolutely no benefit.
(a) Your objective is to find a recipe that produces a maximally tasty meal, using the ingredients you have in your kitchen cupboard. To sample one point in recipe-space, you need to (1) devise a recipe, (2) prep and cook a candidate meal following the recipe, and (3) evaluate the candidate recipe, say by serving it to a bunch of your friends and family. That gets you one sample point. Maybe there are 1 trillion possible "recipes" you could make. Are you going to brute-force cook and serve them all to find a meal that maximises tastiness, or is there a more efficient way that requires fewer plan recipe->prep&cook->serve->evaluate cycles?
(b) Your objective is to find the most efficient design of a bridge, that can support the required load and stresses, while minimising the construction cost.