But in large scale applications you may not want to store XX’ but instead are interested in computing products of the form XX’ v on the fly.
XX^T is the matrix of all piecewise dot products of vectors in X and as others have pointed out there are legitimate applications. Another one being e.g. transforming an undetermined linear system into a least squares problem.
Splitting into 4x4 blocks is typically very nice, though. Maybe it doesn’t matter so much to practical runtime.
I am aware of being able use hardware counters for profiling systems, but the more fine-grained exposure of what's happening at a lower level seems to be just hidden... Or I haven't found the right reference resource yet.
Btw, it's worth noting that if you know that the result will be symmetric (such as is the case for X * X^T), you can make things faster. For example in cuBLAS, cublas*syrk (the variant optimized for when the result is symmetric) IME isn't faster than gemm, so what you can do instead is just do smaller multiplications that fill in one of the two triangles piece by piece, and then copy that triangle to the other one.
Aka a galactic algorithm:
Whereas many algorithms are theoretically fine for real data but don't make good use of cache, or instruction sets...
>This and others optimizations are trading less multiplication for more additions, but from the little I know, on floating point additions and subtractions are risky precision wise, while multiplications are harmless.
Certainly not true. In general different orders of magnitude are a problem. With multiplication and division these are more serve problems, as they lead to greater changes in magnitude, although even that is a generalization.
They're a problem with fixed point: might you have been thinking of that?
So they're making a point that if you apply more multiplications before the inevitable addition, you are likely increasing the danger levels of that addition.
All of these papers/algos are for the ML hype-train. ML algos are approximate anyway so no one cares about absolute accuracy, only the precision of the overall pipeline (class labels shouldn't change, at least not too much). Consider that very many papers/techniques quantize down to 8 or even 4 bits (yes sometimes even during training) for the purposes of perf.
This whole research area should just be renamed to something like approximate computing so that people don't confuse it (and the goals) with classical numerical analysis.
For example swapping columns and rows before applying a product, so that accuracy is best kept (then reswapping back at the end of the operation). It seems to me the Strassen-like algorithm choose arbitrarily some elements of the matrices over others to make their temporary sum-products variables, so we could exploit this asymmetry (not sure if it's the correct word here) to choose which elements are to be chosen
In the end, all data structures are functions. The academic applications of that may make your eyes glaze over, but the practical application is that if you want a reversed array/tree/etc., you just need the reading/iteration "function" for the array to return the data as if it was reversed. A matrix can be "transposed" just by reading its transpose. (Caching or other consequences may result in you choosing to manifest it in memory anyhow, but if it was stored cleverly in the first place, or you're only going to read it once anyhow such that using it once is as expensive as reading it once to reconstruct it anyhow, then there's no reason to.) An array can be randomly permuted simply by wrapping the read function with a permutation on the index, it need not be manifested in memory. etc.
tmpfn = tree.left
tree.left = tree.right
tree.right = tmpfn
Now when tree.left(node) is applied, it returns the right child, and vice versa. All traversals use the accessors, QED.There is almost never a good reason to invert a full 4x4/3x3 transform, yet I see it all the time.
[ a b ] [ a c ] = [ (a b) . (a b) (a b) . (c d) ] = [ |(a b)]^2 (a b) . (c d) ]
[ c d ] [ b d ] [ (c d) . (a b) (c d) . (c d) ] [ (c d) . (a b) |(c d)|^2 ]
Down the diagonal we have squared norms of (a b) and (c d) which is interesting and could somehow be used.We also have repeated values between the upper and lower triangle because (a b) . (c d) is the same as (c d) . (a b): the dot product commutes.
Whereas just squaring the matrix:
[ a b ] [ a b ] = [ (a b) . (a c) (a b) . (b d) ]
[ c d ] [ c d ] [ (c d) . (a c) (c d) . (b d) ]
all four dot products are unique.So right of the bat, we can find interesting things about X X^t without going deep, and an obvious shortcut.
BryanLegend•7h ago
disofu234•6h ago
qoez•5h ago