One of the weird things you do in transformers is add a position vector which captures the distance between the token being attended to the some other token.
This is obviously not powerful enough to express non-linear relationships - like graph relationships.
This person seems to be experimenting with doing pre-processing of the input token set, to linearly reorder it by some other heuristic that might map more closely to the actual underlying relationship between each token.
they replace dot-product attention with topology-based scalar distances derived from a laplacian embedding - that effectively reduces attention scoring to a 1D energy comparison which can save memory and compute
that said, i’d treat the results with a grain of salt give there is no peer review, and benchmarks are only on 30M parameter model so far
lostmsu•48m ago
Lerc•30m ago
Provided the flops are not prohibitive. Output quality per model bytes might be better. In general people run the largest model they can.
I certainly think trading speed for quality at the same size is worth looking at. Especially if it uses methods that can benefit from the efforts of others to improve speed in general.
That said performance difference at 30M may not be representative of performance difference at 30B
There are probably a lot of really good ideas out there waiting for someone to drop a few million in training to reveal how good they are on large sizes.
lostmsu•23m ago