[1] https://en.wikipedia.org/wiki/Earth_mover's_distance#More_th...
But!
Wasserstein distances are used instead of a KL inside all kinds of VAE's and diffusion models, because while the Wasserstein distance is hard to compute, it is easy to make distributions whose expectation is the gradient wrt to the Wasserstein distance. So you can easily get unbiased gradients, and that is all you need to train big neural networks. [0] Pretty much any time you sample from your current and the target distribution and take the gradient of the distance between the points, you will be minimizing a Wasserstein distance.
smokel•11h ago
[1] https://deepgenerativemodels.github.io/
[2] https://youtube.com/playlist?list=PLoROMvodv4rPOWA-omMM6STXa...