There is a fully bijective version of the BWT which doesn't require the index. It's not what BZip2 uses though.
Someone•5mo ago
That method adds an extra character to your alphabet, and uses that as a marker in the transformed string to h effectively) store the index of the first character inside the transformed text.
If, as is often the case, your alphabet contains exactly as many letters as a byte/word can store, that makes for awkward encoding. I also think that, typically, that method will require more memory.
vintermann•5mo ago
No, I'm not talking about that. I'm talking about David Scott's bijective BWT variant. It sorts symbols by a slightly different type of context, and so can do away with a sentinel symbol or a stored index entirely. It was described in the wikipedia page last I checked.
Scott was a bit of a weirdo, but his algorithm actually got recognition in academia, there have been many papers written about it. One character saved doesn't make a lot of difference for compression, but the transform being a permutation does give it some nice properties.
horizion2025•5mo ago
I also think when looking at how BWT works, it is initially surprising it is invertible even with such little information as the first character. It appears a bit like sorting the letters and that certainly would require a lot more information to back into place. It is a bit magic even when you know the inversion algorithm.
amy_petrik•5mo ago
well here's a big surprise of BWT
BWT is a transformation transforming a string into "suffix tree space"
You can do searches - very fast searches - in BZ2/BWT compression space
Discussed in the '90s in the legendary dr. dobbs
But more significantly, BWT is the basis for most DNA and RNA sequence aligners nowadays - which perform string search/matching in suffix tree (transformed) space. It's like doing frequency stuff in fourier domain, it's like adding is multiplication in logspace - but in weird data string search space. The BWT.
etrez•5mo ago
Thx, corrected.
BirAdam•5mo ago
This is excellent work, and I love seeing projects in Ada.
nayuki•5mo ago
I think this is not true. The way I learned the BWT, after encoding, you need to store the index of the first character (which is a tiny bit of extra information). https://web.archive.org/web/20170325024404/http://marknelson...
vintermann•5mo ago
Someone•5mo ago
If, as is often the case, your alphabet contains exactly as many letters as a byte/word can store, that makes for awkward encoding. I also think that, typically, that method will require more memory.
vintermann•5mo ago
Scott was a bit of a weirdo, but his algorithm actually got recognition in academia, there have been many papers written about it. One character saved doesn't make a lot of difference for compression, but the transform being a permutation does give it some nice properties.
horizion2025•5mo ago
amy_petrik•5mo ago
BWT is a transformation transforming a string into "suffix tree space"
You can do searches - very fast searches - in BZ2/BWT compression space
Discussed in the '90s in the legendary dr. dobbs
But more significantly, BWT is the basis for most DNA and RNA sequence aligners nowadays - which perform string search/matching in suffix tree (transformed) space. It's like doing frequency stuff in fourier domain, it's like adding is multiplication in logspace - but in weird data string search space. The BWT.
etrez•5mo ago