Is the distinction that an index on a multi-valued attribute is called an inverted index?
The only thing "inverted" here is the context. The author even admits themselves that the word->doc mapping is an index:
"If user wants to search by words - then words should be keys in our "database" (index)"
It's a pointless debate of semantics. An inverted map is still a map.
The discussion about it is not pointless since it clears up confusion. It might not have been for you, but it's clearly for many others, so if you think that's pointless then allowing yourself to appreciate other perspectives could go a long way.
An inverted map is still a map, but if you are typically thinking of the map A->B and then suddenly somebody talks about an inverted map, then it's understandable that people start to assume that this is now about B->A and get confused if it somehow actually isn't really.
But when we talk about inverted indexes, they are almost always term -> posting list, and most index data structures lay these out so that posting lists are sorted and compressed together. Traditional database indexes like B-trees are optimized for rapid insertion and deletion, while inverted indexes tend to be optimized for batch processing, because you typically deconstruct text into words for a large batch and then lazily integrate this batch into the main index.
Part of this is about scale; a row in a database typically has a single column or maybe 2-3 columns in a composite index; but a document text may tokenize into thousands, hundreds of thousands, or millions of words. At this scale, the fine-grained nature of words mean B-trees aren't as a good a fit.
Another part of it is that inverted indexes aren't for point queries, which is what B-trees are optimized for; you typically search for many words at a time in order to rank your search results by some function like cosine similarity. You rarely want a single posting; you want the union or intersection of many posting sorted by score.
The idea of an index is more general, as an index can be built for many different domains. For example, B-trees can index monoidal data and inverted indexes are just an instance of such a monoid that a B-tree can efficiently index.
Furthermore, metric spaces (e.g., levenshtein distance) can also be efficiently indexed using other trees: metric trees. So calling inverted indexes just indexes would be really confusing since string data is not the only kind of data that a database might want to support having efficient indexes for.
The trade-off this kind of index makes is that it is more optimized for (batch) writes than the more popular B-Tree indexes, but it is less optimized for reads on the other hand. If the write throughout of a given table is very high, you might want to remove all B-Tree indexes that are not strongly correlated to the insert order and have BRIN indexes instead. Combine it with table partitioning, and you can add B-Tree indexes in the cold partitions, or even migrate them to columnar storage if available (with the Citus extension).
By the way, a few years ago a Bloom BRIN variant was added, not to be confused with Postgres' Bloom indexes which are something else.
"Coarse" indexes like BRIN and ClickHouse's data-skipping indexes are still indexes in a broad sense of serving to narrow down a search.
A document can be viewed as an object with a set of pointers to the words it contains.
The inverse of that, was a word object, with a list of pointers to the documents it is found it. This was referred to an an inverted DOCUMENT index. This is what people would normally just call an index.
At some point, people dropped the "DOCUMENT" part, and started just calling it an "inverted index". This makes no sense, grammatically, as it's the document that is inverted, not the index, but it is what it is.
So, an inverted index is just an index.
In summary, they are not "inverted index" in the sense of "the inversion of what you'd normally think of as an index" but instead in the sense of "a map which provides the inversion of the map from documents to words in them, in other words, an index".
the_precipitate•11h ago
Edit: I stand corrected: it's called index stream readers (thanks atombender for pointing this out). For those who knows Mike Burrows only for the Burrows-Wheeler transformation (BZip), you might also want to know that he was also one of the main developers of AltaVista, the first real search engine for the internet. He also designed the early versions of Bing search engine. Eventually he worked for Google and designed their lock service called Chubby.
atombender•10h ago
mrkeen•10h ago
I'm not 100% but I don't think you can directly query a BWT in the same way you'd query an inverted index (without the later discovery of wavelet trees and FM-indexes / succinct data structures, and all that jazz.) And that's mostly for genomics? Not sure if it applies to plain old document searches. Would love to be corrected though.
lazamar•6h ago
marginalia_nu•9h ago
[1] https://nlp.stanford.edu/IR-book/html/htmledition/faster-pos...