> The concept of inlined strings with prefixes (called “German Strings” by Andy Pavlo, in homage to TUM, where the Umbra paper that describes them originated) has been used in many recent database systems (Velox, Polars, DuckDB, CedarDB, etc.) and was introduced to Arrow as a new StringViewArray[^3] type. Arrow’s original StringArray is very memory efficient but less effective for certain operations. StringViewArray accelerates string-intensive operations via prefix inlining and a more flexible and compact string representation.
Seems to be nothing more than they were invented at a German university. I spent quite some time thinking it had something to do with German’s sometimes-SOV word order.
If you refer to subclauses in the German language: here the rule is rather "the finite verb is at the end of the subclause".
Umbra: A Disk-Based System with In-Memory Performance
https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
Section 3.1 covers string handling.
This article (also linked from tfa) explains German strings in more detail.
- two 64-bits words representation
- fixed, 32 bits length
- short strings (<12 bytes) are stored in-place
- long strings store a 4 byte prefix in-place + pointer to the rest
- two bits are used as flags in the pointer to further optimize some use-cases
It would be better just to take the storage requirement on the chin and not add a gratuitous variation in encoding which will bite you on the ass somehow (or someone else).
As much as possible, pick one way of doing one thing. Your stuff already has thousands of things to do. Each time you do something in two or more ways, you add combinations between that and surrounding things being done in two or more ways.
In a wider view, that depends. If one is using a general-purpose heap for string storage and a 64-bit instruction set architecture, the heap is often aligning and padding out allocations to such multiples already.
dekhn•2d ago
asubiotto•2d ago
dang•2d ago
Tadpole9181•2d ago
dang•1d ago