Das Problem mit German Strings

https://www.polarsignals.com/blog/posts/2025/08/26/das-problem-mit-german-strings

74•asubiotto•2d ago

Comments

dekhn•2d ago

did the hacker news title editor change the "mit" to "MIT"?

asubiotto•2d ago

Seems like it. Changed it back!

dang•2d ago

Oops, sorry.

Tadpole9181•2d ago

Haha, is that automated or was someone trying to be helpful?

dang•1d ago

It's automated. And of course it's usually right, but the wrong cases stand out like sore thumbs.

thayne•1d ago

So... why are they called Getman strings?

mathieuh•1d ago

https://datafusion.apache.org/blog/2024/09/13/string-view-ge...

> The concept of inlined strings with prefixes (called “German Strings” by Andy Pavlo, in homage to TUM, where the Umbra paper that describes them originated) has been used in many recent database systems (Velox, Polars, DuckDB, CedarDB, etc.) and was introduced to Arrow as a new StringViewArray[^3] type. Arrow’s original StringArray is very memory efficient but less effective for certain operations. StringViewArray accelerates string-intensive operations via prefix inlining and a more flexible and compact string representation.

Seems to be nothing more than they were invented at a German university. I spent quite some time thinking it had something to do with German’s sometimes-SOV word order.

aleph_minus_one•1d ago

> I spent quite some time thinking it had something to do with German’s sometimes-SOV word order.

If you refer to subclauses in the German language: here the rule is rather "the finite verb is at the end of the subclause".

yorwba•1d ago

It also applies to infitives and participles and the verb in nominalized noun-verb compounds. So the rule is closer to "the verb is at the end of its grammatical unit, except for the finite verb in a main clause, which appears in second position." https://en.wikipedia.org/wiki/V2_word_order

kaladin-jasnah•1d ago

I think this is also called V2 word order.

aleph_minus_one•1d ago

V2 word order (finite verb comes second) is what is used in main clauses.

jandrewrogers•1d ago

This general string format style has been invented many times over the decades. Unfortunately, we seem to need to relearn the tradeoffs each time.

andai•1d ago

Here is the paper in question:

Umbra: A Disk-Based System with In-Memory Performance

https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf

Section 3.1 covers string handling.

This article (also linked from tfa) explains German strings in more detail.

https://cedardb.com/blog/german_strings

chombier•1d ago

my tl;dr: after reading the article:

- two 64-bits words representation

- fixed, 32 bits length

- short strings (<12 bytes) are stored in-place

- long strings store a 4 byte prefix in-place + pointer to the rest

- two bits are used as flags in the pointer to further optimize some use-cases

on_the_train•1d ago

They aren't. They're called German style strings. People just like to clickbait and prey on curiosity of techies.

kazinator•1d ago

> Because it is difficult to assume what the best encoding will be for any given workload, database systems should dynamically choose encodings based on storage and workload characteristics.

It would be better just to take the storage requirement on the chin and not add a gratuitous variation in encoding which will bite you on the ass somehow (or someone else).

As much as possible, pick one way of doing one thing. Your stuff already has thousands of things to do. Each time you do something in two or more ways, you add combinations between that and surrounding things being done in two or more ways.

kccqzy•6h ago

The combinatorial explosion problem is nicely solved by defining good interfaces. C++ gives you iterators and algorithms that work on iterators. Clojure has sequence interfaces and functions that work on all sequence types.

kazinator•6h ago

That just improves the organization of the program; it doesn't get rid of the increased risks of doing the same thing in N ways that could be pined down to one.

kccqzy•5h ago

Please elaborate. What are the risks of doing the same thing in N ways, other than code organization issues leading to duplicate or messy code?

JdeBP•1d ago

> Because each element requires at least a 16 byte representation, both tiny and repeated short strings use more memory than they otherwise would.

In a wider view, that depends. If one is using a general-purpose heap for string storage and a 64-bit instruction set architecture, the heap is often aligning and padding out allocations to such multiples already.

atoav•3h ago

Well as long as you know the difference betwen lowercase ß and uppercase ẞ (introduced in 2008) everything is probably just gonna be fine.

Ask HN: The government of my country blocked VPN access. What should I use?

Python: The Documentary

Fuck up my site – Turn any website into beautiful chaos

Some thoughts on LLMs and software development

My startup banking story (2023)

Uncertain<T>

Death by PowerPoint: the slide that killed seven people

Expert LSP the official language server implementation for Elixir

RSS Is Awesome

Building your own CLI coding agent with Pydantic-AI

TuneD is a system tuning service for Linux

Are OpenAI and Anthropic losing money on inference?

AI adoption linked to 13% decline in jobs for young U.S. workers: study

Rupert's Property

Launch HN: Dedalus Labs (YC S25) – Vercel for Agents

A forgotten medieval fruit with a vulgar name (2021)

Dependent types I › Universes, or types of types

Bad Craziness

You no longer need JavaScript: an overview of what makes modern CSS so awesome

Thrashing

Speed-coding for the 6502 – a simple example

Will AI Replace Human Thinking? The Case for Writing and Coding Manually

VLT observations of interstellar comet 3I/ATLAS II

Optimising for maintainability – Gleam in production at Strand

Show HN: SwiftAI – open-source library to easily build LLM features on iOS/macOS

Web Bot Auth

RFC 8594: The Sunset HTTP Header Field (2019)

I researched every attempt to stop fascism in history. The success rate is 0%

In Search of AI Psychosis

That boolean should probably be something else