frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

https://openciv3.org/
568•klaussilveira•10h ago•160 comments

The Waymo World Model

https://waymo.com/blog/2026/02/the-waymo-world-model-a-new-frontier-for-autonomous-driving-simula...
885•xnx•16h ago•538 comments

How we made geo joins 400× faster with H3 indexes

https://floedb.ai/blog/how-we-made-geo-joins-400-faster-with-h3-indexes
89•matheusalmeida•1d ago•20 comments

What Is Ruliology?

https://writings.stephenwolfram.com/2026/01/what-is-ruliology/
16•helloplanets•4d ago•8 comments

Unseen Footage of Atari Battlezone Arcade Cabinet Production

https://arcadeblogger.com/2026/02/02/unseen-footage-of-atari-battlezone-cabinet-production/
16•videotopia•3d ago•0 comments

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

https://github.com/valdanylchuk/breezydemo
195•isitcontent•10h ago•24 comments

Monty: A minimal, secure Python interpreter written in Rust for use by AI

https://github.com/pydantic/monty
197•dmpetrov•11h ago•88 comments

Show HN: I spent 4 years building a UI design tool with only the features I use

https://vecti.com
305•vecti•13h ago•136 comments

Microsoft open-sources LiteBox, a security-focused library OS

https://github.com/microsoft/litebox
352•aktau•17h ago•173 comments

Sheldon Brown's Bicycle Technical Info

https://www.sheldonbrown.com/
348•ostacke•16h ago•90 comments

Delimited Continuations vs. Lwt for Threads

https://mirageos.org/blog/delimcc-vs-lwt
20•romes•4d ago•2 comments

Hackers (1995) Animated Experience

https://hackers-1995.vercel.app/
450•todsacerdoti•18h ago•228 comments

Dark Alley Mathematics

https://blog.szczepan.org/blog/three-points/
78•quibono•4d ago•16 comments

PC Floppy Copy Protection: Vault Prolok

https://martypc.blogspot.com/2024/09/pc-floppy-copy-protection-vault-prolok.html
50•kmm•4d ago•3 comments

Show HN: If you lose your memory, how to regain access to your computer?

https://eljojo.github.io/rememory/
248•eljojo•13h ago•150 comments

An Update on Heroku

https://www.heroku.com/blog/an-update-on-heroku/
384•lstoll•17h ago•260 comments

Zlob.h 100% POSIX and glibc compatible globbing lib that is faste and better

https://github.com/dmtrKovalenko/zlob
11•neogoose•3h ago•6 comments

How to effectively write quality code with AI

https://heidenstedt.org/posts/2026/how-to-effectively-write-quality-code-with-ai/
228•i5heu•13h ago•173 comments

Show HN: R3forth, a ColorForth-inspired language with a tiny VM

https://github.com/phreda4/r3
66•phreda4•10h ago•11 comments

Why I Joined OpenAI

https://www.brendangregg.com/blog/2026-02-07/why-i-joined-openai.html
113•SerCe•6h ago•90 comments

I spent 5 years in DevOps – Solutions engineering gave me what I was missing

https://infisical.com/blog/devops-to-solutions-engineering
134•vmatsiiako•15h ago•59 comments

Introducing the Developer Knowledge API and MCP Server

https://developers.googleblog.com/introducing-the-developer-knowledge-api-and-mcp-server/
42•gfortaine•8h ago•12 comments

Female Asian Elephant Calf Born at the Smithsonian National Zoo

https://www.si.edu/newsdesk/releases/female-asian-elephant-calf-born-smithsonians-national-zoo-an...
23•gmays•5h ago•4 comments

Understanding Neural Network, Visually

https://visualrambling.space/neural-network/
263•surprisetalk•3d ago•35 comments

I now assume that all ads on Apple news are scams

https://kirkville.com/i-now-assume-that-all-ads-on-apple-news-are-scams/
1038•cdrnsf•20h ago•429 comments

Learning from context is harder than we thought

https://hy.tencent.com/research/100025?langVersion=en
165•limoce•3d ago•87 comments

FORTH? Really!?

https://rescrv.net/w/2026/02/06/associative
59•rescrv•18h ago•22 comments

Show HN: ARM64 Android Dev Kit

https://github.com/denuoweb/ARM64-ADK
14•denuoweb•1d ago•2 comments

Show HN: Smooth CLI – Token-efficient browser for AI agents

https://docs.smooth.sh/cli/overview
86•antves•1d ago•63 comments

Evaluating and mitigating the growing risk of LLM-discovered 0-days

https://red.anthropic.com/2026/zero-days/
47•lebovic•1d ago•14 comments
Open in hackernews

Why German Strings Are Everywhere?

https://cedardb.com/blog/german_strings/
69•byt3h3ad•1mo ago

Comments

ahartmetz•1mo ago
In case anyone wonders why they are called German strings: the article mentions the "research predecessor" of Cedar, Umbra. Umbra is a project of TU (technical university) Munich, Germany.
f1shy•1mo ago
I knew the C++ strings were optimized so. I do not like calling them "German". First time I see them called so (I know it, you can search yourself, as SSO -short string opt.), and looks as some kind of nationalist pride thing to me. Is certainly not a unique or new idea, many scheme/lisps implementations do that for strings AND numbers.

Downvotes coming from other connationals :) love you! I know… never say anything bad about Vaterland

maweki•1mo ago
What do you like to call Hungarian notation?
masklinn•1mo ago
"System's Horrendous Pile Of Shit".

If anyone ever referred to Apps Hungarian that would be "Simonyi's Wish For A Proper Type System", but nobody ever does.

f1shy•1mo ago
I don’t know any other name for it. While this strings are basically SSO (or a twist of it).
aidenn0•1mo ago
From TFA and AndyP's slides it seems to specifically refer to a variant of SSO where, for large strings, a fixed-sized prefix of the string is stored along-side the pointer, in the same location as that fixed-size prefix would be fore SSO strings. This means that strings lacking a common prefix can be quickly compared without pointer-chasing (or even knowing if they are small vs large).
f1shy•1mo ago
Well read.

basically SSO (or a twist of it)

magnat•1mo ago
Also:

* Reverse polish notation

* Chinese remainder theorem

* Byzantine generals problem

f1shy•1mo ago
All names given by people who were NOT of that nationality.
masklinn•1mo ago
C++ strings are not "optimized so". C++ strings (generally) do SSO (up to 23 bytes depending on implementation), these also do SSO but only 8 bytes (to a total of 12), the first 4 bytes are always stored inline for fast lookup even when the rest of the string is on the heap (in which case they're duplicated), and the strings are limited to 4GB (32 bits length). IIRC they also have a bunch of other limitations (e.g. they're not really extensible, by design).

Which is why they're "everywhere"... in databases, especially columnar storage.

f1shy•1mo ago
Yes. Sorry. That was not 100% correct. Still, your words (my emphasis)

>C++ strings are not "optimized so". C++ strings (generally) do SSO (up to 23 bytes depending on implementation), these also do SSO but only 8 bytes (to a total of 12)

That is what I meant. I would (like you did) call it SSO still.

ahartmetz•1mo ago
Doesn't seem nationalist to me because the name seems to have been coined by the people at Cedar, not TU Munich.
f1shy•1mo ago
Cedar is a german company, in case you did not know. That makes it specially nationalist.
aidenn0•1mo ago
According to TFA, the name was coined by Andy Pavlo, who did his undergrad in New York, Doctorate in Rhode Island, and now teaches in Pittsburgh. I see no indication that he is German.

[edit]

Lecture slide with the term is also linked from TFA: https://15721.courses.cs.cmu.edu/spring2024/slides/05-execut...

f1shy•1mo ago
Andy Pavlo, ask anybody who knows him, is a very interesting character. I’m not sure if he really coined the name, or ist just a joke, as the one about his daughter (which I do both find funny at all). The database world (in which I was, is very interesting!) ;)
weinzierl•1mo ago
They maybe should be better called *Kemper Strings" then?
f1shy•1mo ago
Probably. Or just SSO, as it is basically a very well known name already.
ninju•1mo ago
What does SSO stand for (asking for a friend)
weinzierl•1mo ago
In this context, Short String Optimization and not the usual Single Sign On.
pseudohadamard•1mo ago
I'm gonna vote for KarlValentinStruempfe.
jmclnx•1mo ago
If I understood what I was reading about German Strings, I think UTF-8 could add complications to these things.
f1shy•1mo ago
I think statistically still those short string fall in the lower 128 codes which is ascii.
StopDisinfo910•1mo ago
Not really, no.

The main difference is that you don't know how many code points you have in the prefix as they use variable encoding so it can be up to four but as little as one. I imagine the choice of four bytes for the prefix was actually done specifically for this reason. That's the maximum length of a UTF-8 code point.

The length is not the number of characters anymore but just the size of the string.

Apart from that, it should work exactly the same.

msichert•1mo ago
We chose 4B because that was the maximum number of bytes that would be unused otherwise (4B for the length, 8B for the pointer leaves 4B), the UTF8 encoding doesn't really matter.

Also, for UTF8 specifically, cutting code points in half is fine as long as all strings are valid UTF8. The UTF8 encoding is prefix free, i.e., no valid code point is a prefix of another valid code point, so for prefix matching we can usually just compare bytes.

It only gets more complicated if you add collations or want to match case-insensitively. But at that point you need to take into account all edge cases of the Unicode spec anyway.

StopDisinfo910•1mo ago
> We chose 4B because that was the maximum number of bytes that would be unused otherwise

I'm sure you did but there is something funny reading this phrase while at the same time considering you have robbed two bits from your pointer to represent class - admittedly the only thing I find questionable in your design.

If that's the case it's a happy accident because having a full code point here is quite nice.

msichert•1mo ago
That's a good point. We just use pointer tagging in many different places (e.g. for pointer swizzling in our buffer pages), so including a few bits of information in a pointer just seemed obvious.
masklinn•1mo ago
They just store bytes. A leading astral codepoint means your prefix store contains just one codepoint, but that doesn't really change anything per se.
cubefox•1mo ago
The added question mark in the HN submission makes little sense.
nkrisc•1mo ago
It also makes it grammatically incorrect. If it were actually a question it should be, “Why are German strings everywhere?”
cubefox•1mo ago
The other form seems to be an Indian English colloquialism.
thaumasiotes•1mo ago
Do you mean "Why German strings are everywhere?" as an interrogative form?

I doubt that's specific to India. I had a teacher in high school who was Greek and who characteristically asked us "what it could be?", meaning "what could it be?".

Questions in Mandarin Chinese use the same sentence structure as their related statements. I imagine this is really common across languages.

Rygian•1mo ago
Title should mention (2024). Some of the info was already outdated back then [1]

https://news.ycombinator.com/item?id=41176051

xnorswap•1mo ago
I really enjoyed this article. The storing in-place of a prefix is a neat idea for faster matching/sorting.

I wonder if they also have the concept of a reverse string which stores the (reversed) suffix instead and stores the short strings backward.

Niche, but would be fast for heavy ends-with filters.

msichert•1mo ago
If you want to improve equality matching for longer strings, you could even store a 4B hash of the entire string instead of the prefix. I guess that should work well if you equality match on URLs since their prefix is always "http".
WaitWaitWha•1mo ago
> To solve these problems, Umbra, the research predecessor of CedarDB, invented what Andy Pavlo now affectionately (we assume ;)) calls “German-style strings”.

This is how Borland Turbo Pascal stored strings as far back as the first version in mid-80s.

Length followed by the string.

xnorswap•1mo ago
That's not what it's doing though.

Pascal strings are: { length, pointer }

In these strings:

For short strings it's storing:

  { length, string value}
for longer strings, it's storing

  {length, prefix, class, pointer }
masklinn•1mo ago
> Pascal strings are: { length, pointer }

The historical P-strings are just a pointer, with the length at the head of the buffer. Hence length-prefixed strings, and their limitation to 255 bytes (only one byte was reserved for the length, you can still see this in the most base string of freepascal: https://www.freepascal.org/docs-html/ref/refsu9.html).

    {length, pointer}
or

    {length, capacity, pointer}
is struct / record strings, and what pretty much every modern language does (possibly with optimisations e.g. SSO23 is basically a p-string when inline, but can move out of line into a full record string).
f1shy•1mo ago
I think is about the kind of union they use, to store it differently depending on the string length, not the fact of length+data. Anyway is/was also nothing remotely new (the idea) as many lisp and scheme implementations have done so for strings and numbers basically for ages.
afandian•1mo ago
Storing the prefix and the tagged union of pointer and inline data structure is big difference to Pascal strings though.
mau•1mo ago
German-style strings is a way to store array of strings for columnar dbs. The idea is to have an array of metadata. Metadata has a fixed size (16 bytes) The metadata includes the string length and either a pair of pointer + string prefix or the full string for short strings. For some operations the string prefix is enough in many cases avoiding the indirection.

This is different from Pascal strings.

kardianos•1mo ago
This is actually really similar to how SQL Server has long encoded it's varchar(max) format as I understand it. Short text is stored on the row page, but longer text is bumped to a different page.
masklinn•1mo ago
Postgres does the same thing, however AFAIK postgres does not use a fixed-size string which happens to have inline string data: text is always variable, and stored inline up to 127 bytes (after compression).

These are different because the inline segment is fixed-size, and always exposes a 4 bytes prefix inline even when the buffer is stored out of line.

aDyslecticCrow•1mo ago
Interesting to see a deepdive about string formats. I hadn't thought very deeply about it before.

I do agree with the string imutable argument. Mutable and imutable strings have different usecases and design tradeoffs. They perhaps shouldn't be the same type at all.

The transient string is particularly brilliant. Ive worked with some low level networking code in c, and being able to create a string containing the "payload" by pointing directly to an offset in the raw circular packet buffer is very clean. (the alternative is juggling offsets, or doing excessive memcpy)

So beyond the database usecase it's a clever string format.

It would be nice to have an ISO or equivalent specification on it though.

masklinn•1mo ago
> The transient string is particularly brilliant. Ive worked with some low level networking code in c, and being able to create a string containing the "payload" by pointing directly to an offset in the raw circular packet buffer is very clean. (the alternative is juggling offsets, or doing excessive memcpy)

It's not anything special? That's just `string_view` (C++17). Java also used to do that as an optimisation (but because it was implicit and not trivial to notice it caused difficult do diagnose memory leaks, IIRC it was introduced in Java 1.4 and removed in 1.7).

aDyslecticCrow•1mo ago
> It's not anything special? That's just `string_view` (C++17)

Just because something already exists in some language doesn't make it less clever. It's not very widespread, and it's very powerful when applicable.

This format can handle "string views" with the same logic as "normal strings" without relying on interfaces or inheritance overhead.

it's clever.

masklinn•1mo ago
> It's not very widespread

It is tho?

> and it's very powerful when applicable.

I don't believe I stated or even hinted otherwise?

> This format can handle "string views" with the same logic as "normal strings" without relying on interfaces or inheritance overhead.

"owned" and "borrowed" strings have different lifecycles and if you can't differentiate them easily it's very easy to misuse a borrowed string into an UAF (or as Java did into a memory leak). That is bad.

And because callees usually know whether they need a borrowed string, and they're essentially free, the utility of making them implicit is close to nil.

Which is why people have generally stopped doing that, and kept borrowed strings as a separate type. Without relying on interfaces or inheritance.

> it's clever.

The wrong type thereof. It's clever in the same way java 1.4's shared substring were clever, with worse consequences.

aDyslecticCrow•1mo ago
> "owned" and "borrowed" > java 1.4's

You're getting into pedantics about specific languages and their implementation. I never made a statement about C++ or java. I work in primarily in c99 myself.

> the utility of making them implicit is close to nil. > Without relying on interfaces or inheritance.

Implement a function that takes three strings without 3! permutations of that function either explicitly or implicitly created.

masklinn•1mo ago
> You're getting into pedantics about specific languages

No, I'm using terms which clearly express what I'm talking about, and referring to actual historical experience with these concerns.

> Implement a function that takes three strings without 3! permutations of that function either explicitly or implicitly created.

In the overwhelming majority of cases this is a nonsensical requirement, if the function can take 3 borrowed strings you just implement a single function which takes 3 borrowed strings.

In the (rare) situation where optimising for maybe-owned makes sense, you use a wrapper type over "owned or borrowed". Which still needs no "interface or inheritance".

tracker1•1mo ago
I never really put much thought into it either, until I started playing with Rust, which pretty much supports every common way to use strings out there. Mostly for compatibility sake, but still, it's wild all the same.
bjourne•1mo ago
"Optimized" string types are everywhere and I bet that multiple people have already created string types almost identical to German strings. But the memory savings are small and they are not more efficient than ordinary strings. For string comparison you compare the pointers, which is cheaper than comparing two pairs of registers. If the pointers mismatch you compare the (cached) hashes and only if they match do you need to compare characters. For the prefix query, starts_with(content, 'http'), just store a string of the four-character prefix. With immutable strings the overhead is just one pointer.
f1shy•1mo ago
Do you have a pointer to real world data about the effectiveness of these optimizations? I learned about it (SSO, in std lib which is basically the same) in an article which really made it look as that would make anything in C++ blazing fast. In the codebases I worked, a couple of times, I did measure (what you shoud do before optimizing) and the results where between absolutely negligible to worst when active. But that were 3 data points. Mind you one in a real time database.
orphea•1mo ago
Something is very wrong with the site's design. The header's font size is 9.8px, the body is 13px.
tracker1•1mo ago
Yeah, zooming doesn't even work properly and was difficult for me to read.
thaumasiotes•1mo ago
> We would like to have a string that is very cheap to construct and points to a region of memory that is currently valid, but may become invalid later without the string having control over it.

> This is where transient strings come in. They point to data that is currently valid, but may become invalid later, e.g., when we swap out the page on which the payload is stored to disk after we’ve released the lock on the page.

> Creating them has virtually no overhead: They simply point to an externally managed memory location. No memory allocation or data copying is required during construction! When you access a transient string, the string itself won’t know whether the data it points to is still valid, so you as a programmer need to ensure that every transient string you use is actually still valid. So if you need to access it later, you need to copy it to memory that you control.

Hm. What if I don't bother with that and I just read from the transient string? It's probably still good.

> In C, strings are just a sequence of bytes with the vague promise that a \0 byte will terminate the string at some point.

> This is a very simple model conceptually, but very cumbersome in practice:

> What if your string is not terminated? If you’re not careful, you can read beyond the intended end of the string, a huge security problem!

This sounds like a problem that transient strings were designed to exemplify. How do they improve on the C model?

-----

I was interested that the short strings use a full 32-bit length field. That's a lot of potential length for a string of at most 12 characters.

If we shaved that down to the four bits necessary to represent a number from 0-12, we'd save 28 bits, which is 3.5 characters. Adding three characters to the content would bring the potential length of a short string up to 15, requiring 0 additional length bits. And we'd have four bits left over.

I assume we aren't worried about this because strings of length 13-15 are already rare and it adds a huge amount of complexity to parsing the string, but it was fun to think about.

ReptileMan•1mo ago
Joel had a very nice quote - the whole history of C/C++ is them trying to deal with strings. In a way it is both worrying and encouraging that 50 years in there is still development in the area.