Columnar Storage Is Normalization

https://buttondown.com/jaffray/archive/columnar-storage-is-normalization/

14•ibobev•1h ago

Comments

immanuwell•1h ago

The normalization analogy is genuinely clever as a teaching tool, but it quietly papers over the fact that normalization is a logical design concept while columnar storage is a physical one - treating them as the same thing can mislead more than it clarifies, I think

hilariously•49m ago

Fair, but one of the big benefits of normalization was the benefit on storage and memory back in the day which was tiny comparatively.

There's always a reason for a dev to ship something shitty but when you show you can use 80% less storage for the same operation you can make the accountants your lever.

jklowden•14m ago

Nonsense. See Codd’s first paper.

1NF removes repeating groups, putting for example data for each month in its own row, not an array of 12 months in 1 row.

Storage efficiency was never the point. IMS had that locked down. Succinctness of expression and accuracy of results was the point. And is: normalization prevents anomalous results.

jerf•34m ago

I've always preferred to think of normalization as more about "removing redundancy" than in the frame it is normally presented. Or, to put it another way, rather than "normalizing" which has as a benefit "removing redundancy", raise the removing of redundancy up to the primary goal which has as a side benefit "normalization".

A nice thing about that point of view is that it fits with your point; redundancy is redundancy whether you look at it with a column-based view or a row-based view.

orangepanda•1h ago

Is this meant to be a poor explanation of sixth normal form?

Lucasoato•54m ago

This is an interesting thought, even if it doesn’t come with practical consequences. A person could argue that if you happen to encode your table with a columnar format, you very likely won’t use indexes for every “value” but the order itself of that specific block. But this would mean that if you’re using the data order meaningfully, you’d probably going against the principles of table normalization. But, again, this one as well can be considered the result of excessive overthinking rather something practical that can be used.

parpfish•37m ago

I always thought that the biggest benefit of normalization was deduplicating mutable values so you only need to update values in one place and everything stays nicely in sync.

Classic example being something like a “users” table that tracks account id, display name (mutable), and profile picture (mutable). And then a “posts” table that has post id, account id, and message text. This allows you to change the display name/picture in one place and it can be used across all posts

pwndByDeath•15m ago

None-or-many?

juancn•8m ago

It is possible to treat as purely relational but it can be suboptimal on data access if you follow through with it.

The main cost is on the join when you need to access several columns, it's flexible but expensive.

To take full advantage of columnar, you have to have that join usually implicitly made through data alignment to avoid joining.

For example, segment the tables in chunks of up to N records, and keep all related contiguous columns of that chunk so they can be independently accessed:

    r0, r1 ... rm; f0, f0 ... f0; f1, f1 ... f1; fn, fn ... fn

That balances pointer chasing and joining, you can avoid the IO by only loading needed columns from the segment, and skip the join because the data is trivially aligned.

remywang•3m ago

This is exactly domain key normal form!

https://en.wikipedia.org/wiki/Domain-key_normal_form

Build It Yourself (2025)

AI fact-checker with guardrail classifier and MCP server

How Skopx Learns Your Business While You Work

Open Benchmark: Text Normalization in Commercial Streaming TTS Models

Push Notifications Can Betray Your Privacy (and What to Do About It)

Don't read the PDF, write the parser

Context Bloat in AI Agents

Linus Torvalds on AI code review: Anybody who thinks all AI is slop is in denial

A record-setting 31.4 Tbps attack caps a year of DDoS assaults

Tim Cook to Be Replaced by Near-Identical,More Expensive CEO with a Nicer Camera

Show HN: CatchAll – slowest web search API that outperforms everything on recall

TurboOCR: CUDA and TensorRT OCR Server at 270 img/s

Show HN: Ohita – a tool to simplify API key management for AI agents

Statutory Copyleft

Google puts AI agents at heart of its enterprise money-making push

Show HN: Sift – a minimal news app (looking for UI/UX feedback)

DOJ charges SPLC with fraud for paying white supremacist groups $3M

Show HN: Stonks-CLI – track your investment portfolio from your terminal

I spent 20 years building an AI agent engine, and what v6 got right

UK lawmakers approve lifetime smoking ban for today's under-18s

Show HN: API Ingest – Agentic Search in API Docs

Show HN: An MCP server that fact-checks AI bug diagnoses against AST evidence

Prinesh Where R U?

Inko 0.20.0: reducing heap allocations by 50%

Probing the Planck scale with quantum computation

Australian social media ban marred by weak platform checks, tech providers say

AudioRoute – Capture system audio into any DAW on macOS

YouTube complies with Indonesia's social media curbs, minister says

Critical RCE Vulnerability in LiteLLM Proxy

If a bird flu pandemic starts, we may have an mRNA vaccine ready