5NF and Database Design

https://kb.databasedesignbook.com/posts/5nf/

84•petalmind•3h ago

Comments

tadfisher•3h ago

I love reading about the normal forms, because it makes me sound like I know what I'm talking about in the conversation where the backend folks tell me, "if we normalized that data then the database would go down". This is usually followed by arguments over UUID versions for some reason.

necovek•3h ago

So which normal form do they argue for and against? And what UUID version wins the argument?

Tostino•3h ago

Not OP, but UUID v7 is what you want for most database workloads (other than something like Spanner)

tossandthrow•54m ago

I use the null uuid as primary key - never had any DB scaling issues.

petalmind•45m ago

Yeah, no NULL is ever equal to any other NULL, so they are basically unique.

Groxx•28m ago

You are also guaranteed to be able to retrieve your data, just query for '... is null'. No complicated logic needed!

RedShift1•21m ago

Me still using bigints... Which haven't given me any problems. Wouldn't use it for client generated IDs but that is not what most applications require anyway.

tadfisher•2h ago

Explaining jokes is poor form.

culi•2h ago

On the internet it is normal.

necovek•42m ago

This was an attempt to extend jokes and not ask for explanation: there are a number of normal forms, and people usually talk about "normalization" without being specific thus conflating all of them; out of 7 UUID versions, only 2 generally make sense for use today depending on whether you need time-incrementing version or not.

DeathArrow•3h ago

There are use cases where is better to not normalize the data.

andrew_lettuce•2h ago

Typically it's better to take normalized data and denormalize for your use case vs. not normalize in the first place. Really depends on your needs

jghn•2h ago

Over time I’ve developed a philosophy of starting roughly around 3NF and adjusting as the project evolves. Usually this means some parts of the db get demoralize and some get further normalized

skeeter2020•1h ago

>> Usually this means some parts of the db get demoralize

I largely agree with your practical approach, but try and keep the data excited about the process, sell the "new use cases for the same data!" angle :)

petalmind•2h ago

One day I hope to write about denormalization, explained explicitly via JOINs.

andrii•27m ago

Please do, you content is great!

abirch•2h ago

I'm a fan of the sushi principle: raw data is better than cooked data.

Each process should take data from a golden source and not a pre-aggregated or overly normalized non-authorative source.

layer8•1h ago

Sometimes the role of your system is to be the authoritative source of data that it has aggregated, validated, and canonicalized.

abirch•1h ago

This is great. Then I would consider the aggreated, validated, and canonicalized source as a Golden Source. Where I've seen issues is that someone starts to query from a nonauthoritative source because they know about it, instead of going upstream to a proper source.

bob1029•1h ago

JSON is extremely fast these days. Gzipped JSON perhaps even more so.

I find that JSON blobs up to about 1 megabyte are very reasonable in most scenarios. You are looking at maybe a millisecond of latency overhead in exchange for much denser I/O for complex objects. If the system is very write-intensive, I would cap the blobs around 10-100kb.

Quarrelsome•21m ago

I adore contiguous reads that ideas like that yield. I'd rather push that out to a read-only end point, then getting sucked into the entropy of treating what is effectively an unschema-ed blob into editable data.

estetlinus•2h ago

The lost art of normalizing databases. ”Why is the ARR so high on client X? Oh, we’re counting it 11 times lol”.

I would maybe throw in date as an key too. Bad idea?

petalmind•2h ago

Frankly I don't think that overcounting is solved by normalizing, because it's easy to write an overcounting SQL query over perfectly normalized data.

I tried to explain the real cause of overcounting in my "Modern Guide to SQL JOINs":

https://kb.databasedesignbook.com/posts/sql-joins/#understan...

jerf•2h ago

In a roundabout way this article captures well why I don't really like thinking in terms of "normal forms", especially as a numbered list like that. The key insights are really 1. Avoid redundancy and 2. This may involve synthesizing relationships that don't immediately obviously exist from a human perspective. Both of those can be expanded on at quite some length, but I never found much value in the supposedly-blessed intermediate points represented by the nominally numbered "forms". I don't find them useful either for thinking about the problem or for communicating about it.

Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM) doesn't mean it is actually useful... sometimes it just means that it made it easy to write multiple choice test questions. (e.g., "What does Layer 2 of the OSI network model represent? A: ... B: ... C: ... D: ..." to which the most appropriate real-world answer is "Who cares?")

petalmind•2h ago

> Someone, somewhere writing down a list and that list being blessed with the imprimatur of Academic Approval (TM)

One problem is that normal forms are underspecified even by the academy.

E.g., Millist W. Vincent "A corrected 5NF definition for relational database design" (1997) (!) shows that the traditional definition of 5NF was deficient. 5NF was introduced in 1979 (I was one year old then).

2NF and 3NF should basically be merged into BCNF, if I understand correctly, and treated like a general case (as per Darwen).

Also, the numeric sequence is not very useful because there are at least four non-numeric forms (https://andreipall.github.io/sql/database-normalization/).

Also, personally I think that 6NF should be foundational, but that's a separate matter.

jerf•1h ago

"1979 (I was one year old then)."

Well, we are roughly the same age then. Our is a cynical generation.

"One problem is that normal forms are underspecified even by the academy."

The cynic in me would say they were doing their job by the example I gave, which is just to provide easy test answers, after which there wasn't much reason to iterate on them. I imagine waiving around normalization forms was a good gig for consultants in the 1980 but I bet even then the real practitioners had a skeptical, arm's length relationship with them.

wolttam•1h ago

Why shouldn’t we care about layer 2? You can do really fun and interesting things at the MAC layer.

jerf•50m ago

You can do what you do at the MAC layer without any regard for whether or not it is "OSI layer 2", or whether your MAC layer "cheats" and has features that extend into layers 1, or 3, or any other layer. Failing to implement something useful because "that's not what OSI layer 2 is and this is data layer 2 and the OSI model says not to do that" is silly.

To stay on the main topic, same for the "normalization forms". Do what your database needs.

The concepts are just attractive nuisances. They are more likely to hurt someone than to help them.

carlyai•1h ago

love this

iFire•1h ago

https://en.wikipedia.org/wiki/Essential_tuple_normal_form is cool!

Since I had bad memory, I asked the ai to make me a mnemonic:

* Every

* Table

* Needs

* Full-keys (in its joins)

minkeymaniac•32m ago

Normalize till it hurts, then denormalize till it works!

Quarrelsome•23m ago

what a marvelous motto <3.

Certainly a lot more concise than the article or the works the article references.

petalmind•4m ago

Imperative mood "normalize" assumes that you had something not-normalized before you received that instruction. It's not useful when your table design strategy is already normalization-preservation, such as the most basic textbook strategy (a table per anchor, a column per attribute or 1:N link, a 2-column table per M:N link).

And this is basically the main point of my critique of 4NF and 5NF. They both traditionally present an unexplained table that is supposed to be normalized. But it's not clear where does this original structure come from. Why are its own authors not aware about the (arguably, quite simple) concept of normalization?

It's like saying that to in order to implement an algorithm you have to remove bugs from its original implementation — where does this implementation come from?

The other side of this coin is that lots of real-world design have a lot of denormalized representations that are often reasonably-well engineered.

Because of that if you, as a novice, look at a typical production schema, and you have this "thou shalt normalize" instruction, you'll be confused.

This is my big teaching pet peeve.

Quarrelsome•29m ago

Especially loved the article linked that was dissing down formal definitions of 4NF.

I wrote to Flock's privacy contact to opt out of their domestic spying program

YouTube now world's largest media company, topping Disney

Rare concert recordings are landing on the Internet Archive

Spain to expand internet blocks to tennis, golf, movies broadcasting times

Claude Code Routines

5NF and Database Design

California ghost-gun bill wants 3D printers to play cop, EFF says

Turn your best AI prompts into one-click tools in Chrome

Let's Talk Space Toilets

Modifying FileZilla to Workaround Bambu 3D Printer's FTP Issue

OpenSSL 4.0.0

guide.world: A compendium of travel guides

Show HN: LangAlpha – what if Claude Code was built for Wall Street?

Show HN: Plain – The full-stack Python framework designed for humans and agents

Backblaze has stopped backing up OneDrive and Dropbox folders and maybe others

Gas Town: From Clown Show to v1.0

ClawRun – Deploy and manage AI agents in seconds

jj – the CLI for Jujutsu

The Mouse Programming Language on CP/M

Carol's Causal Conundrum: a zine intro to causally ordered message delivery

Introspective Diffusion Language Models

Show HN: A memory database that forgets, consolidates, and detects contradiction

Show HN: Kontext CLI – Credential broker for AI coding agents in Go

Nucleus Nouns

Show HN: Kelet – Root Cause Analysis agent for your LLM apps

DaVinci Resolve – Photo

The acyclic e-graph: Cranelift's mid-end optimizer

A new spam policy for “back button hijacking”

The M×N problem of tool calling and open-source models

Lean proved this program correct; then I found a bug

I wrote to Flock's privacy contact to opt out of their domestic spying program

YouTube now world's largest media company, topping Disney

Rare concert recordings are landing on the Internet Archive

Spain to expand internet blocks to tennis, golf, movies broadcasting times

Claude Code Routines

5NF and Database Design

California ghost-gun bill wants 3D printers to play cop, EFF says

Turn your best AI prompts into one-click tools in Chrome

Let's Talk Space Toilets

Modifying FileZilla to Workaround Bambu 3D Printer's FTP Issue

OpenSSL 4.0.0

guide.world: A compendium of travel guides

Show HN: LangAlpha – what if Claude Code was built for Wall Street?

Show HN: Plain – The full-stack Python framework designed for humans and agents

Backblaze has stopped backing up OneDrive and Dropbox folders and maybe others

Gas Town: From Clown Show to v1.0

ClawRun – Deploy and manage AI agents in seconds

jj – the CLI for Jujutsu

The Mouse Programming Language on CP/M

Carol's Causal Conundrum: a zine intro to causally ordered message delivery

Introspective Diffusion Language Models

Show HN: A memory database that forgets, consolidates, and detects contradiction

Show HN: Kontext CLI – Credential broker for AI coding agents in Go

Nucleus Nouns

Show HN: Kelet – Root Cause Analysis agent for your LLM apps

DaVinci Resolve – Photo

The acyclic e-graph: Cranelift's mid-end optimizer

A new spam policy for “back button hijacking”

The M×N problem of tool calling and open-source models

Lean proved this program correct; then I found a bug

5NF and Database Design

Comments