That 16B password story (a.k.a. "data troll")

https://www.troyhunt.com/that-16-billion-password-story-aka-data-troll/

112•el_duderino•5mo ago

Comments

charcircuit•5mo ago

If there was an open database of password breaches it would be easier for people to do research in if a leak was new or just a password taken from a previous leak. Of course you can get closer to the actual number by filtering out duplicates, but you can't figure out what's new if you can't know what's old.

mananaysiempre•5mo ago

Pwned Passwords[1] is just such a database (with passwords hashed using either SHA-1 or NTLM as an obfuscation measure, and without any emails). Hunt used to distribute versioned snapshots, but these days he directs you to an API scraper[2] in C# instead, so you can still get a list but it probably won’t exactly match anyone else’s.

[1] https://haveibeenpwned.com/passwords

[2] https://github.com/HaveIBeenPwned/PwnedPasswordsDownloader

charcircuit•5mo ago

This isn't sufficient for all cases. For example a breach could contained a hashed passwords. If you only have the obfuscated passwords of previous breaches you can't hash it yourself to know that the new breach is just a rehash of an existing one.

Data breaches can also contain other things than just passwords. Things like phone numbers, addresses, etc that would also be useful for checking.

anon7000•5mo ago

Publishing someone’s leaked credentials in plaintext for anyone to look at also isn’t ideal. I mean, yes, it’s been leaked, but we also don’t need to make it easier for someone to get hacked.

charcircuit•5mo ago

Pretending it's private is also problematic. People get a false impression of what is public and what isn't.

nojs•5mo ago

In other words, 2.7B -> 109M is a 96% reduction from headline to people. Could we apply the same maths to the 16B headline?

I mean there’s not 16B people in the world, so a row per person can be ruled out pretty easily

NitpickLawyer•5mo ago

> I mean there’s not 16B people in the world, so a row per person can be ruled out pretty easily

In a hypothetical "master dump", a mix of all the dumps ever leaked, you'd expect dozens if not more entries for every "real person" out there. Think about how many people had a yahoo account, then how many had several yahoo accounts, and then multiply it with hundreds of leaks out there. I can see the number getting into billions easily, just because of how many accounts people have on many platforms that got hacked in the past ~20 years.

Sure, 99% of those won't be active accounts anymore, but the passwords used serve as a signal, at least for "what kinds of passwords do people use". There's lots to be learned about wordwordnumber wordnumbernumber, and so on.

genewitch•5mo ago

> There's lots to be learned about wordwordnumber wordnumbernumber, and so on.

i had a plan to do statistical studies of some password dumps to try and make a "compressed password list" that could generate password guesses on the fly, and i forgot why i didn't do it, but i'm sure it's because the "model" - the statistical dataset upon which the program would generate output, wouldn't really be that much smaller; at least not with my poor maths skills.

I'm assuming that someone who really knew what they were doing could get close to 20% - 15% of the full password list. I doubt i could do better than just compressing the dataset and extracting it on the fly.

NitpickLawyer•5mo ago

> I doubt i could do better than just compressing the dataset and extracting it on the fly.

The meta in that field is to extract "rules" (i.e. for hashcat) from datasets. Then you run the rules over the encrypted dumps. Rules can be word^number^number, word^number^word^number, or letter^upper^number^lower... etc. Then you build a dictionary and dict + rules = passwords.

Pretty sure you can extract some nice signals nowadays with embeddings and what not.

miki123211•5mo ago

I always find it funny how the media characterizes a data breach in terms of number of records stolen, or, even worse, its size on disk.

There are ~335 million Americans. Assume for simplicity that each of them owns one phone, and hence one SIM card. Generously assume that each SIM card has 1kb of authentication material. A data breach of all US consumer SIM keys would hence be ~335 million records and ~335 gb.

Such a breach would be far, far more catastrophic than anything we have ever seen (and probably anything we will ever see) in computer security, despite being half the size of this one, and containing less than 10% as many records.

SG-•5mo ago

I'm glad someone actually looked at the data and made a real news story about this.

graynk•5mo ago

I am very confused.

> Everything (and I mean it) from that news report went through yours truly.

> Bob is a quality researcher

> The headlines implying this was a massive breach are misleading

But the headlines implying it are literally in the cybernews article, which is the source of it all? Why does the article talks about "the mass media" throughout the length of it, if it's the original source that was misleading?

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Reinforcement Learning from Human Feedback

The AI boom is causing shortages everywhere else

The Waymo World Model

Start all of your commands with a comma (2009)

Selection Rather Than Prediction

Vocal Guide – belt sing without killing yourself

Speed up responses with fast mode

France's homegrown open source online office suite

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Software factories and the agentic moment

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: I spent 4 years building a UI design tool with only the features I use

Al Lowe on model trains, funny deaths and working with Disney

Hoot: Scheme on WebAssembly

Stories from 25 Years of Software Development

OpenCiv3: Open-source, cross-platform reimagining of Civilization III

Reinforcement Learning from Human Feedback

The AI boom is causing shortages everywhere else

The Waymo World Model

Start all of your commands with a comma (2009)

Selection Rather Than Prediction

Vocal Guide – belt sing without killing yourself

Speed up responses with fast mode

France's homegrown open source online office suite

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

Coding agents have replaced every framework I used

A Fresh Look at IBM 3270 Information Display System

72M Points of Interest

Software factories and the agentic moment

Unseen Footage of Atari Battlezone Arcade Cabinet Production

Where did all the starships go?

Show HN: Kappal – CLI to Run Docker Compose YML on Kubernetes for Local Dev

Show HN: Look Ma, No Linux: Shell, App Installer, Vi, Cc on ESP32-S3 / BreezyBox

Learning from context is harder than we thought

Monty: A minimal, secure Python interpreter written in Rust for use by AI

Making geo joins faster with H3 indexes

Hackers (1995) Animated Experience

Sheldon Brown's Bicycle Technical Info

Ga68, a GNU Algol 68 Compiler

Show HN: If you lose your memory, how to regain access to your computer?

An Update on Heroku

Show HN: I spent 4 years building a UI design tool with only the features I use

That 16B password story (a.k.a. "data troll")

Comments