https://www.rfc-editor.org/rfc/rfc9562.html#monotonicity_cou...
The entire universe. Else it's not universally unique.
If your system does need to worry about UUIDv7s generated by the rest of the universe, you likely also need to worry about maliciously created IDs, software bugs, clocks that reset to unix epoch, etc. I worry about those more than a bonefide collision.
joke aside all of this is theorical, in practical application its literally impossible to hit it that it doesn't matters if its possible or not since you are not google scale anyway
Getting UUID 'A' from app 'X' is easily distinguishable from UUID 'A' from app 'Y'.
because you just overreach at this point, if you can develop a better one. be my guest
If YouTube wanted to give every incoming pixel its own UUIDv7, they'd see a collision rate just under 0.6%.
> Assuming 4K@60fps [...] they'd see a collision rate just under 0.6%
This doesn't detract from your point of collisions like that being viable at that scale, but assuming an average of 4K@60fps is assuming a lot. The average video upload there is probably south of 1080p@30fps.This plays nicely with the birthday paradox.
> The odds are 1 in 2^122 — that’s approximately 1 in 5,000,000,000,000,000,000,000,000,000,000,000,00.
This is true if you only generate two GUIDs, but if you generate very many GUIDs, the chance of generating two identical ones between any of them increases. E.g. if you generate 2^61 GUIDs, you have about a 1 in 2 chance of a collision, due to the birthday paradox.
2^61 is still a very large number of course, but much more feasible to reach than 2^122 when doing a collision attack. This is the reason that cryptographic hashes are typically 256 bits or more (to make the cost of collision attacks >= 2^128).
It's trivial to force a collision. Here's the same UUID twice:
6e197264-d14b-44df-af98-39aac5681791
6e197264-d14b-44df-af98-39aac5681791
Typically, you don't care about UUIDs that aren't in your system and you generate those yourself to avoid maliciously generated collisions. Your system can't handle 2^61 IDs. It doesn't have the processing power, storage, or bandwidth for that to happen. Not to mention traditional rate limiting.
>2^61 is still a very large number of course, but much more feasible to reach than 2^122 when doing a collision attack. This is the reason that cryptographic hashes are typically 256 bits or more (to make the cost of collision attacks >= 2^128).
For cryptographic applications it is really small because the previous poster is correct that 2^64 is very small for that purpose - a small supercomputing cluster or two could decrypt such a cipher in a reasonable amount of time, which is why symmetric keys are all 256 bits and up to guarantee there’s no way to attack them.
The only way it would apply to symmetric keys is if you have a server that stores 2^64 encrypted messages, and can somehow find out which messages used the same symmetric key (normally not possible unless they also have the same IV and plaintext), and can somehow coerce the user who uploaded message #1 to decrypt message #2 for you (or vice versa). Obviously that isn't realistic.
So assuming you use 64-bit counters, you can divide those 12 years by 1024 to get 4 days.
And that's not even considering what you could do on a GPU.
Edit: I might be off by a factor of 2, not sure if the SIMD throughput is per-core or per-thread. Also thermal throttling. Same ballpark though!
Then your 32 HT threads aren’t really going to give you full access to the underlying SIMD registers which are going to be per core which is where I assume you realized the 2x difference might show up?
And to do += 1 multithreaded you have to partition the range or you won’t get any speed up - if you don’t amortize the cost of atomic synchronization across threads you’re going to be going slower than a non-SIMD increment.
To actually find a collision in 128b cryptographic hash function it would take closer to 2^65 hashes. Back of the envelope calculations suggest that with Pollard's rho it would cost a few million dollars of CPU time at Hetzner's super-low prices. Not nearly mere mortals budget, but not that far off I guess.
In any case, in 2023 I back-of-the-envelope estimated that you could compute 2^64 SHA256 for ~$100K, using rented GPU capacity https://www.da.vidbuchanan.co.uk/blog/colliding-secure-hashe...
So if you need 1000 random numbers, generate from 1 to 1 million.
The square root approximation works well for large numbers, but leaves out some factors that are relevant for small numbers.
If you don't check for clashes, the 50% chance of failure is too much. Probably even 0.1% is too much, so you'd need more elaborate approach.
If you do check for clashes, you can generate from 1 to 2000 with little overhead.
Then you can choose how many collisions to accept on average. (If the answer is zero, then it makes more sense to look at the probability of one or more collisions.)
One of them was genuine - generated by different systems, and it was caught when loading data from one to another - object had same ID, but different underlying type.
Other one was due to 'error' - two systems(by different companies, supporting the same data exchange standard) used magic hardcoded guid that turned out to be the same.
Both of those systems have full audit trail - each change created new row in database and IDs were formatted as {NAMESPACE}.{GUID}.{TIMESTAMP}. Mutation of an object created new entry with different {TIMESTAMP} part. Namescapes are mandated by standard, so different systems can have the same namespace value.
Anyhow that was my first thought when you mentioned 2^61 guids, where are you even going to put them? second thought, I don't think enumerating 2^61 guids is trivial, in fact, I suspect it would take longer than anyone would be willing to spend, and if you are not storing them why are you generating them?
And what even is a guid collision attack? it is not like they are a hash, and since they tend to be public identifiers it turns out despite their stated use to prevent collisions, you can't really use guids generated by others(if they wanted collisions they would straight up just copy yours) so you end up regenerating them anyway.
At a rate of comparing 400,000 guids per second, you have a 99% chance of seeing a collision within the next 553,750 years.
Microsoft’s GUID standard is garbage.
https://learn.microsoft.com/en-us/dotnet/api/system.guid?vie...
I don't think there's a "Microsoft standard" and they just use different versions of UUID in different products over time. No idea why they call it GUID instead of UUID though, but it's easier to speak out loud so I'm not against it.
v7 has a timestamp indeed, but isn't the time making it more collision resistant? You'd have to generate tons of UUIDv7s in the same millisecond, while v4 is more likely to collide due to not being time-constrained and the birthday paradox.
I think both have their uses though. You might need pure random if you want your UUID not to convey any time information and you're not generating tons of them (e.g. a random user id).
What do you mean "model"? Are you referring to UUIDv1 which has time and MAC address?
That seems to depend a whole lot on the pattern your application generates UUIDs in. If you're generating a consistent distribution over time, sure. If you generate a whole lot in bursts, collision seems to be way more likely.
(Not disagreeing with you, just adding perspective.)
(Agreeing with both parents)
After thinking it more, I have the feeling (against my initial intuition) that v4 might dominate either way unless you consistently generate tons of UUIDs for an impractical number of years.
[0] https://kagi.com/assistant/dd7d8c48-44e4-499b-9f2f-33663d125...
If you were doing RPC in OSF DCE your IDs were UUIDs, and if you were doing COM in Wintel your IDs were GUIDs; and that was basically the difference, a different name for the same thing when used in a different domain.
Plus the difference in endianism because one was a network-byte-order network thing and the other was an Intel Architecture byte order thing, and only some parts of these IDs were technically multiple-byte integers with byte orders to have.
But by the late 1990s this had already become lost to history, with a sea of people who had made all sorts of inferences and promoted them as gospel truth, from the fact that Microsoft had two programs named GUIDGEN.EXE and UUIDGEN.EXE, from the fact that many generators sprang up and the whole idea spread to Java and databases and this new-fangled WorldWeb thing and all sorts of stuff, from the fact that there sprang up multiple different versions of these IDs and what version an ID was depended from tooling and libraries, and from the fact that at the time Microsoft was less likely to go through formal standards processes and more likely to just write and ship things and sponsor a book and a CD-ROM of doco so if your world was RFCs and the IETF you had one worldview and if your world was Microsoft Press and the MSDN you had another worldview.
You probably actually want at least a few prefixed bytes be a timestamp (uuidv7) for b-tree efficiency but ymmv
http://mattmitchell.com.au/birthday-problems-friendly-identi...
However, the overall takeaway should be, as always: don't use MongoDB. Period. Every time I learn something new about it I'm baffled about why people continue to use it.
epoch time + MAC Address + transaction counter (catch NTP skew) + Thread PID + new Pointer address = GUID
Then increment global transaction counter, complete some ops, and check to ensure current epoch time is in the future before the transaction frees the memory locations.
This is often robust in highly concurrent distributed systems even under network degradation, or corrupted sync states. Has other interesting use-cases too. =3
If you want to see how close to a non-ordinal 123456 a random generator can get, you also need to look for stuff like 923456 or 123956, etc.
Also, would 223456 be considered a closer match compared to 323456? (It shouldn't in my opinion because, again, these are non-ordinal strings).
nesk_•5mo ago