The author also mentions B2B and I'm not sure how it's going to work. I understand B2C where you can just say "1 user=1 single-threaded shard" because most user data is isolated/independent from other users. But with B2B, we have accounts ranging from 100 users per organization to 200k users per organization. Something tells me making a 200k account single-threaded isn't a good idea. On the other hand, artificially sharding inside an organization will lead to much more complex queries overall too, because usually a lot of business rules require joining different users' data within 1 org.
- transaction serializability
- atomicity
- deadlocks (generally locks)
- occ (unless you do VERY long tx, like a user checkout flow)
- retries
- scale, infrastructure, parameter tuning
towards thinking about - separating data into shards
- sharding keys
- cross-shard transactions
which can be sometimes easier, sometimes harder. I think there are a surprising amount of problems where it's much easier to think about sharding than about race conditions!> But with B2B, we have accounts ranging from 100 users per organization to 200k users per organization.
You'd be surprised at how much traffic a single core (or machine) can handle — 200k users is absolutely within reach. At some point you'll need even more granular sharding (eg. per user within organization), but at that point, you would need sharding anyways (no matter your DB).
Actually, I'd argue a lot of apps can do entirely without cross-shard transactions! (eg. sharding by B2B orgs)
But looking at it in a different way, say building something like Google Sheets.
One could place user-mgmt in one single-threaded database (Even at 200k users you probably don't have too many concurrently modifying administrators) whilst "documents" gets their own database. I'm prototyping one such "document" centric tool and the per-document DB thinking has come up, debugging users problems could be as "simple" as cloning a SQLite file.
Now on the other hand if it's some ERP/CRM/etc system with tons of linked data that naturally won't fly.
Tool for the job.
This article ends up making a compelling case for DynamoDB. It has the properties he describes wanting. Many, many systems inside of Amazon are built with DDB as the primary datastore. I don't know of any OSS commensurate to DDB, but it would be quite interesting for one to appear.
> "Every transaction can only be in one shard" only works for simple business logics.
You'd be quite surprised at what you can get out of this model. Check out the later chapters of the DynamoDB Book [1] for some neat examples.
And quite frankly, i think it is incredibly rare for the shards to both be fine grained and independent in typical oltp DB usecase.
qouteall•2h ago
It violates the "every transaction can only be in one shard" constraint.
For a specific business requirement it's possible to design clever sharding to make transaction fit into one shard. However new business requirements can emerge and invalidate it.
"Every transaction can only be in one shard" only works for simple business logics.
rowanG077•2h ago
formerly_proven•2h ago
SoftTalker•2h ago
n2d4•1h ago
n2d4•1h ago
You can also still do optimistic concurrency across shards! That covers most of the remaining ones. Anything that requires anything more complex — sagas, 2PC, etc. — is relatively rare, and at scale, a traditional SQL OLTP will also struggle with those.
qouteall•1h ago
So in my understanding:
- The transactions that only touch one shard is simple
- The transactions that read multiple shards but only write shard can use simple optimistic concurrency control
- The transactions that writes (and reads) multiple shards stay complex. Can be avoided by designing a smart sharding key. (hard to do if business requirement is complex)
n2d4•1h ago
If you anticipate you will encounter the third type a lot, and you don't anticipate that you will need to shard either way, what I'm talking about here makes no sense for you.
qouteall•1h ago
hinkley•24m ago