If you’ve ever administered Postgres at scale, multiply it by 10. That’s what dealing with multi-master is like. It’s a nightmare.
IMO, one of the worst offenders is places with microservices, who absolutely do not need microservices, and are doing reads-after-writes.
For every RDBMS I know of other than MySQL (even MariaDB has this), you can tack a RETURNING clause onto the write, and get the entire row back. This shouldn’t be necessary for most cases anyway, because by definition, you have (almost) all of the row data already. The only times you wouldn’t is DB-generated values like an auto-incrementing ID, UUID from the DB, or a timestamp from the DB. MySQL can actually handle the first case; the cursor holds LAST_INSERT_ID. So if you had a monolith, it’s trivial to pass what you wrote to the next function that needs it – boom, no concern about stale data, and one fewer DB round trip.
Even with separate services, this is still possible, though somewhat slower, but I’ve not seen anyone do it.
People with no data seem to think that these things matter to the end user.
The only time uptime and performance matter is at point of purchase.
Similarly maddening, the appalling lack of normalization that is simply taken for granted. “It’s faster, bro.” No, no, it is not. Especially not at the hundreds of millions or billions of rows scale. If you store something low-cardinality like a status column, with an average length of perhaps 7 characters, that’s 8 bytes (1 byte overhead assumed, but it could be 2). Multiply that by 2 billion rows, and you’re wasting 16 GB. Disk is cheap, but a. Memory isn’t b. Don’t be lazy. There’s a right way to use an RDBMS, and a wrong way. If you want a KV store, use a damn KV store.
Finally, I’d be remiss if I failed to point out that Prisma is an unbelievably immature organization who launched without the ability to do JOINS [0]. They are forever dead to me for that. This isn’t “move fast and break things,” it’s “move fast despite having zero clue what we’re doing but convince JS devs that we do.”
with a little pain, if I had to pick an extreme, I'd pick extreme normalization with materialized views that are queried (e.g. no joins), rather than joining all of the time.
how though? Postgres doesn't support auto-refreshing materialized views when the underlying data changes (with good reasons, it's a really hard problem)
This is a safety feature. If my code expects columns A, B, and C, but the migration to add C hasn't run yet and I'm doing something that would otherwise `SELECT `, my query should fail. If the ORM _actually_ does `SELECT ` I'll get back two columns instead of three and things can get spooky and bad real fast (unless the ORM manually validates the shape of the query response every time, which will come with a real runtime cost). If there are columns that the ORM doesn't know about, you could end up with _far more_ data being returned from the database, which could just as easily cause plenty of spooky issues—not the least of which being "overwhelming your network by flooding the client connections with data the application doesn't even know exists".
Semi related to this, the ORM explicitly specifying columns ensures that the shape is consistent which both makes for a faster parse of the rows coming back (And, again, eliminates surprises for the parser.)
This isn't said enough.
I added about 4 indexes and halved the resources overnight. But Prisma, SELECT *, graphql and what other resume building shit people implemented was the bane of my existence, typically engineers did this believing it would be faster. I remember 1 engineer had a standing ovation in slack for his refactor which was supposedly going to save us $$$$$ except our DB CPU went up 30% because he decided to validate every company every second in every session. In his defense, he added 1 line of code that caused it, and it was obscured through prisma and graphql to an inefficient query.
FWIW; I love CockroachDB but the price is directly linked to how much your software engineers shit on the database.
But I'm no fan of Prisma either. Drizzle has its own pain points (i.e. sequential numbers for its auto-generated migrations means annoying merge conflicts if multiple people iterate on the schema at the same time), but it's much better than Prisma at sticking close to the metal and allowing good query performance and table design.
Really?
I think the indexes they suspect were used are unused but Motion didn't realize CockroachDB was doing zigzag joins on other indexes to accomplish the same thing, leaving the indexes that would be obviously used as genuinely not used.
It's a great feature but CRDB's optimizer would prefer a zig zag join over a covering index, getting around this required indexes be written in a way to persuade the optimizer to not plan for a zig zag join.
I already knew about prisma from the infamous https://github.com/prisma/prisma/discussions/19748
..........I'm sorry, what? That seems........absurd.
edit: Might as well throw in: I can't stand ORMs, I don't get why people use it, please just write the SQL.
- Using `JOIN`s (with correlated subqueries and JSON) has been around for a while now via a `relationLoadStrategy` setting.
- Prisma has a Rust service that does query execution & result aggregation, but this is automatically managed behind the scenes. All you do is run `npx prisma generate` and then run your application.
- They are in the process of removing the Rust layer.
The JOIN setting and the removing of the middleware service are going to be defaults soon, they're just in preview.
The removal of the rust service is available in preview for Postgres as of 6.7.[1]
Rewriting significant parts of a complex codebase used by millions is hard, and pushing it to defaults requires prolonged testing periods when the worst case is "major data corruption".
[1]: https://www.prisma.io/blog/try-the-new-rust-free-version-of-...
Harder than just doing joins.
All this complexity, additional abstractions and indirections, with all the bugs gootguns and gotchas that come with it... when I could just type "JOIN" instead.
i greatly envy y'all having projects where the biggest complexity is... A single setting once, that's clearly documented. We live in very different worlds, apparently.
Funny relevant story: we got an OOM from a query that we used Prisma for. I looked into it - it’s was a simple select distinct. Turns out (I believe it was changed like a year ago, but I’m not positive), event distincts were done in memory! I can’t fathom the decision making there…
This is one of those situations where I can't tell if they're operating on some kind of deep insight that is way above my experience and I just don't understand it, or if they just made really bad decisions. I just don't get it, it feels so wrong.
This is answered at the very top of the link on the post you replied to. In no unclear language, no less. Direct link here: https://github.com/prisma/prisma/discussions/19748#discussio...
> I want to elaborate a bit on the tradeoffs of this decision. The reason Prisma uses this strategy is because in a lot of real-world applications with large datasets, DB-level JOINs can become quite expensive...
> The total cost of executing a complex join is often higher than executing multiple simpler queries. This is why the Prisma query engine currently defaults to multiple simple queries in order to optimise overall throughput of the system.
> But Prisma is designed towards generalized best practices, and in the "real world" with huge tables and hundreds of fields, single queries are not the best approach...
> All that being said, there are of course scenarios where JOINs are a lot more performance than sending individual queries. We know this and that's why we are currently working on enabling JOINs in Prisma Client queries as well You can follow the development on the roadmap.
Though this isn't a complete answer still. Part of it is that Prisma was, at its start, a GraphQL-centric ORM. This comes with its own performance pitfalls, and decomposing joins into separate subqueries with aggregation helped avoid them.
My take, as a MySQL expert: that advice is totally irrelevant now, and has been for quite some time. It's just plain wrong in a modern context.
The Prisma answer just does not summarize correctly what the book was saying.
It mainly boiled down to sharding and external caching. Storage and memory were much smaller back then, so there was a lot of sharding and functional partitioning, and major reliance on memcached; all of those are easier if you minimize excessive JOINs.
The query planner in MySQL wasn't great at the time either, and although index hints could help, huge complex queries sometimes performed worse than multiple decomposed simpler queries. But the bigger issue was definitely enabling sharding (cross-shard joins had to be handled at the application level) and enabling external caching (do a simple range scan DB query to get a list of IDs/PKs, then do point lookups in memcached, then finally do point lookups in the DB for any that weren't in memcached).
Which, frankly, is a good lesson that marketing and docs and hype can make up for any amount of technical failure, and if you live long enough, you can fix the tech issues.
If one's getting OOM errors from a SELECT DISTINCT, then there's no deep insight behind the choice, it's just a mistake.
Granted I was much worse in my sql knowledge and postgre/mysql had severe limitations in their query planners, so I can see how something like this could have happened. If they support multiple dbs, and even one has this problem, it might be better (for them) to do it application side.
The specific issue was doing a join with a table for a one to many, you get a lot more data from the db than you would normally need, if you do the join the naive way, and if the join is nested you get exponentially more data.
It was faster to do a query for each db separately and then stitch the results.
Now it is easy to solve in pg with nested selects and json aggregation, which pg query planner rewrites to efficient joins, but you still get only the bytes you have requested without duplication.
I used to agree until I started using a good ORM. Entity Framework on .NET is amazing.
We have one application built on .NET 8 (and contemporary EF) with about 2000 tables, and its memory usage is okay. The one problem it has is startup time: EF takes about a minute of 100% CPU load to initialize on every application restart, before it passes execution to the rest of your program. Maybe it is solvable, maybe not, I haven't yet had the time to look into it.
Does the startup time/CPU usage cost vary depending on the size of the dataset you’re interacting with?
I disagree. It is probably one of the less terrible ORMs, but it is far from amazing. The object-relational impedance mismatch will always dominate for anything that isn't trivial business. EF works great until you need different views of the model. It does support some kind of view mapping technique, but it's so much boilerplate I fail to see the point.
Dapper + SqlConnection is goldilocks once you get into the nasty edges. Being able to query a result set that always exactly matches your view models is pretty amazing. The idea of the program automagically upgrading and migrating the schemas is something that was interesting to me until I saw what you could accomplish with Visual Studio's SQL Compare tool & RedGate's equivalent. I feel a lot more comfortable running manual schema migrations when working with hosted SQL providers.
You can easily project or use views with SQL then projected onto objects. It's very convenient with `.FromSql`:
https://learn.microsoft.com/en-us/ef/core/querying/sql-queri...
Can you call .Select(entity => SomeSmallerModel() { Name = entity.Name }) or something like that to select what you need? If I am understanding your issue correctly.
I also agree that its one of the least worst but there are still things that annoy me.
Me too. I use a DB-first approach. Then EF simply rebuilds the application models automatically with a single command.
I removed a $lookup (the mongodb JOIN equivalent) and replaced it with, as Prisma does, two table lookups and an in-memory join
p90 response times dropped from 35 seconds to 1.2 seconds
There is no "MongoDB JOIN equivalent" because MongoDB is not a relationalal database.
It's like calling "retrieve table results sequentially using previous table's result-set" a JOIN; it's not.
But I've found with that you can get better performance in _few_ situations with application level joins than SQL joins when the SQL join is causing a table lock and therefore rather than slower parallel application joins you have sequential MySQL joins. (The lock also prevents other parallel DB queries which is generally the bigger deal than if this endpoint is faster or not).
Although I do reach for the SQL join first but if something is slow then metrics and optimization is necessary.
As someone who has primarily worked with Postgres for relational concerns, I’ve envied the apparent robustness of the MySQL scaling solutions.
There's a handful of companies that have scaled Citus past 1PB for production usage, but the examples I'm aware of all had more engineering to avoid capability or architecture limitations than one might like. I'd love to see someone come back with a fresh approach that covered more use cases effectively.
Disclaimer: former Citus employee
One interesting tradeoff Postgres and MySQL made for efficiency's sake was making xacts not fully ACID by default; instead they guarantee something that's good enough as long as you keep it in mind. Cockroach and Spanner are fully ACID, but that means even if you used those as a single-node DB, it ought to be slower.
I've sworn off ORMs entirely. My application is a Postgres application first and foremost. I use PG-specific features extensively. Why would I sacrifice all the power that Postgres offers me just for some conveniences in Python, or Ruby, or whatever?
Nah. Just write the good code for your database.
When I need something PG specific I have options like writing raw SQL queries.
Having most of my data layer in C# is fantastic for productivity and in most cases the performance compared to SQL is negligible.
The Npgsql driver automatically applies PG-specific tricks without me having to do anything special
The only path I had to do myself is the data ingress point that had some race condition issues, everything else seems to perform pretty well out of the box.
To be fair, Prisma's `OR` clause looks so good. Way better than ActiveRecord.
Sometimes debt is worth it. Sometimes the interest rate is too high.
Trying to explain to a modern web dev that the optimum data storage structure is not the same as the optimum application layer data structure, so you can't just take one and map them across 1:1 to the other, is really painful.
Developing without an ORM is just as quick as developing with one (because the time you save on routine queries you will more than lose on the tricky edge cases that the ORM completely screws up on). But you need to know SQL and databases to do it.
And yeah, obviously, there will be bugs in your SQL. And writing good tests for the database layer is always tricky because there's usually some pushback to setting up a database on the CI instance, and so on. It's not simple, but ignoring it by using an ORM doesn't make it simpler, it just means you have less options when it goes wrong.
In Postgres that usually means you're not locking rows, you're not using upsert, you might not be writing table DDL yourself. It often means you aren't even using database transactions.
While these things might be extraneous fluff for an all-nighter hackathon, you really have to figure out a sweet spot so that data integrity violations aren't killing your velocity when your service's rubber begins hitting the road.
Admittedly, most of my experience with ORMs was with Ruby on Rails' Active Record + Rails' generated SQL tables + the culture that ensued from it, like large production Rails applications that didn't use a single db transaction (and often no indexes). Though I reckon things could have changed in 15 years.
I can imagine that an ORM might be the best option for most people. It wasn't until I worked at one specific company that I learned how to really use Postgres. Before that, an ORM and its abstractions probably made more sense than expecting me to figure out how to use a database directly on my own.
People just hate embedding SQL into other languages. I don't know why.
It's not as uncommon as one might think, one of the big products in public sector services where I live offers both SQL Server and Oracle as persistence layer so they can't push logic into stored procedures or similar techniques.
But just sketching out some schemas and booting PostgREST might be good enough for forever, if that's the case, go for it. As for ORM:s, I kind of like how Ecto in Elixir settings does things, it solves a few tedious things like validation and 'hydration', and has a macro DSL for generating SQL with concise expressions.
I recently wanted to check it out and wrote a small app that had good use of pgvector for embeddings, custom queries with ctes for a few complex edge cases, and it was all quite smooth.
Now it might not be at the level of active record, ecto or sqlalchemy but it was quite decent.
If you know your sql at any point it gave me options to drop down a level of abstraction, but still keep the types so as not to break the abstraction too much for the rest of the code.
prisma.$queryraw<YourType>`SELECT * FROM ...`
I recently looked at migrating a legacy project with basic SQL query generation to a modern ORM. Prisma came up top of course so I tried it.
We use Postgres built-in range types. Prisma does not support these, there's no way to add the type to the ORM. You can add them using "Unsupported", but fields using that aren't available in queries using the ORM, so that's pretty useless.
It also requires a binary to run, which would require different builds for each architecture deployed to. Not a big thing but it was more annoying than just switching the ORM.
That coupled with their attitude to joins - which has truth to it, but it's also short-sighted - eliminated Prisma.
The final decision was to switch to Kysely to do the SQL building and provide type-safe results, which is working well.
> It also requires a binary to run, which would require different builds for each architecture deployed to.
https://www.prisma.io/blog/from-rust-to-typescript-a-new-cha...
> That coupled with their attitude to joins
https://www.prisma.io/blog/prisma-orm-now-lets-you-choose-th...
As another poster has mentioned, a thing Prisma has over the others is type safety if you use the raw SQL escape hatch for performance reasons.
I guess the join one is from the internet memory, good to know for the next time!
We will absolutely share our findings when that migration happens!
- The query objects can become hard to read with anything more or less complex.
- If you need an unsupported Postgres extension you are out of luck.
- One large file in a schema, impossible to shard.
- We have many apps in a monorepo and they cannot have separate prisma connections cause the schema gets baked into "@prisma/client"
Basically the only thing useful about it are the TS types which is something SQL-builder libraries solve better. Long story short, use Kysely, Prisma provides no value that I see.
Wow, what the fuck.
"Also, this chapter about Query Performance Optimization from the High Performance MySQL book has some great insights. One of the techniques it mentions is JOIN decomposition:
Many high-performance web sites use join decomposition. You can decompose a join by running multiple single-table queries instead of a multitable join, and then performing the join in the application."
This belief that they can write JavaScript that outperforms decades of bare-metal executed optimisations in mainstream database engines is just astonishing.Also, this is an old quote. Databases didn't all support things like JSON at the time, so joins and subqueries presented an N+1 problem and could balloon data fetch requirements fairly easily. Being a GraphQL-focused ORM originally too, this made some sense.
The default is now being changed and correlated subqueries, JOINs, & JSON aggregation will replace the old approach unless explicitly toggled.
In my first job fresh out of uni, I worked with a "senior" backend developer who believed this. He advocated for crusty, bug-ridden ORMs like Sequelize and Prisma (still very early stage back then, so lots more issues than now though I'd still steer well clear of it). Claiming they did "query optimizations". I knew it made zero sense, but also that I wasn't going to be able to convince him.
Also, query planner maturity is a big deal. It's hard to get Spanner to use the indexes you want.
It's interesting that people typically think of columnstores for strict BI / analytics. But there are so many App / user-facing workloads that actually need it.
ps: we're working on pg_mooncake v0.2. create a columnstore in Postgres that's always consistent with your OLTP tables.
It might help for this workload.
We had to design this columnstore to be 'operational' so it can keep up with changing oltp tables (updates/deletes).
You'll be able to deploy Mooncake as a read-replica regardless of where your Postgres is. Keep the write path unchanged, and query columnar tables from us.
--- v0.2 will be released in preview in ~a couple weeks. stay tuned!
If you want to fully embrace the vibe tables are difficult.
Even before LLMs, I was at a certain company that preferred MongoDB so we didn’t need migrations.
Sometimes you don’t care about data structure and you just want to toss something up there and worry about it later.
Postgres is the best answer if you have a solid team and you know what you’re doing.
If you want to ride solo and get something done fast, Firebase and its NoSQL cousins might be easier .
> Postgres is the best answer if you have a solid team and you know what you’re doing.
Not every type of data simply fits into relational model.
Example: time series data.
So depending on your model - pick your poison.
But for relational models, there is hardly anything better than postgres now.
It makes me happy coz I always rooted for the project from earily 2000s.
A really nice pattern has been to use change data capture and kafka to ship data off to clickhouse for long-term storage and analytics, which allows us to simply drop old partitions in postgres after some time.
But, for example I was working on a .net project and entity framework decided it couldn't migrate Postgres tables correctly.
I'm not a SQL person, at this point my options are to drop tables, and let .net recreate them or try and write my own migrations.
This just isn't an issue with Firebase. I can add all the fields I want directly on the client.
As soon as you have a loosely defined object you can't access any specific keys which makes it useless for 99% of times you want to store and retrieve data.
That's the entire idea behind Firebase. It makes prototyping much faster. I don't know how well it scales, but it works for most smaller projects.
It all depends on what you need to actually do. The only real weakness of Firebase is the Google lock in.
Let's say you have an object called box.
It has 3 properties, the id, it's name and it's weight.
4 months later you add a 4th property. Cost, odd records that don't have a cost just return null for that field.
You can be sure that PostgreSQL will be applicable for any work load that MySQL can handle, but not necessarily the other way round, so if you actually have the freedom to make a choice, go with PostgreSQL.
In particular, because PostgreSQL has a lot more features, people imply that the other one, for the lack of those features and its associated complexity, must automatically be faster. Which isn't true, neither generally, nor in special cases, since the latter one can go either way - your particular query might just run 4x on PostgreSQL. There is also no universal approach to even benchmark performance, since every database and query will have completely different characteristics.
That's utter nonsense. What are you even referring to here?
InnoDB is a rock-solid OLTP storage engine, and it's been MySQL's default since 2010. A very large percentage of the global economy relies on InnoDB, and has for quite some time.
But while digging that up it seems there is one with more colors: https://postgresforeverything.com/
And one for the AI crowd https://github.com/dannybellion/postgres-is-all-you-need#pos...
In addition, SQL in itself is a proven technology. The reality is that most problems you might think about solving with specialized databases (Big Data TM etc) could probably easily be solved with your run-of-the-mill RDBMS anyway, if more than five minutes are spent on designing the schema. It's extremely versatile, despite just being one layer above key-value storage.
I don't think I've once seen a migrating away from Postgres article.
> The DB used is PostgreSQL which is not used anywhere else at Yelp, which meant that only a small rotation of long-tenured employees knew Postgres well enough to do outage response. This caused issues in maintenance, visibility, and outage response times. The teams working on the Restaurants products are not infra teams, and the Yelp-wide infra teams (understandably) focus on Yelp-standard infrastructure. As a result, when we did see issues with Postgres it was often a scramble to find people with relevant expertise.
> So, we switched out this DB in-place with a Yelp-standard MySQL DB.
https://www.youtube.com/watch?v=DNHMYp8M40k
I'm trying to avoid editorializing in my above summary, for fear of mischaracterizing their opinions or the current state of postgres. Their use of postgres was 10 years ago, they were using postgres for a high-availability use case -- so they (and I) don't think "postgres bad, cockroach good." But like Bryan Cantrill says, "No one cares about your workload like you do." So benchmark! Don't make technical decisions via "vibes!"
Now we have lovely kql queries and pretty much start new with postgres again...
The compression level is vastly superior to any available Postgres-based solution, and at Instagram’s scale it amounts to extremely compelling hardware cost savings.
Also if they were still primarily on pg, it would be one of the largest pg deployments in existence, and there would be obvious signs of the eng impact of that (conference talks, FOSS contributions, etc).
Bigger-picture: Postgres is an amazing database, and it’s often the right choice, but nothing in tech is always the best choice 100% of the time. There’s always trade-offs somewhere.
But yeah we did migrate our _analytics_ data to ClickHouse (while still keeping Postgres for more transactional stuff) back when I was at PostHog.
Writeup: https://posthog.com/blog/how-we-turned-clickhouse-into-our-e...
Clickhouse is incredible tech. We’ve been very pleased with it for OLAP queries, and it’s taken a lot of load off the postgres instance, so it can more easily handle the very high write load it gets subjected to.
Aside. Job section says not 9-5. What does that mean? Long hours? Or not 9-5 attitude?
When I first stepped into a DBA role with CockroachDB I was confused why indexes we obviously need were in unused indexes. It wasn't until I did an explain on the queries I learned the planner was doing zig-zag joins instead.
I did a double take at this. At the onset of the article, the fact they're using a distributed database and the mention of a "mid 6 figure" DB bill made me assume they have some obscenely large database that's far beyond what a single node could do. They don't detail the Postgres setup that replaced it, so I assume it's a pretty standard single primary and a 100 million row table is well within the abilities of that—I have a 150 million row table happily plugging along on a 2vCPU+16GB instance. Apples and oranges, perhaps, but people shouldn't underestimate what a single modern server can do.
This is definitely true, but I've seen migrations from other systems struggle to scale on Postgres because of decisions which worked better in a scale-out system, which doesn't do so well in PG.
A number of well meaning indexes, a very wide row to avoid joins and a large number of state update queries on a single column can murder postgres performance (update set last_visited_time= sort of madness - mutable/immutable column family classifications etc.)
There were scenarios where I'd have liked something like zHeap or Citus, to be part of the default system.
If something was originally conceived in postgres and the usage pattern matches how it does its internal IO, everything you said is absolutely true.
But a migration could hit snags in the system, which is what this post celebrates.
The "order by" query is a good example, where a bunch of other systems do a shared boundary variable from the TopK to the scanner to skip rows faster. Snowflake had a recent paper describing how they do input pruning mid-query off a TopK.
Obv it depends on your query patterns
That sounds insane for a crud app with one million users.
What am I missing?
People just throw more compute power (ie money) at performance problems, rather than fixing their queries or making better use of indices.
Technically you’ll need a third server to perform the failover, but it doesn’t need to be nearly as big, as it’s just watching heartbeats and issuing commands.
Bare metal is absolutely where it’s at for base infrastructure and bang-for-you-buck.
People call me a graybeard for such things, but it’s never been said derisively or sarcastically to my face. Usually it’s asking for advice.
I took a “traditional” path here. I started when “the cloud” was just the fluffy white cloud that said “Internet” or “Other Networks” at the top of the diagram.
I’ve contended for a long time that ORMs and their ilk automatically building queries is an antipattern for anything but small data scale. At any reasonable size of db, you’re going to need to know sql well enough to write optimized queries anyway. There’s essential complexity in the DB queries, which ORMs can only hide for so long.
Directly to your point though, I once encountered a salesperson who was running an entire sandbox environment of a very large platform to the tune of about $25k/mo. It sat idle for almost half a year before someone came knocking. The cloud team did an audit and they were a little spicy about it, understandably.
The largest table was over 100 million rows. Some migrations were painful, however. At that time, some of them would lock the whole table and we'd need to run them overnight. Fortunately, this was for an internal app so we could do that.
Secondly, we did most of these things using SQL, Bash scripts, cron jobs and some I/O logic built directly into the application code. They were robust enough to handle some extremely mission critical systems (a failure could bring down a US primary market and if it's bad enough, you hear it on the news).
For tables with a lot of updates, Postgres used to fall over with data fragmentation, but that's mostly been moot since SSDs became standard.
It's also easier than ever to stream data to separate "big data" DBs for those separate use cases.
It all depends though, sometimes 1b is passe.
But 100m is a good point to consider what comes next.
I had a 200 billion row table that was operationally still manageable but, IMO, I had allowed to grow out of control. The enterprise storage costs a fortune. Should have nipped that in the bud by 20 billion at the latest.
Even a naive B-tree index has a logarithmic curve, which means that the time to find a record asymptotically flattens out as the number of records increases.
...use ORMs!
They are always bad, but especially bad when the code base and/or query complexity grows.
I dont think it’s possible to say they’re always bad when offering an example of when they make things better. In my eyes, it’s pretty simple: use the ORM for the simple CRUD and simple data querying, because you’ll have to do a lot of it in most codebases.
For everything else, there is in-database processing or dropping down to raw SQL. Even when using an ORM, nobody is preventing you from making an entity mapping against a DB view or calling stored procedures so you can keep the complex querying logic in pure SQL but get an object that’s easy to work with out of it. Plus, you know, any ORM worth their salt will let you point it at a live DB that you’ve run your (hopefully versioned, for example, with dbmate) migrations against and will let you output the backend code for the entity mapping for everything in a single command.
It’s kind of unfortunate that a lot of ORMs do kinda suck and have jagged edges, as well that most attempts at dynamic SQL also remain pretty niche, like: https://mybatis.org/mybatis-3/dynamic-sql.html
The "danger" here is the movement towards an object-oriented design instead of a SQL-first approach to querying the database.
In my experience, I have seen ORMs used for these simple CRUD cases and what tends to happen is you have objects with dozens of members when you only need one or two. And then down the line you also start doing JOINs in memory - you just don't recognize it like that.
You see, you get Object X using CRUD and then object Y using CRUD and then you need a way to say which relates to what and... oops now we've done joins in memory. And sometimes, maybe even often times, we're doing all of that for a single field we need in this one place.
That might just be an inherent problem with object oriented languages, you probably have some mapping that you need to do even with raw SQL and before long you'll have the same use cases emerge.
Whether that means any of the following is true:
OOP is bad for interacting with SQL
SQL is bad for being used by OOP (or rather, relational databases, maybe not object stores)
other approaches might help make it less bad, but ORMs don't help in common usage
maybe we just don't have the right pattern yet, perhaps the "object relational impedance mismatch" folks have a point
I can't tell, not even sure what a proper solution would be.I've seen what happens when people try to put all of the business logic inside of the DB and it's absolute trash (from the perspective of developer experience, versioning, debugging etc.; the performance was actually stellar) to work with.
At $WORK, we write ~100M rows per day and keep years of history, all in a single database. Sure, the box is big, but I have beautiful transactional workloads and no distributed systems to worry about!
I'm just saying, simple is nice and fast when it works, until it doesn't. I'm not saying to make everything complex, just to remember life is a survivor's game.
I still think it’s the right tradeoff for us, operating a distributed system is also very expensive in terms of dev and ops time, costs are more unpredictable etc.
It’s all tradeoffs, isn’t it?
Your table on a small VPS (which I concur is totally reasonable, am running something similar myself): Let's say your VPS costs $40/mo x 12 = $480/yr. Divide into 150 million. You get 312,500 rows per dollar.
I'd wager you server was faster under normal load too. But is it webscale? /s
There's waste, then there's "3 orders of magnitude" waste. The pain is self-inflicted. Unless you have actual requirements that warrant a complex distributed database, you should "just use postgres".
And just to calibrate everyone's expectations, I've seen a standard prod setup using open source postgres on AWS EC2s (1 primary, 2 replicas, 1 haproxy+pgbouncer box to load balance queries) that cost ~ $700k annually. This system was capable of handling 1.2 million rows inserted per second, while simultaneously serving thousands of read queries/s from hundreds of internal apps across the enterprise. The cost effectiveness in their case came out to ~ 20k rows per dollar, lower than your VPS since the replicas and connection pooling eat into the budget. But still: 2 orders of magnitude more cost effective than the hosted distributed hotness.
There was something like 120 million rows in the database. It ran on a single VM. It really needed the indexes, but once those were built it just sang.
This was easily 10+ years ago.
HA is important. But Postgres and MySQL both support HA and replication setups without needing to jump straight into a distributed SQL (In this context of using cockroach). We use MySQL Innodb cluster + MySQL router with auto failover on single primary mode.
> If you start having replicas you are already in distributed territory.
But it’s not the same as a distributed database with quorum writes, global consensus, and cross-region latencies. Those systems are built for horizontal write scaling, that come with added complexity and cost, which most apps don’t need.
It’s all about finding the right balance. With modern vertically scalable hardware and fast SSDs, a single-node setup can handle quite a lot of load before hitting real limits with a failover setups.
If you are a startup you can focus on product and leave the nitty grittys to an abstraction by paying more money.
It seems to me LLMs now help fill both those roles, but with the benefit that you can now tune the SQL as needed. I also always prefer actually knowing exactly what queries are going to my DB, but YMMV.
Largest table 100 million rows and they were paying 6 figures for database services annually? I have one now that sits happily enough on an 8yo laptop. I've worked on systems that had similar scale tables chugging along on very average for 20 years ago MSSQL 2000 boxes. There just isn't a need for cloud scale systems and cloud scale bills for that data volume.
The problems they're describing should never have got that far without an experienced hand pointing out they didn't make sense, and if they'd hired that greybeard they'd have spotted it long before.
Might have been monthly.
100,000,000 rows is what I handled on a single Sun server in 2001 with Sybase, no problemo.
I was about to ask what was main constraint for CockroachDB like iostats and atop info for CPU/disk drives, but realized that is probably something offloaded to some SaaS - so still curious
The only information I could extract was that the company made bad architectural decisions, believes in ORM (looking at the queries, there are many doubts that the data layouts in DB are adequate) and cannot clearly explain situations. But this is only interesting to their candidates or investors.
It may sound rude, so I apologise.
ORMs come in two main types, that I'm aware of: Active Record (named after the original Ruby one, I think) and Data Mapper (think Hibernate; SQLAlchemy).
Active Record ORMs are slightly more ergonomic at the cost of doing loads of work in application memory. Data Mapper looks slightly more like SQL in your code but are much more direct wrappers over things you can do in SQL.
Data Mapper also lets you keep various niceties such as generating migration code, that stem from having your table definition as objects.
Use Data Mapper ORMs if you want to use an ORM.
In other words: [Postgres -> exotic solution] is the path everyone should take (and 99% will just stay in postgres), and not [exotic solution -> postgres].
Us grizzled genX devs saw this coming a decade ago.
In other words, there are many companies currently worth $Billion+ that wouldn't have succeeded had they followed your advice. Today, with incredibly powerful infra of all types available, starting with Postgres is almost always the right step unless you know, know, better. That wasn't the case 10+ years ago.
I'm assuming it's largely because Postgres has more momentum, and is much more extensible, but if you're just trying to do 'boring DB stuff' I find it's faster for most use cases. Development has slowed, but it would be hard to argue that it's not battle tested and robust.
I still think MySQL is a better choice for most web apps due to performance, but more general use cases I can understand the debate.
My preference today for Postgres comes down to the fact that its query planner is much easier to understand and interface with, whereas MySQL/Maria would be fine 29/30 times but would then absolutely fuck you with an awful query plan that you needed a lot of experience with to anticipate.
The single biggest thing for MySQL that should be a huge attraction for devs without RDBMS administration experience is that MySQL, by and large, doesn’t need much care and feeding. You’re not going to get paged for txid wraparound because you didn’t know autovacuum wasn’t keeping up on your larger tables. Unfortunately, the Achilles heel of MySQL (technically InnoDB) is also its greatest strength: its clustering index. This is fantastic for range queries, IFF you design your schema to exploit it, and don’t use a non-k-sortable PK like UUIDv4. Need to grab every FooRecord for a given user? If your PK is (user_id, <some_other_id>) then congrats, they’re all physically colocated on the same DB pages, and the DB only has to walk the B+tree once from the root node, then it just follows a linked list.
If I had to, I'd probably do something like this (haven't tested it beyond validating that it creates):
mysql> SHOW CREATE TABLE ts\G
*************************** 1. row ***************************
Table: ts
Create Table: CREATE TABLE `ts` (
`metric_id` smallint unsigned NOT NULL,
`ts` datetime NOT NULL,
PRIMARY KEY (`metric_id`,`ts`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
/*!50100 PARTITION BY RANGE (dayofmonth(`ts`))
SUBPARTITION BY HASH (`metric_id`)
SUBPARTITIONS 3
(PARTITION p1 VALUES LESS THAN (2) ENGINE = InnoDB,
PARTITION p2 VALUES LESS THAN (3) ENGINE = InnoDB,
PARTITION p3 VALUES LESS THAN MAXVALUE ENGINE = InnoDB) */
1 row in set (0.012 sec)
Obviously there would be more partitions for different days (or whatever other date chunk you wanted – months, weeks, etc.), and the sub partitions number would depend on how many metrics you were tracking. You could also simplify this at the expense of more tables by having a table per metric.I just see lots of people making CRUD web apps and choosing these new Postgres solutions, and that seems like the one thing that MySQL is almost always better at.
there were concerns about the eventuality of a multi-region setup (mandated by GDPR)
Can anyone share more details about this ? The GDPR is mandating a multi-region setup ? This sounds very wild
hobs•1mo ago
NegativeLatency•1mo ago
Currently the vendor lock-in or requirements for installing plugins make it hard to do with cloud sql providers. Especially hard since by the time it's a problem you're probably at enough scale to make switching db/vendors hard or impossible.
hobs•1mo ago
NegativeLatency•1mo ago
here's my usecase:
- we have a bunch of attributes (all different names by customer, and many different values for each record that a customer stores)
- it's a fairly natural fit for a json value with only one level of key: value mapping
- we use mysql on GCP (no columnar plugins, too hard to switch to postgres)
Someone could go back in time and correctly model it as columns and not json but that ship has sorta sailed. While it's not impossible to change, it would be pretty hard, time will tell if that ends up happening.
I would love to be able to tell mysql "put this column in a collumnar engine and use that when I query on it" (AlloyDB is this for postgres on GCP)
hobs•1mo ago
Could always try and dump to GCP/Parquet/json and use duckdb/bq to just query it all.
moonikakiss•1mo ago
with pg_mooncake v0.2 (launching in ~couple weeks), you'll be able to get a columnar copy of your Postgres that's always synced (<s freshness).
Keep your write path unchanged, and keep your Postgres where it is. Deploy Mooncake as a replica for the columnar queries.
NegativeLatency•1mo ago
bastawhiz•1mo ago
https://www.postgresql.org/docs/current/indexes-expressional...
panzi•1mo ago
renhanxue•1mo ago
I can't think of any case at all, no matter how contrived, where you'd want to use the non-B versions of the JSON aggregate functions though.
paulryanrogers•1mo ago
panzi•1mo ago