[1] https://vercel.com/blog/investing-in-the-python-ecosystem
> There has been a ton of interest expressed this week about potential community maintenance of Gel moving forward. To help organize and channel these hopes, I'm putting out a call for volunteers to join a Gel Community Fork Working Group (...GCFWG??). We are looking for 3-5 enthusiastic, trustworthy, and competent engineers to form a working group to create a "blessed" community-maintained fork of Gel. I would be available as an advisor to the WG, on a limited basis, in the beginning.
> The goal would be to produce a fork with its own build and distribution infrastructure and a credible commitment to maintainership. If successful, we will link to the project from the old Gel repos before archiving them, and potentially make the final CLI release support upgrading to the community fork.
> Applications accepted here: https://forms.gle/GcooC6ZDTjNRen939
> I'll be reaching out to people about applications in January.
I need to figure out an automatic way to track these.
If you're not familiar with the CMU DB Group you might want to check out their eccentric teaching style [1].
I absolutely love their gangsta intros like [2] and pre-lecture dj sets like [3].
I also remember a video where he was lecturing with someone sleeping on the floor in the background for some reason. I can't find that video right now.
Not too sure about the context or Andy's biography, I'll research that later, I'm even more curious now.
[1] https://youtube.com/results?search_query=cmu+database
Anyone willing to clarify this? I'm quite weak at database stuff, i'd love to find some undergrad-level proper course to learn and catch up.
He is training up people to work on new features for existing databases, or build new ones.
Not application developers on how to use a database.
Knowing some of the internals can help application developers make better decisions when it comes to using databases though.
You can tell from the topics, it's related to building databases, not using them.
I was hoping to learn about some new potentially viable alternatives to InfluxDB, alas it seems I'll continue using it for now.
For example it has an InfluxDB compatible ingestion API, so Telegraf can push its data to it or InfluxDB can replicate to it. It also has a Prometheus remote read and remote write API, so it's compatible with Prometheus.
The storage can be done in various systems, including ClickHouse, SQLite, DuckDB, TimescaleDB… I should try to include QuestDB.
https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...
1: Moving everything to SQLite
2: Using mostly JSON fields
Both started already a few years back and accelerated in 2025.
SQLite is just so nice and easy to deal with, with its no-daemon, one-file-per-db and one-type-per value approach.
And the JSON arrow functions make it a pleasure to work with flexible JSON data.
Single file per database, Multiple ingestion formats, full text search, S3 support, Parquet file support, columnar storage. fully typed.
WASM version for full SQL in JavaScript.
Edit: Ah, right, the downside is that this is not going to have good olap query performance when interacting directly with the sqlite tables. So still necessary to copy out to duckdb tables (probably in batches) if this matters. Still seems very useful to me though.
We know you can't get both, row and column orders at the same time, and that continuously maintaining both means duplication and ensuring you get the worst case from both worlds.
Local, row-wise writing is the way to go for write performance. Column-oriented reads are the way to do analytics at scale. It seems alright to have a sync process that does the order re-arrangement (maybe with extra precomputed statistics, and sharding to allow many workers if necessary) to let queries of now historical data run fast.
I agree that the basic architecture should be row order -> delay -> column order, but the question (in my mind) is balancing the length of that delay with the usefulness of column order queries for a given workload. I seem to keep running into workloads that do inserts very quickly and then batch reads on a slower cadence (either in lockstep with the writes, or concurrently) but not on the extremely slow cadence seen in the typical olap reporting type flow. Essentially, building up state and then querying the results.
I'm not so sure about "continuously maintaining both means duplication and ensuring you get the worst case from both worlds". Maybe you're right, I'm just not so sure. I agree that it's duplicating storage requirements, but is that such a big deal? And I think if fast writes and lookups and fast batch reads are both possible at the cost of storage duplication, that would actually be the best case from both worlds?
I mean, this isn't that different conceptually from the architecture of log-structured merge trees, which have this same kind of "duplication" but for good purpose. (Indeed, rocksdb has been the closest thing to what I want for this workload that I've found; I just think it would be neat if I could use sqlite+duckdb instead, accepting some tradeoffs.)
I see. Can you come up with row/table watermarks? Say your column store is up-to-date with certain watermark, so any query that requires freshness beyond that will need to snoop into the rows that haven't made it into the columnar store to check for data up to the required query timestamp.
In the past I've dealt with a system that had read-optimised columnar data that was overlaid with fresh write-optimised data and used timestamps to agree on the data that should be visible to the queries. It continuously consolidated data into the read-optimised store instead of having the silly daily job that you might have in the extremely slow cadence reporting job you mention.
You can write such a system, but in reality I've found it hard to justify building a system for continuous updates when a 15min delay isn't the end of the world, but it's doable if you want it.
> I'm not so sure about "continuously maintaining both means duplication and ensuring you get the worst case from both worlds". Maybe you're right, I'm just not so sure. I agree that it's duplicating storage requirements, but is that such a big deal? And I think if fast writes and lookups and fast batch reads are both possible at the cost of storage duplication, that would actually be the best case from both worlds?
I mean that if you want both views in a consistent world, then writes will bring things to a crawl as both, row and column ordered data needs to be updated before the writing lock is released.
Now that you said this about watermarks, I realize that this is definitely the same idea as streaming systems like flink (which is where I'm familiar with watermarks from), but my use cases are smaller data and I'm looking for lower latency than distributed systems like that. I'm interested in delays that are on the order of double to triple digit milliseconds, rather than 15 minutes. (But also not microseconds.)
I definitely agree that it's difficult to justify building this, which is why I keep looking for a system that already exists :)
If you really need to get performance you'll be building a star schema.
also are there sqlite-duckdb sync engines or is that an oxymoron
It's not bad if you need something quick. I haven't had a large need of ANN in duckdb since it's doing more analytical/exploratory needs, but it's definitely there if you need it.
If you’re at a point where the application needs to talk over a network to your database then that’s a reasonable heuristic that you should use a different DB. I personally wouldn’t trust my data to NFS.
Local as in "desktop application on the local machine" where you are the sole user.
1. People gaining newfound appreciation of having the database on the same machine as the web server itself. The latency gains can be substantial and obviously there are some small cost savings too as you don't need a separate database server anymore. This does obviously limit you to a single web server, but single machines can have tons of cores and serve tens of thousands of requests per second, so that is not as limiting as you'd think.
2. Tools like litestream will continuously back up all writes to object storage, so that one web server having a hardware failure is not a problem as long as your SLA allow downtimes of a few minutes every few years. (and let's be real, most small companies for which this would be a good architecture don't have any SLA at all)
3. SQLite has concurrent writes now, so it's gotten much more performant in situations with multiple users at the same time.
So for specific use cases it can be a nice setup because you don't feel the downsides (yet) but you do get better latency and simpler architecture. That said, there's a reason the standard became the standard, so unless you have a very specific reason to choose this I'd recommend the "normal" multitier architectures in like 99% of cases.
Just to clarify: Unless I've missed something, this is only with WAL mode and concurrent reads at the same time as writes, I don't think it can handle multiple concurrent writes at the same time?
See also https://www.sqlite.org/src/doc/begin-concurrent/doc/begin_co...
This type of limitation is exactly why I would recommend "normal" server-based databases like Postgres or MySQL for the vast majority of web backends.
Sqlite docs has a good overview of appropriate and inappropriate uses: https://sqlite.org/whentouse.html It's best to start with Section 2 "Situations Where A Client/Server RDBMS May Work Better"
Any mitigation strategy for larger use cases?
Thanks in advance!
If your writes are fast, doing them serially does not cause anyone to wait.
How often does the typical user write to the DB? Often it is like once per day or so (for example on hacker news). Say the write takes 1/1000s. Then you can serve
1000 * 60 * 60 * 24 = 86 million users
And nobody has to wait longer than a second when they hit the "reply" button, as I do now ...Why impose such a limitation on your system when you don't have to by using some other database actually designed for multi user systems (Postgres, MySQL, etc)?
Assuming you can accept 99% uptime (that's ~3 days a year being down), and if you were on a single cloud in 2025; that's basically last year.
We need not assume internet FB level scale for typical biz apps where one instance may support a few hundred users max. Or even few thousand. Over engineering under such assumptions is likely cost ineffective and may even increase surface area of risk. $0.02
Most will want to use a managed db, but for a real basic setup you can just run postgres or mysql on the same box. And running your own db on a separate VPS is not hard either.
Turns out a lot when you have things like "last accessed" timestamps on your models.
Really depends on the app
I also don't think that calculation is valid. Your users aren't going to be purely uniformly accessing the app over the course of a day. Invariably you'll have queuing delays above a significantly smaller user count (but maybe the delays are acceptable)
https://www.sqlite.org/src/doc/begin-concurrent/doc/begin_co...
It has been OK so far, but definitely I will have to migrate to Postgres at one point, rather sooner than later.
In my experience, caching makes most sense on the CDN layer. Which not only caches the DB requests but the result of the rendering and everything else. So most requests do not even hit your server. And those that do need fresh data anyhow.
The "web service" is only the user facing part which bears the least load. Read caching is useful there too as users look at statistics, so calculating them once every 5-10 minutes and caching them is needed, as that requires scanning the whole database.
A CDN is something I don't even have. It's not needed for the amount of users I have.
If I was using Postgres, these writer processes + the web service would share the same read cache for free (coming from Posgres itself). The difference wouldn't be huge if I would migrate right now, but now I already have the custom caching.
For someone who openly describes his stack and revenue, look up Pieter Levels, how he serves hundreds of thousands of users and makes millions of dollars per year, using SQLite as the storage layer.
_You_ are using it right this second. It's storing your browser's bookmarks (at a minimum, and possibly other browser-internal data).
I have used DuckDB on an application server because it computes aggregations lightning fast which saved this app from needing caching, background services and all the invalidation and failure modes that come with those two.
SQLite is kind-of the middle ground between a full fat database, and 'writing your own object storage'. To put it another way, it provides 'regularised' object access API, rather than, say, a variant of types in a vector that you use filter or map over.
The only thing I want out of DuckDB core at this point is support for overriding the columnar storage representation for certain structs. Right now, DuckDB decomposes structs into fields and stores each field in a column. I'd like to be able to say "no, please, pre-materialize this tuple subset and store this struct in an internal BLOB or something".
Java, .NET, C++, nodejs, Sitecore, Adobe Experience Manager, Optimizely, SAP, Dynamics, headless CMSes,...
https://www.hanselman.com/blog/dark-matter-developers-the-un...
Which looks more like a blind spot to me honestly. This category of databases is just fantastic for industries like fintech.
Two candidates are sticking out. https://xtdb.com/blog/launching-xtdb-v2 (2025) https://blog.datomic.com/2023/04/datomic-is-free.html (2023)
The ones you mentioned are large backend databases, but I'm working on an "immutable SQLite"...a single file immutable database that is embedded and works as a library: https://github.com/radarroark/xitdb-java
I know they are great... but i don't see many news around them
We hosted XTDB to give a tech talk five weeks ago:
https://db.cs.cmu.edu/events/futuredata-reconstructing-histo...
> Which looks more like a blind spot to me honestly.
What do you want me to say about them? Just that they exist?
We also hosted Llyod to give a talk about Malloy in March 2025:
https://db.cs.cmu.edu/events/sql-death-malloy-a-modern-open-...
Otherwise there are full bitemporal extensions for PG, like this one: https://github.com/hettie-d/pg_bitemporal
What we do is range types for when a row applies or not, so we get history, and then for 'immutability' we have 2 audit systems, one in-database as row triggers that keeps an on-line copy of what's changed and by who. This also gives us built-in undo for everything. Some mistake happens, we can just undo the change easy peasy. The audit log captures the undo as well of course, so we keep that history as well.
Then we also do an "off-line" copy, via PG logs, that get shipped off the main database into archival storage.
Works really well for us.
That said, it's kind of frustrating that XTDB has to be its own top-level database instead of a storage engine or plugin for another. XTDB's core competence is its approach to temporal row tagging and querying. What part of this core competence requires a new SQL parser?
I get that the XTDB people don't want to expose their feature set as a bunch of awkward table-valued functions or whatever. Ideally, DB plugins for Postgres, SQLite, DuckDB, whatever would be able to extend the SQL grammar itself (which isn't that hard if you structure a PEG parser right) and expose new capabilities in an ergonomic way so we don't end up with a world of custom database-verticals each built around one neat idea and duplicating the rest.
I'd love to see databases built out of reusable lego blocks to a greater extent than today. Why doesn't Calcite get more love? Is it the Java smell?
Many implementation options were considered before we embarked on v2, including building on Calcite. We opted to maximise flexibility over the long term (we have bigger ambitions beyond the bitemporal angle) and to keep non-Clojure/Kotlin dependencies to a minimum.
https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...
And then of course at the end he has a whole section about Larry Ellison, like always.
It seems like the author is more focused on database features than user base. Every metric I can find online says that MySQL/MariaDB is more popular than PostgreSQL. PostgreSQL seems "better" (more features, better standards compliance) but MySQL/MariaDB works fine for many people. Am I living in a bubble?
There are rumblings that the MySQL project is rudderless after Oracle fired the team working on the open-source project in September 2025. Oracle is putting all its energy in its closed-source MySQL Heatwave product. There is a new company that is looking to take over leadership of open-source MySQL but I can't talk about them yet.
The MariaDB Corporation financial problems have also spooked companies and so more of them are looking to switch to Postgres.
Not just the open-source project; 80%+ (depending a bit on when you start counting) of the MySQL team as a whole was let go, and the SVP in charge of MySQL was, eh, “moving to another part of the org to spend more time with his family”. There was never really a separate “MySQL Community Edition team” that you could fire, although of course there were teams that worked mostly or entirely on projects that were not open-sourced.
They would, so Heatwave is also going to suffer over this.
It does feel like a lot of the momentum has shifted to PostgreSQL recently. You even see it in terms of what companies are choosing for compatibility. Google has a lot more MySQL work historically, but when they created a compatibility interface for Cloud Spanner, they went with PostgreSQL. ClickHouse went with PostgreSQL. More that I'm forgetting at the moment. It used to be that everyone tried for MySQL wire compatibility, but that doesn't feel like what's happening now.
If MySQL is making you happy, great. But there has certainly been a shift toward PostgreSQL. MySQL will continue to be one of the most used databases just as PHP will remain one of the most used programming languages. There's a lot of stuff already built with those things. I think most metrics would say that PHP is more widely deployed than NodeJS, but I think it'd be hard to argue that PHP is what the developer community is excited about.
Even search here on HN. In the past year, 4 MySQL stories with over 100 point compared to 28 PostgreSQL stories with over 100 points (and zero MariaDB stories above 100 points and 42 SQLite). What are we talking about here on HN? Not nearly as frequently MySQL - we're talking about SQLite and PostgreSQL. That's not to say that MySQL doesn't work great for you or that it doesn't have a large installed base, but it isn't where our mindshare is about the future.
What do you mean by this? AFAIK they added MySQL wire protocol compatibility long before they added Postgres. And meanwhile their cloud offering still doesn't support Postgres wire protocol today, but it does support MySQL wire protocol.
> Even search here on HN.
fwiw MySQL has been extremely unpopular on HN for a decade or more, even back when MySQL was a more common choice for startups. So there's a bit of a self-fulfilling prophecy where MySQL ecosystem folks mostly stopped submitting stories here because they never got enough upvotes to rank high enough to get eyeballs and discussion.
That all said, I do agree with your overall thesis.
What are those metrics? If you're talking about things like db-engines rankings, those are heavily skewed by non-production workloads. For example, MySQL still being the database for Wordpress will forever have a high number of installations and developers using and asking StackOverflow questions. But when a new company or established company is deciding which new database to use for their custom application, MySQL is seldom in the running like it was 8-10 years ago.
is a bit misleading. Gel (formerly EdgeDB) is sunsetting it's development. (extremely talented) Team joins Vercel to work on other stuff.
That was a hard hit for me in December. I loved working with EdgeQL so much.
https://www.cs.cmu.edu/~pavlo/blog/2026/01/2025-databases-re...
It's disturbing how everyone is gravitating towards the same tools. This started happening since React and kept getting worse. Software development sucks nowadays.
All technical decisions about which tools to use are made by people who don't have to use the tools. There is no nuance anymore. There's a blanket solution for every problem and there isn't much to choose from. Meanwhile, software is less reliable than it's ever been.
It's like a bad dream. Everything is bad and getting worse.
> MariaDB Galera Cluster provides a synchronous replication system that uses an approach often called eager replication. In this model, nodes in a cluster synchronize with all other nodes by applying replicated updates as a single transaction. This means that when a transaction COMMITs, all nodes in the cluster have the same value. This process is accomplished using write-set replication through a group communication framework.
* https://mariadb.com/docs/galera-cluster/galera-architecture/...
This isn't necessarily about being "web scale", but having a first-party, fairly-automated replication solution would make HA easier for a number internal-only stuff much simpler.
† Yes, I am aware: https://aphyr.com/posts/327-jepsen-mariadb-galera-cluster
For HA, Patroni, stolon, CNPG
Multimaster doesn't necessarily buy you availability. Usually it trades performance and potentially uptime for data integrity.
Over the past 5 years there's been significant changes and several clear winners. Databricks and Snowflake have really demonstrated ability to stay resilient despite strong competition from cloud providers themselves, often through the privatization of what previously was open source. This is especially relevant given also the articles mentioning of how cloudera and hortonworks failed to make it.
I also think the quiet execution of databases like clickhouse have shown to be extremely impressive and have filled a niche that wasn't previously filled by an obvious solution.
Why is it that in "I'm a serious database person" circles, the popular embedded databases don't count?
[1] Yes, I know it's not an exact comparison.
- Reading the data from disk
- Concurrency between different threads reading the same data
- Caching and buffer management
- Eviction of pages from memory
- Playing nice with other processes in the machine
Why would you not leverage it? It's such a great fit for scaling reads.
And losing them.
Anyways, read for yourself, Pavlo & Leis get into it in detail, and there's benchmarks:
A1aM0•1d ago
When you expose a database via a protocol designed for 'context', you aren't just exposing data; you're exposing the schema's complexity to an entity that handles ambiguity poorly. It feels like we're just reinventing SQL injection, but this time the injection comes from the system's own hallucinations rather than a malicious user.
Miyamura80•1d ago
There are ways to reduce injection risk since LLMs are stateless and thus you can monitor the origination and the trustworthiness of the context that enters the LLM and then decide if MCB actions that affect state will be dangerous or not
We've implementeda mechanism like this based on Simon Willison's lethal trifecta framework as an MCP gateway monitoring what enters context. LMK if you have any feedback on this approach to MCP security. This is not as elegant as the approach that Pavlo talks about in the post, but nonetheless, we believe this is a good band-aid solution for the time bein,g as the technology matures
https://github.com/Edison-Watch/open-edison
quotemstr•22h ago
Any decent MVCC database should be able to provide an MCP access to a mutable yet isolated snapshot of the DB though, and it doesn't strike me as crazy to let the agent play with that.
thesz•22h ago
quotemstr•22h ago
Correct, but nested transaction support doesn't seem that much of a reach if you're an MVCC-style system anyway (although you might have to factor out things like row watermarks to lookaside tables if you want to let them be branchy instead of XID being a write lock.)
You could version the index B-tree nodes too.
thesz•22h ago
Also, do not forget about double COMMIT, intentional or not.
SpaceL10n•1d ago
Edit: My apologies for the cynical take. I like to think that this is just the move fast break stuff ethos coming about.
anthonypasq•1d ago
nijave•13h ago
However, that's easy for people to forget and throw privileged creds at the MCP and hope for the best.
The same stands for all LLM tools (MCP servers or otherwise). You always need to implement correct permissions in the tool--the LLM is too easily tricked and confused to enforce a permission boundary