gzip is a legacy algorithm that imo only gets used for compatibility with legacy software that understands nothing but gzip.
sqlite3 data/database.db < “{backup_file}"
My first try would've been to copy the db file first, gzip it and then transfer it but I can't tell whether compression will be that useful in binary format.
duckdb -c "attach 'sqlite-database.db' as db; copy db.table_name to 'table_name.parquet' (format parquet, compression zstd)"
in my test database this is about 20% smaller than the gzipped text SQL statements.You'd want to do this:
duckdb -c "ATTACH 'sqlite-database.db' (READ-ONLY); EXPORT DATABASE 'target_directory' (FORMAT parquet, COMPRESSION zstd)"
Also I wonder how big your test database is and it's schema. For large tables Parquet is way more efficient than a 20% reduction.If there's UUIDs, they're 36 bits each in text mode and 16 bits as binary in Parquet. And then if they repeat you can use a dictionary in your Parquet to save the 16 bits only once.
It's also worth trying to use brotli instead of zstd if small files is your goal.
-z, --compress compress file data during the transfer
--compress-level=NUM explicitly set compression level
Probably it's faster to compress to gzip and later transfer. But it's nice to have the possibility to improve the transfer with a a flag.sqlite transaction- and WAL-aware rsync with inflight compression.
tool cspeed size dspeed
zstd 361 MB/s 16% 1321 MB/s
lzop 512 MB/s 29% 539 MB/s
lz4 555 MB/s 29% 1190 MB/s
If working from files on disk that happen not to be cached, the speed differences are likely to disappear, even on many NVMe disks.(It just so happens that the concatenation of all text-looking .tar files I happen to have on this machine is roughly a gigabyte (though I did the math for the actual size)).
zstd -b1 --fast -i10 some-rpi-os-image-idk.img
-1#-os-image-idk.img :2097152000 -> 226798302 (x9.247), 6765.0 MB/s, 5897.3 MB/s
zstd -b1 --fast -i10 jdk-11.0.8+10.tar
-1#jdk-11.0.8+10.tar : 688844800 -> 142114709 (x4.847), 2660.7 MB/s, 2506.8 MB/s
zstd -b1 --fast -i10 archlinux-2025.04.01-x86_64.iso
-1#.04.01-x86_64.iso :1236303872 ->1221271223 (x1.012), 3643.5 MB/s, 7836.6 MB/s
For every other network, you should compress as you are likely dealing with multiple tenants that would all like a piece of your 40Gbps bandwidth.
(This is the same spiel I give whenever someone says swap on Linux is or is not always beneficial.)
When I do stuff like this, I stream the dump straight into gzip. (You can usually figure out a way to stream directly to the destination without an intermediate file at all.)
Plus this way it stays stored compressed at its destination. If your purpose is backup rather than a poor man's replication.
ssh username@server "sqlite3 my_remote_database.db .dump | gzip -c" | gunzip -c | sqlite3 my_local_database.db
Anyway I don't think the database file size was really an issue, it was a relatively big schema but not many indices and performance wasn't a big consideration - hence why the backend would concatenate query results into an XML file, then pass it through an xml->json converter, causing 1-2 second response times on most requests. I worked on a rewrite using Go where requests were more like 10-15 milliseconds.
But, I still used sqlite because that was actually a pretty good solution for the problem at hand; relatively low concurrency (up to 10 active simultaneous users), no server-side dependencies or installation needed, etc.
You could do a filesystem snapshot and copy from that, but neither a cp or rsync is atomic.
And then there is also https://www.sqlite.org/rsync.html
What I ended up doing is creating a new database, pg_restore'ing into that one with --no-owner and --no-acl, forcibly dropping the old database, and then renaming the new to the old one's name. This has the benefit of not leaving me high and dry should there be an issue with restoring.
Another alternative is to skip compression of the dump output, let rsync calculate the differences from an previous uncompressed dump to the current dump, then have rsync compress the change sets it sends over the network. (rsync -z)
> Downloading a 250MB database from my web server takes about a minute over my home Internet connection
So for the original 3.4GB database that's nearly 15mn waiting for the download.
As others have mentioned an incremental rsync would be much faster, but what bothers me the most is that he claims that sending SQL statements is faster than sending database and COMPLETELY omiting the fact that you have to execute these statements. And then run /optimize/. And then run /vacuum/.
Currently I have scenario in which I have to "incrementally rebuild *" a database from CSV files. While in my particular case recreating the database from scratch is more optimal - despite heavy optimization it still takes half an hour just to run batch inserts on an empty database in memory, creating indexes, etc.
I create a new in-mem db, run schema and then import every table in one single transaction (in my testing it showed that it doesn't matter if it's a single batch or multiple single inserts as long are they part of single transaction).
I do a single string replacement per every CSV line to handle an edge case. This results in roughly 15 million inserts per minute (give or take, depending on table length and complexity). 450k inserts per second is a magic barrier I can't break.
I then run several queries to remove unwanted data, trim orphans, add indexes, and finally run optimize and vacuum.
Here's quite recent log (on stock Ryzen 5900X):
08:43 import
13:30 delete non-essentials
18:52 delete orphans
19:23 create indexes
19:24 optimize
20:26 vacuum
It's a very good writeup on how to do fast inserts in sqlite3
CREATE INDEX then INSERT vs. INSERT then CREATE INDEX
i.e. they only time INSERTs, not the CREATE INDEX after all the INSERTs.
For my use case (recreating in-memory from scratch) it basically boils down to three points: (1) journal_mode = off (2) wrapping all inserts in a single transaction (3) indexes after inserts.
For whatever it's worth I'm getting 15M inserts per minute on average, and topping around 450k/s for trivial relationship table on a stock Ryzen 5900X using built-in sqlite from NodeJS.
I tested couple different approaches, including pglite, but node finally shipped native sqlite with version 23 and it's fine for me.
I'm a huge fan of serverless solutions and one of the absolute hidden gems about sqlite is that you can publish the database on http server and query it extremely efficitent from a client.
I even have a separate miniature benchmark project I thought I might publish, but then I decided it's not worth anyones time. x]
I have been looking for replacement of SQLite for years -- admittedly, not very actively, embedded databases are just a hobby obsession and my life did not allow me much leisure time in the last years -- and still couldn't find one.
The written-in-Rust `sled` database is more like a key-value store and I had partial successes with it but it's too much work making a KV store a relational database.
The premise of having 100M/s writes instead of 500k/s sounds bit unrealistic, but at the same time, while simply importing tuples and completely ignoring stuff like foreign keys, I'm only utilizing one core. I had on my todo list an experiment to run these imports in paralell into different databases and then merging them somehow, but I ran out of time. Again, 10Gb sqlite is quite large.
On the other hand, I think the adoption and the fact that you can take your db basicially anywhere and it will run out of the bat is something you can't ignore. I was briefly looking at pglite but I don't really see benefits apart from a niche use case when you really need that compatibility with big brother.
And then sqlite has so many hidden gems, like the scenario where you can use a sqlite file hosted on http like a remote database! I can post my 10Gb database on S3 and run count(*) on main table and it will only take like 40kb of bandwidth.
Agreed. I want something better than SQLite, something that learns from it and upgrades it further.
> On the other hand, I think the adoption and the fact that you can take your db basicially anywhere and it will run out of the bat is something you can't ignore.
Absolutely. That's why I am soon finishing my Elixir -> Rust -> SQLite library since I happen to believe most apps don't even need a dedicated DB server.
> I was briefly looking at pglite but I don't really see benefits apart from a niche use case when you really need that compatibility with big brother.
I would probably easily pay 1000 EUR next month if I could have SQLite with PostgreSQL's strict schema. That's the one weakness of SQLite that I hate with a passion. I know about strict mode. I am using it. Still not good enough. I want "type affinity" gone forever. It's obviously a legacy feature and many people came to rely on it.
Hence I concluded that SQLite will never change and something newer should arrive at one point. Though how do you beat the (likely) millions of tests that SQLite has? You don't... but we have to start somewhere.
> And then sqlite has so many hidden gems, like the scenario where you can use a sqlite file hosted on http like a remote database! I can post my 10Gb database on S3 and run count() on main table and it will only take like 40kb of bandwidth.*
Admittedly I never saw the value in that, to me that just seems like you are having a remote database again, at which point why not just go for PostgreSQL which is stricter and has much less surprises. But that's my bias towards strictness and catching bugs at the door and not 10 km down the road.
I hear you. As someone who lost rest of his hair during last 10 years talking to frontend kids claiming that types are for grannies - I'm on your side. But having that said, sqlite has a niche application and you can enforce types on app layer, the same way you do it with web apps. Trust, but verify. Better - don't trust at all. At the end of the day the way I see it - it's just like protobuffers et al. - you put some data into a black box stream of bytes and it's your responsiblity to ensure correctness on both ends.
@serverless
It's twofold. On one hand you have the ability to move faster. On the other you have less moving parts that need maintenance and can break. Plus, for me personally, it's the default mindset. Let me give you an example - in 2025 still most online shops have filters that will trigger a full reload of the web page. When I'm clicking on TVs in a shop I don't need to reload the webpage everytime I click on something, the app could easily got the whole stock in single json and filter results on the fly while I'm fiddling with filters. Sure it doesn't work for amazon, but it works for 95% of shops online. Yet no one is doing it. Why?
My point - I'm looking for a way to simplify processes, and for some niche applications it's just more convenient.
I then would add a SQL engine on top - not sure how much SQL would slow things down but hopefully not much. I haven't found anyone who's interested in anything like that though.
And yes I realize this is several orders of magnitude more performance than any other DB out there.
The optimal transaction size is difficult to calculate so should be measured, but it's almost certainly never beneficial to spend multiple seconds on a single transaction.
There will also be weird performance changes when the size of data (or indexed data) exceeds the size of main memory.
Idea for an offline first app, where each app install call pull a changeset and apply it to their local db.
In one of my projects I have implemented my own poor man’s session by writing all the statements and parameters into a separate database, then sync that and replay. Works well enough for a ~30GB database that changes by ~0.1% every day.
https://github.com/crawshaw/sqlite
https://github.com/eatonphil/gosqlite/
Ended up with the latter, but did have to add one function binding in C, to inspect changesets.
Every extra bit makes AOT compiling the Wasm slower (impacting startup time).
I also wanna keep the number of variants reasonable, or my repo blows up.
Add your votes for additional features to this issue: https://github.com/ncruces/go-sqlite3/issues/126
It's not carelessness, it's performance.
Quite simply, I have a table with 4 columns -- A, B, C, D. Each column is just an 8-byte integer. It has hundreds of millions of rows. It has an index on B+C+D, an index on C+D, and one on D.
All of these are required because the user needs to be able to retrieve aggregate data based on range conditions around lots of combinations of the columns. Without all the indices, certain queries take a couple minutes. With them, each query takes milliseconds to a couple seconds.
I thought of every possible way to avoid having all three indices, but it just wasn't possible. It's just how performant data lookup works.
You shouldn't assume people are being careless with indices. Far too often I see the opposite.
Set this on the connection:
PRAGMA mmap_size=…
https://www.sqlite.org/mmap.html
I think this works as it removes data copying between the os and your process.
Your process reads data from the os to execute the query.
With mmap the data reads have less overhead, so a full table scan may be fast enough.
Just ssh the machine, dump the SQL and load it back into SQLite locally.
Isn't this a case for proper database servers with replication?
Or if it's an infrequent process done for dev purposes just shut down the application doing writes on the other side?
Of course! You can't copy the file of a running, active db receiving updates, that can only result in corruption.
For replicating sqlite databases safely there is
Do people really not understand how file storage works? I cannot rightly apprehend the confusion of ideas that would produce an attempt to copy a volatile database without synchronization and expect it to work.
I see you are a man of culture.
But things haven't improved much. Today we have "prompt engineers" whose only job is to input the right question in order to get the right answer.
Is Litestream still an active project?
There is a slight 'well akshully' on this. A DB flush and FS snapshot where you copy the snapshotted file will allow this. MSSQL VSS snapshots would be an example of this.
It’s been years since I needed to do this, but if I remember right, you can clone an entire pg db live with a `pg_backup_start()`, rsync the data directory, pg_backup_stop() and rsync the WAL files written since backup start.
Works a treat when other (better) method are not available.
https://learn.microsoft.com/en-us/sql/relational-databases/b...
To push back against "only" -- there is actually one scenario where this works. Copying a file or a subvolume on Btrfs or ZFS can be done atomically, so if it's an ACID database or an LSM tree, in the worst case it will just rollback. Of course, if it's multiple files you have to take care to wrap them in a subvolume so that all of them are copied in the same transaction, simply using `cp --reflink=always` won't do.
Possibly freezing the process with SIGSTOP would yield the same result, but I wouldn't count on that
The problem is that LVM snapshots operate at the block device level and only ensure there are no torn or half-written blocks. It doesn't know about the filesystem's journal or metadata.
To get a consistent point-in-time snapshot without triggering crash-recovery and losing transactions, you also need to lock the sqlite database or filesystem from writes during the snapshot.
PRAGMA wal_checkpoint(FULL);
BEGIN IMMEDIATE; -- locks out writers
. /* trigger your LVM snapshot here */
COMMIT;
You can also use fsfreeze to get the same level of safety: sudo fsfreeze -f /mnt/data # (A) flush dirty pages & block writes
lvcreate -L1G -s -n snap0 /dev/vg0/data
sudo fsfreeze -u /mnt/data # (B) thaw, resume writes
Bonus - validate the snapshotted db file with: sqlite3 mydb-snapshot.sqlite "PRAGMA integrity_check;"
https://man7.org/linux/man-pages/man2/ioctl_ficlone.2.html
> Clones are atomic with regards to concurrent writes, so no locks need to be taken to obtain a consistent cloned copy.
I'm not aware of any of the filesystems that use it (Btrfs, XFS, Bcachefs, ZFS) that deviate from expected atomic behavior, at least with single files being the atom in question for `FICLONE` operation.
Well, I don't know rsync that well. If you're saying it doesn't detect changes to files while it's being copied, then I'll believe you.
And, as far as I know, it's impossible to detect absolutely all corruption.
But you can pretty easily detect, e.g., that a file has or has not changed since before you copied it to after, on a system with a basically functioning filesystem and clock, with a reasonable/useful level of confidence.
https://mtlynch.io/litestream/
And here's the flooding story:
https://mtlynch.io/solo-developer-year-6/#the-most-terrifyin...
Sidenote: I still use Litestream in every project where I use SQLite.
A reminder that litestream can run over plain old SFTP[1] which means you can stream database replication to just about any UNIX endpoint over SSH.
I have a favorite[2] but any SFTP server will do ...
[1] https://github.com/benbjohnson/litestream/issues/140
[2] https://www.rsync.net/resources/notes/2021-q3-rsync.net_tech...
[1] https://www.postgresql.org/docs/current/warm-standby.html [2] https://www.postgresql.org/docs/current/logical-replication....
Wot? There are multiple ways of snapshotting/checkpointing, starting at the virty level and working on down the stack through the application level.
Why not just compress the whole database using `gzip` or `lz4` before rsyncing it instead? `zstd` works too but seems like it had a bug regarding compressing file with modified content.
better yet, split your sqlite file to smaller piece. it is not like it needs to contain all the app data in a single sqlite file.
Nice tricks in the article, but you can more easily use the builtin utility now :)
I blogged about how it works in detail here: https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-w...
If you are able, please try out this enhancement and let me know if it solves your problem. See <https://sqlite.org/src/info/2025-05-01T16:07Z> for the patch.
sqlite3_rsync is now built into the rsync.net platform.
ssh user@rsync.net sqlite3_rsync … blah blah …
… just added last week and not rolled out in all regions but … all initial users reported it worked exactly as they expected it to.So I see basic stuff needs to be repeated as people still miss those kinds of things.
But I learned that you can easily dump SQLite to a text file - neat!
if retaining the snapshot file is of value, great.
I'd be a tiny bit surprised if rsync could recognize diffs in the dump, but it's certainly possible, assuming the dumper is "stable" (probably is because its walking the tables as trees). the amount of change detected by rsync might actually be a useful thing to monitor.
sqlite3 -readonly /path/db.sqlite "VACUUM INTO '/path/backup.sqlite';"
From https://sqlite.org/lang_vacuum.html : The VACUUM command with an INTO clause is an alternative to the backup API for generating backup copies of a live database. The advantage of using VACUUM INTO is that the resulting backup database is minimal in size and hence the amount of filesystem I/O may be reduced.
A text file may be inefficient as is, but it's perfectly compressible, even with primitive tools like gzip. I'm not sure the SQLite binary format compresses equality well, though it might.
I hope you’re saying because of indexes? I think you may want to revisit how compression works to fix your intuition. Text+compression will always be larger and slower than equivalent binary+compression assuming text and binary represent the same contents? Why? Binary is less compressible as a percentage but starts off smaller in absolute terms which will result in a smaller absolute binary. A way to think about it is information theory - binary should generally represent the data more compactly already because the structure lived in the code. Compression is about replacing common structure with noise and it works better if there’s a lot of redundant structure. However while text has a lot of redundant structure, that’s actually bad for the compressor because it has to find that structure and process more data to do that. Additionally, is using generic mathematical techniques to remove that structure which are genetically optimal but not as optimal as removing that structure by hand via binary is.
There’s some nuance here because the text represents slightly different things than the raw binary SQLite (how to restore data in the db vs the precise relationships + data structures for allowing insertion/retrieval. But still I’d expect it to end up smaller compressed for non trivial databases
Yeah there are indexes. And even without indexes there is an entire b-tree sitting above the data. So we're weighing the benefits of having a domain dependent compression (binary format) vs dropping all of the derived data. I'm not sure how that will go, but lets try one.
Here is sqlite file containing metadata for apple's photo's application:
767979520 May 1 07:28 Photos.sqlite
Doing a VACUUM INTO: 719785984 May 1 08:56 photos.sqlite
gzip -k photos.sqlite (this took 20 seconds): 303360460 May 1 08:56 photos.sqlite.gz
sqlite3 -readonly photos.sqlite .dump > photos.dump (10 seconds): 1277903237 May 1 09:01 photos.dump
gzip -k photos.dump (21 seconds): 285086642 May 1 09:01 photos.dump.gz
About 6% smaller for dump vs the original binary (but there are a bunch of indexes in this one). For me, I don't think it'd be worth the small space savings to spend the extra time doing the dump.With indexes dropped and vacuumed, the compressed binary is 8% smaller than compressed text (despite btree overhead):
566177792 May 1 09:09 photos_noindex.sqlite
262067325 May 1 09:09 photos_noindex.sqlite.gz
About 13.5% smaller than compressed binary with indices. And one could re-add the indices on the other side.You can modify the database before vacuuming by making a new in-memory database, copying selected tables into it, and then vacuuming that to disk.
When is a guy supposed to get a coffee and stretch his legs anymore?
I just tried some comparisons (albeit with a fairly small sqlite file). The text compressed to only about 84% of the size of the compressed binary database, which isn't negligible, but not necessarily worth fussing over in every situation. (The binary compressed to 7.1%, so it's 84% relative to that).
bzip2 performed better on both formats; its compression of the binary database was better than gzip's compression of the text (91.5%) and bzip2's text was better than binary (92.5).
Though that is not available inside rsync, it indicates that if you're going with an external compression solution, maybe gzip isn't the best choice if you care about every percentage reduction.
If you don't care about every percentage reduction, maybe just rsync compression.
One thing worth mentioning is that if you are updating the file, rsync will only compress what is sent. To replicate that with the text solution, you will have to be retaining the text on both sides to do the update between them.
It works at the page level:
> The protocol is for the replica to send a cryptographic hash of each of its pages over to the origin side, then the origin sends back the complete content of any page for which the hash does not match.
These tasks aren't that challenging but they sure are a yak shave.
nix-shell -p sqlite-rsync
Clearly, when you don’t transfer indexes, you will not have to transfer as much data.
However, the tradeoff is that the database is going to have to do more work to regenerate those indexes when you reconstruct it from the text dump at the destination (as opposed to the case where the indexes were included)
I found this tool laying around "udpcast" and used it to send the backup over the network just once to all the destinations at the same time.
https://kylecordes.com/2008/multicast-your-db-backups-with-u...
if one dumps tables as separate CSV files/streams and using DuckDB converts them to individual parquet files the rsync should run faster since hopefully not every table is modified between each new syncing. There is an obvious overhead of the back and forth conversions but DuckDB can directly export a database to SQLite. I have not tested it myself, so it is just a brainstorming.
Last but not least: when compressing/decompressing text dumps use igzip or pigz if you want to speed things up. Also benchmark the compression levels.
dundundundun•23h ago