gzip is a legacy algorithm that imo only gets used for compatibility with legacy software that understands nothing but gzip.
My first try would've been to copy the db file first, gzip it and then transfer it but I can't tell whether compression will be that useful in binary format.
duckdb -c "attach 'sqlite-database.db' as db; copy db.table_name to 'table_name.parquet' (format parquet, compression zstd)"
in my test database this is about 20% smaller than the gzipped text SQL statements.You'd want to do this:
duckdb -c "ATTACH 'sqlite-database.db' (READ-ONLY); EXPORT DATABASE 'target_directory' (FORMAT parquet, COMPRESSION zstd)"
Also I wonder how big your test database is and it's schema. For large tables Parquet is way more efficient than a 20% reduction.If there's UUIDs, they're 36 bits each in text mode and 16 bits as binary in Parquet. And then if they repeat you can use a dictionary in your Parquet to save the 16 bits only once.
It's also worth trying to use brotli instead of zstd if small files is your goal.
-z, --compress compress file data during the transfer
--compress-level=NUM explicitly set compression level
Probably it's faster to compress to gzip and later transfer. But it's nice to have the possibility to improve the transfer with a a flag.sqlite transaction- and WAL-aware rsync with inflight compression.
For every other network, you should compress as you are likely dealing with multiple tenants that would all like a piece of your 40Gbps bandwidth.
When I do stuff like this, I stream the dump straight into gzip. (You can usually figure out a way to stream directly to the destination without an intermediate file at all.)
Plus this way it stays stored compressed at its destination. If your purpose is backup rather than a poor man's replication.
ssh username@server "sqlite3 my_remote_database.db .dump | gzip -c" | gunzip -c | sqlite3 my_local_database.db
Anyway I don't think the database file size was really an issue, it was a relatively big schema but not many indices and performance wasn't a big consideration - hence why the backend would concatenate query results into an XML file, then pass it through an xml->json converter, causing 1-2 second response times on most requests. I worked on a rewrite using Go where requests were more like 10-15 milliseconds.
But, I still used sqlite because that was actually a pretty good solution for the problem at hand; relatively low concurrency (up to 10 active simultaneous users), no server-side dependencies or installation needed, etc.
You could do a filesystem snapshot and copy from that, but neither a cp or rsync is atomic.
And then there is also https://www.sqlite.org/rsync.html
What I ended up doing is creating a new database, pg_restore'ing into that one with --no-owner and --no-acl, forcibly dropping the old database, and then renaming the new to the old one's name. This has the benefit of not leaving me high and dry should there be an issue with restoring.
Another alternative is to skip compression of the dump output, let rsync calculate the differences from an previous uncompressed dump to the current dump, then have rsync compress the change sets it sends over the network. (rsync -z)
> Downloading a 250MB database from my web server takes about a minute over my home Internet connection
So for the original 3.4GB database that's nearly 15mn waiting for the download.
As others have mentioned an incremental rsync would be much faster, but what bothers me the most is that he claims that sending SQL statements is faster than sending database and COMPLETELY omiting the fact that you have to execute these statements. And then run /optimize/. And then run /vacuum/.
Currently I have scenario in which I have to "incrementally rebuild *" a database from CSV files. While in my particular case recreating the database from scratch is more optimal - despite heavy optimization it still takes half an hour just to run batch inserts on an empty database in memory, creating indexes, etc.
I create a new in-mem db, run schema and then import every table in one single transaction (in my testing it showed that it doesn't matter if it's a single batch or multiple single inserts as long are they part of single transaction).
I do a single string replacement per every CSV line to handle an edge case. This results in roughly 15 million inserts per minute (give or take, depending on table length and complexity). 450k inserts per second is a magic barrier I can't break.
I then run several queries to remove unwanted data, trim orphans, add indexes, and finally run optimize and vacuum.
Here's quite recent log (on stock Ryzen 5900X):
08:43 import
13:30 delete non-essentials
18:52 delete orphans
19:23 create indexes
19:24 optimize
20:26 vacuum
It's a very good writeup on how to do fast inserts in sqlite3
CREATE INDEX then INSERT vs. INSERT then CREATE INDEX
i.e. they only time INSERTs, not the CREATE INDEX after all the INSERTs.
For my use case (recreating in-memory from scratch) it basically boils down to three points: (1) journal_mode = off (2) wrapping all inserts in a single transaction (3) indexes after inserts.
For whatever it's worth I'm getting 15M inserts per minute on average, and topping around 450k/s for trivial relationship table on a stock Ryzen 5900X using built-in sqlite from NodeJS.
Idea for an offline first app, where each app install call pull a changeset and apply it to their local db.
In one of my projects I have implemented my own poor man’s session by writing all the statements and parameters into a separate database, then sync that and replay. Works well enough for a ~30GB database that changes by ~0.1% every day.
It's not carelessness, it's performance.
Quite simply, I have a table with 4 columns -- A, B, C, D. Each column is just an 8-byte integer. It has hundreds of millions of rows. It has an index on B+C+D, an index on C+D, and one on D.
All of these are required because the user needs to be able to retrieve aggregate data based on range conditions around lots of combinations of the columns. Without all the indices, certain queries take a couple minutes. With them, each query takes milliseconds to a couple seconds.
I thought of every possible way to avoid having all three indices, but it just wasn't possible. It's just how performant data lookup works.
You shouldn't assume people are being careless with indices. Far too often I see the opposite.
Just ssh the machine, dump the SQL and load it back into SQLite locally.
Isn't this a case for proper database servers with replication?
Or if it's an infrequent process done for dev purposes just shut down the application doing writes on the other side?
Of course! You can't copy the file of a running, active db receiving updates, that can only result in corruption.
For replicating sqlite databases safely there is
Do people really not understand how file storage works? I cannot rightly apprehend the confusion of ideas that would produce an attempt to copy a volatile database without synchronization and expect it to work.
Is Litestream still an active project?
There is a slight 'well akshully' on this. A DB flush and FS snapshot where you copy the snapshotted file will allow this. MSSQL VSS snapshots would be an example of this.
It’s been years since I needed to do this, but if I remember right, you can clone an entire pg db live with a `pg_backup_start()`, rsync the data directory, pg_backup_stop() and rsync the WAL files written since backup start.
To push back against "only" -- there is actually one scenario where this works. Copying a file or a subvolume on Btrfs or ZFS can be done atomically, so if it's an ACID database or an LSM tree, in the worst case it will just rollback. Of course, if it's multiple files you have to take care to wrap them in a subvolume so that all of them are copied in the same transaction, simply using `cp --reflink=always` won't do.
Possibly freezing the process with SIGSTOP would yield the same result, but I wouldn't count on that
Why not just compress the whole database using `gzip` or `lz4` before rsyncing it instead? `zstd` works too but seems like it had a bug regarding compressing file with modified content.
better yet, split your sqlite file to smaller piece. it is not like it needs to contain all the app data in a single sqlite file.
Nice tricks in the article, but you can more easily use the builtin utility now :)
I blogged about how it works in detail here: https://nochlin.com/blog/how-the-new-sqlite3_rsync-utility-w...
So I see basic stuff needs to be repeated as people still miss those kinds of things.
But I learned that you can easily dump SQLite to a text file - neat!
dundundundun•2h ago