You can also run multiple instances of rsync, the problem seems how to efficiently divide the set of files.
It turns out, fpart does just that! Fpart is a Filesystem partitioner. It helps you sort file trees and pack them into bags (called "partitions"). It is developed in C and available under the BSD license.
It comes with an rsync wrapper, fpsync. Now I'd like to see a benchmark of that vs rclone! via https://unix.stackexchange.com/q/189878/#688469 via https://stackoverflow.com/q/24058544/#comment93435424_255320...
find a-bunch-of-files | xargs -P 10 do-something-with-a-file
-P max-procs
--max-procs=max-procs
Run up to max-procs processes at a time; the default is 1.
If max-procs is 0, xargs will run as many processes as
possible at a time.Edit: Looks like when doing file-by-file -F{} is still needed:
# find tmp -type f | xargs -0 ls
ls: cannot access 'tmp/b file.md'$'\n''tmp/a file.md'$'\n''tmp/c file.md'$'\n': No such file or directoryxargs -0 will use a null byte as separator for each argument
printf 'a\0b\0c\0' | xargs -tI{} echo “file -> {}"
Related to this is the very useful:
rclone serve restic ...
.. workflow that allows you to create append-only (immutable) backups.This howto is not rsync.net-specific - you can follow this recipe at any standard SSH endpoint:
https://www.rsync.net/resources/notes/2025-q4-rsync.net_tech...
SSH was never really meant to be a high performance data transfer tool, and it shows. For example, it has a hardcoded maximum receive buffer of 2MiB (separate from the TCP one), which drastically limits transfer speed over high BDP links (even a fast local link, like the 10gbps one the author has). The encryption can also be a bottleneck. hpn-ssh [1] aims to solve this issue but I'm not so sure about running an ssh fork on important systems.
There's gotta be a less antisocial way though. I'd say using BBR and increasing the buffer sizes to 64 MiB does the trick in most cases.
Can we throw a bunch of AI agents at it? This sounds like a pretty tightly defined problem, much better than wasting tokens on re-inventing web browsers.
The cost of leaking data was/is catastrophic (as in company ending) So paying a bit of money to guarantee that your data was being sent to the right place (point to point) and couldn't leak was a worthwhile tradeoff.
For Point to point transfer torrenting is a lot higher overhead than you want. plus most clients have an anti-leaching setting, so you'd need not only a custom client, but a custom protocol as well.
The idea is sound though, have an index file with and then a list of chunks to pull over multiple TCP connections.
As I understand it, this is also the approach of WEKA.io [1]. Another approach is RDMA [2] used by storage systems like Vast which pushes those order and resend tasks to NICs that support RDMA so that applications can read and write directly to the network instead of to system buffers.
0. https://en.wikipedia.org/wiki/Fast_and_Secure_Protocol
1. https://docs.weka.io/weka-system-overview/weka-client-and-mo...
2. https://en.wikipedia.org/wiki/Remote_direct_memory_access
I get 40 Gbit/s over a single localhost TCP stream on my 10 years old laptop with iperf3.
So the TCP does not seem to be a bottleneck if 40 Gbit/s is "high" enough, which it probably is currently for most people.
I have also seen plenty situations in which TCP is faster than UDP in datacenters.
For example, on Hetzner Cloud VMs, iperf3 gets me 7 Gbit/s over TCP but only 1.5 Gbit/s over UDP. On Hetzner dedicated servers with 10 Gbit links, I get 10 Gbit/s over TCP but only 4.5 Gbit/s over UDP. But this could also be due to my use of iperf3 or its implementation.
I also suspect that TCP being a protocol whose state is inspectable by the network equipment between endpoints allows implementing higher performance, but I have not validated if that is done.
For that use case, Aspera was the best tool for the job. It's designed to be fast over links that single TCP streams couldn't
You could, if you were so bold, stack up multiple TCP links and send data down those. You got the same speed, but possible not the same efficiency. It was a fucktonne cheaper to do though.
Do you mean literally just streaming data from one process to another on the same machine, without that data ever actually transiting a real network link? There's so many caveats to that test that it's basically worthless for evaluating what could happen on a real network.
To measure other overhead of what's claimed (TCP the protocol being slow), one should exclude other things that necessarily affect alternative protocols as well (e.g. latency) as much as possible, which is what this does.
But it's much more complicated than that; TCP interacts with latency and congestion and packet loss as both cause and effect. If you're testing TCP without sending traffic over real networks that have their own buffering and congestion control and packet reordering and loss, you're going to miss all of the most important dynamics affecting real-world performance. For example, you're not going to measure how multiplexing multiple data streams onto one TCP connection allows head of line blocking to drastically inflate the impact of a lost or reordered packet, because none of that happens when all you're testing is the speed at which your kernel can context-switch packets between local processes.
And all of that is without even beginning to touch on what happens to wireless networks.
Almost like it makes the point that arguing about "high performance" is useless without saying what that means.
That said:
> you're not going to measure how multiplexing multiple data streams onto one TCP connection
Of course not: When I want to argue against "TCP is not a high performance protocol", why would I want to measure some other protocol that multiplexes connections over TCP? That is not measuring the performance of TCP.
I could conjure any protocol that requires acknowledgement from the other side for each emitted packet before sending the next, and then claim "UDP is not high performance" when running that over UDP - that doesn't make sense.
You gave a vacuous example of a measurement of packets going nowhere. When zero work gets done, you might as well call it infinitely fast.
So it is impossible to compare the performance of TCP and UDP.
UDP is used to implement various other protocols, whose performance can be compared with TCP. Any protocol implemented over UDP must have a performance better than TCP, at least in some specific scenarios, otherwise there would be no reason for its existence.
I do not know how UDP is used by iperf3, but perhaps it uses some protocol akin to TFTP, i.e. it sends a new UDP packet when the other side acknowledges the previous UDP packet. In that case the speed of iperf3 over UDP will always be inferior to that of TCP.
Sending UDP packets without acknowledgment will always be faster than for any usable transfer protocol, but the speed for this case does not provide any information about the network, but only about the speed of executing a loop in the sending computer and network-interface card.
You can transfer data without using any transfer protocol, by just sending UDP packets at maximum rate, if you accept that a fraction of the data will be lost. The fraction that is lost can be minimized, but not eliminated, by using an error-correcting code.
It does not, otherwise it would be impossible by a factor ~100x to measure 4.5 Gbit/s as per the bandwidth-delay calculation (the ping is around the usual 0.2 ms).
With iperf3, as with many other UDP measurement tools, you set a sending rate and the other side reports how many bytes arrived.
It is a long time since I have last used iperf3, but now that you have mentioned it I have also remembered this.
So the previous poster has misinterpreted the iperf3 results, by believing that UDP was slower, as iperf3 cannot demonstrate a speed difference between TCP and UDP, since for the former the speed is determined by the network, while for the latter the speed is determined by the "--bandwidth" iperf3 command-line option, so the poster has probably just seen some default UDP speed.
And I am quite sure that UDP is slower, becuase I increase `--bandwitdh` until throughput stops increasing, which is at 20% of TCP's speed.
Depending on what you're doing it can be faster to leave your files in a solid archive that is less likely to be fragmented and get contiguous reads.
Source: Been in big tech for roughly ten years now trying to get servers to move packets faster
> MPLS ECMP hashing you over a single path
This is kinda like the traffic shaping I was talking about though, but fair enough. It's not an inherent limitation of a single stream, just a consequence of how your network is designed.
> a single loss event with a high BDP
I thought BBR mitigates this. Even if it doesn't, I'd still count that as a TCP stack issue.
At a large enough scale I'd say you are correct that multiple streams is inherently easier to optimize throughput for. But probably not a single 1-10gb link though.
It is. one stream gets you traffic of one path to the infrastructure. Multiple streams get you multiple and possibly also hit different servers to accelerate it even more. Just the limitation isn't hardware but "our networking device have 4 10Gbit ports instead of single 40Gbit port"
Especially if link is saturated, you'd be essentially taking n-times your "fair share" of bandwidth on link.
The issue is the serialization of operations. There is overhead for each operation which translates into dead time between transfers.
However there are issues that can cause singular streams to underperform multiple streams in the real world once you reach a certain scale or face problems like packet loss.
rsync's man page says "pipelining of file transfers to minimize latency costs" and https://rsync.samba.org/how-rsync-works.html says "Rsync is heavily pipelined".
If pipelining is really in rsync, there should be no "dead time between transfers".
If copying a folder with many files is slower than tarring that folder and the moving the tar (but not counting the untar) then disk latency is your bottleneck.
dd is not a magic tool that can deal with block devices while others can't. You can just cp myLinuxInstallDisk.iso to /dev/myUsbDrive, too.
tar cf - *.txt | ssh user@host tar xf - -C /some/dir/I don't know what rsync does on top of that (pipelining could mean many different things), but my empirical experience is that copying 1 1 TB file is far faster than copying 1 billion 1k files (both sum to ~1 TB), and that load balancing/partitioning/parallelizing the tool when copying large numbers of small files leads to significant speedups, likely because the per-file overhead is hidden by the parallelism (in addition to dealing with individual copies stalling due to TCP or whatever else).
I guess the question is whether rsync is using multiple threads or otherwise accessing the filesystem in parallel, which I do not think it does, while tools like rclone, kopia, and aws sync all take advantage of parallelism (multiple ongoing file lookups and copies).
No, that is not the question. Even Wikipedia explains that rsync is single-threaded. And even if it was multithreaded "or otherwise" used concurent file IO:
The question is whether rsync _transmission_ is pipelined or not, meaning: Does it wait for 1 file to be transferred and acknowledged before sending the data of the next?
Somebody has to go check that.
If yes: Then parallel filesystem access won't matter, because a network roundtrip has brutally higher latency than reading data sequentially of an SSD.
The dead time isn't waiting for network trips between files, it's parts of the program that sometimes can't keep up with the network.
That is extremely vague on what that is and I also didn't check that it's true.
Both the original claim "the issue is the serialization of operations" and the counter-claim all sound like extreme guesswork or me. If you know for certain, please link the relevant code.
Otherwise somebody needs to go check what it actually does; everything else is just speculating "oh surely it's the files" and then people remember stuff that might just be plain wrong.
The question was what exactly rsync pipelines, and whether it serialises its network sends. If true, that would be a plausible cause of parallelism speeding it up.
Serial local reads are not a plausible cause, because the autor describes working on NVMe SSDs which have so low latency that they cannot explain that reading 59 GB across 3000 files take 8 minutes.
However:
You might actually be half-right because in the main output shown in the blog post, the author is NOT using local SSDs. The invocation is `rsync ... /Volumes/mercury/* /Volumes/...` where `mercury` is a network share mount (and it is unspecified what kind of share that is). So in that case, every read that looks "local" to rsync is actually a network access. It is totally possible that rsync treats local reads as fast and thus they are not pipelined.
In fact, it is even highly likely that rsync will not / cannot pipeline reading files that appear local to it, because normal POSIX file IO does not really offer any ways to non-blocking read regular files, so the only way to do that is with threads, which rsync doesn't use.
(Extra evidence about rsync using normal blocking writes, and not supporting threads, beyond the fact that no threading code exists in rsync's repo: https://github.com/RsyncProject/rsync/blob/236417cf354220669...)
So while "the dead time isn't waiting for network trips between files" would be wrong -- it absolutely would wait for network trips between files -- your "filesystem access and general threading is the question" would be spot-on.
So in that case rclone is just faster because it reads from his network mount in parallel. This would also explain why he reports `tar` as not being faster, because that, too, reads files serially from the network mount. Supposedly this situation could be avoided by running rsync "normally" via SSH, so that file reads are actually fast on the remote side.
The situation is extra confused by the author writing below his run output:
even experimenting with running the rsync daemon instead of SSH
when in fact the output above didn't rsync over SSH at all.Another weird thing I spotted is that the rsync output shown in the post
Unmatched data: 62947785101 B
seems impossible: The string "Unmatched data" doesn't seem to exist in the rsync source code, and hasn't since 1996. So it is unclear to me what version of rsync used.I commented that on https://www.jeffgeerling.com/blog/2025/4x-faster-network-fil...
But the people you responded to were talking about slowdowns that exist in general, not just ones that apply directly to the post.
For the post, my personal guess is that per-file overhead isn't a huge factor here, and it's mostly rsync having trouble doing >1Gbps over the network.
> In fact, it is even highly likely that rsync will not / cannot pipeline reading files that appear local to it, because normal POSIX file IO does not really offer any ways to non-blocking read regular files, so the only way to do that is with threads, which rsync doesn't use.
Makes sense.
> it absolutely would wait for network trips between files
I don't see why you're saying this. I expect it to serially read files and then put that data into a buffer that can have data from multiple files at the same time. In other words, pipelined networking. As long as the transfer queue doesn't bottom out it shouldn't have to wait for any network round trips. What leads you to think otherwise?
I think that's incorrect though. These slowdowns do not exist in general (see my next reply where I run rsync and it immadiately maxes out my 10 Gbit/s).
I think original poster digiown is right with "Note there is no intrinsic reason running multiple streams should be faster than one [EDIT: 'at this scale']. It almost always indicates some bottleneck in the application". In this case it's the user running rsync as a serially-reading program reading from a network mount.
> rsync having trouble doing >1Gbps over the network
rsync copies at 10 Gbit/s without problem between my machines.
Though I have to give `-e 'ssh -c aes256-gcm@openssh.com'` or aes128-gcm, otherwise encryption bottlenecks at 5 Gbit/s with the default `chacha20-poly1305@openssh.com`.
> I don't see why you're saying this.
Because of the part you agreed making sense: It read each file with the sequence `open()/read()/.../read()/close()`, but those files are on the network mount ("/Volumes/mercury"), so each `read()` of size `#define IO_BUFFER_SIZE (32*1024)` is a network roundtrip.
Though I wonder what the actual delay is. The numbers in the post implied several milliseconds, enough to maybe account for 30 seconds of the 8 minutes. But maybe changing files resets the transfer speed a bunch.
If I assume 0.2ms ping and each rsync read() is a roundtrip, I arrive at 6.4 minutes = 62955918871 B / (321024 B) 0.0002 s / (60 s/min).
Radxa Orion O6/.DS_Store, 6KB at 4.8MBps, that means it took 1.3ms to transfer
Radxa Orion O6/Micro Center Visit Details.pages, 141KB at 9.8MBps, 14ms
Radxa Orion O6/Radxa Orion O6.md, 19KB at 1.9MBps, 10ms
We know the link can do well over 100MBps, so that time is almost all overhead. But it doesn't seem to be a simple delay. Perhaps a fresh TCP window scaling up on a per-file basis? That would be an unfortunate filesytem design.
Since the two 4MB files both get up to ~100MBps, the same speed as the 250MB file, it seems like the maximum impact from switching files isn't much more than 15ms. If the average is below 10ms then we're looking at half a minute wasted over 3564 files. If the average is 20ms then switching files is responsible for 71 seconds wasted.
By that estimate the file-level serialization is a real issue, but the bigger issue is whatever's preventing that 250MB file from ramping up all the way to 10Gbps.
That's because of fast paths:
- For a large file, assuming the disk isn't fragmented to hell and beyond, there isn't much to do for rsync / the kernel: the source reads data and copies it to the network socket, the receiver copies data from the incoming network socket to the disk, the kernel just dumps it in sequence directly to the disk, that's it.
- The slightly less performant path is on a fragmented disk. Source and network still doesn't have much to do, but the kernel has a bit more work every now and then to find a contiguous block on the disk to write the data to. For spinning rust HDDs, the disk also has to do some seeking.
- Many small files? Now that's more nasty. First, the source side has to do a lot of stat(2) calls to get basic attributes of the file. For HDDs, that seeking can incur a sometimes significant latency penalty as well. Then, this information needs to be transferred to the destination, the destination has to do the same stat call again, and then the source needs to transfer the data, involving more seeking, and the destination has to write it.
- The utter worst case is when the files are plenty and small, but large enough to not fit into an inode as inline data [1]. That means two writes and thus seeks per small file. Utterly disastrous for performance.
And that's before stepping into stuff such as systems disabling write caches, soft-RAID (or the impact of RAID in general), journaling filesystems, filesystems with additional metadata...
[1] https://archive.kernel.org/oldwiki/ext4.wiki.kernel.org/inde...
Yeah, this has been my experience with low-overhead streams as well.
Interestingly, I see a ubiquity of this "open more streams to send more data" pattern all over the place for file transfer tooling.
Recent ones that come to mind have been BackBlaze's CLI (B2) and taking a peek at Amazon's SDK for S3 uploads with Wireshark. (What do they know that we don't seem to think we know?)
It seems like they're all doing this? Which is maybe odd, because when I analyse what Plex or Netflix is doing, it's not the same? They do what you're suggesting, tune the application + TCP/UDP stack. Though that could be due to their 1-to-1 streaming use case.
There is overhead somewhere and they're trying to get past it via semi-brute-force methods (in my opinion).
I wonder if there is a serialization or loss handling problem that we could be glossing over here?
cuz in my experience no one is doing that tbh
It’s base line tuning seems to just assume large files and does no auto scaling and it’s mostly single threaded.
Then even when tuning it’s still painfully slow, again seemly limited by its cpu processing and mostly on a single thread, highly annoying.
Especially when you’re running it on a high core, fast storage, large internet connection machine.
Just feels like there is a large amount of untapped potential in the machines…
I used B2 as third leg for our backups and pretty much had to give rclone more connections at once because defaults were nowhere close to saturating bandwidth
When we were doing 100TB backups of storage servers we had a wrapper that run multiple rsyncs over the file system, that got throughput up to about 20gigbits a second over lan
If the server side scales (as cloud services do) it might end up using different end points for the parallel connections and saturate the bandwidth better. One server instance might be serving other clients as well and can't fill one particular client's pipe entirely.
The whole reason for the existence of TCP is to overcome the throughput limit determined by latency. On a network with negligible latency there is no need for TCP (you could just send each packet only after the previous is acknowledged, but the higher is the throughput of your network interface, the less likely is that the latency can be negligible).
However, for latency to not matter, the TCP windows must be large enough (i.e. the amount of data that is sent before an acknowledge is received, which happens after a delay caused by latency).
I use Windows very rarely today, so I do not know its current status, but until the Windows XP days it was very frequent for Windows computers to have very small default TCP window sizes, which caused low throughput on high-latency networks, so on such networks they had to be reconfigured.
On high-latency networks, opening multiple connections is just a workaround for not having appropriate network settings. However, even when your own computer is configured optimally, opening multiple connections can be a workaround against various kinds of throttling implemented either by some intermediate ISP or by the destination server, though nowadays most rate limits are applied globally, to all connections from the same IP address, in order to make this workaround ineffective.
For completeness, I want to add:
The 2MiB are per SSH "channel" -- the SSH protocol multiplexes multiple independent transmission channels over TCP [1], and each one has its own window size.
rsync and `cat | ssh | cat` only use a single channel, so if their counterparty is an OpenSSH sshd server, their throughput is limited by the 2MiB window limit.
rclone seems to be able to use multiple ssh channels over a single connection; I believe this is what the `--sftp-concurrency` setting controls.
Some more discussion about the 2MiB limit and links to work for upstreaming a removal of these limits can be found in my post [3].
Looking into it just now, I found that the SSH protocol itself already supports dynamically growing per-channel window sizes with `CHANNEL_WINDOW_ADJUST`, and OpenSSH seems to generally implement that. I don't fully grasp why it doesn't just use that to extend as needed.
I also found that there's an official `no-flow-control` extension with the description
> channel behaves as if all window sizes are infinite. > > This extension is intended for, but not limited to, use by file transfer applications that are only going to use one channel and for which the flow control provided by SSH is an impediment, rather than a feature.
So this looks exactly as designed for rsync. But no software implements this extension!
I wrote those things down in [4].
It is frustrating to me that we're only a ~200 line patch away from "unlimited" instead of shitty SSH transfer speeds -- for >20 years!
[1]: https://datatracker.ietf.org/doc/html/rfc4254#section-5
[2]: https://rclone.org/sftp/#sftp-concurrency
[3]: https://news.ycombinator.com/item?id=40856136
[4]: https://github.com/djmdjm/openssh-portable-wip/pull/4#issuec...
Inherent reasons or no, it's been my experience across multiple protocols, applications, network connections and environments, and machines on both ends, that, _in fact_, splitting data up and operating using multiple streams is significantly faster.
So, ok, it might not be because of an "inherent reason", but we still have to deal with it in real life.
I think a lot of file transfer issues that occur outside of the corporate intranet world involve hardware that you don't fully control on (at least) one hand. In science, for example, transferring huge amounts of data over long distances is pretty common, and I've had to do this on boxes that had poor TCP buffer configurations. Being able to multiplex your streams in situations like this is invaluable and I'd love to see more open source software that does this effectively, especially if it can punch through a firewall.
A practical example can be `ssh -X` vs X11 over Wireguard. The lag is obvious with the former, but X11 windows from remote clients can be indistinguishable performance-wise from those of local ones with the latter.
My goal is to smooth out some of the operational rough edges I've seen companies deal with when using the tool:
- Team workspaces with role-based access control
- Event notifications & webhooks – Alerts on transfer failure or resource changes via Slack, Teams, Discord, etc.
- Centralized log storage
- Vault integrations – Connect 1Password, Doppler, or Infisical for zero-knowledge credential handling (no more plain text files with credentials)
- 10 Gbps connected infrastructure (Pro tier) – High-throughput Linux systems for large transfersThis idea that one must “give back” after receiving a gift freely given is simply silly.
And I would probably suggest to them that if they were interested in profiting from their cookies they should stop giving them away for free and make them commercial instead. They might then tell me they don’t want to spend the effort and money to commercialize their cookies, or maybe they prefer it as a hobby with no obligations to customers, or maybe they tell me they have a philosophical belief that they should give their their cookies away for free for anyone to do as they please with them, including commercializing them as long as they aren’t legally responsible for anything done with the cookies which is why they handed me that legal contract explicitly stating that when they gave them to me in the first place.
I've adjusted threads and the various other controls rclone offers but I still feel like I'm not see it's true potential because the second it hits a rate limit I can all but guarantee that job will have to be restarted with new settings.
That hasn't been true for more than 8 years now.
Source: https://github.com/rclone/rclone/blob/9abf9d38c0b80094302281...
And the PR adding it: https://github.com/rclone/rclone/pull/2622
2. do you have an example of what indexed backups would look like? Im thinking of macos time machine, where each backup only contains deltas from the last backup. Or am I completely off?
For indexing, full text indexing of backups to allow for record retrieval based on keyword or date. E.g. “images in Los Angeles before 2010” or “tax records from 2015”. If possible, low resolution thumbnails of the backups to make retrieval easier.
I think #1 (transforms) would be more generally useful for cross cloud applications, and #2 is more catered toward backups
So the question "does rclone have that" doesn't make much sense, because it usually wouldn't be rclone implementing it.
For example, zsh does it here for rsync, which actually invokes `ssh` itself:
https://github.com/zsh-users/zsh/blob/3e72a52e27d8ce8d8be0ee...
https://github.com/zsh-users/zsh/blob/3e72a52e27d8ce8d8be0ee...
That said, some CLI tools come with tools for shells to help them implement such things. E.g. `mytool completion-helper ...`
But I don't get rclone SSH completions in zsh, as it doesn't call `_remote_files` for rclone:
https://github.com/zsh-users/zsh/blob/3e72a52e27d8ce8d8be0ee...
With rsync, you upload hashes of what you have, then the source has to do all the hashing work to figure out what to send you. It's slightly more efficient, but If you are supporting even 10s of downloads it's a lot of work for the source.
The other option is to send just a diff, which I believe e.g. Google Chrome does. Google invented Courgette and Zucchini which partially decompile binaries then recompile them on the other end to reduce the size of diffs. These only work for exact known previous versions, though.
I wonder if the ideas of Courgette and Zucchini can be incorporated into zsync's hashes so that you get the minimal diff, but the flexibility of not having a perfect previous version to work from.
Edit: oh I see, delta transfer only sends the changed parts of files?
You seem to be referring to the selection of candidates of files to transfer (along several possible criteria like modification time, file size or file contents using checksumming) [2]
Rsync is great. However for huge filesystems (many files and directories) with relatively less change, you'll need to think about "assisting" it somewhat (by feeding it its candidates obtained in a more efficient way, using --files-from=). For example: in a renderfarm system you would have additions of files, not really updates. Keep a list of frames that have finished rendering (in a cinematic film production this could be eg. 10h/frame), and use it to feed rsync. Otherwise you'll be spending hours for rsync to build its index (both sides) over huge filesystems, instead of transferring relatively few big and new files.
In workloads where you have many sync candidates (files) that have a majority of differing chunks, it might be worth rather disabling the delta-transfer algorithm (--whole-file) and saving on the tradeoffs.
[0] https://www.andrew.cmu.edu/course/15-749/READINGS/required/c...
[1] https://en.wikipedia.org/wiki/Rsync#Determining_which_parts_...
[2] https://en.wikipedia.org/wiki/Rsync#Determining_which_files_...
>In fact, some compression modes would actually slow things down as my energy-efficient NAS is running on some slower Arm cores
Depending on the number/type of devices in the setup and usage patterns, it can be effective sometimes to have a single more powerful router and then use it directly as a hop for security or compression (or both) to a set of lower power devices. Like, I know it's not E2EE the same way to send unencrypted data to one OPNsense router, Wireguard (or Nebula or whatever tunnel you prefer) to another over the internet, and then from there to a NAS. But if the NAS is in the same physically secure rack directly attached by hardline to the router (or via isolated switch), I don't think in practice it's significantly enough less secure at the private service level to matter. If the router is a pretty important lynchpin anyone, it can be favorable to lean more heavily on that so one can go cheaper and lower power elsewhere. Not that more efficiency, hardware acceleration etc are at all bad, and conversely sometimes might make sense to have a powerful NAS/other servers and a low power router, but there are good degrees of freedom there. Handier then ever in the current crazy times where sometimes hardware that was formerly easily and cheaply available is now a king's ransom or gone and one has to improvise.
rsync -e "ssh -o Compression=no" ...> Specifies whether to use compression. The argument must be yes or no (the default).
So I'm surprised you see speedups with your invocation.
It doesn’t do that.
Even before this conversation, a few weeks ago I started working on a modern replacement.
I'm currently working on the GUI if you're interested: https://github.com/rclone-ui/rclone-ui
From the readme:
- Warp speed Data Transfer (WDT) is an embeddedable library (and command line tool) aiming to transfer data between 2 systems as fast as possible over multiple TCP paths.
- Goal: Lowest possible total transfer time - to be only hardware limited (disc or network bandwidth not latency) and as efficient as possible (low CPU/memory/resources utilization)
You'd be astonished at how much faster even seemingly fast local IO can go when you unblock the IO
[1] https://www.youtube.com/watch?v=gaV-O6NPWrI
This is my mount configuration. What do you think? Is there anything that might be causing issues??
rclone mount google_drive: X: ^
--vfs-cache-mode full ^
--vfs-cache-max-age 24h ^
--vfs-cache-max-size 50G ^
--vfs-read-ahead 1G ^
--cache-dir "./rclone_cache" ^
--vfs-read-chunk-size 128M ^
--vfs-read-chunk-size-limit off ^
--buffer-size 128M ^
--dir-cache-time 1000h ^
--drive-chunk-size 64M ^
--poll-interval 15s ^
--vfs-cache-poll-interval 1m ^
--multi-thread-streams 32 ^
--drive-skip-shortcuts ^
--drive-acknowledge-abuse ^
--network-moderclone's multi-threaded transfers effectively pipeline those operations. It's the same principle as why HTTP/2 multiplexing was such a win — you stop paying the latency tax sequentially.
One thing I'd add: for local-to-local or LAN sync, rsync still often wins because the overhead of rclone's abstraction layer isn't worth it when latency is already sub-millisecond. The 4x speedup is really a story about high-latency, high-bandwidth paths where parallelism dominates.
#1 is the fastest way to send because it keeps the buffers full in a consistent stream which allows tcp windows to grow as large as possible. 2 and 3 ( scp and rsync ) do round trip acks to the remote side which drastically slows things down, even if done in parallel.
Also, rsync will likely be faster if some files are already in the destination as it handles that intelligently as well as only transferring changed blocks from files.
You can already perform parallel transfers quite trivially by typing "find .. | xargs -P$(nproc) -n 1 rsync ..", carefully managing sync folder sets and just being a Unix god in general like me.
It doesn't seem reasonable to expect people to use and install it, when we can just use rsync.
This solution sucks in a number of different ways, and is downright incorrect. I'm baffled by the frequency at which these hacky parallel commands are recommended when rclone exists, works perfectly fine, and has literally no runtime dependencies.
baal80spam•4d ago
digiown•4d ago
plasticsoprano•4d ago
PunchyHamster•4d ago