Conversation

TIL rsync's best checksum is md5. This means that if you're conducting a supply chain attack on a system which uses rsync to keep the mirrors in sync, you could pollute the mirrors w/ bad copies that collide w/ md5 and even rsync --checksum wouldn't know they were modified.
4
7
Replying to
It also does the entire thing in advance and it entirely blocks progress on uploading while calculating all the checksums. It's very impractical for syncing incremental changes to a very large overall amount of data. It's a lot more than just not being parallel / async.
1
2
Replying to and
It could start syncing changed directory/file structure and files with a mismatched modification date or time right away. For everything else, it could be calculating those checksums in a thread pool. It's not good at actually using the available network, CPU and I/O resources.
2
2
Replying to and
Some people may want to minimize network usage but I'm used to using OVH where bandwidth is entirely unmetered / unlimited for nearly everything. I simply want it to go as fast as possible. I'd rather have it uploading data opportunistically while it's calculating hashes too...
2
2
Show replies
Replying to
I mean, it really shouldn't be, modern secure checksums can be run at 1 GB/sec+ and modern disks run at 200+MB/sec each (spinning rust, SSD/nvme even faster), I mean yes, if you're sync'ing a lot of data, it'll take time.
1
Replying to
It has a particularly slow implementation and as you pointed out it's entirely serial so it's really bad at using the available CPU and I/O resources. Maxing out I/O while doing a ton of CPU work really requires at least using a large thread pool. Don't need fancy AIO.
2
1
Show replies