Conversation

TIL rsync's best checksum is md5. This means that if you're conducting a supply chain attack on a system which uses rsync to keep the mirrors in sync, you could pollute the mirrors w/ bad copies that collide w/ md5 and even rsync --checksum wouldn't know they were modified.
4
7
Replying to
It also does the entire thing in advance and it entirely blocks progress on uploading while calculating all the checksums. It's very impractical for syncing incremental changes to a very large overall amount of data. It's a lot more than just not being parallel / async.
1
2
Replying to and
I find rsync very impractical with checksums enabled. For large amounts of data, you need to use -t and rely on the last modified date + size checks to make it happen in a reasonable amount of time. Can use checksum mode as an integrity check but probably not for regular usage.
2
2
Replying to
I mean, it really shouldn't be, modern secure checksums can be run at 1 GB/sec+ and modern disks run at 200+MB/sec each (spinning rust, SSD/nvme even faster), I mean yes, if you're sync'ing a lot of data, it'll take time.
1
Replying to
It has a particularly slow implementation and as you pointed out it's entirely serial so it's really bad at using the available CPU and I/O resources. Maxing out I/O while doing a ton of CPU work really requires at least using a large thread pool. Don't need fancy AIO.
2
1
Replying to and
Part of why I don't use the checksum feature much is that unless it's an automated background job with low frequency, I tend to actually want it to do something and I'm waiting for it to finish the work. I'd be unblocked once it finishes main work and is simply doing checksums.
1
1
Replying to and
I'm often doing stuff involving syncing 1TiB of data where only a few gigabytes has changed. The difference between using checksums or not using them is enormous. Could partially get what I want by doing 2 passes for the sync where the 2nd pass uses checksums to do that after.
1