Conversation

TIL rsync's best checksum is md5. This means that if you're conducting a supply chain attack on a system which uses rsync to keep the mirrors in sync, you could pollute the mirrors w/ bad copies that collide w/ md5 and even rsync --checksum wouldn't know they were modified.
4
7
Replying to
It also does the entire thing in advance and it entirely blocks progress on uploading while calculating all the checksums. It's very impractical for syncing incremental changes to a very large overall amount of data. It's a lot more than just not being parallel / async.
1
2
Replying to and
I find rsync very impractical with checksums enabled. For large amounts of data, you need to use -t and rely on the last modified date + size checks to make it happen in a reasonable amount of time. Can use checksum mode as an integrity check but probably not for regular usage.
2
2
Replying to
I mean, it really shouldn't be, modern secure checksums can be run at 1 GB/sec+ and modern disks run at 200+MB/sec each (spinning rust, SSD/nvme even faster), I mean yes, if you're sync'ing a lot of data, it'll take time.
1
Replying to and
I find the main issue is that it simply waits to calculate all of those hashes before it actually starts doing any of the syncing work. It should really be doing as much syncing as it can based on changed file / directory structure and times / sizes before worrying about that.
2
1
Replying to and
The checksums should be treated as a way to find corrupted data. It's a lower priority than the more obvious syncing work and it can be done in parallel. It only has to do that for files which look like they shouldn't have any changes. It doesn't do anything like that though.
1
Replying to
Yeah, I want it to saturate the bandwidth and disk I/O. It may need to spend some time after the main sync completes finishing up all the hashes in parallel but it shouldn't block transferring all the data that obviously needs to be transferred based on non-hash-based checks.
1
1
Show replies