Rsyncable gzip

This article was first written in February 2005 for the BeezNest technical
website (http://glasnost.beeznest.org/articles/206).

GZIP="--rsyncable" tar zcvf toto.tar.gz /toto
Why do you need this special option ? Because if you compress your files before synchronising them with rsync, a very small change in one original file may force rsync to re-transmit the whole compressed tar.gz file, instead of just the changed portion. The basic reason is that rsync works at the byte level : very roughly, it compares the old copy of the file with the latest source, and transmits every byte that is different to update the old copy and make it identical to the new. rsync uses a smart way of doing these comparisons, so that in most cases only a tiny portion of the file needs to be actually transmitted. Unfortunately, file compression algorithms which use an adaptative compression method (like most do), defeat the rsync logic and can cause the whole file to be retransmitted, even if only one byte has been changed. Why is that so ? An adaptative compression method uses an analysis of the bytes already processed, to determine how best to compress the following bytes of the file. For example, suppose the compression program starts at byte 0 with a certain compression method. After 1000 bytes have been compressed, the program will recalculate a new compression method, based on what it found in bytes 0-999. It will then insert a new compression table into the file, and use this table to compress the next 1000 bytes. Then it recalculates it's compression table based on the bytes 0-1999, and does the same, and so on. This means that a change of one byte in bytes 0-999, can potentially change the compression method for the rest of the file, and that the rest of the output bytes will be totally different. And because rsync compares the files byte per byte, it will not find any similar block of bytes between the old and new file, thus will be forced to resend the whole new compressed file. The --rsyncable option above fixes this problem. With this option, gzip will regularly "reset" his compression algorithm to what it was at the beginning of the file. So if for example there was a change at byte 23, this change will only affect the output up to maximum (for example) byte #9999. Then gzip will restart 'at zero', and the rest of the compressed output will be the same as what it was without the changed byte 23. This means that rsync will now be able to re-synchronise between the old and new compressed file, and can then avoid sending the portions of the file that were unmodified. Now, for the example above, suppose "/toto" is a directory with plenty of small files for a total of 50 MB, thus the uncompressed tar file would be about 50 MB. By compressing it with gzip, we bring this down to 15 MB in the tar.gz file. Now we 'rsync' this file with a remote system. If nothing has changed since yesterday in the /toto directory, the tar.gz file will be the same as yesterday, rsync will detect this and the file will not be transmitted. On the other hand, if one single small file at the beginning of the 'tar' has changed, then without the --rsyncable option, most of the tar.gz file will be different, and rsync will have to transmit almost 15 MB to the remote rsync target system. In that case, it would have been better to not compress the tar file at all ! With the --rsyncable option, it is possible that only 1000 bytes would be different in the tar.gz file, so only 1000 bytes would be transmitted by rsync, for the same end-result. References : For an rsync intro, see here For a full explanation (and only for Real Programmers), see here There is also a good summary of the whole rsync/gzip/debian situation here

Comments

thanks for explaining this. Very easy for me to understand now. i'll be using this GZIP feature for my transfers from now on. Cheers,

Felipe

I recently stumbled on that in the man page for gzip, and I also recently started rsyncing a few hundred gzips to a host, so it was quite useful. Glad to have an explanation.