Patches for binary packages

Posted 25 Jul 2001 at 14:02 UTC by gary Share This

In a couple of days I lose my 10MB/s connection to the web and switch to a lowly 64KB/s one instead, so thoughts of bandwidth conservation have been crossing my mind recently. Thinking about automated installers -- apt, up2date, red carpet -- made me wonder how feasable it would be to make binary patch rpms and debs.

It's an age old solution to an age old problem -- source is routinely distributed as patches. Why not binary packages?

Obviously I'm not talking about a straight diff on the package files, but even a package file that contained only the files which had changed between releases could make significant savings. Getting slightly cleverer, some logic to recognise and deal with things that won't diff well (gzipped manpages, for instance) would help.

As a taster, this is how I'd envisage using it on an RPM system:

Packager

% ls
package-1.0.0-1.i386.rpm     package-1.0.1-1.i386.rpm
% rpm --diff package-1.0.0-1.i386.rpm package-1.0.1-1.i386.rpm > package-1.0.0-1-1.0.1-1.patch.i386.rpm
User
% rpm --query package
package-1.0.0-1
% rpm --update-from-patch package-1.0.0-1-1.0.1-1.patch.i386.rpm
% rpm --query package
package-1.0.1-1
I'd imagine that this sort of thing would be fantastic for all distributions, and for people like Ximian -- think how much they would save on bandwidth if red-carpet, up2date and apt only downloaded the differences.

Thoughts anyone?


Already exists for Debian, posted 25 Jul 2001 at 16:13 UTC by gary » (Master)

tjansen wrote, by email:

Last year I had the same thought and made a diff system for debian deb files. It worked quite well and the saved size was ok, too. However when I proposed it on the Debian package list the developers were not convinced that it was such a good idea. The main problem was that diffs increase the amount of storage needed on the server. As debian mirrors are usually donated and run on some old hardware the costs for hard disk space is higher than the bandwidth (which is often free for the maintainer of the server). This, of course, is a debian-specific problem. You can find the code I wrote at http://www.tjansen.de/debiff/.

Adapt the rsync, posted 25 Jul 2001 at 23:41 UTC by danwang » (Master)

Perhaps you could use the rsync algorithm and modify the tools to do this. Using rsync would avoid the need to store diffs on the server machine.

See http://samba.anu.edu.au/rsync/

Re: RSync, posted 26 Jul 2001 at 06:24 UTC by thom » (Master)

Using gzip --rsyncable and rsync, debian packages can easily be transmitted as just a delta. RPM's I don't THINK are quite so easy.
Unfortunately gzip's maintainers haven't accepted the rsync patch, and I don't think are likely too in the near future.

Hmmm, posted 26 Jul 2001 at 13:34 UTC by jono » (Master)

I think this is an area that is quite interesting, and maybe the work done on the debian diff software should be worked on some more. It certainly makes logical sense. I don't know enough about packages yet to make a judgement on an implementation.

Definitely a good idea, posted 26 Jul 2001 at 15:31 UTC by itp » (Master)

Distributing updates as binary patches is something that we (Ximian) have been looking at for a while. It definitely makes sense, and informal testing indicates that with both RPM and .deb packages, binary patches do help for at least some version delta.

There are two main ways to approach the problem. Binary patches can easily be generated on the server given a complete revision history and copies of all released packages. This way, users need only transmit a checksum of the package they have. The diff can be generated on the server, although it remains to be seen how intensive this process will be in real-world conditions.

The real variable is if viable binary patches can be generated against installed packages, or whether a pristine copy of the original package is necessary. In debian, there's a utility called dpkg-repack that allows one to construct a .deb file given only the installed payload and information from /var/lib/dpkg. RPM 4.0.3 will introduce a similar functionality.

However, packages frequently contain conf files, cache files, etc, which will change over use. This means the reconstructed package, even assuming no other variables (such as packaging tool / compression tool version, etc) are constant, may not be identical to the package as shipped.

The easy way out is to assume that pristine copies of the original packages may exist on the users machine. In the apt and RC cases, this is true (/var/cache/apt/archives, /var/cache/redcarpet/packages). This relies on the user being willing to trade off some disk space, by not wiping their cache directories, for a savings in bandwidth.

(Of course, regardless of whether pristine copies are stored, it's required to have a fallback mechanism if the currently installed package is a version not available through the update mechanism, or if the calculated binary diff is actually larger than the new package.)

Needless to say, this is something we are going to continue to look at, and hopefully future versions of RC (and other tools) will soon support binary diffs.

--
Ian Peters
Red Carpet Developer

Re: Patches for binary package, posted 26 Jul 2001 at 16:23 UTC by yakk » (Master)

The way were were going to do it for Eazel Services was similar two what itp described - rsyncing against a local copy of the old package. The secret of rsync is that you don't need a file of the same type as what you're rsyncing it to - it just needs to share lots of blocks. This technique was used in Australia (where bandwidth is expensive) to build Debian ISO images by cat-ing all the deb files from a release together and then rsyncing against an ISO image. The same can be done to reconstruct a package - just cat/ar/tar the original files from the package together and you have a close enough approximation. Then you have to either rsync an uncompressed version of the package or use the rsync-friendly version of gzip. A lot of the hard work that the rsync server does can of course be cached - you can keep track of what set of blocks change between two package versions.

One of the major problems with this is that a small change in the source for a program will have a large change in the binary. The symbol offsets are scattered throught the program. Compiling as PIC (position independant code) isn't really a good idea, so this is still an open question. One idea was to keep a cache of .o files on the client and resync against those and do linking at package install time.

One of these day I hope to have the time or energy to implement some of these ideas in apt.

Ian McKellar
Former Eazel Services Hacker

XDelta, posted 27 Jul 2001 at 00:34 UTC by avi » (Apprentice)

Since no one has mentioned it yet, XDelta is a great binary "diff" tool. You can get it from Joshua MacDonald's homepage.

XDelta, posted 27 Jul 2001 at 08:47 UTC by gary » (Master)

Kelley Spoon from Rackspace emailed me about XDelta as well, but I haven't had time to mention it until now; sorry Kelley!

Using rsync, posted 27 Jul 2001 at 08:51 UTC by gary » (Master)

Another emailed response, from Andrew Bennetts:

At linux.conf.au, Rusty Russell described a system like this for deb packages... essentially it involved using rsync, and zipping the debs with a slightly modified gzip designed to work better with rsync.

I think the slides from his talk are here:
http://users.aber.ac.uk/tgb96/rusty/rusty-aber.pdf

And the project website is here:
http://sourceforge.net/projects/apt-proxy/

Using rsync, posted 27 Jul 2001 at 08:51 UTC by gary » (Master)

Another emailed response, from Andrew Bennetts:

At linux.conf.au, Rusty Russell described a system like this for deb packages... essentially it involved using rsync, and zipping the debs with a slightly modified gzip designed to work better with rsync.

I think the slides from his talk are here:
http://users.aber.ac.uk/tgb96/rusty/rusty-aber.pdf

And the project website is here:
http://sourceforge.net/projects/apt-proxy/

Cool, posted 30 Jul 2001 at 16:39 UTC by Mulad » (Apprentice)

Wow. I've thought about binary patches before, but I had been thinking about them in the same way Sun, Microsoft, and others do. In many cases, folks end up with patches that undo what had been done before. Of course, the Linux packaging systems are much more powerful than the offerings from those companies, and there are better ways to do it, as you folks have described.

Now that I started thinking about it, I think I may have an idea as well. IIRC, RPMs include the MD5 checksums for each file (either that, or they are computed when installed). It would be possible to dynamically build packages that only contain new or updated files. Some data and patch packages could be cached to make the serving a bit quicker, especially for common upgrades. This wouldn't work for ftp-only sites or ones that only have slow processors and I/0, but it might be beneficial to sites that would like to be mirrors and have appropriate hardware, but don't want to be crushed under bandwidth loads.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page