Patches for binary packages
Posted 25 Jul 2001 at 14:02 UTC by gary
In a couple of days I lose my 10MB/s connection to the web and switch to
a lowly 64KB/s one instead, so thoughts of bandwidth conservation have
been crossing my mind recently. Thinking about automated
installers -- apt, up2date, red carpet -- made me wonder how feasable it
would be to make binary patch rpms and debs.
It's an age old solution to an age old problem -- source is
routinely distributed as patches. Why not binary packages?
Obviously I'm not talking about a straight diff on the package files,
but even a package file that contained only the files which had changed
between releases could make significant savings. Getting slightly
cleverer, some logic to recognise and deal with things that won't diff
well (gzipped manpages, for instance) would help.
As a taster, this is how I'd envisage using it on an RPM system:
Packager
% ls
package-1.0.0-1.i386.rpm package-1.0.1-1.i386.rpm
% rpm --diff package-1.0.0-1.i386.rpm package-1.0.1-1.i386.rpm > package-1.0.0-1-1.0.1-1.patch.i386.rpm
User
% rpm --query package
package-1.0.0-1
% rpm --update-from-patch package-1.0.0-1-1.0.1-1.patch.i386.rpm
% rpm --query package
package-1.0.1-1
I'd imagine that this sort of thing would be fantastic for all
distributions, and for people like Ximian -- think how much they would
save on bandwidth if red-carpet, up2date and apt only downloaded the
differences.
Thoughts anyone?
tjansen wrote, by email:
Last year I had the same thought and made a diff system for debian deb
files. It
worked quite well and the saved size was ok, too. However
when I proposed it on the
Debian package list the developers were
not convinced that it was such a good idea.
The main problem was
that diffs increase the amount of storage needed on the server.
As
debian mirrors are usually donated and run on some old hardware the
costs for hard
disk space is higher than the bandwidth (which is often
free for the maintainer of
the server). This, of course, is a
debian-specific problem.
You can find the
code I wrote at http://www.tjansen.de/debiff/.
Adapt the rsync, posted 25 Jul 2001 at 23:41 UTC by danwang »
(Master)
Perhaps you could use the rsync algorithm and modify the tools to do
this. Using rsync would avoid the need to store diffs on the server machine.
See http://samba.anu.edu.au/rsync/
Re: RSync, posted 26 Jul 2001 at 06:24 UTC by thom »
(Master)
Using gzip --rsyncable and rsync, debian packages can easily be
transmitted as just a delta.
RPM's I don't THINK are quite so easy.
Unfortunately gzip's maintainers haven't accepted the rsync patch, and I
don't think are likely too in the near future.
Hmmm, posted 26 Jul 2001 at 13:34 UTC by jono »
(Master)
I think this is an area that is quite interesting, and maybe the work
done on the debian diff software should be worked on some more. It
certainly makes logical sense. I don't know enough about packages yet
to make a judgement on an implementation.
Distributing updates as binary patches is something that we (Ximian)
have been looking at for a while. It definitely makes sense, and
informal testing indicates that with both RPM and .deb packages, binary
patches do help for at least some version delta.
There are two main ways to approach the problem. Binary patches can
easily be generated on the server given a complete revision history and
copies of all released packages. This way, users need only transmit a
checksum of the package they have. The diff can be generated on the
server, although it remains to be seen how intensive this process will
be in real-world conditions.
The real variable is if viable binary patches can be generated against
installed packages, or whether a pristine copy of the original package
is necessary. In debian, there's a utility called dpkg-repack that
allows one to construct a .deb file given only the installed payload
and information from /var/lib/dpkg. RPM 4.0.3 will introduce a similar
functionality.
However, packages frequently contain conf files, cache files, etc,
which will change over use. This means the reconstructed package, even
assuming no other variables (such as packaging tool / compression tool
version, etc) are constant, may not be identical to the package as
shipped.
The easy way out is to assume that pristine copies of the original
packages may exist on the users machine. In the apt and RC cases, this
is true (/var/cache/apt/archives, /var/cache/redcarpet/packages). This
relies on the user being willing to trade off some disk space, by not
wiping their cache directories, for a savings in bandwidth.
(Of course, regardless of whether pristine copies are stored, it's
required to have a fallback mechanism if the currently installed
package is a version not available through the update mechanism, or if
the calculated binary diff is actually larger than the new package.)
Needless to say, this is something we are going to continue to look at,
and hopefully future versions of RC (and other tools) will soon support
binary diffs.
--
Ian Peters
Red Carpet Developer
The way were were going to do it for Eazel Services was
similar two what itp described - rsyncing
against a local copy of the old package. The secret of rsync is that
you don't need a file of the same type as what you're rsyncing it to -
it just needs to share lots of blocks. This technique was used in
Australia (where bandwidth is expensive) to build Debian ISO
images by cat-ing all the deb files from a release together and
then rsyncing against an ISO image. The same can be done to
reconstruct a package - just cat/ar/tar the original files from the
package together and you have a close enough approximation.
Then you have to either rsync an uncompressed version of the
package or use the rsync-friendly version of gzip. A lot of the hard
work that the rsync server does can of course be cached - you can
keep track of what set of blocks change between two package
versions.
One of the major problems with this is that a small change in
the source for a program will have a large change in the binary.
The symbol offsets are scattered throught the program. Compiling
as PIC (position independant code) isn't really a good idea, so this
is still an open question. One idea was to keep a cache of .o files
on the client and resync against those and do linking at package
install time.
One of these day I hope to have the time or energy to
implement some of these ideas in apt.
Ian McKellar
Former Eazel Services Hacker
XDelta, posted 27 Jul 2001 at 00:34 UTC by avi »
(Apprentice)
Since no one has mentioned it yet, XDelta is a great binary "diff" tool.
You can get it from Joshua MacDonald's homepage.
XDelta, posted 27 Jul 2001 at 08:47 UTC by gary »
(Master)
Kelley Spoon from Rackspace emailed me about XDelta as well, but I
haven't had time to mention it until now; sorry Kelley!
Using rsync, posted 27 Jul 2001 at 08:51 UTC by gary »
(Master)
Using rsync, posted 27 Jul 2001 at 08:51 UTC by gary »
(Master)
Cool, posted 30 Jul 2001 at 16:39 UTC by Mulad »
(Apprentice)
Wow. I've thought about binary patches before, but I had been thinking
about them in the same way Sun, Microsoft, and others do. In many
cases, folks end up with patches that undo what had been done before.
Of course, the Linux packaging systems are much more powerful than the
offerings from those companies, and there are better ways to do it, as
you folks have described.
Now that I started thinking about it, I think I may have an idea as
well. IIRC, RPMs include the MD5 checksums for each file (either that,
or they are computed when installed). It would be possible to
dynamically build packages that only contain new or updated files. Some
data and patch packages could be cached to make the serving a bit
quicker, especially for common upgrades. This wouldn't work for
ftp-only sites or ones that only have slow processors and I/0, but it
might be beneficial to sites that would like to be mirrors and have
appropriate hardware, but don't want to be crushed under bandwidth loads.