What are the issues? Why is it so important to go "distributed"?
Debian is the largest independent of the longest-running of the Free
Software Distributions in existence. There are over 1000 maintainers;
nearly 20,000 packages. There are over 40 "Primary" Mirrors, and
something like one hundred secondary mirrors (listed here - I'm stunned and
shocked at the numbers!). 14 architectures are supported - 13
Linux ports and one GNU/Hurd port but only for i386 (aww bless
iiit). A complete copy of the mirrors and their architectures,
including source code, is over 160 gigabytes.
At the last major upgrade of Debian/Stable, all the routers at the
major International fibreoptic backbone sites across the world redlined
for a week.
To say that Debian is "big" is an understatement of the first order.
Many mirror sites simply cannot cope with the requirements.
Statistics on the
Debian UK Mirror for July 2004 to June 2005 show 1.4 Terabytes of
data served. As you can see from the list of mirror sites, many
of the Secondary Mirrors and even a couple of the Primary ones have
dropped certain architectures.
security.debian.org - perhaps
the most important of all the Debian sites - is definitely overloaded
and undermirrored.
This isn't all: there are mailing lists (the statistics show almost 30,000
people on each of the announce and security lists, alone), and IRC
channels - and both of those are over-spammed. The load on the mailing
list server is so high that an idea (discussed informally at Debconf7
and outlined here later in this article, for completeness) to create an
opt-in spam/voting system for people to "vet" postings and comments, was
met with genuine concern and trepidation by the mailing list's
maintainers.
It's incredible that Debian Distribution and Development hasn't fallen
into a big steaming heap of broken pieces, with administrators, users
and ISPs all screaming at each other and wanting to scratch each
others' eyes out on the mailing lists and IRC channels, only to find
that those aren't there either.
So it's basically coming through loud and clear: "server-based"
infrastructure is simply not scalable, and the situation is only going
to get worse as time progresses. That leaves "distributed architecture"
- aka peer-to-peer architecture - as the viable alternative.
This problem has been recognised for quite some time, in fact, Debtorrent's Wiki page
describing the motivation and history point out that Debtorrent was
done as a 2006 Google "Summer of Code" project. Debtorrent hints at
the tantalising possibility of being able to reduce or entirely replace
the present "http", "ftp" and "rsync" download methods for individual
packages, leaving jigdo and bittorrent as the method for downloading
CDs, DVDs, netboot images. Even other methods could be adapted
to use distributed download methods.
What the heck is Debtorrent, anyway, and what does it do?
Debtorrent is a modified version of bittorrent that basically first
goes and looks to see if there is a debtorrent swarm to download a
package from, first. Lack of response after ten seconds results in
debtorrent automatically going to an HTTP mirror. So, every time you do
an apt-get install of a package, the packages should be downloaded from
other debtorrent users rather than from the (overloaded) mirrors.
It's simple - and brilliant. Yet it has taken quite a bit of work to
adapt the bittorrent system, and the issues faced and the solutions
adopted are described in detail on Debtorrent's Wiki page.
However, there are still issues remaining that need to be taken into
consideration, to make Debian Distribution truly "Distributed".
What planet are you on? Debtorrent should be enough, surely?
No. The bittorrent protocol does not have "search" capability in it.
Bittorrent is only a file distribution mechanism, not an
information search mechanism. Web sites are therefore set up to
provide "search" capabilities, such that ".torrent" files can be
downloaded containing the initial "seed" site from which you can
download the list of IP addresses to get the actual file from [whew
- got all that? :) ].
And so, with Debtorrent, it's no different: you still have to download
"Packages.bz2" which gives you the names of the packages, information
about the packages, dependencies and also the relative location on a
mirror where the package - and source code - can be downloaded from.
This information must, at present, come from the mirrors themselves.
Now, whilst this sounds like it isn't a big hairy deal, if you look at
the size of the Etch
Binary Packages.bz2 (4.1mb) and that of the Etch Contents
(10mb), you begin to get an idea of the scale of the problem. When you
look at Lenny (Testing) and Sid (Unstable) these are each approximately
10% larger than the previous distro. Bear in mind that there are
nearly twenty thousand packages, all told. For Testing and
Unstable, the problem of continuous downloading of 5mbyte of Contents
and 10mbyte of "Packages" has been recognised and partially addressed by
splitting the "Packages" into daily diffs - an approach that's still not
scalable: it's just a temporary "fix".
The point is: even with Debtorrent installed on every single debian
system in existence, there's still a large dependency on the
single-point-of-failure mirrors, through which the verification and
amalgamation of the information that you see (Contents and Packages)
must be "funnelled". Along with the Packages themselves, the source
code and the ".dsc" file - the GPG-signed guarantee that the source
code and packages have not been tampered with.
It's this last - the Digital Signing - that gives us the crucial clue in
helping alleviate the load on the Debian Mirrors, possibly even doing
away with them altogether (more specifically, turning them entirely into
secondary mirrors of the underlying peer-to-peer "distributed"
distribution). We'll discuss this later.
Distributed Search
Essentially, I take the view that "apt-cache search" should, on
installation of a suitable dpkg-enhancing-package, result in a DHT-based
search, and that the contents of package descriptions should, once
received, be used as the basis for answering DHT-based queries to other
users.
However, of course, this leaves the question: how do you know that the
Package you're querying is in fact the right one? This is answered
simply enough: a file is made avaialable (again as a
distributed peer-to-peer download) containing the list of packages and
the SHA-1 or MD5 checksums of the individual Package descriptions
(or better yet a GPG signature of each Package description). Included
in the file would be a timestamp of its last update; also, it would
itself be GPG signed.
This file would, for Stable releases, change very little. Security
updates would have their own file. For Testing and Unstable, a "diff"
approach might be considered, but to be honest, the size of the file,
bzipped, would only be around 500k even if every one of the 20,000
packages were included in it: that's a significant difference from
10,000k of the current "Packages.bz2" file (which uncompresses to
enormous proportions - 19mb is not uncommon!) Also it might be a good
idea to include the "last updated" date of each package (in the file)
for reasons explained below.
So, once you have a digitally-signed list of Package which you're only
interested in, and you know the SHA-1 or MD5 or GPG checksum of each
package description, you can immediately exclude DHT-query responses
that do not match the latest checksum. In fact, you could actually
use the Checksum as the DHT key query! [hmm, which
over-rides my earlier idea about including the "last updated" date of
each package, oh well - maybe it would be useful anyway :) ]
So - to recap:
- The "new Packages" file is restricted to containing package
names, version numbers, possibly timestamps of each packages release,
and
Checksums on the package descriptions (the .dsc file) - called "new
Packages".
- The "new Packages" file is Digitally Signed by the Debian
Release Maintainer and made available for download via bittorrent (or
other).
- DHT queries use the Package Checksum (as listed in the "new
Releases" file) to look up information, receiving the .dsc file as a
response.
- Each .dsc file is placed into a DHT cache repository, which can
be /var/lib/apt/lists or /var/lib/dpkg/available or other, and thus
this "shims in" to the existing dpkg system with minimal code changes.
In fact, if you think this through carefully, apt-get update could be
done away with entirely, or made to be so fast that it could be done
every time you perform an apt-get install <package>. The reason is
because apt-get update normally downloads the Releases file: if instead
you perform an incremental update of "new Releases" (by downloading
daily diffs) that only downloads a few kilobytes...
Release Verification: upload process
(I have to be honest here: I may need a true Debianite to review
this section, to verify the procedure!).
The upload process by individual Debian Package maintainers requires
that they compile up the source code into a package
(using dpkg-buildpackage), and the package will be GPG digitally
signed. Checking is performed on the Package information, such as the
dependencies, and also whether a Package's priority is "extra,
important, optional, required or standard". When a "freeze" is in
place, nothing but critical changes can be made to "required" Packages.
(For example, libselinux1 was excluded from Sarge due to its very
long freeze, resulting in the adoption of SE/Linux in Debian being
delayed by eighteen months. libselinux1 needed only to be added to
the list of "important" packages - nothing else was required - but the
freeze dictated that no additions were to be made, and as libselinux1
was only previously listed as "optional" its uploading "Priority"
was automatically downgraded to "optional" by the automatic verification
process).
My guess - and it has to be a guess I'm afraid! - is that it may be
possible to distribute the verification process, or at least have the
verification carried out in some way and the Package Maintainer's own
development machine be the initial "seed" from which the packages that
they compile up are initially distributed from. Or the build farms. Or
the main mirrors. Or bittorrent.com's servers (Note: yes,
bittorrent.com have offered to host .deb files as an experiment - but
only via the bittorrent protocol, not via anything else).
Perhaps the "upload" process could be replaced by distribution of
unverified .debs via peer-to-peer file sharing, with the verification
process being, instead of "upload", to grab the newly built package
from the Maintainer's machine, then the "verification" process is
carried out and the .dsc file created, with its digital signature, and
the .dsc file released at the same time as the "new Packages" is
created.
Are we nearly there yet?
(How much further? can i have a biscuiiit? NO! siddown,
shuddup and listen!)
So (whew) so far we have:
- a peer-to-peer mechanism to replace uploading
of packages by Package Maintainers (with some waffling and fudging
going on about how to do Verification using peer-to-peer distribution)
- we have a peer-to-peer mechanism for performing searches of
packages, which is still digitally signed
- we already have a peer-to-peer mechanism for actually downloading
the packages (DebTorrent)
- it's all digitally signed so everyone's happy
- the Mirrors can still be used to provide HTTP, FTP and Rsync
downloads, ramping them down to provide only CDs, DVDs, netboot
installs and even then it's possible to ramp even that down further
through recommending that people use jigdo, bittorrent or other methods.
It's all hunky-dory, does the job, removes a headache from the Mirrors,
makes us independent of single-point-of-failure infrastructure, and,
this is the best bit: by the simple expedient of providing a different
dpkg.conf file and different GPG key master packages, you have provided
the world with a way to distribute other material in an entirely
distributed fashion. entirely as Free Software. without requiring a
web site.
For example, Miro, the
Distributed Media Player, requires that you upload your content to a
web site, and you subscribe to RSS feeds. They're basically recreating
the Debian Distribution system. Surely it would be better to merge the
two? [note to people who may have thought ahead somewhat: video
files are pretty large - however, of course, you can split them into
smaller "Packages" e.g. chapters or scenes or just 10mbyte chunks, and
have a "meta-package" with the sub-packages as "Dependencies", and
recombine the contents into one larger file... but the details are left
as an exercise for the reader :) especially the bit about if you
make the packages small enough you should be able to create them
in real-time and use the distribution system for webcasting ooooh]
... did I miss anything? not on the distribution side, no, but I did
promise earlier to cover the IRC "anti-spam" and "Mailing list" bit.
But - to be honest, I think that's best covered in a separate article
or later on.
Finally, then, to reiterate what a Distributed Debian Package system
could be used for is not only web-infrastructure-independent package
distribution but also video, music, DNS zone files and other media
which can all benefit from Digital Signatures, timestamping and complex
inter-dependence made easy (where Debian itself is the most incredible
example of solving dependencies in an easy-to-use fashion).
Pretty powerful stuff.