Distributed Debian Distribution Development

Posted 26 Apr 2008 at 01:50 UTC (updated 26 Apr 2008 at 22:34 UTC) by lkcl Share This

As part of the Tech Fusion Outline Series, this article describes some additions to the Debian Distribution model which, if implemented, would have the benefits of making Debian, the Debian Development and deployment entirely independent of Server-based Infrastructure.

The brief outline will be expanded in this dedicated article, pointing out how tieing together components and technology that already exists would be useful not only for Debian but also for other purposes, such as video and audio media distribution. (A method of payment for work on Debian or other media is not within the scope of this article but is easily conceivable). This article therefore explains how and why Debian Distribution Development could go "Distributed".

What are the issues? Why is it so important to go "distributed"?

Debian is the largest independent of the longest-running of the Free Software Distributions in existence. There are over 1000 maintainers; nearly 20,000 packages. There are over 40 "Primary" Mirrors, and something like one hundred secondary mirrors (listed here - I'm stunned and shocked at the numbers!). 14 architectures are supported - 13 Linux ports and one GNU/Hurd port but only for i386 (aww bless iiit). A complete copy of the mirrors and their architectures, including source code, is over 160 gigabytes.

At the last major upgrade of Debian/Stable, all the routers at the major International fibreoptic backbone sites across the world redlined for a week.

To say that Debian is "big" is an understatement of the first order.

Many mirror sites simply cannot cope with the requirements. Statistics on the Debian UK Mirror for July 2004 to June 2005 show 1.4 Terabytes of data served. As you can see from the list of mirror sites, many of the Secondary Mirrors and even a couple of the Primary ones have dropped certain architectures.

security.debian.org - perhaps the most important of all the Debian sites - is definitely overloaded and undermirrored.

This isn't all: there are mailing lists (the statistics show almost 30,000 people on each of the announce and security lists, alone), and IRC channels - and both of those are over-spammed. The load on the mailing list server is so high that an idea (discussed informally at Debconf7 and outlined here later in this article, for completeness) to create an opt-in spam/voting system for people to "vet" postings and comments, was met with genuine concern and trepidation by the mailing list's maintainers.

It's incredible that Debian Distribution and Development hasn't fallen into a big steaming heap of broken pieces, with administrators, users and ISPs all screaming at each other and wanting to scratch each others' eyes out on the mailing lists and IRC channels, only to find that those aren't there either.

So it's basically coming through loud and clear: "server-based" infrastructure is simply not scalable, and the situation is only going to get worse as time progresses. That leaves "distributed architecture" - aka peer-to-peer architecture - as the viable alternative.

This problem has been recognised for quite some time, in fact, Debtorrent's Wiki page describing the motivation and history point out that Debtorrent was done as a 2006 Google "Summer of Code" project. Debtorrent hints at the tantalising possibility of being able to reduce or entirely replace the present "http", "ftp" and "rsync" download methods for individual packages, leaving jigdo and bittorrent as the method for downloading CDs, DVDs, netboot images. Even other methods could be adapted to use distributed download methods.

What the heck is Debtorrent, anyway, and what does it do?

Debtorrent is a modified version of bittorrent that basically first goes and looks to see if there is a debtorrent swarm to download a package from, first. Lack of response after ten seconds results in debtorrent automatically going to an HTTP mirror. So, every time you do an apt-get install of a package, the packages should be downloaded from other debtorrent users rather than from the (overloaded) mirrors.

It's simple - and brilliant. Yet it has taken quite a bit of work to adapt the bittorrent system, and the issues faced and the solutions adopted are described in detail on Debtorrent's Wiki page. However, there are still issues remaining that need to be taken into consideration, to make Debian Distribution truly "Distributed".

What planet are you on? Debtorrent should be enough, surely?

No. The bittorrent protocol does not have "search" capability in it. Bittorrent is only a file distribution mechanism, not an information search mechanism. Web sites are therefore set up to provide "search" capabilities, such that ".torrent" files can be downloaded containing the initial "seed" site from which you can download the list of IP addresses to get the actual file from [whew - got all that? :) ].

And so, with Debtorrent, it's no different: you still have to download "Packages.bz2" which gives you the names of the packages, information about the packages, dependencies and also the relative location on a mirror where the package - and source code - can be downloaded from. This information must, at present, come from the mirrors themselves.

Now, whilst this sounds like it isn't a big hairy deal, if you look at the size of the Etch Binary Packages.bz2 (4.1mb) and that of the Etch Contents (10mb), you begin to get an idea of the scale of the problem. When you look at Lenny (Testing) and Sid (Unstable) these are each approximately 10% larger than the previous distro. Bear in mind that there are nearly twenty thousand packages, all told. For Testing and Unstable, the problem of continuous downloading of 5mbyte of Contents and 10mbyte of "Packages" has been recognised and partially addressed by splitting the "Packages" into daily diffs - an approach that's still not scalable: it's just a temporary "fix".

The point is: even with Debtorrent installed on every single debian system in existence, there's still a large dependency on the single-point-of-failure mirrors, through which the verification and amalgamation of the information that you see (Contents and Packages) must be "funnelled". Along with the Packages themselves, the source code and the ".dsc" file - the GPG-signed guarantee that the source code and packages have not been tampered with.

It's this last - the Digital Signing - that gives us the crucial clue in helping alleviate the load on the Debian Mirrors, possibly even doing away with them altogether (more specifically, turning them entirely into secondary mirrors of the underlying peer-to-peer "distributed" distribution). We'll discuss this later.

Distributed Search

Essentially, I take the view that "apt-cache search" should, on installation of a suitable dpkg-enhancing-package, result in a DHT-based search, and that the contents of package descriptions should, once received, be used as the basis for answering DHT-based queries to other users.

However, of course, this leaves the question: how do you know that the Package you're querying is in fact the right one? This is answered simply enough: a file is made avaialable (again as a distributed peer-to-peer download) containing the list of packages and the SHA-1 or MD5 checksums of the individual Package descriptions (or better yet a GPG signature of each Package description). Included in the file would be a timestamp of its last update; also, it would itself be GPG signed.

This file would, for Stable releases, change very little. Security updates would have their own file. For Testing and Unstable, a "diff" approach might be considered, but to be honest, the size of the file, bzipped, would only be around 500k even if every one of the 20,000 packages were included in it: that's a significant difference from 10,000k of the current "Packages.bz2" file (which uncompresses to enormous proportions - 19mb is not uncommon!) Also it might be a good idea to include the "last updated" date of each package (in the file) for reasons explained below.

So, once you have a digitally-signed list of Package which you're only interested in, and you know the SHA-1 or MD5 or GPG checksum of each package description, you can immediately exclude DHT-query responses that do not match the latest checksum. In fact, you could actually use the Checksum as the DHT key query! [hmm, which over-rides my earlier idea about including the "last updated" date of each package, oh well - maybe it would be useful anyway :) ]

So - to recap:

  • The "new Packages" file is restricted to containing package names, version numbers, possibly timestamps of each packages release, and Checksums on the package descriptions (the .dsc file) - called "new Packages".
  • The "new Packages" file is Digitally Signed by the Debian Release Maintainer and made available for download via bittorrent (or other).
  • DHT queries use the Package Checksum (as listed in the "new Releases" file) to look up information, receiving the .dsc file as a response.
  • Each .dsc file is placed into a DHT cache repository, which can be /var/lib/apt/lists or /var/lib/dpkg/available or other, and thus this "shims in" to the existing dpkg system with minimal code changes.

In fact, if you think this through carefully, apt-get update could be done away with entirely, or made to be so fast that it could be done every time you perform an apt-get install <package>. The reason is because apt-get update normally downloads the Releases file: if instead you perform an incremental update of "new Releases" (by downloading daily diffs) that only downloads a few kilobytes...

Release Verification: upload process

(I have to be honest here: I may need a true Debianite to review this section, to verify the procedure!).

The upload process by individual Debian Package maintainers requires that they compile up the source code into a package (using dpkg-buildpackage), and the package will be GPG digitally signed. Checking is performed on the Package information, such as the dependencies, and also whether a Package's priority is "extra, important, optional, required or standard". When a "freeze" is in place, nothing but critical changes can be made to "required" Packages.

(For example, libselinux1 was excluded from Sarge due to its very long freeze, resulting in the adoption of SE/Linux in Debian being delayed by eighteen months. libselinux1 needed only to be added to the list of "important" packages - nothing else was required - but the freeze dictated that no additions were to be made, and as libselinux1 was only previously listed as "optional" its uploading "Priority" was automatically downgraded to "optional" by the automatic verification process).

My guess - and it has to be a guess I'm afraid! - is that it may be possible to distribute the verification process, or at least have the verification carried out in some way and the Package Maintainer's own development machine be the initial "seed" from which the packages that they compile up are initially distributed from. Or the build farms. Or the main mirrors. Or bittorrent.com's servers (Note: yes, bittorrent.com have offered to host .deb files as an experiment - but only via the bittorrent protocol, not via anything else).

Perhaps the "upload" process could be replaced by distribution of unverified .debs via peer-to-peer file sharing, with the verification process being, instead of "upload", to grab the newly built package from the Maintainer's machine, then the "verification" process is carried out and the .dsc file created, with its digital signature, and the .dsc file released at the same time as the "new Packages" is created.

Are we nearly there yet?

(How much further? can i have a biscuiiit? NO! siddown, shuddup and listen!)

So (whew) so far we have:

  • a peer-to-peer mechanism to replace uploading of packages by Package Maintainers (with some waffling and fudging going on about how to do Verification using peer-to-peer distribution)
  • we have a peer-to-peer mechanism for performing searches of packages, which is still digitally signed
  • we already have a peer-to-peer mechanism for actually downloading the packages (DebTorrent)
  • it's all digitally signed so everyone's happy
  • the Mirrors can still be used to provide HTTP, FTP and Rsync downloads, ramping them down to provide only CDs, DVDs, netboot installs and even then it's possible to ramp even that down further through recommending that people use jigdo, bittorrent or other methods.

It's all hunky-dory, does the job, removes a headache from the Mirrors, makes us independent of single-point-of-failure infrastructure, and, this is the best bit: by the simple expedient of providing a different dpkg.conf file and different GPG key master packages, you have provided the world with a way to distribute other material in an entirely distributed fashion. entirely as Free Software. without requiring a web site.

For example, Miro, the Distributed Media Player, requires that you upload your content to a web site, and you subscribe to RSS feeds. They're basically recreating the Debian Distribution system. Surely it would be better to merge the two? [note to people who may have thought ahead somewhat: video files are pretty large - however, of course, you can split them into smaller "Packages" e.g. chapters or scenes or just 10mbyte chunks, and have a "meta-package" with the sub-packages as "Dependencies", and recombine the contents into one larger file... but the details are left as an exercise for the reader :) especially the bit about if you make the packages small enough you should be able to create them in real-time and use the distribution system for webcasting ooooh]

... did I miss anything? not on the distribution side, no, but I did promise earlier to cover the IRC "anti-spam" and "Mailing list" bit. But - to be honest, I think that's best covered in a separate article or later on.

Finally, then, to reiterate what a Distributed Debian Package system could be used for is not only web-infrastructure-independent package distribution but also video, music, DNS zone files and other media which can all benefit from Digital Signatures, timestamping and complex inter-dependence made easy (where Debian itself is the most incredible example of solving dependencies in an easy-to-use fashion).

Pretty powerful stuff.


interesting..., posted 26 Apr 2008 at 13:21 UTC by lkcl » (Master)

apt-p2p

bugs.debian.org, posted 26 Apr 2008 at 22:39 UTC by lkcl » (Master)

folks - yep, i spoke to phil today and i'd entirely forgotten about bugs.debian.org so we discussed that and i will have to coalesce all the bits of ideas in my tiny brain and write them up as a separate article (yet another one, good grief).

ironically, the "bugtracker" issue - making a distributed peer-to-peer bugtracker that is - is one i've been considering carefully for some time, but had forgotten about, but had considered ditrack with GIT as the backend. the reason is because of ditrack's unique off-line ability to provide a "local" cache - including "local numbering" which is easily recogniseable, that gets changed to "global numbering" when synchronisation is performed.

anyway - however: phil mentioned the unique issue of SPAM being a prime and serious problem on bugs.debian.org, which i had not considered at all, so will need to get back to people on after some thought.

links, posted 27 Apr 2008 at 16:57 UTC by lkcl » (Master)

i kindly received these links from someone unable to post on advogato:

oscomak

Skdb

Open Source Everything Project

distributed bug tracking, posted 8 Jul 2008 at 16:16 UTC by lkcl » (Master)

http://lwn.net/Articles/281849/

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page