Advogato: Rsync on Steroids

Rsync (the application) performs directory-by-directory and file-by-file synchronisation of a filesystem hierarchy - a POSIX-compliant filesystem hierarchy. Recent modifications to rsync show already some of the limitations of the current approach: storage of userid information into extended attributes when rsync is running as a daemon has just been added! The reason is because rsync as a daemon cannot be run as root, and so, when attempting to synchronise file permissions and userid attributes, thus maintaining file system integrity when performing backups, the previous version of rsync simply threw that information away. As a hack, the information is now stored in "extended attributes" - if an ext3 or other filesystem is used - for later retrieval on a restore / recovery.

How much better would it be if rsync had a VFS plugin layer, such that the storage of userid information and other attributes could be put into an alternative database, of which "storage in extended attributes" was just one example? Would it be nice to be to store that information in a format that was compatible with backuppc?

Or - how about storing an entire filesystem into a Tar ball? TAR (Tape Archives) have supported userid attributes, last modified dates and permissions for decades. Heck, while we're at it - what else is a "hierarchical storage" mechanism in the I.T. world? NTFS and HPFS; XML files and HTML files; Structured Storage and Streams; GVFS and KDE's KIO VFS plugins; FUSE and other user-space file systems; heck, even wget could be back-ended into an Rsync plugin at one end: in combination with a TAR plugin at the other end you could make regular compressed backups of web sites (ok - smart readers will have noticed that the last is stretching things a bit, but wait - there is rproxy! oh darn. hmm... even smarter readers will have noted that U.S. patents are only valid in the U.S. but frequently any patent usually results in a piece of software development being stopped, dead. we neeeed to do something about this, even if it means putting a notice on rproxy that it must not be distributed in binary form to the United States, until Software Patents are neutralised. but anyway - sorry for the interruption!).

What else is "hierarchical"? IMAP (and to some extent POP3) mailstores. How about going actually into the mail messages themselves, unpacking attachments, then looking across the entire mailbox for similar attachments, and performing a pseudo-sync of the "old" version of the attachment and the "new" one? How about doing the same thing across filesystems themselves?

How about the idea of optimising rsync on the server, by storing the (expensive-to-calculate) MD4 block checksums in a database? One of the reasons why rsync is not that widely deployed (Debian mirror sites often do not run rsync) is because of the amount of checksumming that's carried out, each time the file is sychronised. However, if you can guarantee filesystem integrity because the entire filesystem is stored not in a POSIX-compliant filesystem but actually in a SQL database, along with the MD4 checksums, actually splitting the files up into "blocks" rather than storing the entire file as one contiguous binary blob, then you've immediately got not only a method for optimising file storage space (if blocks occur more than once across many files or even the same file) but also you've saved yourself a great deal of CPU time not having to look up the MD4 checksums.

How about storing a hierarchical file system in GIT? (yes - i noticed that GIT itself can use rsync for synchronisation - but I'm talking about rsync using GIT for file storage!).

The list of possibilities are just incredible.

My favourite has to be an IMAP plugin, though, because then finally you can keep as many "offline" copies as you want of your mailbox synchronised with the "online" copy. This is one of the things that Exchange has which has no equivalent offering from Free Software projects (that i know of). In Exchange, synchronisation is a dog, causing immense aggravation to users. An rsync IMAP plugin would allow users to install an imap daemon on their own local system - a desktop or even a PDA - which then automatically synchronised email in a highly efficient manner.

Likewise, even sending of email - rsync with an SMTP plugin - could perform "synchronisation" over to a server before sending it out over the Internet. Close integration between the IMAP plugin and the SMTP plugin could result in massive savings of network traffic, which would be very handy on GSM/GPRS connections, based on analysis performed by the plugins, looking at file attachments that had already been transferred, or modified only by tiny amounts, and transferring only the differences rather than the whole email message.

(whilst we're at it - this of course hints at the possibility of doing away with SMTP altogether, especially with a peer-to-peer distributed IMAP server. Think about this: when you "send" an email, where is a copy first stored? in your IMAP "sent mail" folder. So why send it via SMTP at all? why not drop a DHT-based "notification" message into the peer-to-peer infrastructure for your recipient to pick up (with the hash of their email address as the 'key' of course), providing sufficient information and privileges such that they can "authenticate" against your IMAP server or its online version, and access your "sent mail" folder directly. using rsync-IMAP of course :) wouldn't have it any other way. The advantage of this approach is that the problem of SPAM almost entirely disappears, as you are using an authenticated "pull" mechanism, not the "push" mechanism that is SMTP. Further enhancements are to have a hash of both the sender and the recipients email addresses concatenated; for the recipient to perform regular "polling" of all known senders; for a new recipient to "request authorisation" to send email, just like has been done in every single popular IM system ever invented). Actually, an even better enhancement would be to negotiate a random hash for use by each sender-recipient combination, with the hashes generated at "communcation acceptance" time aka "buddy authorisation" yukk hate that phrase).

My second favourite idea is the one where XML documents are treated as "filesystems", which doesn't sound such a big deal except until you recall that ODF is an XML standard. Thus, the possibility exists to use rsync with a double-VFS-plugin (on input as well as output) to perform real-time peer-to-peer document editing (just like writely.com aka "Google Docs"). Whilst I realise that it is a non-trivial task to make any editor (whether it be Inkscape or Koffice or any other) report and recognise XML fragments as "modified" and "synchronised", at least a convenient and efficient method would exist to perform the document synchronisation, alleviating the need for the developers of each of the editor projects to reinvent that wheel.

I just know that there are more things that could be done, such as making the file-selection method part of the plugin architecture (options such as --exclude, --include, --cvs-exclude and --one-file-system), where instead of having these options you would have a much more suitable set per-plugin. There must be far more uses for rsync, with a VFS plugin layer, than I've been able to describe and hint at, here.

I hope that you enjoy looking for more such creative possibilities.