Rsync is an incredibly
powerful tool that synchronises anything from a single file to an entire
hierarchical filesystem, over a network. Unlike many other
synchronisation methods, rsync will use the outdated copy of a file to
save on network traffic (resulting in anything up to 99% optimisation).
Rsync the implementation
however is restricted to only Posix systems (such as Linux, Cygwin and
*BSD), and, worse, its implementation can only perform operations on
Posix-based filesystems. This seems somewhat puzzling, and, as part
of the continued Tech Fusion series, this
article will outline some of the amazingly powerful things that could be
done with rsync... if it had a VFS layer.
Rsync (the application) performs directory-by-directory and file-by-file
synchronisation of a filesystem hierarchy - a POSIX-compliant
filesystem hierarchy. Recent modifications to rsync show already some
of the limitations of the current approach: storage of userid
information into extended attributes when rsync is running as a daemon
has just been added! The reason is because rsync as a daemon cannot be
run as root, and so, when attempting to synchronise file permissions
and userid attributes, thus maintaining file system integrity when
performing backups, the previous version of rsync simply threw that
information away. As a hack, the information is now stored in
"extended attributes" - if an ext3 or other filesystem is used - for
later retrieval on a restore / recovery.
How much better would it be if rsync had a VFS plugin layer, such that
the storage of userid information and other attributes could be put
into an alternative database, of which "storage in extended attributes"
was just one example? Would it be nice to be to store that information
in a format that was compatible with backuppc?
Or - how about storing an entire filesystem into a Tar ball? TAR (Tape
Archives) have supported userid attributes, last modified dates and
permissions for decades. Heck, while we're at it - what else is
a "hierarchical storage" mechanism in the I.T. world? NTFS and HPFS;
XML files and HTML files; Structured
Storage and Streams; GVFS and KDE's KIO VFS plugins; FUSE and other user-space file
systems; heck, even wget
could be back-ended
into an Rsync plugin at one end: in combination with a TAR plugin at
the other end you could make regular compressed backups of web sites
(ok - smart readers will have noticed that the last is stretching
things a bit, but wait - there is rproxy! oh darn. hmm... even
smarter readers will have noted that U.S. patents are only valid
in the U.S. but frequently any patent usually results in a
piece of software development being stopped, dead. we neeeed to
do something about this, even if it means putting a notice on rproxy
that it must not be distributed in binary form to the United States,
until Software Patents are
neutralised. but anyway - sorry for the interruption!).
What else is "hierarchical"? IMAP (and to some extent POP3)
mailstores. How about going actually into the mail messages
themselves, unpacking attachments, then looking across the entire
mailbox for similar attachments, and performing a pseudo-sync of the
"old" version of the attachment and the "new" one? How about doing the
same thing across filesystems themselves?
How about the idea of optimising rsync on the server, by storing the
(expensive-to-calculate) MD4 block checksums in a database? One of the
reasons why rsync is not that widely deployed (Debian mirror sites often
do not run rsync) is because of the amount of checksumming that's
carried out, each time the file is sychronised. However, if you can
guarantee filesystem integrity because the entire filesystem is stored
not in a POSIX-compliant filesystem but actually in a SQL database,
along with the MD4 checksums, actually splitting the files up into
"blocks" rather than storing the entire file as one contiguous binary
blob, then you've immediately got not only a method for
optimising file storage space (if blocks occur more than once across
many files or even the same file) but also you've saved yourself a great
deal of CPU time not having to look up the MD4 checksums.
How about storing a hierarchical file system in GIT? (yes -
i noticed that GIT itself can use rsync for synchronisation - but
I'm talking about rsync using GIT for file storage!).
The list of possibilities are just incredible.
My favourite has to be an IMAP plugin, though, because then finally you
can keep as many "offline" copies as you want of your mailbox
synchronised with the "online" copy. This is one of the things that
Exchange has which has no equivalent offering from Free Software
projects (that i know of). In Exchange, synchronisation is a
dog, causing immense aggravation to users. An rsync IMAP plugin would
allow users to install an imap daemon on their own local system - a
desktop or even a PDA - which then automatically synchronised email in
a highly efficient manner.
Likewise, even sending of email - rsync with an SMTP plugin - could
perform "synchronisation" over to a server before sending it out over
the Internet. Close integration between the IMAP plugin and the SMTP
plugin could result in massive savings of network traffic, which would
be very handy on GSM/GPRS connections, based on analysis performed by
the plugins, looking at file attachments that had already been
transferred, or modified only by tiny amounts, and transferring only
the differences rather than the whole email message.
(whilst we're at it - this of course hints at the possibility of
doing away with SMTP altogether, especially with a peer-to-peer
distributed IMAP server. Think about this: when you "send" an email,
where is a copy first stored? in your IMAP "sent mail" folder. So why
send it via SMTP at all? why not drop a DHT-based "notification"
message into the peer-to-peer infrastructure for your recipient to pick
up (with the hash of their email address as the 'key' of course),
providing sufficient information and privileges such that they can
"authenticate" against your IMAP server or its online version,
and access your "sent mail" folder directly. using rsync-IMAP of
course :) wouldn't have it any other way. The advantage of this
approach is that the problem of SPAM almost entirely disappears,
as you are using an authenticated "pull" mechanism, not the
"push" mechanism that is SMTP. Further enhancements are to have
a hash of both the sender and the recipients email addresses
concatenated; for the recipient to perform regular "polling"
of all known senders; for a new recipient to "request
authorisation" to send email, just like has been done in every
single popular IM
system ever invented). Actually, an even better enhancement would
be to negotiate a random hash for use by each sender-recipient
combination, with the hashes generated at "communcation acceptance"
time aka "buddy authorisation" yukk hate that phrase).
My second favourite idea is the one where XML documents are treated as
"filesystems", which doesn't sound such a big deal except until you
recall that ODF
is an XML standard. Thus, the possibility exists to use rsync with a
double-VFS-plugin (on input as well as output) to perform real-time
peer-to-peer document editing (just like writely.com aka "Google
Docs"). Whilst I realise that it is a non-trivial task to make any
editor (whether it be Inkscape
or Koffice or any other) report and
recognise XML fragments as "modified" and "synchronised", at least a
convenient and efficient method would exist to perform the
document synchronisation, alleviating the need for the developers of
each of the editor projects to reinvent that wheel.
I just know that there are more things that could be done, such
as making the file-selection method part of the plugin architecture
(options such as --exclude, --include, --cvs-exclude and
--one-file-system), where instead of having these options you would
have a much more suitable set per-plugin. There must be far more uses
for rsync, with a VFS plugin layer, than I've been able to describe and
hint at, here.
I hope that you enjoy looking for more such creative possibilities.