Advogato's Number: the resurgence of distributed protocols

Posted 22 Mar 2000 at 00:48 UTC by advogato Share This

This week, Advogato takes a look at the historical trends in popularity among different distributed and centralized protocols, from NNTP through HTTP and on to newer systems such as Napster and the ill-fated Gnutella. I argue that, in spite of the trend towards centralization in the Web, distributed protocols have some life left in them.

The NNTP protocol, when it was first published (in RFC 977) was quite a striking advance in Internet protocols. News articles propagated in an entirely distributed, decentralized fashion. The design of the protocol saved bandwidth when there were lots of people at a site reading the same news articles, and also tolerated failure of individual nodes. With these advantages, NNTP became one of the most popular Internet protocols (along with email and FTP), and for a time was the mechanism of choice for participating in Internet "communities."

Now, 14 years later, NNTP is a protocol in serious trouble. Its complete openness and lack of centralized control made it vulnerable to spam, abuse, and other nastiness. While I used to "read news" almost every day, the quality has sunk to the point where it's just not worth it.

NNTP's popularity has been largely displaced by Web-based message boards. These systems have many serious disadvantages compared with NNTP, including: much poorer utilization of bandwidth, the forced use of clunky "one size fits" HTML-based interfaces, and vulnerability to the failure or compromize of the individual sites that host the content. Nonetheless people do find them significantly more useful on balance.

In spite of the mass migration from Usenet to the Web, NNTP does maintain a few strongholds, especially pr0n, mp3z, and warez. In addition, a number of new protocols with peer-to-peer transmission of files are coming down the pike, of which Napster is surely the most popular. A lot of people seem to be working on variations of Napster, including the Gnutella project started by a couple of employees at Nullsoft (the people who make WinAmp) and promptly shut down by their corporate overlords at AOL.

Since the issues are disparate, we'll look at them one at a time, by category.

Bandwidth

One of the main technical issues of all these protocols is bandwidth. In the classical setup, your school has an NNTP server that talks over the outside network to a few other NNTP servers. Within the school, the clients talk to the news server over a very fast local network. NNTP doesn't have any concept of loading files remotely only on demand, so the total bandwidth tradeoff depends on the average number of people accessing any one file (average, in this case, being mean weighted by file size). When this number is above 1, you win. When it's a lot more than 1, you win big.

With the Web, you basically load files directly from the central server to the clients (this is certainly how mp3.com works). In some cases, especially when there's a local network with a lot more load than the connection to the Internet really can support, it makes sense to add a caching proxy server such as Squid. Web caching has its own set of issues, though, and basically doesn't work well unless the server cooperates.

What's possible, of course, is a hybrid that is optimized for the heterogeneous networks common in schools and companies. Basically, you need the protocol to be sensitive to the relative capacities of the networks, and try to share files between multiple clients inside the local network, rather than duplicating their transfer from the outside. This is one of the goals of Gnutella, and it's easy to imagine that it will continue to be an area of active work on the part of distributed protocols.

Note that it is impossible to optimize bandwidth in this way in a fully centralized protocol. Also note that in some networking environments, such as people connected through a cable modem or ADSL line, the optimization doesn't do much.

Complexity

Let's face it, centralized systems are easier to manage and deploy than distributed ones. In a distributed protocol, you have to worry about consistency of namespaces, make sure the propagation algorithm works properly, and deal with things like partition of the network, failure of remote peers, and so on. Many of these problems pretty much go away in a centralized system.

Further, to really take advantage of the added robustness possible in a fully decentralized system, you need client support to browse different servers and select the one with the best availability. This is quite a bit harder than just doing a DNS lookup on the server's domain name, then connecting on a socket.

Yet, the added complexity shouldn't be overwhelming. NNTP, after all, has lots of clients and servers by now.

Control

Here, I think, is the crux of the distinction between centralized and distributed protocols. In a centralized system, there is a single point of control for things like controlling access, blocking and removing spam, etc. In a distributed system, this kind of control is difficult or impossible.

The lack of controllability is both a good thing and a bad. While nobody likes spam and other forms of abuse, the anarchic nature of the Internet is one of its more appealing features. In particular, decentralized systems seem to be particularly resistant to censorship, both blatant and the more subtle forms resulting from economic pressures.

From a censorship point of view, content lies on a spectrum from official propaganda and corporate-sponsored messages to flatly illegal stuff, with a lot of the interesting stuff in between. Thus, it's not surprising to see that a lot of the less "official" stuff, such as copyrighted music, pr0n, and warez, gravitate to the more decentralized forms, while e-commerce takes place entirely with centralized servers.

Note that censorship and resource utilization have been linked for a long time. Schools all over the world are now banning Napster because of the intense network utilization. Back in the good old days, the protocol of choice for warez and similar stuff was FSP, which had the major property that it degraded gracefully under load, simply throttling the transfer speed rather than killing the network.

What next?

The success of Napster is fueling a renaissance in distributed protocols for file distribution. While a lot of the development is currently ad hoc, it should be possible to learn from the successes and failures of systems which have gone before, and systematically design new stuff that works pretty well.

In the 14 years since NNTP was specified, a number of techniques have come to light which can help fix some of its limitations. These include:

  • Protocols such as rsync and xdelta for synchronizing remote systems.

  • The use of hashes to define a collision-free global namespace.

  • Public key cryptography, particularly digital signatures for authentication.

  • Systems such as PolicyMaker and KeyNote for implementing policies.

  • A ton of academic research on special problems within distributed systems.

Further, there are a bunch of exciting new things that might just nail the spam and abuse problems that seem to be endemic to distributed communications. This includes the existing work from people such as SpamCop and NoCeM, as well as the trust metric work being pioneered on this very website.

Advogato modestly predicts a renaissance in distributed protocols. The next few years seem like a very exciting time for new work in this area.


SMTP did it right, posted 22 Mar 2000 at 06:25 UTC by aaronl » (Master)

Some distributed protocalls just don't work well. One of these is Napster, who's centralized servers aren't even linked. SMTP is one of the best distributed protocals I have ever seen. The mail server relays an email to its destination. The destination is determined by using a record on the DNS of the hostname which is the latter part of the e-mail address. The only centralized part of the system is DNS.

Jabber seems to be using the same method as SMTP (user@host; server is on host), which will make it the only instant messaging system that is not centrally controlled, AFAIK. In systems such as napster, AIM, and ICQ, only clients are released and users are expected to use the company's servers for their communication. If people go and smart smaller servers with these protocals, it all breaks down, because there is no mechanism for linking these servers together, especially to the main servers where many people are signed up (~30*10^6 in AOL's case IIRC). For this reason I am really looking forward to using both the jabber server and client.

The internet in itself is a whole lot more distributed than people could have imagined. Steve Levy, in a 1990 MacWorld article proposed a OneNet monopoly that would eliminate the problem of people using different online services and therefore not being able to communicate. With the internet, the only centralized managment is DNS, and it is actually optional (http://208.163.51.55 will cause the routers to use their routing tables to find a path to my computer - they won't have to look it up in a central directory). You could call the web a huge distributed network of web servers. This is actually a huge step over what came before, where content was put up on a huge, centralized online service like AOL or Compuserve and the content was hosted and even censored by them. You had to pay a lot to get a "keyword" (AFAIK), and you couldn't shop around for prices because the only way to reach AOL's subscribers was to pay AOL. Because of the advances made with the Internet in general and the Web, I think the rise of distributed networks and protocalls actually happened with the widespread adoption of the Internet.

Distributed vs. Decentralized, posted 22 Mar 2000 at 19:10 UTC by nelsonminar » (Master)

An interesting distinction to make in system design is the difference between "distributed" and "decentralized". It's useful to reserve the word "distributed" to talk about the fact of moving bits from place to place. Pretty much any system on the Internet is distributed. The question is how they're distributed.

Some systems on the Internet are fully decentralized - the Web is the premier one. Some systems are centralized, such as a single Web discussion board. In between are hierarchical systems: DNS falls in this category, where there is a single tree of authority but plenty of caching along the way.

NNTP and the current Interent backbone architecture both fall in a different category. Neither system is fully decentralized: there's still a strong tree shape to the network, where leaf sites get feeds from upstream. However, neither system is fully hierarchical either: at the highest levels, traffic is mutually peered and shared between sites, there is no root authority like InterNIC.

Each type of design - centralized, hierarchical, semi-hierarchical, or fully decentralized - has its advantages. Centralized is the easiest to understand, but the least scalable and the least fault tolerant. Hierarchical has done well - the success of DNS over the last 20 years is nothing short of phenomenal. But hierarchical implies a root monopoly, and we've seen those disappear over time with things like the current Internet route peering architecture.

The thing that's less clear is fully decentralized systems. It works very well on the Web, but only because we have full text search engines to knit things together. I think the most exciting area of future Internet research lies in this regime. The payoff could be huge, building truly scalable and self-healing systems. But the complexity is very difficult to manage.

Anarchic vs. Authoritarian, posted 22 Mar 2000 at 19:57 UTC by Ankh » (Master)

The World Wide Web wasn't the only contender for a successful decentralised hypertext system. The HyperGratz system from Austria was (is?) another, and for a while was more widely deployed.

HyperG/HyperGratz had a distribution and cache mechanism that was vastly more efficient than the WWW. It also had the idea that to run a server, you simply filled in a form and applied to someone in Austria to be added, and they'd tell you where your content fit into a global hierarchy.

This sort of beaurocratic centralisation is probably as "unAmerican" as you can get. I've portrayed it a little brutally to try and make clear how it might sound in North America.

The political advantage of WWW is that anyone can set up a server right away, with no need to interact with anyone. You can do that with netnews too, both with NNTP and with the older uucp transport and B-bnews distribution methods. The technical advantage is simplicity.

Another distinguishing facter is the document life cycle. Usenet articles last anywhere from hours to weeks; web pages last from hours to years, or to years after they are out of date. AIM messages last seconds, and unless someone saves a log, IRC messages last a few minutes, or the length of your scrollbar.

Bandwidth is less important in IRC than minimising interruption of service, for example (which is why the minimum spanning tree routing is inappropriate), whereas a two hour interruption in a Usenet feed might not be noticed.

The technology and the politics have to work together, and have to be appropriate for the content, the users and the way the content is used.

protocols should be designed to be lawyer-proof, posted 24 Mar 2000 at 19:33 UTC by jwz » (Master)

More protocols should have fundamental privacy features built in, and by that I don't mean crypto, I mean lack of information. One of the most useful features about Usenet is its utter lack of authentication. What good is an anonymous remailer if the server even has the ability to keep logs of who sent what where? If the information exists (or can exist) then someone somewhere someday will get a court order that will trump any promises of privacy.

You can't subpoena Usenet.

Things like Napster can't succeed unless they protect their users with technical features, and not mere words. Napster is maybe not the best example of this, since everyone knows that its primary use is for copyright violations, but our civil liberties are usually defended by people with unpopular views. Defending the pornographers and racists, and yes, even the warez kids, is how we defend ourselves.

Distributed or central admin systems - somebody loses, posted 30 Mar 2000 at 20:50 UTC by Rasputin » (Journeyer)

One of the important areas of concern between distributed and centrally managed systems is, as was mentioned by jwz, anonymity and privacy versus control and responsibility. Within any centrally managed system, it is a trivial exercise to implement safeguards to limit abuse. The draw back to that is the inherent limit this would impose on privacy. The users have little choice beyond trusting the central authority will not do "Bad Things" with the information they track. Unfortunately, there are numerous examples of companies that will sell all the personal information they can find, because someone is willing to buy and subsequently abuse it.

The picture within a truly distributed system is actually somewhat worse (IMHO). With no controls on user activity because of complete anonymity, there is no longer a limit on the irresponsibility of the users. I might choose to steal a "respectable" online identity and post bogus stories to the effect that VALinux is about to report losses triple analyst's estimates, just to see what happens to the stock price. In a completely anonymous internet, who could stop me? Only I can, assuming I have some sense of ethics that identifies such behaviour as unacceptable.

I think we've all seen what happens within such a large community where the only restraints are personally supplied ethics - the level of online crime/abuse grows daily. The problem is that the concerns about abuse of personal privacy are just as valid. It's a question of who do you trust with what - the large organizations with some amount of information or individuals with some amount of anonymity. In the real world, this issue is dodged with the concept of "reasonable limits", both on an individual's right to privacy and an organization's ability to compromise that privacy. It is an offence, to use an example, for law enforcement to listen to your phone conversations unless there is compelling evidence that you have committed a crime and the only way to legally prove that crime is through a wire-tap. In that case, a tap warrant can be issued at which point you lose the element of privacy you believe you have with your phone conversations. The reasonable limits are on law enforcement to prevent trolling for criminals with taps and on your privacy in the face of evidence of a crime. How many people have heard of cases where this law has been abused? This type of activity works only because there is a limit on the level of anonymity achievable in the "real world" as opposed to the "electronic world" - I might not know who you are, but I can remember you face, so you can still be identified. There is no comparable limit in cyberspace at this time.

I have yet to see a reasonable response to this set of problems from either online communities or governments. There has to be a middle ground between reasonable privacy and reasonable control to limit abuse (abuse can only be eliminated completely when privacy is eliminated completely, and even in that case, it's a questionable call). I honestly don't know what it is, but I hope more discussion and possibly some undreamt of technology will get us there. The answer is, to the best of my ability to guess, going to require some option between distributed and centrally managed systems. Maybe community managed nodes within a distributed system? It comes back to who can you trust with what.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page