Older blog entries for raph (starting at number 202)

Blog 101

This entry is a collaboration with aaronsw.

Blogs are pretty basic. At a minimum, you put posts in reverse chronological order, and provide a permalink (a URL for the post which will hopefully never change) for each one. The permalinks allow other bloggers to put links to your posts in theirs, usually with a comment or response. Links are lifeblood of blogs, and a large part of what makes them interesting.

You don't need any special software to run a blog, although it might be convenient. Some people (such as David McCusker) just edit the HTML by hand. But using a tool can be convenient, and many offer lots of extra features. The most important extras are keeping a list of other sites you visit (a "blogroll") and exporting RSS. Blogging tools should also let you change HTML templates (so you can tinker with the look of your site) and many provide a way for readers to comment on what you write. There are a number of third-party tools to do the latter, such as YACCS and QuickTopic. Most tools support the Blogger API so that you can use a GUI tool like blogBuddy to edit your site.

Blogs are immediate, but not quite so much as chat or instant messaging. Most of the time, blogs are read within a day of being written. One way to think about this is the coherence between what the writer writes and what the reader reads.

Blogs are also firmly in the control of their writers. This means that individual blogs are free of spam, trolls, and other forms of abuse common on the Internet. If someone does try to spam or troll, it's easy to just ignore them.

Blogs are one of the most fertile areas for experimentation on the Web today. People are trying out RSS aggregators, Google API, referrer log analyzer, and other such toys. It's also a fertile ground for research, with projects like blogdex and reptile analyzing the social networks formed by blogging communities.

Advogato diaries are basic blogs, but lack the fancier options. My guess is that it will add some over time. The focus is different than most blog servers - presentation is simple, and free software is a strong theme. The recentlog also occupies a central role, acting as a kind of "communal blog". The scale seems to work well. If more people wrote diaries regularly, the recentlog would be too much to read. On the other hand, there is enough content to convey the vitality of the community.

Social nets have both strong links (close friends and family) and weak links (casual acquaintances). Social networking theory tells us that both are important. Traditional forums such as mailing lists are pretty good at the strong links, bug blogs are public and anyone can read them, so they create weak links among people in different communities that allow information to disseminate very rapidly. Sites such as the Daypop Top 40 track this flow of links.

Even though individual blogs are so simple in structure, you see emergent behavior in the network of blogs. It's common to see conversations between blog authors. In some cases, these blog conversations can take the place of email. This only works if the intended target of the message is reading your blog. Since most people can't read every blog, tools for finding blogs that link to you are quite popular. Many web servers provide "referer logs" which track what site your readers followed to your blog. Some blogging tools even make this part of the blog itself, so readers can follow the links for more information or other points of view. Similarly, backlink tools search the Web (often via Google) to find other pages linking to yours.

Blogs are becoming increasingly popular and mainstream. It's very interesting to see more communities start blogs and how they change and push the medium. Bloggers remind me of Ted Nelson's Hypercorps, the young librarians and teachers of the hypertext system he foresaw and were "paid to sit around and make things interesting for you". Bloggers are only paid in the admiration of their peers, but that admiration is seductive. Soon after you're hooked on reading blogs, you're likely to want to start a blog yourself.


Watching Max's language evolve continues to be a delight. He's working on irregular forms now. He still says, "breaked", and when I say, "broke" back, he usually says, "broked". But in other cases, his use of irregular forms is perfect. Saturday, he picked up "three leaves" (he loves numbers and counting too). I asked him what it would be if it were just one, and he said "one leaf". Also, most of what he says now is either sentences or fairly complete fragments.

There's only another week of school remaining for Alan. We're hoping that his summer break will give him an opportunity to relax and let go of some of his anxiety. When he's in an anxiety freak-out, he is heartbreakingly eloquent at expressing it. At other times, he's happy, bright, and confident. But the amount of time spent in high anxiety seems to be going up. Does anyone have a recommendation for a child psychiatrist in the Bay Area, preferably one who is good with gifted children?


David has been blogging tension in his relationship with his wife, Lisa. However, he has now removed a lot of this discussion at her request. Announcing that he was going to before doing so seems just a bit hostile to me - it is almost like an invitation to archive (I didn't, btw).

In any case, Heather and I were very much reminded of a rough patch in our own marriage, when we briefly separated. Another parent at Quaker meeting asked how I was when we were putting our kids in the nursery, and I responded, "Heather is leaving me". His response: "who's Heather?" We still laugh about that, but it was painful at the time.

I was obviously reaching out to people I could talk to about the problems ("weak links" in the social net lingo), because most of my closer attachments were either entangled in the difficulty, or I really wanted to them to stay untangled (like my advisor and colleagues at school).

If David were to call, I'd be more than happy to talk to him about trees and other stuff. My phone number is pretty easy to find (ordinarily I would send email with it , but this is part of an experiment to avoid email).

3 Jun 2002 (updated 3 Jun 2002 at 07:48 UTC) »

Review of Linked: The New Science of Networks, by Albert-László Barabási, ISBN 0738206679.

Highly recommended. Over the past few years, there has been a flowering of academic research into the nature of networks such as the link structure of the World Wide Web and social networks. This book collects some of the highlights of this research, and presents them as well-written stories. Barabási makes a strong case that network thinking will have profound implications over the coming years, not just for the World Wide Web, but for biology, social networks, and other areas of science as well.

"Scale-free" networks have a central place in the book, not surprising as Barabási is co-author of a groundbreaking paper on the subject. The original paper dealt with the Web, but subsequent research has turned up a number of other networks with similar topology. Even within the Internet, scale-free topology applies both to the network of routers and backbones as well as the link structure of the Web.

Scale-free networks are instantly recognized by their characteristic power-law degree distribution. There are a few highly-connected nodes, and many nodes with just one or two links. By contrast, the degree distribution in random networks tends to be a tightly clustered bell curve.

A simple model generates randomized scale-free networks. Start with a small seed network, possibly a single node. For each new node, add a link from that node to existing nodes, with probability proportional to the indegree of the existing node. Thus, the model captures both growth and preferential attachment. If either element is missing, the resulting network is not scale-free.

These networks have a few important properties. First, their diameter is very small. This property has been known in social network theory since the brilliant "small world" experiments of Milgram in 1967. The idea was popularized in the 1990 play by John Guare, "Six Degrees of Separation", and has since entered the popular vocabulary.

Second, such a network stays well connected even when random nodes are removed. This is an "attack resistance" property of the network, not directly related to the attack resistance of trust metrics, my own specialty (although the underly concept of network flow plays an important role in the analysis of both).

However, when a few highly connected nodes are removed, the network fragments. Thus, scale-free networks are in this way more vulnerable than random networks.

Barabási does not address trust metrics. This is a bit surprising, because Google plays a part in the book, as a "more fit" search engine that rapidly becomes a hub even though existing search engines such as Inktomi and Altavista have already established names for themselves. Barabási misses the opportunity to explain why Google is better. Also, Gaure's play deals with a con artist who is expert at playing his social network to further his own goals, but Barabási does not pursue the theme of trust (and violation of that trust) in social networks.

Even if you are familiar with scale-free network theory, the book is still a fun read, and the presentation may be helpful in talking with others. For people involved in Internet software design, and in design of p2p networks, this book is essential reading. The book nicely balances an accessible presentation with meaty intellectual content. Most people who enjoy thinking about the world will find something of interest.

Thanks to Jim McCoy for recommending this book to me.

2 Jun 2002 (updated 3 Jun 2002 at 07:42 UTC) »

Yesterday's entry brought a very good e-mail response from Keith Packard. I'm happy about the way that's going now.

There is a big difference between doing yet another fontmap for a new application, and doing something that is technically sound as a universal fontmap for (hopefully) all apps. The latter is clearly a lot more work. I wasn't sure whether Keith was serious about doing it, but it seems like he is.

File format, API, or protocol?

The fontconfig discussion brings to mind a few thoughts I've had about config in general. I'm blogging them here because I think they may be relevant to a lot of people.

When you're designing a config mechanism, one of the big questions is whether it will take the form of a file format, API, or protocol. Each has its own set of tradeoffs. The big question is what you want to pin down, and what you want to stay flexible.

If config is in a file format, there can be multiple implementations. However, when that happens, it can be very difficult to change the file format itself, because the risk of breaking things goes up with each additional implementation. You could explicitly design the file format to be extensible (something that XML makes relatively easy), but even then implementations can have limitations and bugs that need to be worked around.

Specifying an API instead lets you change the underlying file format all you like, at least in theory. The downside of the API approach is the danger of runtime incompatibility. For example, you generally won't be able to write pure Python or Perl scripts that access the config info. Instead, you'll have to write a wrapper.

Using an API instead of a file format also lets the library multiplex more than one file format. This can help portability. For example, it's not hard to imagine an API like fontconfig's being ported to Windows or Macintosh systems. The app would call FcFontList, which on Linux would read and parse the fonts.conf file, but in Carbon might be something like FMCreateFontIterator. So far, I don't think fontconfig has tried to do any of this.

It's also possible to implement config data as a protocol, for example talking through Unix domain sockets to a daemon which manages config information. This is generally a much more heavyweight solution, because of the need to keep a daemon running. There are some advantages, though. If carefully implemented, you can get low-latency change notification. In the case of fonts, this give apps access to newly installed fonts without having to close and restart the app. It's not clear that this feature would justify the extra implementation cost. It's also worth noting that protocols give you the same kind of decoupling from the file format that API's give, but without the tight runtime dependencies.

Also note that app configuration wrt X fonts happens through the X protocol. In practice, though, almost nobody speaks the protocol directly. Instead, they all use an API call such as XListFonts.

I'm thinking about fonts now, but these distinctions are probably valid for many types of config info.

XOR metric followup

Earlier, I had said that use of a XOR metric or similar technique for finding short routes was a good indicator of the health of a p2p project. Zooko has put up a nice response arguing why Mnet doesn't yet do it. I agree, "the research isn't really done yet" is a good reason.

Trust, backlinks, Xanadu

Paul Snively sugggests that he, David McCusker, and I collaborate on a Scheme implementation of a trust metric for something like Xanadu-style backlinks. It's an interesting idea. If I actually had free time, I'd be inclined to pursue the idea. For one, the recent talk about Lisp has gotten me interested in trying out the language again. It's been years since I've written any serious Lisp. Would I find it a good tool, or would I want to go back to Python as soon as possible? See On the Relationship Between Python and Lisp by Paul Prescod for an argument for the latter.

In any case, if David McCusker and I collaborate on anything soon, it will almost certainly be some kind of IronDoc/Athshe mindmeld. David has already said that he's not into trust metrics.

In any case, I certainly agree that backlinks are a good application for a trust metric. Two-way links have the problem that they're susceptible to spamming. Forward links don't have that problem, which may be one of the reasons why the low-tech, "worse" Web prevailed over the high-tech, "better" designs behind Xanadu.

Implementing backlinks within Advogato, as DV proposed, would solve the spam problem by using the existing trust metric. But this doesn't work for all those interesting backlinks from blogs outside Advogato-space.

If PageRank were available, then I think it would be another good solution. Sort backlinks by the PageRank of the target page, and do some kind of reasonable pruning. If Google won't do it, there's always my independent implementation :)


I had lunch with, then spent the afternoon with Brian Stell of Mozilla. He is on a mission to make Mozilla printing work well on Linux, especially the i18n bits. Already, he's provided us with fixes and test cases for incremental Type 11 fonts.

There are a lot of tricky issues, and the GNU/Linux community has historically done a poor job dealing with printing. It's an area where cooperation between many diverse pieces is needed, but there's nothing that's really motivating a solution except people's frustration. Brian is trying to solve the problem the Right Way for Mozilla, with the hope that others might follow.

Among other things, Brian is trying to figure out which versions of PostScript and PDF to target. For compatibility with existing printers, you want to generate lowest common denominator PostScript. But there are also advantages to generating more recent versions. For example, you can drop alpha-transparent PNG's into a PDF 1.4 file and they'll render correctly. That's not possible with any version of PostScript, or with earlier versions of PDF. On the font side, many (most?) existing printers can't handle incremental Type 11 fonts, even though they're in PostScript LanguageLevel 3 and also many Adobe interpreters before that (2015 and later).

A good solution would be to generate the latest stuff, and have converters that downshift it so it works with older printers. Alas, no good solution exists now. Ghostscript can rasterize without a problem, but sending huge bitmap rasters to PostScript printers is slow and generally not a good idea. pdftops can preserve the higher level structure (fonts, Beziers, etc.), but is limited in many other ways, among other reasons because it doesn't contain an imaging engine. So, at least for the time being, it seems like the best compromise is to have a codebase that generates various levels of PostScript and PDF.

A chronic problem in for GNU/Linux is a mechanism for users to install fonts, and applications to find them. At least five major application platforms need fonts: Gnome, KDE, Mozilla, OpenOffice, and Java. You also have a number of important traditional (I don't really want to say "legacy") applications that use fonts: TeX, troff, and Ghostscript among them. Plenty of other applications need fonts, including all the vector graphics editors, Gimp, and so on. I suppose I should mention X font servers too.

Most applications that need fonts have a "fontmap" file of some kind. This file is essentially an associative array from font name to pathname in the local file system where the font can be found. Actually, you want a lot more information than just that, including encoding, glyph coverage, and enough metadata at least to group the fonts into families. In some cases, you'll want language tags, in particular for CJK. Unicode has a unified CJK area, so a Japanese, a Simplified Chinese and a Traditional Chinese font can all cover the same code point, but actually represent different glyphs. If you're browsing a Web page that has correct language tagging, ideally you want the right font to show up. Unfortunately, people don't generally do language tagging. In fact, this is one area where you get more useful information out of non-Unicode charsets than from the Unicode way (a big part of the reason why CJK people hate Unicode, I think). If the document (or font) is in Shift-JIS encoding, then it's a very good bet that it's Japanese and not Chinese.

This is why, for example, the gs-cjk team created a new fontmap format (CIDFnmap) for Ghostscript. In addition to the info in the classic Ghostscript Fontmap, the CIDFnmap contains a TTC font index (for .ttc font files which contain multiple fonts), and a mapping from the character set encoding to CID's, for example /Adobe-CNS1 or /Adobe-GB1 for Simplified and Traditional Chinese, respectively.

To make matters even more complicated, as of 7.20, we have yet another fontmap format, the xlatmap. The goals are similar to the CIDFnmap, but with different engineering tradeoffs. One of my tasks is to figure out what to do to unify these two branches.

In any case, there are really three places where you need to access fonts, and hence fontmaps. First, the app needs to be able to choose a font and format text in that font. That latter task requires font metrics information, including kerning and ligature information for Latin fonts, and potentially very sophisticated rules for combining characters in complex scripts. These rules are sometimes embedded in the font, particularly OpenType formats, but more often not for older formats such as the Type1 family. Interestingly, you don't need the glyphs for the formatting step.

The second place where you need the font is to display it on the screen. Historically, these fonts have lived on the X server. But the new way is for the client to manage the font. The XRender extension supports this approach well, as it supports server-side compositing of glyphs supplied by the client. Even without XRender, it makes sense to do the glyph rendering and compositing client-side, and just send it to the X server as an image. Maybe 15 years ago, the performance tradeoff would not have been acceptable, but fortunately CPU power has increased a bit since then.

The Xft library is one possible way to do font rendering, but I'm not very impressed so far. Among other things, it doesn't do subpixel positioning, so rendering quality will resemble Windows 95 rather than OS X.

The third place where you need the font is when you're printing the document. In most cases today, it's a good tradeoff for the app to embed the font in the file you're sending to the printer. That way, you don't have to worry about whether the printer has the font, or has a different, non-matching version. If you do that, then Ghostscript doesn't actually need to rely on fontmaps at all; it just gets the fonts from the file. However, a lot of people don't embed fonts, so in that case, Ghostscript has to fudge.

So how do you install a font into all these fontmaps? Currently, it's a mess. There are various hackish scripts that try to update multiple fontmaps, but nothing systematic.

One way out of the mess would be to have a standard fontmap file (or possibly API), and have all interested apps check that file. Keith Packard's fontconfig package is an attempt to design such a file format, but so far I'm not happy with it. For one, it's not telling me all the information I need to do substitution well (the main motivation in Ghostscript for the CIDFnmap and xlatmap file formats). Another matter of taste is that it's an XML format file, so we'd need to link in an XML parser just to figure out what fonts are installed. I'd really prefer not to have to do this.

I realize that the Right Thing is to provide enough feedback to KeithP so that he can upgrade the file format, and we can happily use it in Ghostscript. But right now, I don't feel up to it. The issues are complex and I barely feel I understand them myself. Also, I'm concerned that even fixing fontconfig for Ghostscript still won't solve the problems for other apps. After all, Ghostscript doesn't really need the font metrics, just the glyphs. Even thoroughly obsolete apps like troff need to get .afm files for Type1 fonts (much less font metrics from TrueType fonts). GUI apps on GNU/Linux haven't really caught up to the sophistication of mid-'80s desktop publishing apps on the Mac. As far as I can tell, fontconfig currently has no good story for metrics or language info.

What would really impress me in a standard for fontmap files is a working patch to get TeX to use the fonts. But perhaps this is an overly ambitious goal.

In any case, I really enjoyed meeting Brian in person today, and commend him for having the courage to attack this huge problem starting from the Mozilla corner.

31 May 2002 (updated 1 Jun 2002 at 19:09 UTC) »
Farmer's market

We took the kids to the Benicia Farmer's Market for the first time of the season. Max loved the pony ride and petting zoo. Alan has become good friends with Lacey, the 9yo daughter of the family that brings the animals, and they had a great time running around together. Alan also ran into about 4 kids he knows from school. It was pleasant weather, the sunset was beautiful, and it was altogether a very nice evening.

More on grain size

See David McCusker's response to yesterday's posting on grain size.

A couple of people pointed out that my estimate for modem latency was too low. Assume 150ms and 4K/s, that gives you a product of 0.6K. That actually makes the clustering around 4K even better.

I've been thinking some more about the relatively large "optimum" grain size for disks. David has obviously put a lot of thought into the problem. What I'll present tonight is a simple technique, optimized for ease of analysis. I'm sure that better techniques are possible, and IronDoc may already implement them. But this is a lecture, not a product :)

Assume 4K blocks and 1M grains, so a grain contains 256 blocks. Coincidentally, if each child reference takes 16 bytes, that's the maximum number of children each internal node can have. The minimum number is half the maximum. That's a big part of what makes it a btree.

In a large tree, each subtree consisting of a node one from the bottom, and its immediate children (leaf nodes all) gets its own grain. Grains are fixed size. The file consists of a sequence of grains; they're not interleaved at all. Maybe the file is a little shorter because the last grain isn't completely full.

Now we can analyze the utilization. For files smaller than one grain (1M), utilization is the same as for classical btrees: 1/2 in the worst case. For larger files, the utilization of blocks within a grain is also 1/2 in the worst case, so the total worst-case utilization is 1/4.

A grain can contain free blocks, but the file need not store any free grains. If you want to delete a grain, you copy the last grain in the file over it, then truncate the file. As a result, the order of grains within the file can become totally random, but you really don't care. The grain is large enough that one disk seek per grain isn't a killer.

There are tricks you can do to improve utilization. One special case bears mentioning - the append-only scenario. In the classical btree algorithm, you get worst-case utilization. When you overflow a block, you split it in half. The first half never gets touched again, so the utilization remains 1/2. It's pretty easy to tweak it so that in this case, utilization is near unity. It's a common enough case that I think this tweak is sufficient. Remember, 1/4 is only worst case, and disk is really cheap.

As an order-of-magnitude ballpark, reading or writing an entire grain should take about 40ms. Reading or writing an individual block is around 20ms. Sometimes you'll need to do an operation on an entire grain, for example when splitting or merging a subtree 1 level up from the bottom.

I'm not trying to make the argument that this is the best file layout. I'm sure there are better tradeoffs, preserving similar locality while improving utilization. But I do like things that are simple to understand. See also McCusker on engineering design tradeoffs, short and well worth reading.

XOR metric

The XOR metric as proposed in the Kademlia work has a lot of people excited. I've been thinking about similar things in the context of my PhD research for about 3 years now. But none of my thinking was really written down anywhere, much less published. There's some discussion on the bluesky list of other similar ideas in the literature, as well.

The Kademlia people have actually done the work and published the paper. Perhaps just as important as the design is their analysis. Designers of p2p networks tend to be very optimistic that the resulting real-world network will have the desired properties. I'm pleased to see p2p design moving toward real quantitative design.

The XOR metric is good in very large nets. You assign an ID to each node, for example the hash of its public key. Suppose that you distribute blocks among the nodes based on the hash of the block, so that a block is likely to be found in nodes with "nearby" hashes. Then, if you know the hash of a block, you have the problem of finding a nearby node.

If the net is reasonably small, then maybe you broadcast contact information for new nodes joining. That way, all nodes know about all other nodes, so the problem is simple. But as the net scales up, this strategy won't work as well.

Hence the XOR metric. Simplified, the "nearness" of two id's is the number of bits in common in the prefix. So, the nearness of "10100..." and "10010..." is 2, because the first two bits (but not the first three) are identical. Each node now keeps track of a small number of other nodes, as few as one for each distinct "nearness" value (which scales as lg N).

Now if you have a target id in hand, keep iterating this step: choose the closest node to the target id. Ask them for contact information for a closer node. You'll get to the closest node within lg N steps, as each step increases the nearness by one. A picture for the analysis resembles walking down a binary tree.

Many p2p network designs need to find arbitrary other nodes, and are expected to scale up. If so, there are two choices: use an algorithm such as Kademlia's (or based on Chord or Tapestry, which were the inspirations for Kademlia) to find short routes, or pray. It's a good test, I think, of the health of the project.

The stamp-trading network described in Chapter 7 of my thesis is another twist on this idea. The lg N topology is similar, but requests flow only along trusted links in the network. It's an open question whether it will in fact find short routes given this additional constraint.

(Thanks to Roger Dingledine for comments on an earlier draft, and to Zooko for good discussion and links. Responsibility for oversimplification remains mine.)


casper.ghostscript.com seems to be down. I hope it's just because transbay.net, the colo host, is moving, and that it'll be up again in the morning. Ah well.

Linux or Mac

David McCusker wants a development machine for home. I think he would be happy with either a Linux box or a Mac running OS X. In the former case, he'll have to spend some time dealing with Linux weirdness, in the latter, he'll have to spend some time dealing with OS X weirdness.

If the goal is raw price/performance, Wintel is a clear win. Buying the cheapest parts on Pricewatch is somewhere around 50% of a comparable-spec Mac. But Macs are arguably more elegant than Wintel. And, if stuff like having digital cameras Just Work when you plug them in is important, there is no contest.

When I ordered spectre, raw horsepower was in fact a driving factor. I need to do regression testing as quickly as possible. Otherwise, I would have been sorely tempted to get a Mac.

David also expresses the wish to do cross-platform GUI coding. Unfortunately, there's no good story here. Drawing is only a small part of the overall problem, so using something like OpenGL won't help much. I do expect the situation to improve over time. wxWindows is probably closest, and has OSX support in development versions.

What is the optimum grain size?

One of the most important factors affecting performance of a tree access protocol is grain size. Here I'll present a very simplistic analysis. A lot of people never bother to do any.

How do you slice up a tree into grains? I propose to serialize the tree, chop the serialization into fixed-size blocks (leaf nodes), and use a btree-like structure to tie this together. The cool new idea is to count parentheses in the btree nodes. This lets you fly around the tree without having to dig into the leaves.

There are other ways to do it, of course. You can use one grain per node. You can use one grain for the whole tree, perfectly reasonable if the tree is smallish. You can also try to aggregate subtrees into grains. On disk, the advantage of fixed-size blocks is good utilization. On a network, you don't care about utilization of blocks, but you still might care about a large variance in grain size.

For simplicity, let's assume that the tree is read-only. We will analyze two usage patterns. The first is simply to traverse the whole tree. The second is to navigate to random nodes in sequence.

Traversing the tree means scanning the serialization linearly. This is cool. You only touch each block once. Assume that the time to fetch a block is a latency value, plus the size of the block divided by the bandwidth. Total time for traversing the tree is (tree size in bytes) / (bandwidth in bytes per second) + (tree size in bytes) * latency / (block size in bytes). It is easy to see that large blocks are more efficient.

The random node case is different. When you fetch a block, only a small part of it is relevant. The rest is waste. The time to fetch a node is latency + (block size in bytes) / bandwidth. This time, small blocks are more efficient.

What is the optimum grain size, then, for a mixture of both cases? When (block size) = latency * bandwidth, both individual cases are exactly a factor of two slower than their limiting best-case (infinitely large blocks in in the case of a traversal, infinitely small in the case of random nodes). Thus, the optimum will be on the order of latency * bandwith.

What is latency * bandwidth for real devices? Here's a quick table. Don't worry about small inaccuracies. We're trying to get the order of magnitude right. This is just disk and network. Memory hierarchy is important too, but the analysis is considerably different, so I won't do that tonight.

modern disk: latency 10ms, max bw 50M/s: 500K
wireless net 802.11b: latency 2.5ms, max bw 0.5M/s: 1.25K
modem: latency 50ms, max bw 4K/s: 0.2K
100Mb lan: latency 0.3ms, max bw 10M/s: 3K
dsl down from a nearby server: 20ms, 100K/s: 2K
dsl up to a nearby server: 20ms, 10K/s: 0.2K
dsl down international: 200ms, 100K/s: 20K
dsl up international: 200ms, 10K/s: 2K

In Unix, the traditional block size is 4K. It's interesting that this value is not far from the mark for networks, even over a very broad range of performance. So the traditional block size is actually still reasonable.

Disk is the outlier. What's more, latency * bandwidth scales very roughly as the square root of areal density, and that's scaling like mad. It used to be 4K, but that was a long time ago.

But 500K is two whole orders of magnitude bigger. If we believe this analysis, then access to a tree on disk will spend 99% of the time seeking, and 1% accessing useful data. That would be bad.

The story is a bit more complex, though. Real disk-based systems spend a huge amount of effort trying to increase locality, or the clustering of items likely to be accessed in a cluster. If this effort is successful, then the effective grain size goes up. Caching with prefetching is a particularly effective technique. Modern OS kernels implement prefetching, and so do drives. In fact, when you request a 4K block from a drive, it will usually spend on the order of 10ms seeking to the right track, then 5ms waiting for the disk to spin to the right sector. Given a typical raw transfer rate of 50M/s, that means 250K or so of data will fly past the read head. In a modern disk, all that goes into a cache. Then, when the kernel requests blocks from that range, it gets them immediately.

So, to do a btree efficiently, you have (at least) two choices. You could specify a really large block size, and not worry about the order of blocks on the disk. Another method is to use a small block size, but try hard to cluster nearby blocks. This demands more intelligence when allocating blocks within the file. It's also well known from filesystems that it's hard to avoid fragmentation when the utilization is very high. When there is ample free space, there are more candidates to choose from to try to optimize locality.

Of course, in the read-only case, you can allocate the blocks exactly in traversal order. In this case, 4K blocks are again reasonable. The problem is avoiding fragmentation as you start updating the tree.

David McCusker talks about this a bit in database areas. In it, he suggests 2K blocks, but allocation logic that works in grains of 16K. That's still not good enough (by more than an order of magnitude) if the grains are randomly scattered in the file. Maybe he's doing something else to try to cluster grains; it's not clear to me. But I do believe it is a tricky problem.

The network version is in many ways simpler (this is a relief, because in other important ways it is harder). You don't have to worry about locality, as there really is no such concept. The network latency is the same no matter which block you request. You also don't have to worry as much about utilization, because it's possible to simply skip sending unused byte ranges. Blocks can be variable in size, too, unlike the fixed blocks of disks.

As I warned, this analysis is oversimplified. Yet, I think it is useful to understand real performance problems. It gives a lot of insight into the appeal of flat files for mailboxes. At 50M/s, you can grep an entire 250M mailbox in five seconds. A dumb database design, by contrast, may use a hash table to allocate individual messages effectively at random within the file. Thus, if you try to search through the messages in order, each read will take 15ms or so. Five seconds will give you enough time to search 300 messages, consistent with the two orders of magnitude discrepancy between the typical block size of 4K and the optimum grain size of about 500K for disks.

Thus, sophisticated file formats have a danger of creating serious performance problems. But I consider that a quantitative problem, one that yields well to careful analysis and design. To me, those are the most fun!

29 May 2002 (updated 29 May 2002 at 08:33 UTC) »

I'm reading Linked: The New Science of Networks, by Barabasi. I'll have quite a bit more to say when I'm finished reading it, but in the meantime, if you're interested in networks, I can highly recommend it. In particular, if you're trying to do p2p, then run, do not walk, to your friendly local independent bookstore.

Advogato and community

anselm: you raise good questions. With luck, Advogato can become more vital without sacrificing its thoughtful tone.

In any case, I think the secret to success in online community is for it to be firmly based on real, human community. That's sometimes tricky in the highly geographically dispersed world of free software, but worth cultivating.

Why Athshe will not be based on XML

Athshe is a (completely vaporware) suite of tools for tree-structured data, including an API for tree access, a simple serialized file format, a more sophisiticated btree-based random access file format, and a network protocol for remote tree access and update. Everybody is doing trees these days, because XML is hot. Yet, I do not plan to base Athshe on XML. Why?

In short, because it makes sense for Athshe to work with a simpler, lower level data language. The simplicity is a huge win because Athshe will take on quite a bit of complexity trying to optimize performance and concurrency (transactions). Also, XML is weak at doing associative arrays, and I think Athshe should be strong there. Lastly, the goals of Athshe are focussed on performance, which is a bit of an impedance mismatch with most of XML community.

I hardly feel I have to justify the goal of simplicity - it's such an obvious win. So I won't.

The children of an XML element are arranged in a sequence. However, in a filesystem, the children of a directory are named; the directory is an associative array mapping names to child inodes. I believe this to be an incredibly useful and powerful primitive to expose. Many applications can use associative arrays to good advantage if they are available. One example is mod_virgule, which leverages the associative array provided by the filesystem to store the acct/<username> information.

XML (or, more precisely, DOM) actually does contain associative array nodes. Unfortunately, these are Attribute nodes, so their children are constrained to be leaves. So you get the complexity without the advantages :)

A very common technique is to simulate an associative array by using a sequence of key/value pairs. This is essentially the same concept as the property list from Lisp lore. XPath even defines syntax for doing plist lookup, for example child::para[attribute::type="warning"] selects all para children of the context node that have a type attribute with value warning. However, mandating this simulation has obvious performance problems, and may also make it harder to get the desired semantics on concurrent updates. In particular, two updates to different keys in an associative array may not interfere, but the two updates to the plist simulation very likely will.

Nonetheless, this concept of simulation is very powerful. I believe it is the answer to the "impedance mismatch" referenced above. Athshe's language consists of only strings, lists, associative arrays, and simple metadata tags. It's not at all hard to imagine mapping "baby XML" into this language. In Python syntax, you'd end up with something like:

<p>A <a href="#foo">link</a> and some <b>bold</b> text.</p>
['p', {}, 'A ', ['a', {'href': '#foo'}, 'link'], ' and some ', ['b', {}, 'bold'], ' text.']

With a little more imagination, you could get DTD, attributes, entities, processing instructions, and CDATA sections in there.

In fact, mappings like this are a common pattern in computer science. They resemble the use of an intermediate language (or virtual machine) in compilers. These types of mappings also tend to be interfaces or boundaries between communities. Often, the lower-level side has a more quantitative, performance-oriented approach, while the higher-level side is more concerned with abstraction. Cisco giveth, and the W3C taketh away :)

Credit where credit is due

My recent piece on link encryption drew deeply on an IRC conversation with Roger Dingledine and Bram, and Zooko provided the OCB link.

A good application for an attack-resistant trust metric

A best-seller on Amazon.

Note especially how the shill reviews have very high "x out of y people found the following review helpful" ratings. A perfect example of a system which is not attack resistant. Reminds me of the /. moderation system :)

Thanks to Will Cox for the link.


I've been busy hacking on Well Tempered Screening. I've got it random-access, and I've coded up the enumeration so you can use PostScript code to define the spot function. I still have to grapple with the "device color" internals, which intimidates me.

On a parallel track, the DeviceN code tree, which up to now has lived in a private branch, is shaping up fairly nicely, at least using regression testing as a metric. We should be able to get it checked into HEAD soon.

Web UI's and robustness

Yesterday, I linked a claim by Dave Winer that users like to use Web browsers as their UI, and a note of frustration by Paul Snively, who lost some work on his blog, and blamed the "Web browser as UI" approach. Paul has since retracted his original post, but I think there's a deeper issue that deserves to be explored. I've lost a fair amount of work posting to Advogato as well, so let that be the whipping boy rather than Radio.

Using the Web browser for the UI has a different set of design tradeoffs than a traditional desktop app. Some things get easier, others get harder. Lots of things can lead to a bad (or good) user experience, including things not foreseen by the software's designer. I know I didn't pay a great deal of attention to robustness when designing Advogato.

I'm not going to try to catalog all the advantages and disadvantages of Web-based UI's - that's a daunting task. Instead, I'll focus on robustness, or the risk of losing work.

I used to lose coding work fairly frequently. Now I don't, because I do lots of things to improve robustness. First, I use an editor with good robustness features. Second, I check my work into a remote CVS server. Third, I back up the repository to CD-R every week. I also frequently copy local working files to another machine on my network, and send patches to mailing lists. As a result, it's been quite a while since I've lost any coding work.

I still lose Advogato posts, though, most recently about a week ago. Why? For one, Netscape 4.7x's text entry box doesn't have any of the paranoia that real editors have about losing work. In fact, pressing Alt-Q (the same command as "format paragraph" in Emacs) is a quick shortcut for "lose all state and quit". This could be fixed in the client, but as the designer of Advogato, I don't have much say about that.

There is more I could do but don't, though. I could make the "Preview" button store the draft in a log, persistent for a day or so and then garbage collected. You could, if you chose, edit in the browser and click "Preview" regularly, much as I regularly press Ctrl-X Ctrl-S in Emacs. In fact, I rather like this idea, as it has other advantages. For example, you could share the draft link with others.

Similarly, I could implement something like version control for diary entries. When you post a new edit, it saves the old version (or, perhaps, the diffs) somewhere. Thus, if you clobber your work (as I did a few days ago), you can get it back. Again, there are other advantages to this approach. This is basically ReversibleChange, one of the "soft security" ideas popular in the Wiki world.

A very high-tech client might even be able to implement something analogous to the autosave feature of Emacs. It would periodically upload the current state of the text entry to the server for safekeeping. However, this would require some pretty major changes to the way the Web works, so I'm not holding my breath.

In the meantime, there are alternatives. For one, it's possible to use an XML-RPC client instead of a Web browser. Many of these clients encourage you to use your favorite editor, which helps with robustness. The client could also keep a log of everything submitted. Such an approach would be complementary to the server-side tweaks I mentioned above.

In the meantime, I now generally write my diary entries in Emacs, then simply cut-and-paste into the browser. It's not perfect, but it works.


A Quaker friend of ours, Ricki Anne Jones, invited us (Heather, the kids, and I) to today's A's game. She gets a luxury box every year because she's such a loyal fan, and she invites a bunch of her friends. It was fun. Alan and Max had another kid to play with, and they avoided melting down, so this was the first time they'd lasted through the entire game.


Dan Brickley of the W3C showed up on #p2p-hackers tonight. We talked about AaronSw's manifesto, among other things. It was a fairly pleasant conversation, but I still feel that the W3C is pretty badly broken. Dan encouraged me to write up some my Athshe stuff (I was trying to talk about it online). As a result, my next Athshe blog will be "why Athshe will not be based on XML."

New computer

I ordered my new dual-Athlon from KC Computers. I'll keep a running log, especially in case other people want to follow the same recipe.

The final price was about $1600. Picking the absolute bottom price from Pricewatch, the parts add up to around $1000, not including assembly and shipping. I consider it money well spent, because I figure I have a considerably lower risk of something going wrong and eating up lots of my time.

Even so, I've been quite satisfied every time I've bought something off Pricewatch, even when the prices seem too good to be true. Good reputation information about the sellers (not unlike what ebay does) would seem to decrease the risk even more.

There are lots of things that can go wrong. The seller could turn out to be shady. The parts could turn out to be defective, possibly seconds or returned merchandise. The parts could be completely legitimate, but of low quality (like the infamous IBM Deskstar 75GXP drives). They could be individually ok, but subtly incompatible with each other, apparently a very serious problem with early Athlon platforms. They could be just fine, but not well supported by the operating system. This last problem has a wide range of variability, as it varies depending on the OS flavor. Something like recompiling the kernel to update a driver may be perfectly reasonable for a sophisticated user, but out of reach for others.

In all these cases, good metadata could help. If I knew I was getting parts with a good chance of working well in the system, I'd have no problem with going through Pricewatch-style vendors, and wielding the screwdriver myself.

Such a metadata system could be quite high-tech. For one, it could compute total cost including aggregating of shipping, and sales tax. It could take shipping delay into account, as well. Optimizing this sounds like a dramatically scaled-down version of the ITA fare calculation used on Orbitz. It's appealing because it maps to minimizing labor and energy costs in the real world, not just getting the best outcome of a game.

You could also do stuff like autogenerating recipes (select this BIOS option, set hdparm to that, etc), and incorporating feedback from others with similar configurations. Even more extreme would be to customize a distribution. Custom kernels, in particular, seem like a win.

A huge part of the value of a brand (such as IBM or Dell) is the QA work they do, essentially creating metadata about reliability and compatibility as part of building and delivering systems. Even so, the assurance is far from absolute. For example, IBM Thinkpad 600's have a defective battery design, causing them to die too early. Metadata from TP600 owners may be more useful input to a buying decision than "IBM is a good brand".

Another reason to believe that a high-tech metadata is useful is the huge variability in the needs of users who run free software. One size most definitely does not fit all. This, I think, is one reason why companies such as VA Research have had such a difficult time being competitive.

There was a lot of talk about "mass customization" being part of the new economy, but not much follow-through. Most dot-com retailers were little more than mail order outfits that happened to publish their catalog through the Web rather than on paper (in fact, many have since added paper catalogs to their mix).

I'm certainly not going to put this kind of metadata system together myself, but I do think it would make an interesting project for someone. Clearly, this type of service would be worth real money. I'm not alone in believing that metadata is important. Amazon has very high quality metadata about books, and that's why their site is so valuable.

26 May 2002 (updated 26 May 2002 at 08:50 UTC) »
Advice to young standards authors

aaronsw posted The Standards Manifesto a few days ago. In it, he expresses frustration with the W3C. I can certainly identify.

I've had a fair amount of experience with standards processes over the years. I'm sure I'll have more. Most, but not all, of the experiences have been unpleasant. Standards are incredibly important, but not much fun. Standards committees, in particular, tend to be particularly tedious, and not very smart (even if the individual members of the committee are). In fact, standards committees are in many ways the least qualified entities to be writing standards.

Designing good standards is an art, and an underappreciated one to boot. One of the most important quality metrics is complexity. A standard should be as free from unneeded complexity as possible. The cost of a spec goes up by roughly an order of magnitude from spec to prototype, and again from prototype to production-quality implementation. It's too easy for glorified technical writers to come up with a new feature if they don't have to implement it themselves.

Standards reflect the process that creates them. The biggest problem with standards processes is lack of openness. There is a spectrum, ranging from complete transparency (the IETF at its best, where everybody is welcome to show up at meetings, and "rough consensus and working code" carries the day), to evil cartels such as the DVD CCA and the BPDG. In these extreme cases, only members are allowed to participate, only members are allowed to see the spec, and there are NDA's and all kinds of legal restraints to "protect" implementations of the standard. The W3C is somewhere in the middle of this continuum. In general, you have to be a paying member to participate, the deliberations are private (and NDA'd), but the resulting spec is public, comments from the public are solicited, and there are (usually) no patent royalties. The W3C seemed to be headed down a darker path, including promotion of patent licensing, but to their credit they responded to public outcry and backed off.

It is often said that "the great thing about standards is that there are so many to choose from." I propose that the same is true of standards processes. Small, motivated groups (or even individuals) can and should make standards. In fact, their work is often the most important. Examples abound. While there was a standards committee around JPEG, the really important work was done by the IJG (mostly Thomas Lane), which standardized a patent-free subset of JPEG and produced a very high quality free implementation. Sockets, developed by Bill Joy in the late seventies and quick to become the universal API for the Internet, were ignored by standards committees until just a few years ago. Committees tended to favor things like XTI, now mercifully dead.

Standards bodies are reasonably good at codifying existing practice. They suck at doing research. A good process winnows needless complexity from a standard, focussing on the essence of the problem. It's almost a quantitative science, as different approaches to the same problem may differ quite significantly in complexity.

A naive person might assume that building a global information network, scaling from coke machines to supercomputers, would be a harder problem than, say, synchronizing audio and video in multimedia. Yet, the TCP/IP documents (RFC's 793 and 791) weigh in at about 128 pages and are reasonably complete, while SMIL is about 500 pages, and includes lots of other standards by reference.

The economic incentives for closed, proprietary (and complex) standards are powerful. Who would spend grueling hours in a standards committee to help create a beautiful, free, simple standard, out of pure altruism? In fact, much of this work is similar to writing free code, but it tends to be quite a bit less fun.

I think the salvation lies in making the creation of new standards more fun. I'm not sure the best way to do this, but can offer my own experiences. The most fun I've had in a standards process has been IJS. It was in many ways a purposeful experiment in process. I deliberately narrowed the scope (things like UI and i18n are not included), while still trying to solve an important problem (making it easy to create printer drivers decoupled from the rasterization engine). I also acted as a dictator with respect to the spec. I didn't include suggestions from the mailing list until I was convinced that they were necessary, and properly thought through. Another key part of the process was the reference implementation, not merely free but also designed explicitly to adapt easily to other people's drivers rather than impose my own framework.

Also important was the fact that IJS built on the work done by HPIJS, a working protocol and codebase that merely had a few portability issues and some aspects specific to HP printers. I didn't have to take on a whole research project.

IJS is of course not perfect, but I do think it's doing its job. My time working on it is justified in the improved user experience and decreased support load, and it was kinda fun. (see, it wasn't altruism, it was enlightened self-interest :) The next time I get involved in a standards process, it's going to look a lot more like IJS than a W3C working group.

So, my advice to Aaron? First, by all means seek a better process than the W3C's. I have no doubts that such a process can be found. Second, be clear on whether the task is primarily research into what should be standardized, or codifying a well-understood domain. In the former case, it makes sense to just go do stuff. In the latter case, finding consensus is more important. Third, strive for simplicity, Feel especially free to ignore those who wish to add their own pet complication, especially if they don't share your vision.

Last, and perhaps most important, treat it as something that's supposed to be fun. Don't get too emotionally wrapped up, especially over the question of whether the rest of the world is getting on board. If you create something really good, it will have an impact. If the standard is simple, it will be expedient for others to implement.

Selected quotes from other blogs

From Dave Winer's Scripting News:

Anyway, we went far and wide and swung around to desktop websites, a subject near and dear to my heart. He wondered why more Mac developers weren't using the combo of PHP and Apache that comes bundled with every Mac. I think it's just a matter of time before Unix developers get there. Users like apps that run in the browser.

From Paul Snively's Gadfly Redux:

OK, this is twice now that I've had a ton of news queued up, posted some things, and... *poof*. Dozens of news items gone. Thank God I know that three pressing ones are actually comments from Paul Prescod and are safely ensconced with YACCS. But there are stacks of other things I wanted to respond to, and they're gone.

You know, I try to be patient. I try to be reasonable. I get a chuckle out of it when Dave says in best ha-ha-only-serious fashion that they write software that sucks. But Radio has a combination of serious architectural flaws: a browser interface that allows the browser's notion of object identity and the database's to get out of sync, possibly due to browser page caching and navigation using the back and forward buttons; and the lack of transactions in the underlying database. Sooner or later, this combination will result in what I've now seen twice.

I'd be a lot happier if Radio would just be an ordinary desktop application. Editing in the browser isn't a win, especially on the Mac. I'm totally with the local flyweight server idea. It's just that I want a well-integrated, rich client to go with it.

I myself have lost plenty of work editing in a browser. Here, I think, we have the classic quality tradeoff between a universal client, and one optimized for a specific task. This is one of the reasons I'm so happy to see all the client work happening around Advogato :)

193 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!