30 May 2002 raph   » (Master)

Linux or Mac

David McCusker wants a development machine for home. I think he would be happy with either a Linux box or a Mac running OS X. In the former case, he'll have to spend some time dealing with Linux weirdness, in the latter, he'll have to spend some time dealing with OS X weirdness.

If the goal is raw price/performance, Wintel is a clear win. Buying the cheapest parts on Pricewatch is somewhere around 50% of a comparable-spec Mac. But Macs are arguably more elegant than Wintel. And, if stuff like having digital cameras Just Work when you plug them in is important, there is no contest.

When I ordered spectre, raw horsepower was in fact a driving factor. I need to do regression testing as quickly as possible. Otherwise, I would have been sorely tempted to get a Mac.

David also expresses the wish to do cross-platform GUI coding. Unfortunately, there's no good story here. Drawing is only a small part of the overall problem, so using something like OpenGL won't help much. I do expect the situation to improve over time. wxWindows is probably closest, and has OSX support in development versions.

What is the optimum grain size?

One of the most important factors affecting performance of a tree access protocol is grain size. Here I'll present a very simplistic analysis. A lot of people never bother to do any.

How do you slice up a tree into grains? I propose to serialize the tree, chop the serialization into fixed-size blocks (leaf nodes), and use a btree-like structure to tie this together. The cool new idea is to count parentheses in the btree nodes. This lets you fly around the tree without having to dig into the leaves.

There are other ways to do it, of course. You can use one grain per node. You can use one grain for the whole tree, perfectly reasonable if the tree is smallish. You can also try to aggregate subtrees into grains. On disk, the advantage of fixed-size blocks is good utilization. On a network, you don't care about utilization of blocks, but you still might care about a large variance in grain size.

For simplicity, let's assume that the tree is read-only. We will analyze two usage patterns. The first is simply to traverse the whole tree. The second is to navigate to random nodes in sequence.

Traversing the tree means scanning the serialization linearly. This is cool. You only touch each block once. Assume that the time to fetch a block is a latency value, plus the size of the block divided by the bandwidth. Total time for traversing the tree is (tree size in bytes) / (bandwidth in bytes per second) + (tree size in bytes) * latency / (block size in bytes). It is easy to see that large blocks are more efficient.

The random node case is different. When you fetch a block, only a small part of it is relevant. The rest is waste. The time to fetch a node is latency + (block size in bytes) / bandwidth. This time, small blocks are more efficient.

What is the optimum grain size, then, for a mixture of both cases? When (block size) = latency * bandwidth, both individual cases are exactly a factor of two slower than their limiting best-case (infinitely large blocks in in the case of a traversal, infinitely small in the case of random nodes). Thus, the optimum will be on the order of latency * bandwith.

What is latency * bandwidth for real devices? Here's a quick table. Don't worry about small inaccuracies. We're trying to get the order of magnitude right. This is just disk and network. Memory hierarchy is important too, but the analysis is considerably different, so I won't do that tonight.

modern disk: latency 10ms, max bw 50M/s: 500K
wireless net 802.11b: latency 2.5ms, max bw 0.5M/s: 1.25K
modem: latency 50ms, max bw 4K/s: 0.2K
100Mb lan: latency 0.3ms, max bw 10M/s: 3K
dsl down from a nearby server: 20ms, 100K/s: 2K
dsl up to a nearby server: 20ms, 10K/s: 0.2K
dsl down international: 200ms, 100K/s: 20K
dsl up international: 200ms, 10K/s: 2K

In Unix, the traditional block size is 4K. It's interesting that this value is not far from the mark for networks, even over a very broad range of performance. So the traditional block size is actually still reasonable.

Disk is the outlier. What's more, latency * bandwidth scales very roughly as the square root of areal density, and that's scaling like mad. It used to be 4K, but that was a long time ago.

But 500K is two whole orders of magnitude bigger. If we believe this analysis, then access to a tree on disk will spend 99% of the time seeking, and 1% accessing useful data. That would be bad.

The story is a bit more complex, though. Real disk-based systems spend a huge amount of effort trying to increase locality, or the clustering of items likely to be accessed in a cluster. If this effort is successful, then the effective grain size goes up. Caching with prefetching is a particularly effective technique. Modern OS kernels implement prefetching, and so do drives. In fact, when you request a 4K block from a drive, it will usually spend on the order of 10ms seeking to the right track, then 5ms waiting for the disk to spin to the right sector. Given a typical raw transfer rate of 50M/s, that means 250K or so of data will fly past the read head. In a modern disk, all that goes into a cache. Then, when the kernel requests blocks from that range, it gets them immediately.

So, to do a btree efficiently, you have (at least) two choices. You could specify a really large block size, and not worry about the order of blocks on the disk. Another method is to use a small block size, but try hard to cluster nearby blocks. This demands more intelligence when allocating blocks within the file. It's also well known from filesystems that it's hard to avoid fragmentation when the utilization is very high. When there is ample free space, there are more candidates to choose from to try to optimize locality.

Of course, in the read-only case, you can allocate the blocks exactly in traversal order. In this case, 4K blocks are again reasonable. The problem is avoiding fragmentation as you start updating the tree.

David McCusker talks about this a bit in database areas. In it, he suggests 2K blocks, but allocation logic that works in grains of 16K. That's still not good enough (by more than an order of magnitude) if the grains are randomly scattered in the file. Maybe he's doing something else to try to cluster grains; it's not clear to me. But I do believe it is a tricky problem.

The network version is in many ways simpler (this is a relief, because in other important ways it is harder). You don't have to worry about locality, as there really is no such concept. The network latency is the same no matter which block you request. You also don't have to worry as much about utilization, because it's possible to simply skip sending unused byte ranges. Blocks can be variable in size, too, unlike the fixed blocks of disks.

As I warned, this analysis is oversimplified. Yet, I think it is useful to understand real performance problems. It gives a lot of insight into the appeal of flat files for mailboxes. At 50M/s, you can grep an entire 250M mailbox in five seconds. A dumb database design, by contrast, may use a hash table to allocate individual messages effectively at random within the file. Thus, if you try to search through the messages in order, each read will take 15ms or so. Five seconds will give you enough time to search 300 messages, consistent with the two orders of magnitude discrepancy between the typical block size of 4K and the optimum grain size of about 500K for disks.

Thus, sophisticated file formats have a danger of creating serious performance problems. But I consider that a quantitative problem, one that yields well to careful analysis and design. To me, those are the most fun!

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!