Linux or Mac
David McCusker wants
a development machine for home. I think he would be happy
with either
a Linux box or a Mac running OS X. In the former case, he'll
have to
spend some time dealing with Linux weirdness, in the latter,
he'll
have to spend some time dealing with OS X weirdness.
If the goal is raw price/performance, Wintel is a clear win.
Buying
the cheapest parts on Pricewatch is somewhere around 50% of
a
comparable-spec Mac. But Macs are arguably more elegant than
Wintel.
And, if stuff like having digital cameras Just Work when you
plug them
in is important, there is no contest.
When I ordered spectre,
raw
horsepower was in fact a driving factor. I need to do
regression
testing as quickly as possible. Otherwise, I would have been
sorely
tempted to get a Mac.
David also expresses the wish to do cross-platform GUI
coding.
Unfortunately, there's no good story here. Drawing is only a
small
part of the overall problem, so using something like OpenGL
won't help
much. I do expect the situation to improve over time.
wxWindows is
probably closest, and has OSX support in development
versions.
What is the optimum grain size?
One of the most important factors affecting performance of a
tree
access protocol is grain size. Here I'll present a
very
simplistic analysis. A lot of people never bother to do any.
How do you slice up a tree into grains? I propose to
serialize the
tree, chop the serialization into fixed-size blocks (leaf
nodes), and
use a btree-like structure to tie this together. The cool
new idea is
to count parentheses in the btree nodes. This lets you fly
around the
tree without having to dig into the leaves.
There are other ways to do it, of course. You can use one
grain per
node. You can use one grain for the whole tree, perfectly
reasonable
if the tree is smallish. You can also try to aggregate
subtrees into
grains. On disk, the advantage of fixed-size blocks is good
utilization. On a network, you don't care about utilization
of blocks,
but you still might care about a large variance in grain
size.
For simplicity, let's assume that the tree is read-only. We
will
analyze two usage patterns. The first is simply to traverse
the whole
tree. The second is to navigate to random nodes in sequence.
Traversing the tree means scanning the serialization
linearly. This is
cool. You only touch each block once. Assume that the time
to fetch a
block is a latency value, plus the size of the block divided
by the
bandwidth. Total time for traversing the tree is (tree size
in bytes)
/ (bandwidth in bytes per second) + (tree size in bytes) *
latency /
(block size in bytes). It is easy to see that large blocks
are more
efficient.
The random node case is different. When you fetch a block,
only a
small part of it is relevant. The rest is waste. The time to
fetch a
node is latency + (block size in bytes) / bandwidth. This
time, small
blocks are more efficient.
What is the optimum grain size, then, for a mixture of both
cases?
When (block size) = latency * bandwidth, both individual
cases are
exactly a factor of two slower than their limiting best-case
(infinitely large blocks in in the case of a traversal,
infinitely
small in the case of random nodes). Thus, the optimum will
be on the
order of latency * bandwith.
What is latency * bandwidth for real devices? Here's a quick
table.
Don't worry about small inaccuracies. We're trying to get
the order of
magnitude right. This is just disk and network. Memory
hierarchy is
important too, but the analysis is considerably different,
so I won't
do that tonight.
modern disk: latency 10ms, max bw 50M/s: 500K
wireless net 802.11b: latency 2.5ms, max bw 0.5M/s: 1.25K
modem: latency 50ms, max bw 4K/s: 0.2K
100Mb lan: latency 0.3ms, max bw 10M/s: 3K
dsl down from a nearby server: 20ms, 100K/s: 2K
dsl up to a nearby server: 20ms, 10K/s: 0.2K
dsl down international: 200ms, 100K/s: 20K
dsl up international: 200ms, 10K/s: 2K
In Unix, the traditional block size is 4K. It's interesting
that this
value is not far from the mark for networks, even over a
very broad
range of performance. So the traditional block size is
actually still
reasonable.
Disk is the outlier. What's more, latency * bandwidth scales
very
roughly as the square root of areal density, and that's
scaling like
mad. It used to be 4K, but that was a long time ago.
But 500K is two whole orders of magnitude bigger. If we
believe this
analysis, then access to a tree on disk will spend 99% of
the time
seeking, and 1% accessing useful data. That would be bad.
The story is a bit more complex, though. Real disk-based
systems spend
a huge amount of effort trying to increase locality,
or the
clustering of items likely to be accessed in a cluster. If
this effort
is successful, then the effective grain size goes up.
Caching with
prefetching is a particularly effective technique. Modern OS
kernels
implement prefetching, and so do drives. In fact, when you
request a
4K block from a drive, it will usually spend on the order of
10ms
seeking to the right track, then 5ms waiting for the disk to
spin to
the right sector. Given a typical raw transfer rate of
50M/s, that
means 250K or so of data will fly past the read head. In a
modern
disk, all that goes into a cache. Then, when the kernel
requests
blocks from that range, it gets them immediately.
So, to do a btree efficiently, you have (at least) two
choices. You
could specify a really large block size, and not worry about
the order
of blocks on the disk. Another method is to use a small
block size,
but try hard to cluster nearby blocks. This demands more
intelligence
when allocating blocks within the file. It's also well known
from
filesystems that it's hard to avoid fragmentation when the
utilization
is very high. When there is ample free space, there are more
candidates to choose from to try to optimize locality.
Of course, in the read-only case, you can allocate the
blocks exactly
in traversal order. In this case, 4K blocks are again
reasonable. The
problem is avoiding fragmentation as you start updating the
tree.
David McCusker talks about this a bit in database
areas. In it, he suggests 2K blocks, but allocation logic
that
works in grains of 16K. That's still not good enough (by
more than an
order of magnitude) if the grains are randomly scattered in
the file.
Maybe he's doing something else to try to cluster grains;
it's not
clear to me. But I do believe it is a tricky problem.
The network version is in many ways simpler (this is a
relief, because
in other important ways it is harder). You don't have to
worry about
locality, as there really is no such concept. The network
latency is
the same no matter which block you request. You also don't
have to
worry as much about utilization, because it's possible to
simply skip
sending unused byte ranges. Blocks can be variable in size,
too,
unlike the fixed blocks of disks.
As I warned, this analysis is oversimplified. Yet, I think
it is
useful to understand real performance problems. It gives a
lot of
insight into the appeal of flat files for mailboxes. At
50M/s, you can
grep an entire 250M mailbox in five seconds. A dumb database
design,
by contrast, may use a hash table to allocate individual
messages
effectively at random within the file. Thus, if you try to
search
through the messages in order, each read will take 15ms or
so. Five
seconds will give you enough time to search 300 messages,
consistent
with the two orders of magnitude discrepancy between the
typical block
size of 4K and the optimum grain size of about 500K for
disks.
Thus, sophisticated file formats have a danger of creating
serious
performance problems. But I consider that a quantitative
problem, one
that yields well to careful analysis and design. To me,
those are the
most fun!