Advogato: Blog for raph

Farmer's market

We took the kids to the Benicia Farmer's Market for the first time of the season. Max loved the pony ride and petting zoo. Alan has become good friends with Lacey, the 9yo daughter of the family that brings the animals, and they had a great time running around together. Alan also ran into about 4 kids he knows from school. It was pleasant weather, the sunset was beautiful, and it was altogether a very nice evening.

More on grain size

See David McCusker's response to yesterday's posting on grain size.

A couple of people pointed out that my estimate for modem latency was too low. Assume 150ms and 4K/s, that gives you a product of 0.6K. That actually makes the clustering around 4K even better.

I've been thinking some more about the relatively large "optimum" grain size for disks. David has obviously put a lot of thought into the problem. What I'll present tonight is a simple technique, optimized for ease of analysis. I'm sure that better techniques are possible, and IronDoc may already implement them. But this is a lecture, not a product :)

Assume 4K blocks and 1M grains, so a grain contains 256 blocks. Coincidentally, if each child reference takes 16 bytes, that's the maximum number of children each internal node can have. The minimum number is half the maximum. That's a big part of what makes it a btree.

In a large tree, each subtree consisting of a node one from the bottom, and its immediate children (leaf nodes all) gets its own grain. Grains are fixed size. The file consists of a sequence of grains; they're not interleaved at all. Maybe the file is a little shorter because the last grain isn't completely full.

Now we can analyze the utilization. For files smaller than one grain (1M), utilization is the same as for classical btrees: 1/2 in the worst case. For larger files, the utilization of blocks within a grain is also 1/2 in the worst case, so the total worst-case utilization is 1/4.

A grain can contain free blocks, but the file need not store any free grains. If you want to delete a grain, you copy the last grain in the file over it, then truncate the file. As a result, the order of grains within the file can become totally random, but you really don't care. The grain is large enough that one disk seek per grain isn't a killer.

There are tricks you can do to improve utilization. One special case bears mentioning - the append-only scenario. In the classical btree algorithm, you get worst-case utilization. When you overflow a block, you split it in half. The first half never gets touched again, so the utilization remains 1/2. It's pretty easy to tweak it so that in this case, utilization is near unity. It's a common enough case that I think this tweak is sufficient. Remember, 1/4 is only worst case, and disk is really cheap.

As an order-of-magnitude ballpark, reading or writing an entire grain should take about 40ms. Reading or writing an individual block is around 20ms. Sometimes you'll need to do an operation on an entire grain, for example when splitting or merging a subtree 1 level up from the bottom.

I'm not trying to make the argument that this is the best file layout. I'm sure there are better tradeoffs, preserving similar locality while improving utilization. But I do like things that are simple to understand. See also McCusker on engineering design tradeoffs, short and well worth reading.

XOR metric

The XOR metric as proposed in the Kademlia work has a lot of people excited. I've been thinking about similar things in the context of my PhD research for about 3 years now. But none of my thinking was really written down anywhere, much less published. There's some discussion on the bluesky list of other similar ideas in the literature, as well.

The Kademlia people have actually done the work and published the paper. Perhaps just as important as the design is their analysis. Designers of p2p networks tend to be very optimistic that the resulting real-world network will have the desired properties. I'm pleased to see p2p design moving toward real quantitative design.

The XOR metric is good in very large nets. You assign an ID to each node, for example the hash of its public key. Suppose that you distribute blocks among the nodes based on the hash of the block, so that a block is likely to be found in nodes with "nearby" hashes. Then, if you know the hash of a block, you have the problem of finding a nearby node.

If the net is reasonably small, then maybe you broadcast contact information for new nodes joining. That way, all nodes know about all other nodes, so the problem is simple. But as the net scales up, this strategy won't work as well.

Hence the XOR metric. Simplified, the "nearness" of two id's is the number of bits in common in the prefix. So, the nearness of "10100..." and "10010..." is 2, because the first two bits (but not the first three) are identical. Each node now keeps track of a small number of other nodes, as few as one for each distinct "nearness" value (which scales as lg N).

Now if you have a target id in hand, keep iterating this step: choose the closest node to the target id. Ask them for contact information for a closer node. You'll get to the closest node within lg N steps, as each step increases the nearness by one. A picture for the analysis resembles walking down a binary tree.

Many p2p network designs need to find arbitrary other nodes, and are expected to scale up. If so, there are two choices: use an algorithm such as Kademlia's (or based on Chord or Tapestry, which were the inspirations for Kademlia) to find short routes, or pray. It's a good test, I think, of the health of the project.

The stamp-trading network described in Chapter 7 of my thesis is another twist on this idea. The lg N topology is similar, but requests flow only along trusted links in the network. It's an open question whether it will in fact find short routes given this additional constraint.

(Thanks to Roger Dingledine for comments on an earlier draft, and to Zooko for good discussion and links. Responsibility for oversimplification remains mine.)

casper

casper.ghostscript.com seems to be down. I hope it's just because transbay.net, the colo host, is moving, and that it'll be up again in the morning. Ah well.

31 May 2002 raph » (Master)