Older blog entries for Bram (starting at number 28)

Trust Metric Criteria

Confidence values in diary ratings are currently surprisingly low. It turns out that this is due to a serious deficiency in the way diary ratings are calculated, which I will now explain.

Consider the case of a single statement. We have a set of nodes, one of which is 'me', and a directed graph of their certifications. Some nodes 'attest to' the statement in question. We will calculate a confidence level of the statement in the range [0, 1] for each node, with special interest in our own confidence. Diary rankings are more complicated, but this simplified model is sufficient for a full exposition of the current problems.

The current advogato way of calculating confidence is as follows. Start with yourself, and repeatedly hop to a new position by randomly selecting one of the nodes the current position certifies. If you get to a node which attests to the statement, halt. If you get to a dead end, give up. Before each hop, have a 5% chance of giving up. Your confidence is the fraction of the time this procedure doesn't give up.

While this technique seems reasonable on its surface, it hase a serious deficiency.

the adding certifications criterion - Adding a certification to any graph must not reduce your confidence.

Advogato fails this criterion in even the simplest case. Consider the case where we only cert one other node, and that node attests to the statement. Our confidence will now be .95. If we add a cert to a node which doesn't attest to the statement and doesn't cert anyone else, our confidence will fall by a factor of 2. This is clearly undesireable behavior.

Here are some more criteria, some of which advogato already passes, but all of which are clearly desireable and none of which (I think) imply each other -

adding nodes - If nodes and certs are added in such a way that you have a path along new nodes which ends in a new node which attests to the statement, your confidence must go up. (This eliminates another technique.)

total strangers - If a node you have no path to certifies someone, that must not change your confidence.

extending chain - If you only have one cert and do not personally attest to the stament, your confidence must be less than the person you certed.

skepticism - If you do not personally attest to the statement your confidence must be less than 1.

terminators - If a node attests to the statement, additional certs from it to other nodes must not raise your confidence level.

redundant backlinks - If all paths from you to node B must go through node A, then B certifying A must not increase your confidence.

unreachability - If you have no path to a node which attests to the statement, your confidence must be zero.

full attestation - All nodes which attest have confidence one.

side show - If all paths through node B which arrive at a node which attests must pass through some node A both before and after B, then B's certifications must have no effect on your confidence.

That's all I can think of for now, there may be others. Ideally, we would like a set of criteria such that for any two graphs we could unambiguously prove which of them should yield a higher confidence.

Notify Lists

Raph wonders about notify lists. I've spent a lot of time thinking about algorithms for this, but haven't talked about it much, since it turns out that the simple solutions are near optimal.

Unrelated to the technical messaging aspects, I'd like to point out that the most logical form of addressing for instant messaging is person@machine. Reusing email addresses makes the system both easy to use and viral, and doesn't introduce yet another identifier for users to remember and manage.

There are two simple approaches. The first is completely centralized. All users report to a single central machine when they log in, and get data back when their friends come online. This system is very efficient, but it has a single point of failure, although in practice a single site can be made quite reliable. It also doesn't scale very well, although the number of messages involved in online status reports is so small it might hardly matter.

The second basic approach is for each user to have a 'home base' they log into, and for home bases to report interest and online status to each other. This approach is very reliable and scales very nicely, but is inefficient under some circumstances. If a typical user has thousands of friends but only a dozen are online at once, lots of unnecessary interest messages get sent. For notify lists this isn't a big deal, since a large fraction of users are online at all times, and the total number of messages is pretty small anyway, but there might be similar applications where it's a real issue.

Before explaining the hybrid solution, I should explain the inter-home protocol.

There is some reliable ordered transport layer, not necessarily TCP. On top of that, there are four messages, all of them idempotent -

  • I'm interested in user X.

  • I'm not interested in user X.

  • User X is online.

  • User X is not online.

The one subtlety is that if one side changes from not interested to interested, the other side must report online status (not necessarily instantly though, as we'll see below).

Let's look at what happens if Alice is in Bob's notify list and Alice comes online first. Messages are sent as follows -

  1. Alice tells her home she's online.

  2. Bob tells his home he's online, and sends it an 'interested in Alice'.

  3. Bob's home sends an 'interested in Alice' to Alice's home.

  4. Alice's home sends an 'Alice is online' to Bob's home.

  5. Bob's home sends an 'Alice is online' to Bob.

Now let's look at what happens if Bob comes online first -

  1. Bob comes online, tells his home he's online and sends it an 'interested in Alice'.

  2. Bob's home sends an 'interested in Alice' to Alice's home.

  3. Alice's home sends an 'Alice is not online' to Bob's home.

  4. Bob's home sends an 'Alice is not online' to Bob.

  5. Alice tells her home she's online.

  6. Alice's home sends Bob's home an 'Alice is online'.

  7. Bob's home sends Bob an 'Alice is online'.

Of course, someone else using Bob's home might also be interested in Alice, in which case several of the messages in the above dialogue would be unnecessary. Also, as an optimization Bob's home may remember his notify list so it doesn't have to be re-uploaded every time he comes online.

The hybrid solution, which has some of the benefits of both of the above methods, is to introduce 'collators' which handle the information for a group of home machines. Collators speak the inter-home protocol to each other, and get some of the message reduction benefit of doing everything with a single centralized server.

Let's look at how the message flow works if Alice is in Bob's notify list and is already online when Bob comes online -

  1. Bob tells his home base he's online, and sends an 'interested in Alice'. If Bob's home already has a user who's online and has Alice in their notify list (unlikely), then Bob's home responds immediately and the protocol stops here.

  2. Bob's home sends an 'interested in Alice' to Bob's home's collator. If another user whose home is using this collator is interested in Alice (likely), then this collator responds immediately, otherwise the protocol continues as follows.

  3. Bob's home's collator sends an 'interested in Alice' to Alice's home's collator. If a user on another collator is interested in Alice (very likely), then this collator responds immediately, otherwise the next two steps are necessary.

  4. Alice's home's collator sends an 'interested in Alice' to Alice's home.

  5. Alice's home sends an 'Alice is online' to Alice's home's collator.

  6. Alice's home's collator sends an 'Alice is online' to Bob's home's collator.

  7. Bob's home's collator sends an 'Alice is online' to Bob's home.

  8. Bob's home sends an 'Alice is online' to Bob.

I've glossed over the details of how collators are found, how homes set and change what their collators are, and how failures on the messaging level are reported, but all those problems can be solved in reasonably straightforward ways.

Leave it to advogato to have a very sophisticated attack launched for testing purposes.

There are two issues at play here. First of all, all arbitrary text should be escaped before being displayed in html. Nothing particularly deep or difficult about that, although advogato text processing is borked in several ways - repeatedly editing old diary entries results in lots of <p> <p> <p>.

The other issues is that for many web sites it's fairly simple to trick a user into clicking on a link which causes something nasty to happen to their account. iframes just exacerbate this problem, they don't create it. A straightforward solution is that for every page which has links which have side effects, a one-time use string is randomly generate and put into the links. When the action is performed, the system checks to see if the string is valid and (very important!) that it was generated for the current user and action. This also stops accidental double-posting, since a second usage of the same string can be treated as a modification instead of a new post.

Unfortunately, this isn't just an issue on advogato, it's an issue on almost all web sites. Advogato's second problem of not escaping makes the attack worse by making it possible for it to be viral, but the attack is there nonetheless.

Props to whoever came up with and implemented the attack. That was very clever.

17 Sep 2002 (updated 17 Sep 2002 at 06:28 UTC) »
dyork: Most of the furniture I've put together recently has been assembled with Allen wrenches, which have a hexagonal hole. I'm guessing they have less chance of shearing when tightening, since they have six contact points instead of four, and each of them has more material behind it. It also might have some advantage related to glide planes, but I'm very hazy on that.

Torx heads look like they have even less chance of shearing, and fewer pointy bits to wear down. Amusingly, Torx is a trademarked name, and everything puts (tm) after it, but there doesn't appear to be a generic term.

I just spent more time than I ever should have googling for screw information.

movement: I'm not suggesting that EEXIST should go away, just that in the case where the new file already maps to the same inode it shouldn't be invoked. If you can think of a specific case in which that wouldn't be the preferred behavior, I'd like to hear about it.One can use stat() and compare the inodes in case of EEXIST, but I'll bet hardly anyone thinks to do that.

link() wouldn't be used in the first place if rename() were implemented atomically like it's supposed to be. Unfortunately, some implementations are broken.

This link explains in detail how maildir works. The lengths which it goes to to get reasonable semantics on top of a file system data store are quite comical. It does one very clever thing though, which is that it encodes data into the file name, forcing new renames to declare the old state as a precondition, thus preventing race conditions where something else modified the file info since it was read.

Broken APIs

The other day I had some exposure to the semantics of link().

Maildir does this neat trick of having separate cur/ new/ and tmp/ subdirectories, with a file per message. To move a message from one to another, it first calls link() to bring it to the new directory, then unlink() to remove it from the old one. This does a good job of keeping mail from getting lost or corrupted, but has a failure mode when the machine goes down between the call to link() and unlink() (not as unlikely as you might think - it could be on an NFS partition).

In the failure case, the message will appear in both boxes. This could easily be cleaned up simply by having the next attempt to link it succeed and the extra file disappear in the process. Unfortunately link() isn't idempotent - it instead fails with the reason that the file already exists. This I find astounding. If someone calls link() on a file which is already linked in exactly the way it's asking, the result should be success with nothing changed, not failure with the reason that the file already exists.

Even worse, non-idempotence isn't just the default behavior, it's the only behavior - there's no option available to make it behave in the sensible way.

I simply cannot fathom what they were thinking.


My guess is some of you reading this have ongoing projects. If so, you should submit them to CodeCon '03. It's happening in February, and has an expanded subject matter of 'active hacker projects'.

I'm really excited about it. Last year's CodeCon went great, and everyone presenting at it got good exposure.

Please note the new .info domain name.

mathieu: In C arrays are syntactic sugar, myarray[3] is just a cleaner way of writing *(myarray + 3)

ladypine: Small confidence values are hardly ever dominated by a single edge, they're usually a more overall network effect. Very small confidence values are good log-scale indicators of how indirect and tenuous your certification of that node is. If your path to certifying someone as bad is much more tenuous than your path to certifying them as good, you want to weight them near one, and in the opposite case you want to weight them near zero, which is exactly what the formula I gave does.


Raph mentioned that I've been thinking about anti-certs. My thoughts are still very speculative, but I'll explain their current state.

Advagato's engine currently has essentially two ways one person can feel about another - positive, if they've certified, or neutral, if they haven't. Anti-certs add another value, active distrust.

My general approach to using anti-certs is to have two steps. The first one uses all the anti-certs to compute weights for each node which have had the anti-certs taken into effect, and the second uses the plain old eigenvector method to calculate your belief level using the calculated node weightings.

This approach has some practical drawbacks. It can't be calculated in a distributed manner unless you send all data everywhere, which is fortunately probably quite practical for many applications. It also requires greater runtime than the vanilla eignevector method, but all the variants I give here are still polynomial.

The simplest way to use anti-certs is to simply set the weight of everyone I've anti-certed to zero.

Behaviorally this technique works okay, but it suffers from completely ignoring the anti-certs of people you've certed. Adding that behavior in is a tricky issue. Since the strength of an anti-cert is dependent on the strength of the node giving it, the effects are probably inherently non-linear.

As a result, there are some situations which don't have a unique weightings solution. For example, if you cert A and B, and A and B each anti-cert the other one, you've basically got two solutions - one in which A is weighted heavily and B very little, and one in which B is weighted heavily and A very little.

Likewise there are situations which have no stable solution. For example, if you cert A, B, and C, and A certs B, B certs C, and C certs A, but B anti-certs A, C anti-certs B, and A anti-certs C, then no single optimal solution exists.

It is possible that there might be closed form formulas which get the above situations to be 'balanced', in which all nodes wind up being weighted about the same, but I doubt that that's possible or desirable.

So, how to calculate a solution? My best idea is to do it in multiple passes. In each one the weight of each node is calculated without using any second order effects, and in the next pass the weights from the previous pass are used. This both always yields a reasonable solution and does it within a polynomial amount of time. I suspect that in practice four passes works great for almost all applications.

How to calculate weight? A reasonable sounding technique is to calculate your confidence in each other node's certification, then your confidence in its anti-certification, and set the weight to confidence / (confidence + anti-confidence).

All of the above looks reasonable at first blush, but definitely requires experimentation to see how it might behave in practice.

Axiomatic Bases

Raph quoted me as saying that ZF is a hack. I probably should explain.

PA seems logically compelling to me, even preexisting. I know what the number one is, what a successor is, and I absolutely believe in the principle of induction. ZF, on the other hand, has no obvious intuitive basis. What is a set? Is it a bag? A list? A data structure? A function? The inability of sets to contain themselves would seem to imply bag, but the ability to keep the same set in multiple other ones at once would seem to imply list. All around, ZF feels like something which was logically compelling but then had awkward restrictions placed on it to get rid of some paradoxes.

Perhaps if it were presented in some other way, using different names and metaphors, I wouldn't find ZF so awkward. I'm convinced of its practical utility for doing mathematics from the sheer amount of fiddling with has been done with it, but I'd still like for my intuition to naturally accept it as well.


Thanks, dmerrill! I think my work on BitTorrent is a reasonable qualification for master certification. I've spent over a year working on it, and it's now getting over a hundred downloads a day, as you can see on the statistics page.

19 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!