# Older blog entries for ciphergoth (starting at number 7)

15 Mar 2003 (updated 17 Mar 2003 at 00:11 UTC) »

I wonder if anyone replied to what I last wrote about Advogato diaries and LiveJournal? I didn't come back in time to check everyone's diaries and find out, so I don't know. If you replied, can you mail me and let me know?

Needless to say, I don't have this problem with LiveJournal, because replies are attached to the comment they reply to and I receive email alerts about them.

Bayes and scoring

raph discusses Paul Graham's plan to use Bayesian statistics to defeat spam and points out that his combiner is isomorphic to a linear one. It turns out that Graham's combiner assumes that the a priori probability that a mail is spam is 0.5, which isn't right, but you can easily fix the problem without sacrificing linearity.

For those who like me were confused by what "a probability of 0.99 spam" means, it means Pr(S|A) where S is the event "the mail is spam" and A is the event "the mail contains the word viagra". The extent to which A and B together indicate that the mail is spam is given by

Pr(S|A and B) = (a o b) = ab/(ab + (1-a)(1-b)) (where a = Pr(S|A))

but this formula is only correct if you assume that

Pr(S) = Pr(not S) = 0.5

and that the events are independent apart from their dependence on whether or not it's spam:

Pr(A and B|S) = Pr(A|S) Pr(B|S)
Pr(A and B|not S) = Pr(A|not S) Pr(B|not S)

Assuming that the events are independent is fundamental to the whole approach, but assuming Pr(S) = 0.5 seems to me like a mistake. Interestingly, though, if my math is right then the correct combiner for arbitrary a priori Pr(S) = s is (a o_s b) = (a o b o (1-s)) (do the math, or mail me and ask me to post it here). Thus if you define f(x) = log(x) - log(1-x) as before so that f(a o b) = f(a) + f(b), then you can define

f_s(x) = f(x) - f(s)

and you find

f_s(a o_s b) = f(a o b o (1-s)) - f(s)
= f(a) + f(b) + f(1 - s) - f(s)
= f(a) + f(b) - f(s) - f(s)
= f_s(a) + f_s(b)

This makes sense if you think about it: if a = s, then Pr(S|A) = Pr(S) so A tells you nothing about whether or not it's spam, and f_s(a) = f(s) - f(s) = 0 so it contributes nothing to the score either way.

So you don't need to assume that the a priori probability of spam is 0.5 to make raph's linear approach work.

Who here (apart from skx) keeps a diary on LiveJournal? I often find myself wishing that the diaries kept here were there. Despite diary rating and the trust metric, I find it in practice much easier to find diaries I'm interested in there than here. I miss the "friends list" facility that allows me to directly choose which diaries to read. And I miss over and over again the facility to make comments on diary entries; where I could make a useful and informative comment, the fact that it will have to appear in my own diary rather than as a comment on another entry puts me off.

Is that feature missing simply because it ain't written yet, or is it a deliberate piece of social engineering? I'm sure there must be some who prefer it that way, but I'd be interested to know why.

I quite often think about what would be needed to really unite all the blogs in the world, so that I can have one "friends page" that follows everything I'm trying to follow. At the moment there's a severe network effect that makes a real LiveJournal account much more useful than accounts with any other provider, even those (eg uJournal) which use the LJ code base. RSS is a component, but it would need a blog-specific module to deliver the same level of usefulness.

Graph Isomorphism

I think there's widespread doubt about the hardness of this problem; ISTR caveats along these lines in descriptions of the associated zero-knowledge proof.

2 Oct 2002 (updated 2 Oct 2002 at 23:22 UTC) »
Implementation of Bram's Trust Metric

I've done a quick-and-dirty implementation of Bram's stochastic trust metric, which works on the Advogato graph. It seems to give plausible results, and on the Advogato graph (4000 active nodes, 40000 arcs) it takes 6 seconds to do 1000 iterations (which gives results to a precision of about 1.6%, where 100% is maximum trust). It's a combination of Perl and C.

To test it, you'll need the Advogato trust graph. Thanks to gary for pointing out that this is available as a graphviz .dot file here:

This is the form that my current implementation expects the trust graph to be in. There's a simple test command line in the Makefile.

Update - just got email from Brad saying that this isn't the metric as he originally meant to describe it. Caveat downloader.

1 Oct 2002 (updated 2 Oct 2002 at 15:18 UTC) »
Can I download the entire Advogato trust graph? Or the diary graph? Is there any publically downloadable trust graph on which new trust metric algorithms can be tried?

Bram's most recent suggestion would be very easy indeed to implement, so I'd like to experiment with it, but I think results on random graphs wouldn't be very meaningful.

I'd most of all like to try the LJ friends graph, but I don't think they'd be up for making it downloadable...

(update: I tried to do a new diary entry, and it replaced this one! Bah! Fortunately I had edited this one and the old version was in my browser cache...)

Bram - I realised after I posted that my attack does not necessarily allow you to cause a given statement to have unlimited confidence. It doesn't in the case of your proposal, for example. The application I usually have in mind when I think about trust metrics is prevention of crapflooding and similar nasties; to do that you need a bound not on the maximum trust the adversary can gain for a particular entity, but on the number of entities they can arrange to have trust for. If you want that, I think you can't have the "Adding Certification Criterion".

Actually, there's a way you can, but only if you add another restriction. If you allow a free graph - that is, a valid graph is any graph which has valid labels on the arcs - then you can't have it. But if you have some local condition affecting the arcs leaving a given node - such as that no two arcs leaving the same node may carry the same positive integer - then you can still do it.

Incidentally the Advogato algorithm (for people, not for diaries) fails the Isomorphism Criterion.

What sampling algorithm are you proposing? is it (1) make random compromise choices about all nodes (perhaps lazily), (2) see if there's an uncompromised path? The SD of the trust value computed would be 1/(2 * sqrt(n)) for n samples. Also I guess it's pretty cheap to collect trust levels for all nodes - "find all reachable nodes" is a pretty fast algorithm. However, the problem with using randomised algorithms for trust is that if your trust was very near some trust threshold, you would find yourself flapping either side of the threshhold on different days...

Trust metrication criteria

Bram - I think your ideas about criteria are the Right Thing and I've been thinking some of the same things myself. However I think I can list three desirable criteria and prove you can't have all three.

One is the "Adding Certification Criterion" you define. The second is the "Isomorphism criterion" which states that trust metrication should be a function on graphs, and that isomorphic graphs should get isomorphic trust ratings. The third is attack resistance as raph defines it.

If the attacker has total control over a subgraph, they can duplicate trusted parties, trust links and all. The "Adding Certification Criterion" says that this must not reduce the trust of the duplicated parties, while the "Isomorphism Criterion" means that the duplicates must get the same trust. Thus the attacker has increased the total trust in their part of the graph. They can do this indefinitely for unlimited trust; thus they have violated attack resistance. I'm pretty sure your most recent proposal falls to this attack.

I'll try and write more on this when I get more time...

I've also posted this to LiveJournal since I prefer the comment forums...

8 Jul 2002 (updated 9 Jul 2002 at 13:42 UTC) »
This diary

While I keep a LiveJournal diary, I don't want to fill it with geek content since it doesn't mostly have geek readers. So I thought I'd move that sort of thing to here.

Backups

I've been trying to find a way to back up my system onto multiple CD-Rs. Several people have basically told me that every solution out there is woefully inadequate, no-one has suggested anything good, and my searches haven't found anything good. I got quite interested in DAR for a while, but in practice its multi-archive handling is too far from the Good Thing to be worth the bother.

I've been thinking about what the Right Thing should look like. One thing it should be able to do is write each "slice" to a pipe rather than a file. This is more general than calling a hook after a slice is complete; you catch the "beginning-of-slice" as well as "end-of-slice" events, you can do arbitrary processing on the content, and you get to stop and start the archive writing process with simple flow control.

Reading the "afio" manual page, many of the options could be removed if this were possible. I sent some mail about this to the DAR maintainer; here's my discussion about how this feature could help you do backups if you've only the disk space to store a little over one slice at a time

Isn't using flow control much simpler? It would be very easy to write another general tool that takes a filename on the command line, and writes stdin to that file, and when a write fails because the disk is full, it waits a little while and tries again. dar (or dar_xform) would write to this process. If the disk filled up, dar would simply block during a write to the pipe, because pipes are flow controlled.

While dar is blocked, another process might be busy taking a completed slice file and writing it to CD, deleting it once it's written and verified.

Once the old slice has been deleted, space will appear, and the writing of the new slice will start to work again. dar won't even notice that there was a pause, and no signals-related complexity is needed.

And all of this happens outside dar itself. In accord with the UNIX philosophy, a whole lot of tools that do one thing well come together to perform a more complex task in a very configurable way. All of this is only possible with the "pipes" approach.