Older blog entries for Bram (starting at number 21)

mathieu: In C arrays are syntactic sugar, myarray[3] is just a cleaner way of writing *(myarray + 3)

ladypine: Small confidence values are hardly ever dominated by a single edge, they're usually a more overall network effect. Very small confidence values are good log-scale indicators of how indirect and tenuous your certification of that node is. If your path to certifying someone as bad is much more tenuous than your path to certifying them as good, you want to weight them near one, and in the opposite case you want to weight them near zero, which is exactly what the formula I gave does.

Anti-Certs

Raph mentioned that I've been thinking about anti-certs. My thoughts are still very speculative, but I'll explain their current state.

Advagato's engine currently has essentially two ways one person can feel about another - positive, if they've certified, or neutral, if they haven't. Anti-certs add another value, active distrust.

My general approach to using anti-certs is to have two steps. The first one uses all the anti-certs to compute weights for each node which have had the anti-certs taken into effect, and the second uses the plain old eigenvector method to calculate your belief level using the calculated node weightings.

This approach has some practical drawbacks. It can't be calculated in a distributed manner unless you send all data everywhere, which is fortunately probably quite practical for many applications. It also requires greater runtime than the vanilla eignevector method, but all the variants I give here are still polynomial.

The simplest way to use anti-certs is to simply set the weight of everyone I've anti-certed to zero.

Behaviorally this technique works okay, but it suffers from completely ignoring the anti-certs of people you've certed. Adding that behavior in is a tricky issue. Since the strength of an anti-cert is dependent on the strength of the node giving it, the effects are probably inherently non-linear.

As a result, there are some situations which don't have a unique weightings solution. For example, if you cert A and B, and A and B each anti-cert the other one, you've basically got two solutions - one in which A is weighted heavily and B very little, and one in which B is weighted heavily and A very little.

Likewise there are situations which have no stable solution. For example, if you cert A, B, and C, and A certs B, B certs C, and C certs A, but B anti-certs A, C anti-certs B, and A anti-certs C, then no single optimal solution exists.

It is possible that there might be closed form formulas which get the above situations to be 'balanced', in which all nodes wind up being weighted about the same, but I doubt that that's possible or desirable.

So, how to calculate a solution? My best idea is to do it in multiple passes. In each one the weight of each node is calculated without using any second order effects, and in the next pass the weights from the previous pass are used. This both always yields a reasonable solution and does it within a polynomial amount of time. I suspect that in practice four passes works great for almost all applications.

How to calculate weight? A reasonable sounding technique is to calculate your confidence in each other node's certification, then your confidence in its anti-certification, and set the weight to confidence / (confidence + anti-confidence).

All of the above looks reasonable at first blush, but definitely requires experimentation to see how it might behave in practice.

Axiomatic Bases

Raph quoted me as saying that ZF is a hack. I probably should explain.

PA seems logically compelling to me, even preexisting. I know what the number one is, what a successor is, and I absolutely believe in the principle of induction. ZF, on the other hand, has no obvious intuitive basis. What is a set? Is it a bag? A list? A data structure? A function? The inability of sets to contain themselves would seem to imply bag, but the ability to keep the same set in multiple other ones at once would seem to imply list. All around, ZF feels like something which was logically compelling but then had awkward restrictions placed on it to get rid of some paradoxes.

Perhaps if it were presented in some other way, using different names and metaphors, I wouldn't find ZF so awkward. I'm convinced of its practical utility for doing mathematics from the sheer amount of fiddling with has been done with it, but I'd still like for my intuition to naturally accept it as well.

Certifications

Thanks, dmerrill! I think my work on BitTorrent is a reasonable qualification for master certification. I've spent over a year working on it, and it's now getting over a hundred downloads a day, as you can see on the statistics page.

26 Aug 2002 (updated 26 Aug 2002 at 20:13 UTC) »
Plugging a Hole

The proof of the non-computability of the halting of turing machines presented in most undergraduate programs contains a huge gaping hole. Fortunately it's not hard to fix.

The Broken Proof

Let us say that we have a turing machine which accepts an encoding of another turing machine as input and always prints out a 0 before its last move if the input halts or a 1 before its last move if the input doesn't halt, and never spends an infinite amount of time thinking about it.

We can then construct a turinng machine which calculates what whether its input would halt, and does the opposite. If we give that machine as input to itself, then we have a contradiction.

The problem here is that there's a distinction between machines which take input and machines which don't take input. The machine which does the opposite takes input, when we fed it to itself the emulated one is given no input, hence it's operating on something else entirely, and there is no contradiction.

The Fixed Proof

Let us say that we have a turing machine which accepts an encoding of another turing machine as input and always prints out a 0 before its last move if the input halts or a 1 before its last move if the input doesn't halt, and never spends an infinite amount of time thinking about it.

We can then construct a turing machine which spits out an encoding of itself, calculates whether that machine will halt, and then does the opposite. The spitting out is based on the standard quine trick - the program contains a blueprint of itself, with a special notation where the blueprint goes saying 'blueprint goes here', it then follows the blueprint exactly and spits out a copy of the blueprint where the notation goes.

Clearly, this machine leads to a contradiction.

Why there is such an obvious and easy to fix error as part of standard basic CS curriculum is beyond me.

No Mix and Matching

While I'm complaining about common errors, I'd like to point out something about log(n) multipliers in runtimes. There are basically two ways of calculating the runtime of an algorithm - the number gates required to construct it using a directed circuit, and the number of operations it requires on a realistic machine with parallel memory.

If you calculate runtime in the first way, you're perfectly justified in saying that adding two numbers on the scale of n takes log(n) time, but since many algorithms undergo a quadratic blow-up in size in this model due to the lack of parallel memory, it isn't applicable to many problems.

If you calculate runtime in the second way, you have no justification for saying that adding two numbers on the scale of n takes log(n) time unless you also take into account that simple memory retrieval technically takes log(n) time, which nobody does.

When calculating runtime, don't mix and match models - if you allow for constant-time memory retrieval, allow for constant-time addition of small numbers.

24 Aug 2002 (updated 26 Aug 2002 at 17:57 UTC) »
wardv: I used to play corewars all the time, in an integrated development environment on the Commodore 64 which my dad wrote in his spare time. Notably, he was a journalist and not a programmer back then.

There was one tournament in which I entered a top contender, Powerbomb, which won almost all of its battles before the finals. Unfortunately, Only five programs were put into the final round, including all the ones which Powerbomb didn't do all that well against. Ah well, I should probably be happy just to have gotten two programs into the finals. I was only twelve at the time.

Mathematical Foundations

I think this post claims an axiomatic system has been discovered which essentially declares its own consistency but manages to avoid the cheap trick of diagonalization.

Update: I'm told I read that all wrong. Ah well.

Intermediate Automata

I believe some of these papers demonstrate that there are (very artificial) cellular automata which exhibit nontrivial behavior but aren't universal, contrary to Wolfram's thesis. I'm increasingly coming to believe that some of the simple rules, especially 18 and 22, are of intermediate degree.

Second Best

I figured out a coherent strategy for Second Best - all players agree on what the two most common words will be, and each player picks one of them at random.

Tweaking Spam

I've been thinking about the spam filtering code I gave earlier. It can be improved a lot.

For starters, multiplying the number of nonspam appearances by two is kind of a hack. It's almost exactly equivalent to subtracting .7 from each token's value. A much more robust approach is to increase the spam threshold from 2.6 to 5.

Also, max value of 4.6 (equivalent to .99 bayesian) seems like a bit much, since just one or two spam words could easily label an otherwise neutral message as spam. Reducing it to 3 (bayesian about .95) seems much more reasonable. Someone whose entry has scrolled off advogato recentlog mentioned that a single token, '2002' threw off his filtering quite a bit.

Several other subtle improvements and code cleanups are possible. I've included them all in the following code -

from math import log
from re import findall

class spamtrap: def __init__(self): self.good = {} self.bad = {}

def add_spam(self, message): for t in _maketokens(message): self.bad[t] = self.bad.get(t, 0) + 1

def add_nonspam(self, message): for t in _maketokens(message): self.good[t] = self.good.get(t, 0) + 1 def is_spam(self, message): ss = [] for token in _maketokens(message): ss.append(min(3, max(-3, log(self.bad.get(token, 0) + 1) - log(self.good.get(token, 0) + 1)))) sum = 0 if len(ss) > 16: ss.sort() for v in ss[:8] + ss[-8:]: sum += v else: for v in ss: sum += v return sum > 5

def _maketokens(message): ts = {} for t in findall("[a-zA-Z0-9'$]+", message): ts[t.lower()] = 1 return ts.keys()

Sorry again about whitespace mangling which interferes with cut'n'paste - it's advogato's doing.

Hot Potato

In the game Hot Potato, each player starts with 100 points. One unlucky player, chosen at random, is initially given the hot potato. On each turn, the player with the hot potato decides which player gets it next, and the recepient's score is decremented by one. The hot potato may not be passed to a player with a score of zero. The winner is the last player with a positive score.

I cannot fathom any coherent strategy for Hot Potato. How to approach it formally is a complete mystery to me.

Second Best

In the game Second Best, each player writes down a word and then all words are revealed. All players who wrote down the second most commonly occuring word are winners, everybody else are losers. In the case of a tie for second, everybody loses.

I can't make heads or tails of the strategy of Second Best either.

20 Aug 2002 (updated 20 Aug 2002 at 22:30 UTC) »
fxn: Using a persistent connection won't help with throughput much unless you implement pipelining.
barryb: Those numbers were derived by plugging in Raph's formula -

  • log(.01) - log(1 - .01) = -4.6
  • log(.99) - log(1 - .99) = 4.6
  • log(.2) - log(1 - .2) = -1.4
  • log(.9) - log(1 - .9) = 2.2

The formulas for probabilities were likewise converted -

log(a / (a + b)) - log(1 - (a / (a + b))) = log(a) - log(b)

Note how much simpler the formula on the right is. I think it's best to think of this heuristic as based on scores, with bayesian analysis being an intuitive motivation, rather than a rigorous basis.

ifile is a noteworthy related project.

Donations

Since there's been discussion on advogato diaries lately about which projects are worth donatiing to, I'd like to point out that I'm accepting donations for my work on BitTorrent, a project I've been working on full-time for over a year, with a mature, deployed piece of code to show for it.

Retroactive Spam

I realized that tiebreak behavior was unspecified in the spam filtering code I gave yesterday. The way I implemented it happened to work out that if there were 15 max probability spam and 15 max probability not spam, it assumed spam. A much more robust approach is, rather than summing up the fifteen scores whose absolute values are the highest, sum up the eight largest positive and eight most negative ones. I've now retroactively changed by previous diary entry to work that way. You can do that in a weblog.

17 Aug 2002 (updated 19 Aug 2002 at 02:09 UTC) »
Translating into English

Paul Graham has some interesting ideas about how to filter spam. Unfortunately he gives code samples in some weird language, so I've translated into Python and fleshed them out a bit.

This code ignores duplicates of a single token in a message, since they resulted in some rather ugly hacks. It also implements Raph's suggestion for using scores. Also, rather than summing the fifteen scores whose absolute values are highest, it sums the eight largest positive and eight most negative, which is much more robust behavior.

This code has no support for serialization and doesn't strip binary attachments, but other than that should work well.

from math import log
from re import findall

class spamtrap: def __init__(self): self.good = {} self.bad = {} self.scores = {}

def _recompute(self, token): g = 2 * self.good.get(token, 0) b = self.bad.get(token, 0) if g + b >= 5: self.scores[token] = min(4.6, max(-4.6, log(b + .00001) - log(g + .00001)))

def add_spam(self, message): for t in _maketokens(message): self.bad.setdefault(t, 0) self.bad[t] += 1 self._recompute(t) def add_nonspam(self, message): for t in _maketokens(message): self.good.setdefault(t, 0) self.good[t] += 1 self._recompute(t) def is_spam(self, message): ss = [self.scores.get(t, -1.4) for t in _maketokens(message)] ssp = [i for i in ss if i > 0] ssn = [i for i in ss if i < 0] ssp.sort() ssn.sort() sum = 0 for v in ssp[-8:] + ssn[:8]: sum += v return sum > 2.2

def _maketokens(message): ts = {} for t in findall("[a-zA-Z0-9'$]+", message): ts[t.lower()] = 1 return ts.keys()

Sorry if there are indentation problems on cut and paste - advogato inserts <p> tags even between <pre> tags.

12 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!