25 Mar 2003 Bram   » (Master)

How to be a Pundit

A pundit, of course, is someone who others look to when they want opinions. Here's how to become one in four easy steps.

  1. Post constantly. Preferably daily. Preferably put every inane little thought to come out of your head on a weblog, so the whole world can see you post constantly. The first step to having a following is to reliably offer an opinion.

  2. Always say the same things. It doesn't particularly matter if the things you say are right, or even if they make sense. What's important is that they sound good, and that they're always the same. Followers don't want leaders saying confusing or ambiguous things.

  3. Include what a great guru/authority/visionary/leader you are as part of your message. Don't worry if you aren't actually one, the important thing is that people think you are.

  4. Arrange pre-packaged stories for journalists to print. Journalists rarely have the resources to research a story properly, so if you give them a story they can just reword a bit and print, they'll do that. Don't forget to include how great you are as an important point in the story.

RedHat 9

Since the free download of RedHat 9 won't be available for a week after it becomes available to subscribers, this is a great opportunity to demonstrate BitTorrent. Unfortunately, I'm not a subscriber. Anyone who is a subscriber and has a decently fast pipe and would like to help download the ISOs and make them available via BitTorrent within a few hours of their release please mail me.

Travelling

I'm going to be in Amsterdam from April 27th throug May 3rd, and New York City from May 18th till about June 1st. I'd like to meet some of the local open source hackers while I'm there. Anybody who would like to meet up mail me.

Bayesian filtering

I'd like to remind everyone that I already posted about Bayesian filtering, and gave some ways of fixing some of Graham's code's deficiencies. Highlight include

  • Whether a word occurs in a message should be yes or no, multiple occurences should be ignored. Graham's code does something very funky with multiple occurences which can cause lots of problems.

  • Everything should be changed to log scale so scores are simply added together. This makes everything much clearer. In particular, it makes it obvious that you might as well assume a default probability of spam of 1/2, since it just trades off with the cutoff threshold. It shows how thoroughly keeping things as probabilities obfuscates the process that many people, including Graham, didn't realize this.

  • Adding up the 16 most extreme scores in either the positive or negative direction is very crude. For long messages it winds up throwing out all subtle details of exact values and just doing a vote count of the top 16 as to whether they're positive or negative. A much better approach is to add up the top 8 and bottom 8. Bruce Schneier and Joel Spolsky have recently complained about very long, easy to flag as non-spam messages of theirs getting rejected, and this is probably the cause and simplest fix.

    Graham didn't specify what the behavior should be when you have 16 words which say 100% spam and 16 words which say 100% not spam, although it doesn't really matter, because the top and bottom approach is clearly better.

I hope I don't sound overly critical of Graham here. I think his approach is a great one. But it is a first version, and can be dramatically improved.

I wish someone had a backtracing setup for spam filtering algorithms, so we could get some real data about false positive and false negative values, instead of just having to guess.

Partial Busy-work Functions

ciphergoth: There are a few other criteria:

  • The whole puzzle should require 2k time to solve

  • If you have two nodes which aren't at opposite ends of the arc, then you still have to spend k time to solve the whole puzzle

Quick verification isn't a criterion I'm looking for, although that would be an interesting add-on. I'm thinking about this because it's related to circuit complexity, not because I'm looking for an especially practical crypto primitive.

The two cases with two nodes are easy. If the nodes are connected by an arc, then the puzzle is to compute the root of a binary hash tree, and each of the partial work functions is one of the two things pointed by the root. For the disconnected case, the puzzle is to compute the output of a hash chain, and the two values are both the value of the halfway point of the chain. For three nodes, I haven't figured out a solution to the case where all three nodes are connected and the case where only a single pair of nodes is connected.

Twisty Puzzles

I figured out some big improvements to my pentultimate design. Possibly even more interesting than having a working pentultimate is the internal pieces which move at half the rate of the external pieces. My basic ideas can be used in a much more robust design for an extension of the 2x2x2 rubik's cube in which the outside is transparent and internally there are a bunch of extra pieces which move at half the rate of the outer ones. This would be a very interesting and difficult new puzzle, and my design for it is extremely robust.

Now I just have to learn how to make precision plastic parts. Good thing I don't have lots of other projects to work on.

The Twisty Forum is great.

Certs against Spam

I've given some more thought to stopping spam, and have come up with the following approach, which is still a bit sketchy but seems to be on the right track.

When receiving a piece of mail, determine if you're willing to accept mail from the sender by the following process:

  1. Determine each peer's degree of spammerness. This can be done using a variant of the random walk method, which is used to calculate advogato's diary ratings.

  2. Back propogate spammerness. Oh, you wanted to know how to do that? For each spammer, calculate max flow from you to that spammer. In the flow diagram, try to even out the flow as much as possible, so if there's a possibility of offloading flow through one peer to other peers each of which are currently taking on less flow, do so. Calculate this for every spammer, then calculate each peer's degree of spammerness by adding up the flow through them for each spammer.

  3. Throw out all peers whose degrees of spammerness is above some threshold. If there's a path from you along certification edges over remaining peers to the sender of the mail being evaluated, accept it. Otherwise reject it.

The real innovation here is the second step, which does back propogation. It ensures that sending spam from fake certified identities is just as damaging as sending spam from the directly compromised identities.

The database for this could be easily populated using a Friendster-style interface. Unfortunately, to be maximally useful this database has to be global, meaning it's huge, and the steps involved in evaluating it look to be at least quadratic. Thankfully moore's law will make this much more manageable in due time, although the database will still have to be widely replicated in order to distribute computational load.

In case you wondered what one could ever do with vastly greater than videophone bandwidth, it turns out you can utilize it nicely propogating global databases.

An interesting graph problem

Given a graph, and a specification of a 'red' node and a 'blue' node, which wish to know if it's possible to remove k nodes, not including blue and red, so there's no longer a path from blue to red along node edges over remaining nodes. Is this problem NP-complete?

Architecture

An architecture is a set of libraries which are so poorly encapsulated that if you want to use one of them you have to use all of them.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!