Older blog entries for Bram (starting at number 65)

29 Mar 2003 (updated 29 Mar 2003 at 05:51 UTC) »
CodeCon

The CodeCon 2003 archives are now on codecon.org. We had lots of great presentations this year, those of you who missed it can now hear them.

BitTorrent

Pushed out a new release yesterday, and a bugfix release today. Get it on the download page.

tjansen: requiring hosters of big content to purchase a big net connection and then recoup the costs later would force them to take on a huge risk, and slow down new deployments immensely. Besides, we don't have micropayments and won't for the forseeable future, so the point is moot.

Bayesian Filtering

brondsem: Counting a repeated instance multiple times causes long messages to be weighed disproportionately. Each message either is or is not a spam, a long spam message does not count as two pieces of spam.

Also, if a non-spam message is about mortgages it's likely to contain the word 'mortgage' many times. Counting that commonly used spam word against it repeatedly is much more likely to tag it as spam, when in fact the repeated instances are just as likely to occur in non-spam as spam.

The idea behind only counting a few words is to not make all messages look the same. If you count all the words in all messages then most messages wind up with pretty much the same average value.

You are absolutely right that real backtracing is needed to know for sure though. Anyone who's setting up a spam filtering system please save all your spams and set it up so you can easily do backtracing and find out the number of false positives and false negatives different techniques would have hit.

How to be a Pundit

A pundit, of course, is someone who others look to when they want opinions. Here's how to become one in four easy steps.

  1. Post constantly. Preferably daily. Preferably put every inane little thought to come out of your head on a weblog, so the whole world can see you post constantly. The first step to having a following is to reliably offer an opinion.

  2. Always say the same things. It doesn't particularly matter if the things you say are right, or even if they make sense. What's important is that they sound good, and that they're always the same. Followers don't want leaders saying confusing or ambiguous things.

  3. Include what a great guru/authority/visionary/leader you are as part of your message. Don't worry if you aren't actually one, the important thing is that people think you are.

  4. Arrange pre-packaged stories for journalists to print. Journalists rarely have the resources to research a story properly, so if you give them a story they can just reword a bit and print, they'll do that. Don't forget to include how great you are as an important point in the story.

RedHat 9

Since the free download of RedHat 9 won't be available for a week after it becomes available to subscribers, this is a great opportunity to demonstrate BitTorrent. Unfortunately, I'm not a subscriber. Anyone who is a subscriber and has a decently fast pipe and would like to help download the ISOs and make them available via BitTorrent within a few hours of their release please mail me.

Travelling

I'm going to be in Amsterdam from April 27th throug May 3rd, and New York City from May 18th till about June 1st. I'd like to meet some of the local open source hackers while I'm there. Anybody who would like to meet up mail me.

Bayesian filtering

I'd like to remind everyone that I already posted about Bayesian filtering, and gave some ways of fixing some of Graham's code's deficiencies. Highlight include

  • Whether a word occurs in a message should be yes or no, multiple occurences should be ignored. Graham's code does something very funky with multiple occurences which can cause lots of problems.

  • Everything should be changed to log scale so scores are simply added together. This makes everything much clearer. In particular, it makes it obvious that you might as well assume a default probability of spam of 1/2, since it just trades off with the cutoff threshold. It shows how thoroughly keeping things as probabilities obfuscates the process that many people, including Graham, didn't realize this.

  • Adding up the 16 most extreme scores in either the positive or negative direction is very crude. For long messages it winds up throwing out all subtle details of exact values and just doing a vote count of the top 16 as to whether they're positive or negative. A much better approach is to add up the top 8 and bottom 8. Bruce Schneier and Joel Spolsky have recently complained about very long, easy to flag as non-spam messages of theirs getting rejected, and this is probably the cause and simplest fix.

    Graham didn't specify what the behavior should be when you have 16 words which say 100% spam and 16 words which say 100% not spam, although it doesn't really matter, because the top and bottom approach is clearly better.

I hope I don't sound overly critical of Graham here. I think his approach is a great one. But it is a first version, and can be dramatically improved.

I wish someone had a backtracing setup for spam filtering algorithms, so we could get some real data about false positive and false negative values, instead of just having to guess.

Partial Busy-work Functions

ciphergoth: There are a few other criteria:

  • The whole puzzle should require 2k time to solve

  • If you have two nodes which aren't at opposite ends of the arc, then you still have to spend k time to solve the whole puzzle

Quick verification isn't a criterion I'm looking for, although that would be an interesting add-on. I'm thinking about this because it's related to circuit complexity, not because I'm looking for an especially practical crypto primitive.

The two cases with two nodes are easy. If the nodes are connected by an arc, then the puzzle is to compute the root of a binary hash tree, and each of the partial work functions is one of the two things pointed by the root. For the disconnected case, the puzzle is to compute the output of a hash chain, and the two values are both the value of the halfway point of the chain. For three nodes, I haven't figured out a solution to the case where all three nodes are connected and the case where only a single pair of nodes is connected.

Twisty Puzzles

I figured out some big improvements to my pentultimate design. Possibly even more interesting than having a working pentultimate is the internal pieces which move at half the rate of the external pieces. My basic ideas can be used in a much more robust design for an extension of the 2x2x2 rubik's cube in which the outside is transparent and internally there are a bunch of extra pieces which move at half the rate of the outer ones. This would be a very interesting and difficult new puzzle, and my design for it is extremely robust.

Now I just have to learn how to make precision plastic parts. Good thing I don't have lots of other projects to work on.

The Twisty Forum is great.

Certs against Spam

I've given some more thought to stopping spam, and have come up with the following approach, which is still a bit sketchy but seems to be on the right track.

When receiving a piece of mail, determine if you're willing to accept mail from the sender by the following process:

  1. Determine each peer's degree of spammerness. This can be done using a variant of the random walk method, which is used to calculate advogato's diary ratings.

  2. Back propogate spammerness. Oh, you wanted to know how to do that? For each spammer, calculate max flow from you to that spammer. In the flow diagram, try to even out the flow as much as possible, so if there's a possibility of offloading flow through one peer to other peers each of which are currently taking on less flow, do so. Calculate this for every spammer, then calculate each peer's degree of spammerness by adding up the flow through them for each spammer.

  3. Throw out all peers whose degrees of spammerness is above some threshold. If there's a path from you along certification edges over remaining peers to the sender of the mail being evaluated, accept it. Otherwise reject it.

The real innovation here is the second step, which does back propogation. It ensures that sending spam from fake certified identities is just as damaging as sending spam from the directly compromised identities.

The database for this could be easily populated using a Friendster-style interface. Unfortunately, to be maximally useful this database has to be global, meaning it's huge, and the steps involved in evaluating it look to be at least quadratic. Thankfully moore's law will make this much more manageable in due time, although the database will still have to be widely replicated in order to distribute computational load.

In case you wondered what one could ever do with vastly greater than videophone bandwidth, it turns out you can utilize it nicely propogating global databases.

An interesting graph problem

Given a graph, and a specification of a 'red' node and a 'blue' node, which wish to know if it's possible to remove k nodes, not including blue and red, so there's no longer a path from blue to red along node edges over remaining nodes. Is this problem NP-complete?

Architecture

An architecture is a set of libraries which are so poorly encapsulated that if you want to use one of them you have to use all of them.

Building a Pentultimate

I have long been fascinated by the pentultimate. Could one actually be built? I have a general design which I'm fairly confident is workable.

Imagine two blocks which you wish to slide along each other in a straight line. If you cut a slot along their boundary, you can put a disc in the middle which will rotate as they move. You can add teeth to the disc and notches to the edges of the slot, thus forcing the gear to rotate and move exactly half as much as each block does relative to the other one. A more esoteric trick is to add circular notches to the faces of the disc and grooves to the sides of the slot for those notches to move in. This will have the same effect in terms of how the disc moves, but has the additional property that the two blocks can't be pulled apart while the disc is inside.

If we bend things a bit we can make the slot be circular, and the blocks then rotate. My trick for making a pentultimate is to cut grooves of this form along the hidden sides of each visible piece and have a disc sit under each boundary between a triangular and pentagonal piece. The discs keep the puzzle from being taken apart, but allow rotation along the cuts. Every time a rotation is done around a cut, it moves a fifth of a complete rotation, making each of the ten discs beneath move half that distance, resulting in each of them moving to the next position along and the puzzle once again being in a rotatable position.

The details of course must be hashed out, but I'm confident a real physical pentultimate which is robust enough to play with could be made using this technique.

An interesting possibility this internal mechanism raises is that the outside of the the puzzle could be made transparent, and the puzzle would then be to position and orient the internal discs.

There are many other fascinating Rubik-type puzzles, but I think the pentultimate would be by far the most compelling toy.

Certs as an anti-spam system

Let us say we wish to build an anti-spam system based on certifications. To make this work, we will need some method of authentication. (We can punt and make a public key be part of an identity). We also need some method of distributing all the certs widely, which must be resistant to attacks which selectively remove certs and anti-certs and also can handle the likely humongous size of the resulting database.

The above problems seem like work, but are much less of a concern than the problem of not having a cert graph evaluation algorithm which we find acceptable. As we can see from the Advogato example, even punting on all the engineering problems by centralizing everything still leaves an open problem as to whether a certification system can work well. Advogato is encouraging, but still immature.

I am now convinced that if we can come up with a certification evaluation primitive which meets the right criteria we will have a spam-fighting solution which will be work well, even if deployed on a global scale. I am furthur convinced that such a primitive exists, although I haven't figured out exactly how to do it yet.

Here are the criteria -

  1. There is a permissible spam factor, s. Say that a spammer has compromised k typical people in a well-connected graph. The spammer then proceeds to repeatedly send pieces of spam from either compromised people or fake people certified by compromised people. After each piece of spam is sent out, the recipient anti-certs the sender. The spammer never bothers sending mail which will be rejected because of the existing anti-certs. The anti-spam criterion states that the attacker should only be able to get about k*s pieces of spam accepted this way before almost any mail they try to send out will get rejected by the trust system.

  2. Let us say an attacker has k people compromised and wishes to use this power to censor others. Due to the anti-spam criterion, it must be possible for the attacker to censor at least k/s typical peers. The censorship criterion says they must not be able to censor any more than that.

Once we find a primitive which meets the above criteria, then we will be able to straightfaced claim to have plans for a global certification based anti-spam system which works on paper. Then we can start worrying about deployment problems.

Partial busy-work functions

I had an interesting thought while pondering circuit complexity today.

We have busy-work functions, which do little more than require some precisely tunable amount of time be spent computing them. For example, computing the secure hash of a sequence of x null characters. I wish to have a busy-work function, b, which meets some additional criteria criteria.

There are to be a fixed number of partial results we can compute each of which takes almost exactly k/2 time, where k is the total amount of busy work for the complete function. Each pair of partial results may be synergistic, meaning that given both of them you can compute b very quickly. Pairs of partial functions which are not synergistic should have the property that if you know both of them it still takes an additional k/2 time to compute b.

We wish to have a specific set of synergy relationships between partical functions. For example, we may want four partial functions P, Q, R, and S such that (P, Q), (Q, R), (P, R) and (R, S) are synergistic but all other pairs are not. My question is, for any given set of synergy relationships, is there always a b and related partial functions which has that property? Can you think of a way of putting together standard (or even not so standard) cryptographic primitives to make an implementation?

My intuition very strongly says that this is possible, but I don't immediately see a way of doing it.

We had a p2p-hackers meeting today, spent a whole bunch of it talking about Codeville. It was fun.

On the way home I realized that the way Codeville resolutions were being determined could be vastly simplified. After two hours of coding I now have the newly rewritten and retested version up on the Codeville page. Ninety thorny lines which it took me a week of being half-dead to write have now been replaced with twenty lines I hacked out in two hours. Boy do I feel dumb.

InstallShield

InstallShield is still a 16-bit application. This means that on long-running 32-bit Windows machines (like mine) by the time you try to install anything which uses InstallShield the 16-bit VM will inevitably have long since crashed and the installer will simply not work.

Those of you using InstallShield should be aware that a very large fraction of all your end user installs are failing due to a problem which can't be fixed without InstallShield completely rewriting their software or you switching to a different installer. I recommend Nullsoft Installer.

Negative Certs

In any trust metric in which you have negative certs there's an attack in which an attacker creates many fake identities and certifies them. The fake identities may get negative certs attached, but it will still be possible to make more. Therefore, any trust metric which includes negative certs must back propogate negativity in its evaluation function to be effective. I just realized this yesterday and haven't put any real thought into how back propogation should be done. Clearly more thought is necessary.

Simple Trust Metric Evaluation

Here's a trust metric evaluation algorithm which fixes the problems of very low confidence levels advogato has right now.

First we must select a paranoia level. This is the assumed probability that any given given peer is compromised. A paranoia level of zero is extremely trusting, while a paranoia level of one never believes anything.

To determine Alice's rating of someone's diary, do multiple runs of the following:

  1. Decide whether to remove each node from the graph with probability equal to the paranoia level.

  2. Trim all dead ends (that is, nodes which have no path to a node which gives a rating) from the graph.

  3. Starting at Alice, do a random walk along trusted edges through non-removed nodes until you hit a node which gives a rating, then stop. It's possible that there is no such path, in which case this run is counted as having failed.

The confidence value is now the fraction of all runs which succeeded. The diary rating can be computed from the distribution of run results, either by taking the median or the mean.

Computation can be greatly sped up by figuring out exact probabilities for each run of removing nodes using linear equations. Disturbingly, it's still randomized, but approximates the exact value reasonably quickly except in the case where confidence is near zero, in which case rating doesn't really matter.

Raph pointed out that this approach involves combining three already known techniques -

  • random node removal

  • trimming dead ends

  • random walk

The above can be used in just about any combination, but using all three has the best properties.

Looking back on Raph's diary I see that I meant to post about this on the 19th of December, after discussing it with him after dinner. I'm a bit behind on my blogging.

Slashdot Trust Metric

Slashdot's comment rating system appears reasonable at first blush, but is deeply flawed in implementation. Happily, there are straightforward ways it could be vastly improved.

First the good part. Moderator points are a good approach, and the strength of them as a technique is largely responsible slashdot's comment system achieving its current level of usability, which to its credit is considerably better than usenet. (I do not mean than as an insult.) In particular, the policy of never giving a particular person more than a few moderator points at a time strongly limits the amount of damage an attacker could do and also allows the size of the community to be very precisely tweaked.

The problems with slashdot aren't so much with what should be there as what shouldn't. The current meta-moderation interface is completely unnecessary. Whenever anyone rates a post up or down, other people have the opportunity to use their moderation points to change that rating, which implicitly meta-moderates the first change. Analysis of that data could be used as the exclusive source of meta-moderation with good results.

Categories for reason of moderation are also pointless. They complicate the UI without adding any new functionality.

Without any data on how it's currently done, I can't comment on specific formulas for deciding when to give people moderation points. Clearly making posts which get modded up should increase likelihood/frequency of getting moderation points, and the same with positive meta-moderation. Coming up with specific formulas seems to be an interesting and soluble problem, but real-world data is necessary.

Perfect Graphs

There's now an algorithm for testing if a graph is Berge in polynomial time. (Read the links for an explanation of the Berge property, it's straightforward but a bit long to repeat here.) Although I haven't read through the papers yet (which I plan to do in my copious free time) I'm under the impression that the algorithm, while distinctly non-trivial, only employs elementary techniques. It's always nice when new results can be understood without extensive background reading.

Mitch Kapor at CodeCon

We've got Mitch Kapor as our special guest speaker for the benefit dinner at CodeCon!

CodeCon pre-registration closes in just a few hours, and it will cost more at the door, so if you haven't yet, go sign up now.

13 Feb 2003 (updated 14 Feb 2003 at 06:59 UTC) »
CodeCon

Pre-registration for CodeCon is closing on the 15th. After that you'll have to register at the door, which costs more.

We've got a really great program this year, including a presentation on Advogato. Also notable is a panel on version control, including representatives from BitKeeper, Subversion, and OpenCM.

CodeCon is also one of the the best places to meet all the local bay area open source hackers. I hope to see everyone there.

I've created a Codeville tutorial which demonstrates its cool features quite succinctly.

Secret Project Unveiled

My secret project is now ready to be seen. It's Codeville, a version control system.

One thing no longer nagging at my brain, a million left to go.

56 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!