I realized this the other day, and dub it Bram's Law
The easier a piece of software is to write, the worse it's implemented in practice.
Why? Easy software projects can be done by almost any random person, so they are. It's possible to try to nudge your way into being the standard for an easy thing based on technical merit, but that's rather like trying to become a hollywood star based on talent and hard work. You're much better off trading it all in for a good dose of luck.
This is why HTTP is a mess while transaction engines are rock solid. Almost any programmer can do a mediocre but workable job of extending HTTP, (and boy, have they,) but most people can't write a transaction engine which even functions. The result is that very few transaction engines are written, almost all of them by very good programmers, and the few which aren't up to par tend to be really bad and hardly get used. HTTP, on the other hand, has all kinds of random people hacking on it, as a result of which Python has a 'fully http 1.1 compliant' http library which raises assertion failures during normal operation.
Remember this next time you're cursing some ubiquitous but awful third party library and thinking of writing a replacement. With enough coal, even a large diamond is unlikely to be the first thing picked up. Save your efforts for more difficult problems where you can make a difference. The simple problems will continue to be dealt with incompetently. It sucks, but we'll waste a lot less time if we learn to accept this fact.
The last technique I gave for back propogation has a serious flaw, but I think the criteria and general approach are good. Here's a much better algorithm:
Mail is accepted from any node which there's a path to along certification edges over nodes which have weight greater than zero.
I've glossed over some details, but that covers all the important points. I'm very enamored of this approach. It's extremely untuitively compelling, and all the steps in it can be done or at least very well approximated in essentially linear time. It looks like we have an algorithm which really can expand to a global scale now.
Twisty Puzzles
Some posts by me to the twisty forum.
HTML
Web pages which make their hyperlinks non-underlined drive me absolutely batshit. I wish there was a browser option to always underline even when the page says not to.
In other good news, the IETF announced a new policy in RFCs against using MUST to describe behavior which is widely violated in practice, especially when that violation won't change for the forseeable future.
29 Mar 2003 (updated 29 Mar 2003 at 05:51 UTC) »
The CodeCon 2003 archives are now on codecon.org. We had lots of great presentations this year, those of you who missed it can now hear them.
BitTorrent
Pushed out a new release yesterday, and a bugfix release today. Get it on the download page.
tjansen: requiring hosters of big content to purchase a big net connection and then recoup the costs later would force them to take on a huge risk, and slow down new deployments immensely. Besides, we don't have micropayments and won't for the forseeable future, so the point is moot.
Bayesian Filtering
brondsem: Counting a repeated instance multiple times causes long messages to be weighed disproportionately. Each message either is or is not a spam, a long spam message does not count as two pieces of spam.
Also, if a non-spam message is about mortgages it's likely to contain the word 'mortgage' many times. Counting that commonly used spam word against it repeatedly is much more likely to tag it as spam, when in fact the repeated instances are just as likely to occur in non-spam as spam.
The idea behind only counting a few words is to not make all messages look the same. If you count all the words in all messages then most messages wind up with pretty much the same average value.
You are absolutely right that real backtracing is needed to know for sure though. Anyone who's setting up a spam filtering system please save all your spams and set it up so you can easily do backtracing and find out the number of false positives and false negatives different techniques would have hit.
A pundit, of course, is someone who others look to when they want opinions. Here's how to become one in four easy steps.
RedHat 9
Since the free download of RedHat 9 won't be available for a week after it becomes available to subscribers, this is a great opportunity to demonstrate BitTorrent. Unfortunately, I'm not a subscriber. Anyone who is a subscriber and has a decently fast pipe and would like to help download the ISOs and make them available via BitTorrent within a few hours of their release please mail me.
Travelling
I'm going to be in Amsterdam from April 27th throug May 3rd, and New York City from May 18th till about June 1st. I'd like to meet some of the local open source hackers while I'm there. Anybody who would like to meet up mail me.
Bayesian filtering
I'd like to remind everyone that I already posted about Bayesian filtering, and gave some ways of fixing some of Graham's code's deficiencies. Highlight include
Graham didn't specify what the behavior should be when you have 16 words which say 100% spam and 16 words which say 100% not spam, although it doesn't really matter, because the top and bottom approach is clearly better.
I hope I don't sound overly critical of Graham here. I think his approach is a great one. But it is a first version, and can be dramatically improved.
I wish someone had a backtracing setup for spam filtering algorithms, so we could get some real data about false positive and false negative values, instead of just having to guess.
Partial Busy-work Functions
ciphergoth: There are a few other criteria:
Quick verification isn't a criterion I'm looking for, although that would be an interesting add-on. I'm thinking about this because it's related to circuit complexity, not because I'm looking for an especially practical crypto primitive.
The two cases with two nodes are easy. If the nodes are connected by an arc, then the puzzle is to compute the root of a binary hash tree, and each of the partial work functions is one of the two things pointed by the root. For the disconnected case, the puzzle is to compute the output of a hash chain, and the two values are both the value of the halfway point of the chain. For three nodes, I haven't figured out a solution to the case where all three nodes are connected and the case where only a single pair of nodes is connected.
Twisty Puzzles
I figured out some big improvements to my pentultimate design. Possibly even more interesting than having a working pentultimate is the internal pieces which move at half the rate of the external pieces. My basic ideas can be used in a much more robust design for an extension of the 2x2x2 rubik's cube in which the outside is transparent and internally there are a bunch of extra pieces which move at half the rate of the outer ones. This would be a very interesting and difficult new puzzle, and my design for it is extremely robust.
Now I just have to learn how to make precision plastic parts. Good thing I don't have lots of other projects to work on.
The Twisty Forum is great.
Certs against Spam
I've given some more thought to stopping spam, and have come up with the following approach, which is still a bit sketchy but seems to be on the right track.
When receiving a piece of mail, determine if you're willing to accept mail from the sender by the following process:
The real innovation here is the second step, which does back propogation. It ensures that sending spam from fake certified identities is just as damaging as sending spam from the directly compromised identities.
The database for this could be easily populated using a Friendster-style interface. Unfortunately, to be maximally useful this database has to be global, meaning it's huge, and the steps involved in evaluating it look to be at least quadratic. Thankfully moore's law will make this much more manageable in due time, although the database will still have to be widely replicated in order to distribute computational load.
In case you wondered what one could ever do with vastly greater than videophone bandwidth, it turns out you can utilize it nicely propogating global databases.
An interesting graph problem
Given a graph, and a specification of a 'red' node and a 'blue' node, which wish to know if it's possible to remove k nodes, not including blue and red, so there's no longer a path from blue to red along node edges over remaining nodes. Is this problem NP-complete?
Architecture
An architecture is a set of libraries which are so poorly encapsulated that if you want to use one of them you have to use all of them.
I have long been fascinated by the pentultimate. Could one actually be built? I have a general design which I'm fairly confident is workable.
Imagine two blocks which you wish to slide along each other in a straight line. If you cut a slot along their boundary, you can put a disc in the middle which will rotate as they move. You can add teeth to the disc and notches to the edges of the slot, thus forcing the gear to rotate and move exactly half as much as each block does relative to the other one. A more esoteric trick is to add circular notches to the faces of the disc and grooves to the sides of the slot for those notches to move in. This will have the same effect in terms of how the disc moves, but has the additional property that the two blocks can't be pulled apart while the disc is inside.
If we bend things a bit we can make the slot be circular, and the blocks then rotate. My trick for making a pentultimate is to cut grooves of this form along the hidden sides of each visible piece and have a disc sit under each boundary between a triangular and pentagonal piece. The discs keep the puzzle from being taken apart, but allow rotation along the cuts. Every time a rotation is done around a cut, it moves a fifth of a complete rotation, making each of the ten discs beneath move half that distance, resulting in each of them moving to the next position along and the puzzle once again being in a rotatable position.
The details of course must be hashed out, but I'm confident a real physical pentultimate which is robust enough to play with could be made using this technique.
An interesting possibility this internal mechanism raises is that the outside of the the puzzle could be made transparent, and the puzzle would then be to position and orient the internal discs.
There are many other fascinating Rubik-type puzzles, but I think the pentultimate would be by far the most compelling toy.
Let us say we wish to build an anti-spam system based on certifications. To make this work, we will need some method of authentication. (We can punt and make a public key be part of an identity). We also need some method of distributing all the certs widely, which must be resistant to attacks which selectively remove certs and anti-certs and also can handle the likely humongous size of the resulting database.
The above problems seem like work, but are much less of a concern than the problem of not having a cert graph evaluation algorithm which we find acceptable. As we can see from the Advogato example, even punting on all the engineering problems by centralizing everything still leaves an open problem as to whether a certification system can work well. Advogato is encouraging, but still immature.
I am now convinced that if we can come up with a certification evaluation primitive which meets the right criteria we will have a spam-fighting solution which will be work well, even if deployed on a global scale. I am furthur convinced that such a primitive exists, although I haven't figured out exactly how to do it yet.
Here are the criteria -
Once we find a primitive which meets the above criteria, then we will be able to straightfaced claim to have plans for a global certification based anti-spam system which works on paper. Then we can start worrying about deployment problems.
Partial busy-work functions
I had an interesting thought while pondering circuit complexity today.
We have busy-work functions, which do little more than require some precisely tunable amount of time be spent computing them. For example, computing the secure hash of a sequence of x null characters. I wish to have a busy-work function, b, which meets some additional criteria criteria.
There are to be a fixed number of partial results we can compute each of which takes almost exactly k/2 time, where k is the total amount of busy work for the complete function. Each pair of partial results may be synergistic, meaning that given both of them you can compute b very quickly. Pairs of partial functions which are not synergistic should have the property that if you know both of them it still takes an additional k/2 time to compute b.
We wish to have a specific set of synergy relationships between partical functions. For example, we may want four partial functions P, Q, R, and S such that (P, Q), (Q, R), (P, R) and (R, S) are synergistic but all other pairs are not. My question is, for any given set of synergy relationships, is there always a b and related partial functions which has that property? Can you think of a way of putting together standard (or even not so standard) cryptographic primitives to make an implementation?
My intuition very strongly says that this is possible, but I don't immediately see a way of doing it.
On the way home I realized that the way Codeville resolutions were being determined could be vastly simplified. After two hours of coding I now have the newly rewritten and retested version up on the Codeville page. Ninety thorny lines which it took me a week of being half-dead to write have now been replaced with twenty lines I hacked out in two hours. Boy do I feel dumb.
InstallShield is still a 16-bit application. This means that on long-running 32-bit Windows machines (like mine) by the time you try to install anything which uses InstallShield the 16-bit VM will inevitably have long since crashed and the installer will simply not work.
Those of you using InstallShield should be aware that a very large fraction of all your end user installs are failing due to a problem which can't be fixed without InstallShield completely rewriting their software or you switching to a different installer. I recommend Nullsoft Installer.
Negative Certs
In any trust metric in which you have negative certs there's an attack in which an attacker creates many fake identities and certifies them. The fake identities may get negative certs attached, but it will still be possible to make more. Therefore, any trust metric which includes negative certs must back propogate negativity in its evaluation function to be effective. I just realized this yesterday and haven't put any real thought into how back propogation should be done. Clearly more thought is necessary.
Simple Trust Metric Evaluation
Here's a trust metric evaluation algorithm which fixes the problems of very low confidence levels advogato has right now.
First we must select a paranoia level. This is the assumed probability that any given given peer is compromised. A paranoia level of zero is extremely trusting, while a paranoia level of one never believes anything.
To determine Alice's rating of someone's diary, do multiple runs of the following:
The confidence value is now the fraction of all runs which succeeded. The diary rating can be computed from the distribution of run results, either by taking the median or the mean.
Computation can be greatly sped up by figuring out exact probabilities for each run of removing nodes using linear equations. Disturbingly, it's still randomized, but approximates the exact value reasonably quickly except in the case where confidence is near zero, in which case rating doesn't really matter.
Raph pointed out that this approach involves combining three already known techniques -
The above can be used in just about any combination, but using all three has the best properties.
Looking back on Raph's diary I see that I meant to post about this on the 19th of December, after discussing it with him after dinner. I'm a bit behind on my blogging.
Slashdot Trust Metric
Slashdot's comment rating system appears reasonable at first blush, but is deeply flawed in implementation. Happily, there are straightforward ways it could be vastly improved.
First the good part. Moderator points are a good approach, and the strength of them as a technique is largely responsible slashdot's comment system achieving its current level of usability, which to its credit is considerably better than usenet. (I do not mean than as an insult.) In particular, the policy of never giving a particular person more than a few moderator points at a time strongly limits the amount of damage an attacker could do and also allows the size of the community to be very precisely tweaked.
The problems with slashdot aren't so much with what should be there as what shouldn't. The current meta-moderation interface is completely unnecessary. Whenever anyone rates a post up or down, other people have the opportunity to use their moderation points to change that rating, which implicitly meta-moderates the first change. Analysis of that data could be used as the exclusive source of meta-moderation with good results.
Categories for reason of moderation are also pointless. They complicate the UI without adding any new functionality.
Without any data on how it's currently done, I can't comment on specific formulas for deciding when to give people moderation points. Clearly making posts which get modded up should increase likelihood/frequency of getting moderation points, and the same with positive meta-moderation. Coming up with specific formulas seems to be an interesting and soluble problem, but real-world data is necessary.
Perfect Graphs
There's now an algorithm for testing if a graph is Berge in polynomial time. (Read the links for an explanation of the Berge property, it's straightforward but a bit long to repeat here.) Although I haven't read through the papers yet (which I plan to do in my copious free time) I'm under the impression that the algorithm, while distinctly non-trivial, only employs elementary techniques. It's always nice when new results can be understood without extensive background reading.
We've got Mitch Kapor as our special guest speaker for the benefit dinner at CodeCon!
CodeCon pre-registration closes in just a few hours, and it will cost more at the door, so if you haven't yet, go sign up now.
FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.
Keep up with the latest Advogato features by reading the Advogato status blog.
If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!