Recent blog entries for jwb

There are two projects upon which I have embarked, both driven by my increasing discomfort with the modern computer experience.

Project the First: A desktop environment implemented entirely in Java

People give me strange looks when I talk about this project, but I am very excited about the prospects. For the last year or so I have been using Mac OS X, switching between its native interface and a full-screen X11/GNOME session hosted on my Linux box. (Switching between the two is a matter of a single keystroke). Prior to that I used Linux exclusively.

What strikes about these two systems is not their differences, but their similarities. Mac users and GNOME users (and Windows users) behave in ways which are grossly unsafe with regards to their files and data. We tend to just download and run programs from random places on the Internet, although at least with Linux the distributors exercise some editorial influence. These programs that we download and run are usually native programs. They have access to all the computer's facilities including the screen and keyboard and mouse, the audio output and input, the printer, the network, and the modem. Any of these programs can enumerate, access, and change any of the user's files. They can be trojans or viruses. And the worst feature of these systems is that the user is given woefully insufficient tools for finding out what the applications do.

My new system seeks to address this. The Java language was chosen for its structured method of limiting the actions of a program. Untrusted code can be forbidden to access files or the network or executing native code. It is perfectly safe to download and run any Java code, so long as the SecurityManager is correctly configured to limit that code's actions. Java's security policy framework is the basis of my new system.

This new system (which needs badly to be named) presents a new way of interacting with programs or applications or whatever you care to call them. Suppose you download a new music player, similar to iTunes. When the application is loaded it will lack any kind of file system or network capability. A file inside the archive will describe the kinds of capabilities the program's author feels are needed. For an example music player, access to the audio output will be necessary, access to a CD reader or writer may be optional, and the program might be expected to make network connections to musicbrainz.org, freedb.org, and last.fm. Before the application is even loaded, the system, using a full-screen input-capturing user interface, will ask the user to grant these capabilities. If everything seems to be in order, the user can confirm and the application will be loaded with the appropriate capabilities.

The nice thing about Java is that the security framework ensures that the application doesn't perform any hanky panky. You might have granted the ability to connect to musicbrainz.org on port 80, but if the application starts making connections to nsa.gov, or listening for connections on port 31337, or sending mail, then 1) these things will not work, and 2) the user can be alerted. The application can be killed if necessary and safely removed.

This system requires a file manager, because by default the applications are loaded only with the ability to read and write files in a private subdirectory which supports only that application. For an application to gain access to any of the user's files, the user must actually drag and drop that file to the appplication. In the example of the music player, the user might drag an entire folder, ~/Music, giving the application the ability to enumerate and read the files therein contained. But if the user chooses to not do this, then the application has no ability to spy on, steal, or alter the user's files and private data. Even if the user gives the music player the full ability to read, write, create, and delete files and folders under ~/Music, they can still be sure that the program won't intentionally or accidentally delete their email.

I only started this project one month ago, but already some of the issues to be tackled have become clear. The first and probably worst is my own ignorance of Java internals. A benefit of this system is supposedly that bugs in the native JPEG library won't cause buffer overflows like they can and commonly do on Windows, Mac OS, and Linux. But it's not entirely clear to me when and how Java might be resorting to native code behind my back. Is the entirety of the JDK pure Java? Can I be sure that Java is never using JNI to access something like libjpeg or libfreetype? If I restrict native code, what parts of Java will I break?

Another problem is that the Java look-and-feel libraries are universally horrible. The Windows L&F looks almost like Windows, the Mac L&F looks sort of like Mac OS, and the GTK+ L&F resembles GTK+. They are all flamboyantly buggy in ways that the user will notice, and ugly as well. I need to find a L&F which I can use which is beautiful and well-implemented and comes with the right license. Any pointers in this area would be appreciated.

Project the Second: a web forum which is not horrible

The majority of forums and blogs out there are hideously bad. They tend to connect to databases dozens of times for every page load. They run a lot of code on the host CPUs even when generating exactly the same content for the millionth time. They have a knack for being slashdotted at just the wrong moment.

Normally I would not care about this junk but because of a website which my wife has started building I have been sucked into this project. The goal of the project is to build a forum and blog-with-comments system which maintains the bulk of its data as static files on the regular Unix filesystem. As we all know Linux can spray out static web content at impressive rates, so I believe this system will be much faster than current systems like Scoop and Slash and Moveable Type, at least for a read/write ratio above a certain but as yet undetermined level.

This kind of thing seems pretty easy, and I am thinking that a site of the scale of Daily Kos could easily be supported on modest hardware. Daily Kos and Kuro5hin and other Scoop sites love to visit the database, but the need to do so is severely overestimated by the implementors. After an author writes a diary or blog entry, that diary is either not going to change or will be update so seldom that the ratio of writes to reads will be indistiguishable from zero. The same is true of comments. A busy thread on Daily Kos will be read by hundreds of thousands of people, but will attract only a few hundred comments. There is, therefore, no reason why MySQL should be consulted on every page load.

I tried to do this same project many years ago, working from Slashcode, but for a number of reasons this is now much easier. We now have many wonderful high-quality implementations of the DOM in every practical language: Perl, Python, C, Java, and many more. We also have something on Linux that we certainly did not have in 1999: a filesystem capable of holding billions of files, with very large numbers of files in single directories. We now have Lustre, which in addition to scaling to huge sizes is also cache-coherent across nodes, allowing quite advanced multi-reader/multi-writer behavior even when you have a rack full of machines handling the HTTP requests.

In other words, I'm quite excited about this project and optimistic about the prospect for success.

avriettea

If you think mysql didn't install itself in the right place, just pass some more parameters to the configure script. You can also change the path to the database files at runtime with the -h,--datadir option. Agree completely that mysql can seem like a bad joke after using pg, but then again mysql doesn't have the operational headaches that come with postgresql: vacuum, reindex, and so forth. I tend to use both systems.

flaming

My colleagues at work had the genius idea to install a mail filtering system called MailMarshal in front of their Exchange server. I would not recommend this product. It blocks mails containing GIF files, but the same image converted to PNG is not blocked. Also if the GIF is buried in one extra level of MIME hierarchy, it will be allowed. Finally it seems more interested in the memo From: header than the SMTP envelope sender.

The punch line is that I cannot send mail even to people within my own company without resorting to a comical rfc822-in-rfc822 tunneling system. It is the mail analog to ipip tunneling through firewalls. And it proves once again that control freaks don't understand networks.

Unfortunately I was unable to control my frustration after a week of trying to diagnose this mail problem, and flamed the administrators (gently). This never helps the situation but I always do it. I am broken in that regard.

Incremental apt-get update

I'd like to meddle with the way Debian distributed software. It is very silly that apt-get update downloads several megabytes of data, just to get the incremental changes since the last apt-get update. I believe it should be easy to implement an incremental update that sends only the changes since time t. This would require that each upload have a serial number, but either that already exists somewhere in the apt system, or it could be hacked in without too much trouble.

By my caculations a daily apt-get update could be reduced from several megabytes to tens of kilobytes.

BitTorrent

After reading numerous enthusiastic endorsements of BitTorrent, I downloaded the client and tried to get Mandrake 9.1 images from the network. I must say I was not impressed. The estimated time to complete the download was 42 hours, at about 10KB/sec. Normally I can download at around 135KB/sec. Worse, my client was uploading at around 20KB/sec! On my assymetrical connection, uploads degrade downloads. Is this something BitTorrent does not account for? The client should limit uploads to at most, say, 10% of downloads. Otherwise there is little reason for the user to participate. Certainly if uploads were going to be 200% of downloads I would never use it.

Also their FAQ desperately needs a "How do run this farking program?" section. /usr/bin/btdownloadprefetched.py is not obvious.

25 Sep 2002 (updated 25 Sep 2002 at 18:58 UTC) »

Someone mailed me to request my spam filter system, so I packaged it up slightly with some command line arguments and documentation. You may download it here:

http://atari.saturn5.com/~jwb/spam-2.0.tar.gz

I added a --gram-length=n option, so you can play with that dimension of the system.

workrave

I recommend you try Workrave, to help keep your hands useful in later life.

n-grams in spam filter

I modified my spam filter to compare the performance of unigrams, digrams, and trigrams. The undesired corpus contains 353 mails; the desired corpus holds 3352. When using trigrams the vocabulary is over 1.1 million terms. My system is the same as Paul Grahams, except I do not double the document frequency of terms in the good corpus, and I consider mail as spam if its probability exceeds 50%.

The unigram system identified all but eight mails from the spam corpus, with zero false positives. The digram and trigram systems both identified all but three, also with no false positives. Of course the trigram system takes much longer for the analysis, so I believe I will use digrams for the present system.

The system works so well that I will write a small C library for use by mail clients. I think spam filtering has no effect on spammers until it becomes widespread. So I will try to spread it widely.

tk: I have tried using 2-grams and 3-grams in my spam filter. Of course this tends to bloat the vocabulary and therefore the time required for analysis. Later I will attempt to characterize the effect of term length on filter performance. My parser it something of a hack, so any extensions to the term length will probably result in repulsive code.
7 Sep 2002 (updated 7 Sep 2002 at 23:55 UTC) »
spam

The spam filter is getting better as the training corpus grows. It now has 300 messages, which explains why I need this software. The mail parsing code I posted earlier was not optimal. It turns out that parsing the unencoded content of binary attachments helps to trap viruses and other windows programs. In both the bodies and the headers it is necessary to ignore very short words, and to remove non-word characters. This means that:

To: jwbaker@acm.org

Is reduced to just

jwbakeracmorg

Which is perfectly fine for these purposes. I'm still tweaking the implementation. The latest iteration found five spams buried in my "good" corpus, but decided some things in my spam corpus were good. There are some odd effects currently. For example, my mail goes through mail.saturn5.com comparatively recently. Before it went via a different server, and I still have many spams from that server. Therefore the spam corpus disproportionately represents the old mail server. The term for the new mail server "mailsaturn5com" can tend to make a spam appear legitimate.

The goal is still zero false positives.

victory The best part of a spam filter is the tiny feeling of victory when a spam takes a detour into the spam folder with 100% confidence, and helps to make sure that any spams like it will follow. Muhhhhahhahhaha

spam

I implemented Paul Graham's probabilistic filtering for spam. At first I attempted to simply use rainbow from the libbow distribution, but rainbow has many problems. It is painfully academic, its HTML parsing code always segfaults, it is miserably slow, and it cannot test more than one document per launch. Therefore did I apt-get remove libbow.

My implementation, in Perl, attempts to understand the MIME and RFC822 structures for better tokenizing. Here is the meat of the lexer:

sub lex_header {
    my ($h, $w) = @_;
    my (@t);
    
    @t = $h->tags();
    
    foreach my $t (@t) {
        my ($n);
        
        $n = $h->count($t);
        
        for (my $i = 0; $i < $n; $i++) {
            foreach my $tok (split(/\s/, $h->get($t, $i))) {
              $w->{$tok}++;
            }
        }
    }
}

sub lex_entity { my ($r, $w) = @_; my ($t, $h, $n); lex_header($r->head(), $w); $t = $r->effective_type(); $h = $r->bodyhandle(); if (defined($h)) { $h = $h->open("r"); if ($t eq 'text/plain' || $t eq 'text/html') { while ($_ = $h->getline()) { chomp(); if (!/[^A-Za-z0-9\+\=]/) { next; } foreach my $tok (split(/\b/, $_)) { $w->{$tok}++; } } } } $n = $r->parts(); for (my $i = 0; $i < $n; $i++) { lex_entity($r->parts($i), $w); } }

This is a decidedly informal approach to tokenizing a mail, but I think it works. Especially important is the line that rejects base64-encoded text in a text/plain entity. See if you can spot it.

After training the system on a spam corpus of 142 mails and a legit corpus of 3332 mails, the filter can reject about half of incoming spam and has not yet produced a false positive. Hopefully the rejection ratio improves as the spam corpus grows.

One damning bit of trivia arose from this experiment. The mail servers at my previous employer Critical Path are all 99% reliable indicators of spam. One of Critical Path's features is supposed to be spam filtering. Evidently it doesn't work overly well.

CD Changer UI

I have a 3-disc CD changer. It is pretty useless. You play the first CD, then the second, then the third. To play the fourth, you eject the third. To play the fifth, you eject the fourth and so on. The first and second discs stay in there forever. If I really wanted to program four hours of music, I would plug in the iPod.

4 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!