Older blog entries for raph (starting at number 258)

Paul Graham and Spam

xach figured it out: the person analyzing spam is, indeed, Paul Graham. In fact, he has a new piece up.

I'm writing something more detailed about trust now, filling out some email conversation that we've started. However, I'm struck by one feature of his new scheme, so I'll just blog my response here:

Very good. I do believe you're on to something here. The idea of everybody building their own corpus _is_ powerful. Your analysis makes sense to me, and I do believe your tool will do better than most.

I take issue with one thing, though: your assertion that probabilities are superior to scores. Most people not trained in statistics will find adding up of scores more transparent than computing Bayesian probabilities. Further, you've got an awful lot of voodoo constants in there: 2x, 0.01, 0.99, 0.20, 0.9. Lastly, the Bayesian probability computation assumes all the probabilities are independent (if I recall my stats correctly), which is definitely not valid in this application.

In fact, I do believe that your probabilities and SpamAssassin-like scores are equivalent. Use the transform: score = log p - log (1-p), or p = exp(score) / (1 + exp(score)). Take the 15 scores with greatest absolute values (see, already a more intuitive formulation), and simply add them. Your voodoo scores are now -4.6, 4.6, -1.4, and 2.2, respectively.

Perhaps the best way to look at this is that Bayesian probability can give theoretical justification to a "score" system. That might be interesting to some.

Font problems

It's hard to find good, free TrueType fonts. See this note on debian-devel for more evidence.

Even the nonfree Microsoft fonts that everybody uses aren't available for free download any more (thanks to Peter Novodnorsky for bringing this to my attention).

Now's my turn to say, "nyah nyah, told you so." I've long argued for Type1 and against TrueType fonts, for political, technical, and aesthetic reasons. Now that we're moving to high-resolution printing and unhinted antialiasing, the hinting embedded in the font is becoming increasingly irrelevant. Even if you are going to do hinting, a good renderer is capable of some very nice quality with Type1's. Formerly, all the free renderers sucked, but FreeType's is now showing quite a bit of promise.

The message linked above doesn't mention it, but there's a good set of Type1 fonts with a GPL license. Recently, Valek Filippov has extended these fonts to include Cyrillic glyphs, thereby fulfilling the spirit of the GPL. We'll include these in the full Ghostscript release, by the way, as soon as we test them to make sure they don't cause regressions in Latin documents.

By the way, URW did not donate these fonts under the GPL out of their own hearts. Artifex paid good money for them, and donated them out of a mix of self-interest and altruism.

Bug bounty

We've set up a bug bounty program for Ghostscript. It'll be very interesting to see what kind of results we get. If positive, we might do something like this regularly.

GoBe

gobeProductive will be released under the GPL soon. It's very cool software. Why adopt in when we already have KWord, AbiWord, and OpenOffice? For one, it has the experience of the GoBe team behind it. These are the same guys who did ClarisWorks. I haven't seen the code, but it's likely to be a lot more elegant than its competitors.

I know the GoBe folks, especially Bruce Q. Hammond, because I did some Libart work for them. The graphics tool alone will be a significant contribution to free software, especially for less technical users.

The software is based on a "parts" architecture. This should make it relatively easy to get started. A good project would be to do Python bindings for the parts API, so people can do simple parts in Python.

The Gobe people have done some really great work. It's a damn shame the business side didn't work out. Instead of trying to hoard intellectual property into uselessness, they're giving their source to the public. The free software community should adopt them warmly.

jfleck: I agree with what you said, which is of course why I used the word "seemingly". I also don't know if Hatfill is the man or not, but I found the patterns of press coverage very odd.

Your comments on anonymous sources are worth noting too. There's almost no benefit to going on the record; you'll still get quoted, and you avoid being held responsible for your words. I'm a big fan of anonymity in some cases, but in others it's simply ridiculous. I was struck, for example, by this passage in the AP wire story on the Chinese Harry Potter sequel:

Rowling's agent, the Christopher Little Literary Agency in London, said it was aware of the fake Chinese Harry. A spokeswoman who asked not to be identified refused to comment by telephone, but sent The Associated Press an e-mail saying, "We are taking this issue extremely seriously."

Uh, sure.

Some followup

graydon wrote me a nice email on my proof thread, pointing out his diary entry, which a lot of people probably missed because the next one was posted so soon after.

In it, he points out the logic advocated by Eric Hehner. In this logic, programs themselves are predicate expressions. Graydon claims that this results in some simplification, presumably because you don't have to separate two different kinds of expressions: programs and predicates about them. Hopefully, when Graydon and I meet, he can explain this to me. I suppose it's possible, but to me these do seem like the kinds of things that should be separated.

Aaron Swartz linked my memorial to Dijkstra. It really made my heart glow when he called it "beautiful". Thanks, Aaron.

I got an exciting email a couple of days ago. Somebody smart (you probably know the name) is working on the spam problem, and asked me about trust metrics. I'll post more as soon as it's clear the person wants this info public.

I wrote recently about the alarmingly weak US media coverage of the Anthrax case. I'm relieved to see a seemingly fair, accurate article in Newsweek.

9 Aug 2002 (updated 9 Aug 2002 at 15:51 UTC) »
Why formal methods?

Formal methods have earned a bad reputation. At one time, I think it was widely held that eventually we'd figure out how to prove large scale programs correct, and that the benefits would be compelling. But it hasn't worked out that way. People program as carelessly as ever, with reams of bugs and security holes to show for it.

Even so, I'm stubbornly hopeful. I think we simply haven't worked out good ways to do formal proofs yet. Mathematical proof style isn't really all that formal, and doesn't seem to adapt well to computer problems. Dijkstra's work provides glimpses of the future, but those techniques won't become popular until we can teach ordinary people to use them.

Another problem is that mathematical logic is fairly nasty. Especially when dealing with infinities, you have to be careful to avoid pitfalls like the Russell set "paradox". It's especially a problem with rigorously formal logic because you really want "metatheorems" to work: essentially, creating new deduction rules along with proofs that they're sound. The problem is that no formal system can be both complete and consistent. So you have to place limits on metatheorems, and often getting work done has the flavor of working around these limits.

What's the answer? Well, one way is to bite the bullet and adopt a reasonably powerful axiom set as the basis for all other work. A problem here is that you can't really get people to agree on which axiom set is the right one. In fact, controversy rages on whether or not to accept the law of the excluded middle or the axiom of choice.

But I find these debates deeply unsatisfying. What bearing do they have on whether a program runs according to its specification? My gut feeling is that infinity-prone concepts such as integer, real, and set are a bit too seductive. Computers deal with finite objects. In many ways, 64-bit word is a simpler concept than integer, as evidenced by the much larger set of open problems. A 64-bit addressable array of bytes is a bit larger, but still finite. You can do a lot in that space, but a lot of the classically tough problems become conceptually simple or trivial. Solution to Diophantine equations? Undecidable over integers, but if you confine yourself to solutions that fit in memory, just enumerate all possible arrays and see if any fit. You wouldn't do it this way in practice, of course, but I find it comforting to know that it presents no conceptual difficulty.

It looks as if the field of formal methods may be heading this direction anyway. Recently, the subfield of model checking has been racking up some success stories. The models in question have much more of a finite flavor than most mathematical approaches to computations. It may well be that the technology for reasoning about finite formal systems evolves from the model checking community to become widely applicable to programs. That would be cool.

I still haven't answered the title question: why formal methods? My answer is that formal techniques are our best hope for producing software of adequate quality. I'm under no illusion that it's a magic bullet. No matter how good the proof technology becomes, I'm sure it will always be at least one or two orders of magnitude more work to produce a provably correct program than to hack one out. Even so, programs will still only be "correct" with respect to the specification. In some cases, a spec will be relatively simple and straightforward. Lossless data compression algorithms are probably my favorite example: essentially you want to prove that the composition of compression and decompression is the identity function. But how can you prove a GUI correct?

You can't, but the use of formal methods will put intense pressure on spec authors to remove needless complexity and ambiguity. When formal methods catch on, I think we'll start seeing specs that truly capture the essence of an idea, rather than sprawling messes common today.

I also believe that security will be a major driving force in these developments. Our processes for writing software today (free and proprietary alike) are simply incapable of producing secure systems. But I think it's possible to formalize general security properties, and apply them to a wide class of systems. Buffer overflows (still a rich source of vulnerabilities) are obvious, but I think it's also possible to nail down higher-level interactions. It would be cool, for example, to prove that a word processing format with rich scripting capabilities is incapable of propagating virii.

Even so, getting the right spec is a hard problem. Timing attacks, for example, took a lot of people by surprise. If your spec doesn't take timing into account, you might have a system that's provably impossible to break (I'll assume that P != NP gets proven somewhere along the way), but falls even so. This brings me to another point: security assertions are often very low-level, while the natural tendency of computer science theorists is to lift the abstraction level as high as possible.

This, I am convinced, is how we'll program in fifty years. A lot of work will go into writing good specifications; more than goes into writing code now. Then, when people actually write programs, they'll do correctness proofs as they go along. It might take a thousand times as much work to crank out a line, but I think we can easily get by on a thousandth as much code.

And I think it's a pretty good bet that this will come out of the free software world rather than proprietary companies. We'll just have to

see.

In memoriam, Edsger Dijkstra

Yesterday night, Edsger Dijkstra died of cancer. His early work was truly pioneering, with many of his ideas forming an integral part of computing today, including semaphores. He perhaps most famous for his article, "Go To Statement Considered Harmful". Since then, "considered harmful" has become a distinctive part of the computing lexicon.

Dijkstra was an intellectual father to me. Twenty or so years ago, my father initiated a correspondence with him, and we kept it up for some years. I eagerly awaited the airmail envelope from Holland, addressed beautifully in fountain pen, and always found the contents to be equally beautiful, in words as well as penmanship. It's remarkable that he took time in his busy life to correspond with an arrogant young kid, and I still appreciate it.

Later, I attended the 1990 Marktoberdorf summer school, and had a chance to experience Dijkstra's wit, charm, and warmth. Looking back, it was a key point in my life - I was struggling with what direction it should take. The summer school was an amazingly stimulating and positive experience in academic computer science, and no doubt contributed to my decision to enter graduate school at Berkeley two years later.

I have been deeply influenced by his work on formal techniques. I haven't proved any real programs correct, but very often, thinking "how would I prove this correct?" has led to a better solution. His exposition of weakest preconditions was an intellectual delight, providing deep insight into the basic stuff that programs are made of.

Dijkstra possessed a bright, clear vision of how we should program. To him, programs were theorems, and programming was essentially the same as working out proofs. Over time, he developed a distinct style of writing proofs, emphasizing simplicity and straightforward application of steps.

Today, his vision is, sadly, still exciting and bold. In a more perfect world, the computer industry would have embraced his ideas, and much programming today would be organized around writing provably correct programs. Many of Dijkstra's beautiful, angry essays took the world to task for short-sightedness, upholding the ideals of intellectual rigor and simple, systematic design.

His formal techniques and style have not become wildly popular, but I believe they have been deeply influential nonetheless, through his patient teaching and clear writing. I am sure that when, some day, computing progresses from its current crude, alchemical practice to a real science, it will resemble Dijkstra's vision, and published work, much more closely than the bloated, corporate-designed messes we're saddled with today.

Now that Dijkstra has passed on, it falls to our generation to carry his ideals forward. Let us strive for simplicity in all our work. Let us strive to choose the right path, even when a shortcut seems more expedient.

Calculemus!

Ghostscript

The 7.22 release is out. This is mostly pdfwrite stuff. We're trying to shoot for an 8.0 release by Seybold, 10 Sep. It's tight, but we might just be able to pull it off. One of the things we're trying this time is a "bug bounty". It'll be interesting to see how that works out.

Do hash potatoes grow in Galois fields?

David McCusker wonders how to choose a CRC polynomial for 64 bits. Leaving aside the question of why, here's a generic discussion of LFSR's, and an application note geared towards FPGA's, but still reasonable. Also see this note on why concatenating two 32-bit CRC's doesn't work.

But CRC's are pretty bad hashes. Collisions are very easy to generate, one of the usual criteria for hash quality. It's hard to imagine an application for which a 32 bit CRC is inadequate, but a 64 bit CRC is a good choice.

Proofs

I got some very nice email from Konrad Slind and Ken Friis Larsen about how to make portable proofs, and about HOL. HOL stands for Higher Order Logic, which basically means you can quantify over functions and predicates. You can't do this straightforwardly in first-order logics.

Given this power, you can make nice abstract proofs that don't depend on any particular model. Basically, your theorem is of the form "for all systems of numbers that satisfying these axioms: ...". If you want to apply the theorem to a specific model, you just plug it in, and prove each of the axioms as a theorem. The general technique is called "theory interpretation". In its most general form, it's a transformation of abstract proofs into concrete proofs expressed in terms of a particular model, but in a nice system you can do the deduction directly, by taking one instance of the (polymorphic) universal quantification. Theory interpretation is one of the ideas in QED.

HOL is based on ML, which is really a very nice language for this sort of thing. I'm not yet convinced that the whole business of "tactics" and "tacticals" are essential in a proof interchange format, but they do seem to be a good way to partially automate the process of producing proofs.

Document formats

jfleck took my bait. Ok, I will explain why I think structural markup is not always the best approach.

In the simplest model, an author creates a document in a document format, then transmits the file (now a sequence of bytes) to a recipient. The recipient is in possession of a "viewer", which is a transform from the document format into a human-readable presentation. The usual viewer displays the document on the screen or prints it out on the printer, but it doesn't have to be that way. In particular, blind people will generally want the text read aloud.

So now we have the issue of variability of the viewer transform. Some document formats (PostScript, say) nail down the presentation quite precisely, while others (HTML, say) might appear very different in different viewers. Which is better?

Many people take a religious position on this issue. But I prefer to look at it in the framework of presentation quality. You have a viewing context: paper vs screen vs audio, window size (or screen size), color vs monochrome, etc. These are all legitimate sources of variability. You also have gratuitous variability, very commonly different sets of available fonts. You can solve this variability a number of ways: restricting the allowable fonts in a document, using freely licensed fonts, or allowing "embedding" of the font's glyphs (but not the complete font) in the document. The latter option is interesting because it restricts editability.

Now you can ask the question: over the space of viewing contexts, what is the quality of the presentation. As always, there are many aspects to quality: aesthetic beauty, ease of reading text, ease of finding something in the text, consistency, etc. The art of graphic design is all about making a high quality presentation. A typical rule of thumb is to place a graphic either on the same page or the facing page as the text which describes it. But the rules of thumb often conflict, and it's a judgement call to decide which ones are more important.

And now we can address the question of structural markup vs a document format that emphasizes presentation. Especially for documents that fit the "structured" model, the former can produce reasonably good presentations across a wide variety of viewing contexts. PostScript, by contrast, can represent a stunningly beautiful presentation, but only in a very narrow context.

There are, of course, many other factors that affect quality. Of popular document formats, HTML is particularly bad when printed. It's good enough on the screen, though. In fact, it could be better for screen viewing than PostScript or PDF, because it could use screen-optimized fonts, while the latter often forces printer-optimized fonts onto the screen. In practice, though, Web browsers haven't spent as much attention on rendering quality as, say, Adobe Acrobat, so the advantage is only potential.

Thus, I hope I've conveyed why I don't think that pure structural markup is always better than pure presentation, or vice versa. The main trend seems to be toward hybrid approaches. For example, PDF 1.4 allows you to specify one layout with great precision, but also includes parallel more-or-less structural markup so that it can reflow the text for small windows, etc. This adds considerable complexity, but one can make the argument that it serves readers of the document well. Similarly, the purity of structural markup is often bent, incorporating directives and hints with the intent of improving the presentation (style sheets are one such technique). However, these hints almost always fall short of pinning down the exact layout of text on the page, so you really can't do as well in print as a presentation-centric approach.

word2pdf

The discussion of document formats reminds me of a wishlist item I haven't mentioned here before: a batch converter from Microsoft Word format(s) directly to PostScript or PDF, faithfully reproducing the original layout. Obviously, wvware goes part of the way, but the print output goes through TeX, so the formatting gets pretty badly mangled.

Doing it as a batch converter is a lot easier than a full GUI word processor. That makes it realistic to focus on faithfulness. Over time, the existence of such a batch renderer would be helpful for building high quality GUI editors, but in the near term, a lot of the time people just want to view or print the documents anyway.

I'm pretty sure that we (Artifex) could sell such a tool, as it fits in with the Ghostscript business, and we've already seen some customer interest. So, if anyone out there is seriously interested in working on something like this, get in touch.

(also see wvWare2 design document)

Document formats

Peter Deutsch and I had a very interesting and wide-ranging discussion yesterday. One point I feel compelled to write about is the current state of document formats.

Peter did his presentation at Adobe in vanilla HTML. It worked reasonably well, but of course didn't have a very polished feel. The A/V techs found the choice unusual; presumably PDF and PowerPoint dominate.

What format should one prepare documents in? It's 2002, and there's still no standard, editable document format that produces reasonable quality printed output. Word, PDF, and HTML represent the entrenched two-out-of-three solutions, respectively. Are we ever going to get such a beast? It's possible that Word will become standard, PDF will become editable, or that HTML will become good, but the trends are not really in place.

It's also important to mention TeX. This format is actually pretty close on all three counts, but suffers from serious usability problems, and there are other issues; fonts other than the Metafont family are not standardized. The sad thing is that it was pretty close twenty years ago, but the work needed to make it broadly usable never happened.

The theoretical space of document formats is interesting. I think I'll write about it in coming days, including why the blind faith in structural markup is wrong.

249 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!