Older blog entries for raph (starting at number 178)

20 Apr 2002 (updated 22 Apr 2002 at 17:49 UTC) »

I've been thinking about trees a lot lately, and yesterday I had an epiphany about why: I'm actually going to need a high-tech tree implementation for Fitz (the merger of Libart and the Ghostscript graphics library). This tree will hold the graphics objects to be displayed or printed. It's known in the graphics field as a "display list". The PDF 1.4 imaging model is inherently tree structured, so it is most natural for it to be a tree.

The goals are simple. I want a nice, abstract interface, insulated from the implementation details. This interface should support dom-like tree navigation and mutation. Each node will also have a bounding box. The most important query on a tree node is to iterate through all children intersecting a given bounding box.

Also, I want the tree to support a nice Model/View architecture. Mutating the tree (the "model") notifies the view that something has changed. The view then queues a region of the screen for redisplay. A background thread (or, more likely, idle handler) empties this queue, rerendering these regions and drawing them to the screen. To the client, the effect is that the screen updates automagically.

Behind that interface, I want an efficient implementation. The main criterion is RAM usage, because this will go into things like printers, where RAM is far from free. Using an in-memory object per tree node is too heavy, as my experience with both DOM and the Gnome Canvas proved. Of course, it also has to be fast, so things like linear scanning through a serialization are out.

I now have a design that I believe will meet these goals. Here is the broad outline.

The API is fairly similar to DOM, but with one subtle and important difference: instead of Node objects for each tree node, I plan to use "node ids". These node ids are transient, even if the node itself is persistent. When the client is done with a node id, it explicitly lets go its reference. DOM, by contrast, generally relies on the Java garbage collector. I expect that the number of active node ids (ie, those for which the client holds a reference) is small compared to the size of the tree.

Obviously, this API admits a DOM-style implementation. In this case, there is a one-to-one mapping between node id's and node objects. Admitting such an implementation is a good thing, because it should be relatively simple to code up and thus useful as a prototype.

However, the real goal is to admit a more compact implementation. In this implementation, a serialization of the tree is held in a B-tree. Thus, there are two trees. The structure of these two trees need not bear any relation to each other, though.

The alphabet of this serialization consists of left parenthesis, right parenthesis, and atoms. Let's simplify things a lot by treating atoms as if they are one byte in size. That way, counting is easy, and we also don't need to worry about splitting an atom across more than one (fixed size) B-tree block. Thus, a complete depth-2 binary tree is represented by this serialization: "((AB)(CD))"

For each B-tree node, we store a little bit of extra information about the parentheses. Starting at 0, incrementing for each '(', and decrementing for each ')', we record the minimum and ending values. For example, if our B-tree block size is 4, then our blocks are "((AB", ")(CD", and "))". The summary tuples are (0, 2), (-1, 0), and (-2, -2), respectively. These summaries are stored for both leaf and interior B-tree nodes.

Now, there are two notions of node id's, which I will christen "sliding" and "stable". A "sliding node id" is essentially the tuple of a B-tree (leaf) node identifier and an offset within the node. Thus, in our example, if the id's of the three B-tree leaf nodes are x, y, and z, then the root node of our tree is (x, 0), its first child is (x, 1), and its second child is (y, 1).

Given a sliding node id, it is pretty easy to see how the standard DOM methods of first child, next sibling, and previous sibling can be implemented efficiently. In particular, you can jump over irrelevant B-tree nodes without having to dive into them. The details are left as an exercise for the reader.

Finding the parent of a node is a little trickier, but only a little. Basically, you walk down from the root node, and remember where you were. If this operation is important, then it probably makes sense to have a cache. You can also populate this cache when you do a "first child" operation, using the result (the child) as the cache key.

This concept of a sliding node id works well if the tree is read-only, or even in the (interesting) special case where you're appending to the end. But I'm also interested in the general case where you are mutating the tree.

The basic idea is to maintain a mapping from stable node id's to sliding node id's. When the tree is mutated, update this mapping. For example, assume that we have a node id pointing to "B". The sliding node id is (x, 3). Now, let's insert an "E" between "A" and "B", resulting in the following B-tree leaf nodes: x = "((A", w = "EB", y = ")(CD", and z = "))". We update the sliding node id in the stable-to-sliding mapping to (w, 1). Note that the node id for the root's second child stays at (y, 1). In fact, we expect most node id's to remain stable unless that neighborhood of the tree is being mutated.

This design is quite a bit less ambitious than, say, IronDoc. I'm not concerned with updating over a network, doing transactions, or generalized undo.

Messaging without email

David McCusker has also been thinking a lot about trees. Reading over his site has provided much of the inspiration for the above design.

I realized recently that he and I are conducting a full scale dialog through our respective weblogs. This has traditionally been the domain of email.

I feel like I've found a kindred soul. We've spent a lot of time thinking in the same space, and come up with parallel conclusions. I can also relate to much of the more personal info that David blogs - for example, when we were trying to live with another couple in an extended family, we had money issues similar to those that David describes. For us, those ended when we split back into our respective nuclear families.

I get the feeling that I would enjoy meeting David and hanging out with him. The normal thing to do would be to email him, but I'm a bit reluctant to break the blog-not-email pattern.


Yesterday, Roger Dingledine, who's in town from Boston, invited me to go to dinner with him. On the spur of the moment, I hopped over to San Francisco (where the cfp2002 conference is being held. It was great fun. In addition to Roger, I got to meet a number of people I hadn't seen in a while, including adam and Paul Syverson. I also got to meet some people I knew only online, including Len Sassaman and Lance Cottrell. Also, meeting George Danezis was a special treat. Bram stopped by late, but I had to leave to catch Bart by midnight.

These personal connections are vitally important. I'm happy that I was able to do this.


Someone (sorry I don't remember who) recommended Forth. I'm happy that he likes it, but my experience is otherwise. PostScript is actually one of the more powerful Forth dialects (lists and maps as first-class objects, etc), but I avoid it as much as possible.

It's awesome that projects like Parrot and Psyco are underway. Unfortunately, I fear that the goal of getting good performance may impose subtle but deep changes on the languages implemented. Python, in particular, is radically dynamic, much more so than a typical Lisp.

I also think it's great that we have different competing approaches. We don't yet know the Best Way to efficiently implement highly dynamic languages. My personal guess is that Arc will kick the ass of its competitors, by virtue of drawing on the deep knowledge well of efficiently implementing Lisp, but I would be happy to be proved wrong.

My main frustration, as I've written before, is my lack of good choices among stable languages. But this will inevitably happen in time. If there are free software values (in analogy to the extreme programming values), then I'm sure that patience is among them.


This is well worth reading. There are some problem domains in which performance does matter. It used to be that performance mattered in most problem domains; no longer. But the performance-critical domains tend to be the ones that interest me personally.

Languages such as Python and Perl have become popular for good reasons - it really is possible to express concepts in these languages more succinctly than in lower-level languages like C. However, this power generally comes at the price of a dramatic loss of performance. I don't agree that there necessarily has to be a tradeoff between beauty and performance; quite the contrary.

I guess what bothers me most is that the technology to extract good performance from highly dynamic languages is well known by now, but the implementors of Python and friends have largely chosen to ignore this art.

Fortunately, future implementors of popular scripting languages are as free to reject low-performance implementations as past implementors of popular scripting languages have been to reject high-performance implementations. From the looks of it, that's precisely what Paul Graham intends to do. If so, it would rock.


One of the things David McCusker has going at his website is an ongoing discussion of a data structure/algorithm he's hacking, which he calls blobs. From the outside, it looks like a file (ie sequence of bytes), but for which inserting and deleting in the middle is efficient. In most file implementations, only manipulating the end is efficient.

Under the hood, it's basically a b-tree. There seems to be a lot more to it than that - David is obviously a very smart guy, and equally obviously, he's put a lot of thought over a long time into the algorithms. Fortunately, as he points out, you don't necessarily have to understand the implementation of blobs to use them effectively. I think this is true of a lot of the best technology.

I can think of a few applications in my own experience where blobs would be useful. The most obvious is text editors. There is also a large class of binary file formats that can usefully be edited in this way - PDF definitely comes to mind, and many producers/editors of PDF files seem to go through all kinds of contortions to deal with variable length editing.

But, thinking about this a bit today on BART, I'm more intrigued with the idea of using this data structure for generic tree-structured data. In some serializations of trees, the basic tree editing operations map well to edits of the serialization. XML is one such serialization; ASN.1 is not, because it includes length fields.

The basic tree edits I have in mind are inserting and deleting nodes, and moving nodes up or down. In XML, the first two map to just inserting and deleting the XML representations of the nodes. The second two are not much harder - you delete or insert (respectively) matching start and end tags.

You, faithful reader, probably think I'm out of my gourd at this point. "Doesn't he realize", you might say, "that he's proposing to use a tree structure to store a byte sequence representing a tree? Why not just use a tree to store a tree, like DOM does?" Most of the answer is: performance.

Most DOM implementations take on the order of a hundred bytes or so per tree node. To me, this is clearly ridiculous. Even XML, which is a fairly redundant and bloated representation, takes a dozen bytes or so per node, assuming concise tags. It's not hard to imagine more compact tree serializations that take just a few bytes of overhead per node to store the tree structure. Lisp S-expressions, which aren't even binary, are a good example.

With a blob, the amount of memory taken by the data structure is only a tiny amount more than the serialization itself. So it seems like you might get an order of magnitude or so improvement without even trying very hard.

But RAM usage isn't the only reason this is interesting. Manipulating byte ranges is actually a lot simpler than doing edits on a tree. This can be interesting for a number of reasons.

Let's say you're trying to keep two copies of a tree sync'ed up, over some kind of network connection. If you were to do something really stupid, like DOM over CORBA, you'd generally get one method invocation per node. Since DOM over CORBA incurs a round-trip latency for every method invocation, performance would be amazingly bad. You also need to preserve a great deal of DOM complexity so you can express the edits to the tree in rich enough detail.

By contrast, it's very easy to imagine an efficient implementation based on byte range edits. One major win is that you can batch up operations on as many nodes as you like into a single byte range edit. Another major win is the dramatic simplification of the protocol: just insert and delete.

This is just one example; another is undo functionality. It's possible, but painful to implement a generic undo facility in DOM. Basically, what you'd need to do is provide a proxy DOM implementation that records the inverse of each tree edit into a log. When the client inserts, you log "delete". When the client moves a node from A to B, you log "B to A". Then, to undo, you play the log back. The devil is in the details - there are lots of operations, and a vanilla DOM is hard enough to implement without having to worry about maintaining integrity of such a log. It's also difficult to see how you'd aggregate changes, for example undoing back to a blank document in a single step rather than replaying the entire tree edit history.

In a byte range implementation, you just have insert and delete. These operations are each other's inverses. Further, it's pretty easy to reduce a sequence of such edits if need be.

Undo, of course, has lots of applications, including aborting transactions in a transactional system. You can also easily imagine this kind of thing being used for copy-on-write, version control, etc.

I'm not saying that this approach solves every problem. Holding on to a reference to a node as it slides around, fairly easy in DOM, becomes more challenging.

I spent a lot of time thinking about how DOM might be implemented efficiently when I was working on SVG. I've basically come to the conclusion that it can't be. But I do think that it's possible to implement something DOM-like that's a lot simpler, a lot more powerful, and a lot more efficient. Such a thing, I believe, would be very useful in a lot of interesting applications.

11 Apr 2002 (updated 11 Apr 2002 at 08:07 UTC) »

David McCusker (whose blog is well worth reading), posted a nice link to my last entry. I always enjoy ego strokes. Thanks!

David makes some important points. We are taught to "throw one away; you will anyway". This is good advice. The first time you write a program, learning how to write that program is the most important thing. Building a prototype is the best way.

But some of us have already built the prototypes for the programs we want to write; we have learned how. Brooks does not say to throw all programs away, only the first.

I feel strong kinship with David over these issues. After much thought, I find my preference for C reaffirmed. For one, it's a fairly good language for the kind of programming I enjoy most (2D graphics). I am happy packaging my work as libraries that I, and others, can use. C is still the best language for writing libraries, including those intended to be used from scripting languages. These libraries will continue to be useful even as these languages evolve.

For throwaway programs, I like Python.


Arc is the most interesting language design effort at the moment. If Paul Graham meets his goals, it will be a very useful language. He has a pretty good chance at meeting these goals, because Arc is essentially a dialect of Lisp, which already meets most of them.

Paul has solicited input from the community. The collected responses are well worth reading. They should be a Wiki, I think.

Of course, it will be a long time before Arc becomes a good language for writing "timeless" code. Among other things, the language has to stabilize, acquire some good implementations, and develop a rich set of libraries (with a thriving community to support them).


We have approval from the school system for our plan to home-school Alan in the mornings, and send him to public school in the afternoons. I think this will work well.

Alan has expressed interest in chemistry. I'm pretty weak in that subject, so I'm not confident about the best way to teach it. My inclination is to start by doing recipe-style experiments (baking soda and vinegar, color changers, etc). Do people have a favorite book or other resource for this stuff? Is a chemistry set a good purchase?

Max's language development is also rocketing along. He's now doing simple sentences ("dog peed on it"), can count to ten with no difficulty, and appears to be learning some letters. Not only that, but we spent some time this evening kicking a (small) soccer ball, and he's pretty good at kicking it where he wants it to go. Way better than I was at almost-2!


My Plantronics CT-10 arrived a few days ago. I'm still getting used to it, but overall I think I like it. The mike is not significantly "hotter" than a generic phone's, but seems to be better at noise cancelling. The headphones are quite a bit clearer, which is good. And a headset is a huge ergonomic win, especially if you're also typing.

Putting the phone in its charging cradle while off-hook doesn't work well. This is workaroundable, but still a misfeature.

Ego again

I was really pleased that Advogato scooped the rest of the online press with the CIFS license story. Thanks atai!


The 7.20 release is out, and this time rillian did most of the work. Working as part of a larger team sure feels luxurious at times :)


Working on mod_virgule again sure is fun. The patches are flowing, things are going live, improvements are being planned and discussed.

StevenRainwater's duplicate posting prevention patch is now live. This removes one of the usability bugs in the site.

Again, being able to delegate is a major luxury. gary is serving as mod_virgule's "patch penguin". We also have a reasonable system of CVS branches worked out, so the mechanical parts of applying patches and so on is flowing more smoothly.


There's now a wiki for mod_virgule development. It's way too much fun to play with.

I really like wikis. They're much like bumblebees; there's no way a reasonable person could believe they would fly, but they do. I think Advogato should have a wiki, or at least something rather wiki-like. It's the classic build-or-buy decision - do we want to deploy an existing Wiki implementation (probably with some customization, like integrating Advo's user accounts and tmetric results for authorization), or build our own on the mod_virgule framework?

The free software way

Heather got me "Extreme Programming Explained", at zooko's recommendation, and I skimmed through it a bit. It's interesting.

I'm convinced that there's something to XP, but reading the book, it's pretty clearly designed for custom proprietary software projects. There's quite a bit to adapt to make it work for free software. For example, there's a very illuminating discussion of balancing Time, Scope, Quality, and Cost. However, while there is something analogous to Cost in the free software world, it is very different than in the proprietary world. In free software, it's all about attracting bright, motivated volunteers.

Of course, as the XP advocates freely admit, the technique is nothing more than a well-chosen selection of time-tested techniques for writing software. Free software also has some good traditions. A radical new approach is not called for - more a careful look at what works, and how to make it work even better.

To me, Quality is the most important of the four factors listed above. For one, really good software is timeless (see "good taste is timeless" in Taste for Makers). In software, there is an overwhelming urge to throw things away very quickly, which in turn creates a need to make new things as quickly as possible. This is fine when the things are of low quality (so we don't miss throwing them away) and there lots of money to make the new things, but doesn't work well in free software. The cost of writing software is amortized over its useful life. Free software is very cost-constrained, so amortizing over a longer time is worthwhile.

Quality also has a profound effect on resource allocation ("cost"). My own willingness to work on a project is directly linked to the quality I can expect it to attain, and I'm sure a lot of other top programmers feel the same way. There is, in fact, a large pool of bright people willing to donate their time to free software - the problem is that it's spread out over a correspondingly large number of projects. A smaller number of very high quality projects seems like a much better deal to me than a huge collection of low-quality projects.

I feel that the Unix philosophy speaks well to the issue of Scope. Programs should do one thing, and do it well. The real power comes from being able to use such programs in many different contexts. In Unix, the elegant underlying model of processes, pipes, and files binds these small components together. There have been attempts to extend this philosophy to more domains, but it's very difficult to design things like interprocess widget bindings with anywhere near the same level of elegance as the original Unix system ("Good design is Hard", ibid).

Sorry for rambling - these ideas are still rattling around inside my head. I'm hoping for some kind of synthesis soon, at which point I'll probably write an Advogato article.


A long patch of "raph lag" has finally ended, with all of "Gary's branch" being merged into HEAD and applied live. Did you notice?

April fools

I'm really glad I didn't try any April fools pranks with the site this year. Aside from the usual issue of having no time, the obvious jokes (getting bought out by a BigCo) are pretty stale by now.

Two were funny: PigeonRank, and Binary Lexical Octet Ad-hoc Transport. The latter was funny because it had a real point, one well worth paying attention to.

Instant outlining

Dave Winer is rolling out something called Instant Outlining. I think there's something interesting there. I don't think he's doing any interesting new science. The polling/notify protocols look fairly crude compared to some of the more elegant work on such things in distributed systems research. The XML format (OPML) is not particularly elegant, although I must say it is something of a relief to see XML being used to represent primarily text with a tree structure. Cramming all the text into <outline text=""> elements is merely a syntactic nit. The fact that it's relatively simple XML positively invites people to play with it.

What is interesting is that Dave seems to have made propagation of annotated links extremely easy, but still under manual control. Because of the manual control, the trust issues are manageable. If you read the testimonials on Dave's site, you'll comparisons to email, which of course gets its trust all wrong and suffers badly from it.

I don't think Instant Outlining gets everything right. Browsing through some OPML files, I see some dissonance: between the flow of time and the tree, between missives and replies (they don't seem to be formally linked), between freeform text and formal structure, and between freshness and permanence. But at least they're playing with these concepts for real.

I also think that tree-structured outlines are a very useful concept, and underappreciated in today's world. Note how Advogato diary entries have evolved a very outline-like style, with headings tagged by <b>. If the volume of diaries scales up, it might be very useful to formalize that structure, and present a "collapsed view" for at least some diaries, showing the headings but not the bodies.


In response to some inquiries, I wrote up a Attack Resistant Trust Metric Metadata HOWTO. I posted it to p2p-hackers, which seems like the best demographic, but perhaps there are people here who might be interested as well.


More updates to the latest snapshot of my thesis. There's some good work in there, including considerably more detail on my Google analysis, a distributed network model, and more.

Not only that, but I finally went ahead and implemented the PageRank algorithm. It's pretty cool. I did some testing over the Advogato graph, and the results seemed to be very good, probably even better than Advogato's native trust metric.

This implementation is also fast: about 80ms to run the metric.

rms on patents

Richard Stallman's recent speech on software patents is very good. Most of what's been written on software patents falls either into the "patents bad. Ugh!" or the "we need them to incentivize innovation" camps. In this speech, rms goes quite a bit deeper.

Wiki lust

I want Advogato to have a Wiki. Just a simple little one, nothing super-fancy. I might just have to code one up after a bunch of the merging stuff happens.

29 Mar 2002 (updated 30 Mar 2002 at 08:32 UTC) »

Based on the feedback from my last post, as well as a little digging on my own, I decided to try the Plantronics CT10. I'll let you know how I like it.


The 7.20 release process is well underway. rillian put together a release candidate tarball. Feel free to try it. So far, there has been a fair amount of feedback on gs-devel, largely of build problems.

The other major user-visible thing that's happening is that we're trying to tighten up security. In particular -dSAFER now blocks reading of arbitrary files, in addition to writing. Unfortunately, this breaks gv on pdf files, because gv writes out a temporary PS file as a wrapper for the PDF.

I'm preparing a patch now for gv, but it's still tricky getting the fixes in the hands of users, largely because the last release of gv appears to be 5 years old.

Security vs convenience is a delicate tradeoff. I'm trying to do the right thing, but I still have the feeling that it will cause difficulty for users.

Trust metrics and Slashdot

I've heard from a couple of people that the Slashdot crew made the assertion at a panel at SXSW that "Advogato doesn't scale". This angers me. The are nontrivial algorithms at the core of Advogato, so it's not obvious that it does scale. I put quite a bit of work into optimizing these algorithms. Lately, I've been playing with the trust metric code itself, as well as the Advogato implementation. There's no question that the website has performance issues, but most of those are due to the simpleminded arrangement of account profiles as XML files in the filesystem. On a cold cache, crawling through all 10k of these XML files takes about a minute. Once that's done, though, the actual computation of the network flow takes somewhere around 60 milliseconds on my laptop (you have to do three of these to run the Advogato trust metric).

So it's clear that Advogato isn't anywhere near its scaling limits. If we needed to scale it a lot bigger, the obvious next step is to move to an efficient representation of the trust graph. bzip2 on graph.dot gets it down to 134k, so there's obviously a lot of room for improvement.

So I'm upset that misinformation went out to a fairly large number of people. In order to spread the word, I hereby challenge Slashdot to a public duel. In return for their analysis of the scaling limits of Advogato and of the underlying trust metric, I promise to do an analysis of why the Slashdot moderation system sucks. I will cover both mathematical analysis, pointing out its failure to be even modestly attack resistance, as well as some discussion on why this leads to lower quality results in practice.

I hope they take me up on this, as I believe such an analysis would be useful and interesting. If not, I hope they have the courtesy to retract their assertion about Advogato's scaling.

Trust metrics and LNUX

Slashdot is not the only LNUX property that has rubbed me the wrong way. The "peer rating" page on SourceForge has, since its inception, contained this text:

The SourceForge Peer Rating system is based on concepts from Advogato. The system has been re-implemented and expanded in a few ways.

The Peer Rating box shows all rating averages (and response levels) for each individual criteria. Due to the math and processing required to do otherwise, these numbers incoporate responses from both "trusted" and "non-trusted" users.

The actual system they implemented is a simple popularity poll. There's nothing wrong with this. I can certainly appreciate that they've been busy with other things, and trust research is obviously not a core focus of SourceForge. But I really wish they wouldn't pretend that it's something it's not.

I came to the realization a couple of weeks ago that this text is a big part of the reason why I feel such Schadenfreude over the woes of LNUX. Keep in mind, I think SourceForge has been a wonderful service, and I think the people who work on it are great (I know several personally). But I think the corporate structure at LNUX has hurt the goals of free software quality and intellectual inquiry. I have nothing against them trying to make a living doing a proprietary PHP candy shell over legacy free software tools, but I don't think their interests intersect much with mine - which is advancing the state of free software by doing high quality work.

Again, if they revised that text, I wouldn't be quite so happy when I see the red next to their symbol on the LWN stocks page.

Trust metrics and the future

I firmly believe that attack-resistant trust metrics based on peer certification have the potential to solve a lot of important problems. Obviously, I'm biased, because they are, after all, the focus of my thesis. It's certainly possible that, in practice, they won't work out nearly as well as my visions.

Even so, even the most critical person would have to admit that there's a possibility that trust metrics can be the basis for robust, spam-free messaging systems, authorization for ad-hoc wireless networks, not to mention the possibility of making metadata in general trustworthy, without which the real-world implementation of the idealistic "semantic web" concept will surely degenerate into yet another tool for spammers and related slimeballs.

"More research is needed." However, I'm not being paid to do this research, so my own contributions are strictly for fun. Further, centralized systems are much more likely to be profitable to corporations than democratic peer systems (cough, VeriSign, cough). Thus, the research is going to be driven by the currently tiny core of academic and free-software researchers that actually understand and appreciate the issues.

I am hopeful that interest will pick up, especially now that I realize that Google's PageRank algorithm is fundamentally an attack-resistant trust metric. I think everybody will concede that Google is waaaaay better than their competition. If attack-resistant trust metrics work so well for evaluating Web search relevance, then why not in other domains?

This is, I believe, a message that is ripe for more widespread dissemination. That's the main reason why I'm posting this entry, rather than nursing my perceived slights in private.

Call your congress-critters

When the DMCA was being proposed, I thought it was a terrible idea, but I didn't let my congresscritters know. It passed. There is some relation between these two facts.

The CBDTPA is, like the DMCA, a terrible idea. Not only does it have the potential to seriously impact the legal development of free software, there is also no Village People song to which the acronym can be sung. Thus, this time I decided I would at least do something, so I heeded this EFF alert and gave my two Senators and one Representative a call. It was a surprisingly easy and pleasant process. In all three cases, a staffer picked up the phone, and seemed genuinely eager to listen to concerns from their constituents. Not only that, they seemed at least moderately familiar with the issue, indicating that it's been getting some response.

You can make the response even bigger. Do it.

27 Mar 2002 (updated 27 Mar 2002 at 06:43 UTC) »
Advice wanted

It took me until last week to come to the realization that I spend a nontrivial fraction of my work time on the phone, in conference calls, but that my phones are pretty bad.

Thus, I have decided that I want a really good phone. Sound quality is paramount, but comfort is also significant. One friend told me I should get a phone that supports Plantronics headsets. Anybody have experience with these?


Even though I have spamassassin installed, spam seems to be getting worse. This essay (or here) suggests that I'm not the only one. The essay also makes an important point: the spam itself may ultimately not be as harmful as the steps taken to try to fight it.

There is no question about it; our email infrastructure is rotting. It seems like the bad guys are being creative in how to destroy it, but there is nobody actively working on improving it.

Even if email continues to be useful for some time, it is very much evolving in flavor. I don't much like what it's evolving into.

Lions book, C

I bought the Lions book on impulse some time last week, and it arrived today. So far, I've only skimmed it, but it looks like a serious treat.

One of the fun things about the book is that the version of Unix described is written in a rather old dialect of C - for example, the assignment form of addition is written =+.

A most striking feature of the language is that types are optional. This works largely because of the nature of the machine - there's only one interesting word size other than "char". The result is a rather more concise flavor than the C of today.

The idea of optional types has fallen out of favor. Languages today tend to either prohibit them (eg Python) or require them (eg Java). I'm not at all convinced that this is a good thing.

One of the joys of C is its maturity - it's been around long enough to have its rough edges sanded off, and also to attract some reasonably decent implementations and tools (though I believe that quite a bit of useful work remains to be done). It's also reasonably stable by now. I think this is very important - it's very difficult to achieve quality if you're constantly throwing good things away and writing new things that suck. Unfortunately, most of the trendier languages are unstable to an extreme. If I write a large system in a Python/C mix now, using libraries extensively, it's almost certain that the preferred Python implementation five years hence won't be able to run it without modification, and the libraries will no doubt have been superseded as well (especially if one of the libraries happens to be Tkinter).

That said, C is not yet completely static. I do not its current evolution is in very good hands, though. There appear to be four bodies that are actively working with C: the standards committee, Microsoft, gcc, and Apple. The latter two are actually based on the same code base, but seem to be steered in different directions.

These guys are not working together very well. C99 has yet to actually be implemented. A lot of the stuff in the spec is needless complexity - the "restrict" keyword certainly comes to mind, not to mention both digraphs and trigraphs. Then you have the "extern inline" fiasco, which virtually guarantees that a potentially useful feature is rendered useless for years to come. Then, you have really good ideas like standard sized integer types, that can't easily be used because they are incompatible with pre-C99 versions.

I think that if the C99 committee had been doing a good job, they could have fixed that last problem without much difficulty. Basically, they could have developed and released (under a very unrestrictive free license) a set of header files that reliably add these new types to the most popular C implementations of the day. New C implementations would of course include them "out of the box". That way, everybody would get to use them, and they could catch on quickly.

Then you have the useful features that they missed. One of my favorites is the ability to call a varargs function dynamically (vararg functions have had the ability to be called dynamically since at least the days of the Lions book). A special case (va_list) is already implemented. I don't think the more general case would have been that difficult or complex, and it would make life much easier for people trying to do dynamic language binding.

Ah well.


After discussing two instances of good things that are being poorly maintained, it is uplifting to see renewed vitality on mod_virgule.

My only question: if we implement CSS, and then it turns out that there are a lot of people who don't like it, will we have to DeCSS the site?

Btw, one idea that might be worth considering is to use a tiny bit of text to indicate cert level. This would not only distinguish certs on CSS-losing browsers, but also help with screen readers, tiny wireless devices, and other less popular web browsing technologies.

out of the box

It took me a while to figure out exactly what kgb has against the phrase "out of the box." Then it hit me - I was thinking about user experience, as in "it works well out of the box." For example, Ghostscript doesn't work all that well out of the box. It works much better if you know all the fiddly little command line things. I think we should all do more "out of the box" thinking.

schooling out of the box

Both of our kids are obviously highly gifted, which is forcing us to think about their education. A lot of people seem to have the attitude that the traditional school environment builds character, even if it's deathly boring for gifted children. Somehow, it's supposed to prepare the child for later life.

I just flat-out don't agree. Realistically, the chances of Alan becoming a corporate drone are much lower than, say, making a living writing poetry. Why not prepare him for the latter, then?

Our current thinking is to have him in school half a day, and at home half a day. Now that he's reading, I think he can take more of his education into his own hands, just like both his parents.


Playing with Python continues to be great fun. Check this out:

>>> import fitzpy
>>> rp = fitzpy.read_pdf_open('/home/raph/knight.pdf')
>>> t = rp.get_trailer()
>>> t['Root']['Pages']['Kids'].topy()[0]['MediaBox'].topy(1)
[0, 0, 612, 792]

I can't imagine it getting much better than this.

Alas, speed is still a problem. It's not horrible, mind you, but even with C doing all the work of opening the file and parsing tokens, it's still quite a bit slower than a C-only implementation.


The xmlrpc stuff is fun. I'm happy that gary is working on it. The idea of the community taking care of Advogato, making it thrive, thrills me.

People who are interested in mod_virgule development should look at the virgule-dev mailing list. The Badvogato crew has done some good things with the code, and it's just my slackerliness that's kept it from being integrated. I'm hoping that will change, now that Gary has so generously volunteered to take some of the load.

I'm posting this diary entry from the Python command line, using Gary Benson's xmlrpc patches. For more information, see xmlrpc.html. Looks like a lot of fun!

169 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!