Older blog entries for raph (starting at number 218)

20 Jun 2002 (updated 20 Jun 2002 at 06:32 UTC) »
bgeiger: yes, expiring certs after a year was always part of the plan, I just haven't gotten around to implementing it yet.

Joel on Software has a nice piece on how large corporations are adopting free software. Perhaps esr's dream is becoming realized after all, just a bit more slowly than he predicted. In any case, I don't care much about big corps and free software. The lifeblood of the movement is still individual, human people, and will probably remain so for quite some time.

The subtext of yesterday's async musing was to express skepticism that all the city/fiber/dock/boat mechanism in Mithril was actually needed. But, of course, not having seen the design in detail, I can't really criticize it.

Async runtime musings

Matt Welsh's SEDA is very, very interesting. I especially like the quantitative flavor. Often, that's what makes the difference between expressing one's taste and doing real science.

Matt chose Java for his implementation. On the surface, this is a strange choice. For one, until quite recently, Java didn't even have a nonblocking I/O library. Matt had to implement his own, using JNI. The path of least resistance is to use a thread per connection, which has scaling problems as the number of connections grows large. Of course, it's possible to mitigate these scaling problems by throwing more hardware at the problem. Now why would Sun do a thing like that?

Anyway, SEDA is a fairly sophisticated runtime. Each stage is implemented with some number of threads. The number of threads is dynamically tuned (by "controllers") to optimize things like utilization and flow control. Many asynchronous designs have a much more straightforward mapping between threads and the tasklets (or "fibers", in McCusker's terminology). For example, a lot of people use select() to get one thread, one process. Alternatively, people who sell high-end hardware encourage the use of one thread per tasklet. But neither is truly satisfactory. A select()-based server will block on many important tasks, including file I/O, and is also sensitive to individual event handlers taking too long. It also won't get any gain from multiple CPU's.

SEDA solves this, but places very high demands on the runtime. In particular, you have to deal with very dynamic patterns of memory allocation. In a performance-oriented context, it's tempting to do this memory management (at least partly) by hand. But in an environment as dynamic as a typical SEDA server, it gets really tricky. Not to mention that malloc() and free() themselves start sucking when multithreaded (a microbenchmark shows roughly 3x slowdown with zero lock contention).

Similarly, reference counting would start to bite very hard too. In general, you need a mutex around each refcnt bump. No wonder Python has a global interpreter lock instead.

So the Java approach is actually very sensible. You simply require high performance multithreaded garbage collection from your lower level runtime. This is actually quite a Hard Problem; much PhD-level research has gone into it. However, by now it's more or less been solved. Since it's at the very lowest level, a large investment gets amortized over everything else at higher levels.

There's a pattern. Do sophisticated and complex stuff at the low level (here, hairy and nasty might be better adjectives) so that the higher levels become simpler.

SEDA isn't just interesting because of raw performance, either. From the looks of it, you actually do get good decoupling between the various stages. There's also some thought about debuggability, often a terribly hard problem in asynchronous systems. Perhaps it will become realistic to write an asynchronous library and be able to compose it with other components.

Thanks to Don Maas for an enlightening email about OpenGL display lists. As it turns out, these are not directly editable, but you can do splicing-style edits by composing them hierarchically. Not directly relevant, I think, but interesting.


David McCusker is going asynchronous on us again!

There are two pieces of the related work that strike me as quite interesting. First, Matt Welsh's SEDA work, which is quite appealing. In his Hotos paper, he makes a strong case that operating systems should not always provide a transparent virtual machine. Indeed, I can think of lots of ways in which global coupling could lead to better overall performance. One example that comes readily to mind is coordination of caches. In multiple processes on the same machine, many megabytes may be allocated in an unused cache, while another highly active cache could benefit from the space.

Reading about the specific webserver application also gave more insight into why the lock contention here was causing such long waits. When the pool of Apache processes is exhausted, the socket doesn't accept new connections. These connections are retried using the usual TCP exponential backoff algorithms. So delays could be in the dozens of seconds, even after the queue is mitigated server-side. All that should be fixed now.

Second, it looks like some serious quantitative design has gone into BEEP. The section where they talk about duplicating some of TCP's flow control caught David's eye, and mine also.

I conclude that TCP sockets are delicately tuned for performance, and that any abstraction which is placed over them carries the risk of destroying this tuning. CORBA is a particularly bad offender. First, it makes it too hard to do asynchrony. Second, it hides extremely useful information about the state of the TCP connection, particularly when it is closed, and when the window is full. If these properties are important, then protocols composed from CORBA will inherently suck compared with those built directly on top of TCP/IP. Remoting other notions of interprocess communication carries the same risks. Yet, in the local case, much simpler notions such as CSP-style rendezvous may be just as effective.

A couple of people pointed me at Twisted, and also libio by Niels Provos. Thanks.

Asynchrony is fascinating, but difficult to master. All real implementations face difficult design tradeoffs. Systematic approaches like CSP are interesting, but I am firmly convinced that there is no universal answer.

Fitz replies

kai: I'm unaware of the node design in OpenGL display lists, but they sound interesting. The OpenGL folks seem quite performance oriented. Do you by chance have a good reference?

For normal horizontally placed text, the default will be 1/4 pixel horizontally, and a whole pixel in the vertical direction. These defaults are based on fairly careful empirical study. 1/4 pixel horizontal positioning is nearly indistinguishable in quality from ultraprecise positioning, while 1/2 pixel is not quite enough. Vertically, alignment with the baseline helps rendering quality. The only real downside is that line spacing may become slightly irregular. I doubt this will be much of a problem in practice, though.

For vertically oriented text, I'll probably just switch the axes, but I'm not sure, as I haven't done any kind of careful study.

Finally, for rotated text (not aligned with either axis), I'll turn off caching altogether, because the hit rate is likely to be low.

As for progressive rendering, it's not high on my list of interests. I don't expect that the difference in performance between low quality and high quality modes will be enough to justify the complexity. And, as you suggest, it becomes less interesting as systems get faster.

mslicker writes: raph, I have to ask, what problem are you trying to solve?

This is absolutely the right question. I haven't written it down explicitly, and for that I apologize. The top-level problem is rendering a 2D graphics images provided by the client. At first, the main client will be Ghostscript, so the images will be PostScript and PDF files, but I expect interactive applications will use the library as well.

Then, there are several subgoals. Performance is critical. Memory utilization is also very important in some applications (especially embedded printers, which have very limited hardware). Even on desktops, using less memory is usually a big win, because of memory hierarchy effects. Quality of rendering is critical, but doesn't directly affect the tree API.

Given these constraints, I'd also like the client API to be clean and simple, even when the implementation does sophisticated things under the hood. I want to do a extensive unit and regression testing, and I feel that Python is a good language for these, so Python bindings are also important.

So why am I arriving at a general tree infrastructure? For one, the imaging model includes PDF 1.4 transparency, which has a very strong tree structure. Optimized rendering requires analysis of the tree. The painter's algorithm won't work as well.

And the simplest tree implementation, one in-memory object per node, is far too expensive in memory utilization when the tree is very large, or the available memory is limited. Thus, it's important to store the tree in a concise format, competitive with Ghostscript's existing "command list" at the least. I can easily envision multiple tree implementations tuned for different tasks. The simple, in-memory one will be implemented first anyway, because of its simplicity. I am not a fan of gratuitous virtualization, but when you have two different implementations of the same conceptual interfact, it's often a good idea.

At this point, I'm exploring the design space. This exploration is informed by thinking about more general problems, lots of them from my DOM experience. When a general approach is good, I'll use it. However, when generality forces unpleasant tradeoffs, within Fitz it will be a casualty. I am spending time thinking about more general trees so I know which is which.

I haven't written up Fitz in any real detail yet. In some ways, it's unfair to be blogging it, because you see the raw thinking without needed context. When I get to start work on Fitz in earnest, I will do much writing, and it will all be very accessible. But I have to concentrate on several hard problems for the upcoming Ghostscript 8.0 release first.

Change notification in trees

This is another fairly dense technical piece. If you were hoping for diversion or scandal, sorry. It is a companion to my caching in trees piece from a few days ago.

For interactive displays, the Fitz display tree will follow a Model/View pattern. The model is the tree, and the view is a rendering module that renders the tree and displays it on the screen.

What happens when you change the tree? If you wanted the simplest possible implementation and didn't care about performance, you'd re-render the tree from scratch, and display that. But I do care about performance, and I am willing to sacrifice some of that simplicity for "minimal updating". At root, this is the realization that, if you have the rendering for the old tree in hand, then to get the rendering for the new tree, you may only need to do a small incremental computation.

Let's look at minimal update in more detail. A deep optimization running through much of Fitz is culling based on bounding boxes. Changes to an object are guaranteed not to affect rendered pixels outside the bounding box (note that effects like drop shadows don't fit neatly into this category; thus, they're not directly supported in the Fitz imaging model). So, if you change an object, a reasonable approximation to the update area is the union of the old and new bounding boxes. If you delete or insert an object, just the one bounding box.

However, for some objects you may be able to do better. For example, when dragging one corner of a semitransparent rectangle, the minimal repaint area may be L-shaped. In this case, repainting the entire rectangle would be wasteful.

One more performance issue: if the tree is very large, then the cost of having one "event listener" per node could be too much. Keep in mind that the only justification for all this is performance. If you add complexity and make other aspects of performance (like memory usage) worse, then it's probably not a win.

So the broad outlines of a change notification system begin to suggest themselves. If the tree is small, then having an event listener on each node is reasonable. Changes to leaf nodes are distinguished from changes to tree structure. The former are dispatched first to a change handler for the specific leaf type. The fallback is the bounding box approach. Changes to tree structure also invoke the bounding box method.

However, as the tree grows larger, you only attach event listeners to some nodes. It's a type of cache. For nodes that have no event listeners attached to descendants, you expect a notification any time a descendant changes. Otherwise, you only care about changes to the node itself, because you know other listeners will handle the rest.

So now we have the concept of event propagation. For at least some kinds of listeners, you want to bubble the event up the tree until you hit a listener. In the extreme case, you can have just one listener at the root of the tree.

Now I face a difficult design decision. If I want to have multiple Views of the same Model, event propagation gets trickier. In particular, the propagation patterns for different Views can't interfere with each other. You can't simply stop bubbling an event upwards when you hit a listener, because maybe another View has their listener further up. Bubbling all the way to the top isn't very satisfying either, because that will tickle the listener at the root even if a deeper listener has already handled the change.

There are a bunch of other hacks that don't work so well either. You could, for example, check to see whether a deeper handler has already been called, and ignore the notification if so. But you get into trouble if you want to aggregate several changes together into one notification. Keep in mind that if you change most of the nodes in the tree, then re-rendering from scratch is almost certainly better than doing an incremental computation for each node changed.

If you did have a good solution for the one-View case, then you could extend that to the multi-View case by having an "event group" for each listener. When a listener catches an event, it sets a bit so that other listeners with the same event group farther up the tree don't get called. But the complexity is unsatisfying.

That said, let me present one reasonably simple approach supporting multiple Views. Here, the tree implementation does just bubble events all the way to the root. It's up to the client (the View) to apply the event propagation logic. At an internal node, if you have listeners for all children, and if the node itself is unchanged (meaning no insertions or deletions at that node), then you can ignore the notification.

However, there is one unappealing aspect of this approach. If you want to add more listeners (say, after they've been thrown out of the cache), then it only makes sense to add listeners to all children of a node. If you only add it to some, you can't reliably ignore the notification, so you might as well not add it to any. I expect that the fanout may be quite large, so this could be a problem.

So far, I haven't been able to come up with what feels like the Right Answer, only different tradeoffs. Right now, restricting the thing to one View seems like the best tradeoff. For Fitz, I think this will very much be the common case. But for other, more general tree applications (particularly those involving networking), it's probably not good enough. It would be nice to have a general solution that was simple enough to be practical for Fitz, but I guess I'm just too incompetent to come up with one. Ah well, such is life.

15 Jun 2002 (updated 15 Jun 2002 at 07:18 UTC) »

I finally got around to fixing the locks so that lock contention won't cause huge delays in reading pages. Writing (updating diaries and the like) can still be affected, but this is less urgent to fix.

LotR wrote:

We need a diary-writing trust metric!

Okay. I think you're right. I might well be motivated to write a generic metadata engine and apply it to the specific application of "how interesting is diary X?". Here's roughly how it will work.

When you're logged in, you'll get a chance to enter a one-to-ten score for another user's diary page. I might put this right under the "Certify <user> as:" selection at the bottom of individual person pages, but I'm also inclined to make it more accessible, for example allowing bulk updates on a customized version of the recentlog page.

This goes into the database as generalized assertions. At first, the only assertions that will be allowed are of the form "<user>'s diary is 7 on a one-to-ten scale", but the engine doesn't care what kind of assertions are present. "Roquefort is a particularly fine cheese" is also plausible. The reason for limiting the assertion space is to avoid scaling problems, which can become quite severe as the number of assertions scales up.

Then, roughly nightly, there will be a process that computes metadata scores, using the method I presented in my HOWTO. This will compute a confidence value for each user in the trust graph and each assertion. You can see where the scaling problems come from. I am sure there exist techniques for storing this data more sparsely, but I'm not interested in doing that research now.

Finally, the recentlog display will be annotated with the metadata scores. I'll probably also put in an threshold option.


I am trying my best to be patient with bytesplit. I realize he is a human being like all of us, but for whatever reason driven by demons causing him to antagonize people here. I sincerely wish that he is able to tame these demons, and interact positively with Advogato.

At the same time, I realize this is unlikely. As such, bytesplit is providing an opportunity to look at the trust metrics and the dynamics of this site more critically. The current trust metric certainly has limitations, and is definitely not a magic bullet for making this site an interesting read and a comfortable place. That's up to us.

What the trust metric does do is automatically compute membership in the community based on peer certifications. While I personally feel that bytesplit's contributions to free software are marginal at best, ten people here feel that his level of interest is high enough to rate an Apprentice cert. And, he does show interest in learning more, and his on-topic writings are perfectly reasonable for an aspiring apprentice. Given that, I don't think the trust metric should reject bytesplit's ranking.

All this is good motivation to implement the generalized metadata as proposed above. Unlike the existing trust metric, this metadata system would directly address quality and relevance of writing. I'll be very interested to see how it goes.

Cert inflation

We definitely have cert inflation here. Part of that is because the trust metric is generous, part of it is that people here are generally doing an inaccurate job of evaluating peer cert levels. This is useful information for people trying to design metadata systems: a significant fraction of the information input will simply be wrong.

I could certainly make the trust metric less generous. The easiest way to do this would be to have negative certifications as well as positive ones. But I'm not convinced that cert inflation is the most important problem in the world to solve.


David McCusker called again, and we had another nice chat, this time focussing on writing programs in asynchronous style. I think it's a hard problem. I think it's even worse for library writers, because it may not be realistic to assume that most users of your library will understand asynchronous programming very well. I told David of X as a cautionary tale. X actually has very sophisticated logic for dealing with asynchrony properly. For newcomers to X, this all seems very intimidating and complex (asynchronous grabs are a good case in point). In fact, I think there is widespread failure in levels above X to deal with race conditions and the like correctly.

Every time you do something over the network, it's asynchronous whether you like it or not. Yet, event-driven programs seem a lot more complex than their simple, synchronous cousins. David would like to recapture that simplicity in asynchronous programs. A lot of other people have tried things in this direction, without very happy results so far. I feel that CORBA is a cautionary tale in this regard. It pretends that method calls are really local, when in reality they're decomposed into two asynchronous events, and of course all kinds of things can happen in the meantime.

I haven't seen any of the details of Mithril yet, but I'm fairly skeptical that it will make asynchronous programming accessible to less-skilled programmers. On the other hand, I am perfectly willing to believe that it will be a good tool for expressing asynchrony concisely, and thus useful for people who know what they're doing.

One detail we touched on but didn't really go into was whether the fundamental message sending operation on channels should be synchronous (as in CSP) or asynchronous. In CSP, if you send a message on a channel, but there is nobody ready to readon the channel, you block. The other way to do it is to append the message to a queue. Both are reasonable primitives, in that it's quite straightforward to simulate one in terms of the other. So which do you choose?

I mentioned that the CSP way might be easier to reason about. There's another issue that came to mind after our call: the queue required for the fully asynchronous case requires unbounded resources in the general case. Obviously, in tiny embedded systems, this can be a real problem. On desktops, it's less clear. But if a system is operating under very high load, you probably want to worry about whether the queues will keep growing. Of course you can always implement flow control on top of async messages, but that's not really the point. On CSP, the default is not to grow unboundedly.

mwh: I haven't been following Stackless Python closely, but I am aware of it. Looking briefly at the site, I see they are now implementing a concurrency and channel approach directly inspired by Limbo and CSP. That could be very cool.

Farmers Market

We went to the Farmers Market in Benicia, as we do most Thursdays. This was the first week of corn season, so we got some. It was amazingly good. I like farmers markets; they're a good way to make a connection to the people who actually grow your food.


Antony Courtney called me up today, and we had a nice chat. He is interested in working on Fitz, largely so he can use it for his thesis work on functional reactive user interfaces.

His work is done in Haskell, so Haskell bindings for Fitz would be part of the deal. I think these two things could work well together. Fitz has a somewhat functional design. The rendering of a tree is a function of the tree. Caching is strongly related to memoization in functional languages, and minimal updating is a form of incremental computation.

Language bindings are to be an integral part of Fitz. We'll use the Python bindings extensively for testing. It's also, I think, the best language for experimenting. But I don't mean to exclude Perl or Ruby programmers, either. I just wish it was easier to write high quality cross-language wrappers.

What I'd really like is something like Pyrex, only with the ability to generate code for many different runtimes. Pyrex itself seems to be coming along quite nicely. As of 0.3, you can define classes in C, which seems quite useful. At some point, I might be motivated to try an experiment of making Pyrex generate, say, a Ruby extension.


David McCusker writes briefly of picoservers. These sound like fun. Basically, the problem boils down to: how do you best express asynchronous behavior in a programming language? Threads are one way, but they have lots of pitfalls, including performance and scaling issues. Event-based programming is more lightweight, but has a reputation for being very tedious and low-level. Also, event-based programming by itself can't take advantage of multiple processors.

I haven't looked at SEDA carefully yet (it's the thesis work of Matt Welsh), but it looks interesting.

People have been thinking about asynchrony for a long time. One of the more elegant approaches is Hoare's Communicating Sequential Processes (CSP). I'm a bit surprised that CSP hasn't gone further. It seems like a nice higher level abstraction compared with event-driven programming, but without all the nasty problems with race conditions and lock contention that threads bring you.

There is actually a CSP implementation in C. It seems like more typing than languages that have CSP baked-in. Occam is the most famous of these languages, but I think Limbo might be a more useful incarnation. Occam tends to be fairly static, but Limbo lets you create "threads" and channels very dynamically.

Python's generators are already sorta like coroutines. David Mertz talks about using them to implement what he calls weightless threads.

Caching in Trees

This will be another fairly technical entry on trees. It's been in the queue for a few days. The focus is on the Fitz display tree, for which memory efficiency is a major factor. Tonight we look at caching.

If you have a display tree, there are lot of things you might want to cache: intermediate renderings, bounding boxes and other geometry information, etc. For each Bezier curve, for example, you might want to cache a decomposition into triangles. The memory footprint for these cached objects might be significant in size compared to the original tree. This is why we want a cache rather than simply annotating the tree nodes with the extra info.

Mutating the tree can invalidate cached data, as well. In some cases, the relationship between the mutation and the cache is nontrivial. For example, if you change the color of a Bezier shape, you invalidate an intermediate RGBA rendering, but the triangle decomposition remains valid.

We're not going to pin down the exact representation of the tree. One in-memory object per node is the simplest way, and should work. Another approach that should work is storing a serialization of the tree in a btree-like structure. In this case, our node id is effectively a file offset to the beginning of the serialization. Thus, the id of a node can change as the tree is mutated.

Dealing with "sliding" node id's is probably too hard for clients of the tree, so we have an additional concept of "node reference", which is an in-memory object that essentially wraps a node id. When a node id moves, the tree implementation updates the corresponding node reference. This way, clients holding node references don't have to worry about them moving around.

Node reference might take dozens of bytes of RAM each, but node id's are essentially weightless. We hope that tree clients hold a relatively small number of node references, even as the size of the tree scales up.

Now we get to tonight's central design question: what should the key of our various caches be? A node id? A node reference? Something else? Here, we consider some alternatives.

Persistent id

A very common pattern in databases is to add a persistent id to each node. The value is somewhat arbitrary, but must be unique for each node in the tree. If we had persistent id's, then it would make good sense to use them as the cache keys. The problem is the extra storage cost. We're trying to keep that down to the bone.

Node id

It's tempting to use node id's (ie file offsets in the btree case) as cache keys. The problem is that if the node id moves, the cache key needs to be updated. Keeping the inverse map from node id to cache keys has nontrivial storage costs, also.

A more subtle, but important, argument against is that updating them may be very rare in some usage scenarios. Thus, there is a risk that the update machinery won't be adequately tested.

Node reference

The cache key could simply be a pointer to a node reference object. If the tree moves the corresponding node id, it updates the internals of the node reference, but the pointer remains constant.

Cache in tree

A rather different approach is to insert the cached values into the tree. The advantage is that the RAM costs can be very low (near-zero if the tree is stored on disk as a btree). The disadvantage is that computing and evicting cached values now requires traffic with the tree, with attendant performance and fragility problems.

Also, if there are multiple caches, then they'll need to be properly

multiplexed so values from the caches don't interfere with each other.

Cache in node reference

This is something of a hybrid of the above three approaches. Instead of the cache being represented as a hash table to the side of the tree, the cache entry is an extra field in the node reference. If there is a single cache (or small, bounded number of caches), then this approach is appealing. Otherwise, you have to do multiplexing as above.

I think you see most of these approaches in real systems. For one example, the Gnome Canvas includes a bounding box in all nodes, and also an SVP (sorted vector path) in all Bezier shape nodes (thus, I claim, it represents "cache in tree"). Unfortunately, it never evicts any elements from the cache, so the memory requirements can become quite painful.

For Fitz, I now think I have an answer. Most caches will use node references as keys. However, I may treat bounding boxes specially, and store them in node references. Saving a hashtable lookup may be a significant win, and it also helps that the value is of small constant size.

Also, note that navigational links (parent, first child, next sibling), which are internal to the tree implementation, can similarly be cached in the node reference. The basic rendering traversal is: given a bounding box, find all child nodes intersecting that bbox. If the nodes are in-cache, it should be possible to do this very quickly.

I'm happy with this, but still don't feel I've come to terms with change notification. I'll blog that in the next few days, and probably lose my remaining two readers.


Alan started his Korean Royal Court Martial Arts (Koong Joong Mu Sool) classes this week. It seems to be a perfect match for him - he's in much better shape.


David McCusker called on the phone, and we had a nice, wide-ranging conversation. A major theme was social networks and community. I'm looking forward to continuing the conversation.


I'm having fun with the imaging parts of the code, but I'm a bit stuck on the Ghostscript integration parts. I'll just keep at it.


David McCusker has a beautiful short piece on overcomplex technology. We should be asking ourselved, "is there a simpler way?" far more often.

In the case of trust metrics, there definitely was "goal drift". I started out trying to make a better PKI, and came up with the tmetric ideas as part of that. Since then, I've come to consider PKI too hard, and in need of a fundamental rethink. The tmetric ideas still seem sound to me, though.

davidw points out that a hardware metadata site without the trust metric would still be useful. Of course. I think a good rule of thumb is whether you want to automate important decisions. If your query is: "what kind of reliability can I expect from a JXQ-11?", then it's reasonable to wade through the trolls, spammers, etc. (if any), and you probably win it all back in overall simplicity. But part of my idea is that you'd use such data to automatically figure out orders to place electronically. Then, you really do start to care about people putting false information into the system.

209 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!