29 Apr 2003 graydon   » (Master)

lexical reproduction

I have an ongoing thesis at the moment which I'm exploring in the programming language literature, which is that programming language features do well (all other things being equal) when they eliminate either distant or dynamic state and replace it with either close or lexical state. the underlying point being that we may favour language features that facilitate copying and modifying small bits of code -- fragments which work in their new context -- as a fundamental programming activity.

exceptions

so along these lines I was thinking about exception handling mechanisms the other day. exceptions are a dynamic control transfer system, or else they are an "error recovery" system, depending on who you ask. I've been browsing around LtU's recent link to crash-only software, the restart system in common lisp, and the checked-vs-unchecked debate in java.

the static checking part really got me curious, because there's a clearly dynamic bit of state that the java designers have tried to patch into the lexical structure of a program, and have mostly failed. nobody likes checked exceptions because the signature of a method needs an exception specification covering the union of all possible dynamic contexts it will find itself in, which is almost always just "Exception". so there's no point.

I think what lexical exception checking -- if you believe in it at all -- ought to do is specify which exceptions will not be thrown from a method. this would at least let the compiler tell you when a particular catch clause is dead code -- probably an error on your part -- and erase it, along with the associated try context setup. that's something, but not much; you may still fail to catch the "right" exception. but at least it works with an open set of possible thrown exceptions, which is the practical fact, most of the time.

but I wonder if this thinking could be combined with crash-only design, so that you have only one throwing-like statement called crash and a conservative type of signature which says whether your method confines crashes (those without the indicator are assumed to propagate crashes). the confining ones can be checked by a compiler; the "confines crashes" bit can be propagated based on static knowledge between caller and callee, so a method which calls only crash-confining methods (and doesn't do any dangerous pointer fiddling or crashable arithmetic) is itself crash-confining; finding the transfer point during a crash is easier; and the call setup would (I think) be a little shorter on crash handling and crash propagating code. you'd still need destructors, of course, and for diagnostic sake you'd still want to collect stack traces and error descriptors. but I think that's a separate issue: using exceptions as cheap debugging vs. using exceptions to help a program keep running in the face of errors are two different things.

unfortunately I don't think you can quite implement this scheme in java or C++ as it stands. they both treat exceptions a little wrong; java will not statically check for handling of unchecked exceptions, hence the name, and C++ ignores (in a horribly fatal way) everything undeclared, unless you declare nothing. both are a bit broken. ah well.

diffs

pfremy writes that diff's output is unintuitive. this may be the case, but his explanation of the two algorithms used is not right.

diff uses a myers LCS method, a careful dynamic programming construct. it is not the fastest known variant, but it is O(ND) time, where D is the size of the minimal edit script, and it finds that edit script. it also has a very important property that you can implement a version which eats only linear space.

python's difflib uses a "junk-sensitive" recursive greedy matcher (documented here) which is apparantly similar in spirit to the Ratcliff-Oberhelp "gestalt pattern matching" algorithm, but with more desirable running time (R-O is quite awful). it does not find minimal edit scripts, and it might -- or might not -- run at competitive speed and space with myers stuff. I can't find any clear analysis.

anyways, there is a lot of literature on string matching. it's one of the huge problems you can lose the rest of your life in. it is certainly possible to improve diff -- I don't want to put you off the cause -- but be sure to do some large tests and background reading before discounting the decisions made in implementing diff.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!