Older blog entries for graydon (starting at number 114)

hashes

depending on how you view the state of cryptographic research, the results from this week are either very good or very bad. in the short term it probably means not much; in the slightly longer term it probably means we have a lot of replacing and upgrading to do.

this incident points out two facts:

  • cryptography is an arms race and you need to keep spending money on it as long as your opponents are
  • the ability to extend, augment, or replace algorithms in the field is an important feature for a security system

there will inevitably be an increase in pointers to henson's paper. beyond the preceding two points, the paper makes a valid argument that input or algorithm randomization can help turn permanent failure cases into transient ones. however, it extends these points, I think unfairly, into an attack against the whole concept of cryptographic hash functions (CHFs). that's a mistake, and really involves a lot of glossing over of what a CHF is and why we need them:

  • difference detection is the principal task of data integrity
  • humans can see big differences but not small differences
  • the meaning of "big" and "small" can be changed, depending on the type of lens you use
  • a CHF is a lens which enlarges some differences and shrinks others
  • integrity systems should always use as many lenses as they can afford to
  • working with "no lenses" is a meaningless concept: computers produce derived images of data all the time. even loading and storing bytes from memory is a copying operation, and there is always -- even with no attackers -- a certain probability that any bit in a string will flip.
  • CHFs produce good value for the money: you spend a little bit of money on R&D, a little bit of CPU time calculating, and a little bit of disk space for storing; you get a lot of integrity. that's the equation.

I agree with the point about hash randomization, but tossing out CHFs as a concept is a serious mistake. coding theory, along with say binary search, is one of the exceedingly few sources of computers' Real Ultimate Power.

LSP and subtyping

I don't usually care about subtyping. I read a theory of objects and found it pleasantly formal, but I probably missed a lot. the issue just doesn't usually grab me.

then I read oleg's old page about many OO languages failing to satisfy the LSP, and realized how important and overlooked this critique is. the basic result is that subtyping in most OO languages these days is behaviorally wrong, and when it works it is working only by accident.

I find this remarkable!

there appear, from a further evening of digging in the literature, to be only two known ways to produce subtyping relationships which satisfy the LSP: the first statically prohibits the problem by careful construction of the type system and restrictions on the kinds of extensions possible in subtypes: the essential difference being to dispatch by type rather than by object. this is what CLU does, and amazingly the approach seems to have completely died off.

the second approach is to let the problem persist dynamically in your language, but check it at runtime using explicit pre and post conditions ("design by contract"), and combine ancestor contracts appropriately when subtyping. this is of course what Eiffel is famous for, but the only other language I see picking up on it is D. in both cases, I find no mention of the fact that the correct combination of contracts in subtyping is a pure necessity for behavioral correctness. as necessary as an array bounds check or a null pointer check -- oh wait.

looking over the pantheon of modern OO languages which fail to address this issue lends further evidence to my belief that language design is actually regressing. CLU was developed before I was born.

on preventing null:

along the way of discussing pointer idioms, I wrote that nullable pointers should be expressed as a disjoint union between nothing (null) and a pointer value. this means that a pointer, to the type system, is something which does point to a live, non-null, non-special object. if a pointer "can be null", it's not expressed as a pointer; it's expressed as a union. union {null, pointer} so to speak. this is standard in several languages, it works fine; you just need language support for disjoint unions, not C's "inclusive unions". it's really the same thing elanthis is describing, only formalized using a normal language feature. when you have a value of that type, you need to switch on its value in order to dereference it, such as:

  switch (maybenull) {
    case pointer p: p->dosomething();
    case null: donull();
   }

notice that there's only a pointer value p in scope in the switch arm corresponding to the pointer part of the union. the principal benefit to this approach is that the dereference operator is statically inapplicable to the null value. it doesn't even get a name. you know by looking at types alone when you need to write this form and when you have a live pointer. for cases where there may be a null, you're using a union of this type, and you're forced to handle it. for cases where there may not be a null, you just use a pointer. really, this is not at all novel. it already exists in many languages.

on preventing cycles:

the "transfer of ownership" idiom I described is insufficient, as elanthis pointed out. I think a more correct way to do this is to require that an owning pointer (or disjoint owning/null union) can only receive its value when it is initialized. it is then inexpressible to "assign" to that variable, merely to evict its value (transferring to another variable which is being initialized) or destroy it.

I think that restriction does the trick; you would need to initialize A before B and B before A to make a cycle. I believe there is some relationship between this approach and the "linear naming" research that chalst is doing. perhaps I'm underestimating his work though.

in any case, I mostly disagree with elanthis' position that cyclical ownership is some sort of special, unsolvable problem. it's just a delicate problem. I think it's generally understood that structural language restrictions are a tradeoff between "helping to organize your thoughts" and "binding your hands from doing useful work"; such things must be developed carefully and with attention to costs, but not treated as sacred cows.

socializing

met some really nice, slightly older and significantly wiser folks for dinner yesterday; it's always neat to talk with people who have more long-term perspective on computing trends. also spent a few days in north carolina with the red hat gang soaking up the corporate cheer. it was a bit rushed, but I had a lot of fun and met many interesting and friendly hackers. next up: gcc summit!

D

I wanted to lend a moment's support to the D language, which is sensibly crafted and sticks to pragmatics rather than dogma. pragmatics are important. I'm especially happy to see it promoting DBC and formalized support for unit testing.

languages in general

I've been tinkering with some ideas for programming languages recently. not necessarily "new languages", but aspects of programming which I feel ought to be expressable in language. there are two issues in particular I'd very much like to see formalized into language constructs:

  1. "design by accounting" (in a sense analogous to DBC): the notion that the signature of a function includes resource use accounting; typically in terms of memory and time, though conceivably also in terms of user-defined costs. the notion being that a function has a certain cost which is the sum of the costs of its sub-expressions, with loops costing some bounded-above multiple of the cost of their body. costs would be calculated as much as possible by the compiler, and statically verified as much as possible. analogous to other forms of mixed static/dynamic checking, those cost checks which could not be statically verified (say at module boundaries, i/o operations, or hard-to-analyze loop heads) would be left in the program, making "withdrawls" from a runtime-managed cost center, with underflow causing an exception. the technology for this is just a redeployment of conventional value range propagation and bounds checking.

  2. an "ownership tree". this came up last night too. every time I get to discussing modern languages, recently, an uncomfortable fact comes up: I don't like garbage collection. I don't like it because it encourages the programmer to not think about ownership, so they build systems which slowly leak space, by accidental accumulation of dangling references. I don't claim that C's "explicit free" model is any better mind you; it makes you think about ownership but gives you no vocabulary with which to express your thoughts, so you wind up acting on mistaken assumptions (double free, forgotten free).

    I think the solution to this is to differentiate, at a low level, between owning and non-owning (weak) pointers. you should get an owning pointer from an allocation, and each allocation should (until it dies) have exactly 1 owning pointer in the program. you enforce this by prohibiting direct copying of owning pointers, but make 3 variants of the concept of "copying an owning pointer": one which clones the pointee into a new owning pointer, one which forms a weak pointer, and one which forms a non-storable reference (for use as a function argument). if you want an owning pointer you can move around (and possibly set to null), you make a discriminated union of "nothing" and an owning pointer (like haskell's Maybe 'a or ML's 'a option), and add a language construct which takes two of these values and transfers an owning pointer between them atomically. passing an owning pointer as an argument -- another form of "making a copy" -- is similarly prohibited.

    in other words, the rules of the language should make cyclical ownership, null pointers, and double-ownership inexpressible.

    this strikes some people as draconian, but I claim that it is nothing more than the extension of stack discipline from control into data. we already enforce, in many languages, that all control structures must be expressed as a tree of function calls, rather than an arbitrary graph of gotos. at any time, there is a single, obvious stack of activation frames from main() to your current context. this simplifies reasoning about control; imagine debugging without stack traces! all I'm saying is that we should force all data structures to express their ownership as a proper tree, rather than an arbitrary graph of pointers.

swing

spent some weeks digging through the repaint, double buffer management, clipping, and paint coalescing sections of swing. I think I have it mostly right now, but then cairo went and changed underfoot. alas.

monotone

I released a version of monotone with support for human readable version names and win32. two more boxes ticked off the list.

thermostat

a year ago we bought an extra athlon as a workhorse for compute-heavy jobs. frances needed it mostly, but I use it sometimes for big builds or tests. it turns out the athlon was much faster than the one we meant to buy (accidentally got the wrong part) and came with a very bad fan. this gave it the unfortunate habit of critically overheating and shutting down.

three things were done to fix this:

  1. a new heat sink made of copper.
  2. a program called athcool which twiddled the necessary registers to have the chip actually enter a low power state when the OS runs HLT. this is not the default. if you have an athlon, you probably want this program to run at boot.
  3. a homebrew thermostat consisting of the lm_sensors module, a gkrellm alarm, and a shell script. the sensors monitor temperature and the alarm trips after a certain temperature is reached. this runs the shell script, which pulls out the top process id on the system, sends it SIGSTOP, and schedules an 'at' job for 5 minutes in the future to send SIGCONT.
components and text

on a more abstract, cultural note, I'd like to expand on something I've said here before, about programs.

I think the idea of a "software component" is a wrong and damaging distraction from a fundamental fact: program text is the ultimate component technology.

the text of free programs can be read, indexed, searched, edited, copied and pasted, translated from one language to another, paraphrased, commented, printed out for posterity, machine analyzed and machine transformed. that is far more than any other "component" technology (COM, .NET, java beans, CORBA) has ever permitted, and more I suspect than any ever shall.

after speech, text is the deepest, oldest, and most powerful human language technology. free software has grown strong because it treats text as an ally. a new user has a ton of code to read. a ton of examples. command interpreters which respond to single words. tools which speak and can be spoken to. when you want to learn how a system works in free software, you can read through it, edit it, perturb it, tinker directly with the text. if you need to find something, you run grep, use TAGS, ask google, use LXR or global. when someone wants to explain something they can mail you code. you can buy a book on algorithms and see how things are written.

the web too has flourished due to treating text as an ally: page source can be viewed, rendering happens on the fly, markup is minimal and mixed in with plain written text codes. screen scraping may be embarassingly "low tech", but think honestly about some of the wild, out-there programs which people have actually got working using the web's text, and imagine trying to coax a distributed OO system to do the same thing. imagine even agreeing on the interface. it would never happen.

I think arguments about programming language are of trivial importance compared to the argument about text vs. black-boxes. we should always remember the primacy of text. you can translate between the text of different languages (or auto-stub, or what have you) but you're doomed if you have no text.

a few months pass, and many things have happened.

monotone

I switched monotone from a temporal transport model (log replay) to a stateless hashed-index synchronization model. this was an interesting trip into the wilderness of implementing wire protocols. curiously, getting the protocol's shutdown sequence right seemed to be one of the most complicated parts. anyways, now any monotone client is a server, and you don't need to have any shared history with anyone to do a symmetric or asymmetric (push/pull) sync. it seems pretty cool.

I also internationalized it. so now it does localized messages, local filenames, file character set conversions, international branch and tag names, international dns, etc. this was actually kind of fun work; it was especially amusing when it suddenly started returning system error messages to me in german! I can't say I like IDNA much, but it seems to be the consensus. oh well.

free swing

work proceeds reasonably well. we have some other people at work helping on it now, and we have a bunch of widgets working and the ability to make some trivial buttons-and-sliders sort of programs which really work. of course there's still much to do, but things seem to be speeding up. I think we're over the hump.

language wars

as a long time language geek, I can't help but put in a word about the recent threads over language preferences. this is of course a mixture of business and personal thinking, but in the spirit of honesty I figure I ought to lay out my beliefs.

C# and java strike me as a language-war-in-progress between big, old, dangerous proprietary firms. these companies are not blushing virgins; they have a history of fighting dirty fights. we, the free software community in general, should make a conscious policy to stay away from that war. it is one thing to ship cloned toolchains if our customers want them -- how can you say no to a paying customer? -- but we should not write central or critical code in these languages until well after the war is over. that may take decades, it may involve a lot of splintering and pain. it may even involve writing off some huge, tasty programs previously assumed to be "donated" to the community. yes, it sucks. that's the proprietary game and we're all trying to get out of that game. choosing sides won't make it any less painful. we need to abstain.

havoc asks for a compelling high level language platform, and despite my preference for more exotic languages (ocaml, haskell, lisp, ada, eiffel) I have to ultimately rest on the pragmatics I see before me: the high level free software platform in 2004 is C++ and python.

I suspect that high level really means two things these days: lots and lots of libraries, and abstract interfaces which don't involve pointer twiddling. C++ and python meet the bill here. C++ didn't used to, say in 1994, but it's gotten a lot better in 10 years. we have a good compiler for it now (g++ soon-to-be-3.4) and lots of free libraries. if you're careful you can program in it with nary a "*" in your code.

here is my selection; if you haven't read these libraries or used them, honestly, give them a try:

monotone is self hosting now. I massaged it into playing nice with other network protocols, so now it has this HTTP/CGI server type as well as the netnews one. that's good because people can just set up private depots if they have a web server account, and don't have to bother finding a news server.

I feel like this is a nice milestone. a couple years ago I needed a good free distributed version control tool, and now I have one.

I've been spending the past couple months working on native java2d on cairo. it's kinda fun, a bit of a break from low level stuff. brings back memories of berlin only without the corba and windowing system business, and with a garbage collector. of course, libgcj takes just about as long to link on my modern pentium 4 as berlin did on my "modern" pentium 1, so I guess some things never change.

some things even seem to get worse!

coincidence

lkcl writes that finite group theory can be used to model intelligence and invention; I wasn't aware we had such a simple model for it yet. would you care to share some more of the maths from whence this conclusion (simultineity of invention) "pops out"? perhaps a link to a paper or such?

testing tools

movement asks if free software testing tools are "wild and wooly". I would say some, but not all. I think the unit test things are as good as they need to be; unit testing is more about practise than technology anyways. there's a *Unit library for just about everything now, which is nice.

high-level supervisory and tracking stuff -- say aegis or bugzilla -- is pretty good. maybe if I'm lucky someday I'll be able to say that about monotone too. I think we all know that test cases are the core of the "science" which makes up computer science; the muck you have to slog through to make your software strong against reality.

the mid-level supervisory stuff, however, is weak. dejagnu is Not My Idea of Good (sorry rob) and most people seem content to just cobble together shell or python scripts when they're fed up with expect. testing tools which do automatic breadth or boundary calculations, which estimate and control test-run time, or which do much in the way of environment control and capture, seem disappointingly rare.

one thing which bothers me about test infrastructure is that the posix 1003.3 output format doesn't require (or doesn't often seem to be interpreted as requiring) unique identifiers for each assertion tested. as a result, people wind up measuring not the set of tests passed or failed, but the number. this is wrong in an important way: if I make some fixes and go from 438 failures to 245, it is not at all clear whether I am "193 tests better off" or whether I've fixed 438 and regressed on 245. constructing a QA system on this sort of measurement will be broken by design. it's a bit frustrating. the proper regression relationship is subset, and for that tests need identity.

option processing

I read in innumeracy that people generally mis-perceive coincidence, and that the expected frequency of coincidences (amongst all possible things that may happen to you every day) is pretty high. that may be, but it's still creepy. one day, I'm sitting on a go bus sketching out a design for a C++ option processing library. next day, it gets mentionned here. then I open up gnus and boost announces a formal review period for an option processing library. jeepers. I can feel the axons of the pan-network subcortex creeping into my head just thinking about it.

29 Apr 2003 (updated 29 Apr 2003 at 18:34 UTC) »
lexical reproduction

I have an ongoing thesis at the moment which I'm exploring in the programming language literature, which is that programming language features do well (all other things being equal) when they eliminate either distant or dynamic state and replace it with either close or lexical state. the underlying point being that we may favour language features that facilitate copying and modifying small bits of code -- fragments which work in their new context -- as a fundamental programming activity.

exceptions

so along these lines I was thinking about exception handling mechanisms the other day. exceptions are a dynamic control transfer system, or else they are an "error recovery" system, depending on who you ask. I've been browsing around LtU's recent link to crash-only software, the restart system in common lisp, and the checked-vs-unchecked debate in java.

the static checking part really got me curious, because there's a clearly dynamic bit of state that the java designers have tried to patch into the lexical structure of a program, and have mostly failed. nobody likes checked exceptions because the signature of a method needs an exception specification covering the union of all possible dynamic contexts it will find itself in, which is almost always just "Exception". so there's no point.

I think what lexical exception checking -- if you believe in it at all -- ought to do is specify which exceptions will not be thrown from a method. this would at least let the compiler tell you when a particular catch clause is dead code -- probably an error on your part -- and erase it, along with the associated try context setup. that's something, but not much; you may still fail to catch the "right" exception. but at least it works with an open set of possible thrown exceptions, which is the practical fact, most of the time.

but I wonder if this thinking could be combined with crash-only design, so that you have only one throwing-like statement called crash and a conservative type of signature which says whether your method confines crashes (those without the indicator are assumed to propagate crashes). the confining ones can be checked by a compiler; the "confines crashes" bit can be propagated based on static knowledge between caller and callee, so a method which calls only crash-confining methods (and doesn't do any dangerous pointer fiddling or crashable arithmetic) is itself crash-confining; finding the transfer point during a crash is easier; and the call setup would (I think) be a little shorter on crash handling and crash propagating code. you'd still need destructors, of course, and for diagnostic sake you'd still want to collect stack traces and error descriptors. but I think that's a separate issue: using exceptions as cheap debugging vs. using exceptions to help a program keep running in the face of errors are two different things.

unfortunately I don't think you can quite implement this scheme in java or C++ as it stands. they both treat exceptions a little wrong; java will not statically check for handling of unchecked exceptions, hence the name, and C++ ignores (in a horribly fatal way) everything undeclared, unless you declare nothing. both are a bit broken. ah well.

diffs

pfremy writes that diff's output is unintuitive. this may be the case, but his explanation of the two algorithms used is not right.

diff uses a myers LCS method, a careful dynamic programming construct. it is not the fastest known variant, but it is O(ND) time, where D is the size of the minimal edit script, and it finds that edit script. it also has a very important property that you can implement a version which eats only linear space.

python's difflib uses a "junk-sensitive" recursive greedy matcher (documented here) which is apparantly similar in spirit to the Ratcliff-Oberhelp "gestalt pattern matching" algorithm, but with more desirable running time (R-O is quite awful). it does not find minimal edit scripts, and it might -- or might not -- run at competitive speed and space with myers stuff. I can't find any clear analysis.

anyways, there is a lot of literature on string matching. it's one of the huge problems you can lose the rest of your life in. it is certainly possible to improve diff -- I don't want to put you off the cause -- but be sure to do some large tests and background reading before discounting the decisions made in implementing diff.

105 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!