Older blog entries for crhodes (starting at number 160)

For posterity, and to postpone my deletion from planet.lisp a few more months... I was asked by Faré about whether it is possible to speed up the first calls to some user-defined generic functions in SBCL.

To understand that request, first we need to understand the implementation strategy for the discriminating function in SBCL's metaobject protocol. The discriminating function is responsible for computing the set of methods which are applicable given the arguments to the generic function, determining the effective method (the actual code to be run given the generic function's method combination), and then running that effective method.

One possible strategy for implementation is to do nothing at all in advance – simply at each call of the generic function, call compute-applicable-methods to find the ordered set of applicable methods, apply method combination to that set to generate an effective method form, compile it, and then call the resulting function with the argument list and the methods list. This is about as slow as it sounds: while it is correct, since it is typical for generic functions to be called more than once with the same classes of arguments, there is wasted effort in this strategy from repeating the same computations over and over again.

Fortunately, it is possible to do better. Quite apart from some special cases (slot accessors, predicates, single methods – documented in the SBCL Internals manual), if the generic function has the same methods, the result of compute-applicable-methods on arguments of the same classes will be the same (hence compute-applicable-methods-using-classes), and if it also has an unchanged method combination the effective method form and function will be unchanged. This suggests cacheing the result of computing the effective method, attempting a lookup based on the classes of the arguments before the slow path, and invalidating the cache if things change (e.g. adding or removing methods to the generic function or a change in the class hierarchy).

All the above is in “Efficient Method Dispatch in PCL” (Kizcales & Rodruigez, 1990); while there are more details in SBCL's implementation (including an alternate strategy for cacheing based on type dispatch rather than class hash codes) that paper is recommended reading for anyone wanting to understand how to get tolerable function call speed in generic-function-based environments. But there remains the problem of the initial state of the discriminating function: what should it be? The natural choice is of an empty cache, so that the cache gets filled based on what effective methods are actually used, but that means that there can be a substantial amount of work to do at startup to warm up all the caches for all the generic functions in a system. In fact, in the days when I used to develop and deploy CLIM-based applications, this was so noticeable that the deployment script I used started the application up, exited it and scrubbed the application state before dumping the memory image, precisely so that the generic function caches had relevant entries, meaning that application startup was much faster. How much faster? To get some kind of sense, let's look at an example:

(defgeneric foo (x)
  (:method ((x string)) (concatenate 'string "x" x))
  (:method ((x integer)) (1+ x))
  (:method ((x pathname)) (pathname-type x))
  (:method ((x generic-function)) (sb-mop:generic-function-name x)))

(time (foo 3))     ; 239,230 processor cycles
(time (foo 3))     ; 223,855 processor cycles
(time (foo 3))     ; 73,925 processor cycles
(time (foo 3))     ; 4,100 processor cycles
(time (foo #'foo)) ; 176,225 processor cycles
(time (foo #'foo)) ; 5,340 processor cycles
(time (foo "x"))   ; 103,090 processor cycles
(time (foo "x"))   ; 9,645 processor cycles
(time (foo #p""))  ; 136,780 processor cycles
(time (foo #p""))  ; 6,560 processor cycles

Eyeballing these (and knowing something more about the details of the implementation, which always helps), we can estimate the overhead as being about 400,000 processor cycles plus about 100,000 per distinct concrete class passed as an argument: call it about 0.5ms total per generic function based on a 1GHz processor. It doesn't take too many generic functions with empty caches called in sequence (as in protocol-heavy frameworks like CLIM) to make this startup delay noticeable.

What if it isn't possible to run the application beforehand? Well, it is possible to fill caches by hand. Here's one way to do it:

(defun precompile-gf (gf)
  (let* ((lock (sb-pcl::gf-lock gf)))
    (setf (sb-pcl::gf-precompute-dfun-and-emf-p (sb-pcl::gf-arg-info gf)) t)
    (let ((classes-list (mapcar (lambda (x) (sb-mop:method-specializers x))
                                (sb-mop:generic-function-methods gf))))
      (multiple-value-bind (dfun cache info)
          (sb-pcl::make-final-dfun-internal gf classes-list)
        (sb-thread::call-with-recursive-system-lock ; or -SPINLOCK
         (lambda () 
           (sb-pcl::set-dfun gf dfun cache info)
           (sb-mop:set-funcallable-instance-function gf dfun))

This computes the set of classes directly named in the method specializers, and pre-fills the cache with entries corresponding to direct instances of those classes: as long as no subsequent changes occur, every call involving direct instances will be a cache hit and there will be no expensive recomputations:

(fmakunbound 'foo)
(defgeneric foo (x)
  (:method ((x string)) (concatenate 'string "x" x))
  (:method ((x integer)) (1+ x))
  (:method ((x pathname)) (pathname-type x))
  (:method ((x generic-function)) (sb-mop:generic-function-name x)))
(precompile-gf #'foo)

(time (foo #p""))  ; 5,995 processor cycles
(time (foo #p""))  ; 5,940 processor cycles
(time (foo 3))     ; 173,955 processor cycles
(time (foo "x"))   ; 81,945 processor cycles
(time (foo #'foo)) ; 148,435 processor cycles

The “direct instances” restriction there is a strong restriction: if the hierarchy is based around protocol classes (used in specializers) and implementation classes (used as concrete classes for instantiation) then the initial cache filling will be useless (as in the case above: in SBCL, 3 is a direct instance of the (implementation-specific) FIXNUM class.

What's the second try, then? If the class hierarchy is not too deep and the generic functions don't have too many multiple-argument specializers, then pre-filling caches with all possible class argument combinations might not be totally prohibitive:

(defun class-subclasses (class)
  (let ((ds (copy-list (sb-mop:class-direct-subclasses class))))
    (if (null ds)
        (list class)
        (append (list class) 
                (remove-duplicates (mapcan #'class-subclasses ds))))))

(defun class-product (list)
  (if (null list)
      (list nil)
      (let ((first (class-subclasses (car list)))
            (rest (class-product (cdr list)))
        (dolist (c first result)
          (setf result (nconc (mapcar (lambda (r) (cons c r)) rest) result))))))

(defun exhaustive-class-list (gf)
  (let ((specs (mapcar #'sb-mop:method-specializers 
                       (sb-mop:generic-function-methods gf))))
    (remove-duplicates (mapcan #'class-product specs) :test #'equal)))

(defun precompile-gf-harder (gf)
  (let* ((lock (sb-pcl::gf-lock gf)))
    (setf (sb-pcl::gf-precompute-dfun-and-emf-p (sb-pcl::gf-arg-info gf)) t)
    (let ((classes-list (exhaustive-class-list gf)))
      (multiple-value-bind (dfun cache info)
          (sb-pcl::make-final-dfun-internal gf classes-list)
        (sb-thread::call-with-recursive-system-lock ; or -SPINLOCK
         (lambda () 
           (sb-pcl::set-dfun gf dfun cache info)
           (sb-mop:set-funcallable-instance-function gf dfun))

(fmakunbound 'foo)
(defgeneric foo (x)
  (:method ((x string)) (concatenate 'string "x" x))
  (:method ((x integer)) (1+ x))
  (:method ((x pathname)) (pathname-type x))
  (:method ((x generic-function)) (sb-mop:generic-function-name x)))
(precompile-gf-harder #'foo)

(time (foo 3))     ; 4,510 processor cycles
(time (foo #p""))  ; 6,060 processor cycles
(time (foo "x"))   ; 10,685 processor cycles
(time (foo #'foo)) ; 5,985 processor cycles

So, nirvana? Well, what happens if the class hierarchy is not so friendly, and the exhaustive classes list is too exhausting? Stay tuned for the next thrilling episode...

A most excellent day. I mistimed my morning, and had to miss breakfast in order to arrive in time to catch the ferry to Preko (a village on the island opposite Zadar) with a small but perfectly-formed band of intrepid explorers: Didier Verna, David Johnson-Davies, Alessio Stalla and Nils Bertschinger. Given the attractive description of the castle of St Michael as a "telecommunications station", with my employment, how could I not attempt the walk. And so we walked through the sunshine to the top of the mountain, to see what we could see (the other side of the mountain, it turns out, but also more islands, more sea, more sun...). Some fun discussions, about the more esoteric parts of various lisp dialects, machine learning, method-combination-based hacks, the nature of objects, and so on (quick riddle: what are the possible fixed points of type-of?) Then, as we were sitting down to yet another Southern European lunch, what should happen but an extremely localized rain shower?! How odd.

Then it was back to reality; it was time to face the fact that I was going to have to leave the sunny, sleepy Croatian coast to return to rainy, cold, difficult real life. (It has its compensations...) A very good conference; a big thank you to Franjo Pehar and Damir Kero, and the University of Zadar, for being such excellent hosts; it's been a great three days.

Happy Protest Day.

Day two of ELS 2012 began with war stories from Ernst van Waning on his work as a consultant for SRI (his talk should of course not be confused with his employer's views; Pascal Costanza's opinions from yesterday were similarly disclaimed) on their AURA project. We had a confusing video-in-video demonstration, which is perhaps taking the conservative approach to the Demo Effect a little bit too far. A point that might be worth mentioning again is that Lisp is not immune to memory leaks; in the KM (warning: 1990s-compatible website) knowledge management software, many interned symbols get generated during the course of a query and do not get removed later. We've encountered this in SBCL in the past; for example, in the SBCL build itself, towards the end of the process, we compile and load PCL, then in a fresh bootstrap image just load the fasls before dumping the final image; this is so that many internal symbols created by reading the code don't end up in the final image. (Though something that has been suggested a few times is that packages should intern their symbols in a hash table which is weak: dear cleverweb, is there any way in which this can be detected? Think like pfdietz).

After coffee, and surprisingly packaged filled croissants, Marco Benelli gave us his experiences of using Scheme (specifically Gambit-C) in industrial automation. There were some interesting constraints – the approach had to support uClinux as well as more full-fat distributions, and... exotic architectures like sh. After that came Gunnar Völkel, talking about Algorithm Engineering with Clojure ("Algorithm Engineering" here seemed to mean the cycle of algorithm design, analysis, implementation and evaluation); the start of the talk was about implementing tracing and profiling, using a name-based registry before function definition to specify interceptions which wrap function implementation, which worried me because it seemed like a description of what should already exist (the lack of documented support for advice/fwrappers in SBCL notwithstanding). After that, we had a description of their team's Experiment Definition Language, used to generate code in an Experiment Setup Language, which then performs a whole experimental run (of the order of weeks) for various different parameters. I'm not convinced about the composability of the interception design; one issue is that since it overrides the defining forms, it is automatically incompatible with any similar extension (just as SERIES is incompatible with SCREAMER and Lexicons: each of them wants to own defun) – another is that, because the interceptions modify the source code, there's no sensible ordering: one of the example functions was both traced and profiled, which means that either the profiling code becomes traced (where the user is probably not interested in the execution of the otherwise-invisible profiling code) or the tracing code becomes profiled (which detracts from the utility of the profiling data).

One more long lunch break later, and we're into the afternoon iteration: Irène Durand talking about enumerators, and Alessio Stalla about do+, both dealing with ways of structuring iteration. Irène's taxonomy of enumerators might bear closer attention, while Alessio's "I don't hate loop" polemic against the use of a code walker in an iteration construct (iterate) seemed to be a totally reasonable point. There was some interesting probing of the limitations as well as the extensibility of the design – Pascal Costanza brought up the fact that loop allows e.g. if foo collect bar into quux else nconc baz into quux (forgive the attack of metasyntosis), while this variety of accumulation function into the same accumulator appears not possible in the do+ design.

Lightning talks: a virtual filesystem based on queries; return-with-dynamic-extent; HIV as recursive immune system process; pathnames (gah! pathnames!); high-performance network appliances; interoperability choices; and homoiconicity (Didier gives highly-engaging lightning talks: "musical notation is Tuning-complete"). Then announcements (a big thanks to Marco Antoniotti, Franjo Penar and his local team), dinner, bed.

ELS 2012 Liveblogging! Well, I'm constrained to one diary entry a day, so maybe it's a bit of a stretch to claim that I've joined the socially network world, but baby steps...

First impression: Zadar is really quite pretty. Shiny white stone, clean, old buildings, seafront. My first impression is probably coloured by the fact that when I left from London, the weather was 4°C and pouring with rain – and I emerged from the plane into 25°C heat and a blue sky. I confess, I even had a little nap near the Sea Organ while waiting for the evening meeting and welcome reception; at that reception, let us just say that a good amount of Maraschino and another good amount of the local beer was consumed, both in good company. Also, it's asparagus season; yum.

This morning, after a generous welcome from Marco Antoniotti, this year's programme chair, Juanjo García Ripoll gave a very interesting overview of ECL and its history, and made some good points about its design philosophy. The key argument is probably that designing ECL for embeddability adds options, rather than being a limitation; he made a plausible case for those things which are currently lost compared with more traditional implementations – particularly, image saving – are reimplementable, at least up to a point. Juanjo also listed a number of good improvements in ECL since the last time: Unicode support, multithreading, improved CLOS and MOP support, and plenty of other things.

After a good long coffee break, we had the first paper sessions: first, a presentation of Climb (no website yet, apparently), an image processing toolkit developed by Laurent Senta with Didier Verna; some interesting stuff in there, even if the dreaded Demo Effect came along. A particularly neat-looking demo of a (prototype) visual environment for chaining processing tasks; performance is a bit more of a hot topic (read: not yet implemented), both in terms of parallelizing individual operations and (I think) in terms of compiling networks of processing tasks to minimize redundant computation. After that, Giovanni Anzani gave an autocad-based talk on calculating and visualizing optimal (for some value of "optimal"; sufficient for architecture, anyway) points of intersection of incommensurate measurements. Again, a pretty nifty demo, this time within AutoCAD using AutoLisp; somewhat surprisingly, it seems that there is no matrix manipulation library support within AutoLisp. (I think I need to read the paper for this work, to understand exactly what the problem the method presented is aiming to solve).

One lesson in Southern European lunchtimes later (even longer than academic lunchtimes!) we were into the second session, starting with Alessio Stalla talking about ABCL and its interoperability with Java. I got a shout-out, because in amongst the various integrations of ABCL with its JVM host was a note that the sequences in ABCL support the extensible sequence protocol that I proposed in 2007; the example given was of using instances of Java java.util.List class instances as Lisp sequences, directly. The demo effect struck again; instead of launching slime, a button in the modified Java web framework made the compiler enter an infinite compiling loop. Bad luck. (Demoing things is a particular nightmare, I know; the trick as far as I have managed to formalize it is to leave as little as possible to chance: this includes even typing, unless you're very confident: use short file or variable names, define key bindings or keyboard macros, or write scripts to do things for you.) Nils Bertschinger talked about probabilistic programming in Clojure: implementing Metropolis-Hastings sampling of program paths with given probabilities, and consequently allowing conditioning on some program choice points and Bayesian inference on the hidden parameters. It looks interesting, but the killer feature of Clojure (immutable data structures, for cheap undo) might also be the cause of a performance problem. Still, looks interesting. The demo worked.

Pascal Costanza rounded out the day's schedule with his discussion on reflection in Lisp and elsewhere, talking about fexprs, 3-lisp and macros through to metaobject protocols. Unfortunately, as a regular attendee at Lisp events, I've seen much of it before; it's still interesting, but maybe I need to get out more. To dinner!

I got interviewed a couple of weeks ago. (If you're reading this on Planet Lisp, you probably already know this). I had a quick update to one of the points I made, but failed to write it down anywhere and have since forgotten. So, instead, a somewhat delayed and probably more dull update. (What, more dull than the delayed response of someone to their own interview? Well, be ready to be amazed. But I can't help but feel that the additional information was that I had forgotten some whole class of Lisp programming that I actually do, which is a bit embarrassing in an interview in a series titled "Lisp Hackers"...).

Maybe the first thing was to dip my toes back into SBCL maintenance: some nice if minor fixes from me this cycle: one was a simultaneous bug fix and optimization to modular arithmetic, motivated by pfdietz's resurfacing and running of his random form tester, which inevitably revealed that we have been slack in the last five years or so (where does the time go?); the other was a fix to the powerpc implementation of ldb, which broke the build after the previous fix. All sorted out now, phew. (And there's lots of other stuff that's gone in this month, unlike the previous "month" which sort of rolled on for the best part of three months, so it's probably worth testing).

But onward, to my desire to learn a bit more about Emacs Lisp. I've used emacs for many a year – indeed, the interview reminded me that I learnt Lisp by being given a difficult problem to work on, instructions on how to start XEmacs, and time to read USENET – but have never considered myself a real Emacs User; my ~/.emacs is so tiny, I daren't show it to the world for fear that I will lose all my remaining Lisp Hacker credibility. Acting with the view that the best way to learn is to do, shortly after starting to use EMMS as a media player, I've implemented support for DISCNUMBER metadata (this matters if, like me, you have a large number of multiple-disc sets). I've also revived (again) SWANKR, putting it up on github also, since I have started getting patches; I look forward to exploring this "social coding" idea. And I have also written a hacky but just-about-usable interface to the BBC iPlayer (using the excellent but fairly user-hostile get_iplayer utility) – particularly pleasing to my family now that digital switchover has reached London and I no longer possess equipment capable of receiving the UK television signal.

I now return to the Teclo vortex. But I am going to Zadar for the European Lisp Symposium; I hope to see some old and new faces there!

As I said in my last entry, I was in Amsterdam for ECLM 2011, once again smoothly organized by Edi Weitz and Arthur Lemmens, but this time under the aegis of the Stichting Common Lisp Foundation (of which more a bit later). After leaving the comfortable café, where Luke and Tobias (along with a backpack's worth of computing equipment on its way to visit St Petersburg) eventually turned up, it was time to go for the Saturday evening dinner, held at Brasserie Harkema. In the olden days, when I had time to do a certain amount of public-facing Lisp development, I got used to receiving the adulation of a grateful public – this time, at the dinner, I happened to sit next to someone called Lars from Netfonds. “Hmm,” said something at the back of my mind, ”that rings a bell.” Lars who? Lars Magne Ingebrigtsen. My inner fanboy went a bit squeee – even to the point of explaining what gmane was to a third party in his presence. Still, it was nice to be able to say a heartfelt “thank you” in person to someone whose software has saved me time and a certain amount of embarrassment. Other topics of conversation at the dinner included a discussion with R. Matthew Emerson (of Clozure) about the social aspects of Free Lisp development, a topic on which I have written before; contrasting the attitudes and experiences of contributors and users (small and large) of Clozure CL and SBCL was interesting. It was also nice to be able to talk about Lisp-based music analysis, synthesis and generation programs; reminding myself that I do still know about that landscape enough to fill people in.

The meeting itself, as others have observed over the years, is only partly about the talks: a substantial part of the goodness is in the chats over coffee and lunch. Edi and I reminisced about meeting in the venue, Hotel Arena, at a precursor to ECLM (in autumn 2004, I think... I certainly remember being approximately penniless, just after starting my first job); other people present then (as well as Arthur) included Nick Levine, Luke Gorrie, Peter van Eynde, Jim Newton, Pascal Costanza, Marc Battyani, Nicholas Neuss... many of whom were around for the rematch; a total of 95 people registered for the meeting, and the hall (part disco, part church) for the talks felt pleasantly full.

Of the talks, I was most interested in the material of Jack Harper's talk, concerning some of the constraints involved in building a product for (human) fingerprinting, and asserting that using Lisp in this product was not a problem. (Favourite quote: “batteries are complicated things”). I was a little bit disappointed that few of the speakers actually interacted with any code at all (Luke may claim that writing his slides in Squeak Smalltalk counts, but I beg to differ); in fact, Paul Miller of Xanalys was the only one of the speakers spending substantial time demonstrating anything related to the subject of the talk – and that only because the canned demo movie refused to display on the projector. Luke's talk appeared to go down well; the obvious first question came and went, and there were some more interesting questions from the floor. Star of the show was Zach Beane's talk about quicklisp; I spend a lot of time presenting or watching presentations in each of my capacities, and it's nice to have a refreshingly different (and deadpan) delivery, with good use of slides to complement the spoken content. I hope that he's right that his personal scalability will not be taxed, and that volunteers will find ways to assist in the project by taking ownership of particular tasks.

While Hans Hübner may have attempted to be controversial in his opinion slot about style guides for CL, the real controversy for me was Dave Cooper's announcement of the Stichting Common Lisp Foundation. Now, the Foundation has clearly done one thing that is helpful: provided legal and financial infrastructure so that the financial risk of hosting an ECLM is not borne entirely by two individuals; the corporate entity can potentially, after acquiring a buffer, provide the seed funding needed and, if necessary, absorb small ECLM losses (not that I believe there has been one, but hypothetically) through other fund-raising activities. On the other hand, when I asked the question as to how the Stichting CL Foundation would aim to distinguish itself from the ALU, the response from Dave Cooper was that the only difference would be that the foundation would focus on CL, where the ALU's remit extends to all members of the Lisp family. Such a narrowing of focus is, I think, potentially beneficial – indeed, when going through my email archives to look for the date of the 2004 meeting, I found a lucid rationale from Dan Barlow explaining that he had chosen to make CLiki's focus specifically DFSG-free Unix Lisp software in order to promote a sense of cohesion (rather than being motivated primarily by a strongly-held belief about the inherent superiority of DFSG-licensed software). But I don't think that the ALU's only weakness is that it spreads its Lisp net too wide: I think it has lost track of what it as an entity wants to do beyond perform a similar function for the ILC as Stichting has performed for the ECLM; Nick Levine, in his talk about how to find Lisp resources, observed that the ALU has a valuable piece of real estate – the lisp.org domain – which does not seem to be used to grow or meet the needs of the Lisp community, whether Common Lisp specifically or Lisp more generally. I found it a little sad that, Edi and Arthur aside, the overlap between the ALU board and Stichting CL Foundation directors is 100%.

After the longer talks came the lighting ones, and I took the opportunity to repeat my talk and demo about swankr, my implementation of the SLIME backend for R, from the European Lisp Symposium in April. Erik Huelsmann announced ABCL 1.0, a far better milestone to announce at the ECLM rather than my sneaky announcement of SBCL 0.9 (six years ago!? Doesn't time fly! Also, what ugly slides...). And after some more lightning (and less-lightning) talks, it was time to wrap up with drinks, dinner, and good conversation.

I'm in Amsterdam for the European Common Lisp Meeting, 2011 vintage. Still wearing my two hats, as academic and
entrepreneur” – and, somewhat to my surprise, still enjoying it. Though I do have a fairly nasty cold, possibly a result of too many late nights (business), early mornings (children), and interaction with disease-ridden individuals (students).

I'm sitting in the cafe de jaren, a haunt which I think is popular with students – but today looks just plain popular. They seem very accomodating, with newspapers to wade through (admittedly, I brought my own), free Wifi, and tasty soup and sandwiches. I've been here before; in fact, getting on for a decade ago, my wife and I mislaid a copy of Asterix and the Somethings (dunno which) in Dutch. It's a pleasure to sit here, waiting for my colleagues to show up so that I can inspect Luke's presentation for blatant falsehoods off-message content. Looking forward to this evening's brasserie outing and of course the talks tomorrow – it'll be particularly interesting to see how Jack Harper's presentation compares with our Teclo experience –
and of course it'll be good to catch up with old friends, some of them in the flesh for the first time...

Hey, what happened to that resolution to blog weekly about being entrepreneurial? Well, it's been a long few months: course mostly delivered; PhD student approximately completed (well done, Ben); plenty of extra time to actually be entrepreneurial. Before I sink back down into the mire of too much to do and not enough time, an update!

I went to the 4th European Lisp Symposium, held at the Technical University of Hanburg-Harburg. It was great. Compared with last year, when I was Programme Chair, and volcano eruptions closed most of European airspace, leading to scrambles to find alternative keynote speakers and general stress about whether there were going to be any attendees at all, this was a breeze. Sure, I participated by reviewing a few contributions, but the event itself snuck up on me – I found myself on the Monday remembering that straight after my teaching duties on Wednesday, I needed to dash to the airport to catch a plane. Very pleasant; thanks to Didier Verna and Ralf Möller for making things so smooth that I could just turn up and assume that the event would be running perfectly – I know how much work it takes to get to that point.

It was good to catch up there with some of the wider Lisp world; there were about 60 attendees, including a solid transatlantic contingent. I couldn't quite allow myself to relax completely, and so ended up giving a lightning talk about R – a useful warmup for my slightly more substantial talk (slides; audio recording appears to have failed) at the Zürich Stuff'n'Lisp User Group. The cuteness of adding two lattice objects together (in graphical presentation form) to get a new graph combining the two originals seems never to get old, though since it's in fact six months old I did take the time this morning to commit and push the accumulated fixes to my public swankr git repository.

Right. Back to work work work fun hacking.

What I got for Christmas: sufficiently advanced Intel graphics drivers for 855GM, in Linux 2.6.37-rc7. No more missing mouse cursor on boot (and, icing on the Christmas cake, working video playback!) Thank you to those who worked on this, particularly since I couldn't actually work out how or where to submit useful bug reports (and so resorted to my usual strategy when dealing with laptop-related issues, which is to contact mjg59 by whatever means available and follow his suggestions as precisely as possible).

2 Dec 2010 (updated 2 Dec 2010 at 17:05 UTC) »

As the train I'm on ambles its unheated way through the unseasonably Wintry English countryside, it's time for another “weekly” exciting entrepreneurial update. Actually I should be properly working, not just talking about working, but there's a file I need for that elsewhere, and three's mobile Internet coverage evaporates about 3 minutes outside Waterloo station – if only there were a company dedicated to bettering mobile data infrastructure... So, here I am, with means, motive and opportunity to write a diary entry.

Since I last wrote, I have fought with R's handling of categorical variables in linear models; the eventual outcome was a score draw. The notion of a contrast is a useful one; very often, when we have a heap of conditions under which we observe some value, what we're interested in is not so much the predicted value given some condition, but the difference between the value under some condition and the value under some other: the canonical example for this is probably the difference between the condition of some group receiving a trial treatment, and the group receiving a control or placebo: the default contrast for unordered categorical variables in R is called the treatment contrast (contr.treatmen t).

In my particular case, I wanted to know the difference between any particular contrast and the average response – none of the categories I had in my system should have been privileged over any of the others, and there wasn't anything like a “control” group, so comparing against the overall average is a reasonable thing to want to do, and indeed it is supported in R through the use of the sum contrast contr.sum. However, this reveals a slight technical problem: the overall average and differences for each categorical variable is one more variable than the (effective) number of values; just as in simultaneous equations, this is a Bad Thing. (Technically, the system becomes undetermined.) So, in solving the system, one of the differences is jettisoned; my problem was that I wanted to visualise that information for all the differences, whether or not the last one was technically redundant – particularly since I wanted to offer a guideline as to which differences were most strongly different from the average, and I would be out of luck if the most unusual one happened to be the one jettisoned. Obviously I could trivially compute the last difference, simply from the constraint that all the differences must sum to zero (and actually dummy. coef does that for me); but what about its standard error?

Enter se.co ntrast. This operator allows the user to construct an arbitrary contrast, expressed most simply as a vector of contributions to that contrast and ask an aov object for the standard error of that contrast. Some experimentation later, for a linear model m for len observations, and a particular factor variable f, and a function class.ind to construct a matrix of class indicator values (i.e. for a vector vi of observations, construct a matrix xij where xij is 1 if observation i came from condition j, and zero otherwise), I think that:

  anova <- aov(m)
  ci <- class.ind(data[[f]])
  ci <- ci[,colSums(ci) != 0]
  contrasts <- ci %*% diag(1/colSums(ci)) %*% (diag(len)-
(1/len)*matrix(rep(1,len*len), nrow=len))
  ses <- se.contrast(anova, contrasts)
gives me a vector ses of the standard errors corresponding to the sum contrasts in my system, including the degenerate one. (As seems to be standard in this kind of endeavour, the effort per net line of code is huge; please do not think that I wrote these five lines of code off the top of my head. Thanks to denizens of the r-help mailing list and in particular to Greg Snow for his answer to my question about this).

So, this looks like total victory! Why have I described this as only a score draw? Well, because while the above recipe works for a single factor variable, in the case I am actually dealing with I have all sorts of interaction terms between factors, and between factors and numerical variables, and again I want to display and examine all the contrasts, not just some subset of them chosen so that the system of equations to solve is nondegenerate. This looked sufficiently challenging, and the analysis to be done looked sufficiently peripheral to the current business focus, that it's been shelved, maybe for a rematch in the new year.

151 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!