As the train I'm on ambles its unheated way through the unseasonably Wintry English countryside, it's time for another “weekly” exciting entrepreneurial update. Actually I should be properly working, not just talking about working, but there's a file I need for that elsewhere, and three's mobile Internet coverage evaporates about 3 minutes outside Waterloo station – if only there were a company dedicated to bettering mobile data infrastructure... So, here I am, with means, motive and opportunity to write a diary entry.
Since I last wrote, I have fought with R's handling of categorical variables in linear models; the eventual outcome was a score draw. The notion of a contrast is a useful one; very often, when we have a heap of conditions under which we observe some value, what we're interested in is not so much the predicted value given some condition, but the difference between the value under some condition and the value under some other: the canonical example for this is probably the difference between the condition of some group receiving a trial treatment, and the group receiving a control or placebo: the default contrast for unordered categorical variables in R is called the treatment contrast (contr.treatmen t).
In my particular case, I wanted to know the difference between any particular contrast and the average response – none of the categories I had in my system should have been privileged over any of the others, and there wasn't anything like a “control” group, so comparing against the overall average is a reasonable thing to want to do, and indeed it is supported in R through the use of the sum contrast contr.sum. However, this reveals a slight technical problem: the overall average and differences for each categorical variable is one more variable than the (effective) number of values; just as in simultaneous equations, this is a Bad Thing. (Technically, the system becomes undetermined.) So, in solving the system, one of the differences is jettisoned; my problem was that I wanted to visualise that information for all the differences, whether or not the last one was technically redundant – particularly since I wanted to offer a guideline as to which differences were most strongly different from the average, and I would be out of luck if the most unusual one happened to be the one jettisoned. Obviously I could trivially compute the last difference, simply from the constraint that all the differences must sum to zero (and actually dummy. coef does that for me); but what about its standard error?
Enter se.co ntrast. This operator allows the user to construct an arbitrary contrast, expressed most simply as a vector of contributions to that contrast and ask an aov object for the standard error of that contrast. Some experimentation later, for a linear model m for len observations, and a particular factor variable f, and a function class.ind to construct a matrix of class indicator values (i.e. for a vector vi of observations, construct a matrix xij where xij is 1 if observation i came from condition j, and zero otherwise), I think that:
anova <- aov(m) ci <- class.ind(data[[f]]) ci <- ci[,colSums(ci) != 0] contrasts <- ci %*% diag(1/colSums(ci)) %*% (diag(len)- (1/len)*matrix(rep(1,len*len), nrow=len)) ses <- se.contrast(anova, contrasts)gives me a vector ses of the standard errors corresponding to the sum contrasts in my system, including the degenerate one. (As seems to be standard in this kind of endeavour, the effort per net line of code is huge; please do not think that I wrote these five lines of code off the top of my head. Thanks to denizens of the r-help mailing list and in particular to Greg Snow for his answer to my question about this).
So, this looks like total victory! Why have I described this as only a score draw? Well, because while the above recipe works for a single factor variable, in the case I am actually dealing with I have all sorts of interaction terms between factors, and between factors and numerical variables, and again I want to display and examine all the contrasts, not just some subset of them chosen so that the system of equations to solve is nondegenerate. This looked sufficiently challenging, and the analysis to be done looked sufficiently peripheral to the current business focus, that it's been shelved, maybe for a rematch in the new year.