As the train I'm on ambles its unheated way through the
unseasonably
Wintry English countryside, it's time for another
“weekly”
exciting
entrepreneurial update. Actually I should be properly
*working*,
not just talking about working, but there's a file I need
for that
elsewhere, and three's
mobile
Internet coverage evaporates about 3 minutes outside
Waterloo station
– if only there were a
company dedicated to bettering mobile data
infrastructure... So,
here I am, with means, motive and opportunity to write a
diary entry.

Since I last wrote, I have fought with R's handling of
categorical
variables in linear models; the eventual outcome was a score
draw.
The notion of a *contrast* is a useful one; very often,
when we
have a heap of conditions under which we observe some value,
what
we're interested in is not so much the predicted value given
some
condition, but the difference between the value under some
condition
and the value under some other: the canonical example for
this is
probably the difference between the condition of some group
receiving
a trial treatment, and the group receiving a control or
placebo: the
default contrast for unordered categorical variables in R is
called
the treatment contrast (`contr.treatmen
t`).

In my particular case, I wanted to know the difference
between any
particular contrast and the average response – none of
the
categories I had in my system should have been privileged
over any of
the others, and there wasn't anything like a
“control”
group, so
comparing against the overall average is a reasonable thing
to want to
do, and indeed it is supported in R through the use of the
sum
contrast `contr.sum`. However, this reveals a slight
technical
problem: the overall average and differences for each
categorical
variable is one more variable than the (effective) number of
values;
just as in simultaneous equations, this is a Bad Thing.
(Technically,
the system becomes undetermined.) So, in solving the
system, one of
the differences is jettisoned; my problem was that I wanted
to
visualise that information for all the differences, whether
or not the
last one was technically redundant – particularly
since I wanted
to offer a guideline as to which differences were most
strongly
different from the average, and I would be out of luck if
the most
unusual one happened to be the one jettisoned. Obviously I
could
trivially compute the last difference, simply from the
constraint that
all the differences must sum to zero (and actually `dummy.
coef`
does that for me); but what about its standard error?

Enter `se.co
ntrast`.
This operator allows the user to construct an arbitrary
contrast,
expressed most simply as a vector of contributions to that
contrast
and ask an `aov` object for the standard error of
that
contrast. Some experimentation later, for a linear model
`m`
for `len` observations, and a particular factor
variable
`f`, and a function `class.ind` to construct a
matrix of
class indicator values (i.e. for a vector v_{i} of
observations, construct a matrix x_{ij} where
x_{ij}
is 1 if observation i came from condition j, and zero
otherwise), I
think that:

anova <- aov(m) ci <- class.ind(data[[f]]) ci <- ci[,colSums(ci) != 0] contrasts <- ci %*% diag(1/colSums(ci)) %*% (diag(len)- (1/len)*matrix(rep(1,len*len), nrow=len)) ses <- se.contrast(anova, contrasts)gives me a vector

`ses`of the standard errors corresponding to the sum contrasts in my system, including the degenerate one. (As seems to be standard in this kind of endeavour, the effort per net line of code is huge; please do not think that I wrote these five lines of code off the top of my head. Thanks to denizens of the r-help mailing list and in particular to Greg Snow for his answer to my question about this).

So, this looks like total victory! Why have I described this as only a score draw? Well, because while the above recipe works for a single factor variable, in the case I am actually dealing with I have all sorts of interaction terms between factors, and between factors and numerical variables, and again I want to display and examine all the contrasts, not just some subset of them chosen so that the system of equations to solve is nondegenerate. This looked sufficiently challenging, and the analysis to be done looked sufficiently peripheral to the current business focus, that it's been shelved, maybe for a rematch in the new year.