Advogato: How did you get started in free software?
Caolan McNamara: In college late 92 or maybe early the next year
a guy called John
Quinn and friends had succeeded in getting Linux onto his
top-of-the-line 486 and most importantly installing circlemud onto it, so...
Eventually chopping the crap out of beasties begin to pall, but being
able to have your own Unix rather than have to fight with the
authorities that be to get some time on a wonderful ApolloOS or Ultrix
system was pretty neat. I like writing code a lot and the vast free
code base accreted around Linux was incredibly useful as tools for
student living in poverty and for learning from. Sitting on my ass
with my maw open taking continuously didn't appeal so I made a few
stabs at writing useful things to return the favour and perpetuate the
system.
We don't hear too much from the Irish free software scene. How would
you characterize the community there?
Well there's certainly a mountain of commercial software being written
or being passed through the place, worlds biggest exporter of software
and all that (see this OECD IT
Outlook), there is an active community of free software users, and
there's been serious Linux and Unix heads around for years. But there
hasn't been the crossover to create a large amount of free software,
which is bothersome. Alan Cox on a brief visit ventured to
suggest that this was because we spent all of our time in the pub, a
blatant unfounded racial slur of course. Nevertheless its a
mysterious issue.
How did you get involved in grokking Microsoft file formats?
The 97 specs showed up on their website in July 1998 approx. I took a
look at them and thought about implementing a text extraction tool
that would also take the fastsaved nonsense into account. Noone else
seemed interested in doing it. Once that was wrapped up it
didn't seem too far fetched to expand it to "simple html markup" and
"simple graphics". The AbiWord people put an awesome but scary kludge
to import Word files using a wrapper around the old incoherent
mswordview version so we rewrote it as wv library. wv spun off a few
other bits and pieces along the way, the wvDecrypt module to decrypt
word and technically other office files, libwmf to convert wmf files
and ivt2html a quick hack to convert those MSDN cd ivt files to html,
wvSummary to dump summary information from ole2 documents and so
on. wv expanded to take the 95/6 formats into account, the contributed
ole code went into cole which turned into libole2 which gnumeric and
friends sits up on top of nowdays. And then I got a mail from
StarOffice and they kidnapped me in January. Now I get paid to work
on a vast to-be-GPLed code base, pretty neat eh.
What do you think you've learned from these about Microsoft as a
company and the way they create software?
There's so many fileformats in Office that it's like an ecosystem,
incestuous couplings of subformats merrily prancing away under the
hood. There's little sign of careful future proofing gone into their
formats. On the other hand the Escher graphic file format is quite
tidy and the basic OLE2 streams concept is fine, giving programmers a
file system, but giving users a single file to move around the place.
So basically two thoughts:
1. Some reasonably ok ideas, but very bad follow through into
correctly working clean code, (not that I might be the best person to
bring that accusation).
2) The same problems all large companies and old projects suffer,
incremental cruft as people forget what chunks of code are for and
loose the overall picture of how things work, and start nailing
functionality onto the side.
What's good and what's bad about the MSWord file format?
The good thing is that it is pretty much unchanged from the beginning,
having a Word 95 reader allows you to make a fair stab at having a 97
reader with zero modifications to at least read 2000 documents. Vice
versa allowing some care to handle the non OLE streamed nature of
older formats you can handle them as well without an insurmountable
amount of work. And MS sticks to its tried and trusted set of
techniques, for instance always the same two or three compression
schemes. The compression in ivt files is the same as that in cab etc.
The bad stuff is that format is buggy in places. The 95 lists were
changed to a completely different 97 list format but "95 lists may
still occur in 97 documents". Sounds to me like someone couldn't
figure out how to remove the old code without breaking the whole
thing. The 97 upgrade from 95 for the file format was to simply change
practically all 8bit strings to unicode, nevertheless they themselves
couldn't export to 95 except through rtf. An Indian company mentioned
to me that in contact with the East Asia Microsoft they were told that
there wasn't the expertise internally in MS to handle word format
technical queries and fobbed them off to wv. The fastsaved technology
was hijacked to kludge unicode support onto the old format, reading
the old Word 2 format documentation, the 6 format and the 97 format
all shows the exact same document with incremental additions. All in
all, lots of evidence that its gotten completely out of hand and that
Redmond has been lumbered with a fragile file format that they no more
fully understand them we do. There isn't a conspiracy (or at least its
a retrofitted one) that MS is actively fighting a file format battle
with the world. It just grew that way.
If you were designing a word processor format from scratch, what
would
it be like? XML-based?
File format wise something plain human readable text like XML is the
way to go. Some independent ability to validate that the input/output
is sane, a builtin ability to ignore non understandable tags and
attributes from future versions etc. All good stuff, lots of knowledge
floating as to how XML works. On the other hand its a pain to put a
graphic file or for instance an OLE2 stream for an embedded legacy
app, say equation editor[1] directly into a
text based xml file,
though there a couple of possibilities all of which would work fine
with varying degrees of ugliness. But I'm not an XML head, ill leave
that to the experts.
It's pretty well understood that incompatibilities in the file
format
force people to upgrade their Microsoft office suites (I've heard that
some files saved in MS Word 6.01 can't be loaded in 6.0 - feel free to
elaborate).
I didn't know that the 6.0.1 vs 6.0 was actually a problem for word as
well, but I have a memory of a wv showstopper difference between 6.0.1
and 6.0. There was something to do with the font names (FFN) or some
similar structure, some extra data being appended onto some of these
structures (for asian support I theorized, probably incorrectly), so
that the advertised size of the structure didn't match the reality. I
also have a changelog entry for a work around in the summary stream
information as well for 6.0.1, but the exact details escape me.
How is that going to differ in Linux, and how do you think that will
affect adoption of free software office suites?
On the MsOffice incompatibilities, any new MS version will either be
incremental addons to the existing binary format which we have on the
ropes and will not be a major problem. Or it will be some new thing,
perhaps some sort of humourous standardmangling XML which would be
much easier to import in comparison anyhow, so I don't see this as a
problem, it will still mean that the average user may have to upgrade
OpenOffice each time Microsoft release a new version of their suite,
but it's not as if our customers have to shell out an upgrade tax for
the privilege.
Incompatibilities between OpenOffice file versions shouldn't become a
problem. Just ignoring unrecognized tags and attributes should avoid
having to ever write a special save as older version filter, instant
time saving, isn't that great. No Word 97 to 95 style fiasco, we
should also be able to avoid problems like this with an open
development system, with a wider group of testers, platforms and
bizarre setups. So stuff where only the English version crashes when
you do "complicated thing" while the German one is fine won't have as
easy time of slipping through the net.
There's a lot of talk about StarOffice being Gnomified. Any word on
integration with KDE?
StarOffice is not a vast corporation with gazillions of employees, its
owned by one but thats not the same thing at all. So it cannot afford
to spread itself thin. My belief is that no barriers will be actively
placed in the way of interoperability with KDE but a choice has to be
made and that the main focus will be with interoperability with
GNOME's Bonobo linking and embedding because its closer to our own,
the main topsecret internal technical reason being that the foot looks
a lot cuter than the K. But seriously, there was always someone going
to be slightly disappointed here. Anyhow if KDE sticks together a
mechanism for using Bonobo components in their apps then I imagine
they can play too.
You recently passed maintenance of wv to Dom Lachowicz. What are your
thoughts on changing maintainers of free software projects, and wv in
particular?
Its kind of tough to do it actually, there isn't a chance in hell that
I'd ever have time to continue work in wv right now, and of course it
makes absolutely no sense for me to work on my competitor so a new
maintainer was needed, but I dodged the issue since last Christmas.
Hand overs are stresful, you get very very attached to the software,
child surrogate, watching it linger maintainerless is annoying, but
you dread the possibility of future coders trampling all over the
clever bits and making a complete mess of the design. But I think Dom
will be excellent for it.
I am taking some glee from forwarding all wv mail to Dom, and
reclaiming the space from the automated conversion site, (1 gig of
document and wmf files in bzip2 tar files from 1st jan to 1 aug). So I
am kind of glad that its handed over and I can move on, I have no idea
how people like Linus or Alan handle the volume of mail, the incessant
questions wore me down.
What kind of clothes do you wear?
Zero clothes sense, I stick with black, lots of black, sometimes in a
spirit
of lightheartedness I wear some black instead.
[1](which btw is crippleware MathType Insert object->Equation
Editor->bottom left button of equation panel-> choose one of the
horzonal braces save, reopen, activate equation, barf "you gotta buy
the full version to use that boy")
<disclaimer>
These are my personal opinions, you'd want to be utterly crazed to
consider
these official positions of StarOffice/Sun or even vaguely congruent
with
those parties.
</disclaimer>