Unix sucks; StarOffice to be free
Posted 20 Jul 2000 at 15:20 UTC by advogato 
This week's lwn has two articles
that will likely be of strong interest to free software hackers. First,
the indefatigable Miguel de Icaza gave a talk promoting the Bonobo
component framework. His main thesis is that Unix sucks at reusing code,
as well as a few other things.
Second, StarOffice has announced that it is being released as GPL code
soon. In fact, code is scheduled to be available October 13 on OpenOffice.org. This is big news;
while there certainly are a number of free office suite projects out
there, none of them clearly has the momentum to clearly win. StarOffice
is already "almost good enough" for most people's needs, and there are a
number of reasons to believe that hackers both within Sun and in the
Gnome world will be working on the "almost". With this announcement, we
are considerably closer to having a Linux platform we can feel
comfortable recommending to our moms.
Advogato wants to spend a little time ruminating on code reuse before
opening up the general discussion. "Code reuse," if anyone recalls, was
a mantra of the '80s, the solution to the productivity crunch. It's also
one of those things that didn't come to pass, certainly not as the
original promotors of the idea forsaw.
What has happened is that the center of the "programming
language" has shifted dramatically. A programming language is no longer
primarily a notation for expressing algorithms. It is a vast,
intertwined network of libraries and API's, with an extension mechanism
for scripting these together and adding your own little bits that aren't
covered by the libs. Sun attracted some derision from the programming
language community by confusing both of these two concepts under the
"Java" brand name (especially from people who feel that the Java
algorithm notation is halfway decent, but the API's are an
out-of-control heap of slog), but in truth that marketing strategy may
reveal some deeper truths about the nature of programming and reuse.
To show how much things have changed, I'll use an example from the mists
of computer time - David Parnas's original landmark paper on refactoring,
which itself uses the Key Word In Context problem as its example. Using
the technology of the day, Parnas estimates that "such a system could be
produced by a
good programmer within a week or two". Well, friends, this cat just
implemented the problem in 11 minutes, the way any sane postmillenial
hacker would have: as a Perl script. That's quite
a few orders of magnitude improvement in productivity - it resembles
hardware more than software (although not quite to the insane level as,
say, hard drive capacity per dollar).
Of course, what drove this stunning increase is code reuse. There's tons
of code in the Perl language and standard library, much of it optimized
to do exactly this sort of thing - string monkeying. This is the
kind of code that's being reused on a massive scale, anywhere from
one-off scripts to the back-end of quite a few web sites, and, as Iain
observes, quite a lot of
sourcemeat projects.
Dijkstra has said that the only valid motivation for reusing a piece of
code is its extremely high quality. Most software written today is a
writhing mass of bugs, with possibly some functionality thrown in. I
think we do expect better from our tools, and in fact, it's
relatively rare that I've been bitten by a bug in programming language
library code.
But ultimately, stuffing everything into the language isn't going to
work. There's way too much interesting stuff to try to shoehorn into one
language. In many ways, Gnome has
served as a
counterweight to the trend of extending the scope of the programming
language - the language of choice is simply C, and we rely on libraries,
lots and lots of libraries (48 for my latest Nautilus build), to hold the
reusable code. Gnome is also starting to use the Bonobo component
framework extensively, especially now that the API shows some signs of
maturing.
As Dijkstra warns, when these libraries aren't of the highest quality,
we can expect problems. But aren't we, as free software hackers, good at
polishing code? Might not the dense interdependencies of modern-day
Gnome development foster exactly the sorts of network effects that I see
as the way out of the Tragedy
of the Commons? I'm hopeful, I'm hopeful.
Well, one problem with the applications mentioned is that they do not reuse code in the Unix way. No surprise here, since they
were written to be portable to Windows. Having the application run on both Windows and Unix pretty much eliminate the
possibility of reusing code from either.
Unix Way of code reuse is pipe()/fork()/exec(). You simply execute external programs, just like shell scripts do. Traditional
Unix
applications have been designed to do useful work this way: they are small, accomplish specific tasks, receive input through
stdin, return results through stdout.
Perhaps those whose only experience with Unix is Linux are unused to this way of doing things. Linux man pages are
generally
weak, and the huge amount of applications installed on a typical Linux distribution naturally inhibits people from checking out
what those programs do. Furthermore, we have seen a tendency of applications trying to do everything, instead of relying on
other programs. This is probably Windows influence.
Perl, interestingly enough, is one of the great sinners. It tries to replace all these little programs with a single huge one. As
powerful and useful as it is, it's impact on the Unix Way of code reuse is very detrimental, I'm afraid.
In many ways, Gnome has served as a counterweight to the trend of
extending the scope of the programming language - the language of
choice is simply C, and we rely on libraries, lots and lots of
libraries
Sadly someone has started a "dependancies are bad" school of thought.
Witness all the people on GNOME lists not wanting to link to GNOME, but
use GNOME. Their solution: Rip out the guts of GNOME (for example,
gnome-canvas) and compile it into the program directly. I don't know
who started it, but really, it's kind of bugging me.
Maybe before we
be able to fully reuse code (in libraries) we have to stop the
"dependancies are bad" thing.
I dunno, maybe I'm just rambling on...
StarOffice aparently links against 32 dynamic libraries, of which 21
are not used by any other application.
Now, Sun is not going to be happy accepting patches that say "Here, we
ripped out all the cruft and made it use a more standard set of Linux
libraries!" They're trying to do to MS Office what MSIE did to
Netscape: make a competing product free and ubiquitous, and you can
flood out the competition. They want to keep portability, and they
want to have "features".
Some people were flamed last year by Ditherati for saying that
sometimes, a feature involves the removal of code. Stability
and speed are features, and being bogged down with
clickbuttons and automated filtering and Estonian grammar-checking
options that you don't need is not.
I forsee a series of forks happening in the near future. We've
already seen a rather lame proof-of-concept browser that pulls all the
cruft from Mozilla. I think we'll see some strip-down projects that
focus on streamlining bloated software that was freed by large
companies.
These projects may try to sinply wrap around core code (as galeon did
with gecko), or they may be complete overhauls. With the former, no
forks are necessary. But would it be possible to take those 21
staroffice libraries and use only the parts that are relevant to a
word processor or spreadsheet application?
Of course, since they've gone the GPL route, they now have the chance
to absorb code from AbiWord and GNUMeric.
Don had a
note in his diary about someone who had contributed to a free
software project by removing code from it.
I once sent a patch
to Red Hat to fix their version of the "kibitz" script, and I'm not sure if it changed
/bin to /usr/bin or /usr/bin to /bin. If the latter, I could brag
that I have at least negative four characters of code in the Red
Hat distribution. :-)
Of course companies which release free software typically have
profit-related motives and possibly larger strategies including
non-free software. A long-standing and complex issue, of course. But I think I'm with
Rob: "Nothing is lost."
I think one of the problems preventing more wide-spread code re-use
is that the traditional Unix ethic "do one thing and do it well" is
often implemented as "do one thing, do it well, and tie in a minimal
user interface".
An example of this is looking at something like fetchmail: which
does fetch mail, and does it well, and incorporates a console UI.
I wonder how many other applications make use of the same
functionality, but don't re-use the code because there is no
interface at a more useful level than "fork/exec". Or wget: "gtm" wants
the functionality, but is stuck with the "fork/exec" interface.
dcs above seems to be implying this is a good thing: I'm not sure I
agree. Certainly at the moment people are coming along and looking at
Unix console programs and saying "well, that does that well, but I want
a nice GNOME interface not a console one", and finding the "fork/exec"
interface lacking. And if the result of that is to re-implement it from
scratch, and tie in a GNOME interface this time, then
bang goes code re-use.
I would say that licensing and political issues are probably the
biggest factors preventing more code re-use, in any case.
(Who spotted the sweet irony in the fact that the
leader of the GNOME project (see also, the KDE project) is complaining
about lack of code re-use?)
dcs' and
joe's posts about the
fork/exec style code reuse and the problems run into when people want to
reuse code with a UI are interesting. One situation where code reuse has
been done well is with mpg123. As a command line mp3 player it's great,
but many programs tried to use it as the backend with a gui frontend and
did alright, but in my opinion not as well as XMMS, which used the
mpg123 code but not the whole program. Now XMMS does the same great
decoding that mpg123 always has, but with a nice gui and without a bunch
of forks and execs.
I think it would be pretty useful to have libraries implementing the
guts of some of these console tools. This way a GUI program could easily
use the code from find, finger, wget, or whatever. Maybe even console
programs should be designed with a backend/frontend
strategy(model-view-control), in case their functionality will ever be
incorporated into some other program. It would probably be overkill for
some programs, but for things like wget, mpg123, and others, it would be
very useful.
component software, posted 20 Jul 2000 at 22:15 UTC by higb »
(Observer)
I bought a copy of Clemens Szyperski's Component
Software a couple days ago. It is stimulating a lot of interesting
ideas. I haven't been too comfortable with the division
between "package", "library, and "language" in UNIX land. (One could
alternately say "component", "class", and "language") It may be that
GNOME can find a good way
to integrate the pieces (RPM and Bonobo?), but I've decided to cast
about a bit and look at some of the other solutions (Oberon and C.
(I originally posted this in the wrong article. Stupid, stupid...)
Rob Pike, in his presentation, "Systems Software
Research is Irrelevant," encourages work on component-based
applications, but I don't think that he would agree at all with Miguel's
fawning over MS COM. Pike writes:
There has been much talk about component architectures but only one true
success: UNIX pipes. It should be possibile to build interactive and
distributed applications from pice parts.
(I think he considers the "plumbing"
architecture in Plan 9 to be something like the next stage of UNIX
pipes.)
corrections, posted 20 Jul 2000 at 23:03 UTC by higb »
(Observer)
While we're makeing corrections, I meant to say "Oberon and C#" above.
The main problem with fork, exec and pipe and the fundamental building
blocks for code reuse is that the only common denominator here is
untyped, unstructured bytestreams. If you need more structured
communication than that (and for modern applications you do), you must
build it yourself, and unfortunately, everyone comes up with a different
protocol this way. This is why combining shell tools works great for
simple plain text processing, and not for just about anything else.
Shared libraries, CORBA, COM and Bonobo are all attempts to impose a
higher level of semantics above the basic byte stream, and build a
common set of interfaces that many programs understand. It maybe less
universal than pipes, but imagine putting together something like the
Nautilus component model out of raw pipes, or indeed build the program
by calling external executables to parse XML, present a virtual
filesystem view, provide basic onscreen widgets, provide a component
model, render SVG... (just a few of the things Nautilus uses libraries
for).
A Functional OS?, posted 21 Jul 2000 at 02:13 UTC by mettw »
(Journeyer)
I discovered Haskell recently and love it. On the issue of code
reuse you can't get much better than Haskell's solution. Basically
it is like templates in C++ but doesn't require that you reimpliment
the function for each type. So if you define a function like
`series f x = x : series f (f x)' then it will work for any
combination of function f and value x that will be accepted by
f. Or, if you define a function like `add a b = a + b' then this
will work with any type that works with the `+' operator.
UNIX pipes incorporates a similar idea to the lazy evaluation of
Haskell so if we had a more sophisticated communication system,
such as allowing the passing of lists, tupples and types then
we should be able to acomplish a `functional' operating system.
This seems (to me atleast) to be a much better way to accomplish
code reuse than OO stuff like CORBA.
mjs hit the nail on the head - If you're using fork/exec as your code reuse, and using stdin and stdout as the extent of your IPC,
you're making things a lot harder for yourself. To communicate you have to "flatten" your data structures into a byte stream, write the
encoding, transmission, metaencoding, exception handling, decoding and reconstructing the data structures, debug your implementation,
extend your implementation when you find that you failed to anticipate something, debug the extension, whoo-whee! Call that "code
reuse?"
(One of) the points of CORBA, or Java RMI, or RPC, is to handle all that for you and you never again have to invent YAL4P. (Yet Another
Layer 4 Protocol.) It takes a well-known and well-proven communication method - the typed function call - and makes it as transparent as
possible between processes and between network hosts.
Advogato said:
That's quite a few orders of magnitude improvement in productivity - it
resembles hardware more than software (although not quite to the insane
level as, say, hard drive capacity per dollar).
Of course, what drove this stunning increase is code reuse.
I disagree. I believe the improvements in hardware is a big
factor as well. Reading the Parnas paper I see he had to deal with
things like packing 4 characters to a word, and using indicies for
the permutations rather than doing as you do and make copies of
each permutation "in core." He also talked about some of the
tricks like partial sorting which could improve performance.
If there are N lines of w words each (where w<255) then it looks
like you only need w*N bytes of extra memory for the approach he uses.
Each line also stores a table of offsets in ordered of which word
is alphabetically the smallest. On the other hand, your approach
duplicates each line, so needs w*c*N memory, where c is the number
of characters in the word, so you take about 6 to 8 times as much
memory.
You can tell memory is a concern since he talks about using a
symbol table, which likely exists to reduce the number of duplicate
words present in the input data. You can even conceive of replacing
all symbols with a number so the sorts comparisons are done on
integers rather than strings, which would up the performance by
a factor of 2 or 3 (this is done in the n*log(n) step, so the time
spent in the hash table used for uniqueness checking making
the table is okay).
The Perl approach you took also takes the more recent joy
that hardware is cheap - easy to see since you split and tear
apart strings all the time in your comparison function. (Instead,
you could have saved the $lineno and @words in a [list reference]
then sorted on elements in the list. Umm, assuming in Perl that
a list comparison compares items in the list (as Python does)
and not just their id values.
For a fair comparison you would have to target the same sort
of hardware environment and reduce your memory usage drastically - I
estimate a factor of 10 smaller than what you have.
That's a more complicated problem, and Perl doesn't have those
tools built into the language or directly accessible as modules,
so you would probably spend a couple of hours on it - especially
since you would have to do a lot more debugging. Add in
documentation (since it's a bigger problem) and we're talking
half a day (4 hours). This is only about an order of magnitude
faster in development time than 1972, but still an order of
magnitude slower than your assumption of cheap hardware.
Hence, I argue that a major reason productivity increases resemble
hardware performance increases is because of ... the hardware.
I even believe hardware has more effect on productivity than
improvements in software development methods, though the numbers
above suggest they are about equivalent (within a factor of 2).
Now, Sun is not going to be happy accepting patches that say "Here, we ripped out all the
cruft and made it use a more standard set of Linux libraries!
For what its worth, StarOffice is a very autonomous company, my firm belief is that you
will not have to get Sun to agree to something but to convince the StarOffice division
of the technical merits of your code reorganization, etc. There is no conspiracy here, or if there is
then those of us working here haven't been let in on the secret. So October when we have this
beast back working again so that we don't get a mozilla dissapointment start off we shall see
what the game is like.
C.
blatantly buzzword, posted 21 Jul 2000 at 08:20 UTC by jdub »
(Master)
mjs and arosey: Understanding that XML is not the cure for
everything, couldn't componentized software use WDDX or similar DTDs and
the 'standard' UNIX fork()/exec()/pipe()?
Yes, blatantly buzzword. It's probably fairly unworkable too. Given
that I'm familiar with COM, it also sounds bloody strange.
I'm glad Miguel has been so outspoken on this. Perhaps it will be a
kick in the bum for some, but questioning the status quo will forever be
the most valuable path to goodness. Blinkers suck; look at something
outside your world every day.
First of all, Unix doesn't "suck" at code re-use strictly speaking,
developers do, and this means us !! ;-)
More seriously, the reasons why we avoid re-using code are various
but it seems that one of the most important ones is that each "software
library" is essentially designed to solve a certain problem under
very specific conditions, even these are often not brigthly mention
in the code or documentation.
Change the initial requirements, and you're pretty certain to
need a different design, unless your code was carefully written
from the start to foresee a lot of various uses (a very unexpectable
thing :-). Of course, I'm not mentioning the "hacker culture", NIH
syndrome and other non-technical issues..
Frankly, I can't see a lot of really-well-designed-and-reused
libraries in the
C/C++ world, except maybe the jpeg, png & zip libraries (ok, I'm
stretching this a lot ;-), and that's probably because of all
the constraints imposed by the language and its environment. On the
other hand, there is a lot more code re-use with Perl & Python. That's
essentially because high performance and low memory footprint are not
part of their initial requirements, as they instead focus on
flexibility and ease of use (at least for Python ;-). As the cat
demonstrates, this provides a massive increase of productivity,
though dalke
rightly reminded us that this was clearly at a certain cost. In short,
Raph's solution doesn't solve the same problem than Parnas' one because
the initial conditions are far from being equal.
In some way, choosing to re-use a library means limiting your
capabilities, while increasing them in another dimension. As long as
one feels (be it justified or not) that making the choice of
a given implementation will limit the program too much, it simply
won't happen or developers will start to complain. That's also why you
have a "dependencies are bad" school of thought among many programmers:
who do not like to be artificially limited by political,
rather than technical, issues. Remember, its a matter of
perception
It's all about finding a balance. Code re-use mainly means taking
great, great care when designing your library to make it flexible, and
this normally takes massive amount of time and usually several tries.
However, if you succeed, your investement will be paid back each time
you re-use your code with no modification whatsoever.
Finally, I don't really think that the center of "programming
language" has shifted as Raph claims. We're asking more and more things
to our programs, and it may be true that in a C/C++ world, there is no
other option that relying on more and more libraries. However, more
often than not, solid breakthroughs can only be made useful by
integrating them into the language, I'm thinking about garbage
collection, exceptions, serialisation, language-independent object
communication (COM/CORBA), task/thread synchronisation,
distributed processing, secure code generation/execution,
etc..
Just like stuffing everything in the language won't work,
relying exclusively on external libraries with primitive languages
won't help much in the long run.. This world definitely ain't
perfect :-)
And frankly, as much as I love Gnome, I really don't
see how one could claim that it has "served as a counterweight to
the trend of extending the scope of the programming language" ?
I doubt it had any influence into "language" trends of the computing
industry (except maybe in some limited part of the
open-source community).
There are clearly technical reasons why Java is now taught in all
self-respecting computer science department and I expect this to
have far more consequences to the future of our industry.
Besides, the most widely used languages in the world
are probably still Visual Basic & Perl (evil grin !!)
Why? Because one of the nifty features that showed up in the Berkeley UNIX programming manual back around 1980 was a KWIC index
to the whole shebang. And that KWIC index was created using the standard UNIX tools of 1980, with a couple of extra glue programs.
There you have traditional UNIX code reuse... voila!
The reason you're not seeing that kind of code reuse in GUI tools is because, well, GUI tools are basically editors. You present the user
an object that you want him to manipulate and give back to you. The real work, raytracing or text formatting or whatever... the stuff that
the UNIX tools do so well... isn't ever seen by the user directly, he just gets a spinning stopwatch and when it's done you show him the
results.
There's a couple of things that we can get from this.
First, we need a real GUI front-end to the UNIX pipes and filters world. Something that lets people drag objects around and hook them
together in pipelines that look good. That makes the command line tools useful for GUI users in a way that they can appreciate.
Second, we need to build tools that operate on more structured objects. UNIX pipelines aren't limited to text, you know, look at NetPBM
for pipeline tools that munge image data. The problem is that you can't expect people to keep that stuff straight. So what you do is add a
wrapping layer.
You make sure the GUI knows what kinds of inputs and outputs the various tools expect: PPM files, PNG, UNIX colon delimited text
files,
Windows-style CSV files, whatever. Then when you hook "cut" up in a pipeline, it won't accept input from "pnmcut". You add tools like
"csvtopasswd".
This makes the kind of code reuse UNIX is good at attractive to the GUI world. Then we can look for better solutions to the opposite
problem... bringin in the Window kind of code reuse without landing in the toxic ecosystem of Windows... Windows stability issues tell
us
everything we need to know about that.
Obstacles to reuse, posted 21 Jul 2000 at 20:39 UTC by gord »
(Master)
I see two main reasons for the lack of reuse within Free Software:
- Inflexible IPC. When interfacing with an external program,
the protocols for doing so are quite narrow. With some extra attention,
more reasonable means of communication are possible (see X11, Gimp,
CORBA).
However, there is not yet a truly standard object representation.
This forces people to turn to code running in the same address space,
namely libraries or scripts, but brings the issue of...
- Clunky packages. Interdependencies are shunned on all but
the most standard packages because it is a nuisance to install
packages. That means people usually choose to cut-and-paste the parts
of the package they want, or else rewrite them from scratch.
And so it seems to me that the solutions come from a few different
directions: convergence on standard RPC mechanisms; package installers
that know about interdependencies (such as APT);
self-contained packages which are bundled with stubbed versions of their
dependencies (but use the full versions if they are already installed).
This last option is the one explored most by the GNITS group, who
advocate code reuse within GNU, and are responsible for such things as
Automake, Libtool, and libit (a nifty idea with a lot of potential).
Someone mentioned GUI environments for users to easily connecting
applications with
pipes/IPC. There was once a project called GNU Piper. Does anyone know
what happened to it?
Well, it doesn't look like there's much disagreement about code reuse around... Let me reply to a few messages, and make
some additional remarks.
First, much complain about the "untyped" interface of pipes. Well, there's ASN.1 for you people, and, besides, CORBA
requires that objects be able to pass a "string" version of themselves (based on ASN.1, I think). RPC are based on xdr, which
isn't really any different than what is required for adding types to text streams. In fact, one could replace the xdr library with
something with the exact same interface, but with a text-stream back-end.
Furthermore, these text streams are not really untyped. You have utilities to deal with text streams as records, as tables, as
b-tree records, etc. Take a look at what is behind "man", for one thing.
And, then, we have XML, which works with text streams just fine. As a matter of fact, HTTP, HTML and the WWW is based on
pipes. Everything you see on the web, including animated images, frames, java applications, etc, could be happening on a
pipe()/fork()/exec().
Finally, complex object types are not all that common, generally speaking. Applications whose input/output are text are very,
very common. Scientific applications usually import/export numbers. Many applications' input are languages (eg, postscript).
Complex data types are usually found in the realm of... GUI. :-) And here is the real problem.
GUI programming paradigm is very different from pipe/fork/exec paradigm. For one thing, the former is event-based. It's
difficult to integrate traditional applications in a GUI paradigm. Often, though, it isn't necessary. All that's necessary is a
front-end. Front-ends can be easily written with a drag-and-drop interface. Unfortunately, GUI programmers have a hard time
dealing with the traditional unix paradigm and vice versa. One example I consider a good example of such integration (based
on tk/tcl, no less :), is Spin/Xspin.
Alas, people heavy into OO face the same problem with the traditional model.
Anyway, AI code reuse. :-)
I decided to talk about this because I bet this is going to be very important in a somewhat distant future, and people come to
advogato to read about interesting stuff (don't you? :).
A small and dedicated branch of AI called Distributed Artificial Intelligence (who works basically with Intelligent Agent, and is
generally upset with the "misuse" of the word "agent" nowadays :) has an interesting framework.
All applications are agents. This is by definition, as a matter of fact. :-) Any application fits the definition of an agent, most of
them are just incredibly dumb agents.
Smarter agents make a clear distinction about what they _know_ (their internal data), and what they _believe_ (what they
_expect_ to be true about their enviroment).
Really smart agents also have internal models of other agents (anything they interact with), to help them predict their behavior.
Interestingly, this can be applied to external hardware devices (like printers), programs (daemons) and even users. Such a
program, would, ideally, have a simplified model of what your intentions, goals, desires and beliefs are, and use that to predict
your wishes. For instance, if Clippy was a REAL agent, it wouldn't keep popping up and annoying you so much. :-)
Of course, communication in this environment is much more complex than what CORBA (for example) allows for. It happens on
multiple levels. On one level, you have something like KQML. This levels communicates things like if you are asking,
demanding, begging :-), stating something or what, the intensity, etc. It also covers the ability to find out which agents out
there are capable of answering to such requests.
On a next level, you have a language capable of expressing knowledge, like KIF. Knowledge, obviously, is a much more
complex data type than any object.
Finally, we have ontologies. Ontologies are "common sense" to be used with particular knowledge fields (roughly speaking,
like everything else :). For instance, text editing applications (agents :) would have a host of common axioms, definitions,
theorems, etc, which would enable _any_ text editing agent to talk to _any_ other text editing agent, find out what it does,
"outsource" tasks :-), etc. Whatever the capabilities of your agents are, you would express them in KIF (or similar), with an
specific text editing ontology as background info.
Incidentally, all of the above would still work with pipe/fork/exec. :-) But the programming model for it would not, since it's a
agent-facilitator model (similar to client-server, but the clients (agent) have server functions, and that the server (facilitator)
does is getting clients who need each others capabilities in touch with one another).
Now, *this* model would promote code reuse. :-)