cpp considered harmful

Posted 27 Feb 2000 at 01:37 UTC by dan Share This

In which the author attempts to write foreign-language bindings for some commonly available Unix functions

In particular, we plan to have longer file names, file version numbers, a crashproof file system, file name completion perhaps, terminal-independent display support, and perhaps eventually a Lisp-based window system through which several Lisp programs and ordinary Unix programs can share a screen. Both C and Lisp will be available as system programming languages.

That is an excerpt from the GNU Manifesto. I assume that it was written around 1985 - I came to the free software party late, having found it by accident while trying to exit from Emacs 18, but that's what the copyright notice on the file says.

One Linux-based GNU system later, it's interesting to look at that and see where we are now as against where we were back then. Longer file names? Check. Version numbers? Not supported in any widespread fashion, but CVS is actually preferable for many purposes. Crashproof fs? Journalling filesystems RSN, but even with ext2 I can count on the fingers of one foot the number of times I've actually lost work due to fs corruption. File name completion? Yup. Terminal-independent? Certainly. Lisp? What's that?

Fair question. There are a plethora of high-level languages available in the free software world; Perl and Python both have huge followings, Tcl is still widely used, and Java is inexpensive and widely available (even if the precise status of Java<sup>tm</sup>'s freeness is too complex for me to care enough to know about). Lots of other interesting languages and implementations have Linux-based implementations and active developer communities too - Ruby was featured here on Advogato recently, Caml is pretty popular with the people who know about it, and by the simple act of listing these ones that I can remember I'm opening myself up to vast numbers of responses from people annoyed that I've forgotten their favourite. So, the stated intention that Lisp be supported everywhere may these days be as relevant as the original intention that GNU support MIT Chaosnet, or be initially targetted at 68k machines, because we have this wide choice of other expressive languages that let us get the job done more quickly than C

But: systems programming?

All of these language implementations - or all that I've seen - are second-class citizens. They're implemented on a C substrate. When you call stat() in Perl, it doesn't make a system call. It calls some C code, and that calls the shared C library, and that makes a system call. The Perl interpreter marshalls its arguments and massages them into C calling convention, and the C library insulates us from the kernel-specific details, allowing kernel developers to make changes as necessary when things like 64 bit uids or files bigger than 2Gb come along.

Most of this is basically inescapable. The C library and its friends implement POSIX - or whatever it's called today - whereas the kernel implements whatever it feels like. POSIX is documented. The C calling convention is unlikely to be changed any time soon. Using this sounds like a net win.

Now then. My preferred non-C language implementation on Linux comes with a native-code compiler. I can write code in this language that runs within n% of the speed of C (n == "small enough not to care"). If I use appropriate definitions to describe the C-style functions I want to call, it can generate code which calls them just like that. Inline. No timewasting subroutine calls, no copying stuff around - just get the address using dlsym() or equivalent and jump to it.

There's still a problem with symbolic constants. We can find the address of open() easily enough, but where do we get the value of O_RDONLY? According to the manual page (and I'm assuming from that, POSIX), we should have this value if we #include <sys/types.h>, <sys/stat.h> and <fcntl.h>. OK. Cool. Now all we need to do is (a) implement cpp, and (b) find out all the symbols that the C compiler has defined when it starts up (echo | gcc -v -E -)

There's another problem with structures. Now we need to implement a complete C parser as well as cpp, and it needs to know the platform's rules for structure packing.

Happily neither of these are insuperable, but it gets ugly. We can write an autoconf-ish thing that runs a C program at build time to get all these values from the actual C compiler. Messy, but it works.

OK, we've called the function; it should have opened a file. The return value will tell us, and errno will tell us what went wrong if we didn't. That's an integer, right?

Nope. It's a preprocessor macro. Moreover, it's a preprocessor macro which expands into something with leading underscores, which means it almost certainly won't stay like that. How the dickens am I supposed to call that, then? I have to write a C function that returns the value of errno, and call that function every time I want to find out what it's set to. To find a value which was actually returned by the system call, and subsequently squirreled out of sight by the syscall glue, I have to write and call a function. Do I feel like attention to non-C languages is low on the priority list for libc developers? Yes, maybe I do.

My opinion: if you're writing an API -

  • Consider keeping symbolic constants in a file with documented easily-parseable format, and autogenerating header files from that

  • Consider avoiding passing struct pointers unless they really do make life a lot simpler. If you do need to, think about providing a file with easily-parseable format that lists the names of structures and their members

  • If you provide functionality as macros "for speed", also provide an actual function that does the same thing. ncurses does this, and according to the header files even manages it automatically. Ca't be that hard, then

  • I didn't mention this before, but it's another point: think carefully before creating an API that requires callbacks; calling back from C to (other language) is often going to be a tad trickier than calling from (other language) to C. If you decide to do it anyway, include a spare argument for 'client data', so that foreign language interfaces can have all callbacks call one C function which switches on the client_data to dispatch to the actual (other language) function that wanted to be called.

I'd be interested to hear what other issues people have come up against when designing APIs that non-C-language users can use, or in trying to use other people's APIs from non-C languages. Garbage collection and threading are two possible other areas of contention. Discuss.

(Footnote: the language I refer to is of course Common Lisp, and the implementation is CMUCL. More detail on my diary page)


Macro expansion considered harmful, posted 27 Feb 2000 at 09:10 UTC by raph » (Master)

I can certainly see the problems you're facing. I'd like to see the language binding people get together and produce a document that contains guidelines for api designers about how to make life easier. One thing that comes to mind: if you're going to have structures, provide set/get functions to muck with the structure internals. From what I've seen, it's easier for language bindings to deal with functions than with data structure internals.

That said, I want to take your basic argument one step further. I believe that, in nearly all cases, macro expansion is a bad way to get things done. It's seductively easy and relatively powerful, and appears to increase the level of abstraction. However, a lot of the times, it just doesn't work.

More specifically, it works well enough for the narrow purpose for which it was originally intended, but fails pretty badly in any broader context. A particular concern of mine is error reporting. Macro expansion violates the one of the main principles of cyberbitchology, which is to catch errors as early as possible. Once you've expanded the macro, you've lost the original context containing the error. Hence, inscrutable error message.

Here's a simple example off the top of my head:

#include <glib.h>

int main () {
  int i;
  char *p = g_new (i, 1);
}

Try to compile it, and what do you get? "5: parse error before ')'". Not "g_new's first argument must be a type, not a variable". Ok, so this isn't very horrifying, so try making a random error in a LaTeX file. Better yet, an autoconf script :)

In addition to error reporting, macros also makes analysis of files intractable. In addition to simply being another layer of indirection, it can be difficult to the point of the halting to invert what the macros are doing (although I don't think cpp is quite Turing complete, it's still hard).

Extensive use of preprocessor macros is also notorious for making debugging hard. This is basically an insoluble problem. The compiled code is the result of macro expansion, but the source the debugger is trying to show you is the input. This is one good reason to use preprocessor macros as sparingly as possible.

Next time you're tempted to use macros to solve a problem, think twice. Have you given up the possibility of clear error reporting? Are there other contexts, such as analysis, for which the macros are making life harder?

Not all macros are this broken, posted 27 Feb 2000 at 10:24 UTC by mjs » (Master)

Macro systems like that of Lisp don't suffer from the kinds of problems raph describes. That's because the surface syntax is trivial (it's almost impossible to screw up paren matching with a good [read `Emacs'] editor), and the macro-expansion language, being the Lisp language itself,is powerful enough to do any additional error-checking that's required.

On the other hand, the Lisp macro system _can_ make complex systems hard to understand if overuesed. To understand a complex Lisp system, it is widely reported, one must first understand it's macrology.

macro expansion, posted 27 Feb 2000 at 19:40 UTC by DaveD » (Observer)

Couple of questions for Raph:

Would arguments against macros apply to codes which by necessity have deeply nested, highly iterative loops? For example, shoving function calls into the inner loop of an O(n^3), O(n^4) algorithm tends to degrade performance. On the other hand, writing such functionality explicitly may add several pages of printout to a function already a dozen or more (say) pages long. Using macro expansion increases code readability without sacrificing performance.

If such macros were first implemented as functions, exhaustively tested, then turned into macros, what is the harm?

inline functions?, posted 28 Feb 2000 at 01:49 UTC by dan » (Master)

Would arguments against macros apply to codes which by necessity have deeply nested, highly iterative loops?

If that's the only way to do it, that's the only way to do it. However, I would want to ask the obvious questions -

  • Is the O(n^3) algorithm really necessary?
  • What about declaring the function as inline? A moderately smart (as opposed to the mythical "sufficiently smart") compiler ought to produce the same result with an inline function a macro
  • Where is your profiling data that demonstrates the need for doing this?

Not all of these are necessarily always applicable - you may be targetting users with stupid compilers, for example. But do test first. Premature optimization is the root of all evil, etc etc

A smaller cpp, posted 28 Feb 2000 at 03:01 UTC by mbp » (Master)

Interestingly, the Plan 9 C Compiler written by Rob Pike accepts a somewhat smaller language than ANSI C.

Deeply nested loops, posted 28 Feb 2000 at 16:28 UTC by DaveD » (Observer)

deeply nested, highly iterative loops?

If that's the only way to do it, that's the only way to do it. However, I would want to ask the obvious questions -

  • Is the O(n^3) algorithm really necessary?
Well, yes. Gaussian elimination is O(n^3). This is a provable fact. And at the end of the day, for all the fancy SOR/FFT/Multigrid techniques, sometimes GE is the tool left in the box.

  • What about declaring the function as inline? A moderately smart (as opposed to the mythical "sufficiently smart") compiler ought to produce the same result with an inline function a macro

All the documentation I have read concerning inline functions states that inline is a *suggestion* which the compiler is free to ignore at will. I welcome comments to the contrary. (I have Edith Hamilton for mythology.)

  • Where is your profiling data that demonstrates the need for doing this?

In the sense that it is possible to compute exactly how many times an arbitrary operation is performed in an algorithm such as GE, why is profiling data relevant?

The perils of macros, posted 28 Feb 2000 at 20:08 UTC by alan » (Master)

There are several reasons for macro abuse in the C library. The inability of the C standards committee to use a crystal ball being one highly forgivable case

In general inline is better, but even now gcc inline isnt always generating such good code as macro tricks, especially with constant evaluation/optimisation tricks.

Language bindings for anything but C are hard, posted 29 Feb 2000 at 18:10 UTC by monniaux » (Journeyer)

I am the original author of MlGtk, an interface between Gtk+ and Objective CAML. I would perhaps suggest that APIs be specified in the following fashion:

  • the use of macros would be restricted;
  • the header files would contain additional information.
As for the second point: for instance, functions taking structure pointers as arguments often end up in two ways:
  • the structure is used by the library after the call, so it should be marked as "non garbage-collectable";
  • the structure is released by the library after the call (clear the mark);
  • the structure is not used after the call.
The same holds for strings. Currently, one has to sift carefully through the documentation to understand which case a function is in. Such things should be put in normalized comments after the declarations.

Wrapper lib, posted 1 Mar 2000 at 05:38 UTC by hp » (Master)

A nice practical solution to this would be a small library that contains function wrappers for things like the value of errno, and comes with an easily-parseable file format (similar to the GTK+ defs files) that listed symbolic constants and such. Many POSIX language bindings could then use this little library.

A similar thing is basically needed for GTK+, except that the functions can go in GTK itself (avoiding the need for the library).

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page