Recent blog entries for akihabara

Spent the last week adding preprocessor testcases for every bit of odd behaviour I can dream up. Tidying up the #define directive parser at the moment, removing a malloc performance bottleneck. Zack's just completed a nice tidy- up of the macro expanding code, removing excessive recursive calls. I suspect the current code is now faster than the old cpplib and cccp, certainly there is little reason for it to be slower.

We should be able to scrap support for -traditional (though not -Wtraditional I expect) since we're now bundling an old preprocessor, tradcpp, just for that job. A token-based preprocessor just proved to be too fundamentally different to K+R for the integration to be sustainable, and it was getting in the way.

Cpplib is beginning to look quite clean in most places, and should be not too hard to read. Almost at the stage of being a piece of code to be proud of. A noteable exception is the lexer, which still needs a lot of cleaning up and work on improving performance. Lexers tend to be ugly by their very nature, though.

Hopefully we can soon start to think about front-end integration and pre-compiled headers, which will be fun to work on, and give us some really nice performance improvements. The C and C++ front ends should be able to all-but abandon their existing lexers, save crannies like interpreting numbers and merging adjacent string literals.

In a few days I'm going to be offline for a month or three, so Zack will be working on it alone for a while. I think he's forgotten his Advogato password, though <g>.

Finally got the new expander and lexer live today. A lot of cleanup and optimisation remains to be done, but the immediate priority is comprehensive testsuites so we can be sure not to introduce regressions when improving the code base.

-traditional is not supported fully at present, but we're working on a solution.

At last, the new macro stuff is nearly done, thanks to some work by Zack yesterday. We bootstrap and pass the tests in the testsuite, and are more precise about corner cases than before. Just -traditional stuff to go, and we should be able to apply it to CVS. If you use non-ISO stuff like the GNU ## extension to delete the previous token, or token pasting to get a non-token (remember, we're grown-up and token-based now) you'll get warnings telling you to clean up your act.

A lot of ugliness remains, but that will be easier to clean once we're happy we've got working code and binned the old text-based expander. Many areas are much cleaner, for example the three places (#assert, #unassert and #if/#elif) that need to parse assertions all use the same code now, rather than having their own slightly different version to handle the slight differences of syntax.

The token-based macro expansion process is quite simple in concept, but the reality is a bit messy and hard to understand from the source code. I'll try to clean it up and comment it once we're sure it's working, and have it in CVS.

After -traditional, the next stage is probably to get cpp re-integrated with the front ends, as a library and not a separate process. This will cut out a lot of overhead: an extra exec(), writing out the preprocessed file, the front end reading it in again, and re-tokenising.

Putting the finishing touches on a macro expander that uses the new lexer. Like the lexer, it is token-based. The current lexer and macro expander are both text-based.

Getting this to work has been a very frustrating experience. Macro expansion is a hairy and convoluted process, and stringification and token-pasting just add to the confusion. A dense and strangely-worded C99 specification doesn't help :-)

We just have a single token list, and the lexer lexes all tokens in the next logical line into this list. However, a function-like macro invocation can cross multiple logical source file lines. So we don't write over the original token list, and cause chaos, we append to it instead in this case. However, this appending could cause a realloc of the tokens (stored consecutively in memory), and arguments to macros are stored as lists of pointers to the original tokens (they needn't be consecutive), so they need to be fixed up if we realloc. Other things still to do include fixing bogus line numbers in errors and the final output, and squeezing tokens back into 16 bytes for both 32-bit and 64-bit architectures. We need to run it against a macro abuser like glibc to try and turn up missed obscure cases.

Ah, almost forgot, the gem of -traditional support. Not sure what's best there; I think to get everything right would need a separate pre-pass that does traditional macro text splicing. However, this would lose line and column information and just be a maintenance headache. Probably it's best just to support everything we reasonably can in the token-based environment, and drop the really weird stuff like half-strings and macro expansion within strings.

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!