Older blog entries for wingo (starting at number 356)

/usr/local, fedora, rpath, foo

Good evening intertubes,

I wish to broach a nerdy topic, once more. May my lay readers be rewarded in some heaven for your earthly sufferings. I'll see you all at the bar.

problem statement

For the rest of us, we have a topic, and that topic is about autotools, software installation, -rpath, and all that kind of thing.

As you might be aware, last month we released Guile 2.0.0. Of course we got a number of new interested users. For some of them things went great. It all started like this:

wget http://ftp.gnu.org/gnu/guile/guile-2.0.0.tar.gz
tar zxvf guile-2.0.0.tar.gz
cd guile-2.0.0
./configure
make
sudo make install

Up to here, things are pretty much awesome for everybody. Obviously the next thing is to run it.

$ guile
GNU Guile 2.0.0
Copyright (C) 1995-2011 Free Software Foundation, Inc.

Guile comes with ABSOLUTELY NO WARRANTY; for details type `,show w'.
This program is free software, and you are welcome to redistribute it
under certain conditions; type `,show c' for details.

Enter `,help' for help.
scheme@(guile-user)> "Hello, World!"
$1 = "Hello, World!"

Sweetness! Let's start going through the manual. (Time passes.) You get to the point of compiling a short program that links to Guile:

gcc -o simple-guile simple-guile.c \
  `guile-config compile` `guile-config link`

And so far so good! But alack:

$ ./simple-guile
./simple-guile: error while loading shared libraries:
    libguile-2.0.so.22: cannot open shared object

Pants! What is the deal?

righteous indignation

The deal is, you appear to be on a Fedora or some other Red Hat-derived system. These systems add /usr/local/bin to the PATH, so the guile-config call succeeds. /usr/local/lib is in the link-time path too, so it finds libguile-2.0.so. But /usr/local/lib is not in the runtime library lookup path by default. Hence the error above.

This, my friends, is a bug. It is a bug in Fedora and similar systems. This recipe works fine on Debian. It works fine with the GNU toolchain, as configured out-of-the-box.

The only reason I can think that Fedora would break this is because of their lib versus lib64 split. It would be strange to say, "32-bit libraries go in /usr/lib, but maybe or maybe not in /usr/local/lib." And in fact the FHS explicitly disclaims what's happening in /usr/local.

But the fact is, /usr/local is the default location for people to compile, and the $PATH and other settings indicate that Fedora folk know this. I can only conjecture that the situation is this way in order to preserve compatibility with legacy 32-bit closed-source binary apps that cannot be recompiled to politely install their libraries elsewhere.

This decision costs me time and energy in the form of bug reports. Thanks but no thanks, whoever made this decision.

solutions?

It's true, this is a specific case of a more general issue. If you installed instead to /opt/guile, you would have to adjust your PATH to be able to run guile-config, or your PKG_CONFIG_PATH if you wanted to use the underlying pkg-config file instead. So you do that:

export PKG_CONFIG_PATH=/opt/guile/lib/pkgconfig
gcc -o simple-guile simple-guile.c `pkg-config --cflags --libs guile-2.0`

Awesome. You run it:

$ ./simple-guile

Hey awesome it works! Actually that's not true. Or rather, it's only true if you're on some other system, like a Mac. If you're on GNU, you get, again:

./simple-guile: error while loading shared libraries:
   libguile-2.0.so.22: cannot open shared object

Willikers! The same badness! And in this case, we can't even blame Fedora, as no one would think to add /opt/guile/lib to the runtime library path. (Well, actually, Apple did; it linked the binary to the link-time location of the library. That has other problems, though.)

The deal is, that we ourselves need to add /opt/guile/lib to the runtime library lookup path. But how we do that is compiler-specific; and in any case it's not necessary if we're installing to somewhere that's already in the system's runtime search path.

So---and this is the point I was coming to, besides ripping on Fedora's defaults---you need a build-time determination of what to add to the compilation line to set the rpath.

That, my friends, is the function of AC_LIB_HAVE_LINKFLAGS, from the excellent gnulib. It takes some link-time flags and spits out something that will also set the run-time library search path.

why haven't I seen this?

Amusingly I had not personally encountered this issue because all of my projects use libtool for linking, which does add -rpath as appropriate. Libtool-induced blissful ignorance: who'd have thought it possible?

Having said my message, I now entreat my more informed readers to leave corrections below in the comments. Thanks!

Syndicated 2011-03-18 23:52:21 from wingolog

ports, weaks, gc, and dark matter

Just wanted to drop a note about some recent bug-hunting. I was experiencing memory leaks in a long-running web server, and, after fixing some port finalization leaks, ran into a weird situation: if I did a (gc) after every server loop, the memory usage (measured by mem_usage.py) stayed constant. But if I just let things run, memory usage grew and grew.

I believe that the problem centered on the weak ports table. Guile maintains a global hash table of all open ports, so that when the process exits, all of the ports can be flushed. Of course we don't want to allow that table to prevent ports from being garbage collected, so it is a "weak" table.

In Guile, hash tables are implemented as small objects that point to a vector of lists of pairs (a vector of alists). For weak hash tables, the pairs have disappearing links: their car or cdr, depending on the kind of table, will be cleared by the garbage collector when the pointed-to object becomes unreachable.

And by "cleared", I mean set to NULL. But the rib stays around, and the vertebra stays in the bucket alist. So later when you traverse a weak hash table, looking for a value, you lazily vacuum out the dead entries.

Dead entries also get vacuumed out when the hash table grows or shrinks such that it should be resized. Guile has an internal table of roughly doubling prime sizes, and when hash table occupancy rises to 90% of the size, or falls below 25%, the hash table is resized up or down. So ideally there are not many collisions.

So, with all of that background information, can you spot the bug?

Well. I certainly couldn't. But here's what I think it was. When you allocate a port, it adds an entry to this weak hash table, allocating at least 4 words and probably more, when you amortize over rehashings. When GC runs and the port is no longer reachable, the port gets collected, and the weak entry nulled out---but the weak entry is still there. Allocation proceeds and your hash table gains in occupancy, vacuuming some slots but, over time, increasing occupancy. Some entries in the hash table might actually be to unreachable ports that haven't been collected yet, for whatever reason.

At some point occupancy increases to the rehashing level. A larger vector is allocated, repopulated with the existing values, and in the process vacuuming ribs and vertebrae for dead references. Overall occupancy is lower but not so much lower as to trigger a rehashing on the low-water-mark side. The process repeats, leading to overall larger memory usge.

You wouldn't think it would be that bad though, would you? Just 32 bytes per port? Well. There are a couple of other things I didn't tell you.

Firstly, port objects are more than meets the eye: the eye of the garbage collector, that is. Besides the memory that they have that GC knows about, they also in general have iconv_t pointers for doing encoding conversions. Those iconv_t values hide kilobytes of allocation, on GNU anyway. This allocation is indeed properly reclaimed on GC---recall that the web server was not leaking when (gc) was run after every request---but it puts pressure on the heap without the GC knowing about it.

See, the GC only runs when it thinks it's necessary. Its idea of "necessary" depends, basically, on how much has been allocated since it last ran. The iconv_t doesn't inform this decision, though; to the GC, it is dark matter. So it is possible for the program to outrun the GC for a while, increasing RSS in the part of the heap the GC doesn't scan or account for. And when it is later freed, you have no guarantees about the fragmentation of those newly-allocated blocks.

I think it was ultimately this, that the GC wouldn't run for a while, we would reach the rehashing condition before GC realized the ports weren't accessible, and the process repeated.

This problem was exacerbated by what might be a bug in Guile. In Scheme, one doesn't typically open and close ports by hand. You use call-with-input-file, or with-output-to-string, or other such higher-order procedures. The reason is that you really want to make sure to delimit the lifetime of, for example, open files from the operating system. So these helper procedures open a port, call a function, close the port, then return the output string or the result from the procedure call or whatever. For the laity, it's like Python's with statement.

In the past, string ports did not have much state associated with them, so it wasn't necessary to actually close the port when leaving, for example, with-output-to-string. But now that Guile does unicode appropriately, all ports might have iconv_t descriptors, so closing the ports is a good idea. Unfortunately it's not a change we can make in the 2.0 series, but it will probably land in 2.2.

Well, what to do? As you can tell by the length of this entry, this problem bothered me for some time. In the end, I do think that open ports are still a problem, in that they can lead to an inappropriately low rate of GC. But it's the interaction with the weaks table---remember Alice?---that's the killer. GC runs, you collect the ports, but memory that was allocated when the ports were allocated (the rib and vertebra) stays around.

The solution there is to fix up the weak hash tables directly, when the GC runs, instead of waiting for lazy fixup that might never come until a rehash. But the Boehm-Demers-Weiser collector that we switched to doesn't give you hooks that are run after GC. So, game over, right?

Heh heh. Hie thee hither, hackety hack; sing through me, muse in a duct-tape dress. What we do is to allocate a word of memory, attach a finalizer, and then revive the object in its finalizer. In that way every time the object is collected, we get a callback. This code is so evil I'm going to paste it here:

static void
weak_gc_callback (void *ptr, void *data)
{
  void **weak = ptr;
  void *val = *weak;
  
  if (val)
    {
      void (*callback) (SCM) = data;

      GC_REGISTER_FINALIZER_NO_ORDER
         (ptr, weak_gc_callback, data, NULL, NULL);
      
      callback (PTR2SCM (val));
    }
}

static void
scm_c_register_weak_gc_callback (SCM obj, void (*callback) (SCM))
{
  void **weak = GC_MALLOC_ATOMIC (sizeof (void**));

  *weak = SCM2PTR (obj);
  GC_GENERAL_REGISTER_DISAPPEARING_LINK (weak, SCM2PTR (obj));

  GC_REGISTER_FINALIZER_NO_ORDER
     (weak, weak_gc_callback, (void*)callback, NULL, NULL);
}

And there you have it. I just do a scm_c_register_weak_gc_callback (table, vacuum_weak_hash_table), and that's that.

This discussion does have practical import for readers of this weblog, in that now it shouldn't die every couple days. It used to be that it would leak and leak, and then stop being able to fork out to git to get the data, leading to a number of interesting error cases, but unfortunately none that actually killed the server. It would accept the connection, go and try to serve it, fail, and loop back. It's the worst of all possible error states, if you were actually trying to read this thing; doubtless some planet readers were relieved, though :)

Now with this hackery, I think things are peachy, but time will tell. Tekuti is still inordinately slow in the non-cached case, so there's some ways to go yet.

Hey I'm talking about garbage collectors, yo! Did I mention the most awesome talk I saw at FOSDEM? No I didn't, did I. Well actually it was Eben Moglen's talk, well-covered by LWN. No slides, just the power of thought and voice. I think if he weren't a lawyer he'd make a great preacher. He speaks with the prophetic voice.

But hey, garbage collectors! The most awesome technical talk I saw at FOSDEM was Cliff Click's. It was supposedly about "Azul's foray into Open Source" [sic], but in reality he gave a brief overview of Azul's custom hardware -- their so-called "Vega" systems -- and then about how they've been working on moving downmarket to X86 machines running the Linux kernel.

Let me back up a bit and re-preach what is so fascinating about their work. Click was one of the authors of Java's HotSpot garbage collector. (I don't know how much of this work was all his, but I'm going to personify for a bit.) He decided that for really big heaps--- hundreds of gigabytes---that what was needed was constant, low-latency, multithreaded GC. Instead of avoiding stop-the-world GC for as long as possible, the algorithm should simply avoid it entirely.

What they have is a constantly-running GC thread that goes over all the memory, page by page, collecting everything all the time. My recollection is a little fuzzy right now---I'm writing in a café without internet---but it goes a little like this. The GC reaches a page (bigpages of course). Let's assume that it knows all active objects, somehow. It maps a new page, copies the live objects there, and unmaps the old page. (The OS is then free to re-use that page.) It continues doing so for all pages in the system, rewriting pointers in the copied objects to point to the new locations.

But the program is still running! What happens if the program accesses an object from one of the later pages that points to an object from an earlier one that was already moved? Here's where things get awesome: the page fault accessing the old page causes the GC to fix up the pointer, right there and then. La-la-la-la-life goes on. Now that I'm back on the tubes, here's a more proper link.

OK. So Azul does this with their custom hardware and custom OS, great. But they want to do this on Linux with standard x86 hardware. What they need from the OS is the ability to quickly remap pages, and to have lower latency from the scheduler. Also there are some points about avoiding TLB cache flushes, and lowering TLB pressure due to tagged pointers. (Did you know that the TLB is keyed on byte locations and not word locations? I did not. They have some kernel code to mask off the bottom few bits.)

They want to get these hooks into the standard kernel so that customers running RHEL can just download their JVM and have it working, and to that end have started what they call the "Managed Runtime Initiative".

Basically what this initiative is is a code dump. The initial patches were totally panned, and when I mentioned this to Click, he said that maybe they should wait for the next "code dump". He actually used that term!

It's really a shame that they are not clueful about interacting with Free Software folk, because their code could help all language runtimes. In particular I wonder: the Boehm collector is a miracle. It's a miracle that it works at all, and furthermore that it works as well as it does. But it can't work optimally because it can't relocate its objects, so it can lead to heap fragmentation. But it seem that with these page-mapping tricks, the Boehm GC could relocate objects. It already knows about all pointers on the system. With some kernel support it could collect in parallel, remapping and compacting the heap as it goes. It's a daunting task, but it sounds possible.

Well. Enough words for one morning. Back to the hack!

Syndicated 2011-02-25 13:59:05 from wingolog

guile 2.0 is out!

Hear ye, hear ye: Guile 2.0 is unleashed upon the world!

Guile is a great implementation of Scheme. It's a joyful programming environment that gives the hacker expressive tools for growing programs.

This release improves Guile tremendously, adding a new compiler and virtual machine, and integrating more powerful hygienic macros in the core of Guile. We managed to pull all of this off with minimal incompatibilities with old code, and maximal awesomeness.

Guile 2.0 is a personal milestone as well, as it is my first stable release as a Guile co-maintainer. I would like to express special thanks to Ludovic Courtès, my co-maintainer, for companionship in the fantastic hack over the last few years. I would also like to express a strange thanks to two people whom I have never met in person, but that, through their code artifacts, really made Guile 2.0 what it is. To Keisuke Nishida, for his initial compiler and VM ten years ago, for teaching me how to write a compiler; and to R. Kent Dybvig, for his psyntax macro expander, for teaching me about macro expanders. Thanks!

Check the release announcement for a short summary of changes, or NEWS for all the delicious details. Start with the manual if you like; it's a great read. Then download and build the thing, and when next you start it up, just think: this could be the beginning of a beautiful hack!

Syndicated 2011-02-16 11:06:21 from wingolog

gnu at fosdem

Yes indeed, FOSDEM approaches, and this year with 100% less smoke. Rocking.

One novelty of this year is that the GNU project was allocated a DevRoom, where we've planned a whole series of talks.

This is a really fantastic opportunity to work on GNU, the social construct. GNU is largely a virtual organization at this point, with much esprit de corps but little real-world reification. This FOSDEM will be another pearl in the short string of recent hackers' meetings, but, unlike the rest, it's in a more ecumenical context of other groups. More outward-looking, is what I mean to say.

So you wanted to meet Bastien Guerry, Org-Mode hacker extraordinare? He's there. He's speaking! Sweetness! What about Simon Josefsson, the GNU TLS maintainer? What ho! And Ralf Wildenhues, autoconf hero? A veritable hit parade, this room.

Alas, the room is marred slightly by a talk by yours truly, at 14h on Saturday, so either make sure you're there, or make sure you're not there, as preference dictates. The topic is Guile, of course; after a brief propagandistic schtick on how Guile is the knees of the bees, I will -- if things go well -- show some live hacking, just so folks get a feel about how it is to hack with Guile. Extending a web application while it's running. Because that's part of it, you know?

Maybe? Maybe you prefer to go to Dave Neary's talk, and I would understand. I would. But his room only holds so many, so when it's full, come visit us over at H.2214, the GNU room!

Syndicated 2011-02-01 21:31:04 from wingolog

"free as in beer"?

The phrase "free as in beer" makes no sense. It never did.

"Free as in free beer" only makes marginally more sense. I've been hearing it for years, and my mind seems to skip over the "free" and concentrate on the "beer", and beer is a good thing, right? (Though to be honest, the Free Software world's beer fixation is not one of its healthier characteristics.)

Furthermore, the existence of Free Beer, while delicious, does nothing to clarify matters, even in the presence of Irish Moss.

"Free beer" might have been amusing coming from RMS, but it's not useful. Let's leave "free beer" to the brewers, and stop using it for its explanatory power.

Syndicated 2011-01-17 15:45:33 from wingolog

types and the web

An essay expanding on the theme of types and the web. I wrote this for Guile's manual, but it applies generally, I think. The point about SXML can apply fruitfully to python as well. -- Andy

It is a truth universally acknowledged, that a program with good use of data types, will be free from many common bugs. Unfortunately, the common practice in web programming seems to ignore this maxim. This subsection makes the case for expressive data types in web programming.

By "expressive data types", I mean that the data types say something about how a program solves a problem. For example, if we choose to represent dates using SRFI 19 date records (see SRFI-19), this indicates that there is a part of the program that will always have valid dates. Error handling for a number of basic cases, like invalid dates, occurs on the boundary in which we produce a SRFI 19 date record from other types, like strings.

With regards to the web, data types are help in the two broad phases of HTTP messages: parsing and generation.

Consider a server, which has to parse a request, and produce a response. Guile will parse the request into an HTTP request object (see Requests), with each header parsed into an appropriate Scheme data type. This transition from an incoming stream of characters to typed data is a state change in a program---the strings might parse, or they might not, and something has to happen if they do not. (Guile throws an error in this case.) But after you have the parsed request, "client" code (code built on top of the Guile HTTP stack) will not have to check for syntactic validity. The types already make this information manifest.

This state change on the parsing boundary makes programs more robust, as they themselves are freed from the need to do a number of common error checks, and they can use normal Scheme procedures to handle a request instead of ad-hoc string parsers.

The need for types on the response generation side (in a server) is more subtle, though not less important. Consider the example of a POST handler, which prints out the text that a user submits from a form. Such a handler might include a procedure like this:

;; First, a helper procedure
(define (para . contents)
  (string-append "<p>" (string-concatenate contents) "</p>"))

;; Now the meat of our simple web application
(define (you-said text)
  (para "You said: " text))

(display (you-said "Hi!"))
-| <p>You said: Hi!</p>


This is a perfectly valid implementation, provided that the incoming text does not contain the special HTML characters <, >, or &. But this provision is not reflected anywhere in the program itself: we must assume that the programmer understands this, and performs the check elsewhere.

Unfortunately, the short history of the practice of programming does not bear out this assumption. A cross-site scripting (XSS) vulnerability is just such a common error in which unfiltered user input is allowed into the output. A user could submit a crafted comment to your web site which results in visitors running malicious Javascript, within the security context of your domain:

(display (you-said "<script src=\"http://bad.com/nasty.js\" />"))
-| <p>You said: <script src="http://bad.com/nasty.js" /></p>


The fundamental problem here is that both user data and the program template are represented using strings. This identity means that types can't help the programmer to make a distinction between these two, so they get confused.

There are a number of possible solutions, but perhaps the best (in the Guile context) is to treat HTML not as strings, but as native s-expressions: as SXML. The basic idea is that HTML is either text, represented by a string, or an element, represented as a tagged list. So foo becomes "foo", and <b>foo</b>; becomes (b "foo"). Attributes, if present, go in a tagged list headed by @, like (img (@ (src "http://example.com/foo.png"))). See the SXML Wikipedia page, for more info.

The good thing about SXML is that HTML elements cannot be confused with text. Let's make a new definition of para:

(define (para . contents)
  `(p ,@contents))

(use-modules (sxml simple))
(sxml->xml (you-said "Hi!"))
-| <p>You said: Hi!</p>

(sxml->xml (you-said "<i>Rats, foiled again!</i>"))
-| <p>You said: &lt;i&gt;Rats, foiled again!&lt;/i&gt;</p>


So we see in the second example that HTML elements cannot be unwittingly introduced into the output. However it is now perfectly acceptable to pass SXML to you-said; in fact, that is the big advantage of SXML over everything-as-a-string.

(sxml->xml (you-said (you-said "<Hi!>")))
-| <p>You said: <p>You said: &lt;Hi!&gt;</p></p>

The SXML types allow procedures to compose. The types make manifest which parts are HTML elements, and which are text. (Using types to disallow nested paragraphs is an exercise for the reader.)

So you needn't worry about escaping user input; the type transition back to a string handles that for you. XSS vulnerabilities are a thing of the past.

Syndicated 2011-01-11 07:18:07 from wingolog

doing it wrong

Intrepid hacker Aleix Conchillo Flaqué writes in to say that, against all odds, he actually managed to install the Tekuti blog software on his server. Rock on!

Of course, it was not without a couple of problems, and indeed the important one points to something more fundamentally wrong with existing internet technologies.

in which mod_rewrite fails the author

Whoa, Mr. Wingo, calm down there! Blaming the internet for a bug in your software! Hubris is a virtue and all that, but perhaps this is taking it too far? I'm sure my readers will let me know in the comments, but first, some background.

The symptom of the problem is like this. To refer to a post in a URL, tekuti has an identifier, the post key. For example, the path to access to edit a post is /admin/posts/key.

Tekuti doesn't have post numbers though, so the easiest way to generate the post key is to serialize its location in the data store, like 2010/12/22/doing-it-wrong. As you can see the post key can include any character, and indeed typically does include slashes, so the actual text representation of the URL to edit a post has to be percent-encoded, like /admin/posts/2010%2f12%2f22%2fdoing-it-wrong.

This scheme works fine, and indeed when accessing the Tekuti web server directly, everything works. But if you put it behind Apache using a mod_rewrite proxying rule, as Aleix did:

RewriteRule ^/blog(/?.*)$ http://localhost:8080/$1 [P]

Then trying to edit posts doesn't work! What's the deal?

parsing it wrong

The deal is, mod_rewrite does a textual match on decoded paths. So the string that mod_rewrite sees isn't the string given to it -- in this case, it sees /admin/posts/2010/12/22/doing-it-wrong. Then it does some arbitrary transformation on that decoded string, and somehow constructs the result.

But you cannot produce the desired result from the intermediate, decoded string!

There are various options to re-encode parts of the output string, or to not encode parts of it -- but you can't determine which slash in the intermediate string is to be re-encoded.

The problem is that mod_rewrite treats slashes specially, not re-encoding them. This problem is fundamental to any technology that processes URL paths using textual comparisons.

To parse a URL path properly, you must first split it according to the delimiters you are interested in (/, in this case). Then you percent-decode the path components into a list, and match on that list.

Query parameters need to be parsed similarly -- first you split on &, then split on the first = in each component, then percent-decode the resulting keys and values into an ordered list of key-value pairs.

Quoth the RFC:

When a URI is dereferenced, the components and subcomponents significant to the scheme-specific dereferencing process (if any) must be parsed and separated before the percent-encoded octets within those components can be safely decoded, as otherwise the data may be mistaken for component delimiters.

RFC 3986, section 2.4: When to encode or decode

This problem isn't specific to regular expressions -- it also occurs, for example, when dispatching a HTTP request to a URI handler. The dispatch should be based on path components, and not the against the path as a string.

it's a programming language problem

So why does this problem occur, even in technologies as venerable as mod_rewrite? Because it's one that can only be solved by programming languages. You need some sort of sequence data type to parse paths. You need some sort of map to parse query strings, conventionally at least. And then to reconstitute a URI, you need to do so from those data types (lists, maps, &c).

Most contexts in which you do dispatch, like in your .htaccess, aren't equipped with the expressivity to do it right. Regular expressions are only generally applicable on URI subcomponents like path components -- they can only be used in limited situations on full paths.

workarounds

In Aleix's case, the workaround is to use mod_proxy directly:

ProxyPass /blog http://localhost:8080/

It's unfortunate, as you can't use mod_proxy from an .htaccess file -- only from the main configuration, and requiring a restart of the server. Oh well, though.

You could still use mod_rewrite, but with special rules for the specific URI paths that might include escaped slashes, like this:

RewriteRule ^/blog/admin/posts/(.+)$ http://localhost:8080/admin/posts/$1 [P,B]

But such a special solution is just that -- special, i.e., not general.

en brevis

If you're hacking something for that web that is intended for general use, and if you parse or generate URIs, do your users a favor and give them an appropriate programming language. Your users will thank you, or more likely, continue in blissful ignorance, which is just as well.

Syndicated 2010-12-23 00:04:23 from wingolog

on the new posix

Here's some stuff I've been preparing for Guile's 1.9.14 release, which should come out later today. Readers interested in further web hackery should check out Tekuti for a larger example. Happy hacking!

When Guile started back in the mid-nineties, the GNU system was still focused on producing a good POSIX implementation. This is why Guile's POSIX support is good, and has been so for a while.

But times change, and in a way these days the web is the new POSIX: a standard and a motley set of implementations on which much computing is done. So today's Guile also supports the web at the programming language level, by defining common data types and operations for the technologies underpinning the web: URIs, HTTP, and XML.

It is particularly important to define native web data types. Though the web is text in motion, programming the web in text is like programming with goto: muddy, and error-prone. Most current security problems on the web are due to treating the web as text instead of as instances of the proper data types.

In addition, common web data types help programmers to share code.

Well. That's all very nice and opinionated and such, but how do I use the thing? Read on!

Hello, World!

The first program we have to write, of course, is "Hello, World!". This means that we have to implement a web handler that does what we want. A handler is a function of two arguments and two return values:

(define (handler request request-body)
  (values response response-body))

In this first example, we take advantage of a short-cut, returning an alist of headers instead of a proper response object. The response body is our payload:

(define (hello-world-handler request request-body)
  (values '((content-type . ("text/plain")))
          "Hello World!"))

Now let's test it. Load up the web server module if you haven't yet done so, and run a server with this handler:

(use-modules (web server))
(run-server hello-world-handler)

By default, the web server listens for requests on localhost:8080. Visit that address in your web browser to test. If you see the string, Hello World!, sweet!

Inspecting the Request

The Hello World program above is a general greeter, responding to all URIs. To make a more exclusive greeter, we need to inspect the request object, and conditionally produce different results. So let's load up the request, response, and URI modules, and do just that.

(use-modules (web server)) ; you probably did this already
(use-modules (web request)
             (web response)
             (web uri))

(define (request-path-components request)
  (split-and-decode-uri-path (uri-path (request-uri request))))

(define (hello-hacker-handler request body)
  (if (equal? (request-path-components request)
              '("hacker"))
      (values '((content-type . ("text/plain")))
              "Hello hacker!")
      (not-found request)))

(run-server hello-hacker-handler)

Here we see that we have defined a helper to return the components of the URI path as a list of strings, and used that to check for a request to /hacker/. Then the success case is just as before -- visit http://localhost:8080/hacker/ in your browser to check.

You should always match against URI path components as decoded by split-and-decode-uri-path. The above example will work for /hacker/, //hacker///, and /h%61ck%65r.

But we forgot to define not-found! If you are pasting these examples into a REPL, accessing any other URI in your web browser will drop your Guile console into the debugger:

<unnamed port>:38:7: In procedure module-lookup:
<unnamed port>:38:7: Unbound variable: not-found

Entering a new prompt.  Type `,bt' for a backtrace or `,q' to continue.
scheme@(guile-user) [1]> 

So let's define the function, right there in the debugger. As you probably know, we'll want to return a 404 response.

;; Paste this in your REPL
(define (not-found request)
  (values (build-response #:code 404)
          (string-append "Resource not found: "
                         (unparse-uri (request-uri request)))))

;; Now paste this to let the web server keep going:
,continue

Now if you access http://localhost/foo/, you get this error message. (Note that some popular web browsers won't show server-generated 404 messages, showing their own instead, unless the 404 message body is long enough.)

Higher-Level Interfaces

The web handler interface is a common baseline that all kinds of Guile web applications can use. You will usually want to build something on top of it, however, especially when producing HTML. Here is a simple example that builds up HTML output using SXML.

First, load up the modules:

(use-modules (web server)
             (web request)
             (web response)
             (sxml simple))

Now we define a simple templating function that takes a list of HTML body elements, as SXML, and puts them in our super template:

(define (templatize title body)
  `(html (head (title ,title))
         (body ,@body)))

For example, the simplest Hello HTML can be produced like this:

(sxml->xml (templatize "Hello!" '((b "Hi!"))))
=| <html><head><title>Hello!</title></head><body><b>Hi!</b></body></html>

Much better to work with Scheme data types than to work with HTML as strings. Now we define a little response helper:

(define* (respond #:optional body #:key
                  (status 200)
                  (title "Hello hello!")
                  (doctype "<!DOCTYPE html>\n")
                  (content-type-params '(("charset" . "utf-8")))
                  (content-type "text/html")
                  (extra-headers '())
                  (sxml (and body (templatize title body))))
  (values (build-response
           #:code status
           #:headers `((content-type
                        . (,content-type ,@content-type-params))
                       ,@extra-headers))
          (lambda (port)
            (if sxml
                (begin
                  (if doctype (display doctype port))
                  (sxml->xml sxml port))))))

Here we see the power of keyword arguments with default initializers. By the time the arguments are fully parsed, the sxml local variable will hold the templated SXML, ready for sending out to the client.

Instead of returning the body as a string, here we give a procedure, which will be called by the web server to write out the response to the client.

Now, a simple example using this responder, which lays out the incoming headers in an HTML table.

(define (debug-page request body)
  (respond
   `((h1 "hello world!")
     (table
      (tr (th "header") (th "value"))
      ,@(map (lambda (pair)
               `(tr (td (tt ,(with-output-to-string
                               (lambda () (display (car pair))))))
                    (td (tt ,(with-output-to-string
                               (lambda ()
                                 (write (cdr pair))))))))
             (request-headers request))))))

(run-server debug-page)

Now if you visit any local address in your web browser, we actually see some HTML, finally.

Conclusion

Well, this is about as far as Guile's built-in web support goes, for now. There are many ways to make a web application, but hopefully by standardizing the most fundamental data types, users will be able to choose the approach that suits them best, while also being able to switch between implementations of the server. This is a relatively new part of Guile, so if you have feedback, let us know, and we can take it into account. Happy hacking on the web!

Syndicated 2010-12-17 16:34:26 from wingolog

meta data

Hey tubes! Long time no type, in this direction at least.

It seems that most of my writing energy these past few months has been directed towards Guile. For example, right now I should be writing documentation for new hacks, but instead I am typing at another part of the ether.

It's good and bad, this thing. The good thing is the hack-cadence in Guile is high. The bad thing is that not many learn about it, because, well, code doesn't blog about itself, does it?

Except in this case, perhaps. The tin can jiggling the electrons at the other end of this blogline has been my hack, of late. What you are reading is words about Scheme web servers, served by a Scheme web server.

That's right, I ported Tekuti to Guile 2.0. Delicious dogfood, yum!

In the process, I decided that mod-lisp, which I had been using, was stupid. There is already a simple, standard way of serving HTTP requests over a socket, and it is HTTP. So I wrote pieces of a web server, and put them in Guile. I'll probably write more about that later, so no more words about that for now, except to request that folks with spiders, bots, odd rss grabbers and such send me bug reports if things aren't legit.

ciao slicehost, ciao linode

About the same time, my bank decided to change my credit card, so all my old subscriptions stopped working. It was just the thing I needed to make me jump ship, finally, from slicehost to linode.

If you're still on slicehost, I heartily recommend that you switch. (Heartily! Strange word. Like gravy and meatballs or something.) Linode feels faster to me, it's half the price, and otherwise the quality is about the same or perhaps a little better. And from what I hear, the linode offerings continue to improve, while slicehost hasn't changed for the 2+ years that I was with them.

Anyway, rap at yall soon, and keep your parentheses warm in this at-times cold Northern winter. Peace!

Images courtesy of the excellent Hyperbole and a Half.

Syndicated 2010-12-13 21:08:36 from wingolog

words that don't mean what you want them to: limn

The verb "limn" is lovely: terse, and emenny.

I first heard it at work from my boss, jh, a great word-source. I understood it to mean "to go over, thoroughly, and calmly, touching each item. To enumerate things as one runs fingers through long hair."

Turns out it doesn't mean that at all! For posterity, the definition of limn is:

Limn.

Transitive verb. Old English limnen, fr. luminen, for enluminen, French enluminer to illuminate, to limn, Latin illuminare to paint. See also Illuminate, Luminous.

1. To draw or paint; especially, to represent in an artistic way with pencil or brush. [1913 Webster]

Let a painter carelessly limn out a million of faces, and you shall find them all different. --Sir T. Browne. [1913 Webster]

2. Hence: To picture in words; to describe in graphic terms. [PJC]

3. To illumine, as books or parchments, with ornamental figures, letters, or borders. [1913 Webster]

Syndicated 2010-10-06 20:22:09 from wingolog

347 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!