30 Dec 2006 (updated 30 Dec 2006 at 00:50 UTC)
»
parallel computing
We do nightly regression
tests on the Ghostscript codebase to
try and detect inadvertent changes. It's a combination of
established test suites and our own collection of problem
files from the wild.
The problem is that a complete run takes hours. Before we
bought our current server, it was impossible to to a check
on every commit, and even now we'd need a queuing system no
one's been annoyed enough to write. So instead we run once a
day, and then someone has to check the
results and work out which change caused the differences.
So we've been looking at using a cluster to speed up the
runs, hopefully to a few minutes, so we can easily test
things, and get automatic feedback right away after a
commit. My partner does scientific parallel programming and
has been helping set something up.
For the moment, we're renting time. Our usage pattern is
ununsual. Most cluster users have algorithms that
are limited by communication between the nodes, and so they
tend to do smaller jobs, but run a simulation for hours,
days, even weeks. We want a lot of nodes, but not for very
long, so it's the sort of thing where renting part of a
shared resource makes sense.
Of course, it works better to be sharing a resource much
larger than the average job size, or with other people with
similar usage patterns to avoid being blocked in the queue.
But we'll see how it goes. For the moment we're using Tsunamic
Technologies' cluster on demand service. They've
provided good support so far, and offer a familiar linux
environment using the PBS
job queue system (the venerable qsub et al.) to
schedule access to the nodes. So far it's going pretty well,
with scaling down to a 5 minute run.
wherefore the grid?
People have been talking about Grid
computing for 17
years now, but not much has appeared to
fulfill the promise. Right now, most parallel machine users
are doing research simulations, and there the overhead of
dealing
with heterogenous environment and dynamic node allocation
isn't especially worthwhile. But once it there's the
infrastructure
available to rent time easily, and especially to sell time,
I think we'll see a lot more of our sort of use.
Ironically, it's the overhead of virtualization
that's finally making that possible. The problem with a market in cpu time is
that you have to be able to run untrusted code. An entirely
automatic reputation system isn't really good enough. You
need recourse if your provider is messing with your data,
and providers need to be able to protect jobs from each
other. And because you can move machine images around, it
also fosters the sort of dynamic infrastructure we need to
really have scalable computing available as a utility.
I was therefore excited to see that Amazon is doing
exactly that with their Elastic Compute Cloud beta. To
use the service you upload an OS image to their storage
farm, and then launch as many instances as you want, for as
long as you want. It's a really cool set up. Apparently the
story is that they have this enormous server farm for
dealing with their peak loads (like Christmas)
but of course that means it's idle much of the time. TThe
same issue we have, really. They
already sell almost everything else online, so they decided
to try
renting out time on their infrastructure as a new business idea.
They have some
other cool things too, like an RPC
interface to human labor.
The best thing about it is that they have a web protocol
for doing all this. So while someone has to provide a credit
card and pay the bills, you can now write code that can
allocate and occupy its own server resources. We're one step
closer to AIs living free on the net. :)