The problem is that a complete run takes hours. Before we bought our current server, it was impossible to to a check on every commit, and even now we'd need a queuing system no one's been annoyed enough to write. So instead we run once a day, and then someone has to check the results and work out which change caused the differences.
So we've been looking at using a cluster to speed up the runs, hopefully to a few minutes, so we can easily test things, and get automatic feedback right away after a commit. My partner does scientific parallel programming and has been helping set something up.
For the moment, we're renting time. Our usage pattern is ununsual. Most cluster users have algorithms that are limited by communication between the nodes, and so they tend to do smaller jobs, but run a simulation for hours, days, even weeks. We want a lot of nodes, but not for very long, so it's the sort of thing where renting part of a shared resource makes sense.
Of course, it works better to be sharing a resource much larger than the average job size, or with other people with similar usage patterns to avoid being blocked in the queue. But we'll see how it goes. For the moment we're using Tsunamic Technologies' cluster on demand service. They've provided good support so far, and offer a familiar linux environment using the PBS job queue system (the venerable qsub et al.) to schedule access to the nodes. So far it's going pretty well, with scaling down to a 5 minute run.
wherefore the grid?
People have been talking about Grid computing for 17 years now, but not much has appeared to fulfill the promise. Right now, most parallel machine users are doing research simulations, and there the overhead of dealing with heterogenous environment and dynamic node allocation isn't especially worthwhile. But once it there's the infrastructure available to rent time easily, and especially to sell time, I think we'll see a lot more of our sort of use.
Ironically, it's the overhead of virtualization that's finally making that possible. The problem with a market in cpu time is that you have to be able to run untrusted code. An entirely automatic reputation system isn't really good enough. You need recourse if your provider is messing with your data, and providers need to be able to protect jobs from each other. And because you can move machine images around, it also fosters the sort of dynamic infrastructure we need to really have scalable computing available as a utility.
I was therefore excited to see that Amazon is doing exactly that with their Elastic Compute Cloud beta. To use the service you upload an OS image to their storage farm, and then launch as many instances as you want, for as long as you want. It's a really cool set up. Apparently the story is that they have this enormous server farm for dealing with their peak loads (like Christmas) but of course that means it's idle much of the time. TThe same issue we have, really. They already sell almost everything else online, so they decided to try renting out time on their infrastructure as a new business idea.
They have some other cool things too, like an RPC interface to human labor.
The best thing about it is that they have a web protocol for doing all this. So while someone has to provide a credit card and pay the bills, you can now write code that can allocate and occupy its own server resources. We're one step closer to AIs living free on the net. :)