Older blog entries for danstowell (starting at number 1)

Roast pumpkin and aubergine spaghetti

This is a nice way to use pumpkin, a spicy and warming pumpkin pasta dish. These quantities serve 2; takes about 45 minutes in total, with some spare time in the middle.

  • 1/2 a pumpkin
  • 1 medium orange chilli
  • 1 tsp paprika
  • 1/2 tsp turmeric
  • Plenty of olive oil
  • 1/2 an aubergine
  • 2 tomatoes
  • Spaghetti

Put the oven on hot, about 210--220 C. Peel and deseed the pumpkin, and slice it into slices about 1/2 cm thick and 4 or 5 cm long - no need to be exact, but we want thinnish pieces. Chop the chilli up into rings too.

In a roasting tin, put a good glug of olive oil, then the pumpkin and chilli. Sprinkle over the paprika and turmeric, then toss to mix. Put this in the oven and let it roast for about 40 minutes, preparing the aubergine and pasta in the mean time.

The aubergine needs to be cut into pieces of similar size and shape to the pumpkin. The tomatoes, leave them whole but cut out the stalky bit. Halfway through the pumpkin's cooking time, add the aubergine, another glug of olive oil, toss briefly to mix, and sit the tomatoes in the middle somewhere, then put it all back in the oven.

Cook the spaghetti according to the packet instructions (e.g. boil for 15 minutes). Drain it, and get the other stuff out of the oven. In the pan that you used for the pasta (or a new pan), put the two roasted tomatoes and bash them with a serving spoon so they fall apart and become a nice lumpy paste. Add the pasta to them and mix. Then add the other roast vegetables, and mix all together, but gently this time so you don't mush the veg.

Serve with some parmesan perhaps.

Syndicated 2011-10-30 15:14:49 from Dan Stowell

ISMIR 2011: the year of bigness

I'm blogging from the ISMIR 2011 conference, about music information retrieval. One of the interesting trends is how a lot of people are focusing on how to scale things up, to handle millions of audio files (or users, or scores) rather than just hundreds or thousands. Why? Well, in real-world applications it's often important: big music services like Spotify and iTunes have about 15 million tracks, Facebook has millions of users, etc. In ISMIR one of the stars of the show is the Million Song Dataset, just released, which should help many many researchers to develop and test on a big scale. Here I'm going to note some of the talks/posters I've seen with interesting approaches to scalability:

Brian McFee described a simple tweak to the kd-tree data structure called "spill tree" which improves approximate search. Basically, when you split the data in two you allow some of the data points to spill over and fall on both sides. Simple but apparently effective.

Dominik Schnitzer introduced a nice way to smooth out a search space and reduce the problem of hub-ness. One way to do it could be to use a minimum spanning tree, for example, but that involes a whole-dataset analysis so it might not scale well. In Dominik's approach, for each data point X you find an estimate of what he calls "mutual proximity": randomly sample 100 data points from your dataset and measure their distance to X, then fit a gaussian to those distances. Then to find the "mutual proximity" between two data points X and Y, you just evaluate X's gaussian at Y's location to get a kind of "probability of being a near neighbour". He also makes this a symmetric measure by combining the X->Y measure with the Y->X measure, but I'd imagine you don't always need to do that, depending on your purpose. The end result is a distance measure that pretty much eliminates hubs.

Shazam's music recognition algorithm, described in this 2006 paper, is one of the commercial success stories of scalable audio MIR. Sebastien Fenet tweaked it to be robust to pitch-shifting, essentially by using a log-frequency spectrogram and using delta-log-frequency rather than frequency in the fingerprints.

A small note from the presentation of the Million Song Dataset: apparently if you want a good online linear-predictor than is fast for large data, try out Vowpal Wabbit.

Also, Thierry mentioned that he was a fan of using Amazon's cloud storage/processing - if you store data with Amazon you can run MapReduce jobs over it easily, apparently. Mark Levy of last.fm is also a fan of MapReduce, having done a lot of work using Hadoop (Yahoo's implementation of MapReduce) for big data-crunching jobs.

Mikael Henaff presented a technique for learning a sparse spectrum-derived feature set, similar in spirit to KSVD. The thing I found interesting was how he then made a fast way of decomposing a new signal (once you've derived your feature basis from some training data). Ordinarily you'd have to do an optimisation - the dictionary is overcomplete so it can't be done as easily as an orthogonal transform. But you don't want to do that on a lot of data. Instead, he first trains a nonlinear projection which approximates that decomposition (it's a matrix rotation followed by a shrinkage nonlinearity, really simple mathematically). So you have to train that, but then when you want to analyse new data there's no optimisation needed, you just apply the simple transform.

There's been plenty of interesting stuff here at ISMIR that isn't about bigness, and it was good of Douglas Eck (of Google) to emphasise that there are still lots of interesting and important problems in MIR that don't need scalability and don't even benefit from it. But there are interesting developments in this area, hence this note.

Syndicated 2011-10-27 23:06:26 (Updated 2011-10-28 12:58:43) from Dan Stowell

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!