pycage, while MPU-side decoding is the easiest way to go, DSP-side will still be beneficial (albeit somewhat more complicated). Whether the benefits are worth the effort is another matter. The tools that you need to roll your own codecs are available, and you can do this mostly in C without having to resort to too much tms320c55x assembly. The biggest issue is likely familiarizing yourself with the DSP kernel, the socket node interfaces, and so forth. Most of this is documented pretty well at the dspgateway page.
For the adventurous, there's still an unused mailbox line between MPU and DSP on 1710 in the current implementation that could probably be round-robin'ed pretty easily. We also presently don't make use of hardware page table walking, which makes the exmap interface a bit clunky (essentially wiring TLB entries by hand, but at least they're pre-faulted).
It would also be interesting to see how the FP-driven codecs compare to the integer-based one under EABI with a soft-float toolchain. ogg123 might even be usable out of the box with soft-float (though at likely higher than the CPU utilization numbers that have been quoted). On another note, it's also pretty easy to figure out DSP load average through the sysfs interface, so it may be worthwhile to profile some of that, especially if the DSP ends up getting more heavily loaded.