I have gotten my embedded product working quite a bit faster than it was. Recall that the demo I gave my client was way too slow and it took some convincing to get him to continue funding the product, but he did eventually decide to continue paying me for development.
Unfortunately, it is still too slow to be marketable. However, it has become suprisingly usable even for the performance that it does have. Before it wasn't usable at all - I could demonstrate that it worked correctly, and measure its miserable speed, but that's about it.
I can't claim that much credit for the speedup. It turned out that there was a more efficient algorithm for what I'm trying to do, it gets the same results but is much faster. However there was the challenge of getting it to actually work in my firmware, which took a couple of days. I was very relieved when I got the new algorithm functioning and found that a significant improvement had happened (a speedup of about 20 times from what I had).
But I need to speed it up by another factor of 10 before my client will consider marketing it. And at this point he's only committed to paying me to improve the speed, more work will be required to bring the product to market and I have to hit a minimum performance threshhold for that to happen.
I have always been a big fan of tweaking code for performance. I have yet to find an optimizing compiler that I cannot beat by writing better C. And I am always able to beat my best C performance by writing in assembler.
Everyone says that you should only try to improve performance by selecting a better algorithm, but what if you are already using the best known algorithm and it's not fast enough. That's when you need to tweak your code.
Also some algorithms are only more efficient when they are used on large data sets, because they are expensive to set up. If you have many independent and small data sets, what would ordinarily be considered the best algorithm may really be a poor choice. In that case an algorithm that scales poorly but can be coded tightly may be a better choice.
I'm afraid that because I considered performance not to be a real concern for the proof-of-concept demo, I didn't even try to look for better algorithms when I wrote my demo, because it simply didn't occur to me that the straightforward and obvious way could have had such disastrous results. I've done some research now and I'm pretty sure that the algorithm I have now is the best available.
I don't think it will be too hard to write it in assembly and there are a number of ways that I can see that this will help. I should be able to get some performance gain easily. Getting enough may be difficult but I still have several days left of the time I have funded. I could take more time if I needed it but it would make things unpleasant financially again.
Again I'm sorry I won't say what exactly it is I'm working on but hopefully it will only be about a week and a half until the product is announced.