I have made considerable progress at improving the performance of my embedded application. Using a better algorithm (what I believe now is considered the fastest algorithm) in C got me a factor of ten improvement. Re-coding that in assembler got me another factor of three improvement.
The problem is, it's still not fast enough to meet the client's requirements. I need to increase the performance by another factor of three. I believe I can do this by tweaking the assembly, but it's not at all obvious to me how.
That's why I submitted an "Ask Kuro5hin" on the topic of ARM Assembly Code Optimization?
I felt the discussion would be more useful to others, and likely to result in more ideas that would inspire me to solve my problem, if I encouraged people to post performance tips for any architecture.
I mention in the article that I don't have the budget to buy all the books that have been recommended to me. One reason for that is that I rather dramatically underestimated the time it would take me to deliver the proof of concept, and I had agreed to do that work for a fixed price (being under the impression it would be so easy!). But the happy news is that I asked my client if he would be willing to help me out with the cost of a couple of books that people recommended to help me speed up my assembly, and he agreed.
After a couple hours of calling around (and being disappointed that online bookstores won't ship on Saturdays) I got a bookstore a couple hours from my house to put on hold for me:
- Computer Architecture: A Quantitative Approach, by Hennessy, Patterson and Goldberg
- Hacker's Delight, by Henry S. Warren, Jr.
My friend Dave Lyons, who recommended Hacker's Delight to me, said that he was quite suprised by what it showed you can get a two's-complement execution unit to do.
