Ken Gilmer has found some
examples of surprisingly bad CACAO performance on ARM
(see slides).
In my understanding, CACAO should be at least as fast as
JamVM for moderately long-running benchmarks such as
probe and decode. Before looking into the
issue, I could only imagine lots of JIT/native transitions
as the cause of this, but I was not entirely convinced that
that alone would create such a performance problem.
Digging into it using oprofile, I quickly found
some very inefficient code for handling JNI local
references. Interestingly, it’s mostly a memset
of 64 bytes that is run on every transition from JIT code to
native code. It seems that either memory bandwidth on ARM is
unbelievably low or the machine’s caching behavior is
extremely poor, as the same call, when run on decent x86
hardware, doesn’t even show up in a profile, at least not
nearly as prominently.
Anyway, after an overhaul
of this half-decade old code, performance for these two
benchmarks has improved by more than 50%. JamVM is still
faster for decode, but only slightly. I attribute
this to the garbage collector.
Without Ken’s talk, I would have never found out about this.