Older blog entries for lupus (starting at number 24)

Memory savings with magic trampolines in Mono

Mono is a JIT compiler and as such it compiles a method only when needed: the moment the execution flow requires the method to execute. This mode of execution greatly improves startup time of applications and is implemented with a simple trick: when a method call is compiled, the generated native code can't transfer execution to the method's native code address, because it hasn't been compiled yet. Instead it will go through a magic trampoline: this chunk of code knows which method is going to be executed, so it will compile it and jump to the generated code.

The way the trampoline knows which method to compile is pretty simple: for each method a small specific trampoline is created that will pass the pointer to the method to execute to the real worker, the magic trampoline.

Different architectures implement this trampoline in different ways, but each with the aim to reduce its size: the reason is that many trampolines are generated and so they use quite a bit of memory.

Mono in svn has quite a few improvements in this area compared to mono 1.2.5 which was released just a few weeks ago. I'll try to detail the major changes below.

The first change is related to how the memory for the specific trampolines is allocated: this is executable memory so it is not allocated with malloc, but with a custom allocator, called Mono Code Manager. Since the code manager is used primarily for methods, it allocates chunks of memory that are aligned to multiples of 8 or 16 bytes depending on the architecture: this allows the cpu to fetch the instructions faster. But the specific trampolines are not performance critical (we'll spend lots of time JITting the method anyway), so they can tolerate a smaller alignment. Considering the fact that most trampolines are allocated one after the other and that in most architectures they are 10 or 12 bytes, this change alone saved about 25% of the memory used (they used to be aligned up to 16 bytes).

To give a rough idea of how many trampolines are generated I'll give a few examples:

  • MonoDevelop startup creates about 21 thousand trampolines
  • IronPython 2.0 running a benchmark creates about 17 thousand trampolines
  • an "hello, world" style program about 800
This change in the first case saved more than 80 KB of memory (plus about the same because reviewing the code allowed me to fix also a related overallocation issue).

So reducing the size of the trampolines is great, but it's really not possible to reduce them much further in size, if at all. The next step is trying just not to create them.
There are two primary ways a trampoline is generated: a direct call to the method is made or a virtual table slot is filled with a trampoline for the case when the method is invoked using a virtual call. I'll note here than in both cases, after compiling the method, the magic trampoline will do the needed changes so that the trampoline is not executed again, but execution goes directly to the newly compiled code. In one case the callsite is changed so that the branch or call instruction will transfer control to the new address. In the virtual call case the magic trampoline will change the virtual table slot directly.

The sequence of instructions used by the JIT to implement a virtual call are well-known and the magic trampoline (inspecting the registers and the code sequence) can easily get the virtual table slot that was used for the invocation. The idea here then is: if we know the virtual table slot we know also the method that is supposed to be compiled and executed, since each vtable slot is assigned a unique method by the class loader. This simple fact allows us to use a completely generic trampoline in the virtual table slots, avoiding the creation of many method-specific trampolines.

In the cases above, the number of generated trampolines goes from 21000 to 7700 for MonoDevelop (saving 160 KB of memory), from 17000 to 5400 for the IronPython case and from 800 to 150 for the hello world case.

I'll describe more optimizations (both already committed and forthcoming) in the next blog posts.

Mono GC updates

I guess some of you expected a blog entry about the generational GC in Mono, given the title. From my understanding many have the expectation that the new GC will solve all the issues they think are caused by the GC so they await with trepidation.
As a matter of fact, from my debugging of all or almost all those issues, the existing GC is not the culprit. Sometimes there is an unmanaged leak, sometimes a managed or unmanaged excessive retention of objects, but basically 80% of those issues that get attributed to the GC are not GC issues at all.
So, instead of waiting for the holy grail, provide test cases or as much data as you can for the bugs you experience, because chances are that the bug can be fixed relatively easily without waiting for the new GC to stabilize and get deployed.
Now, this is not to say that the new GC won't bring great improvements, but that those improvements are mainly in allocation speed and mean pause time, both of which, while measurable, are not bugs per-se and so are not part of the few issues that people hit with the current Boehm-GC based implementation.

After the long introduction, let's go to the purpose of this entry: svn Mono now can perform an object allocation entirely in managed code. Let me explain why this is significant.

The Mono runtime (including the GC) is written in C code and this is called unmanaged code as opposed to managed code which is all the code that gets JITted from IL opcodes.
The JIT and the runtime cooperate so that managed code is compiled in a way that lets the runtime inspect it, inject exceptions, unwind the stack and so on. The unmanaged code, on the other hand, is compiled by the C compiler and on most systems and architectures, there is no info available on it that would allow the same operations. For this reason, whenever a program needs to make a transition from managed code to unmanaged (for example for an internal call implementation or for calling into the GC) the runtime needs to perform some additional bookeeping, which can be relatively expensive, especially if the amount of code to execute in unmanaged land is tiny.

Since a while we have made use of the Boehm GC's ability to allocate objects in a thread-local fast-path, but we couldn't take the full benefit of it because the cost of the managed to unmanaged and back transition was bigger than the allocation cost itself.
Now the runtime can create a managed method that performs the allocation fast-path entirely in managed code, avoiding the cost of the transition in most cases. This infrastructure will be also used for the generational GC where it will be more important: the allocation fast-path sequence there is 4-5 instructions vs the dozen or more of the Boehm GC thread local alloc.

As for actual numbers, a benchmark that repeatedly allocates small objects is now more than 20% faster overall (overall includes the time spent collecting the garbage objects, the actual allocation speed increase is much bigger).

Code coverage with Mono

I uploaded version 0.2 of the monocov coverage tool for Mono here. It is also available from the monocov svn module from the usual Mono svn server.

The release features an improved Gtk# GUI, fixes to html rendering and other minor improvements.
The usage is pretty simple, just run you program or test suite with the following command after having installed monocov:

   mono --debug --profile=monocov program.exe
The coverage information will be output to the program.exe.cov file. Now you can load this file in the GUI with:

   monocov program.exe.cov
and browse the namespaces for interesting types you want to check code coverage for. Double clicking on a method will bring up a viewer with the source file of the method with the lines of code not reached by execution highlighted in red.

To limit the collection of data to a specific assembly you can specify it as an argument to the profiler. For example, to consider only the code in mscorlib, use:

   mono --debug --profile=monocov:+[mscorlib] test-suite.exe
To be able to easily collect coverage information from the unit tests in the mono mcs directory you can also run the test suite as follows, for example in mcs/class/corlib:

   make run-test
Monocov can also generate a set of HTML pages that display the coverage data. Here are the files generated when running the nunit-based test suite for mono's mscorlib with the following command:

    monocov --export-html=/tmp/corlib-cov corlib.cov
Hopefully this tool will help both new and old contributors to easily find untested spots in our libraries and contribute tests for them.
Happy testing!
Debugging managed lock deadloacks in Mono

I just committed to svn a small function that can be used to help debug deadlocks that result from the incorrect use of managed locks.

Managed locks (implemented in the Monitor class and usually invoked with the lock () construct in C#) are subject to the same incorrect uses of normal locks, though they can be safely taken recursively by the same thread.

One of the obviously incorrect way to use locks is to have multiple locks and acquire them in different orders in different codepaths. Here is an example:

using System;
using System.Threading;

class TestDeadlock {

static object lockA = new object (); static object lockB = new object ();

static void normal_order () { lock (lockA) { Console.WriteLine ("took lock A"); // make the deadlock more likely Thread.Sleep (500); lock (lockB) { Console.WriteLine ("took lock B"); } } } static void reverse_order () { lock (lockB) { Console.WriteLine ("took lock B"); // make the deadlock more likely Thread.Sleep (500); lock (lockA) { Console.WriteLine ("took lock A"); } } } static void Main () { TestDeadlock td = new TestDeadlock (); lock (td) { lock (td) { // twice for testing the nest level Thread t1 = new Thread ( new ThreadStart (normal_order)); Thread t2 = new Thread ( new ThreadStart (reverse_order)); t1.Start (); t2.Start (); t1.Join (); t2.Join (); } } } }

I added an explicit Sleep () call to make the race condition happen almost every time you run such a program. The issue with such deadlocks is that usually the race time window is very small and it will go unnoticed during testing. The new feature in the mono runtime is designed to help find the issue when a process is stuck and we don't know why.

Now you can attach to the stuck process using gdb and issue the following command:

(gdb) call mono_locks_dump (0)
which results in output like this:

Lock 0x824f108 in object 0x2ffd8 held by thread 0xb7d496c0,
nest level: 2
Lock 0x824f150 in object 0x2ffe8 held by thread 0xb7356bb0,
nest level: 1
        Waiting on semaphore 0x40e: 1
Lock 0x824f1b0 in object 0x2ffe0 held by thread 0xb7255bb0,
nest level: 1
        Waiting on semaphore 0x40d: 1
Total locks (in 1 array(s)): 16, used: 8, on freelist: 8, to
recycle: 0
We can see that there are three locks currently held by three different threads. The first has been recursively acquired 2 times. The other two are more interesting because they each have a thread waiting on a semaphore associated with the lock structure: they must be the ones involved in the deadlock.

Once we know the threads that are deadlocking and the objects that hold the lock we might have a better idea of where exactly to look in the code for incorrect ordering of lock statements.

In this particular case it's pretty easy since the objects used for locking are static fields. The easy way to get the class is to notice that the object which is locked twice (0x2ffd8) is of the same class as the static fields:

(gdb) call mono_object_describe (0x2ffd8)
TestDeadlock object at 0x2ffd8 (klass: 0x820922c)
Now we know the class (0x820922c) and we can get a list of the static fields and their values and correlate with the objects locked in the mono_locks_dump () list:

(gdb) call mono_class_describe_statics (0x820922c)
At 0x26fd0 (ofs:  0) lockA: System.Object object at 0x2ffe8
(klass: 0x820beac)
At 0x26fd4 (ofs:  4) lockB: System.Object object at 0x2ffe0
(klass: 0x820beac)
Note that the lockA and lockB objects are the ones listed above as deadlocking.
Mono on the Nokia 770 OS 2006

Starting with Mono version 1.2.1, the Mono JIT supports the new ARM ABI (also called gnueabi or armel). This is the same ABI used by the 2006 OS update of the Nokia 770 and it should be good news for all the people that asked me about having Mono run on their newly-flashed devices.

The changes involved enhancing the JIT to support soft-float targets (this work will also help people porting mono to other embedded architectures without a hardware floating point instruction set) as well as the ARM-specific call convention changes. There was also some hair-pulling involved, since the gcc version provided with scratchbox goes into an infinite loop while compiling the changed mini.c sources when optimizations are enabled, but I'm sure you don't want to know the details...

This was not enough, though, to be able to run Gtk# applications on the Nokia 770. When I first ran a simple Gtk# test app I got a SIGILL inside gtk_init() in a seemlingly simple instruction. Since this happened inside a gcc-compiled binary I had no idea what the JIT could have been doing wrong. Then this morning I noticed that the instructions in gtk_init() were two bytes long: everything became clear again, I needed to implement interworking with Thumb code in the JIT. This required a few changes in how the call instructions are emitted and at callsite patching. The result is that now Mono can P/Invoke shared libraries compiled in Thumb mode (mono itself must still be compiled in ARM mode: this should be easy to fix, but there is no immediate need now for it). Note that this change didn't make it to the mono 1.2.1 release, you'll have to use mono from svn.

As part of this work, I also added an option to mono's configure to disable the compilation of the mcs/ directory, which would require running mono in emulation by qemu inside scratchbox. The new option is --disable-mcs-build. This can also be useful when building the runtime on slow boxes, if the building of the mcs/ dir is not needed (common for embedded environments where the managed assemblies are simply copied from an x86 box).

There are not yet packages ready for the Nokia 770, though I'll provide a rough tarball of binaries soon: the issue is that at least my version of scratchbox has a qemu build that fails to emulate some syscalls used by mono, so it's hard to build packages that require mono or mcs to be run inside scratchbox. I'm told this bug has been fixed in more recent versions, so I'll report how well jitted code runs in qemu when I'll install a new scratchbox. This is not the best way to handle this, though, because even if qemu can emulate everything mono does, it would be very slow and silly to run it that way: we should run mono on the host, just like we run the cross-compiling gcc on the host from inside scratchbox and make it appear as a native compiler. From a quick look at the documentation, it should be possible to build a mono devkit for scratchbox that does exactly this. This would be very nice for building packages like Gtk# that involve both managed assemblies and unmanaged shared libraries (the Gtk# I used for testing required lots of painful switches between scratchbox for compiling with gcc and another terminal for running the C#-based build helper tools and mcs...). So, if anyone has time and skills to develop such a devkit, it will be much appreciated! Alternatively, we could wait for debian packages to be built as part of the debian project's port to armel, which will use armel build boxes.

This afternoon Jonathan Pryor pasted on the mono IRC channel an interesting benchmarklet that showed interesting results. It came from Rico Mariani at http://blogs.msdn.com/ricom/archive/2006/03/09/548097.aspx as a performance quiz. The results are non-intuitive, since it makes it appear that using a simple array is slower than using the List<T> generic implementation (which internally is supposed to use an array itself).

On mono, using the simple array was about 3 times slower than using the generics implementation, so I ran the profiler to find out why.

It turns out that in the implementation of the IList<T> interface methods we used a special generic internal call to access the array elements: this internal call is implemented by a C function that needs to cope with any array element type. But since it is an internal call and the JIT knows what it is supposed to do, I quickly wrote the support to recognize it and inline the instructions to access the array elements. This makes the two versions of the code run with about the same speed (with mono from svn, of course).

The interesting fact is that the MS runtime behaves similarly, with the simple array test running about 3 times slower than the IList<T> implementation. If you're curious about why the MS implementation is so slow, follow the link above: I guess sooner or later some MS people will explain it.

Mono for ARM/Nokia 770 binaries
I made tarballs of binaries for use on Linux/ARM systems, including the Nokia 770. And, yes, Gtk# apps work fine on it:-).
Happy hacking, you'll find them here.
19 Sep 2005 (updated 19 Sep 2005 at 17:09 UTC) »
Mono on the Nokia 770
After the Mono JIT port was done using a desktop little-endian ARM computer, Geoff just recompiled it and run it on a Linksys NSLU2 (which runs a ARM processor in big-endian mode).
That was pretty cool. I wonder if it is as cool as running mono on a Nokia 770 (no recompilation necessary, just copied the binary from my Debian box). Here it is running our hello world app.
Many thanks to the fine folks at Nokia for sending me a prototype so quickly.
Mono ARM port
The Mono ARM port is mostly complete: check it out from svn or from the soon to be released mono 1.1.9.
It has been bootstrapped on both little and big endian Debian ports (thanks to Geoff for enduring and submitting his Linksys NSLU2 to an extended compiling session to test big endian support).
Feedback from folks with Linux PDAs is appreciated: I'll see if I can prepare some binaries later today as well.
Memory footprint improvements in mono
At the end of March 2004, davidw pointed out that mono and Gtk# used a lot of memory for simple applications: in his mail he had 42196 KB of VSIZE and 11096 KB of RSS for a button in a window. Of course this was not good and at the time I told him to check again at the time of the mono 1.0 release or thereabouts and indeed we had some improvements: at the beginning of July the numbers were 31776 KB and 9916 KB respectively for VSIZE and RSS. The changes involved both the way Gtk# was built and the way mono loaded metadata information and how that was stored in the runtime data structures.

Today I ran the same test again to see what is the status, since we're going to release mono 1.2 in a few weeks/months. The numbers look better: 18912 KB for VSIZE and 8200 KB for RSS. These numbers are comparable to the existing perl and python bindings (17032/10028 the former, 15092/9164 the latter).

For reference, on my system the equivalent C program gives: 10568 KB for VSIZE and 4728 KB for RSS. The aumount of writable mapped memory is what counts, though, so I got some data about that, too, with a small program looking at /proc/pid/maps. The C program has about 935 KB of writable memory mapped, 290 KB of which is what looks like the relocation overhead in the numerous shared libraries loaded. Since memory is mmapped in page-sized chunks, even if a library has only a couple of relocations, it will make a whole page dirty. Ulrich Drepper has a paper on how to make shared libraries behave better that is a worthy read, but in some cases there isn't much we can do if we want to support plugin-like systems.

A very amusingly sad fact is to be noted about the libgmodule library: this is a very tiny lib (9 KB of text size). So, to be able to share 9 KB of readonly data, we waste 4 KB of writable data. Maybe it makes sense to just include the gmodule code inside the standard libglib library. libdl has a similar issue (7 KB of text, note that the wrapper library is bigger than the wrapped one, sigh): maybe some linker magic could be used to include libdl inside libc, it would increase the size of the latter by just 0.5%.

In the mono/Gtk# case, the writable memory mapped was 4.35 MBs. Of these, 1 MB is the stack setup for the thread that runs the finalizers: hardly a few KBs of this memory is ever touched, so the kernel doesn't actually commit this amount of data. There is another megabyte that consists of shared mappings from files, which is used by the io-layer to implement the win32 handle semantics. Usually, very little of this memory should be actually read or written, so it shouldn't matter much and the io-layer rewrite Dick is working on will hopefully get rid of most of the usage, too. If we subtract the amount of memory used by the C program itself, we get that the memory actually allocated by mono/gtk# is about 1.3 MBs, in the perl and the python case it's about 2.5 MBs. The 1.3 MBs include about 200 KB of GC heap size, 128 KB of memory allocated for the jitted code (of which just 20-30 KB are used) and the memory used by the mono runtime data structures: metadata from the IL images, jit bookkeeping etc. Of course we plan to reduce our memory requirements even further, since this is good both for the people using mono in embedded solutions and for the people using multiple mono desktop applications. Hopefully in another six months time we'll have as good improvements as we had in the last year:-)

15 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!