Older blog entries for gary (starting at number 243)

Debugging the C++ interpreter

Every so often I find myself wanting to add debug printing to the C++ interpreter for specific methods. I can never remember how I did it the last time and have to figure it out all over again, so here’s how:

diff -r 4d8381231af6 openjdk-ecj/hotspot/src/share/vm/interpreter/bytecodeInterpreter.cpp
--- a/openjdk-ecj/hotspot/src/share/vm/interpreter/bytecodeInterpreter.cpp  Tue Apr 21 09:50:43 2009 +0100
+++ b/openjdk-ecj/hotspot/src/share/vm/interpreter/bytecodeInterpreter.cpp  Wed Apr 22 11:13:35 2009 +0100
@@ -555,6 +555,15 @@
          topOfStack stack_base(),
          “Stack top out of range”);

+  bool interesting = false;
+  if (istate->msg() != initialize) {
+    ResourceMark rm;
+    if (!strcmp(istate->method()->name_and_sig_as_C_string(),
+                “spec.benchmarks._202_jess.jess.Rete.FindDeffunction(Ljava/lang/String;)Lspec/benchmarks/_202_jess/jess/Deffunction;”)) {
+      interesting = true;
+    }
+  }
   switch (istate->msg()) {
     case initialize: {
       if (initialized++) ShouldNotReachHere(); // Only one initialize call

The trick is getting the fully-qualified name of the method right: the method name contains dots, but the class names in its signature contain slashes. You’re there once you have that down.

Syndicated 2009-04-22 10:27:19 from gbenson.net

I’m not dead

I haven’t blogged for a while. I’ve been working on Shark’s performance, walking through the native code generated for critical methods and looking at what’s happening. There’s several cases where I can see that some piece of code is unnecessary, but translating that into a way that Shark can see it’s unnecessary is non-trivial. I’m thinking I may need to separate the code generation, adding an intermediate layer between the typeflow and the LLVM IR so I can add things which are maybe necessary and then remove them if not. It all seems a bit convoluted — bytecode → typeflow → new intermediate → LLVM IR → native — but the vast bulk of the Shark’s time is spent in the last step so a bit more overhead to create simpler LLVM IR should speed up compilation as well as the runtime.

None of this has been particularly bloggable, but I wanted to point out two exiting things that are happening in Shark land. Robert Schuster and Xerxes Rånby have been busy getting Shark to run on ARM, and Neale Ferguson has started porting LLVM to zSeries with the intention of getting Shark running there. I expected to see Shark on ARM sooner or later, but Shark on zSeries came completely out of the blue. I’m really looking forward to seeing that happen!

Syndicated 2009-04-08 10:31:46 from gbenson.net

Good news and bad news

Bad news first. The drop in speed between the Zeros in IcedTea6 1.3 and 1.4 doesn’t seem to come from Zero itself. I did a build of IcedTea6 1.4 with everything in ports/hotspot/src/*cpu reverted to 1.3, and the speed loss remained. It must be something to do with the newer HotSpot, or some other patch that got added or changed. I don’t really want to spend any more time on this than I have, so we’ll just have to live with it.

I’ve not come to any conclusions as to the difference in speed between the native-layer C++ interpreter and Zero either. It’s not the unaligned access stuff I mentioned: I ran some benchmarks, but the results were ambiguous. It may be libffi, but again, I don’t want to spend more time on this…

The good news is that I’ve been checking the Zero sources for SCA cover, emailing various people, and there’s only one tiny easily-removable bit I’m unsure about. I spent the morning preparing and submitting the first of the patches that will be required, the core build patch, which will hopefully be well received.

Syndicated 2009-02-20 14:53:12 from gbenson.net

18 Feb 2009 (updated 19 Feb 2009 at 15:30 UTC) »


Advogato can't display this entry, read it here instead.

Syndicated 2009-02-18 17:19:29 from gbenson.net

State of the world

Well, in case you missed it, Zero passed the TCK! Specifically, the latest OpenJDK packages in Fedora 10 for 32- and 64-bit PowerPC passed the Java SE 6 TCK and are compatible with the Java SE 6 platform. I’ve been working toward this since sometime in November — the sharp-eyed amongst you may have noticed the steady stream of obscure fixes I’ve been committing — and the final 200 hour marathon finished at 5pm on FOSDEM Saturday, less than 24 hours before my talk. It was pretty stressful, and I took the week off to recover!

Needless to say, none of this could have happened without the rest of the OpenJDK team here at Red Hat getting it to pass on the other platforms. Special thanks must go to Lillian for managing the release. She got the blame for a lot of what went wrong, and it’s only fair she should get the credit for what went right.

Of course, all of this wasn’t just so I’d have something exciting to announce at FOSDEM. In a way it validates the decision we took at Red Hat to focus on Zero rather than using Cacao or another VM. By using as much OpenJDK code as possible — Zero builds are 99% HotSpot — we get as much OpenJDK goodness as possible, including the “correctness” of the code. Zero’s speed can make it a standing joke, but I’d like to use these passes to emphasize that Zero isn’t just a neat hack — it’s production quality code that hasn’t been optimized yet. I’ve written fast code and I’ve written correct code, and in my experience it’s easier in the long run to make correct code fast than it is to make fast code correct. The TCK isn’t everything, naturally, but the fact that it’s possible to pass it using Zero builds gives us a firm foundation for future work.

So, what now for me? Well, in the medium term I want to restart work on Shark, but there’s a couple of things for Zero I want to look at while they’re fresh in my mind. The first is speed. As an interpreter Zero will never be “fast”, but in my FOSDEM slides I used some of Andrew Haley’s benchmarks that show Zero as significantly slower than the template interpreter on x86_64. Furthermore, Robert Schuster mentioned in his talk that the Zero in IcedTea 1.4 was significantly slower than the Zero in IcedTea 1.3. I’m not going to spend a great deal of time on it, but I’d like to do a bit of benchmarking and profiling to check that nothing stupid is happening.

The other thing I want to do for Zero is to get it into upstream HotSpot. This is going to require a lot of non-fun stuff — tidying, a bit of rethinking, and an SCA audit.

Finally, Inside Zero and Shark, the articles I’ve been writing. I didn’t mention it at the time, but I was writing them while my TCK runs were in progress, to keep me sane! I do plan to continue them, but they’ll likely be a little more sporadic now I’m starting the fun stuff again. Watch this space!

Syndicated 2009-02-16 13:34:44 from gbenson.net

Inside Zero and Shark: The call stub and the frame manager

Now that we have all that stack stuff out of the way we can get into the core of Zero itself, the replacements for the parts of HotSpot that were originally written in assembly language.

I’ve mentioned already that the bridge from the VM into Java code is the call stub. In Zero, this is the function StubGenerator::call_stub, in stubGenerator_zero.cpp, and its job is really simple. In the previous article I explained how when a method is entered it finds its arguments at the top of the stack, at the end of the previous method’s frame. Well, for the first frame of all there is no previous frame, so the call stub’s job is to create one. If you look in the crash dump I linked in the previous article, you’ll see that right at the bottom of the stack trace is a short frame that’s different from all the others. This is the entry frame, the frame the call stub made. You can see the code that built it in EntryFrame::build, right at the bottom of stubGenerator_zero.cpp.

Once the entry frame is created, the call stub invokes the method by jumping to its entry point. If the method we’re calling hasn’t been JIT compiled then the entry point will be pointing at one of the interpreter’s method entries. These, along with the call stub, are the bits that are written in assembly language in classic HotSpot.

There are several different method entries — Zero has five, classic HotSpot has fourteen! — but most of this is optimization, and you can do pretty much everything with just two: an entry point for normal (bytecode) methods, and an entry for native (JNI) methods. In this article I’m going to talk about the normal entry.

In the C++ interpreter, the normal entry is split into two parts. The larger of the two is the bytecode interpreter. This is written in C++, the function BytecodeInterpreter::run in bytecodeInterpreter.cpp, and it does the bulk of the work. The other part is the frame manager. In non-Zero HotSpot this is written in assembly language, and it handles the various stack manipulations that cannot be performed from within C++. Zero’s frame manager is, of course, written in C++; it’s the function CppInterpreter::normal_entry in cppInterpreter_zero.cpp.

The frame manager performs tasks for the bytecode interpreter, so you might expect the bytecode interpreter call the frame manager wherever it needs to adjust the stack. It can’t work this way, however; the interleaved stack in classic HotSpot means that once you’re inside the bytecode interpreter the bytecode interpreter’s ABI frame lies on top of the stack, blocking any access to the Java frames beneath. To cope with this, the code is essentially written inside out, with the frame manager calling the bytecode interpreter, and the bytecode interpreter returning to the frame manager whenever it needs something done.

The way it works is this. On entering a method, we start off in the frame manager. The frame manager extends the caller’s frame to accomodate any extra locals, then creates a new frame for the callee. The frame manager then calls the bytecode interpreter with a method_entry message.

Now we’re inside the bytecode interpreter, which executes bytecodes one-by-one until it reaches something it cannot handle. Say it arrives at a method call instruction. In classic HotSpot, the bytecode interpreter’s ABI frame is blocking the top of the stack, so if the bytecode interpreter were to handle the call itself the callee wouldn’t be able to extend its caller’s frame to accomodate its extra locals. The frame manager has to to handle this, so the bytecode interpreter returns with a call_method message.

Now we’re back in the frame manager again; the bytecode interpreter’s frame has been removed, and the Java frame at the top of the stack. The arguments to the call were set up by the bytecode interpreter, so all the frame manager has to do is jump to the callee’s entry point. When the callee returns, the frame manager returns control to the bytecode interpreter by calling it with a method_resume message, and the bytecode interpreter continues from where it was when it issued the call_method.

This process is repeated every time a method call is required. Once the bytecode interpreter is finished with a method, it returns to the frame manager with a return_from_method or a throwing_exception message. The frame manager then removes the method’s frame, copies the result into it’s caller’s frame if necessary, and return to its caller.

The frame manager exists because the bytecode interpreter’s frame blocks the stack in classic HotSpot. In Zero, Java frames live on the Zero stack, which is separate from the ABI stack. Why then does Zero need a separate frame manager? The answer is that it doesn’t — it would be perfectly possible to rewrite the bytecode interpreter to stand alone. That, however, is the issue: you’d have to rewrite the bytecode interpreter, making significant modifications to the existing HotSpot code. That runs counter to the design philosophy of Zero, which aims for it to slot into HotSpot with minimal modification. It could be done, but we didn’t do it.

That pretty well sums up the normal entry. Next time I’ll talk about the other essential method entry, the one that handles JNI methods.

Syndicated 2009-02-02 13:49:28 from gbenson.net

Inside Zero and Shark: HotSpot’s stacks

Now that we understand what Java is expecting of the stack we can take a look at how HotSpot and Zero implement it. It would, of course, be perfectly possible to implement a stack exactly as described, but in practice that’s not how it’s done.

The first difference is pretty straightforward. Remember I said that when a method is called the interpreter pops its arguments from the stack and copies them to the callee’s local variables? Well, all that copying would be pretty inefficient, so HotSpot simply doesn’t bother. Imagine you’re executing some method. You’ve just pushed three values onto the stack, value_a, value_b and value_c, and you’re about to invoke a method that takes two arguments:


When it enters the callee, rather than popping the arguments, it leaves them where they are:

value_c local[1]
value_b local[0]

If the callee has more locals than it has arguments (say this method needs four) then extra locals (set to zero) will be pushed onto the stack:

  0   local[3]
0 local[2]
value_c local[1]
value_b local[0]

Execution continues as normal after that, with values being pushed onto the stack after the locals. As methods call methods call methods call methods, the stack becomes split into layers, with individual method’s stacks interleaved with blocks of local variables. When a method returns, everything up to and including local[0] is popped. If a method is to return a value, then that will popped before everything else and pushed back onto the stack afterwards. We’ve exchanged copying the arguments for copying the result, a good tradeoff given that methods can have many arguments but only one result.

So far so good, but there’s another difference. The stack I’ve been talking to until now is the stack of the Java language. In HotSpot this is variously referred to as the Java expression stack, the Java stack or the expression stack. But HotSpot is a program in itself, and it has a stack of its own, the ABI stack or native stack. This is the stack that C and C++ functions use to store their own bits and pieces on. HotSpot was originally written for i386, a platform notoriously starved of registers, and rather than maintaining two separate stack pointers the HotSpot engineers decided to store the Java stack on the ABI stack and save a register. Each time a Java method is invoked, the native code that executes it needs to store some state on the ABI stack, so chunks of Java stuff end up interleaved with chunks of ABI stuff between each method’s local variables and its part of the expression stack.

This tinkering with the ABI stack is one of the two reasons the C++ interpreter in HotSpot required a layer written in assembler — you don’t have that kind of access to the ABI stack in C++. Zero, of course, is written in C++, and doesn’t have that access; Zero maintains a separate stack, an instance of the ZeroStack class from stack_zero.hpp. That could have consigned this interleaving to a side-note from history, but sadly the C++ interpreter expects to find it’s state information stored between its local variables and its expression stack. Rather than rewriting the C++ interpreter, Zero interleaves too. It’s the path of least resistance.

You can see what I mean in this crash dump that — congratulations! — you are now qualified to understand. The trace is split into frames, with each frame representing one method invocation. The deepest frame is at the top, so here:

called java.util.concurrent.ThreadPoolExecutor$Worker.run
called java.util.concurrent.ThreadPoolExecutor.runWorker
called sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run
called sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0
called java.net.Socket.getInputStream — which crashed.

The top frame’s expression stack is at 0xd0ffe7e4-0xd0ffe7ec, and its local variables are at start of the next frame down, at 0xd0ffe83c-0xd0ffe848. Between them is the C++ interpreter’s state (at 0xd0ffe7f0-0xd0ffe830) and two words which the stack walker uses to figure out where everything is.

I’m nearly finished, but there’s one final thing. The Java Language Specification specifies the sizes of the various types in terms of the number of stack slots or local variable slots they occupy: long and double values take two slots, and everything else takes one. If a method’s arguments are an int, an Object, a long and an int, it’s local variables on entry will look like this:

int   local[4]
long local[3]
Object local[1]
int local[0]

This is pretty straightforward, aside from the fact that the long is officially in local[2] but its address is actually the address of local[3]. The problem arises when you’re using a 64-bit machine — the 64-bit Object pointer has been allocated the same number of slots as a 32-bit int. On 64-bit platforms, therefore, stack slots need to be 64-bits wide, which wastes space, and leaves us the choice of where in the slot to put non-Object types. The various classic HotSpot ports do this in different ways, but on Zero everything is accessed by slot number so values are positioned such that they start at the address of the start of the slot. This means the calculation the same regardless of whether the machine is 32- or 64-bit, and makes the majority of this stuff transparent. The same local variable array on 64-bit Zero looks like this:

int     local[4]
long local[3]
Object local[1]
int   local[0]

I’ll shut up about stacks now!

Syndicated 2009-01-30 10:17:26 from gbenson.net

Inside Zero and Shark: The Java stack

This article will be a little generic — nothing about HotSpot, nothing about Zero — but before we can understand Zero’s calling convention we need to go up a level and understand the calling convention of Java itself, in which arguments and results are passed on the stack. Lets have a look at an example to see how that works:

class HelloUser {
  public static void main(String[] args) {
    System.out.print("Hello ");

We’re going to have to disassemble it to see what’s happening:

public static void main(java.lang.String[]);
    0:  getstatic       [Field java/lang/System.out:Ljava/io/PrintStream;]
    3:  ldc             [String "Hello "]
    5:  invokevirtual   [Method java/io/PrintStream.print:(Ljava/lang/String;)V]
    8:  getstatic       [Field java/lang/System.out:Ljava/io/PrintStream;]
   11:  ldc             [String "user.name"]
   13:  invokestatic    [Method java/lang/System.getProperty:(Ljava/lang/String;)Ljava/lang/String;]
   16:  invokevirtual   [Method java/io/PrintStream.println:(Ljava/lang/String;)V]
   19:  return

The getstatic instruction gets a value from a static field of a class (in this case the out field of the System class) and pushes it onto the stack. The ldc instruction loads a constant (the string "Hello ") and pushes that onto the stack. So far we have this:

"Hello "
Before 0: getstatic Before 3: ldc Before 5: invokevirtual

The next instruction is an invokevirtual, which is going to call the method java.io.PrintStream.print. This takes two arguments, the implicit argument this, and the string to print, so the interpreter pops two values from the stack, stores them as the callee’s first two local variables, and starts to execute the callee. When the callee returns the stack will be empty:

Before 8: getstatic

We now have another getstatic and another ldc:

Before 11: ldc Before 13: invokestatic

The next instruction is an invokestatic, another method call. This is calling java.lang.System.getProperty, which takes only one argument, the name of the property to get (static methods have no this). Presently there are two values on the stack, but the interpreter doesn’t care about that. It simply pops the top value from the stack, stores it as the callee’s first local variable, and starts to execute the callee. This time, however, the callee returns a value, the user’s name, so when it returns it will have pushed that onto the stack:

Before 16: invokevirtual

Now we’re ready for the final call, another invokevirtual. That extra value on the stack may have seemed odd before, but now it makes sense; it’s the first argument for this call! The interpreter pops two values from the stack, stores them as the callee’s first two local variables, and starts to execute the callee. This method returns nothing, so when the callee returns the stack will be empty. HelloUser.main returns nothing, so the stack is now exactly as it should be for us to execute the return instruction:

Before 19: return

Next time we’ll see how all this works in HotSpot and Zero.

Syndicated 2009-01-29 12:54:46 from gbenson.net

Inside Zero and Shark: Calling Conventions and The Call Stub

JavaCalls::call is merely a thin wrapper around JavaCalls::call_helper, so lets have a look in there (they’re both in javaCalls.cpp). The interesting part starts when the JavaCallWrapper is created. JavaCallWrapper’s constructor manages the transition to _thread_in_Java, amongst other things, and its destructor manages the transition back to _thread_in_vm, so the whole of that block will be _thread_in_Java. This idiom of using an object whose constuctor and destructor manage things is a common one in HotSpot; the apparently unused HandleMark created directly after the JavaCallWrapper is another example of this.

Ok, so now we’re _thread_in_Java, and it’s time to execute some Java code. The call to the call stub is the bit that does that, but before we look at that it’s interesting to skip forward a little, to look at what happens before and after the HandleMark and JavaCallWrapper are destroyed. Immediately before the blocks close is this:

// Preserve oop return value across possible gc points
if (oop_result_flag) {
  thread->set_vm_result((oop) result->get_jobject());

and immediately after the blocks close is this:

// Restore possible oop return
if (oop_result_flag) {
  result->set_jobject((jobject) thread->vm_result());

If the Java code called by the call stub returned an object (a java.lang.Object) then a pointer to that object will now be in result — and it’s an oop. The destructors of both HandleMark and JavaCallWrapper contain code that can GC, so these blocks of code are needed to protect that oop. Here, rather than using a handle, the result is protected by being stored in the thread, in a location the GC knows to check and update.

Back to the call stub. What is it? Well, in what I’ll call “classic” HotSpot (where everything from here on in is written in assembly language) every methodOop has a pair of entry points: pointers to the native code that actually executes the method. When a method has been JIT compiled these entry points will point at the JIT compiled code; for interpreter code they will point to some location within the interpreter. The reason there are two entry points is that the interpreter passes arguments and return values in a different manner to compiled code; the interpreter uses a different calling convention from the compiled code. If a method is compiled then its compiled entry point (the entry point that will be called by compiled code) will point directly at the compiled code, but its interpreted entry point will point to the i2c adaptor, which translates from the interpreter calling convention to the compiler calling convention and then jumps to the compiled entry point. Interpreted methods have similar treatment: their interpreted entry point points to the part of the interpreter responsible for executing that method, and their compiled entry point will point to the c2i adaptor.

What does this have to do with the call stub? Well, the call stub is the interface between VM code and the interpreter calling convention. It takes a C array of parameters and copies them to the locations specified by the interpreter calling convention. Then it invokes the method, by jumping to its interpreted entry point. Finally, it copies the result from the location specified by the interpreter calling convention to the address supplied by JavaCalls::call_helper.

You’ll notice this description has been with reference to classic HotSpot. Zero and Shark are mostly the same, but there are two significant differences. Firstly, the reason classic HotSpot has two calling conventions is an optimization. The interpreter calling convention gives better performance in the interpreter, the compiler calling convention gives better performance in the compiler, and the difference is enough to more than offset the overhead of using adaptors for bridging. In Zero and Shark, the limits of what can be done in C++ and with LLVM constrain the design of the calling convention such that having different ones doesn’t really make sense. So — for now, at least — Shark code also uses the interpreter calling convention, and the compiled entry point is never set or used. In Zero and Shark there is only “the calling convention”.

The second difference is that Shark methods require a bit of extra information to execute. Compiled methods need to be able to tell HotSpot where they are in the code at certain times, and in classic HotSpot this is done by looking at the PC. LLVM doesn’t allow us access to this — even if it did, it wouldn’t make much sense — so Shark compiled methods feed HotSpot faked PCs. To do this, each method needs to know where HotSpot thinks the compiled code starts, so in Zero, entry points are not pointers to code but pointers to ZeroEntry or SharkEntry objects. The real entry point is stored within those.

Next time, some details about the calling convention, and some stuff about stacks.

Syndicated 2009-01-28 08:49:21 from gbenson.net

Inside Zero and Shark: Handles and Oops, Traps and Checks

You’re about to run the important enterprise application “Hello World”. What’s going to happen?

class HelloWorld {
  public static void main(String[] args) {
    System.out.println("Hello world!");

After initializing itself, HotSpot will create a new Java thread. This will initially be _thread_in_vm because it’s running VM code. Eventually it will call JavaCalls::call (in javaCalls.cpp) to bridge from VM code to Java code. Before we can look at what JavaCalls::call does, however, we need to understand a couple of HotSpot conventions. Look at its prototype:

void JavaCalls::call(JavaValue* result, methodHandle method, JavaCallArguments* args, TRAPS);

The first things we need to understand are handles and oops. All Java objects, and in fact all objects in HotSpot managed by the garbage collector, are oops, and when you’re dealing with oops you need to keep the garbage collector in mind. More specifically, you need to know where in your code the GC might run, because when it does run you need to have told it the location of every single oop you’re using, and when it returns you need to deal with the fact that your oops have probably moved. If your C compiler has optimized your code such that an oop is in a register then the oop in that register is now wrong, and you’re going to crash pretty soon.

Dealing with raw oops is hard, but luckily there are ways of protecting them. In VM code — when you’re _thread_in_vm — the protection of choice is to use handles. A handle wraps an oop, managing access to it such that GC activity becomes transparent. If you’re in VM code and you’re using handles then you don’t have to worry. But you do need to know what’s happening, because if you see some code that’s calling methodHandle methods and you grep the OpenJDK tree to find the methodHandle class definition you will not find it. The methods you are looking for are actually the methods of the methodOopDesc class (in methodOop.hpp).

The other thing we need to understand in that prototype is the mysterious TRAPS at the end. It’s kind of a note to the programmer: functions that trap are functions that can throw Java exceptions. When you call them you use CHECK as their final argument for a convenient exception check:

JavaCalls::call(result, method, args, CHECK);

TRAPS and CHECK are defined in exceptions.hpp. You may wish to avert your eyes:

#define TRAPS   Thread* THREAD
#define CHECK   THREAD); if (HAS_PENDING_EXCEPTION) return; (0

Now we can see how HotSpot handles exceptions: they’re simply stored in the thread. Code that cares can access the exception using these guys:

#define PENDING_EXCEPTION       (((ThreadShadow*)THREAD)->pending_exception())
#define HAS_PENDING_EXCEPTION   (((ThreadShadow*)THREAD)->has_pending_exception())

Next time I really will explain how method invocation works…

Syndicated 2009-01-27 11:11:18 from gbenson.net

234 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!