Older blog entries for dreier (starting at number 23)

Cambridge… England, that is

My tutorial Writing RDMA applications on Linux has been accepted at LinuxConf Europe 2007. I’ll try to give practical introduction to writing native RDMA applications on Linux — “native” meaning directly to RDMA verbs as opposed to using an additional library layer such as MPI or uDAPL.  I’m aiming to make it accessible to people who know nothing about RDMA, so if you read my blog you’re certainly qualified.  Start planning your trip now!

My presentation is on the morning of Monday, September 3, and I’m flying to England across 7 time zones on Sunday, September 2, so I hope I’m able to remain upright and somewhat coherent for the whole three hours I’m supposed to be speaking….

Syndicated 2007-08-06 16:04:31 from Roland's Blog

Do you feel lucky, punk?

Sun just introduced their Constellation supercomputer at ISC Dresden. They’ve managed to get a lot of hype out of this, including mentions in places like the New York Times. But the most interesting part to me is the 3,456-port “Magnum” InfiniBand switch. I haven’t seen many details about it and I couldn’t find anything about it on Sun’s Constellation web site.

However I’ve managed to piece together some info about the switch from the new stories as well as the pictures in this blog entry. Physically, this thing is huge–it looks like it’s about half a rack high and two racks wide. The number 3,456 gives a big clue as to the internal architecture: 3456 = 288 * 12. Current InfiniBand switch chips have 24 ports, and the biggest non-blocking switch one can build with two levels (spine and leaf) is 24 * 12 = 288 ports: 24 leaf switches each of which have 12 ports to the outside and 12 ports to the spines (one port to each of the 12 spine switches).

Then, using 12 288-port switches as spines, one can take 288 24-port leaf switches that each have 12 ports to the outside and end up with 288 * 12 = 3456 ports, just like Sun’s Magnum switch. From the pictures of the chassis, it looks like Magnum has the spine switches on cards on one side of the midplane and the leaf switches on the other side, using the cute trick of having one set of cards be vertical and one set horizontal to get all-to-all connections between spines and leaves without having too-long midplane traces.

All of this sounds quite reasonable until you start to consider putting all of this in one box. Each 288 port switch (which is on one card in this design!) has 36 switch chips on it. At about 30 Watts per switch chip, each of this cards is over 1 kilowatt, and there are 12 of these in a system. In fact, with 720 switch chips in the box, the total system is well over 20 kW!

It also seems that the switch is using proprietary high-density connectors that bring three IB ports out of each connector, which reduces the number of external connectors on the switch down to a mere 1152.

One other thing I noticed is that the Sun press stuff is billing the Constellation as running Solaris, while the actual TACC page about the Ranger system says the cluster will be running Linux. I’m inclined to believe TACC, since running Solaris for an InfiniBand cluster seems a little silly, given how far behind Solaris’s InfiniBand support is when compared to Linux, whose InfiniBand stack is lovingly maintained by yours truly.

Syndicated 2007-06-28 22:11:48 from Roland's Blog

Soon I will be invincible

I just read Aaron Grossman’s novel Soon I Will Be Invincible, and I heartily recommend it as an amusing summer read. It would be perfect for a long trip (sorry I didn’t write this post in time for everyone going to OLS). It’s set in a world where super-powers exist, and interleaves first person chapters narrated by Doctor Impossible (a super-villian, “The Smartest Man in The World”) and Fatale (a rookie superhero, “The Next Generation of Warfare”).

Grossman strikes a nice balance by taking everything seriously enough to capture what’s great about superhero comics, while slipping in enough sly jokes to keep things light. For example, the book starts with Doctor Impossible in jail, and it turns out that the authorities have decided he’s not an evil genius–he just suffers from Malign Hypercognition Disorder.

Syndicated 2007-06-26 03:43:49 from Roland's Blog

Enterprise distro kernels

Greg K-H wrote recently about kernels for “Enterprise Linux” distributions. I’m not sure I get the premise of the article; after all, the whole point of having more than one distro company is that they can compete on the basis of the differences in what they do. So it makes no sense to me to present this issue as something that Red Hat and Novell have to agree on (and it also leaves out Ubuntu’s “LTS” distribution, although I’m not sure if that has taken any of the “enterprise distro” market). Obviously Novell sees a reason for both openSUSE and SLES; why should SLES and RHEL have to be identical?

In fact (although Greg didn’t seem to realize it when he wrote his article), there are already significant differences between the SLES and RHEL kernel updates. SLES has relatively infrequent “SP” releases, where the kernel ABI is allowed to break, while RHEL has update releases roughly every quarter but aims to keep the kernel ABI stable through the life of a full major release.

Greg seems to favor the third proposal in his article, namely rebasing to the latest upstream kernel on every update. However, I don’t think that can work for enterprise distros, for a reason that DaveJ alluded to in his response:

W[ith] each upstream point revision, we fix x regressions, and introduce y new ones. This isn’t going to make enterprise customers paying lots of $ each year very happy.

For a lot of customers, the whole point of staying on an enterprise distro is to stick with something that works for them. No kernel is bug-free and every enterprise distro kernel surely has some awful bugs; what enterprise customers want to avoid are regressions. If SLES10 works for my app on my hardware, then SLES10SP1 better not keel over on the same app and the same hardware because of a broken SATA driver or something like that.

Of course customers often want crazy-sounding stuff, for example, “Give me the 2.6.9 kernel from RHEL4, except I want the InfiniBand drivers from 2.6.21.” (And yes, since I work on InfiniBand a lot, that is definitely a real example, and in fact a lot of effort goes into the “OpenFabrics Enterprise Distribution” (OFED) to make those customers happy) A kernel hacker’s first reaction to that request is most likely, “Then you should run just 2.6.21.” But if you think about what the customers are asking for some more, it starts to make sense. What they are really saying is that they need the latest and greatest IB features (maybe support for new hardware or a protocol that wasn’t implemented until long after the enterprise kernel was frozen), but they don’t want to risk some new glitch in a part of the kernel where RHEL4’s 2.6.9 is perfectly fine for them.

This is just a special case of Greg’s “support the latest toy” request, and if there were some technical solution for pulling just a subset of new features into an enterprise kernel then that would be great. But as I said before, without a major change in the upstream development process, rebasing enterprise kernels during the lifetime of a major release doesn’t seem to be what customers of enterprise distros want. And I agree with Linus when he says that you can’t slow down development without having people losing interest or going off onto a branch that’s too unstable for real users to test. So I don’t think we want to change our development process to be closer to an enterprise distro.

And given how new features often have dependencies on core kernel changes, I don’t see much hope of a technical solution for the “latest toy” problem. In fact the OFED solution of having the community that works on a particular class of new toys do the backporting seems to
be about the best we can do for now.

Syndicated 2007-06-24 16:21:45 from Roland's Blog

Are we having, like, a conversation?

Pete replied to my earlier reply and argued that mb() should never be used:

NOOOOOOOOOOO!

In theory, it’s possible to program in Linux kernel by using nothing but mb(). In practice, expirience teaches us that every use of mb() in drivers is a bug. I’m not kidding here. For some reason, even competent and experienced hackers cannot get it right.

I agree that memory barriers should be avoided, which is why I said that “a fat comment about why you need to mess with memory barriers anyway” should always go along with a barrier. However

  1. When a memory barrier is required, I don’t think that using a spinlock as an obfuscated stand-in is an improvement. If a spinlock is just serving as a memory barrier, then you probably don’t have a clear idea of what data the spinlock is protecting. And that’s going to lead to problems when someone does something like expanding the locked region by moving where the spinlock is taken.
  2. Spinlocks actually don’t save you from having your driver blow up on big machines. The number of people that understand mmiowb() is far smaller than the number of people that understand mb(), but no matter how many spinlocks you lock, you may still need mmiowb() for your driver to work on SGI boxes. (Actually I think this is a problem with the interface presented to drivers: someday when I really do feel like tilting at a windmill, I’ll try to kill off mmiowb())

It’s funny: when I read Pete’s entry, I did a quick grep of drivers/usb looking for mb() to show that even non-exotic device drivers need memory barriers. And the first example I pulled up at random, in drivers/usb/host/uhci-hcd.c looks like a clear bug:

        /* End Global Resume and wait for EOP to be sent */
        outw(USBCMD_CF, uhci->io_addr + USBCMD);
        mb();
        udelay(4);
        if (inw(uhci->io_addr + USBCMD) & USBCMD_FGR)
                dev_warn(uhci_dev(uhci), "FGR not stopped yet!n");

The mb() after the outw() does not make sure that the IO reaches the device, since it is probably a posted PCI write which might lurk in a chipset buffer somewhere long after the CPU has retired it. The only way to make sure that the write has actually reached the device is to do a read from the same device to flush the posted write. So the udelay() might expire before the previous write has even reached the device. I guess I’ll be a good citizen and report this on the appropriate mailing list.

An example of what I consider a proper use of a memory barrier in a device driver is in my masterpiece drivers/infiniband/hw/mthca/mthca_cq.c:

        cqe = next_cqe_sw(cq);
        if (!cqe)
                return -EAGAIN;

        /*
         * Make sure we read CQ entry contents after we've checked the
         * ownership bit.
         */
         rmb();

The device DMAs a status block into our memory and then sets a bit in that block to mark it as valid. If we check the valid bit, then we need an rmb() to make sure that we don’t operate on bogus data because the CPU has speculatively executed the reads of the rest of the status block before it was written and then checked the valid bit after it was set (and yes, this was actually seen happening on the PowerPC 970). I can’t think of any reasonable way to write this with a spinlock. Would I just create a dummy spinlock to take and drop for no reason other than memory ordering?

Of course this is driving exactly the sort of high-speed exotic hardware that Pete talked about in his blog, but I think the principle applies to any hardware that DMAs structures to or from host memory. For example, tg3_rx() in drivers/net/tg3.c seems to have a barrier for similar reasons. If you’re writing a driver for DMA-capable hardware, you better have some understanding of memory ordering.

Syndicated 2007-06-07 23:04:57 from Roland's Blog

If I’m a crusader, where’s my cape?

Pete Zaitcev replied to my earlier post about misuse of atomic variables. I never really thought of myself as a crusader or as particularly quixotic, but I’ll respond to the technical content of Pete’s post. I think the disagreement was with my disparagement of the anti-pattern:

int x;

int foo(void)
{
        int y;

        spin_lock(&lock);
        y = x;
        spin_unlock(&lock);

        return y;
}

Pete is entirely correct that spin_lock() and spin_unlock() have full memory barriers; otherwise it would be impossible for anyone to use spinlocks correctly (and of course there’s still mmiowb() to trip you up when someone runs your driver on a big SGI machine).

However, I still think that using a spinlock around an assignment that’s atomic anyway is at best pretty silly. If you just need a memory barrier, then put an explicit mb() (or wmb() or rmb())

Also, Pete is not entirely correct when he says that atomic operations lack memory barriers. All atomic operations that return a value (eg atomic_dec_and_test()) do have a full memory barrier, and if you want a barrier to go with an atomic operation that doesn’t return a value, such as atomic_inc(), the kernel does supply a full range of primitives such as smp_mb__after_atomic_inc(). The file Documentation/memory-barriers.txt in the kernel source tree explains all of this in excruciating detail.

In the end the cost of an atomic op is roughly the same as the cost of a spin_lock()/spin_unlock() pair (they both have to do one locked operation, and everything else is pretty much in the noise for all but the most performance criticial code). Spinlocks are usually easier to think about, so I recommend only using atomic_t when it fits perfectly, such as a reference count (and even then using struct kref is probably better if you can). I’ve found from doing code review that code using atomic_t almost always has bugs, and we don’t have any magic debugging tools to find them (the way we have CONFIG_PROVE_LOCKING, CONFIG_DEBUG_SPINLOCK_SLEEP and so on for locks).

By the way, what’s up with mark mazurek? His blog seems to be an exact copy taken from Pete’s blog feed, with the added bonus of adding a comment to the post in my blog that Pete linked to. There are no adds or really anything beyond an exact duplicate of Pete’s blog, so I can’t figure out what the angle is.

Syndicated 2007-06-06 03:17:06 from Roland's Blog

Bike to work day

This past Thursday was Bike to Work Day here in Silicon Valley, and while biking to work I thought about why I really like living in the Bay Area while other people seem to hate it. See, I pretty much never drive to work: I work from home three days a week and on the days where I actually go to the office, I ride my bike or light rail.

One of the main complaints I hear about the Bay Area is the traffic, and I can’t disagree really.  But the simple solution is just to avoid driving.  As I said, I bike to work, and I live downtown where I can walk to almost everywhere else I want to go.  When I do get in my car it’s usually to go to the beach or Tahoe or the redwoods or something like that, and living near that stuff is the whole point of being in the Bay Area.

The other usual complaint about the Bay Area is that it’s too expensive, and I guess I can’t argue with that.  But pay is higher here too, and the advantage of having a 1300-square-foot house is that I don’t have to worry about finding enough stuff to fill my rooms.

Anyway, if you don’t like the Bay Area, please don’t move here (or move away if you’re already here).  We have enough people without you haters and your negative attitude….

Syndicated 2007-05-21 16:32:16 from Roland's Blog

Atomic cargo cults

Cargo cult programming refers to “ritual inclusion of code or program structures that serve no real purpose.” One annoying example of this that I see a lot in kernel code that I review is the inappropriate use of atomic_t, in the belief that it magically wards off races.

This type of bogosity is usually marked by variables or structure members of type atomic_t, which are only ever accessed through atomic_read() and atomic_set() without ever using a real atomic operation such as atomic_inc() or atomic_dec_and_test(). Such programming reaches its apotheosis in code like:

        atomic_set(&head, (atomic_read(&head) + 1) % size);

(and yes, this is essentially real code, although I’ve paraphrased it to protect the guilty from embarrassment).

The only point of atomic_t is that arithmetic operations (like atomic_inc() et al) are atomic in the sense that there is no window where two racing operations can step on each other. This helps with situations where an int variable might have ++i and –i race with each other; since these operations are probably implemented as read-modify-write. So if i starts out as 0, the value of i after the two operations might be 0, -1 or 1 depending on when each operation reads the value of i and what order they write back the final value.

If you never use any of these atomic operations, then there’s no point in using atomic_set() to assign a variable and atomic_read() to read it. Your code is no more or less safe than if would be just using normal int variables and assignments.

One way to think about atomic operations is that they might be an optimization instead of using a lock to protect access to a variable. Unfortunately I’m not sure if this will help anyone get things right, since I also see plenty of code like:

int x;

int foo(void)
{

        int y;

        spin_lock(&lock);
        y = x;
        spin_unlock(&lock);

        return y;
}

which is the analogous cargo-cult anti-pattern for spinlocks.

Maybe the best rule would be: if the only functions using atomic_t in your code are atomic_set() and atomic_read(), then you need to write a comment explaining why you’re using atomic at all. Since the vast majority of the time, such a comment will be impossible to write, maybe this will cut down on cargo cult programming a bit. Or more likely it will just make code review more fun by generating nonsensical comments for me to chuckle at.

Syndicated 2007-05-13 16:36:01 from Roland's Blog

Let’s try WordPress

I’ve said “so long” to Typo and migrated my blog to WordPress.  WordPress just seems to have more momentum than Typo, even though PHP seems kind of 90s to me.

Anyway, I’ve set up a redirect so the previous RSS feed should continue to work, but you’ll probably want to update your URL if for some strange reason you’ve actually subscribed to my blog’s feed.

Syndicated 2007-05-04 21:10:44 from Roland's Blog

14 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!