RDMA on Converged Ethernet
I recently read Andy Grover’s post about converged fabrics, and since I particupated in the OpenFabrics panel in Sonoma that he alluded to, I thought it might be worth sharing my (somewhat different) thoughts.
The question that Andy is dealing with is how to run RDMA on “Converged Ethernet.” I’ve already explained what RDMA is, so I won’t go into that here, but it’s probably worth talking about Ethernet, since I think the latest developments are not that familiar to many people. The IEEE has been developing a few standards they collectively refer to as “Data Center Bridging” (DCB) and that are also sometimes referred to as “Converged Enhanced Ethernet” (CEE). This refers to high speed Ethernet (currently 10 Gb/sec, with a clear path to 40 Gb/sec and 100 Gb/sec), plus new features. The main new features are:
- Priority-Based Flow Control (802.1Qbb), sometimes called “per-priority pause”
- Enhanced Transmission Selection (802.1Qaz)
- Congestion Notification (802.1Qau)
The first two features let an Ethernet link be split into multiple “virtual links” that operate pretty independently — bandwidth can be reserved for a given virtual link so that it can’t be starved, and by having per-virtual-link flow control, we can make sure certain traffic classes don’t overrun their buffers and avoid dropping packets. Then congestion notification means that we can tell senders to slow down to avoid congestion spreading caused by that flow control.
The main use case that DCB was developed for was Fibre Channel over Ethernet (FCoE). FC requires a very reliable network — it simply doesn’t work if packets are dropped because of congestion — and so DCB provides the ability to segregate FCoE traffic onto a “no drop” virtual link. However, I think Andy misjudges the real motivation for FCoE; the TCP/IP overhead of iSCSI was not really an issue (and indeed there are many people running iSCSI with very high performance on 10 Gb/sec Ethernet).
The real motivation for FCoE is to give a way for users to continue using all the FC storage they already have, while not requiring every server that wants to talk to the storage to have both a NIC and an FC HBA. With a gateway that’s easy to build an scale, legacy FC storage can be connected to an FCoE fabric, and now servers with a “converged network adapter” that functions as both an Ethernet NIC and an FCoE HBA can talk to network and storage over one (Ethernet) wire.
Now, of course for servers that want to do RDMA, it makes sense that they want a triple-thread converged adapter that does Ethernet NIC, FCoE HBA, and RDMA. They way that people are running RDMA over Ethernet today is via iWARP, which runs an RDMA protocol layered on top of TCP. The idea that Andy and several other people in Sonoma are pushing is to do something analogous to FCoE instead, that is, take the InfiniBand transport layer and stick it into Ethernet somehow. I see a number of problems with this idea.
First, one of the big reasons given for wanting to use InfiniBand on Ethernet instead of iWARP is that it’s the fastest path forward. The argument is, “we just scribble down a spec, and everyone can ship it easily.” That ignores the fact that iWARP adapters are already shipping from multiple vendors (although, to be fair, none with support for the proposed IEEE DCB standards yet; but DCB support should soon be ubiquitous in all 10 gigE NICs, iWARP and non-iWARP alike). And the idea that an IBoE spec is going to be quick or easy to write flies in the face of the experience with FCoE; FCoE sounded dead simple in theory (just stick an Ethernet header on FC frames, what more could there be?) it turns out that the standards work has taken at least 3 years, and a final spec is still not done. I believe that IBoE would be more complicated to specify, and fewer resources are available for the job, so a realistic view is that a true standard is very far away.
Andy points at a TOE page to say why running TCP on an iWARP NIC sucks. But when I look at that page, pretty much all the issues are still there with running the IB transport on a NIC. Just to take the first few on that page (without quibbling about the fact that many of the issues are just wrong even about TCP offload):
- Security updates: yup, still there for IB
- Point-in-time solution: yup, same for IB
- Different network behavior: a hundred times worse if you’re running IB instead of TCP
- Performance: yup
- Hardware-specific limits: yup
And so on…
Certainly, given infinite resources, one could design an RDMA protocol that was cleaner than iWARP and took advantage of all the spiffy DCB features. But worse is better and iWARP mostly works well right now; fixing the worst warts of iWARP has a much better chance of success than trying to shoehorn IB onto Ethernet and ending up with a whole bunch of unforseen problems to solve.