dreier is currently certified at Journeyer level.

Name: Roland Dreier
Member since: 2000-05-04 21:54:41
Last Login: 2007-05-17 02:43:10

FOAF RDF Share This

Homepage: http://www.digitalvampire.org/


I'm currently a software engineer at Cisco. I spend most of my time hacking on the Linux kernel, mostly working on InfiniBand support.


Recent blog entries by dreier

Syndication: RSS 2.0

Some TOEs stink

stinky toes

“Stinky Toes” by ijonas

As you may know, Pure Storage has been shipping iSCSI arrays for a few months.  Recently we got a bug report from a customer: when running heavy IO from a VMware guest, iSCSI IO from the VMware system to a Pure array would stall for minutes at a time.  Fortunately, we were able to get a similar set up in our lab and reproduce the issue, so we could dig in and see what was going on.

The IO was going from the guest to a guest disk image, which is really just a file on the VMware filesystem, so the actual iSCSI initiator was the VMware hypervisor.  We started grabbing packet dumps of the traffic while we reproduced the problem, and noticed a strange pattern of slow retransmissions from the Pure side when the stall occurred.

The test setup had one 10G Ethernet link from the VMware system going through a switch to two 10G ports on the Pure array, so packet loss due to congestion was immediately something we thought of, but we couldn’t understand why we couldn’t recover.  It seemed that when the stall occurred, the Pure side of the TCP connection got stuck with a full send buffer.

The first clue was when we noticed that selective ACK (SACK) was not enabled for our iSCSI TCP connections — and indeed, when we captured the connection establishment, the VMware iSCSI initiator was not advertising SACK in its TCP options.  This was kind of a mystery, because when we did other things such as ssh-ing into the the VMware  command line, it was perfectly happy to set up TCP connections with SACK enabled.

So not having SACK enabled partially explained why we were not able to recover from packet loss very well: we were running with a large TCP window (hundreds of KB) on a high-speed link, and if some packets got dropped, we might send a few hundred more afterwards; without SACK, the VMware system had no way to tell us which packets to retransmit.

However, TCP was behaving even worse than “lack of SACK” could explain.  What we saw in the packet trace was that after a lost packet, our retransmission timer would expire and we would send the packet after the last one that the initiator ACKed (which would be a few hundred packets before the last one that we sent).  The initiator would ACK that packet, which would advance our send window, and so we would send one new packet beyond the last one we sent — right at the end but definitely within our window.

And then the initiator would just ignore that packet!  The way that TCP is supposed to work is that the receiver should either ACK all the way up to that new packet (if our retransmitted packet was the only lost packet, and it now had all the data up to and including our new packet), or it should send a duplicate ACK (“dup ACK”) that re-ACKed the data we already knew it had.  Just ignoring those packets that are within its receive window is completely inexplicable.

The dup ACK is important because enough of them will trigger “fast retransmission” — without that, we’re stuck waiting for a roughly 1/4 second timer to expire between retransmissions, which means it will take way too long to resend a full send window of several hundred packets.  In fact, so long that the inititor just gives up and establishes a new TCP connection after a timeout of a minute or two.

Finally, we realized why the VMware initiator’s TCP behavior was so crazy and primitive.  Fortunately, we had duplicated the customer config and we were using a Broadcom 10G Ethernet adapter with the Broadcom iSCSI offload driver (roughly equivalent to the bnx2i driver in Linux).  This crazy TCP stack wasn’t in the VMware hypervisor — it was running on the network adapter.

(In fact, looking at the Linux kernel sources for bnx2i and cnic, one can see that the Broadcom TCP offload engine apparently has an option “L4_KWQ_CONNECT_REQ1_SACK” for connections, but because the iSCSI initiator driver doesn’t set the “SK_TCP_SACK” flag, it doesn’t get enabled.  One can guess that the Broadcom driver for VMware is probably from a similar codebase, and that kind of explains why we didn’t see SACK enabled)

Once we realized where the problem was coming from, the fix was simple: switch from the Broadcom offload driver to the normal VMware software iSCSI initiator.  Once we did that, performance became pretty stable, just about saturating the 10G Ethernet link, with occasional hiccups of a few seconds when a congestion drop occurred.  (As a side
note, it’s kind of nuts that these days we take it for granted that a storage array can do enough IOPS to get above 1 GB/sec with small random IOs).

In the past I’ve defended TOEs, but in this case the Broadcom NIC and driver aren’t even fully implementing the most primitive form of TCP, so I have to agree that it’s completely unusable.  But it’s worth noting that we tried Chelsio and Emulex adapters with their iSCSI offload drivers, and they worked fine.  I still think TCP and iSCSI offload make  sense because they have a fundamental 3x advantage in memory bandwidth (the NIC puts the data where it’s supposed to go, rather than putting it in some random receive buffer and then having the CPU read it and write it to copy it to where it’s supposed to go)

So I don’t think there’s any broad conclusion that can be drawn beyond the fact that one should really never use Broadcom’s iSCSI offload.


Syndicated 2012-03-17 00:21:24 from Roland's Blog

Do you know everything about RDMA?

As I mentioned on Twitter (by the way, are you following @rolanddreier?), I’ll be speaking at the Linux Foundation Collaboration Summit in San Francisco on April 7.  My general mandate is to give an introduction to RDMA and InfiniBand on Linux, and to talk about recent developments and what might be coming next in the area.  However, I’d like to make my talk a little less boring than my usual talks, so I’d be curious to hear about specific topics you’d like me to cover.  And if you’re at the summit, stop by and say hello.

Syndicated 2011-03-08 04:58:07 from Roland's Blog

Want to work with me?


More seriously, in the past few weeks, my new employer (Pure Storage) has said a little more, and I can now link to a real jobs page.  As you can see from the listings, we’re looking for both kick-ass developers as well people with more QA/tools/scripting skills.  And we definitely are willing to help people fresh out of school learn, as long as they have some experience with Linux.

If you’re interested, you can let me know or just apply directly from the jobs page.  Good luck!

Syndicated 2011-02-17 00:40:45 from Roland's Blog

New testbed installed

Since I changed jobs, I left behind a lot of my test systems, but I now have a couple of test systems set up. Here is the rather crazy set of non-chipset devices I now have in one box:

$ lspci -nn|grep -v 8086:
03:00.0 InfiniBand [0c06]: Mellanox Technologies MT25208 [InfiniHost III Ex] [15b3:6282] (rev 20)
04:00.0 Ethernet controller [0200]: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] [15b3:6750] (rev b0)
05:00.0 InfiniBand [0c06]: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] [15b3:673c] (rev b0)
84:00.0 Ethernet controller [0200]: NetEffect NE020 10Gb Accelerated Ethernet Adapter (iWARP RNIC) [1678:0100] (rev 05)
85:00.0 Ethernet controller [0200]: Chelsio Communications Inc T310 10GbE Single Port Adapter [1425:0030]
86:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] [15b3:6274] (rev 20)

(I do have a couple of open slots if you have some RDMA cards that I’m missing to complete my collection :) )

Syndicated 2011-02-11 22:56:00 from Roland's Blog

51 older entries...


dreier certified others as follows:

  • dreier certified dmarti as Journeyer
  • dreier certified fusion94 as Journeyer
  • dreier certified bwoodard as Journeyer
  • dreier certified jallison as Master
  • dreier certified dtype as Journeyer

Others have certified dreier as follows:

  • dmarti certified dreier as Journeyer
  • taral certified dreier as Journeyer
  • benson certified dreier as Journeyer
  • dtype certified dreier as Journeyer
  • jallison certified dreier as Journeyer
  • fusion94 certified dreier as Journeyer
  • khazad certified dreier as Journeyer
  • ole certified dreier as Journeyer

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

Share this page