28 Mar 2014
(updated 3 Apr 2014 at 03:28 UTC) »
Really weird issue yesterday trying to move a customer to AWS. Testing was all fine, but when we switched the ip address, the system grinded to a halt.
The cause was the NFS server, which became unresponsive, so web and php5 farms stopped. Using iotop I found out that this was caused by the jdb2 process, jbd2/xvda1-8 in my case. jdb2 basically was at 100% i/o. Initially I thought perhaps the instance was faulty, so build a new NFS server (simply replaying my ansible script). Got exactly same behaviouron the new server, all i/o grind to a halt as jbd2 took over as soon as I did even simple things like checking out a Drupal repository (so single client, doing an svn co).
But why would jbd2 kick in? With iostat -x 1 I determined that we were writing 5MB/s to the root file system. That made no sense. There is nothing on this NFS server that would do that. All data is on separately mounted EBS disks. The root file system is ext4, but all the other files systems were xfs! And the clients only mount the xfs file systems.
Using a suggestion to debug what's going on, I tried:
echo 1 > /sys/kernel/debug/tracing/events/ext4/ext4_sync_file_enter/enable
Waited for a minute and then did:
Got a lot of lines like:
nfsd-943  8559086.521147: ext4_sync_file_enter: dev 202,1 ino 30703 parent 30666 datasync 0
nfsd-942  8559086.527871: ext4_sync_file_enter: dev 202,1 ino 30703 parent 30666 datasync 0
OK, clearly the NFS daemon causes a lot of datasync()
calls. But why would this have an effect on the root file system?
After more googling I found this comment
Problem vanished after fsck'ing my ext4 partitions
Huh? Worth a try. Stopped NFS server, mounted root disk on another server, ran fsck:
# fsck /dev/xvdf
fsck from util-linux 2.20.1
e2fsck 1.42 (29-Nov-2011)
cloudimg-rootfs: clean, 62399/524288 files, 378952/2097152 blocks
and reattached. Problem solved!!
I have no explanation for this behaviour, except that maybe the latest Ubuntu 12.04 LTS AMI has a bad disk.
PS: I now believe I was wrong, see this update