BTRFS Status April 2014
Since my blog post about BTRFS in March [1] not much has changed for me. Until yesterday I was using 2.13 kernels on all my systems and dealing with the occasional kmail index file corruption problem.
Yesterday my main workstation ran out of disk space and went read-only. I started a BTRFS balance which didn’t seem to be doing any good because most of the space was actually in use so I deleted a bunch of snapshots. Then my X session aborted (some problem with KDE or the X server – I’ll never know as logs couldn’t be written to disk). I rebooted the system and had kernel threads go into infinite loops with repeated messages about a lack of response for 22 seconds (I should have photographed the screen). When it got into that state the ALT-Fn keys to change a virtual console sometimes worked but nothing else worked – the terminal usually didn’t respond to input.
To try and stop the kernel from entering an infinite loop on every boot that I used “rootflags=skip_balance” on the kernel command line to stop it from continuing the balance which made the system usable for a little longer, unfortunately the skip_balance mount option doesn’t permanently apply, the kernel will keep trying to balance the filesystem on every mount until a “btrfs balance cancel” operation succeeds. But my attempts to cancel the balance always failed.
When I booted my system with skip_balance it would sometimes free some space from the deleted snapshots, after two good runs I got to 17G free. But after that every time I rebooted it would report another Gig or two free (according to “btrfs filesystem df“) and then hang without committing the changes to disk.
I solved this problem by upgrading my USB rescue image to kernel 3.14 from Debian/Experimental and mounting the filesystem from the rescue image. After letting kernel 3.14 work on the filesystem for a while it was in a stage where I could use it with kernel 3.13 and then boot the system normally to upgrade it to kernel 3.14.
I had a minor extra complication due to the fact that I was running “apt-get dist-upgrade” at the time the filesystem went read-only do the dpkg records of which packages were installed were a bit messed up. But that was easy to fix by running a diff against /var/lib/dpkg/info on a recent snapshot. In retrospect I should have copied from an old snapshot of the root filesystem, but I fixed the problems faster than I could think of better ways to fix them.
When running a balance the system had a peak IO rate of about 30MB/s reads and 30MB/s writes. That compares to the maximum contiguous file IO speed of 260MB/s for reads and 320MB/s for writes. During that time it had about 50% CPU time used for my Q8400 quad-core CPU. So far the only tasks that I do regularly which have CPU speed as a significant bottleneck are BTRFS filesystem balancing and recoding MP4 files. Compiling hasn’t been an issue because recently I haven’t been compiling many programs that are particularly big.
Lessons Learned
I should photograph the screen regularly when doing things that won’t be logged, those kernel error messages might have been useful to me or someone else.
The fact that the only kernel that runs BTRFS the way I need comes from the Experimental repository in Debian stands in contrast to the recent kernel patch that stops describing BTRFS as experimental. While I have a high opinion of the people who provide support for the kernel in commercial distributions and their ability to back-port fixes from newer kernels I’m concerned about their decision to support BTRFS. I’m also dubious about whether we can offer BTRFS support in Debian/Jessie (the next version of Debian) without a significant warning. OTOH if you find yourself with a BTRFS system that isn’t working well you could always hire me to fix it. I accept payment via Paypal, bank transfer, or Bitcoin. If you want to pay me in Grange then I assure you I will never forget about it. ;)
I thought that I wouldn’t have CPU speed issues when I started using the AMD64 architecture, for most tasks that’s been the case. But for systems for which storage is important I’ll look at getting faster CPUs because of BTRFS. Using faster CPUs for storage isn’t that uncommon (I used to work for SGI and dealt with some significant CPU power used for file serving), but needing a fast quad-core CPU to drive a single SSD is a little disappointing. While recovery from file system corner cases isn’t going to be particularly common it’s something that you want completed quickly, for personal systems you want to be doing something else and for work systems you don’t want down-time.
The BTRFS problems with running out of disk space are really serious. It seems that even workstations used at home can’t survive without monitoring. For any other filesystem used at home you can just let it get full and then delete stuff.
Include “rootflags=skip_balance” in the boot loader configuration for every system with a BTRFS root filesystem and in the /etc/fstab for every non-root BTRFS filesystem. I haven’t yet encountered a single situation where continuing the balance did any good or when it didn’t do any harm.
Related posts:
- BTRFS Status March 2014 I’m currently using BTRFS on most systems that I can...
- Starting with BTRFS Based on my investigation of RAID reliability [1] I have...
- Using BTRFS I’ve just installed BTRFS on some systems that matter to...