More ways for firmware to screw you
Some of my recent time has been devoted to making our boot media more Mac friendly, which has entailed rather a lot of rebooting. This would have been fine, if tedious, except that some number of boots would fall over with either a clearly impossible kernel panic or userspace segfaulting in places that made no sense. Something was clearly wrong. Crashes that shouldn't happen are generally an indication of memory corruption. The question is how that corruption is being triggered. Hunting that down wasn't terribly easy.
My first thought was that we were possibly managing to load the kernel over a region used by UEFI code. UEFI defines two types of code - boot services and runtime services. While runtime services code and data must be preserved by the OS, in theory boot services code and data is available to the OS once the firmware has exited. In practice, that's not true. It seemed entirely possible that the kernel might be ending up on top of some of that boot services code or data and getting trodden on. Grub now has code to avoid putting the kernel on boot services, so testing the latest code seemed like a good plan. But no, crashes still happened.
That pretty much ruled out the bootloader. My next thought was that executing some of the firmware code was triggering a write to some other memory that contained the kernel. Josh Boyer suggested the next trick, which was to try marking the kernel read-only to see whether anything was hitting it. x86 lets you mark pages as read-only - any attempt to write to them should take a fault. UEFI functions are executed in the context of the kernel, so share the same page tables. That let me rule this out, since everything still went just as wrong and I wasn't taking an extra fault first.
However, at this point I was reasonably happy that it wasn't the kernel itself being overwritten - faults were occurring in userspace code as well. That was a pretty strong indication that what was happening was continuing to happen once userspace had started, so it wasn't a direct response to a firmware call. I made sure of that by stubbing out all the calls that could be triggered after kernel initialisation, and saw the same failures. Once all attempts to be clever have failed, it's time to just start using brute force. The kernel lets you reserve areas of RAM by passing arguments like memmap=0xlength$0xstart to block length bytes starting at start from being used. It took a while, but I finally found a 256MB range that made a difference - reserving it resulted in the machine booting reliably, letting the OS use it resulted in occasional crashes.
Definite progress. Comparing that memory range to the EFI memory map was helpful. There were several blocks of UEFI boot services data present there, which really seemed like too much of a coincidence. By reserving each of them in turn, I'd traced it down to a single 31MB region of boot service data - that is, memory reserved by the firmware for use by the UEFI boot services. Per spec, this is available to the OS once the boot environment has been exited. Nothing other than the OS should be touching this after boot, but something clearly was. Tracking down what was far easier than I expected, although the first attempt was a failure. Setting it read-only should have triggered a fault, but didn't. That was rather confusing. But, rather than give up, I patched the kernel to fill the region with 0xff at kernel init. Then I booted the system, read it back and looked for values that weren't 0xff. I got this:
00000000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| * 001568a0 ff ff ff ff ff ff ff ff ff ff ff ff 84 00 00 00 |................| 001568b0 00 20 a7 ac 46 00 00 00 00 00 00 00 00 00 06 01 |. ..F...........| 001568c0 c2 0b 0c 00 ff ff ff ff ff ff ff ff ff ff ff ff |................| 001568d0 ff ff 0a 04 f0 03 82 0d 40 00 00 00 ff ff ff ff |........@.......| 001568e0 ff ff 00 21 00 36 9a 80 ff ff ff ff ff ff 00 7e |...!.6.........~| 001568f0 00 09 43 48 41 2d 47 75 65 73 74 01 04 02 04 0b |..CHA-Guest.....| 00156900 16 32 08 0c 12 18 24 30 48 60 6c 2d 1a 0e 18 1a |.2....$0H`l-....| 00156910 ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00156920 00 00 00 00 00 00 00 dd 09 00 10 18 02 00 10 01 |................| 00156930 00 00 dd 1e 00 90 4c 33 0e 18 1a ff ff 00 00 00 |......L3........| 00156940 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................| 00156950 00 00 bd ea f8 b3 ff ff ff ff ff ff ff ff ff ff |................| 00156960 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff |................| * 022fb000That's a lot of 0xffs (around 31MB of them) with one small section that contains an 802.11 probe packet with the SSID of the hospital across the road from my house. Apple support network booting off wireless networks. It seems that the firmware brought up the wireless card, associated with this network (it's the only public one nearby) and then left the card DMAing packets into RAM. The read-only page attribute only applies to CPU-initiated accesses, so it could do this without triggering a page fault. It also explained why it was so random - whether memory corruption occurred would depend on whether a packet appeared between that memory being used by the OS and the kernel reinitialising the wireless card. It certainly explains why I couldn't reproduce it when I left the machine repeatedly rebooting on the bus home.
How do we fix this? Unsure. With luck disconnecting the UEFI driver in the bootloader should quiesce the hardware, but without testing I'm not sure of that yet. For now it's just another example of firmware managing to break expectations in deeply strange ways.