Older blog entries for mjg59 (starting at number 195)

ext4, application expectations and power management

There's been a certain amount of discussion about behavioural differences between ext3 and ext4[1], most notably due to ext4's increased window of opportunity for files to end up empty due to both a longer commit window and delayed allocation of blocks in order to obtain a more pleasing on-disk layout. The applications that failed hardest were doing open("foo", O_TRUNC), write(), close() and then being surprised when they got zero length files back after a crash. That's fine. That was always stupid. Asking the filesystem to truncate a file and then writing to it is an invitation to failure - there's clearly no way for it to intuit the correct answer here. In the end this has been avoided by avoiding delayed allocation when writing to a file that's just been truncated, so everything's fine.

However, there's another case that also breaks. A common way of saving files is to open("foo.tmp"), write(), close() and then rename("foo.tmp", "foo"). The mindset here is that a crash will either result in foo.tmp being zero length, foo still being the original file or foo being your new data. The important aspect of this is that the desired behaviour of this code is that foo will contain either the original data or the new data. You may suffer data loss, but you won't suffer complete data loss - the application state will be consistent.

When used with its (default) data=ordered journal option, ext3 provided these semantics. ext4 doesn't. Instead, if you want to ensure that your data doesn't get trampled, it's necessary to fsync() before closing in order to make sure it hits disk. Otherwise the rename can occur before the data is written, and you're back to a zero length file. ext4 doesn't make guarantees about whether data will be flushed before metadata is written.

Now, POSIX says this is fine, so any application that expected this behaviour is already broken by definition. But this is rules lawyering. POSIX says that many things that are not useful are fine, but doesn't exist for the pleasure of sadistic OS implementors. POSIX exists to allow application writers to write useful applications. If you interpret POSIX in such a way that gains you some benefit but shafts a large number of application writers then people are going to be reluctant to use your code. You're no longer a general purpose filesystem - you're a filesystem that's only suitable for people who write code with the expectation that their OS developers are actively trying to fuck them over. I'm sure Oracle deals with this case fine, but I also suspect that most people who work on writing Oracle on a daily basis have very, very unfulfilling lives.

But anyway. We can go and fix every single piece of software that saves files to make sure that it fsync()s, and we can avoid this problem. We can probably even do it fairly quickly, thanks to us having the source code to all of it. A lot of this code lives in libraries and can be fixed up without needing to touch every application. It's not the end of the world.

So why do I still think it's a bad idea?

It's simple. open(),write(),close(),rename() and open(),write(),fsync(),close(),rename(), are not semantically equivalent. One is "give me either the original data or the new data"[2]. The other is "always give me the new data". This is an important distinction. fsync() means that we've sent the data to the disk[3]. And, in general, that means that we've had to spin the disk up.

So, on the one hand, we're trying to use things like relatime to batch data to reduce the amount of time a disk has to be spun up. And on the other hand, we're moving to filesystems that require us to generate more io in order to guarantee that our data hits disk, which is a guarantee we often don't want anyway! Users will be fine with losing their most recent changes to preferences if a machine crashes. They will not be fine with losing the entirity of their preferences. Arguing that applications need to use fsync() and are otherwise broken is ignoring the important difference between these use cases. It's no longer going to be possible to spin down a disk when any software is running at all, since otherwise it's probably going to write something and then have to fsync it out of sheer paranoia that something bad will happen. And then probably fsync the directory as well, because what if someone writes an even more pathological filesystem. And the disks sit there spinning gently and chitter away as they write tiny files[4] and never spin down and the polar bears all drown in the bitter tears of application developers who are forced to drink so much to forget that they all die of acute liver failure by the age of 35 and where are we then oh yes we're fucked.

So. I said we could fix up applications fairly easily. But to do that, we need an interface that lets us do the right thing. The behaviour application writers want is one which ext4 doesn't appear to provide. Can that be fixed, please?

[1] xfs behaves like ext4 in this respect, so the obvious argument is that all our applications have been broken for years and so why are you complaining now. To which the obvious response is "Approximately anyone who ever used xfs expected their data to vanish if their machine crashed so nobody used it by default and seriously who gives a shit". xfs is a wonderful filesystem for all sorts of things, but it's lousy for desktop use for precisely this reason.

[2] Yes, ok, we've just established that it actually isn't that in the same way that GMT isn't UTC and battery refers to a collection of individual cells and so you don't usually put multiple batteries in your bike lights, but the point is that this is, for all practical intents and purposes, an unimportant distinction and not one people should have to care about in their daily lives.

[3] The disk is free to sit there bored for arbitrary periods of time before it does anything, but that's fine, because the OS is behaving correctly. Sigh.

[4] Dear filesystem writers - application developers like writing lots of tiny files, because it makes a large number of things significantly easier. This is fine because sheer filesystem performance is not high on the list of priorities of a typical application developer. The answer is not "Oh, you should all use sqlite". If the only effective way to use your filesystem is to use a database instead, then that indicates that you have not written a filesystem that is useful to typical application developers who enjoy storing things in files rather than binary blobs that end up with an entirely different set of pathological behaviours. If I wanted all my data to be in oracle then I wouldn't need a fucking filesystem in the first place, would I?

Syndicated 2009-03-14 21:04:01 from Matthew Garrett

After a bit of back and forth with Peter, we came up with a straightforward way of dealing with the fact that the Wacom driver needs a logical input device per input type[1], but the X server only generates an input device per hal device. The simplest solution turned out to be a hal callout that generates additional hal devices on demand, which also means we can add information to the fdi files to only add the appropriate device types. Ought to land in rawhide in the near future, at which point tablets should be basically working out of the box. Except that xsetwacom gets device name -> type mapping by attempting to parse xorg.conf. Pass the suicide.

Today's other accomplishment was spending long enough looking at Toshiba ACPI dumps to figure out how to enable hotkey reporting without needing to poll. Of course, I then found that the FreeBSD driver has done the same thing since 2004. Never mind. Patch has been posted to lkml and I've shoved it into rawhide, so that'll improve things for most Toshiba users. There's a few machines that have an entirely different BIOS and we don't know how the hotkeys there work at all, so life continues to be miserable for those of you that own them. Sorry.

[1] Stylus, cursor, eraser and so on

Syndicated 2009-03-06 04:46:26 from Matthew Garrett


Minor Fedora updates - I fixed up the FDI file in the wacom package, so tablet PCs should have a working stylus out of the box in rawhide. The eraser won't work right now - the driver needs some reworking to bind multiple X devices to a single logical input device. I've also added support for brightness control via smartdimmer to nouveau, which should increase the number of machines that have working brightness control. I don't think this has landed in the rawhide kernel yet, but should do soon. There's the potential for some conflict with the mbp_nvidia_bl driver. We may end up dropping that.

HP updates - The button bar on my 2510 got replaced last week. I now have working volume buttons again. However, the machine now reboots whenever the machine is suspended and I close the lid. Diagnosed to either a faulty switch assembly or system board, which will require an engineer visit. An engineer dropped round today to fix the touchpad. Despite the case notes clearly stating that the problem was with the cable assembly, he was sent a replacement top cover unit. Without any cables. So he's coming back at some point. HP's customer support system apparently does not allow these cases to be merged. Which means I now have two visits to look forward to.

Android - I'm gradually working my way through the code, replacing various custom interfaces with standard ones. const char * const LCD_BACKLIGHT = "/sys/class/leds/lcd-backlight/brightness"; is an interesting standout so far.

Syndicated 2009-02-23 18:43:34 from Matthew Garrett

In other news, my HP 2510p's screen was replaced last month after the hinge snapped. A chunk of plastic off the hinge cover snapped off two days ago. I'm somewhat puzzled by this, since I can't see any plausible way force could be applied to it - it's as if it came away slightly and then got crushed when I tried to close the lid. On top of the motherboard having been replaced 4 times now (twice due to faulty power connectors, one due to the fan being replaced and the motherboard being swapped at the same time, once because the machine started refusing to boot at LCA) and it still being slightly tempremental when booting, I'm not overly impressed - especially when I've only had it 18 months. Nobody else I know with one seems to have had the same level of difficulty, though dreadful thermal issues (especially when using the dock) seem to be common.

The X200s looks awfully shiny, but the 1440x900 screen option doesn't appear to be available in the UK. Oddly, despite having a SIM slot, it also doesn't seem to come with an HSDPA option.

Syndicated 2009-02-19 01:29:01 from Matthew Garrett

Aside from the inherent humour in Opensolaris's attempt to migrate to a 15 year old shell, today brings the thrilling news that I'll be moving to Boston to join the engineering team in Westford, MA. I look forward to the Applebee's. Some of the more entertaining aspects of US immigration mean that it'll probably be in July at the earliest (365 days with the company, plus time spent in the US since starting), which means that I have plenty of time to properly investigate my local pubs to console myself over having to spend the rest of my life drinking American beer.

Syndicated 2009-02-19 01:18:20 from Matthew Garrett

I'm sitting on a plane. The screen in front of me is currently displaying X root weave. Xorg fo life, yo.

Syndicated 2009-02-01 04:51:41 from Matthew Garrett

One of the points I made in my presentation at LCA this year was that for power management to be effective, it needs to be something that works without anyone having to think about it. One aspect of that is ensuring that it doesn't get in the way of the user, since otherwise the user will eventually get irritated and turn it off. Part of my work at Red Hat is coming up with ways to not only offer power management functionality, but to make it sufficiently useful and inobtrusive that manual configuration is almost never required.

Screensavers are an interesting case. We have a good idea of whether most hardware is "doing something" or not, based on whether it's generating traffic or an application has it open. This is less true of screens - the resource making use of the display is the user, and it's entirely possible for the user to be reading or watching something[1] onscreen without us getting any feedback from them. It's common to see people noticing that their screensaver is activating and hitting the mouse or keyboard to stop it. What's the correct solution?

One solution is to have the user increase the screensaver timeout. This is a poor solution - it's one of those "Think about what you're going to be doing with the computer before starting to do it" ideas that I dislike a lot. Computers are there to serve the user, not the other way around. The other downside to this is that the timeout will be left at a large number and monitors will be turned on for significantly longer than necessary.

Another is to pay attention to what the user's doing. If they keep hitting the keyboard just as the screensaver's activating, it's because they want a larger timeout. It's not difficult to give them that. I spent a while today playing with various complex implementations, but I finally came down to a simple one:

  • If the user generates activity while the screen is blanking or immediately afterwards, bump the timeout by 10 minutes. Perform a further increase each time they do this.
  • If the screen is successfully blanked and the user doesn't immediately unblank it, reset the timeout to the original value
Another option would be to double the timeout each time the user unblanks the screen, and that may be what I end up going with. A more complex solution might be to keep track of the user behaviour and tie it to time of day (if the system goes idle at 3:30, you might as well blank straight away - they've gone to grab a coffee or something), but I'm leaning towards thinking that that's overkill.

To test this out, I've actually gone to the extent of setting my default screensaver timeout to a minute. We'll see whether it gets irritating. I suspect that there's some more fine tuning to do, and I may want some kind of decay function rather than immediately pushing the timeout back to the original value.

Next job is to think about whether there's any reason to not just enter DPMS straight away if the user's selected a blank screen...

[1] I'm thinking along the lines of IRC conversations or logfiles rather than films - media players should be talking to the screensaver already

Syndicated 2009-01-29 09:31:01 from Matthew Garrett


Intel have a magic communciation channel between the system firmware and the graphics hardware. It's based on a region of shared memory and judicious use of interrupts, and it's documented here.

Nvidia have a magic communication channel between the system firmware and the graphcis hardware. It's based on WMI and bonghits, and it's not documented.

Why, yes, I have spent half the day trying to work out how the NVIF method works.

Syndicated 2009-01-26 11:58:00 from Matthew Garrett

I had a deeply fucked up dream last night. For some reason we'd paid huge quantities of money to obtain nothing of value and George W Bush had been president for 8 years. I clearly need to drink less.

Syndicated 2009-01-22 13:33:57 from Matthew Garrett

186 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!