Older blog entries for amits (starting at number 32)

Re-comparing file systems

The previous attempt at comparing file systems based on the ability to allocate large files and zero them met with some interesting feedback. I was asked why I didn't add reiserfs to the tests and also if I could test with larger files.

The test itself had a few problems, making the results unfair:

- I had different partitions for different file systems. So the hard drive geometry and seek times would play a part in the test results

- One can never be sure that the data that was requested to be written to the hard disk was actually written unless one unmounts the partition

- Other data that was in the cache before starting the test could be in the process of being written out to the disk and that could also interfere with the results

All these have been addressed in the newer results.

There are a few more goodies too:
- gnuplot script to ease the charting of data
- A script to automate testing of on various file systems
- A big bug fixed that affected the results for the chunk-writing cases (4k and 8k): this existed right from the time I first wrote the test and was the result of using the wrong parameter for calculating chunk size. This was spotted by Mike Galbraith on lkml.

Browse the sources here

or git-clone them by

git clone git://git.fedorapeople.org/~amitshah/alloc-perf.git

So in addition to ext3, ext4, xfs and btrfs, I've added ext2, reiserfs and expanded the ext3 test to cover the three journalling modes: data, writeback and guarded. guarded is the new mode that's being proposed (it's not yet in the Linux kernel). It's to have the speed of writeback and the consistency of ordered.

I've also run these tests twice, once with a user logged in and a full desktop on. This is to measure the times that a user will see when actually working on the system and some app tries allocating files.

I also ran the tests in single mode so that there are no background services running and the effect of other processes on the tests is not seen. This is done to see the timing. The fragmentation will of course remain more or less the same; that's not a property of system load.

It's also important to note that I created this test suite to mainly find out how fragmented the files are when allocating them using different methods on different file systems. The comparison of performance is a side-effect. This test is also not useful for any kind of stress-testing file systems. There are other suites that do a good job of it.

That said, the results suggest that btrfs, xfs and ext4 are the best when it comes to keeping fragments at the lowest. Reiserfs really looks bad in these tests.Time-wise, the file systems that support the fallocate() syscall perform the best, using almost no time in allocating files of any size. ext4, xfs and btrfs support this syscall.

On to the tests. I created a 4GiB file for each test. The tests are: posix_fallocate(), mmap+memset, writing 4k-sized chunks and writing 8k-sized chunks. These tests are repeated inside the same partition sized 20GiB. The script reformats the partition for the appropriate fs before the run.

The results:

The first 4 columns show the times (in seconds) and the last four columns show the fragments resulting from the corresponding test.

The results, in text form, are:

# 4GiB file
# Desktop on
filesystem posix-fallocate mmap chunk-4096 chunk-8192 posix-fallocate mmap chunk-4096 chunk-8192
ext2 73 96 77 80 34 39 39 36
ext3-writeback 89 104 89 93 34 36 37 37
ext3-ordered 87 98 89 92 34 35 37 36
ext3-guarded 89 102 90 93 34 35 36 36
ext4 0 84 74 79 1 10 9 7
xfs 0 81 75 81 1 2 2 2
reiserfs 85 86 89 93 938 35 953 956
btrfs 0 85 79 82 1 1 1 1

# 4GiB file
# Single
filesystem posix-fallocate mmap chunk-4096 chunk-8192 posix-fallocate mmap chunk-4096 chunk-8192
ext2 71 85 73 77 33 37 35 36
ext3-writeback 84 91 86 90 34 35 37 36
ext3-ordered 85 85 87 91 34 34 37 36
ext3-guarded 84 85 86 90 34 34 38 37
ext4 0 74 72 76 1 10 9 7
xfs 0 72 73 77 1 2 2 2
reiserfs 83 75 86 91 938 35 953 956
btrfs 0 74 76 80 1 1 1 1


[Sorry; couldn't find an option to make this look proper]

Fig. 1, number of fragments. reiserfs performs really bad here.

Fig. 2. The same results, but without reiserfs.


Fig. 3, time results, with desktop on



Fig. 4. Time results, without desktop -- in single user mode.

So in conclusion, as noted above, btrfs, xfs and ext4 are the best when it comes to keeping fragments at the lowest. Reiserfs really looks bad in these tests. Time-wise, the file systems that support the fallocate() syscall perform the best, using almost no time in allocating files of any size. ext4, xfs and btrfs support this syscall.

Syndicated 2009-04-25 05:44:00 (Updated 2009-07-28 17:43:23) from Amit Shah

The fallocate() Story Continues

Making apps use the fallocate() syscall instead of writing zeros to a file is the preferred way to init a file with all 0s. I was pleasantly surprised ktorrent already does that (but via a non-default config option):



I would like it if they made posix_fallocate() the default, if available on the target system. posix_fallocate() already uses fallocate() if supported by the filesystem, otherwise it falls down to the writing zeros block-by-block method. My last post showed the comparison of various file allocation methods, the performance of filesystems and also the fragmentation each method causes.

Reading that post again, it looks like it could've been written much better and could've used a couple of editing rounds. So I've decided to do a second post which will have better results and more file systems added to the fray. I've updated the test to calculate the numbers more reliably and have also run the tests once more with more filesystems and taking factors like hard disk geometry, seek times, etc., out of the equation. The git tree is already updated with the new code, so you can try it out yourself. In any case, stay tuned for the results.

Syndicated 2009-04-15 13:40:00 (Updated 2009-07-28 17:45:02) from Amit Shah

Comparison of File Systems And Speeding Up Applications

Update: I've done a newer article on this subject at http://log.amitshah.net/2009/04/re-comparing-file-systems.html that removes some of the deficiencies in the tests mentioned here and has newer, more accurate results along with some new file systems.

How should one allocate disk space for a file for later writing? ftruncate() (or lseek() followed by write()) create sparse files, not what is needed. A traditional way is to write zeroes to the file till it reaches the desired file size. Doing things this way has a few drawbacks:
  • Slow, as small chunks are written one at a time by the write() syscall
  • Lots of fragmentation
posix_fallocate() is a library call that handles the chunking of writes in one batch; the application need not have to code his/her own block-by-block writes. But this still is in the userspace.

Linux 2.6.23 introduced the fallocate() system call. The allocation is then moved to kernel space and hence is faster. New file systems that support extents make this call very fast indeed: a single extent is to be marked as being allocated on disk (as traditionally blocks were being marked as 'used'). Fragmentation too is reduced as file systems will now keep track of extents, instead of smaller blocks.

posix_fallocate() will internally use fallocate() if the syscall exists in the running kernel.

So I thought it would be a good idea to make libvirt use posix_fallocate() so that systems with the newer file systems will directly benefit when allocating disk space for virtual machines. I wasn't sure of what method libvirt already used to allocate the space. I found out that it allocated blocks in 4KiB sized chunks.

So I sent a patch to the libvir-list to convert to posix_fallocate() and danpb asked me about what the benefits of this approach were and also asked about using alternative approaches if not writing in 4K chunks. I didn't have any data to back up my claims of "this approach will be fast and will result in less fragmentation, which is desirable". So I set out to do some benchmarking. To do that, though, I first had to make some empty disk space to create a few file systems of sufficiently large sizes. Hunting for a test machine with spare disk space proved futie, so I went about resizing my ext3 partition and creating about 15 GB of free disk space. I intended to test ext3, ext4, xfs and btrfs. I could use my existing ext3 partition for the testing, but that would not give honest results about the fragmentation (existing file systems may already be fragmented, causing big new files surely to be fragmented whereas on a fresh fs, I won't run into that risk).

Though even creating separate partitions on rotating storage and testing file system performance won't give perfectly honest results, I figured if the percentage difference in the results was quite high, that won't matter. I grabbed the latest Linus tree and the latest dev trees for the userspace utilities for all the file systems and created about 5GB partitions for each fs.

I then wrote a program that created a file, allocated disk space and closed it and calculate the time taken in doing so. This was done multiple times for different allocation methods: posix_fallocate(), mmap() + memset() and writing zeroes in 4096 byte chunks and 8192 byte chunks.

So I had four methods of allocating files and 5G partition size. So I decided to check the performance by creating 1GiB file size for each allocation method.

The program is here. The results, here. The git tree is here.

I was quite surprised seeing poor performance for posix_fallocate() on ext4. On digging a bit, I realised mkfs.ext4 didn't create it with extents enabled. I reformatted the partition, but that data was valuable to have as well. Shows how much a file system is better with extents support.

Graphically, it looks like this:
Notice that ext4, xfs and btrfs take only a few microseconds to complete posix_fallocate().


The number of fragments created:

btrfs doesn't yet have the ioctl implemented for calculating fragments.

The results are very impressive and the final patches to libvirt were finalised pretty quickly. They're now in the development branch libvirt. Coming soon to a virtual machine management application near you.

Use of posix_fallocate() will be beneficial to programs that know in advance the size of the file being created, like torrent clients, ftp clients, browsers, download managers, etc. It won't be beneficial in the speed sense, as data is only written when it's downloaded, but it's beneficial in the as-less-fragmentation-as-possible sense.

Syndicated 2009-03-20 15:58:00 (Updated 2010-02-01 12:09:54) from Amit Shah

Startups in 14 sentences

Paul Graham has an article on the top 13 things to keep in mind for entrepreneurs. I have one to add (for software startups):

- Going open source can help
You might have a brilliant idea and a cool new product. It mostly will be disruptive technology. You might think of changing the world. But people might have to modify the way they were doing things. What if you run out of funds midway or some other unforeseen event by which your company has to shut shop? Customers will be vary of deploying solutions from startups for fears of them going down. If the customers are given access to the source code, they're at least insured they can have control over the software if your company is unable to support it. And letting them know this can win some additional customers -- who knows!

Syndicated 2009-02-27 07:42:00 (Updated 2009-02-27 07:50:56) from Amit Shah

Making Suspend Safer for File Systems

I saw these File System Freezing patches that got merged into Linus' tree yesterday and instantly thought that these patches could be used to freeze file systems before going into a suspended state. At the recent foss.in/2008, I met with Christoph Hellwig and one of the things we discussed was how he would never trust any file system to be in a consistent state after attempting suspend-to-disk.

The freezing patches are aimed at snapshotting as of now. Extending the suspend routines to make use of them is something I still have to look at. While working with file systems isn't entirely new to me -- I've worked on something called the Mosix File System earlier, it's been a really long time. It'll be quite interesting to work on this.

I had a brief chat with hch about this idea and while he says this still will not convince him to suspend to disk, it could be a good thing for suspend to ram where the laptop runs out of power but the fs could be in a good state. I agree. Though I'd like to use s-t-d with this!

I've had many ideas slip by without blogging about them for ages and later seen them implement by others. In this case, even if I don't end up implementing something, I'd at least have the satisfaction of having penned it down first.

Syndicated 2009-01-12 05:57:00 (Updated 2009-01-12 06:06:35) from Amit Shah

Modesty

A Quote, after a long time.

A few friends were having a discussion on modesty and my opinion was called for. This is what I had to say:

I don't blow my own trumpet. Never. There are people who say things about me, though. So I feel compelled to correct them every now and then so that What They Say Is What I Am (TM).

Some people label me immodest.

update: Some background info added. In all modesty, a few people thought this post sounded arrogant and they were quick to point out that I really am not. I thought labelling this post as 'humour' would suffice...

Syndicated 2008-12-31 14:06:00 (Updated 2009-01-23 09:00:45) from Amit Shah

Animal Farm

I'm (re-)reading Animal Farm these days, 2-3 pages a day. I can't help but correlate it to the headline-grabbing news that's happening these days.

Syndicated 2008-12-30 08:38:00 (Updated 2008-12-30 08:42:28) from Amit Shah

KVM: Disabled by BIOS

I spent some time fixing this on a Dell Optiplex 755. I thought it was a BIOS update that was necesary and had to hunt for a DOS bootable that could run the EXE given by Dell. FreeDOS wouldn't work. Finally found a disk given by Dell (with some other machine) that was a bootable. Even the updated BIOS didn't solve this issue.

I then searched around the net and found this post that mentioned to disable Trusted Execution. Well, if you have an option that enables virtualization and then give another option that effectively disables it, what good is this UI?

This, however, sounds like something that I don't yet understand. So I should go read what that is and how to make KVM run with it enabled.

Syndicated 2008-12-04 13:43:00 (Updated 2008-12-04 14:07:18) from Amit Shah

Laptops: The New Desktops

As laptop sales are outpacing desktop sales and laptops becoming more and more capable, it's no wonder that laptops are now the preferred choice for a computer. The prices of laptops have fallen dramatically to aid this trend.

However, with this growing trend, there now comes a need to have even smaller portable computers. The laptops are now "too big". So we now have the onslaught of netbooks and mobile phones doubling as handheld computers. So what exactly is it that people need? They want big screens but the device should be portable. They need more processing power but a small device and one that doesn't heat up. It's going to be very interesting wathcing this space in the next few years.

While discussing this topic with Vijay today, he mentioned he wanted to take just the laptop screen around without having to undock his laptop for presentations or short meetings. Tablets don't work for him for some reasons. He just wanted the screen to go with him and communicate with the "base" wirelessly. I thought that should be possible with a low-power, low-speed processor on the screen itself running something off the RAM. Anyway, with the cloud-computing phenomenon, all one would need is a browser and a handful of other software (mainly plugins to browsers). The OSes will either have to evolve to support ASMP or the processor manufacturers will have to come out with low-power chips and being able to share the bus with a stronger processor (Intel's Atom does seem it can fit here). The OS has to evolve in either case along with the chipset.

The desktop software will have to have support for this, of course, where you would click something like "detach the screen safely" and the necessary plugins can be transferred to the screen's RAM. Or the screen can have some flash storage and the browser, presentation software, etc. can be stored natively all the time.

Anyway, is this still the most-desired gadget? Once this is done (as it has, Toshiba had a prototype two years back along the same lines), will people stop wanting more? Just in today's world, I can imagine people just wanting to use their super mobile phones to work as the "screen" -- a low-power computer when not in front of their laptops. They can be hooked up to projectors easily, they can be carried around, can be hooked up to bigger monitors to get more screen space. What's stopping us from doing that now?

Syndicated 2008-11-09 07:12:00 (Updated 2008-11-09 09:14:09) from Amit Shah

Piracy

The Indian movie industry (and that's not just "Bollywood") is plagued with piracy of movies as well as music. I've had several friends staying abroad telling me about recent releases they saw "on the Internet". Of course, songs are always to be downloaded and not bought.

A movie I saw recently had a note at the end of the screening: "Please buy original CDs. Do not download music." There was laughter in the sparsely-populated movie hall (on the 2nd day of the screening of the movie that talked about youth and music, no less).

That got me thinking: we spend quite a lot of money these days to watch movies in multiplexes. It's about 5x-6x the cost from what I used to pay about 10 years back. And that too doesn't guarantee a seat in the "balcony". These days, the movie halls usually have flat pricing, no matter where you're seated. You could be 5 feet away from the screen or 50.

So it's no wonder people don't want to go to movie theatres. They just walk across the street and buy a DVD for Rs 30 that has 3 or 4 of the latest releases. And they can always download the music or buy MP3 CDs that cost about the same but have music from 50 recent releases. Original audio CDs cost about the same it costs to watch the movie in the movie hall.

I was thinking what can help curb this piracy, and one thing that came to mind was the distributors and producers of the movies could give away audio CDs of the movie just after the screening either for free or for a very samll token amount, like Rs 30.

If this were done, people would actually go to the theatre to watch movies since the cost of the ticket no longer only gets them the movie but also gets them the CD to the songs which they've already listened to (and liked?) (side note: movies in India usually run more because of the music and actors than the story or reviews). Also, music gets distributed and listened to legally instead of it being pirated.

The producers need not worry about losing out on income via audio CD sales. I wonder how much they make anyway. Also, if this drives more people to the theatres, it's only going to be good for them. For people who do not want to watch the movie but want the CD, they can buy the CDs as they had been buying previously. For people who wanted the music but did not buy it, there's no negative in the model for the producer, but there's a positive: enticing them to go watch the movie plus get a chance to get the CD.

So it came as a welcome surprise (though I don't know how well this idea will take off) when I saw Google announced putting links in youtube videos for songs in the video.

I've had (non-Indian) friends tell me they don't download music any more since they can get songs for just under a dollar from the various online stores. It hardly makes any difference to their bottomlines plus they get legal music and are free of any hassles they might later get into for doing illegal stuff (downloading).

This might work elsewhere, but in India, the mentality hasn't changed enough that people will buy something instead of getting it for free or from a very cheap alternative. Adding 'buy music you just liked from here' won't pick off. I'd like to be proven wrong, though.

There's a lot to be gained in this model for everyone involved. Even the movie halls will see more traffic and hence more income for the various food courts and shopping plazas that are bundled in the movie hall complexes these days.

If this is implemented and takes off, the producers can then think about giving off DVDs of the movie for let's say 50% of the original price. Why not?

Update: xkcd on piracy

Syndicated 2008-10-08 07:46:00 (Updated 2008-10-15 10:17:45) from Amit Shah

23 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!