k is currently certified at Journeyer level.

Name: Adrian Chadd
Member since: 2000-04-04 17:58:16
Last Login: 2007-05-24 13:07:58

FOAF RDF Share This

Homepage: http://www.creative.net.au/

Notes:

I work on a few projects, sproadically, in my spare time.

that includes freebsd, squid, ircd-hybrid and stuff I've forgotten.

In my spare time I'm working on some genetic programming engines.

Projects

Articles Posted by k

Recent blog entries by k

Syndication: RSS 2.0

17 May 2006 »

The status of the Squid cyclic fs (COSS) :

COSS was originally implemented as an on-disk LRU. I'll describe the original implementation as I grabbed from Eric Stern now.

A filesystem is just a single large file or physical device.

A membuf - 1 megabyte in size - is initially allocated to represent the first megabyte of the filesystem. Objects are copied into the membuf if their size is known up front (and thus space can be 'pre-allocated' in the stripe.) When the stripe is filled up it is marked as "full" and written to the filesystem. Objects are added to the beginning of a linked list as this happens.

Objects are referenced by their offset on the disk: any read is first checked against the in-memory membuf list. If an object is found to be in a membuf then a copy of the object data is taken and the data is handed back to Squid. If an object is not found in a membuf it is read from the filesystem, placed at the head of the current membuf - and they are re-added to the head of the linked list - and the squid file pointer is updated to point to this new position.

As stripes are successively allocated and written to the filesystem in order the 'popular' objects stay near the 'head'. This happens until the last stripe on disk is written: at which point the write pointer is cycled to the beginning of the filesystem.

At this point the LRU implementation kicks in: the objects which are at the end of this linked list match those at the beginning of the filesystem. COSS will start at the end of the linked list and move backwards, deallocating objects, until it reaches the beginning of the next stripe. It then has enough room to allocate a 1 megabyte stripe (and its membuf.) at the beginning of the disk. It then fills this membuf as described above.

When this membuf is filled it writes the stripe to disk, frees the objects in the next megabyte of disk and then allocates a membuf and fills that.

This implementation wasn't complete:

  • The rebuild-from-logfile didn't seem to work right
  • There was no support for rebuild-from-disk (in case the logfile was missing or corrupt)
  • The implementation used file_read() and file_write() - callback methods of scheduling disk filedescriptor IO - but assumed synchronous behaviour.

When I adapted the codebase to use POSIX AIO I discovered a number of race conditions in the COSS code:

  • Objects which were being read from disk and written into the current membuf had their squid file pointer numbers updated. Subsequent reads of this object would be copied from the current membuf - but async disk IO wouldn't guarantee the data there was written until some time after scheduling. This resulted in a lot of swapin failures as NULL data was being written
  • It was possible, but so far unlikely, that a disk operation would be scheduled for an object which was then overwritten by a full stripe.

The nice features of COSS was the simple writing and object pool maintainence: writes were aggregated and predictable (being in 1 megabyte chunks.) Popular objects had more of a possibility of staying in the current membuf.

I recently took the code and began fixing the bugs. These included:

  • All disk stripes were now an even multiple of the membuf size (1 megabyte.) Eric's original implementation would note when a membuf was free, write the membuf to disk and then start the new membuf at the end of the old membuf. This meant a few bytes weren't being wasted but it did make dividing the filesystem up for analysis/repair/rebuild difficult.
  • Object relocations are now tracked from their creation to completion
  • When an object is relocated its data - and any subsequent read request - is stalled until the object data has been read in from the filesystem.
  • A check (and loud log message!) has been added to catch attempts to write a stripe where a pending relocate is occuring (and the read hasn't completed), hopefully catching (but not repairing for now) instances where said read will result in then-bogus data
  • Rebuild logic has been added - its now easy to read the disk in as 1 megabyte chunks and do basic checks on each stripe. If a stripe has been partially or badly written to disk the contents can be thrown away without affecting the rest of the filesystem
  • Objects no longer live in a single linked list. Each on-disk stripe reigon has an in-memory structure used to track various statistics including a linked list containing which objects are currently there. This makes freeing any arbitrary stripe easy, allowing for much cleaner object expiry and filesystem index rebuild process.

The problems seen so far:

  • The write rate is a function of not only the cacheable data coming in from the network but the hit rate - and subsequent relocation of popular objects - which means the write volume can quickly spiral out of control
  • Some hit-rate issues which I haven't figured out yet. It may be related to my relatively small test caches (~ 8-10 gigabytes) and the polygraph workloads using a much bigger cache data set.

Possible directions to take (although I do need some actual world-testing and statistics first!):

  • Find out what percentage of objects are read in and never referenced again vs objects re-referenced once, twice, four times, eight times, etc.
  • Consider allocating stripe areas as "hot object" stripes which aren't part of the disk LRU. Place popular objects in these stripes and don't relocate them once they're there - this should cut down on the constant object relocation and therefore cut back on the write bandwidth. They can be managed by a different LRU or other replacement scheme.
  • Consider implementing some form of object locality; which will need cooperation from other areas of squid.

Interested in the work? I'm placing snapshots up on my squid website - here.

16 May 2006 »

Henrik Nordstrom pointed out the pread() and pwrite() syscalls which should be supported under Linux. This code now can either use the POSIX AIO calls or the Linux AUFS (user-space threads implementing disk IO) which use pread()/pwrite().

In short; it works, and it works fine.

The trouble now: how to rebuild the store index from disk during startup.

13 May 2006 »

I've been fixing the COSS code in Squid - its a cyclic-style filesystem with a twist - instead of completely cyclic it implements an on-disk LRU.

I've got it mostly stable. The main problem right now? The linux user-space Posix AIO support only seems to de-queue one op at a time per FD. I think this is severely hampering the disk performance; but there's little I can do about it for now. Grr.

23 Mar 2006 »

Today's amusement: how I messed up the precariously-balanced backup system at work. Grr.

Today's amusement #2: my little co-operative multitasking message-passing thingy is running and passing messages between modules. The most it can do? 10 million events in 14 seconds, and not one memory leak. I wonder if it'll leak memory during error conditions..

22 Mar 2006 (updated 22 Mar 2006 at 06:20 UTC) »

Ah, yay.

Updates:

* still owe a lot on my credit card. Wow, who would've thought financial planning was so crucial. :)
* bored at work. There's stuff to do, but its not challenging. Sigh.
* studying psychology/linguistics at UWA. Yes, I'm a second year (ie, not a first year). I gave up studying CompSci - first and second year stuff just frustrates me and I don't need frustrating things whilst I'm working full time.

What I'm currently working on:

* I'm a programmer for an online MUD. No, I won't say which. Its nifty though - it uses a proprietary engine and a crazy syntax which is the bastard child of C and BASIC. Very good for writing MUD code in.
* I'm still tinkering with fast network application frameworks: check my homepage for the CVS repo. "projects/col" has what I'm working on.

69 older entries...

 

k certified others as follows:

  • k certified mtearle as Journeyer
  • k certified gbowland as Apprentice
  • k certified rwatson as Master
  • k certified asmodai as Journeyer
  • k certified phk as Master
  • k certified softweyr as Master
  • k certified winter as Journeyer
  • k certified jwalther as Journeyer
  • k certified benno as Journeyer
  • k certified mnot as Journeyer
  • k certified Skud as Journeyer
  • k certified dancer as Journeyer
  • k certified des as Journeyer
  • k certified msmith as Master
  • k certified grog as Master
  • k certified eivind as Master
  • k certified imp as Master
  • k certified green as Journeyer
  • k certified ashp as Apprentice
  • k certified joshua as Apprentice
  • k certified poombah as Apprentice
  • k certified darkewolf as Journeyer
  • k certified peter as Master
  • k certified billf as Journeyer
  • k certified Simon as Journeyer
  • k certified fusion94 as Journeyer
  • k certified jkh as Master
  • k certified aunty as Journeyer
  • k certified hypatia as Apprentice
  • k certified jmallett as Journeyer

Others have certified k as follows:

  • rwatson certified k as Journeyer
  • asmodai certified k as Journeyer
  • gsutter certified k as Journeyer
  • phk certified k as Journeyer
  • jhb certified k as Journeyer
  • cg certified k as Journeyer
  • cmc certified k as Journeyer
  • mph certified k as Journeyer
  • quiet1 certified k as Journeyer
  • benno certified k as Journeyer
  • jedgar certified k as Journeyer
  • mtearle certified k as Journeyer
  • yakk certified k as Journeyer
  • jmg certified k as Journeyer
  • dancer certified k as Journeyer
  • des certified k as Journeyer
  • eivind certified k as Journeyer
  • ashp certified k as Journeyer
  • joshua certified k as Journeyer
  • poombah certified k as Journeyer
  • jwalther certified k as Journeyer
  • peter certified k as Journeyer
  • bp certified k as Journeyer
  • billf certified k as Journeyer
  • darkewolf certified k as Journeyer
  • mnot certified k as Journeyer
  • winter certified k as Journeyer
  • fusion94 certified k as Master
  • bma certified k as Journeyer
  • cynick certified k as Journeyer
  • mazeone certified k as Journeyer
  • mlsm certified k as Journeyer
  • dcs certified k as Journeyer
  • suso certified k as Journeyer
  • bmilekic certified k as Journeyer
  • nealmcb certified k as Journeyer
  • Kay certified k as Journeyer
  • nixnut certified k as Journeyer
  • Skud certified k as Journeyer
  • XFire certified k as Journeyer
  • jmallett certified k as Journeyer
  • kappa certified k as Journeyer
  • ana certified k as Journeyer
  • footrot certified k as Journeyer
  • dchapes certified k as Journeyer
  • mdeegan certified k as Journeyer
  • fxn certified k as Journeyer
  • trs80 certified k as Journeyer
  • kilmo certified k as Journeyer
  • robertc certified k as Journeyer
  • m certified k as Journeyer
  • bsdgabor certified k as Journeyer
  • okaratas certified k as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

FOAF updates: Trust rankings are now exported, making the data available to other users and websites. An external FOAF URI has been added, allowing users to link to an additional FOAF file.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page