k is currently certified at Journeyer level.

Name: Adrian Chadd
Member since: 2000-04-04 17:58:16
Last Login: 2017-05-19 23:46:50

FOAF RDF Share This

Homepage: http://www.creative.net.au/

Notes:

I work on a few projects in my spare time.

that includes freebsd, squid/lusca, the cacheboy cdn, ircd-hybrid and stuff I've forgotten.

In my spare time I'm working on some genetic programming engines.

Projects

Articles Posted by k

Recent blog entries by k

Syndication: RSS 2.0

One of the most annoying things about running traffic statistics over a proxy log is the use of "magic hostnames". There's a few examples of this worth mentioning:

  • blogspot/wordpress - one hostname per blog site
  • pheedo.com - using [md5].img.pheedo.com for an image URL
  • youtube/googlevideo/limewire/bitgravity/etc - CDNs which have many, many content servers which requests are directed to via some method

Caching issues aside, this makes it difficult to do straight "per-host" statistics as one can have an entire site "hidden" underneath many, many hostnames which individually serve a small amount of traffic.

Anyway. The naive way of working around this is to craft rules to aggregate specific domains as needed. I've done this in my stats software so I can generate daily statistic sets that are manageable. This gives me live data to play with. :)

Another way is to simply figure out the top level / second level domains and aggregate them at that level. So, you'd aggregate *.[domain].com ; *.[domain].net ; but not *.[domain].au. For .au you would aggregate *.[domain].com.au and so on. This should work fine (as the country domain name structure is reasonably static these days) but it does mean you end up hiding a lot of potential information from the domain name. For example, a CDN may have [server].[client].domain.com - seeing the specific per-client traffic statistics may help identify which websites are causing the traffic, versus just seeing *.domain.com.

Out of curiousity, I decided to come up with some way of figuring out where these domain names branch out into multiple sites. Partially so I can generate some aggregation rules, but partially so I can later on begin accounting traffic to both individual hosts and all parts of the domain name.

Anyway. This was easy to solve in a very, very naive way. I simply built a tree in memory based on the reversed hostname (so foo.domain.com -> moc.niamod.oof) so I could easily identify where in the name the branching occurs. I then just created one vertex per character.

Every time I added a hostname to the tree I incremented a reference count for all vertex nodes in the path.

Finally, to figure out which prefixes were the most prevalent, I simply depth-first searched the tree via recursion, looking for nodes that met a certain criteria - specifically, the node character was "." and the refcount was above some arbitrary hard coded limit (8). I then sorted the resulting list based on refcount.

The result:


844611: .com
82937: .net
51478: .org
36302: .blogspot.com
18525: .uk
17246: .wordpress.com
16527: .co.uk
15297: .info
15237: .ru
14790: .ningim.com
12359: .pheedo.com
12355: .img.pheedo.com
12328: .edu
9992: .files.wordpress.com
9980: .live.com
9578: .de
8430: .us
7171: .deviantart.com
6484: .photofile.ru
6481: .users.photofile.ru
6112: .profile.live.com
5197: .yuku.com
5044: .stats.esomniture.com
5044: .esomniture.com
4960: .avatar.yuku.com
4817: .bigtube.com
4659: .ss.bigtube.com
4541: .llnwd.net
4246: .au
4161: .vo.llnwd.net

.. so, as expected really. This host list doesn't include all of the seen hosts over a month of proxy access as I had done some pre-aggregation so the host list database was managable.

Now, the hacky part.

This wasn't a string B*Tree - the vertex children were -not- sorted. This means searches aren't as efficient as they could have been but for the data set in question (11 million hosts) the algorithm ran in O(perfectly sensible) time and didn't degrade noticably as the data set increased. Adding the extra code to do proper insertion and lookup optimisation would make it faster sure but I wanted to see if I had to care or not. No, I didn't have to care.

It was written in C. Yes, with no pre-defined tree datatypes. This is why I wanted to do the minimum effort required. :)

I initially had domain name fragments for the vertex nodes (ie, foo.domain.com became "foo"->"domain"->"com") but due to the distribution of the strings (ie, a -lot- under .com), the search speed (and thus insert speed) was very bad. I knew the worst case performance of the unsorted-node B*tree would be bad (ie, O(tree depth * number of entries per node)) and I absolutely, positively hit it with .com .

Finally, I needed some way of printing out the hostname given a particular vertex node. A typical method of doing this via recursion is to pass in a "prefix string" into the recursive function which gets modified as nodes are visited. I then realised I could simply traverse the tree backwards to assemble the string when needed. I didn't need to try and manage a variable-length string buffer; or artificially limit how long the string could be (and track that.)

In summary, this is all basic first year computer science data structures clue. It was an amusing way to spend an hour. It will be much more fun to merge this with the statistics code to provide domain-based aggregate statistics...

To the users of opensource:

Hi. Your generous donation of equipment and testing is appreciated. But please understand that software needs writing far before it needs testing. Unless we're writing device drivers or adapting old software to new hardware, we really, really need resources to help us -write- the software in the first place.

Writing software generally requires being employed somehow and paying bills. Please keep this in mind.

These particular email threads, when it eventually shows up in the mailing list archives, explains quite nicely why I stopped developing on Squid-3. This: This and then this.

I'm sorry guys - you first agreed on a basic roadmap and timeline for Squid-3.1 - and then in the middle of a "release candidate" release cycle you decide to add more damned features.

Considering exactly how successful previous "new features" have been in keeping the stability and performance of Squid-3 up there, I have absolutely no faith anymore in what they do.

30 Jun 2009 (updated 30 Jun 2009 at 14:17 UTC) »

I've been busy working on a bunch of stuff.

* The log analysis stuff is coming along nicely. Thankyou very much SQLite. I'll post a specific update or two about that when I've finished fixing the bugs I've introduced.

* I've modified the pygrub boot loader to understand FreeBSD disk labels. The hacking can be found at http://people.freebsd.org/~adrian/xen/ in the bsd_pygrub directory. It turns out that the pygrub/xen UFS code is (a) Solaris UFS, (b) UFS-1- only, (c) crashes very badly when fed a FreeBSD formatted UFS1 for some reason. I'll investigate that shortly. It is one more step towards sensible FreeBSD/Xen integration though!

* I've been fixing bugs and adding features to my Squid-2 fork, Lusca. I've found and fixed a couple of nasty bugs inherited from Squid-2.HEAD (especially one to do with 304 replies not making it back to the client!) and I've started documenting how all of the transparent hijacking/intercepting code works.

I'm doing some pretty heavy customisation of "Lightsquid", a GPL'ed squid logfile analysis tool. I'd like to be able to offer Xenion customers a decent set of management and reporting tools.

The Lightsquid interface is reasonably simple, fast and snappy. It captures the right amount of information for the average network/system administrator. There are a few problems though.

The HTML needs an overhaul. Its nested table hell. It is all done via a custom template engine so it shouldn't be too painful.

The parser seems to assume you're going to feed it all of the logs for a given day. If you feed it half a day at a time, the second import will over-write the first.

The data is stored in flat files, indexed by day. This is fine - a year is 365 directories - and trolling each directory to pull the daily stats isn't too bad. But the per-user statistics are kept in single files, one per user per day. Generating a monthly or yearly user report per user is a very, very expensive operation. Multiply that by a few thousand users and it just won't scale.

I'm going to have to abstract out the data storage and retrieval into a simple API; then implement a database backend for it. This API should implement an "add" functionality so I can handle adding data to an existing day repository.

There probably won't be much of the original Lightsquid code left when I'm eventually done with it.

Then once that is done, I can focus on some better monitoring and management tools.

77 older entries...

 

k certified others as follows:

  • k certified mtearle as Journeyer
  • k certified gbowland as Journeyer
  • k certified rwatson as Master
  • k certified asmodai as Journeyer
  • k certified phk as Master
  • k certified softweyr as Master
  • k certified winter as Journeyer
  • k certified jwalther as Journeyer
  • k certified benno as Journeyer
  • k certified mnot as Journeyer
  • k certified Skud as Journeyer
  • k certified dancer as Journeyer
  • k certified des as Journeyer
  • k certified msmith as Master
  • k certified grog as Master
  • k certified eivind as Master
  • k certified imp as Master
  • k certified green as Journeyer
  • k certified ashp as Apprentice
  • k certified joshua as Apprentice
  • k certified poombah as Apprentice
  • k certified darkewolf as Journeyer
  • k certified peter as Master
  • k certified billf as Journeyer
  • k certified Simon as Journeyer
  • k certified fusion94 as Journeyer
  • k certified jkh as Master
  • k certified aunty as Journeyer

Others have certified k as follows:

  • rwatson certified k as Journeyer
  • asmodai certified k as Journeyer
  • gsutter certified k as Journeyer
  • phk certified k as Journeyer
  • jhb certified k as Journeyer
  • cg certified k as Journeyer
  • cmc certified k as Journeyer
  • mph certified k as Journeyer
  • quiet1 certified k as Journeyer
  • benno certified k as Journeyer
  • jedgar certified k as Journeyer
  • mtearle certified k as Journeyer
  • yakk certified k as Journeyer
  • jmg certified k as Journeyer
  • dancer certified k as Journeyer
  • des certified k as Journeyer
  • eivind certified k as Journeyer
  • ashp certified k as Journeyer
  • joshua certified k as Journeyer
  • poombah certified k as Journeyer
  • jwalther certified k as Journeyer
  • peter certified k as Journeyer
  • bp certified k as Journeyer
  • billf certified k as Journeyer
  • darkewolf certified k as Journeyer
  • mnot certified k as Journeyer
  • winter certified k as Journeyer
  • fusion94 certified k as Master
  • bma certified k as Journeyer
  • cynick certified k as Journeyer
  • mazeone certified k as Journeyer
  • mlsm certified k as Journeyer
  • dcs certified k as Journeyer
  • suso certified k as Journeyer
  • bmilekic certified k as Journeyer
  • nealmcb certified k as Journeyer
  • nixnut certified k as Journeyer
  • Skud certified k as Journeyer
  • XFire certified k as Journeyer
  • kappa certified k as Journeyer
  • ana certified k as Journeyer
  • footrot certified k as Journeyer
  • dchapes certified k as Journeyer
  • mdeegan certified k as Journeyer
  • fxn certified k as Journeyer
  • trs80 certified k as Journeyer
  • kilmo certified k as Journeyer
  • robertc certified k as Journeyer
  • m certified k as Journeyer
  • bsdgabor certified k as Journeyer
  • okaratas certified k as Master

[ Certification disabled because you're not logged in. ]

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!

X
Share this page