Older blog entries for monkeyiq (starting at number 71)

Fighto time.

Recently a question was posed to me in which I tended to offer a reasonably off the cuff response for. This led to an interesting debate about if set<string> was going to be hugely slower than hash_set<string> for the exact case where hash_set<> should whip an AVL tree's butt: direct lookups.

So without going into that conversation I decided to benchmark the two std::collections from both stdc++ and stlport 4.x. This is using gcc 4.0.2 which is shameful as I should have a more recent gcc. I'll likely rereun it on icc and 4.1.x as well.

The core of the code is to read strings from cin and shove them into a std::list. During the set<> parts I create a set with the list (which will have dups) and then iterate the list 50 times looking for each entry (including dups again) in the built set<> or hash_set<>.

There is of course some cruft there to select the right container from stdc++ and stlport because hash_set is non standard.

    if( use_hash )
    {
        l_t::iterator e = l.end();
        for( l_t::iterator iter = l.begin(); iter != e; ++iter )
            hstrset.insert( *iter );
        for( int i=0; i<LOOKUPS; ++i )
            for( l_t::iterator iter = l.begin(); iter != e; ++iter )
                hstrset.find( *iter );
    }
    else
    {
        l_t::iterator e = l.end();
        for( l_t::iterator iter = l.begin(); iter != e; ++iter )
            strset.insert( *iter );
        for( int i=0; i<LOOKUPS; ++i )
            for( l_t::iterator iter = l.begin(); iter != e; ++iter )
                strset.find( *iter );
    }

So the benchmarks, all compiled with -O9. Other gcc options don't seem to make any real effect. I created input from Gutenberg files, l.size is the number of words read. The hash_set methods are quicker for the completely degenerate case of only doing direct lookups and doing each of them at least 50 times per uniq word in the input.

Perhaps the most interesting point is the difference in speed between stlport and libstdc++ for this. I am now very interested to see how stlport5.x compares.


# Using stdc++::set<> foo$ time cat /tmp/largetxt.txt | ./string_xset l.size:273435 use_hash:0

real 0m16.980s user 0m16.493s sys 0m0.028s

# Using stlport::set<> foo$ time cat /tmp/largetxt.txt | ./string_xset_stlport l.size:273435 use_hash:0

real 0m10.184s user 0m9.821s sys 0m0.084s

# Using stdc++::hash_set<> foo$ time cat /tmp/largetxt.txt | ./string_xset 1 l.size:273435 use_hash:1

real 0m4.061s user 0m3.868s sys 0m0.024s

# Using stlport::hash_set<> foo$ time cat /tmp/largetxt.txt | ./string_xset_stlport 1 l.size:273435 use_hash:1

real 0m2.430s user 0m2.328s sys 0m0.012s

Moving back to blogging here for a while... Seems using university equipment for this is not so much an optimal thing. Loss of control over the network environment etc.

Moving my blog over to UQ because its much easier to update and I can play with the blosxom aswell :)

Read my article at xml.com on Using libferris with XML.

Enjoy.

Novell's VFS

At Edd's blog there is mention of Novell open sourcing their VFS and the possibility to integrate it into GNOME. Very interesting, I should add support for mounting Simias to libferris hopefully sometime shortly after they release it. I suspect it might be somewhat hard to install Simias at first though (maybe they want to beat libferris for difficulty to install ;)

Does make me wonder a little about the whole 'sharing VFS' thing that comes up from time to time (search freedesktop.org's mailing lists). As to, at the level of an Extended Attribute extractor (like something that extracts EXIF data from exif/jfif images) whether there can be code sharing. I think this is the most realistic level where the VFS systems can interoperate as there is an EA interface and a binary chunk that data is extraced from. Though even at this 'grass roots' level libferris differs from C based VFS systems in that data is accessed through the C++ IOStreams interface... though bolting a obj->read( offset, count ) to that should be trivial.

On the flipside, I've been adding support for Adobe's XMP metadata embedding to libferris. This should make for a reasonably kickass /usr/local/doc search engine with the new EA from pdf files and the fulltext indexing of HTML and PDF etc that ferris already does. I think I've managed to sniff out most of the embedded RDF standards now, after XMP there is the jpeg COMMENT1 block RDF standard that I've yet to code support for.

I might have to make mod_ferris for apache soon so that a web interface to the EA/indexes can be created... maybe.

Dude, where's my filesystem

Trying to use evolution-data-server in little clients, not going all that gleefully as yet... must mount evolution... must mount ...

Have now returned from icFCA04, the international conference on Formal Concept Analysis which was in Sydney this year. Interesting mix of computer sci folks and raw maths guys there. I suspect that next year I'll make my talk involve much more pictures and math descriptions of what I am thinking rather than near on code level talks that I tend to be attracted to.

Went along to the SLUG while I was there too. Some very interesting chats seem to always be waiting at that user group... Chatted with jdub for a while over dinner about gnome and explained some of the crack that I've sprinkled over libferris and showed some lattices from FCA that I plan to make into the future of n-dimensional filesystem browsing (though not in those terms ;). From what I've herd there are some very interesting plans for the future of gnome-vfs... Be interesting to chat with some of the dudes planning to extend g-vfs as I think that the open source community needs to be moving fairly quickly to create an opposing system to WinFS. Parts of WinFS seem really cool and we definately want that coolness too (though hopefully we can improve some parts of WinFS that are not so cool).

Top of the reading heap

Computer Graphics: Principles and Practice in C, 2/E
ISBN: 0-201-84840-6
Publisher: Addison Wesley Professional
Copyright: 1996
Format: Cloth; 1200 pp
US: $79.99
<already read parts, very good stuff, well worth having>

Principles of Data Mining Written by David J. Hand , Heikki Mannila , Padhraic Smyth Published by MIT Press (01 August, 2001) ISBN: 026208290X USPrice $58.0 <read first 150pp. very interesting>

Hmm, should really try to get more sites to publish information about libferris. Seems that /. is loving stories about similar tools with their past posting on Gnome-Storage and now this mentioning some of the WinFS stuff.

I've been meaning to brew up a public document comparing libferris and plans with the WinFS stuff on MSDN-TV, though given a limited time budget...

ACE logging @ osnews.com

Here is the link.

From the first page

"However, an important factor in the logging facility's design is the ability to effectively "no-op" the logging statements at compile time."

Note that a stream based compile time on/off method works easily

// in header
#ifdef NDEBUG
  #define MY_DEBUG if( 0 ) cerr
#else
  #define MY_DEBUG if( should_runtime_show() ) \
    get_real_stream_reference(); \
  else \
    get_null_stream_reference();
#endif

// in program MY_DEBUG << "foo" << endl;

62 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!