Older blog entries for Fefe (starting at number 38)

A few small updates: The ACLs in tinyldap work, at least for the read case. I also added partial write support now. tinyldap can not only serve data from the read-only data structure, but also from a journal that is read in when the server process starts. Since the tinyldap model is to start one process per client, this works well, but I will add some code to check whether there's news in the journal during the run-time of the server processes, too.

I also added support for LDAP AddRequests, but no other writes yet. My blog software now writes to the journal directly, and it helped greatly. I initially wrote my blog to use LDAP, so I'd have a "customer zero" for tinyldap, and would be compelled to iron out kinks I found while using it. It worked in that it made me add the journaling code, but it also failed in that I now write to the journal directly instead of using the ldap modify requests and so on. Important lesson: using your own laziness to get things done does not always work.

A small update on gatling, too. I added a mode to have more than one process, and I expected it to speed up the cold cache case, because then more than one disk read request would be outstanding at a time, and the OS could use the I/O elevator to optimize the movement of the disk head. To my great surprise it turned out that that case did not get faster at all, but the hot cache case had a x2 speedup when using two cores. This is all on Linux, mind you, but the results are almost the same on other operating systems. See this page for details.

I have been drafted to do two talks at the CCC Camp. One will be about the dark side of C++ (why it sucks to audit C++ code) and the other one will be about what optimizations you should not do on your source code, because the compiler already does them for you, and it makes the code less readable.

27 Sep 2005 (updated 27 Sep 2005 at 10:30 UTC) »

The last half year has been quite successful from a business point of view, but it has cost me in terms of time spent on writing free software.

Nonetheless, I have found the time to write rudimentary ACL support for tinyldap, to find out whether the idea works at all. It is still far from finished: you can for example still query using attributes that you are not allowed to see in the ACLs, as long as you don't ask for the attributes in the answer record. And there is headroom for optimization.

Besides that, I have been thinking about adding threads to gatling, to make open() asynchronous. I have seen a production web server serving lots of small images which don't fit into memory at the same time, where gatling was limited by the serialization of the read I/O. Something needs to be done about this, obviously.

Anyway, I am now moving my diary to blog.fefe.de, where I mostly collect URLs that I find interesting, but you will also find a few articles between the links.

27 Jan 2005 (updated 27 Jan 2005 at 01:06 UTC) »

Another year has passed, and much has happened.

From my event notification framework a web server has risen. I named it gatling because it was designed to serve many small requests. gatling was meant not only as web server, but as generic file server. The idea is: you go somewhere, want to give someone some files, and you just go to the directory with the file and start gatling. gatling than exports the files using as many protocols as possible. So far only HTTP and FTP are actually implemented, and HTTPS. But SMB and NFS are planned. Both protocols are unfortunately very ugly and need more management infrastructure than HTTP and FTP.

I also wrote an anti-spam patch for qmail, find it at www.fefe.de/qmail/. It checks whether envelope sender addresses can be replied to, whether there is some known open proxy or relay on the connecting IP, and some other stuff. Surprisingly effective.

I started writing some ACL infrastructure for tinyldap. Some interesting questions pose themselves. What parameters do you allow for rule parametrization? Only by DN? From the qmail anti-spam patch I have code to match a string against a known list of (sub-)strings, so I could basically match a dn against an arbitrary list of rules with complexity bounded by the length of the dn to be checked, not by the number of rules. However, the more I think about it, the more I think rules should not be restricted to DNs. It should be possible to, for example, match by objectClass. Some further thought is needed on this.

Oh, and I finally bought myself that Athlon 64 I was talking about in the last entry. Surprisingly it is much less noisy than my last system. I watched out for quiet components without compromising on performance, and it is possible. Now that my computer is so quiet, I start to notice how noisy the rest of my environment is :-)

ARGH! I vastly underestimated the relative size of the strings in the LDAP database and the offsets to the strings. The ratio is about 2:1. I.e. for two megs of raw data, I need one more meg for the offsets. And that is not counting the indices (which are much smaller).

parse still can't handle the phone CD. The reason is that Linux will not let me mmap more than 2 gigs. This probably has to do with some stupid compile time constants like the relative position of code, stack, reserved kernel space and the mmap space. Or so. I'm just pulling this out of thin air, sorry. So if there ever was a compelling reason for 64-bit computing, this would be it. Time to get myself an Athlon 64 box. Anyone have a spare one for me? *sigh*

On the other hand, I could have foreseen it. I haven't counted them, but I expect the phone CD to carry at least 20 million records, which means at least 80 megs just for the pointers to the records. This is a lot of data. Maybe I can think of an easy way to be more space efficient. Let's sleep over it. Comments or ideas welcome.

File systems keep pointer sizes down by segmenting the data into smaller independent chunks. Maybe I should do the same. The data could be encoded with a static Huffman tree. Bad tradeoff here. Maybe I will just have to settle for the data for Berlin and Brandenburg or so. Too bad. I really looked forward to this.

The event notification stuff has progressed quite nicely. It is the basis of some benchmarks I did on Linux and the BSDs, and it made quite a wave when I posted it on Slashdot.

The goal was to create a good abstraction for the various event notification APIs, but also around the various sendfile variants, so it would be easy to write a highly web server. The abstraction part is in libowfat, and the web server is called gatling (it's unreleased so far, but you can get the source code via cvs from :pserver:cvs@cvs.fefe.de:/cvs, check out "gatling".

gatling now uses sendfile on Linux, FreeBSD, Solaris, HP-UX, AIX, and uses mmap+write on the other systems (this provides zero-copy TCP at least on NetBSD and IRIX). gatling uses better-than-poll event notification APIs on Linux, the BSDs, IRIX, Solaris and MacOS X. So it should be the ideal basis to write highly scalable network servers and clients.

gatling is not just an HTTP server, it also is an FTP server. And the beauty of it is that it can be used in ad-hoc mode, i.e. you just run it and the current directory is exported read-only to the world. Uploads are possible via FTP, but only if the receiving directory is world writable. This is incredibly useful, much more so than I would have initially thought. I'm planning to also add SMB and NFS read-only export. Then gatling would be the ideal LAN party network publication tool.

Also, I started hacking on tinyldap again. A friend of mine reverse engineered the database format on a "telephone book CD" for Germany. I thought to myself: this would be a formidable data set to put into tinyldap to prove that it is not just a toy. So he wrote a converter that spit out tab separated data and I converted it to LDIF and it's over 2 Gigs worth of ASCII LDIF. Whew. tinyldap assumes that all your data fits into memory. Not so much tinyldap itself but the programs that create the data file tinyldap serves from. So I started hacking on parse (which creates the binary data file but no indices on it). I ended up completely reworking it and over days my local source tree would not create correct data files. It was a strange experience for me because normally I only do small changes on my local cvs checkout and check every change in immediately. On the other hand, I always wait until it at least compiles and survives very basic tests. But now I checked everything in.

The next step is to convert the index generation to work like sort(1), i.e. quicksort on smaller blocks and then mergesort on these blocks.

That's harder than it sounds because I am sorting offsets. The comparison function has to seek to the offset and compare the data there. So the merge sort would drown the disk in wild seeking around unless I attach the first n bytes behind the offset to the offset before the merge sort and throw it away afterwards. I'm curious how much seeking I will end up doing.

I am also seeking sponsors for my work on tinyldap. If you run company working with LDAP, please go to your LDAP admin when he is doing LDAP administration and see for yourself. Chances are, he will be cursing all the time. Why not spend some money on the development of an LDAP server that does not suck? Hey, you can even say what features you want implemented! If you are interested, please contact me.

I have been working on event notification again. This time, I decided to throw away the old callback based API. It was not as easy to program as I had hoped.

The impetus was that djb has defined an API for event notification in the mean time; it's called io. At first, it didn't look so good to me, but the second look convinced me to use this API for my new implementation.

So far, the code supports epoll (Linux 2.6), sigio (Linux 2.4), kqueue (FreeBSD, OpenBSD and NetBSD-current) and /dev/poll (Solaris). I also wrote a small web server using this API, to conduct some kernel networking scalability tests. Yesterday, I added read-only FTP support. Maybe I'll also add read-only SMB and NFS support, then it would be the ideal file publishing platform for LAN parties or so.

I completed the vorbis diff. It can now also speed up encoding by 25% if you have SSE.

The CCC is doing another camp in a few days. I find the entrance fee of 100 Euros a little steep, but I nonetheless hope many of you will consider coming. I will probably do a workshop on either pattern matching and text retrieval or SIMD hacking. If you are coming and have a preference in the matter, please drop me a mail.

Other than that there's not much to say. It is too hot in Germany right now to do much meaningful work, and I don't have an air conditioner, so I'm basically waiting for the night to get things done (like now, it's 23:00 local time). The EPIA M decided to come back once more. I'm not sure if it is the mainboard or the power supply. But I use it now to play back Winnie Pooh DivX recordings to my son, and it's quite good enough for that.

Oh, in the mean time AMD has released good documentation on SSE and SSE2 (in their Opteron documents). So now AMD documentation is in every respect much better than the Intel stuff. For some reason, the Intel compiler does not produce better code than gcc for me either, even on my Pentium 3 notebook. Bad karma, maybe ;)

My 1 month old VIA EPIA M has died on me :-( When I press the power button, the CPU fan starts very briefly and then stops again. No beeps, nothing else. The power supply fan has the same behaviour. Too bad.

Anyway, c't had an article about EPIA M and they said theirs had a VIA Nehemiah instead of the Ezra that mine had. The difference is that Nehemiah has SSE and no 3dnow. So I hope that I will get a replacement EPIA board with Nehemiah, it is slightly faster.

I took the opportunity to port my libvorbis 3dnow patch to SSE, but boy was I mistaken about the work that would entail! I thought I'd just do it on the train using the new builtins the Intel C compiler defines and gcc also supports. It turns out that my gcc version (3.1.1) creates bad code for some of them, and it generates bad code for all of them unless you turn the optimizer on. gcc wants to save the xmm registers to the stack otherwise, and forgets to align the stack storage to 16 bytes, which is a requirement for SSE. The result is a seg fault in innocuous looking code.

The Intel documentation really sucks. They find it completely unnecessary to document their SSE stuff, you only get documentation about SSE+SSE2. So in the middle of your hacking you find that the instruction you used does not exist in SSE. Duh. Unfortunately, I haven't found any AMD documentation about SSE; their CPU documentation is excellent, especially in comparison to Intel's crap.

Well, back to my vorbis hacking. The patch only speeds up decoding, and it does 1/4 to 1/3 speedup on my Athlon, although I had to work around unaligned data and data layout that is not so vector friendly (one array per channel, so you have to interleave the data manually for the sound card). I put it on my home page, let's see how many people will find it useful. If you download it, please send me an email and tell me! I want to know! ;)

3 Feb 2003 (updated 3 Feb 2003 at 04:21 UTC) »

I was reading the documentation for the different x86 CPUs regarding my SIMD work, and I have started comparing the latencies. I was shocked just how bad the Pentium 4 is.

The Pentium 4 has really horrible latencies, in particular where it really matters: at the SIMD instructions. For example, movq has a latency of 6 on the Pentium 4 (VIA C3: 1, Pentium 3: 1, Athlon: 2). This means moving one MMX register to another, no memory stalls or AGIs included!

Now movq is pretty important because the x86 architecture uses source+destination addressing in the instructions, not source1+source2+destionation as most RISC CPUs, so you need to copy data around all the time to get something done. This means that you have to find five other instructions to schedule between the movq and the next instruction to keep the pipeline filled. That is next to impossible in typical applications. There are not enough registers to unroll a typical loop four times.

No wonder the Pentium 4 performs so badly per clock cycle in comparison to, well, everything else on the market. I'll stick with my Athlon, thank you. Until I can buy a Hammer, that is ;-)

BTW: Since I mentioned it here, I put the 3dnow vorbis decoder diff on my home page, and 46 people downloaded it so far. Whee! Now I need to find the time to do it again in SSE, that should speed things up even more on my Athlon XP.

@nymia: you describe the code slave, not the coder. You describe someone who does it because he needs the money, not because he likes to do it. Don't just copy other people's code; in most cases it turns out those others didn't know what they were doing, too. Read their code and understand why they did it like that. Then you can copy from yourself. Only copy from others after you have completely understood what and why they were doing.

More SIMD hacking

SIMD hacking is starting to be a lot of fun. That and trying to find clever ways to avoid branches. I spent the last few days hacking vorbis to make it faster on my slow C3. Vorbis is completely float based, and the C3 has 3dnow, so this was a good opportunity to learn 3dnow. So far I converted the q&p loop in vorbis_lsp_to_curve, the overlap/add (only large/large) and copy sections of vorbis_synthesis_blockin, vorbis_apply_window mdct_butterfly_generic and the 2-channel same-endianness segment of ov_read and netted a 10% speed-up on my Athlon. The C3 has about the same speed-up, which is interesting since the C3's FPU is running and only half the CPU clock speed, so 3dnow should be a bigger gain. Maybe vorbis is limited by the RAM bandwidth and cache misses and not FPU?

I have to say that the AMD documentation is much better than the Intel documentation, especially about SIMD. I'll look for some SSE docs on their web page, because the Intel SSE docs are even worse than their other docs. The Intel web site is less forthcoming and their documentation sucks in comparison. At least all the necessary documents can be downloaded for free and without registration (and as PDF and not winword) ;-)

Anyway, during my SIMD hacking I found that I need a good assembly level debugger, a good profiler (I'm using gcov right now, but it won't tell me where the time is spent, only what code is executed often -- close but no cigar; gcov is too unprecise, hrprof seems to get the timings wrong) and a good stall simulator. If someone knows a free tool where I can specify a given target CPU and run my code and it will then tell me which assembly instructions caused which stalls and for what reason, it would be most helpful. Something like this should be relatively easy to do for the bochs people. Or for valgrind, once someone adds MMX, SSE and 3dnow support to their JIT engine.

By the way: a big hooray for valgrind! What a great piece of software! If you develop software, use valgrind!

I wonder how much crypto bulk cipher performance could be gained using MMX and SSE. SSE is next on my list. Hacking code is great, every day you have the opportunity to learn new exciting things! ;-)

If you are new to SIMD and want to see how really skillful people do it, have a look at the ffmpeg sources. I find the altivec stuff (in libavcodec/ppc) particularly impressive.

29 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!