Older blog entries for Fefe (starting at number 30)

3 Feb 2003 (updated 3 Feb 2003 at 04:21 UTC) »

I was reading the documentation for the different x86 CPUs regarding my SIMD work, and I have started comparing the latencies. I was shocked just how bad the Pentium 4 is.

The Pentium 4 has really horrible latencies, in particular where it really matters: at the SIMD instructions. For example, movq has a latency of 6 on the Pentium 4 (VIA C3: 1, Pentium 3: 1, Athlon: 2). This means moving one MMX register to another, no memory stalls or AGIs included!

Now movq is pretty important because the x86 architecture uses source+destination addressing in the instructions, not source1+source2+destionation as most RISC CPUs, so you need to copy data around all the time to get something done. This means that you have to find five other instructions to schedule between the movq and the next instruction to keep the pipeline filled. That is next to impossible in typical applications. There are not enough registers to unroll a typical loop four times.

No wonder the Pentium 4 performs so badly per clock cycle in comparison to, well, everything else on the market. I'll stick with my Athlon, thank you. Until I can buy a Hammer, that is ;-)

BTW: Since I mentioned it here, I put the 3dnow vorbis decoder diff on my home page, and 46 people downloaded it so far. Whee! Now I need to find the time to do it again in SSE, that should speed things up even more on my Athlon XP.

@nymia: you describe the code slave, not the coder. You describe someone who does it because he needs the money, not because he likes to do it. Don't just copy other people's code; in most cases it turns out those others didn't know what they were doing, too. Read their code and understand why they did it like that. Then you can copy from yourself. Only copy from others after you have completely understood what and why they were doing.

More SIMD hacking

SIMD hacking is starting to be a lot of fun. That and trying to find clever ways to avoid branches. I spent the last few days hacking vorbis to make it faster on my slow C3. Vorbis is completely float based, and the C3 has 3dnow, so this was a good opportunity to learn 3dnow. So far I converted the q&p loop in vorbis_lsp_to_curve, the overlap/add (only large/large) and copy sections of vorbis_synthesis_blockin, vorbis_apply_window mdct_butterfly_generic and the 2-channel same-endianness segment of ov_read and netted a 10% speed-up on my Athlon. The C3 has about the same speed-up, which is interesting since the C3's FPU is running and only half the CPU clock speed, so 3dnow should be a bigger gain. Maybe vorbis is limited by the RAM bandwidth and cache misses and not FPU?

I have to say that the AMD documentation is much better than the Intel documentation, especially about SIMD. I'll look for some SSE docs on their web page, because the Intel SSE docs are even worse than their other docs. The Intel web site is less forthcoming and their documentation sucks in comparison. At least all the necessary documents can be downloaded for free and without registration (and as PDF and not winword) ;-)

Anyway, during my SIMD hacking I found that I need a good assembly level debugger, a good profiler (I'm using gcov right now, but it won't tell me where the time is spent, only what code is executed often -- close but no cigar; gcov is too unprecise, hrprof seems to get the timings wrong) and a good stall simulator. If someone knows a free tool where I can specify a given target CPU and run my code and it will then tell me which assembly instructions caused which stalls and for what reason, it would be most helpful. Something like this should be relatively easy to do for the bochs people. Or for valgrind, once someone adds MMX, SSE and 3dnow support to their JIT engine.

By the way: a big hooray for valgrind! What a great piece of software! If you develop software, use valgrind!

I wonder how much crypto bulk cipher performance could be gained using MMX and SSE. SSE is next on my list. Hacking code is great, every day you have the opportunity to learn new exciting things! ;-)

If you are new to SIMD and want to see how really skillful people do it, have a look at the ffmpeg sources. I find the altivec stuff (in libavcodec/ppc) particularly impressive.

12 Jan 2003 (updated 12 Jan 2003 at 02:54 UTC) »
SIMD hacking

Wow, what a great weekend so far! ;-)

I started looking at MMX, SSE and 3dnow, and I started hacking at ffmpeg. I actually found a few routines that were reasonable often called, small enough and candidates for SIMD, and wrote a translation. I learned a lot in the process, and now the patches have been integrated into ffmpeg. I will try to submit enough patches to be mentioned in the ChangeLog or in some comment, so I have something to show.

In the process, I found a great profiling package called hrprof. It basically uses a new gcc instrument-this-code option and the Pentium cycle counter to write profiling data usable by gprof but much more accurate. Great tool!

Using the profiler, I found that ffmpeg is spending much time in quant_psnr8x8_c, dct_sad8x8_c and dct_sad16x16_c (and a whole lot of already mmx/mmx2 optimized functions). Those will be my next targets, let's see how far I get. It's quite an experience using those SIMD instructions, so very different from normal SISD hacking. I find it quite rewarding so far.


The shipment with the Infineon DIMM arrived yesterday, and I put together my new EPIA-M box. I put together some PXE net-boot configuration and it now boots diskless. It's on mobile mode (several free-hanging parts connected with strange cables) and is virtually noise-less, although the CPU does have a fan.

Most of the hardware is supported. The network card, the USB, USB2 and Firewire controllers, the sound card... but not the graphics card. I can use mplayer and XFree86 in VESA mode, but that sucks very much, in particular since I want to use it as web clicking PC for my wife, and scrolling a large Mozilla window without hardware blitting is not what I had in mind. The graphics chipset is not even in the lspci database, and I found no data sheet or description of it. It is apparently called CastleRock and is some sort of S3 chip, but what do I know? This is very disappointing, in particular since this hardware was advertised as having "full Linux support". Well, it's not so full after all. If anyone here can help me, please contact me! I already looked at google and tried the current XFree86 snapshot.

Other than that, this is some great hardware! It is fast enough to play full-screen hq-divx movies with AC3 soundtrack at (according to vmstat) 25-40% CPU load. I got the faster one with the 933 MHz CPU, but I guess the slower one would have been sufficient.

10 Jan 2003 (updated 10 Jan 2003 at 01:07 UTC) »

I found out about gmane today. What a great site! I have been waiting for something like this for years! They basically offer mailing lists accessible via nntp, i.e. it's a Usenet news server, just without the usual Usenet newsgroups, instead it has mailing lists.

Why is that so great? Because I hate web interfaces to mailing lists. In particular I loathe bad ones like the one for ffmpeg on Source Forge. They don't have all mailing lists, and they don't have all the articles for every mailing list, but it is possible to submit mailing list archives. To my great surprise, I found that they already include the mailing list for one of my projects, the diet libc! Wow, what an ego trip ;-)

Anyway, I mentioned this yesterday in my diary, but overwrote it when I tried to post a second diary entry, which overwrote the first one. I started working on an event notification framework. I wrote it to learn about new platform specific APIs like sigio and epoll on Linux. I am planning to add kqueue support as well, but haven't gotten around to it yet. I measured some 8000 HTTP connections per second on my desktop box with it. With HTTP keep-alive I even got over 30000 transactions per second in one benchmark.

The Economist has run a very interesting story. It's about some polls about how people see other countries. The USA approval in Europe has dropped dramatically, even in UK only 50% of the population approve! Even more dramatic: Europe's approval of Israel is only 38%, that's way down there, almost as bad as Iraq. The strangest result is that 60% of the Europeans and 80% of the Americans want the EU to be a strong leader. Is that a call to rescue the Americans from their own government? Food for thought.

9 Jan 2003 (updated 9 Jan 2003 at 06:08 UTC) »

I wasted the better half of the day trying to get glibc to compile. It just wouldn't work for me. Nobody else appears to have problems with this.

The problem was that glibc's new ld.so checks whether the ELF run path is set. This is done for shared libraries using -Wl,-rpath, which you might have seen somewhere already. Or you can override it at run time with $LD_LIBRARY_PATH. Or you can specify a default at link time with $LD_RUN_PATH.

This I happen to do to make all those stupid GNOME applications find their libraries on my system, which I do not want clobbering my /usr/local/lib, so I put them somewhere under /opt.

How does ld.so check? With assert(). Apparently nobody ever tested this code, because assert calls __assert_fail, which calls some internal printf clone, which calls some conversion routine for numbers, and that one segfaults before anything is actually printed.

So ld.so segfaults before calling any syscall that could be observed with strace. D'oh!

There is little that makes as happy as seeing people use your software. At a small local LAN party, I saw people casually using npush and npoll (from my ncp package; people whom I never saw before and who didn't even know me. That was amazing.

And it's even more amazing to get bug reports that show people tried to do more with your code than you did yourself! The main problem with free software is that it is not very rewarding. Most emails are gripes about bugs or licensing issues, it's rare to have someone write you just to tell you that he likes your software. So getting a bug report about a detail of a library routine that is not exposed by the surrounding project is a very special gift, because it shows someone not only downloaded the code, he actually read the source code!

Anyway, I finally got an account on an ia64 machine, which allowed me to diagnose the problems the Debian build system reported on that platform. It turned out to be a bug in the start code, so this opportunity forced me to read about ia64 assembly language. I wonder which planet the designers of this architecture came from, and whether it even was in our galaxy.

My friend Öc is currently solving the holy grail of qmail; he patched qmail to add RCPT TO batching, and he is now working on a generic filter infrastructure, which is a big problem for integrating spam or virus scanners currently.

fnord has come around nicely. I just got an email from the GNU project about it, they want to include it in their directory. This also happened to the diet libc a while back and after that it really took off. I take this as a good omen ;) The diet libc has been ported to x86_64 and ia64 now, I think it is time to look at it from a security point of view now and try to get external people to audit the source code. It is getting used on servers more and more.

I haven't had much time for tinyldap lately, regrettably. Too bad as it still does not have ACLs or write capability.

On the other hand, I am thinking about writing an sshd that only supports protocol version 2. I don't have that warm and fuzzy feeling with openssh any more.

I decided to enter the Honeynet challenge. To maximize my own learning experience, I decided to do it the hard way: by not running the binary, not even in strace or a debugger. Just working on the disassembler output. The first part was easier than the last part. I actually thought I could finish this in 24 hours, but then I fell asleep on the keyboard ;-)

I am finished now and I think it was worth it. There is still plenty of time, so if you think you are up to it, go ahead and join the contest!

Time and again, just when I am starting to lose faith in the free software community (usually when I go to Freshmeat and see superfluous crap projects like the hundredth PHP weblog or window manager or trivial 3k hack bloated to multiple megabytes using Java, KDE or GNOME, someone comes along and restores my faith.

Just today, I received a report regarding tinyldap for a cut-and-paste bug in the library that is not visible in any of the actual binaries. So, to find it, you need to read (and understand) the source code. After all the kindergarten behavior and low-quality postings on the mailing lists or Usenet newsgroups for popular software like qmail or Apache, I can really appreciate the difference. Normally most of the messages ask questions straight out of the manual, or worse, the FAQ. So far, the mailing lists for my projects have been great examples of how mailing lists are supposed to be. Very low volume, no fluff. It's good to see that this can actually work.

I am still thinking on how to implement ACLs for tinyldap. This issue has been clouding my mind for over two weeks now. OpenLDAP slows down by a factor of 100 with ten measly ACLs for my test data, so it is very important to get this right.

tinyldap is now actually in a state where it begins to be useful. It has a trivial client that can do EQUALS queries, it has a relatively simple server that is meant to be run from tcpserver (see ucspi-tcp), and it already outperforms openldap by a factor of almost 10 for index generation and simple queries.

Time and again I am amazed at how bad the successful software packages out there are. Apache is easily outperformed by fnord, MySQL isn't even a real database, PHP is so slow that Zend actually made a business model out of selling performance enhancing hacks for it... The only widely used free software projects that actually perform well in their market are GNU grep and the Linux kernel (and the latter is an evil bloat monster).

Sometimes the state of the free software is quite depressing. Well, enough talk. Let's do something about it!

21 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!