Older blog entries for Fefe (starting at number 32)

I completed the vorbis diff. It can now also speed up encoding by 25% if you have SSE.

The CCC is doing another camp in a few days. I find the entrance fee of 100 Euros a little steep, but I nonetheless hope many of you will consider coming. I will probably do a workshop on either pattern matching and text retrieval or SIMD hacking. If you are coming and have a preference in the matter, please drop me a mail.

Other than that there's not much to say. It is too hot in Germany right now to do much meaningful work, and I don't have an air conditioner, so I'm basically waiting for the night to get things done (like now, it's 23:00 local time). The EPIA M decided to come back once more. I'm not sure if it is the mainboard or the power supply. But I use it now to play back Winnie Pooh DivX recordings to my son, and it's quite good enough for that.

Oh, in the mean time AMD has released good documentation on SSE and SSE2 (in their Opteron documents). So now AMD documentation is in every respect much better than the Intel stuff. For some reason, the Intel compiler does not produce better code than gcc for me either, even on my Pentium 3 notebook. Bad karma, maybe ;)

My 1 month old VIA EPIA M has died on me :-( When I press the power button, the CPU fan starts very briefly and then stops again. No beeps, nothing else. The power supply fan has the same behaviour. Too bad.

Anyway, c't had an article about EPIA M and they said theirs had a VIA Nehemiah instead of the Ezra that mine had. The difference is that Nehemiah has SSE and no 3dnow. So I hope that I will get a replacement EPIA board with Nehemiah, it is slightly faster.

I took the opportunity to port my libvorbis 3dnow patch to SSE, but boy was I mistaken about the work that would entail! I thought I'd just do it on the train using the new builtins the Intel C compiler defines and gcc also supports. It turns out that my gcc version (3.1.1) creates bad code for some of them, and it generates bad code for all of them unless you turn the optimizer on. gcc wants to save the xmm registers to the stack otherwise, and forgets to align the stack storage to 16 bytes, which is a requirement for SSE. The result is a seg fault in innocuous looking code.

The Intel documentation really sucks. They find it completely unnecessary to document their SSE stuff, you only get documentation about SSE+SSE2. So in the middle of your hacking you find that the instruction you used does not exist in SSE. Duh. Unfortunately, I haven't found any AMD documentation about SSE; their CPU documentation is excellent, especially in comparison to Intel's crap.

Well, back to my vorbis hacking. The patch only speeds up decoding, and it does 1/4 to 1/3 speedup on my Athlon, although I had to work around unaligned data and data layout that is not so vector friendly (one array per channel, so you have to interleave the data manually for the sound card). I put it on my home page, let's see how many people will find it useful. If you download it, please send me an email and tell me! I want to know! ;)

3 Feb 2003 (updated 3 Feb 2003 at 04:21 UTC) »

I was reading the documentation for the different x86 CPUs regarding my SIMD work, and I have started comparing the latencies. I was shocked just how bad the Pentium 4 is.

The Pentium 4 has really horrible latencies, in particular where it really matters: at the SIMD instructions. For example, movq has a latency of 6 on the Pentium 4 (VIA C3: 1, Pentium 3: 1, Athlon: 2). This means moving one MMX register to another, no memory stalls or AGIs included!

Now movq is pretty important because the x86 architecture uses source+destination addressing in the instructions, not source1+source2+destionation as most RISC CPUs, so you need to copy data around all the time to get something done. This means that you have to find five other instructions to schedule between the movq and the next instruction to keep the pipeline filled. That is next to impossible in typical applications. There are not enough registers to unroll a typical loop four times.

No wonder the Pentium 4 performs so badly per clock cycle in comparison to, well, everything else on the market. I'll stick with my Athlon, thank you. Until I can buy a Hammer, that is ;-)

BTW: Since I mentioned it here, I put the 3dnow vorbis decoder diff on my home page, and 46 people downloaded it so far. Whee! Now I need to find the time to do it again in SSE, that should speed things up even more on my Athlon XP.

@nymia: you describe the code slave, not the coder. You describe someone who does it because he needs the money, not because he likes to do it. Don't just copy other people's code; in most cases it turns out those others didn't know what they were doing, too. Read their code and understand why they did it like that. Then you can copy from yourself. Only copy from others after you have completely understood what and why they were doing.

More SIMD hacking

SIMD hacking is starting to be a lot of fun. That and trying to find clever ways to avoid branches. I spent the last few days hacking vorbis to make it faster on my slow C3. Vorbis is completely float based, and the C3 has 3dnow, so this was a good opportunity to learn 3dnow. So far I converted the q&p loop in vorbis_lsp_to_curve, the overlap/add (only large/large) and copy sections of vorbis_synthesis_blockin, vorbis_apply_window mdct_butterfly_generic and the 2-channel same-endianness segment of ov_read and netted a 10% speed-up on my Athlon. The C3 has about the same speed-up, which is interesting since the C3's FPU is running and only half the CPU clock speed, so 3dnow should be a bigger gain. Maybe vorbis is limited by the RAM bandwidth and cache misses and not FPU?

I have to say that the AMD documentation is much better than the Intel documentation, especially about SIMD. I'll look for some SSE docs on their web page, because the Intel SSE docs are even worse than their other docs. The Intel web site is less forthcoming and their documentation sucks in comparison. At least all the necessary documents can be downloaded for free and without registration (and as PDF and not winword) ;-)

Anyway, during my SIMD hacking I found that I need a good assembly level debugger, a good profiler (I'm using gcov right now, but it won't tell me where the time is spent, only what code is executed often -- close but no cigar; gcov is too unprecise, hrprof seems to get the timings wrong) and a good stall simulator. If someone knows a free tool where I can specify a given target CPU and run my code and it will then tell me which assembly instructions caused which stalls and for what reason, it would be most helpful. Something like this should be relatively easy to do for the bochs people. Or for valgrind, once someone adds MMX, SSE and 3dnow support to their JIT engine.

By the way: a big hooray for valgrind! What a great piece of software! If you develop software, use valgrind!

I wonder how much crypto bulk cipher performance could be gained using MMX and SSE. SSE is next on my list. Hacking code is great, every day you have the opportunity to learn new exciting things! ;-)

If you are new to SIMD and want to see how really skillful people do it, have a look at the ffmpeg sources. I find the altivec stuff (in libavcodec/ppc) particularly impressive.

12 Jan 2003 (updated 12 Jan 2003 at 02:54 UTC) »
SIMD hacking

Wow, what a great weekend so far! ;-)

I started looking at MMX, SSE and 3dnow, and I started hacking at ffmpeg. I actually found a few routines that were reasonable often called, small enough and candidates for SIMD, and wrote a translation. I learned a lot in the process, and now the patches have been integrated into ffmpeg. I will try to submit enough patches to be mentioned in the ChangeLog or in some comment, so I have something to show.

In the process, I found a great profiling package called hrprof. It basically uses a new gcc instrument-this-code option and the Pentium cycle counter to write profiling data usable by gprof but much more accurate. Great tool!

Using the profiler, I found that ffmpeg is spending much time in quant_psnr8x8_c, dct_sad8x8_c and dct_sad16x16_c (and a whole lot of already mmx/mmx2 optimized functions). Those will be my next targets, let's see how far I get. It's quite an experience using those SIMD instructions, so very different from normal SISD hacking. I find it quite rewarding so far.


The shipment with the Infineon DIMM arrived yesterday, and I put together my new EPIA-M box. I put together some PXE net-boot configuration and it now boots diskless. It's on mobile mode (several free-hanging parts connected with strange cables) and is virtually noise-less, although the CPU does have a fan.

Most of the hardware is supported. The network card, the USB, USB2 and Firewire controllers, the sound card... but not the graphics card. I can use mplayer and XFree86 in VESA mode, but that sucks very much, in particular since I want to use it as web clicking PC for my wife, and scrolling a large Mozilla window without hardware blitting is not what I had in mind. The graphics chipset is not even in the lspci database, and I found no data sheet or description of it. It is apparently called CastleRock and is some sort of S3 chip, but what do I know? This is very disappointing, in particular since this hardware was advertised as having "full Linux support". Well, it's not so full after all. If anyone here can help me, please contact me! I already looked at google and tried the current XFree86 snapshot.

Other than that, this is some great hardware! It is fast enough to play full-screen hq-divx movies with AC3 soundtrack at (according to vmstat) 25-40% CPU load. I got the faster one with the 933 MHz CPU, but I guess the slower one would have been sufficient.

10 Jan 2003 (updated 10 Jan 2003 at 01:07 UTC) »

I found out about gmane today. What a great site! I have been waiting for something like this for years! They basically offer mailing lists accessible via nntp, i.e. it's a Usenet news server, just without the usual Usenet newsgroups, instead it has mailing lists.

Why is that so great? Because I hate web interfaces to mailing lists. In particular I loathe bad ones like the one for ffmpeg on Source Forge. They don't have all mailing lists, and they don't have all the articles for every mailing list, but it is possible to submit mailing list archives. To my great surprise, I found that they already include the mailing list for one of my projects, the diet libc! Wow, what an ego trip ;-)

Anyway, I mentioned this yesterday in my diary, but overwrote it when I tried to post a second diary entry, which overwrote the first one. I started working on an event notification framework. I wrote it to learn about new platform specific APIs like sigio and epoll on Linux. I am planning to add kqueue support as well, but haven't gotten around to it yet. I measured some 8000 HTTP connections per second on my desktop box with it. With HTTP keep-alive I even got over 30000 transactions per second in one benchmark.

The Economist has run a very interesting story. It's about some polls about how people see other countries. The USA approval in Europe has dropped dramatically, even in UK only 50% of the population approve! Even more dramatic: Europe's approval of Israel is only 38%, that's way down there, almost as bad as Iraq. The strangest result is that 60% of the Europeans and 80% of the Americans want the EU to be a strong leader. Is that a call to rescue the Americans from their own government? Food for thought.

9 Jan 2003 (updated 9 Jan 2003 at 06:08 UTC) »

I wasted the better half of the day trying to get glibc to compile. It just wouldn't work for me. Nobody else appears to have problems with this.

The problem was that glibc's new ld.so checks whether the ELF run path is set. This is done for shared libraries using -Wl,-rpath, which you might have seen somewhere already. Or you can override it at run time with $LD_LIBRARY_PATH. Or you can specify a default at link time with $LD_RUN_PATH.

This I happen to do to make all those stupid GNOME applications find their libraries on my system, which I do not want clobbering my /usr/local/lib, so I put them somewhere under /opt.

How does ld.so check? With assert(). Apparently nobody ever tested this code, because assert calls __assert_fail, which calls some internal printf clone, which calls some conversion routine for numbers, and that one segfaults before anything is actually printed.

So ld.so segfaults before calling any syscall that could be observed with strace. D'oh!

There is little that makes as happy as seeing people use your software. At a small local LAN party, I saw people casually using npush and npoll (from my ncp package; people whom I never saw before and who didn't even know me. That was amazing.

And it's even more amazing to get bug reports that show people tried to do more with your code than you did yourself! The main problem with free software is that it is not very rewarding. Most emails are gripes about bugs or licensing issues, it's rare to have someone write you just to tell you that he likes your software. So getting a bug report about a detail of a library routine that is not exposed by the surrounding project is a very special gift, because it shows someone not only downloaded the code, he actually read the source code!

Anyway, I finally got an account on an ia64 machine, which allowed me to diagnose the problems the Debian build system reported on that platform. It turned out to be a bug in the start code, so this opportunity forced me to read about ia64 assembly language. I wonder which planet the designers of this architecture came from, and whether it even was in our galaxy.

My friend Öc is currently solving the holy grail of qmail; he patched qmail to add RCPT TO batching, and he is now working on a generic filter infrastructure, which is a big problem for integrating spam or virus scanners currently.

fnord has come around nicely. I just got an email from the GNU project about it, they want to include it in their directory. This also happened to the diet libc a while back and after that it really took off. I take this as a good omen ;) The diet libc has been ported to x86_64 and ia64 now, I think it is time to look at it from a security point of view now and try to get external people to audit the source code. It is getting used on servers more and more.

I haven't had much time for tinyldap lately, regrettably. Too bad as it still does not have ACLs or write capability.

On the other hand, I am thinking about writing an sshd that only supports protocol version 2. I don't have that warm and fuzzy feeling with openssh any more.

I decided to enter the Honeynet challenge. To maximize my own learning experience, I decided to do it the hard way: by not running the binary, not even in strace or a debugger. Just working on the disassembler output. The first part was easier than the last part. I actually thought I could finish this in 24 hours, but then I fell asleep on the keyboard ;-)

I am finished now and I think it was worth it. There is still plenty of time, so if you think you are up to it, go ahead and join the contest!

23 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!