Older blog entries for dkg (starting at number 56)

16 Mar 2012 »

Compromising webapps: a case study

This paper should be required reading for anyone developing, deploying, or administering web applications.

Syndicated 2012-03-16 01:42:00 from Weblogs for dkg

21 Nov 2011 »

Adobe leaves Linux AIR users vulnerable

A few months ago, Adobe announced a slew of vulnerabilities in its Flash Player, which is a critical component of Adobe AIR:

Adobe recommends users of Adobe AIR 2.6.19140 and earlier versions for Windows, Macintosh and Linux update to Adobe AIR 2.7.0.1948.
[...]
June 14, 2011 - Bulletin updated with information on Adobe AIR

However, looking at Adobe's instructions for installing AIR on "Linux" systems, we see that it is impossible for people running a free desktop OS to follow Adobe's own recommendations:

Beginning June 14 2011, Adobe AIR is no longer supported for desktop Linux distributions. Users can install and run AIR 2.6 and earlier applications but can't install or update to AIR 2.7. The last version to support desktop Linux distributions is AIR 2.6.

So on the exact same day, Adobe said "we recommend you upgrade, as the version you are using is vulnerable" and "we offer you no way to upgrade".

I'm left with the conclusion that Adobe's aggregate corporate message is "users of desktops based on free software should immediately uninstall AIR and stop using it".

If Adobe's software was free, and they had a community around it, they could turn over support to the community if they found it too burdensome. Instead, once again, users of proprietary tools on free systems get screwed by the proprietary vendor.

And they wonder why we tend to be less likely to install their tools?

Application developers should avoid targeting AIR as a platform if they want to reach everyone.

Tags: adobe, proprietary software, security

Syndicated 2011-11-21 17:41:00 from Weblogs for dkg

21 Jun 2011 »

unreproducible buildd test suite failures

I've been getting strange failures on some architectures for xdotool. xdotool is a library and a command-line tool to allow you to inject events into an existing X11 session. I'm trying to understand (or even to reproduce) these errors so i can fix them.

The upstream project ships an extensive test suite; this test suite is failing on three architectures: ia64, armel, and mipsel; it passes fine on the other architectures (the hurd-i386 failure is unrelated, and i know how to fix it). The suite is failing on some "typing" tests -- some symbols "typed" are getting dropped on the failing architectures -- but it is not failing in a repeatable fashion. You can see two attempted armel builds failing with different outputs:

The first failure shows [ and occasionally < failing under a us,se keymap (that is, after the test-suite's invocation of setxkbmap -option grp:switch,grp:shifts_toggle us,se):

Running test_typing.rb
Setting up keymap on new server as us
Loaded suite test_typing
Started
...........F..
Finished in 19.554214 seconds.

  1) Failure:
test_us_se_symbol_typing(XdotoolTypingTests)
    [test_typing.rb:58:in `_test_typing'
     test_typing.rb:78:in `test_us_se_symbol_typing']:
<"`12345678990-=~ !@\#$%^&*()_+[]{}|;:\",./<>?:\",./<>?"> expected but was
<"`12345678990-=~ !@\#$%^&*()_+]{}|;:\",./>?:\",./<>?">.

14 tests, 14 assertions, 1 failures, 0 errors

The second failure, on the same buildd, a day later, shows no failures under us,se, but several failures under other keymaps:

Running test_typing.rb
Setting up keymap on new server as us
Loaded suite test_typing
Started
..F.F.F.......
Finished in 16.784192 seconds.

  1) Failure:
test_de_symbol_typing(XdotoolTypingTests)
    [test_typing.rb:58:in `_test_typing'
     test_typing.rb:118:in `test_de_symbol_typing']:
<"`12345678990-=~ !@\#$%^&*()_+[]{}|;:\",./<>?:\",./<>?"> expected but was
<"`12345678990-=~ !@\#$%^&*()_+]{}|;:\",./<>?:\",./<>?">.

  2) Failure:
test_se_symbol_typing(XdotoolTypingTests)
    [test_typing.rb:58:in `_test_typing'
     test_typing.rb:108:in `test_se_symbol_typing']:
<"`12345678990-=~ !@\#$%^&*()_+[]{}|;:\",./<>?:\",./<>?"> expected but was
<"`12345678990-=~ !@\#$%^&*()_+[]{|;:\",./<>?:\",./<>?">.

  3) Failure:
test_se_us_symbol_typing(XdotoolTypingTests)
    [test_typing.rb:58:in `_test_typing'
     test_typing.rb:88:in `test_se_us_symbol_typing']:
<"`12345678990-=~ !@\#$%^&*()_+[]{}|;:\",./<>?:\",./<>?"> expected but was
<"`12345678990-=~ !@\#$%^&*()_+{}|;:\",./>?:\",./<>?">.

14 tests, 14 assertions, 3 failures, 0 errors

I've tried to reproduce on a cowbuilder instance on my own armel machine; I could not reproduce the problem -- the test suites pass for me.

I've asked for help on the various buildd lists, and from upstream; no one resolutions have been proposed yet. I'd be grateful for any suggestions or hints of things i might want to look for. It would be a win if i could just reproduce the errors.

Of course, one approach would be to disable the test suite as part of the build process, but it has already helped catch a number of other issues with the upstream source. It would be a shame to lose those benefits.

Any thoughts?

Syndicated 2011-06-21 15:05:00 from Weblogs for dkg

10 May 2011 »

the bleeding edge: btrfs (poor performance, alas)

I'm playing with btrfs to get a feel for what's coming up in linux filesystems. To be daring, i've configured a test machine using only btrfs for its on-disk filesystems. I really like some of the ideas put forward in the btrfs design. (i'm aware that btrfs is considered experimental-only at this point).

I'm happy to report that despite several weeks of regular upgrade/churn from unstable and experimental, i have yet to see any data loss or other serious forms of failure.

Unfortunately, i'm not impressed with the performance. The machine feels sluggish in this configuratiyon, compared to how i remember it running with previous non-btrfs installations. So i ran some benchmarks. The results don't look good for btrfs in its present incarnation.

The simplified test system i'm running has Linux kernel 2.6.39-rc6-686-pae (from experimental), 1GiB of RAM (no swap), and a single 2GHz P4 CPU. It has one parallel ATA hard disk (WDC WD400EB-00CPF0), with two primary partitions (one btrfs and one ext3). The root filesystem is btrfs. The ext3 filesystem is mounted at /mnt

I used bonnie++ to benchmark the ext3 filesystem against the btrfs filesystem as a non-privileged user.

Here are the results on the test ext3 filesystem:

consoleuser@loki:~$ cat bonnie-stats.ext3 
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
loki          2264M   331  98 23464  11 10988   4  1174  85 39629   6 130.4   5
Latency             92041us    1128ms    1835ms     166ms     308ms    6549ms
Version  1.96       ------Sequential Create------ --------Random Create--------
loki                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  9964  26 +++++ +++ 13035  26 11089  27 +++++ +++ 11888  24
Latency             17882us    1418us    1929us     489us      51us     650us
1.96,1.96,loki,1,1305039600,2264M,,331,98,23464,11,10988,4,1174,85,39629,6,130.4,5,16,,,,,9964,26,+++++,+++,13035,26,11089,27,+++++,+++,11888,24,92041us,1128ms,1835ms,166ms,308ms,6549ms,17882us,1418us,1929us,489us,51us,650us
consoleuser@loki:~$

And here are the results for btrfs (on the main filesystem):

consoleuser@loki:~$ cat bonnie-stats.btrfs 
Reading a byte at a time...done
Reading intelligently...done
start 'em...done...done...done...done...done...
Create files in sequential order...done.
Stat files in sequential order...done.
Delete files in sequential order...done.
Create files in random order...done.
Stat files in random order...done.
Delete files in random order...done.
Version  1.96       ------Sequential Output------ --Sequential Input- --Random-
Concurrency   1     -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
loki          2264M    43  99 22682  17 10356   6  1038  79 28796   6  86.8  99
Latency               293ms     727ms    1222ms   46541us     504ms   13094ms
Version  1.96       ------Sequential Create------ --------Random Create--------
loki                -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16  1623  33 +++++ +++  2182  57  1974  27 +++++ +++  1907  44
Latency             78474us    6839us    8791us    1746us      66us   64034us
1.96,1.96,loki,1,1305040411,2264M,,43,99,22682,17,10356,6,1038,79,28796,6,86.8,99,16,,,,,1623,33,+++++,+++,2182,57,1974,27,+++++,+++,1907,44,293ms,727ms,1222ms,46541us,504ms,13094ms,78474us,6839us,8791us,1746us,66us,64034us
consoleuser@loki:~$

As you can see, btrfs is significantly slower in several categories:

writing character-at-a-time is *much* slower: 43K/sec vs. 331K/sec
reading block-at-a-time is slower: 28796K/sec vs. 39629K/sec
all forms of file creation and deletion are nearly an order of magnitude slower
Random seeks are almost as fast, but they swamp the CPU

I'm hoping that i just configured the test wrong somehow, or that i've done something grossly unfair in the system setup and configuration. (or maybe i'm mis-reading the bonnie++ output?) Maybe someone can point out my mistake, or give me pointers for what to do to try to speed up btrfs.

I like the sound of the features we will eventually get from btrfs, but these performance figures seem like a pretty rough tradeoff.

Tags: benchmarks, bonnie, btrfs

Syndicated 2011-05-10 18:17:00 from Weblogs for dkg

6 May 2011 »

Please use unambiguous tag names in your DVCS

One of the nice features of a distributed version control system (DVCS) like git is the ability to tag specific states of your project, and to cryptographically sign your tags.

Many projects use simple tag names with the version string like "0.35". This is a plea to ensure that the tags you make explicitly reference the project you're working on. For example, if you are releasing version 0.35 of project Foo, please make your tag "foo_0.35" for security and for future disambiguation.

There is more than one reason to care about unambiguous tags. I'll give two reasons below, but they come from the same fundamental observation: All git repositories are, in some sense, the same git repository; some just store different commits than others.

it's entirely possible to merge two disjoint git repositories, and to have two unassociated threads of development in the same repo. you can also merge threads of developement later if the projects converge.

Avoid tag replay attacks

Let's assume Alice works on two projects, Foo and Bar. She wraps up work on a new version of Foo, and creates and signs a simple tag "0.32". She publishes this tag to the Foo project's public git repo.

Bob is trying to attack the Bar project, which is currently at version 0.31. Bob can actually merge Alice's work on Foo into the Bar repository, including Alice's new tag.

Now looks like there is a tag for version 0.32 of project Bar, and it has been cryptographically-signed by Alice, a known active developer!

If she had named her tag "foo_0.32" (and if all Bar tags were of the form "bar_X.XX"), it would be clear that this tag did not belong to the Bar project.

Be able to merge projects and keep full history

Consider two related projects with separate development histories that later decide to merge (e.g. a library and its most important downstream application). If they merge their git repos, but both projects have a tag "0.1", then one tag must be removed to make room for the other.

If all tags were unambiguously named, the two repos could be merged cleanly without discarding or rewriting history.

I noticed this because of a general principle i try to keep in mind: when making a cryptographic signature, ensure that the thing you are signing is context-independent -- that is, that it cannot be easily misinterpreted when placed in a different context. For example, do not sign e-mails that just say "I approve" -- say specifically what you are approving of in the body of the signed mail. Otherwise, someone could re-send your "I approve" e-mail In-Reply-To a topic that you do not really approve of.

By extension, signing a simple tag is like saying "the source tree in this specific state (and with this specific history) is version 0.3". A close inspection of the state and the history by a sensitive/intelligent human skilled in the art of looking at source code can probably figure out what the project is from its contents. But it's much less ambiguous to say "the source tree in this specific state (and with this specific history) is version 0.3 of project Foo".

Once you start using unambiguous tag names, you make it safe for people to set up simple tools that do automated scans of your repository and can take action when a new signed tag appears. And you respect (and help preserve) the history of any future project that gets merged with yours.

Tags: git, security

Syndicated 2011-05-06 21:05:00 from Weblogs for dkg

4 May 2011 »

test-driven development: refactoring, difficulties

liw recently wrote about test-driven development:

This all sounds like a lot of bureaucratic nonsense, but what I get out of this is this: once all tests pass, I have a strong confidence that the software works. As soon as I've added all the features I want to have for a release, I can push a button to push it out

To my mind, this is only one of the benefits. He doesn't describe another major benefit, which is the confidence with which you can take on re-factoring in projects with well-developed test infrastructure.

If your project has no test infrastructure at all, and you make a deep and/or potentially invasive change, you might well produce something that is heinously broken for users who have a different pattern of using the tool than you do.

But if you have a well-developed, reasonable test suite with fairly wide coverage, you can make a deep or invasive change, and be confident that -- if the tests all pass -- the stuff you release isn't going to be too horrific. And if you do break something with a change which the test suite didn't cover, that's an indication that the test suite is lacking. Hopefully, you can factor out a problem report into its own test, so that future changes will ensure that the behavior doesn't regress too.

The upshot of more-confident re-factoring is that your development can be much bolder, you can roll out new features more quickly, and you can spend less time agonizing about whether you've got the various abstraction layers exactly right the first time through. These are all good things (though i do think the agony of abstraction perfectionism is well-warranted in some contexts, like API definitions, and wouldn't want test-driven development to make people give up on that necessary task).

when testing can't cover everything

Some things are just hard to test well. for example, if you have 20 different boolean options to a command-line tool, you can't realistically test them in all combinations and permutations; that would be over a million tests.

User experience is also notoriously difficult to test, as are tools that rely heavily on network interaction, which can be flakey and unpredictable.

other downsides

Test suites themselves require maintenance, as the components they rely on can change over time. More than once, i've had a test suite failure that was really a failure of the test infrastructure, not the code itself. But in those same projects, that's usually followed or preceded by a test suite failure that picks out a particularly nasty or subtle bug in the tested code that might have persisted for quite a while unnoticed. So the test suite does in some sense create more work all around.

more and better testing is good for Debian

Even when we know that coverage isn't perfect, and even with its additional overhead, well-integrated tests (at the unit level and more generally) are worth the tradeoffs. It's worth doing because of the robustness and guarantees that we can give each other with regularly-tested code. It's a more stable foundation which surprisingly also gives us more flexibility going forward. This is good for free software in general. It also helps us find our bugs before our users do, so it's better for our users. So more and better testing directly supports the two main priorities outlined in the Debian Social Contract.

Syndicated 2011-05-04 23:11:00 from Weblogs for dkg

28 Apr 2011 »

USAA Deposit@Home: bad engineering and terrible UX

I use USAA for some of my finances. They specialize in remote banking (i've never been in a physical branch). Sadly, they can still be pretty clueless about how to use the web properly. My latest frustration with them was trying to use their Deposit@Home service, where you scan your checks and upload them.

No problem, right? I've got a scanner and a web browser and i know how to use them both. Ha ha. Upon first connecting, i'm rejected, and i find the absurd System Requirements -- only Windows and Mac, and only certain versions of certain browsers. You also need Sun's Java plugin, apparently.

Deliberately naïve, i call their helpdesk and ask them if they could just give me a link to let me upload my scanned checks. They tell me that they want 200dpi images, and then give an absurd runaround that includes references to the Gramm-Leach-Bliley Act as the reason they need to limit the system to Windows or Mac, and that security is the reason they need to control the scanner directly (apparently your browser can control your local scanner on Windows? yikes). But they let slip that Mac users don't have the scanner controlled directly, and can just upload images (apparently the federal law doesn't cover them or something). Preposterous silliness.

The Workaround

Of course, it turns out you can get it working on Debian GNU/Linux, mainly by telling them what they want to hear ("yes sir, i'm running Mac OS"), but you'll also have to run Sun's Java (non-free software) to do it, since their Java uploader fails with nasty errors when using icedtea6-plugin.

I set up a dedicated system user for this goofiness, since i'm going to be running apparently untrustworthy applets on top of non-free software. I run a separate instance of iceweasel as that user; all configuration described is for that user's homedir. If you do this yourself, You'll need to decide if you want the same level of isolation for yourself.

So i have the choice of installing sun-java6-plugin from non-free and having the plugin installed for all web browsers, or just doing a per-user install of java for my dedicated user (and avoiding the plugin for my normal user). I opted for the latter. As the dedicated user, I fetched the self-extracting variant from java.com, unpacked it, and added it to the iceweasel plugins:

chmod a+x ~/Download/jre-6u25-linux-i586.bin
mkdir -p ~/lib ~/.mozilla/plugins
cd ~/lib
~/Download/jre-6u25-linux-i586.bin
ln -s ~/lib/jre*/lib/*/libnpjp2.so ~/.mozilla/plugins/

Then i closed iceweasel and restarted it. In the relaunched iceweasel sesson, I told Iceweasel 4 to pretend that it was actually Firefox 3.6 on a Mac. I did this by going to about:config (checking the box that says i know what i'm doing), right-clicking, and choosing "new >> string". The new variable name is general.useragent.override and i found that setting it to the following (insane, false) value worked for me:

Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.9.2.3) Gecko/20060426 Firefox/3.5.18

Note that this configuration might make hotmail think that you are a mobile device. :P If you try to use this browser profile for anything other than visiting USAA, you might want to remove this setting, or install xul-ext-useragentswitcher to be able to set it more easily than using about:config.

Once these changes were made, I was able to log into USAA, use the Deposit@Home service to upload my 200dpi grayscale scans. I guess they think i'm a Mac user now.

The Followup

After completing the upload, I wrote them this review (i doubt they'll post it):

The Deposit@Home service has great potential. Unfortunately, it currently appears to be overengineered and unnecessarily restrictive.
The service requires two scans (front and back) of the deposited check, at 200dpi, in grayscale or black-and-white, in jpeg format, reasonably-cropped. The simplest way to do this would be to show some examples of good scans and bad scans, and provide two file upload forms.
Once the user uploaded their images, the web site could run its own verification, and display them back for the user to confirm, optionally using a simple javascript-based image-cropper if any image seems wrong-sized. This would work fine with any reasonable browser on any OS.
Instead, Deposit@Home requires the user to present a User-Agent header claiming to be from specific versions of Mac or Windows, running certain (older) versions of certain browsers, and requires the use of Sun's Java plugin.
Entirely unnecessary system requirements to do a simple task. Disappointing. :(

Acknowledgements

I found good background for this approach on the ubuntu forums.

The Takeaway

I continue to be frustrated and annoyed by organizations that haven't yet embraced the benefits of the open web. Clearly, USAA has spent a lot of money engineering what they think is a certain experience for their users. However, they didn't design with standard web browsers in mind, so they appear to have boxed themselves into a corner where they think they have to test and approve of the entire software stack running on the end-user's machine.

This is not only foolish -- it's impossible. When you're designing a web-based application, just design it for the web. Keep it simple. If you want to offer some snazzy java-hooked-into-your-scanner insanity, i will only have a mild objection: it seems like a waste of time and engineering effort. My objection is much stronger if your snazzy/incompatible absurdity is the only way to use your service. A simple, web-based, browser-agnostic interface should be available to all your clients.

Even more aggravating is the claim that they don't think they should engineer for everyone. I was told during the runaround that they would only support Linux if 4% of their users were using Linux (which they don't think is the case at the moment -- if you are a USAA customer, and you use something other than Mac and Windows, please tell them so). I tried to tell them that I wasn't asking for Linux support; i was asking for everyone support. If you just use generic engineering in the first place, there's no extra expense for special-casing other users. But they couldn't understand.

And now, since i'll need to lie to them in my User Agent string every time i want to deposit a check online, those visits won't even show up in their logs to be counted. "Our web site deliberately disables itself for $foo users; we haven't written it for them; but that's OK, we don't have any $foo users anyway" is a nasty self-fulfilling prophecy. Why would you do that?

Syndicated 2011-04-28 06:59:00 from Weblogs for dkg

27 Apr 2011 »

The bleeding edge: systemd

Curious about these shiny new things i keep hearing about, i set up a test desktop system using using systemd as the init system (yes, that means using systemd-sysv in place of sysvinit -- so i had to remove an essential package for this to work).

A system-crippling bug was naturally the result of living on the bleeding edge, but fortunately it was resolved with a trivial patch.

In all, i'm pretty happy with some of the things that systemd seems to get right. For example, native supervision of daemon processes, clean process states, elimination of initscript copy-paste-isms, and socket-based service initiation are all obvious, marked improvements over the existing sysvinit legacy cruft.

But i'm a bit concerned about other parts. For example, all the above-mentioned features fit really well within a tightly-tuned, well-managed server. But systemd also appears to rely heavily on complex userland systems like dbus and policykit that would be much more at home on a typical desktop machine. I've never seen a well-managed server installation that warranted either policykit or dbus. Also, given the bug i ran into -- when PID 1 aborts due to a silly assertion, your system is well-and-truly horked. Shouldn't a lot more attention to detail be happening? I'd think that a "recover gracefully from failed assertions" approach would be the first thing to target for a would-be PID 1.

I'm also concerned about the Linux-centricism of systemd. I understand that features like cgroups and reliance on the spiffiness of inotify are part of the appeal, but i also really don't want to see us stuck with only One Single Kernel as an option. The kFreeBSD folks (and the HURD folks) have done a lot of work to get us close to having some level of choice at this critical layer of infrastructure. It'd be good to see that possibility realized, to help us avoid the creeping monoculture. I worry that systemd's over-reliance on Linux-specific features is heading in the wrong direction.

So my question is: why is this all being presented as a package deal? I'd be pretty happy if i could get just the "server-side" features without incurring the dbus/policykit/etc bloat. I already run servers with runit as pid 1 -- they're lean and quite nice, but runit doesn't have systemd's socket-based initiation (or the level of heavyweight backing that systemd seems to have picked up, for that matter).

I understand that Lennart is resistant to UNIX's traditional "do one thing; do it well" philosophy. I can understand some of his reasoning, but i think he might be doing his work and his tools a disservice by taking it this far. Wouldn't systemd be better if it was clearer how to take advantage of parts of it without having to subscribe to the entire thing?

Of course, i might be missing some nice ways that systemd can be effectively tuned and pared down. But i've read his blog posts about systemd and i haven't seen how to get some of the nice features without the parts i don't want. I'd love to be pointed to some explanations that show me how i'm wrong :)

Syndicated 2011-04-27 08:07:00 from Weblogs for dkg

8 Apr 2011 »

EDAC i5000 non-fatal errors

I've got a Debian GNU/Linux lenny installation (2.6.26-2-vserver-amd64 kernel) running on a Dell Poweredge 2950 with BIOS 2.0.1 (2007-10-27).

It has two Intel(R) Xeon(R) CPU 5160 @ 3.00GHz processors (according to /proc/cpuinfo, 8 1GiB 667MHz DDR2 ECC modules (part number HYMP512F72CP8N3-Y5), according to dmidecode, and an Intel Corporation 5000X Chipset Memory Controller Hub (rev 12) according to lspci.

The machine has been running stably for many months.

On the morning of March 31st, i started getting the following messages from the kernel, on the order of one pair of lines every 3 seconds:

Mar 31 07:04:38 zamboni kernel: [16883514.141275] EDAC i5000 MC0: NON-FATAL ERRORS Found!!! 1st NON-FATAL Err Reg= 0x800
Mar 31 07:04:38 zamboni kernel: [16883514.141278] EDAC i5000: 	NON-Retry  Errors, bits= 0x800

A bit of digging turned up a redhat bug report that seems to suggest that these warnings are just noise, and should be ignorable. Another link thinks it's a conflict with IPMI, though i don't think this model actually has an IPMI subsystem.

However, i also notice from munin logs that at the same time the error messages started, the machine exhibited a marked change in CPU activity (including in-kernel activity) and local timer interrupts: [Individual interrupts - by month]

[CPU Usage - by month]

I also note that more rescheduling interrupts started happening, and fewer megasas interrupts at about the same time. I'm not sure what this means.

A review of other logs and graphs on the system turns up no other evidence of interaction that might cause this kind of elevated activity.

One thought was that the elevated activity was just due to writing out a bunch more logs. So i tried removing the i5000_edac module just to keep dmesg and /var/log/kern.log cleaner. Leaving that turned off doesn't lower the CPU utilization or change the interrupts, though.

Any suggestions on what might be going on, or further diagnostics i should run? The machine is in production, and I'd really rather not take down the machine for an extended period of time to do a lengthy memory test. But i also don't want to see this kind of extra CPU usage (more than double the machine's baseline).

Tags: i5000_edac, troubleshooting

Syndicated 2011-04-09 00:00:00 from Weblogs for dkg

5 Apr 2011 »

Debian on Thecus N8800Pro

I recently set up debian on a Thecus N8800Pro. It seems to be a decent rackmount 2U chassis with 8 3.5" hot-swap SATA drives, dual gigabit NICs, a standard DB9 RS-232 serial port, one eSATA port, a bunch of USB 2.0 connectors, and dual power supplies. I'm happy to be able to report that Thecus appears to be attempting to fulfill their obligations under the GPL with this model (i'm fetching their latest source tarball now to have a look).

Internally, it's an Intel Core 2 Duo processor, two DIMMS of DDR RAM, and a PCIe slot for extensions if you like. It has two internally-attached 128MiB SSDs that it wants to boot off of.

The downsides as i see them:

BIOS

The BIOS only runs over VGA, which is not easy to access. I wish it ran over the serial port. It's also not clear how one would find an upgrade to the BIOS. dmidecode reports the BIOS as:

        Vendor: Phoenix Technologies, LTD
        Version: 6.00 PG
        Release Date: 02/24/2009

RAM

The mainboard has only two DIMM sockets. It came pre-populated with two 2GiB DDR DIMMs. memtest86+ reports those DIMMs clearly, but sees only 3070MiB in aggregate. The linux kernel also sees ~3GiB of RAM, despite dmidecode and lshw reporting that the two DIMM modules each are only 1GiB in size. I'm assuming this is a bios problem. Anyone know how to get access to the full 4GiB? Should i be reporting bugs on any of these packages?

SSDs

The internal SSDs are absurdly small and very slow for both reading and writing. This isn't a big deal because i'm basically only using them for the bootloader, kernel, and initramfs. But i'm surprised that they could even find 128MiB devices in these days of $10 1GiB flash units.

processor

/proc/cpuinfo reports two cores of a Intel(R) Core(TM)2 Duo CPU T5500 @ 1.66GHz, each with ~3333 bogomips, and apparently without hardware virtualization support. It's not clear to me whether the lack of virtualization support is something that could be fixed with a BIOS upgrade.

I used Andy Millar's nice description of how to get access to the BIOS -- I found i didn't need a hacksaw to modify the VGA cable i had. just pliers to remove the outer metal shield on the side of the connector i plugged into the open socket.

lshw sees four SiI 3132 Serial ATA Raid II Controller PCI devices, each of which appears to support two SATA ports. I'm currently using only four of the 8 SATA bays, so i've tried to distribute the disks so that each of them is attached to an independent PCI device (instead of having two disks on one PCI device, and another PCI device idle). I don't know if this makes a difference, though, in the RAID10 configuration i'm using.

There's also a neat 2-line character-based LCD display on the front panel which apparently can be driven by talking to a serial port, though i haven't tried. Apparently there are also LEDs that might be controllable directly from the kernel via ICH7 and GPIO, but i haven't tried that yet either.

Any suggestions on how to think about or proceed with a BIOS upgrade to get access to the full 4GiB of RAM, BIOS-over-serial, and/or enable hardware virtualization?

Tags: n8800pro, thecus

Syndicated 2011-04-05 20:07:00 from Weblogs for dkg

47 older entries...