Older blog entries for logic (starting at number 169)

20 Feb 2007 »

Linux on a BladeRunner system

So, you bought one of those shiny new Penguin Computing BladeRunner systems, and were thinking to yourself: "I wish there were a good guide to getting Linux to do what it should on these things". Well, I'm going to try and cover two of the basics here: serial console configuration, and interface bonding. Everything else is pretty much stock stuff, but I had a heck of a time figuring out configurations that worked here. The discussion below assumes you're running RHEL or Fedora, but the idea should be fairly clear.

First step: learn about some of the remote management features of the BladeRunner. You have a couple of configuration commands that are useful here, related to console management and power. Log into the chassis, and type conf; for some reason, they put this under "configuration" rather than "management" or some other similar tree of the command set. Now, you can control power to individual blades with server-blade power <num> [cycle|forced-off|off|on] (where <num> is the number of the blade you're managing). on means exactly what it sounds like; power on the blade if it's currently off, just like it would if you hit the power button on the front. off sends an ACPI power vevent to the blade, advising it to shut down; it doesn't actually force power to be pulled; that's what forced-off does. cycle removes power to the blade, and adds it again; off the top of my head, I don't recall if it does so gracefully or not. Check the documentation, it might be there (but probably not).

The next command that's interesting here is server-blade console-redirect <num>: that grabs the serial console for that particular blade, and displays it to your current session. The serial console configuration appears to be set in stone, from what I can see: 57600 bps, 8 data bits, no parity, 1 stop bit, with hardware (RTS/CTS) flow control. If you power up a blade fresh from the factory, you'll notice that the BIOS is already set up to redirect output to both the serial port and the built-in KVM in the chassis.

That's about all I'll say about the chassis management side of things, since they actually do cover this stuff fairly well in the documentation. The problem I found with the documentation was that it didn't cover host-side configuration of certain things, specifically interface bonding and serial console configuration. Maybe they thought it was out-of-scope for their documentation; I hate to be the bearer of bad news, but the chassis is pretty useless without systems to run on it. ;-) (Here endeth the snide remarks, hopefully.)

Next up is host-side serial console configuration. This is fairly standard stuff, once you know what the settings need to be, but I'll spell it out here too. First is the bootloader; GRUB, in my case. You'll want to add a couple of lines to the GRUB configuration file (in the main stanza):

  serial --unit=0 --speed=57600 --word=8 --parity=no --stop=1
  terminal --timeout=10 serial console

Optionally, you can also add hiddenmenu to the options; that reduces the amount of text being displayed at boot time to a minumum, unless you need it. Set the timeout on the terminal line to whatever is appropriate for you; 10 seconds before you're kicked to the primary display was good enough for my needs, since I didn't want the boot process to take much longer than it already did.

Next up, also in the grub configuration file (on the command line of each kernel you want to boot), are the kernel parameters that get your boot output sent to the right place. Adding console=tty0 console=ttyS0,57600n8r to the kernel command line seemed to do the job for me (display on /dev/ttyS0 at 57000 bps, no parity, 8 data bits, hardware flow control). If this is a stock Red Hat/Fedora installation, I suggest getting rid of the rhgb and quiet entries on that line as well; for server use, you really want to see that console output, so that when you inevitably get a kernel that has a bad day, you don't have to monkey with grub configuration via a 57600bps serial line to capture the panic.

The final step is making it possible to log in on the console. Adding the following line to /etc/inittab worked great for my needs:

  co:2345:respawn:/sbin/agetty -h ttyS0 57600 vt100

The co is an name for the entry, but it's become a defacto standard for the inittab name for the serial console entry. 2345 should be obvious (runlevels to operate in), respawn means run it again after you log out, not just once. The agetty command line just specifies hardware flow control (-h), linux device (ttyS0), speed (57600), and terminal type (vt100); change terminal type to anything appropriate for your environment, but vt100 is pretty safe for most folks.

That covers serial console configuration. For more details on that piece of the puzzle, I strongly recommend reading over the Remote Serial Console HOWTO; it discusses some of the ins-and-outs of other boot loaders, some other configurations you might be interested in trying, and some general advice for this kind of setup.

Next up is interface bonding; every blade has two Broadcom NetXtreme BCM5780S Gigabit Ethernet NICs built in, each connected to one of the two management blades through an internal backplane. I've been using the tg3 drivers for these quite successfully; the bcm5820 module is also available as a third-party add-on, but appears to be deprecated at this point. So, your /etc/modules.conf or /etc/modprobe.conf (depending on version) will likely have a couple of lines in it like alias eth0 tg3 for each interface. Next up is bonding configuration: add the following lines:

  alias bond0 bonding
  options bonding mode=2 arp_interval=500 arp_ip_target=10.0.0.1

Change 10.0.0.1 to whatever your default gateway is. You might be asking yourself why you shouldn't use MII instead of ARP on these systems, and it's a good question. The failure mode of the chassis internal network precludes it; you can end up with one management switch down, but the MII status of the ethernet interface still shows it being up. You can test this out by setting your bonding configuration to use MII instead, then yanking out one of the chassis management blades.

The next piece is very specific to RHEL or Fedora; please refer to your Linux vendor's documentation for details on how to do it elsewhere. In /etc/sysconfig/network-scripts, create a file called ifcfg-bond0 that looks exactly like a normal ifcfg-* file (in fact, you might want to just move your existing ifcfg-eth0 or similar file there); specify how you want this combined interface to behave. Then, for each of ifcfg-eth0 and ifcfg-eth1, add a file that looks like:

  DEVICE=eth0
  TYPE=Ethernet
  BOOTPROTO=none
  ONBOOT=yes
  MASTER=bond0
  SLAVE=yes

(Change DEVICE= to the appropriate line for each device.) More information on interface bonding under linux is available from the OSDL Linux networking site.

That's the main stuff. Obviously, test all of these changes with a complete boot to make sure console redirection is happening the way you want it, and to be sure that your bonded interfaces are showing up as a single bond0 interface (and that you have connectivity to and from them). Good luck. :-)

Syndicated 2006-08-26 10:23:00 from esm

20 Feb 2007 »

Sysadmin Day

Have you hugged your systems administrator lately? ;-)

Syndicated 2006-07-27 14:27:00 from esm

20 Feb 2007 »

Production vs. Development

I've been doing UNIX administration for quite a few years now, and I know a few other people who would call this their profession. I've been noticing, though, a rather drastic difference in attitude toward systems and users in some of them, and I think I've finally nailed down where that difference might be coming from. Fair warning: as usual, this is written basically as a stream-of-consciousness kind of thing, so I may or may not actually get to the point.

You have a few different priorities as a systems administrator, and your job is to juggle them using your best judgement (weighing business needs, available resources, personal time, etc). One of these priorities is production support; keeping the systems that run your business online and performing whatever task it is that they are supposed to be doing (in the case of an Internet-facing business, for instance, this could mean keeping the web, app, mail, and other servers running; for a development shop, it could mean keeping the build systems working smoothly). Inevitably, though, there is another audience that systems administrators are asked to deal with: users. In some cases, that means developers; in others, it means the sales team. In either case, as a technical lead, your job is to help them get their job done. Some administrators get lucky: they only have to deal with one environment or the other. In most shops, however, the administrative staff shares responsibilities. This sounds fine, until you consider the psychology of the two situations.

Production support generally involves saying no a lot. Change is bad; slow, methodical improvements are fine over time, but radical departures from what is known to work are frowned upon. Taken to the logical extreme, you end up with a heavy-weight change control process where even small system changes are scrutinized heavily. And to those who hate such things (and I'm right there with you), this is a good thing. Production systems are differentiated from others by the fact that the business lives and dies by them functioning correctly. By extension, things that effect those systems ought to be tightly controlled and monitored.

But development (or, in some environments, user) systems are a completely different thing. Joel Spolsky has talked about this kind of thing at length: the systems are a tool to help the users do useful work. If a development group needs a machine or two for testing, then find a couple somewhere. If a user needs access to the Internet to research a sales pitch they're working on, give it to them. This doesn't mean giving them everything they ask for; this means finding out what they need to do their jobs, and working toward that in a collaborative manner. The systems administrator provides the role of technology expert and facilitator in this arena.

Take someone who has done production support for years, and give them a development group to support, and Bad Things Happen. The admin can't understand why on earth they want to do things like arbitrarily shuffle 50GB files around the network, use various outdated tools that only support older insecure protocols, or talk on IRC or an instant messenger with people outside the company. They lack perspective: they don't see how that style of systems management affects real people trying to get their job done.

I won't even get into the mess that can happen if you try the reverse (a development-only systems administrator in a production role).

I'm finding that there are very few administrators who can successfully balance both mindsets. As geeks, we seek out uniformity because it regularizes our workloads, but you simply can't reconcile the two: to be successful at both, you have to treat both as the unique environments that they are. I've talked before about administrative geeks needing to step beyond their job title, and this is part of it: working with a mindset of "value to company" rather than the micro-view of what you happen to be doing today.

Anyone can say "no" all the time.

Syndicated 2006-05-23 09:10:00 from esm

20 Feb 2007 »

Systems Administrators != Programmers

I had a good reminder today of what Erik Naggum was talking about when speaking with a co-worker. He was trying to come up with a "good" solution to a Perl problem, so I had him describe it to me. It turns out that he has an object, which may or may not contain the attribute he's looking for. If it doesn't have it, the object has an attribute referring to another object. This other object, in turn, may or may not have the attribute he's looking for, and also has a reference to another object; this repeats until the "next object" is null.

Astute readers will have recognized this as a basic linked list.

I figure, okay, he hasn't been programming in a long time, so we'll mentally walk him through it. First, a quick nudge to see if he can remember his basic CS from a long time ago: "Why don't you write something that will just recurse through that list?" He nods, but the eyes, they don't have it. He goes away to think about it for a while, and then we wind up on a conference call in a meeting room. While the person we're calling is talking, he's quickly jotting down on the board:

  O -> O -> O -> ...

I think, okay, he has the mental idea now, because he's diagramming it correctly. We talk a bit more, and he's obviously trying to come up with a way to iterate over the list. Once again, I suggest a simple recursive function (because that's natural to me, goddamnit, I don't care what you iterative programmers say), and quickly psuedocode something up for him:

  function blah(o)
    if !o:
      return null
    if o->whatImLookingFor:
      return o->whatImLookingFor
    return blah(o->nextObject)

  found = blah(myList)

He looks at it, and appears puzzled. This appears to be a new construct to him, and he asks a few questions that confirm this. "But it'll just keep calling itself and looking at the same object all the time, won't it?" My first psuedocode example unfortunately used similar variable names in both the "mainline" and the function itself, but after renaming things, he still didn't seem to understand. Ah! Local vs. global scoping rules, maybe? So I suggest that, in Perl, you'd have something like my ($o) = @_; as the first line of that function. He still doesn't get it, and grunts something about just dumping the objects and grepping for what he's looking for. He just couldn't make himself look at the "CS" solution, because it didn't fit what he was expecting from Perl: a quick hack that did what he wanted without having to think too hard about it.

Systems administrators who spend to much time with "qwiky" languages like Perl are doomed to forget everything they ever knew about Computer Science. I'm convinced of this now. You might think I had this conversation with a newbie programmer or someone who whipped up scripts now and then, but you'd be wrong; this is a fellow who did software development professionally, and moved into systems administration later in his career. He's no idiot.

(I liken the problem to "l337 sp33k". I've known colleagues who could carry on a perfectly professional conversation face-to-face, but the moment they sit down at a keyboard, they immediately regress to "OMG r u 4 r34l?!" and, setting aside perceptions, they actually behave dumber while doing this. Perl has a similar effect on programmers; in the end, you end up with something that's a tangled web of spaghetti code and system() calls, getting the job done but disgusting anyone who has to look at it. Sort of like reading, "omg u pwn3d that bug, yo!". I wonder when someone will spec a programming language called "l337"...nevermind, Google to the rescue. Ye ghods.)

Postscript: yeah, I probably should have given him an iterative version of the pseudocode. Something like this:

  function blah(o):
    while o:
      if o->whatImLookingFor:
        return o->whatImLookingFor
      o = o->nextObject
    return null

Just as simple, conceptually. Maybe some people have trouble wrapping their heads around recursion, but it's always seemed like a very straightforward idea to me.

Syndicated 2006-04-29 02:46:00 from esm

20 Feb 2007 »

isolatrbeta

Go away. :-) The funny thing? I'd use IMolatr.

Syndicated 2006-04-28 23:49:00 from esm

20 Feb 2007 »

Code Monkey

Code Monkey think maybe manager want to write goddamn login page himself.

Pure geeky comic gold from Jonathan Coulton's "thing-a-week". :-)

Syndicated 2006-04-24 15:22:00 from esm

20 Feb 2007 »

Migrated...

If you can read this (and in some parts of the 'net, you probably can't), then you're reading it through our new network connection via Comcast. After a pretty nasty false start over the weekend, and the cutoff date for Dataflo coming up this weekend, I bit the bullet and cut things over this morning. DNS was set up such that NS records were available for both address assignments for a while now, as well as MX records pointing at both address pools, so the only thing that's really off the air right now in some places is the webserver, and that'll just have to wait. I also managed to break my backup MX along the way, so I'll need to poke at that a bit more and see what I fat-fingered.

Bah. I like renumbering about as much as I like moving.

So, some initial observations: latency is a touch higher than it was with Dataflo, surprisingly enough. First hop averages went from 2-3 ms on the old wireless link to 8-10 ms (with occasional spikes) on Comcast. On the other hand, I'm now seeing download rates around 6Mbps (vs. 1.5Mbps), which is a nice change of pace. :-) Downstream speed is 768kbps now (vs. 1.5Mbps through Dataflo), which may prove to be a problem, but we'll see how it goes for now.

The biggest negative observation I've made so far is Comcast's complete lack of business-oriented customer service; they're obviously incredibly used to dealing with residential customers, and they've used that same contracted-out infrastructure to deal with their commercial customers as well. Phone support is excellent (when you finally get to them; getting a rep on the phone requires navigating about 10 levels of phone menus), but getting physical on-site service is a multi-day waiting game, with the typical "we'll be there sometime between 8:00 AM and 8:00 PM" kind of scheduling that you've come to expect from their residential service. There is no email interface to the helpdesk whatsoever; any kind of technical support inquiry MUST be made by phone, which is incredibly annoying when wanting to do something complicated with DNS. So far, I'm not impressed, but if I never have to deal with their customer service (much like when I was with Speakeasy, who rarely saw an unannounced service outage), then it won't matter to me. We'll see.

Syndicated 2006-04-18 09:53:00 from esm

20 Feb 2007 »

Dataflo or Datanoflo?

So, the day after I post a blog entry about being forced to switch to Comcast because of the lackluster service I've received with Dataflo, I have another rather significant service outage. There appeared to be problems last night, although I was having a hard time pinning the issue on Dataflo then. However, they definitely made a mid-day routing change today; at 12:34, all of a sudden, my backup MX starts receiving email for everything (the backup MX, in this case, is my router, which usually sees nothing but spam all day), and traceroutes to the static IP block they've assigned me are dropped on the floor the moment Cogent hands off to Dataflo.

Again.

It's not like I'm asking for a lot here. I need a service that stays up, and I'm paying a premium for that. That premium ought to include at least a bit of special attention to making sure the services provided to me are working when a major change is made, and perhaps major changes shouldn't be done in the middle of the workday, in the middle of the week, without any prior notification to the customer base of a potential outage. Screw the SLA; I shouldn't need an SLA to hold over their heads. What I really need is a service that stays up, with planned maintenance windows and some degree of customer notification.

I have no idea why I think Comcast will be any better, but at least they're half the price (actually, a third of the price, if you consider what my renewal contract pricing stated). In the event I get lousy service, at least I won't be crying about the money I'm wasting.

Syndicated 2006-04-04 14:52:00 from esm

20 Feb 2007 »

I'm Comcastic!

Ugh. I finally bit the bullet, and had Comcast "Workplace" service installed to replace the flakey fixed-point wireless service I've been fighting with for the past two years. To be fair, I should qualify that: my "last mile" link (between my rooftop transmitter and the tower) has been rock solid. The problem has always been mid-day routing and configuration changes on the part of the provider, which have knocked me offline anywhere from a few minutes now and then, to one incident that had me offline for the better part of a day. At this price and service level, that's completely unacceptable. So, since I can't get service from my preferred vendor here due to a complete lack of DSL availability in my area, I'm stuck with the local cable company: Comcast.

So, first impressions (note that I haven't moved ANY services to this yet, until I get a feel for how this will work): 6Mbps download speeds are unbelievably fun. :-) The 768kbps upload speed may end up being a bit of a hindrance; most of my traffic is SMTP, but I do receive a fair bit of HTTP traffic too, and that's where that upload bottleneck is going to suck (specifically, both I and Erica maintain pretty large photo galleries that seem to get a bit of traffic, and I'll often host larger items for people for limited times). But, we'll have no idea how that'll go until I bite the bullet and cut stuff over; I'm hoping I can run fairly well off of both links for a while, with plenty of time for DNS updates to propagate.

Ugh. I remember how much I hate renumbering now.

Syndicated 2006-04-03 11:29:00 from esm

20 Feb 2007 »

Fedora Core 5

My initial impressions: more solid than I thought this release was going to be, given how much has changed. My biggest complaint so far: they upgraded OpenSSL to a new major version without providing a compat release for those of us upgrading. All we needed was an equivilent to the already-existing openssl097a package, which does nothing but package up libssl.so.5 and libcrypto.so.5, and upgrading would have been smooth as silk for me on one machine. Instead, I'm stuck rebuilding packages that I should have been able to leave alone until later. Bah.

More later when I've had a little more time to play with it. Playing with it on a desktop machine is proving difficult, as the only "play" machine I have right now running Fedora Core hangs the console whenever I fire up X. (Not too surprising, that machine is a bit of a hodgepodge of hardware.)

Syndicated 2006-03-30 20:50:00 from esm

160 older entries...