Older blog entries for lukeg (starting at number 15)


I've done a small kernel hack, and it doesn't work. I suspect the problem is some deep kernel magic ("blame the compiler" :-)), so I'm hoping that posting it here will lead to some kindly kernel hacker spotting the trouble.

The idea is to make an "emergency rescue" feature for a thrasing system. Recently my desktop machine was thrashing so badly that I couldn't rsh into it, couldn't kill the X server with C-M-Backspace, and basically just couldn't get it working again. It did however have perfectly good pings times during the thrasing. I assume that the kernel was healthy but all userspace processes were swapped into oblivion, and I ended up cycling the power to recover.

The attempted solution is a kernel-space program that simply kills the process using the most physical memory. It is implemented as an iptables target, so that it can be triggered remotely by sending a magic packet that matches some special iptables rule. Fun, huh?

Here's the code for the iptables target, which freezes the whole machine when I try to use it. See the problem?

/*-*- linux-c -*-
 * This is a module for recovering a system that is thrashing. When
 * trigged by a matched packet, it will kill the task that is using
 * the most physical memory (RSS). Not too subtle, but hopefully it
 * beats hitting the power button when userspace is totally thrashed
 * out of operation.
 * -- luke@bluetail.com
#include <linux/sched.h>
#include <linux/module.h>
#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/spinlock.h>
#include <net/icmp.h>
#include <net/udp.h>
#include <net/tcp.h>
#include <linux/netfilter_ipv4/ip_tables.h>

struct in_device; #include <net/route.h>

static unsigned int ipt_killfatso_target(struct sk_buff **pskb, unsigned int hooknum, const struct net_device *in, const struct net_device *out, const void *targinfo, void *userinfo) { struct task_struct *p, *fatso = NULL; int fatso_rss; /* Find the "fatso" process -- the one with the most RSS */ read_lock(&tasklist_lock); for_each_task(p) { spin_lock(&p->mm->page_table_lock); if (fatso == NULL || (p->mm->rss > fatso_rss)) { fatso = p; fatso_rss = p->mm->rss; } spin_unlock(&p->mm->page_table_lock); } /* And kill him.. */ if (fatso != NULL) { /* presumably there is some process.. */ printk(KERN_NOTICE "killing fatso: %d", fatso->pid); force_sig(SIGKILL, fatso); } /* Unlock the task list, *after* sending the signal - this seems to be important. */ read_unlock(&tasklist_lock); return IPT_CONTINUE; }

static int ipt_killfatso_checkentry(const char *tablename, const struct ipt_entry *e, void *targinfo, unsigned int targinfosize, unsigned int hook_mask) { return 1; }

static struct ipt_target ipt_killfatso_reg = { { NULL, NULL }, "KILLFATSO", ipt_killfatso_target, ipt_killfatso_checkentry, NULL, THIS_MODULE };

static int __init init(void) { if (ipt_register_target(&ipt_killfatso_reg)) { return -EINVAL; }

return 0; }

static void __exit fini(void) { ipt_unregister_target(&ipt_killfatso_reg); }

module_init(init); module_exit(fini); /* MODULE_LICENSE("GPL"); */

24 Nov 2002 (updated 24 Nov 2002 at 23:54 UTC) »

Presented my paper about Distel at the Erlang conference, where my prerelease Emacs slideware package (shot1 shot2) went over really well - nothin' like executing code from inside the slides to distract people from your blathering :-). Also implemented some really nice new Distel features, like completion of Erlang modules and functions in Emacs based on the lisp-complete-symbol command for Emacs Lisp.

Great fun, the Erlang hackers are a fun bunch :-)

wlach: as an Emacs tutorial I would highly recommend Keywiz!

Did some fun stuff:

  • Released version 3.1 of Distel, my system for Erlang-compatible concurrent and distributed Emacs Lisp programming. Also wrote a paper about it for the Erlang conference. This is my first real paper, and I must say it's an awful lot of work :-). I'm greatly indebted (once again) to demoncrat for his help.

    I hope the paper is interesting to Emacs hackers, though it'll take an afternoon's perusal of Concurrent Programming in Erlang to see the programming model that's being implemented.

  • Did a teeny-tiny bit more hacking on my Lisp network stack. Now if you telnet to its IP address, it prints "Hello, world!" over TCP.

And otherwise occupied with Erlang hacking at work. Also visited Canada, braved the first real snow of the winter in Stockholm, and some other real world things.

More hacking on my Lisp IP stack. Just got it to answer pings! Very exciting :-)

The code's a little untidy at the moment, but it's still down at 823 lines of hand-written code, and I'm pretty happy with that. I've left out everything not absolutely essential sofar, like routing, fragmentation, etc.

It also seems to me that the low-level internet protocols are a lot more straightforward than CORBA/HTTP/XML/Javathings/etc, and much better specified too. Maybe toy IP stacks could become suitable weekend-hacks like toy webservers are today - that would be pretty interesting!

Long time no diary! Many random hacks:

And, learned to use Word and Powerpoint at work - obviously signaling a great disturbance in the Force.

(I also crashed Netscape right before posting this article, and am delighted to find that Advogato preserved it.)

17 Aug 2002 (updated 17 Aug 2002 at 09:36 UTC) »

Of Roshambo, Bram writes:

    A tempting strategy is to make your bot 'wimp out' and start playing randomly if it isn't doing well. Tournaments play two programs against each other many times with no persistent information between runs to keep this strategy from being effective.

Going random when you're down sounds like a recipe for staying down. How about going random any time you get a modest lead, in the hope of keeping it?

Me, demoncrat, and another friend have had some ultra-simple Roshambo tornaments recently, with robots written in idel. The game is included in the Idel distro if you're interested in having a crack.

You can see demoncrat's first-generation champion to get the flavour.

Did a nice-looking but nasty Lisp hack today.

Just noticed that the ICFP 2002 programming contest now has a website up. It starts on August 30, so it's probably not too soon to make sure your favourite language is installed on the contest machine.

25 Jul 2002 (updated 26 Jul 2002 at 00:00 UTC) »
Cause for Celebration

    luke@cockatoo:~$ ps aux | grep xterm
    luke     18794  0.0  0.2  1336  428 pts/6    R    01:19   0:00 grep xterm

    By switching from xterms to eterms, my dear laptop is now populately solely by Emacs windows - a tremendous personal achievement! It results directly from the breakage of the "mouse button" on my Thinkpad a few months ago, which taught me that a mouse is a lousy substitute for some Emacs and Sawfish hacking :-)

    (I still use a mouse for Netscape on my desktop machine at work, w3m not being so universal. One step at a time..)

15 Jul 2002 (updated 16 Jul 2002 at 01:32 UTC) »

I got a couple of nice replies to my recent TCP ravings, one from Grit and even a (delightfully titled!) followup article from the author of the original piece. It's an interesting topic.

Actually it's two topics, I think. One is how well TCP works, particularly in some specific situations like "well provisioned" high-speed LANs. The other is about the merits transport-layer framing, in other words having TCP (or a similar protocol) track discrete "frames" rather than just a continuous stream of bytes.

For the framing issue, I think there must be some specific examples where it's useful that haven't been mentioned. Grit: why do you want the correspondence between write()s and TCP segments? What am I missing?

Here's my reply to the "TCP Apologists Considered Annoying" article (anyone's welcome to respond to this):

    I disagree about pay-as-you-use. The vast majority of TCP's congestion-control features cannot be turned off, either because they're deeply embedded in the protocol itself or because no platform that anyone actually uses provides an interface to disable them. Those features do get in the way, too, kicking in and causing timeout waits even when - intuitively, to an informed human looking at a packet trace - there's absolutely no good reason for it.

By pay-as-you-use, I mean that most congestion-control features are only actually used when you encounter congestion. Or more accurately, when you encounter packet loss, severe packet reordering, or large "spikes" in delay, which TCP interprets as signs of congestion.

On a fast and reliable "specially provisioned" network, I'm assuming these things are extremely rare. That being the case, I don't see why TCP congestion control should cause any problems.

It would be more interesting to know what mistakes are showing up in packet traces, and whether they're caused by TCP implementation bugs or by network quirks that the protocol intrinsically doesn't handle well. I'd want to determine this before concluding that TCP's congestion control is expensive, and certainly before turning it off or designing alternatives.

    First off, there's no such thing as a "layer 7 switch". Anybody referring to anything above layer 2 as a switch should be forced to re-take basic networking courses until they learn to get it right.

This is just a disagreement over names. I was refering to boxes that both switch ethernet packets and work with the higher level protocols, for instance to use HTTP cookies for session persistence in load balancing. Vendors and industry press call them "layer 7 switches". I did originally put the name in quotes, after all :-)

    Secondly, transparency requires that service-consumer behavior be preserved. If a transparent proxy wants to do unspecified "useful things" by reshaping traffic internally, it must do so within the limitations of preserving higher-level behavior or it's not actually transparent.

That hits the nail on the head. The reason that TCP proxies can do so much transparently is that TCP and SOCK_STREAM leave so much freedom to the transport, compared with e.g. a datagram protocol where application level frames must correspond to IP packets. With streams, write() isn't defining a frame, it's just writing the next sequence of bytes in the stream, so there is no frame information to be preserved.

    What Luke seems to be missing is that message-boundary information can be more efficiently maintained and transmitted within the transport layer than via application-level framing. If the framing occurs at the application layer, the transport layer is bound to transmit the framing information absolutely verbatim; it has no flexibility. A message-boundary-preserving transport layer can do much more intelligent things that combine this framing with chopping stuff up into network-layer packets, flow control, retransmission, etc.

Apparently I am missing the advantages, but then they haven't been stated specifically. I can't tell what's being proposed from the text above, and TCP streams already have most of the listed advantages.

Streams give tremendous flexibility to the transport layer: its only restriction is to ultimately deliver the bytes in order. Any way it wants to chop them up into network-layer packets for better flow-control, retransmission, transfer efficiency, or to fit the receiver's buffer is no problem. It can even split an application-level frame header into pieces if that will be more efficent - it's just stream data like any other. Intermediate proxies are able to re-package the data when they need to, e.g. because their MTU is different on either side or because they want to modify the data. Even the sender and receiver can repackage the data, e.g. with nagle's algorithm or to compress the send/receive queue.

It's also easy to put application frames on top of streams. Writing and reading a 2- or 4- byte in-band length header is cheap and simple, and so are other framing formats like HTTP 1.1 chunked encoding. Taking care of byte ordering is trivial, and often necessary for other things than framing anyway.

In summary, streams support simple and cheap application-level framing with great flexibility to the lower layers, in addition to unframed byte streams. So what significant problem would transport-level framing for a TCP-like protocol be solving, and what specific scheme might it use to do so?

    Way too many times, I've seen application messages and message separators appear on the wire as two separate packets, and TCP's byte-stream orientation can make this hard to avoid without extra buffering.

This is a really important point. It's one thing to respect layering, but to completely ignore the existence of lower layers could lead to very inefficient code. It pays to understand what's likely to happen under the hood and give the right "hints", in much the same way it's useful to understand what compilers are going to generate from your programs.

So for instance, if a program is doing a bunch of separate write() calls, this might cause the data to go out in separate IP packets and ethernet frames, when it could have all fit into one. Thoughful use of writev(3) and TCP_CORK could help out for things like this. It sounds like Jeff from Platypus has a lot of experience in this area -- I don't.

6 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!