Older blog entries for apenwarr (starting at number 602)

28 Mar 2011 (updated 29 Mar 2011 at 02:13 UTC) »

I hope IPv6 *never* catches on

(Temporal note: this article was written a few days ago and then time-released.)

This year, like every year, will be the year we finally run out of IPv4 addresses. And like every year before it, you won't be affected, and you won't switch to IPv6.

I was first inspired to write about IPv6 after I read an article by Daniel J. Bernstein (the qmail/djbdns/redo guy) called The IPv6 Mess. Now, that article appears to be from 2002 or 2003, if you can trust its HTTP Last-Modified date, so I don't know if he still agrees with it or not. (If you like trolls, check out the recent reddit commentary about djb's article.) But 8 years later, his article still strikes me as exactly right.

Now, djb's commentary, if I may poorly paraphrase, is really about why it's impossible (or perhaps more precisely, uneconomical, in the sense that there's a chicken-and-egg problem preventing adoption) for IPv6 to catch on without someone inventing something fundamentally new. His point boils down to this: if I run an IPv6-only server, people with IPv4 can't connect to it, and at least one valuable customer is *surely* on IPv4. So if I adopt IPv6 for my server, I do it in addition to IPv4, not in exclusion. Conversely, if I have an IPv6-only client, I can't talk to IPv4-only servers. So for my IPv6 client to be useful, either *all* servers have to support IPv6 (not likely), or I *must* get an IPv4 address, perhaps one behind a NAT.

In short, any IPv6 transition plan involves *everyone* having an IPv4 address, right up until *everyone* has an IPv6 address, at which point we can start dropping IPv4, which means IPv6 will *start* being useful. This is a classic chicken-and-egg problem, and it's unsolvable by brute force; it needs some kind of non-obvious insight. djb apparently hadn't seen any such insight by 2002, and I haven't seen much new since then.

(I'd like to meet djb someday. He would probably yell at me. It would be awesome. </groupie>)

Still, djb's article is a bit limiting, because it's all about why IPv6 physically can't become popular any time soon. That kind of argument isn't very convincing on today's modern internet, where people solve impossible problems all day long using the unstoppable power of "not SQL", Ruby on Rails, and Ajax to-do list applications (ones used by breakfast cereal companies!).

No, allow me to expand on djb's argument using modern Internet discussion techniques:

Top 10 reasons I hope IPv6 never catches on

Just kidding. No, we're going to do this apenwarr-style:

What I hate about IPv6

Really, there's only one thing that makes IPv6 undesirable, but it's a doozy: the addresses are just too annoyingly long. 128 bits: that's 16 bytes, four times as long as an IPv4 address. Or put another way, IPv4 contains almost enough addresses to give one to each human on earth; IPv6 has enough addresses to give 39614081257132168796771975168 (that's 2**95) to every human on earth, plus a few extra if you really must.

Of course, you wouldn't really do that; you would waste addresses to make subnetting and routing easier. But here's the horrible ironic part of it: all that stuff about making routing easier... that's from 20 years ago!

Way back in the IETF dark ages when they were inventing IPv6 (you know it was the dark ages, because they invented the awful will-never-be-popular IPsec at the same time), people were worried about the complicated hardware required to decode IPv4 headers and route packets. They wanted to build the fastest routers possible, as cheaply as possible, and IPv4 routing tables are annoyingly complex. It's pretty safe to assume that someday, as the Internet gets more and more crowded, nearly every single /24 subnet in IPv4 will be routed to a different place. That means - hold your breath - an astonishing 2**24 routes in every backbone router's routing table! And those routes might have 4 or 8 or even 16 bytes of information each! Egads! That's... that's... 256 megs of RAM in every single backbone router!

Oh. Well, back in 1995, putting 256 megs of RAM in a router sounded like a big deal. Nowadays, you can get a $99 Sheevaplug with twice that. And let me tell you, the routers used on Internet backbones cost a lot more than $99.

It gets worse. IPv6 is much more than just a change to the address length; they totally rearranged the IPv4 header format (which means you have to rewrite all your NAT and firewall software, mostly from scratch). Why? Again, to try to reduce the cost of making a router. Back then, people were seriously concerned about making IPv6 packets "switchable" in the same way ethernet packets are: that is, using pure hardware to read the first few bytes of the packet, look it up in a minimal routing table, and forward it on. IPv4's variable-length headers and slightly strange option fields made this harder. Some would say impossible. Or rather, they would, if it were still 1995.

Since then, FPGAs and ASICs and DSPs and microcontrollers have gotten a whole lot cheaper and faster. If Moore's Law calls for a doubling of transistor performance every 18 months, then between 1995 and 2011 (16 years), that's 10.7 doublings, or 1663 times more performance for the price. So if your $10,000 router could route 1 gigabit/sec of IPv4 in 1995 - which was probably pretty good for 1995 - then nowadays it should be able to route 1663 gigabits/sec. It probably can't, for various reasons, but you know what? I sincerely doubt that's IPv4's fault.

If it were still 1995 and you had to route, say, 10 gigabits/sec for the same price as your old 1 gigabit/sec router using the same hardware technology, then yeah, making a more hardware-friendly packet format might be your only option. But the router people somehow forgot about Moore's Law, or else they thought (indications are that they did) that IPv6 would catch on much faster than it has.

Well, it's too late now. The hardware-optimized packet format of IPv6 is worth basically zero to us on modern technology. And neither is the simplified routing table. But if we switch to IPv6, we still have to pay the software cost of those things, which is extremely high. (For example, Linux IPv6 iptables rules are totally different from IPv4 iptables rules. So every Linux user would need to totally change their firewall configuration.)

So okay, the longer addresses don't fix anything technologically, but we're still running out of addresses, right? I mean, you can't argue with the fact that 2**32 is less than the number of people on earth. And everybody needs an IP address, right?

Well, no, they don't:

The rise of NAT

NAT is Network Address Translation, sometimes called IP Masquerading. Basically, it means that as a packet goes through your router/firewall, the router transparently changes your IP address from a private one - one reused by many private subnets all over the world and not usable on the "open internet" - to a public one. Because of the way TCP and UDP work, you can safely NAT many, many private addresses onto a single public address.

So no. Not everybody in the world needs a public IP address. In fact, *most* people don't need one, because most people make only outgoing connections, and you don't need your own public IP address to make an outgoing connection.

By the way, the existence of NAT (and DHCP) has largely eliminated another big motivation behind IPv6: automatic network renumbering. Network renumbering used to be a big annoying pain in the neck; you'd have to go through every computer on your network, change its IP address, router, DNS server, etc, rewrite your DNS settings, and so on, every time you changed ISPs. When was the last time you heard about that being a problem? A long, long time ago, because once you switch to private IP subnets, you virtually never have to renumber again. And if you use DHCP, even the rare mandatory renumbering (like when you merge with another company and you're both using 192.168.1.0/24) is rolled out automatically from a central server.

Okay, fine, so you don't need more addresses for client-only machines. But every server needs its own public address, right? And with the rise of peer-to-peer networking, everyone will be a server, right?

Well, again, no, not really. Consider this for a moment:

Every HTTP Server on Earth Could Be Sharing a Single IP Address and You Wouldn't Know The Difference

That's because HTTP/1.1 (which is what *everyone* uses now... speaking of avoiding chicken/egg problems) supports "virtual hosts." You can connect to an IP address on port 80, and you provide a Host: header at the beginning of the connection, telling it which server name you're looking for. The IP you connect to can then decide to route that request anywhere it wants.

In short, HTTP is IP-agnostic. You could run HTTP over IPv4 or IPv6 or IPX or SMS, if you wanted, and you wouldn't need to care which IP address your server had. In a severely constrained world, Linode or Slicehost or Comcast or whoever could simply proxy all the incoming HTTP requests to their network, and forward the requests to the right server.

(See the very end of this article for discussion of how this applies to HTTPS.)

Would it be a pain? Inefficient? A bit expensive? Sure it would. So was setting up NAT on client networks, when it first arrived. But we got used to it. Nowadays we consider it no big deal. The same could happen to servers.

What I'd expect to happen is that as the IPv4 address space gets more crowded, the cost of a static IP address will go up. Thus, fewer and fewer public IP addresses will be dedicated to client machines, since clients won't want to pay extra for something they don't need. That will free up more and more addresses for servers, who will have to pay extra.

It'll be a *long* time before we reach 4 billion (2**32) server IPs, particularly given the long-term trend toward more and more (infinitely proxyable) HTTP. In fact, you might say that HTTP/1.1 has successfully positioned itself as the winning alternative to IPv6.

So no, we are obviously not going to run out of IPv4 addresses. Obviously. The world will change, as it did when NAT changed from a clever idea to a worldwide necessity (and earlier, when we had to move from static IPs to dynamic IPs) - but it certainly won't grind to a halt.

It is possible do do peer-to-peer when both peers are behind a NAT.

Another argument against widespread NATting is that you can't run peer-to-peer protocols if both ends are behind a NAT. After all, how would they figure out how to connect to each other? (Let's assume peer-to-peer is a good idea, for purposes of this article. Don't just think about movie piracy; think about generally improved distributed database protocols, peer-to-peer filesystem backups, and such.)

I won't go into this too much, other than to say that there are already various NAT traversal protocols out there, and as NAT gets more and more annoyingly mandatory, those protocols and implementations are going to get much better.

Note too that NAT traversal protocols don't have a chicken-and-egg problem like IPv6 does, for the same reason that dynamic IP addresses don't, and NAT itself doesn't. The reason is: if one side of the equation uses it, but the other doesn't, you might never know. That, right there, is the one-line description of how you avoid chicken-and-egg adoption problems. And how IPv6 didn't.

IPv6 addresses are as bad as GUIDs

So here's what I really hate about IPv6: 16-byte (32 hex digit) addresses are impossible to memorize. Worse, auto-renumbering of networks, facilitated by IPv6, mean that anything I memorize today might be totally different tomorrow.

IPv6 addresses are like GUIDs (which also got really popular in the 1990s dark ages, notably, although luckily most of us have learned our lessons since then). The problem with GUIDs are now well-known: that is, although they're globally unique, they're also totally unrecognizable.

If GUIDs were a good idea, we would use them instead of URLs. Are URLs perfect? Does anyone love Network Solutions? No, of course not. But it's 1000x better than looking at http://b05d25c8-ad5c-4580-9402-106335d558fe and trying to guess if that's *really* my bank's web site or not.

The counterargument, of course, is that DNS is supposed to solve this problem. Give each host a ~~GUID~~ IPv6 address, and then just map a name to that address, and you can have the best of both worlds.

Sounds good, but isn't actually. First of all, go look around in the Windows registry sometime, specifically the HKEY_CLASSES_ROOT section. See how super clean and user-friendly it isn't? Barf. But furthermore, DNS on the Internet is still a steaming pile of hopeless garbage. When I bring my laptop to my friend's house and join his WLAN, why can't he ping it by name? Because DNS sucks. Why doesn't it show up by name in his router control panel so he knows which box is using his bandwidth? Because DNS sucks. Why can the Windows server browse list see it by name (sometimes, after a random delay, if you're lucky), even though DNS can't? Because they got sick of DNS and wrote something that works. Why do we still send co-workers hyperlinks with IP addresses in them instead of hostnames? Because the fascist sysadmin won't add a DNS entry for the server Bob set up on his desktop PC.

DNS is, at best, okay. It will get better over time, as necessity dictates. All the problems I listed above are mostly solved already, in one form or another, in different DNS, DHCP, and routing products. It's certainly not the DNS *protocol* that's to blame, it's just how people use it.

But still, if you had to switch to IPv6, you'd discover that those DNS problems that were a nuisance yesterday are suddenly a giant fork stabbing you in the face today. I'd rather they fixed DNS *before* making me switch to something where I can't possibly remember my IP addresses anymore, thanks.

Server-side NAT could actually make the world a better place

So that's my IPv6 rant. I want to leave you with some good news, however: I think the increasing density of IPv4 addresses will actually make the Internet a better place, not a worse one.

Client-side NAT had an unexpected huge benefit: security. NAT is like "newspeak" in Orwell's 1984: we remove nouns and verbs to make certain things inexpressible. For example, it is not possible for a hacker in Outer Gronkstown to even express to his computer the concept of connecting to the Windows File Sharing port on your laptop, because from where he's standing, there is no name that unambiguously identifies that port. There is no packet, IPv4 or IPv6 or otherwise, that he can send that will arrive at that port.

A NAT can be unbelievably simple-minded, and just because of that one limitation, it will vastly, insanely, unreasonably increase your security. As a society of sysadmins, we now understand this. You could give us all the IPv6 addresses in the world, and we'd still put our corporate networks behind a NAT. No contest.

Server-side NAT is another thing that could actually make life better, not worse. First of all, it gives servers the same security benefits as clients - if I accidentally leave a daemon running on my server, it's not automatically a security hole. (I actually get pretty scared about the vhosts I run, just because of those accidental holes.)

But there's something else, which I would be totally thrilled to see fixed. You see, IPv4 addresses aren't really 32-bits. They're actually 48 bits: a 32-bit IP address plus a 16-bit port number. People treat them as separate things, but what NAT teaches us is that they're really two parts of the same whole: the flow identifier, and you can break them up any way you want.

The address of my personal apenwarr.ca server isn't 74.207.252.179; it's 74.207.252.179:80. As a user of my site, you didn't have to type the IP (which was provided by DNS) or the port number (which is a hardcoded default in your web browser), but if I started another server, say on port 8042, then you *would* have to enter the port. Worse, the port number would be a weird, meaningless, magic number, akin to memorizing an IP address (though mercifully, only half as long).

So here's my proposal to save the Internet from IPv6: let's extend DNS to give out not only addresses, but port numbers. So if I go to www2.apenwarr.ca, it could send me straight to 74.207.252.179:8042. Or if I ask for ssh.apenwarr.ca, I get 74.207.252.179:22.

Someday, when IPv4 addresses get too congested, I might have to share that IP address with five other people, but that'll be fine, because each of us can run our own web server on something other than port 80, and DNS would transparently give out the right port number.

This also solves the problem with HTTPS. Alert readers will have noticed, in my comments above, that HTTPS can't support virtual hosts the same way HTTP does, because of a terrible flaw in its certificate handling. ~~Someday, someone might make a new version of the HTTPS standard without this terrible flaw, but in the meantime,~~ transparently supporting multiple HTTPS servers via port numbers on the same machine eliminates the problem; each port can have its own certificate.

(Update 2011/03/28: zillions of people wrote to remind me about SNI, an HTTPS extension that allows it to work with vhosts. Thanks! Now, some of those people seemed to think this refutes my article somehow, which is not true. In fact, the existence of an HTTPS vhosting standard makes IPv6 even *less* necessary. Then again, the standard doesn't work with IE6.)

This proposal has very minor chicken-and-egg problems. Yes, you'll have to update every operating system and every web browser before you can safely use it for *everything*. But for private use - for example, my personal ssh or VPN or testing web server - at least it'll save me remembering stupid port numbers. Like the original DNS, it can be adopted incrementally, and everyone who adopts it sees a benefit. Moreover, it's layered on top of existing standards, and routable over the existing Internet, so enabling it has basically zero admin cost.

Of course, I can't really take credit for this idea. It's already been invented and is being used in a few places.

Embrace IPv4. Love it. Appreciate the astonishing long-lasting simplicity and resilience of a protocol that dates back to the 1970s. Don't let people pressure you into stupid, awful, pain-inducing, benefit-free IPv6. Just relax and have fun.

You're going to be enjoying IPv4 for a long, long time.

Syndicated 2011-03-27 02:00:38 (Updated 2011-03-29 02:13:30) from apenwarr - Business is Programming

27 Mar 2011 (updated 27 Mar 2011 at 02:09 UTC) »

Time capsule: assorted cooking advice

Hi all,

As I mentioned previously, I'm about to disappear into the Google Vortex, across which are stunning vistas of trees and free food and butterflies as far as the eye can see. Thus, I plan to never ever cook for myself again, allowing me to free up all the neurons I had previously dedicated to remembering how.

Just in case I'm wrong, let me exchange some of those neurons for electrons.

Note that I'm not a "foodie" or a gourmet or any of that stuff. This is just baseline information needed in order to be relatively happy without dying of starvation, boredom, or (overpriced ingredient induced) bankruptcy, in countries where you can die of bankruptcy.

Here we go:

Priority order for time-saving appliances: microwave, laundry machine, dishwasher, food processor, electric grill. Under no circumstances should you get a food processor before you get a dishwasher. Seriously. And I include laundry machine here because if you have one, you can do your laundry while cooking, which reduces the net time cost of cooking.

Cast-iron frying pans are a big fad among "foodies" nowadays. Normally I ignore fads, but in this case, they happen to be right. Why cast iron pans are better: 1) they're cheap (don't waste your money on expensive skillets! It's cast iron, it's supposed to be cheap!); 2) unlike nonstick coatings, they never wear out; 3) you're not even supposed to clean them carefully, because microbits from the previous meal help the "seasoning" process *and* makes food taste better; 4) they never warp; 5) they heat very gradually and evenly, so frying things (like eggs) is reliable and repeatable; 6) you can use a metal flipper and still never worry about damage; 7) all that crap advice about "properly seasoning your skillet before use" is safe to ignore, because nothing you can do can ever possibly damage it, because it's freakin' cast iron. (Note: get one with a flat bottom, not one with ridges. The latter has fewer uses.)

You can "bake" a potato by prodding it a few times with a fork, then putting it on a napkin or plate, microwaving it for 7 minutes, and adding butter. I don't know of any other form of healthy, natural food that's as cheap and easy as this. The Irish (from whom I descend, sort of) reputedly survived for many years, albeit a bit unhappily, on a diet of primarily potatoes. (Useless trivia: the terrible "Irish potato famine" was deadly because the potatoes ran out, not because potatoes are bad.) Amaze your friends at work by bringing a week's worth of "lunch" to the office in the form of a sack of potatoes. (I learned that trick from a co-op student once. We weren't paying him much... but we aimed to hire resourceful people with bad negotiating skills, and it paid off.)

Boiled potatoes are also easy. You stick them in a pot of water, then boil it for half an hour, then eat.

Bad news: the tasty part of food is the fat. Good news: nobody is sure anymore if fat is bad for you or not, or what a transfat even is, so now's your chance to flaunt it before someone does any real science! Drain the fat if you must, but don't be too thorough about it.

Corollary: cheaper cuts of meat usually taste better, if prepared correctly, because they have more fat than expensive cuts. "Correctly" usually means just cooking on lower heat for a longer time.

Remember that cooking things for longer is not the same as doing more work. It's like wall-clock time vs. user/system time in the Unix "time" command. Because of this, you can astonish your friends by making "roast beef", which needs to cook in the oven for several hours, without using more than about 20 minutes of CPU time.

Recipe for french toast: break two eggs into a bowl, add a splash of milk, mix thoroughly with a fork. Dip slices of bread into bowl. Fry in butter in your cast iron pan at medium heat. Eat with maple syrup. And don't believe anyone who tells you more ingredients are necessary.

Recipe for perogies: buy frozen perogies from store (or Ukranian grandmother). Boil water. Add frozen perogies to water. Boil them until they float, which is usually 6-8 minutes. Drain water. Eat.

Recipe for meat: Slice, chop, or don't, to taste. Brown on medium-high heat in butter in cast-iron skillet (takes about 2 minutes). Turn heat down to low. Add salt and pepper and some water so it doesn't dry out. Cover. Cook for 45-60 minutes, turning once, and letting the water evaporate near the end.

Recipe for vegetables: this is a trick question. You can just not cook them. (I know, right? It's like food *grows on trees*!)

Hope this advice isn't too late to be useful to you. So long, suckers!

Syndicated 2011-03-24 03:38:29 (Updated 2011-03-27 02:09:10) from apenwarr - Business is Programming

24 Mar 2011 »

The Google Vortex

For a long time I referred to Google as the Programmer Black Hole: my favourite programmers get sucked in, and they never come out again. Moreover, the more of them that get sucked in, the more its gravitation increases, accelerating the pull on those that remain.

I've decided that this characterization isn't exactly fair. Sure, from our view in the outside world, that's obviously what's happening. But rather than all those programmers being compressed into a spatial singularity, they're actually emerging into a parallel universe on the other side. A universe where there *is* such a thing as a free lunch, threads don't make your programs crash, parallelism is easy, and you can have millions of customers but provide absolutely no tech support and somehow get away with it. A universe with self-driving cars, a legitimate alternative to C, a working distributed filesystem, and the entire Internet cached in RAM.

A universe where on average, each employee produced $425,450 in profit in 2010, after deducting their salary and all other expenses. (Or alternatively: $1.2 million in revenue.)

I don't much like the fact of the Google Vortex. It's very sad to me that there are now two programmer universes: the haves and the have-nots. More than half of the programmers I personally know have already gone to the other side. Once you do, you can suddenly work on more interesting problems, with more powerful tools, with on average smarter people, with few financial constraints... and you don't have to cook for yourself. For the rest of us left behind, the world looks more and more threadbare.

A few people do manage to emerge from the vortex, providing circumstantial evidence that human life does still exist on the other side. Many of them emerge talking about bureaucracy, politics, "big company attitude", projects that got killed, and how things "aren't like they used to be." And also how Google stock options aren't worth much because they've already IPO'd. But sadly, this is a pretty self-selecting group, so you can't necessarily trust what they're complaining about; presumably they'll be complaining about something, or they wouldn't have left.

What you really want to know is what the people who didn't leave are thinking. Which is a problem, because Google is so secretive that nobody will tell you much more than, "Google has free food. You should come work here." And I already knew that.

So let's get to the point: in the name of science (and free food, and because all my friends said I should go work there), I've agreed to pass through the Google Vortex starting on Monday. The bad news for you is, once I get through to the other side, I won't be able to tell you what I discover, so you're no better off. Google doesn't stop its employees from blogging, but you might have noticed that the blogs of Googlers don't tell you the things you really want to know. If the NDA they want me to sign is any indication, I won't be telling you either.

What I can do, however, is give you some clues. Here's what I'm hoping to do at Google:

Work on customer-facing real technology products: not pure infrastructure and not just web apps.
Help solve some serious internet-wide problems, like traffic shaping, real-time communication, bufferbloat, excessive centralization, and the annoying way that surprise popularity usually means losing more money (hosting fees) by surprise.
Keep coding. But apparently unlike many programmers, I'm not opposed to managing a few people too.
Keep working on my open source projects, even if it's just on evenings and weekends.
Eat a *lot* of free food.
Avoid the traps of long release cycles and ignoring customer feedback.
Avoid switching my entire life to Google products, in the cases where they aren't the best... at least not without first making them the best.
Stay so highly motivated that I produce more valuable software, including revenue, inside Google than I would have by starting a(nother) startup.

Talking to my friends "on the inside", I believe I can do all those things. If I can't achieve at least most of it, then I guess I'll probably quit. Or else I will discover, with the help of that NDA, that there's something even *better* to work on.

So that's my parting gift to you, my birth universe: a little bit of circumstantial evidence to watch for. Not long from now, assuming the U.S. immigration people let me into the country, I'll know too much proprietary information to be able to write objectively about Google. Plus, speaking candidly in public about your employer is kind of bad form; even this article is kind of borderline bad form.

This is probably the last you'll hear from me on the topic. From now on, you'll have to draw your own conclusions.

Syndicated 2011-03-23 22:29:01 from apenwarr - Business is Programming

20 Mar 2011 »

Suggested One-Line Plot Summaries, Volume 1 (of 1)

"The Summer of My Disco-tent."

Discuss.

Syndicated 2011-03-20 20:42:42 from apenwarr - Business is Programming

19 Mar 2011 »

Celebrating Failure

I receive the monthly newsletter from Engineers Without Borders, an interesting organization providing engineering services to developing countries.

Whether because they're Canadian or because they're engineers, or both, they are unusual among aid organizations because they focus on understanding what didn't work. For the last three years, they've published Failure Reports detailing their specific failures. The reports make an interesting read, not just for aid organizations, but for anyone trying to manage engineering teams.

I wish more organizations, and even more individuals would write documents like that. I probably should too.

Syndicated 2011-03-18 22:03:22 from apenwarr - Business is Programming

16 Mar 2011 »

Parsing ought to be easier

I just realized that I've spent too much time writing about stuff I actually know about lately. (Although okay, that last article was a stretch.) So here's a change of pace.

I just read an interesting article about parsing, that is, Parsing: The Solved Problem that Isn't. It talks about "composability" of grammars, that is, what it would take to embed literal SQL into your C parser, for example.

It's a very interesting question that I hadn't thought of before. Interesting, because every parser I've seen would be *hellish* to try to compose with another grammar. Take the syntax highlighting in your favourite editor and try to have to it auto-shift from one language to another (PHP vs. HTML, or python vs. SQL, or perl vs. regex). It never works. Or if you're feeling really brave, take the C++ parser from gcc and use it to do syntax highlighting in wordpress for when you insert some example code. Not likely, right?

The article was a nice reminder of what I had previously learned about parsing in school: context free grammars, LL(k), and so on. Before I went to school, I had never heard of or used any of those things, but I had written a lot of parsers; I now know that what I had independently "invented" is called recursive descent and it seems to be pretty unfashionable among people who "know" parsing.

I admit it; I'm a troublemaker and I fail badly at academia. I still always write all my parsers as recursive descent. Sometimes I even don't split the tokenizer from the grammar. Booyah! I even write non-conforming XML parsers sometimes, and use them for real work.

So if you're a parsing geek, you'd better leave now, because this isn't going to get any prettier.

Anyway, here's my big uneducated non-academic parsing neophyte theory:

You guys spend *way* too much time caring about arithmetic precedence.

See, arithmetic precedence is important; languages that don't understand it (like Lisp) will never be popular, because they prevent you from writing what you mean in a way that looks like what you mean. Fine. You've gotta have it. But it's a problem, because context-free grammars (and its subsets) have a *seriously hard time* with it. You can't just say "addition looks like E+E" and "multiplication looks like E*E" and "an expression E is either a number or an addition or a multiplication", because then 1+2*3 might mean either (1+2)*3 or 1+(2*3), and those are two different things. Every generic parsing algorithm seems to require hackery to deal with things like that. Even my baby, recursive descent, has a problem with it.

So here's what I say: just let it be a hack!

Because precedence is only a tiny part of your language, and the *rest* is not really a problem at all.

When I write a parser that cares about arithmetic precedence - which I do, sometimes - the logic goes like this:

ah, there's the number one
a plus sign!
the number two! Sweet! That's 1+2! It's an expression!
a multiplication sign. Uh oh.
the number three. Hmm. Well, now we have a problem.
(hacky hacky swizzle swizzle) Ta da! 1+(2*3).

I'm not proud of it, but it happens. You know what else? Almost universally, the *rest* of the parser, outside that one little hacky/swizzly part, is fine. The rest is pretty easy. Matching brackets? Backslash escapes? Strings? Function calls? Code blocks? All those are easy and non-ambiguous. You just walk forward in the text one token at a time and arrange your nice tree.

The dirty secret about parsing theory is that if you're a purist, it's almost impossible, but if you're willing to put up with a bit of hackiness in one or two places, it's *really* easy. And now that computers are fast, your algorithm rarely has to be even close to optimized.

Even language composition is pretty easy, but only in realistic cases, not generalized ones. If you expect this to parse:

	if (parseit) {
		query = "select "booga" + col1 from table where n="}"";
	}

Then you've got a problem. Interestingly, a human can do it. A computer *could* do it too. You can probably come up with an insane grammar that will make that work, if you want to allow for potentially exponential amounts of backtracking and no chance of separating scanning from parsing. (My own preferred recursive descent technique is almost completely doomed here, so you'll need to pull out the really big Ph.D. parsing cannons.) But it *is* possible. You know it is, because you can look at the above code and know what it means.

So that's an example of the "hard problems" that you're talking about when you try to define composability of independent context-free grammars that weren't designed for each other. It's a virtually impossible problem. An interesting one, but not even one that's guaranteed to have a generalizable solution. Compare it, however, with this:

	if (parseit) {
		query = { select "booga" + col1 from table where n = "}" };
	}

Almost the same thing. But this time, the SQL is embedded inside braces instead of quotes. Aha, but that n="}" business is going to screw us over, right? The C-style parser will see the close-brace and stop parsing!

No, not at all. A simple recursive descent parser, without even needing lookahead, will have no trouble with this one, because it will clearly know it's inside a string at the time it sees the closebrace. Obviously you need to be using a SQL-style tokenizer inside the SQL section, and your SQL-style tokenizer needs to somehow know that when it finds the mismatched brace, that's the end of its job, time to give it back to C. So yeah, if you're writing this parser "Avery style", you're going to have to be writing it as one big ugly chunk of C + SQL parser all in one. But it won't be any harder than any other recursive descent parser, really; just twice the size because it's for two languages instead of one.

So here's my dream: let's ignore the parsing purists for a moment. Let's accept that operator precedence is a special case, and just hack around it when needed. And let's only use composability rules that are easy instead of hard - like using braces instead of quotes when introducing sublanguages.

Can we define a language for grammars - a parser generator - that easily lets us do *that*? And just drop into a lower-level implementation for that inevitable operator precedence hack.

Pre-emptive snarky comments: This article sucks. It doesn't actually solve any problems of parsing or even present a design, it just complains a lot. Also, this type of incorrect and dead-end thinking is already well covered in the literature, it's just that I'm too lazy to send you a link to the article right now because I'm secretly a troll and would rather put you down than be constructive. Also, the author smells funny.

Response to pre-emptive snarky comments: All true. I would appreciate those links though, if you have them, and I promise not to call you a troll. To your face.

Syndicated 2011-03-16 02:13:11 from apenwarr - Business is Programming

13 Mar 2011 (updated 13 Mar 2011 at 23:05 UTC) »

The strange story of etherpad

I don't actually know this story - certainly no more of it than anyone who has read a few web pages. But luckily, I'm pretty good at making things up. So I want to tell you the story of etherpad, the real-time collaborative web-based text editor.

I find this story particularly fascinating because I have lately had an obsession with "simplifying assumptions" - individual concepts, obvious in retrospect, which can eliminate whole classes of problems.

You can stumble on a simplifying assumption by accident, by reading lots of research papers, or by sitting alone in a room and thinking for months and months. What you can't do is invent them by brute force - they're the opposite of brute force. By definition, a simplifying assumption is a work of genius, whether your own or the person you stole it from.

Unix pipes are the quintessential example of a simplying concept in computer science. The git repository format (mostly stolen from monotone, as the story goes) is trivial to implement, but astonishingly powerful for all sorts of uses. My bupsplit algorithm makes it easy and efficient to store and diff huge files in any hash-based VCS. And Daniel J. Bernstein spent untold effort inventing djb redo, which he never released... but he shared his simplifying ideas, so it was easy to write a redo implementation.

What does all this have to do with etherpad? Simple. Etherpad contains a few startling simplifying assumptions, with a surprising result:

Etherpad is the first (and still only) real-time collaborative editor that has ever made me more productive.

And yet, its authors never saw it as more than just a toy.

My first encounter with etherpad was when Paul Graham posted a real-time etherpad display of him writing an essay. I thought it looked cool, but pointless. Ignore.

Sometime later, I read about Google Wave. I thought that level of collaboration and noise in your inbox sounded like nothing anybody could possibly want; something like crack for lolcats. Ignore.

And then, a little while later, I heard about etherpad again: that it had been bought by Google, and immediately shut down, with the etherpad team being merged into the (technically superior, if you believe the comments at that link) Google Wave team.

Moreover, though, a lot of the commenters were aghast that etherpad had been shut down so suddenly - people were using it for real work, they said. Real work? Etherpad? Did I miss something? Isn't it just a Paul Graham rerun machine?

I couldn't find out, because it was gone.

The outcry was such that it came back again, though, a couple of days later, with a promise that it would soon be open sourced.

So I tried it out. And sure enough, it is good. A nice, simple, multi-user way to edit documents. The open source version is now running for free on multiple sites, including ietherpad and openetherpad, so you can keep using it. Here's what's uniquely great about etherpad:

Start a document with one click without logging in or creating an account.
Share any document with anyone by cut-and-pasting a trivial URL.
Colour coding, with pretty, contrasting colours, easily indicates who wrote what without having to sign your name every time (ie. better than wiki discussions).
Full document history shows what changed when.
Each person's typing shows up immediately on everyone's screen.
Documents persist forever and are accessible from anywhere.
No central document list or filenames, so there's nothing to maintain, organize, or prune.
Easy to import/export various standard file formats.
Simple, mostly bug-free WYSIWYG rich web text editor with good keybindings. (I never thought I would voluntarily use a WYSIWYG text editor. I was wrong.)
Just freeform text (no plugins), so it's flexible, like a whiteboard.
A handy, persistent chat box next to every document keeping a log of your rationale - forever.
A dense screen layout that fits everything I need without cluttering up with stuff I don't. (I'm talking to you, Google Docs.)
Uniquely great for short-lived documents that really need collaboration: meeting minutes, itineraries, proposals, quotes, designs, to-do lists.

Where's etherpad development now? Well, it seems to have stopped. All the open source ones I've seen seem to be identical to the last etherpad that existed before Google bought them. The authors disappeared into the Google Vortex and never came out.

A few months later, Google cancelled the Wave project that had absorbed etherpad. It was a failed experiment, massively overcomplicated for what it could do. Nobody liked it. It didn't solve anyone's problem.

And that could be just another sad story of capitalism: big company acquires little company, sucks life out of it, saps creativity, spits out the chewed-up remains.

But, you see, I don't believe that's what happened. I think what happened is much more strange. I think the people who made etherpad really believed Google Wave was better, and they still do. That's what fascinates me.

See, upon further investigation, I learned that etherpad was never meant to be a real product - it was an example product. The real product was AppJet, some kind of application hosting engine for developers. As an AppJet developer, you could use their tools to make collaborative web applications really easily, with plugins and massive flexibility and workflows and whatnot. (Sound familiar? That's what Google Wave was for, too.) And etherpad was just an example of an app you could build with AppJet. Not just any example: it was the simplest toy example they could come up with that would still work.

I get the impression that the AppJet guys were pretty annoyed at the success of etherpad and the relative failure of AppJet itself. Etherpad is so trivial! Where's the magic? Oh God, WHAT ABOUT EMBEDDED VIDEO? WILL SOMEONE PLEASE THINK ABOUT EMBEDDED VIDEO? Etherpad couldn't do embedded video; still can't. AppJet can. Google Wave can. Etherpad, as the face of their company, was embarrassing. It made their technology look weak. Google Wave was a massive testosterone-powered feature checklist from hell, and Etherpad was... a text editor.

No wonder they sold out as fast as they could.

No wonder they shut down their web site the moment they signed the deal.

They felt inferior. They wanted to get the heck out of this loser business as soon as humanly possible.

And that, my friends, is the story of etherpad.

Epilogue

But I'm expecting a sequel. Despite the Wave project's cancellation, the etherpad/appjet guys have still not emerged from the Google Vortex. Rumour has it that their stuff was integrated into Google Docs or something. (Google Docs does indeed now have realtime collaboration - but it's too much AppJet, too little Etherpad, if you know what I mean.)

When I had the chance to visit Google a few weeks ago, I asked around to see if anybody knew what had happened to the etherpad developers; nobody knew. Google's a big place, I guess.

I would love to talk to them someday.

Etherpad legitimized real-time web document collaboration. It created an entirely new market that Google Docs has been desperately trying, and mostly failing, to create. Google Docs is trying to be Microsoft Office for the web, and the problem is, people don't want Microsoft Office for the web, because Microsoft Office works just fine and Google Docs leaves out zillions of well-loved features. In contrast, etherpad targeted, and ironically is still targeting and progressively winning despite the project's cancellation, an actually new and very important niche activity.

The brilliance of etherpad has nothing to do with plugin architectures or database formats or extensibility; all that stuff is plain brute force. Etherpad's beauty is its simplifying assumption, that just collaboratively editing a trivial, throwaway text file is something a lot of people need to do every single day. If you make that completely frictionless, people will love you. And they did.

Somehow, the etherpad guys never recognized their own genius. They hated it. They wanted it dead, but it refuses to stay dead.

What happens next?

...

Pre-emptive commentary

I expect that as soon as anyone reads this article, I'll get a bunch of comments telling me that Google Wave is the answer, or Google Docs can now do everything Etherpad can do, or please try my MS Office clone of the week, etc. So let me be very specific.

First of all, look at the list of etherpad features I included above. I love all those features. If you want me to switch to a competing product for the times I currently use etherpad, I want all that stuff. I don't actually want anything else, so "we don't do X but we do Y instead, which is better!" is probably not convincing. (Okay, maybe I want inline file attachments and a few bugfixes. And wiki-like hyperlinks between etherpad documents, ooh!)

Specific things I hate about Google Wave (as compared to etherpad):

It's slower.
The plugins/templates make things harder, not easier.
Conversations are regimented instead of free-form; you end up with ThreadMess that takes up much more screen space than in etherpad, and you can't easily trim/edit other people's comments.
It has an "inbox" that forces me to keep track of all my documents, which defeats throwaway uses where etherpad excels.
Sharing with other users is a pain because they have to sign up and I have to authorize them, etc.
The Google Wave screen has more clutter and less content than the etherpad screen.
Google Wave has a zillion settings; etherpad has no learning curve at all.
Google Wave wants to replace my email, but that's simply silly, because I don't collaborate on my email.
Google Wave wants me to live inside it: it's presumptuous. Etherpad is a tool I grab when I want, and put down when I'm done.

Specific things I hate about Google Docs (as compared to etherpad):

It's slower.
The screen layout is very very crud-filled (menu bars, etc).
It creates obnoxious popovers all the time, like when someone connects/disconnects.
Its indication of who changed what is much clumsier.
Its limited IM feature treats conversation as transient and interruptive, not a valuable companion to the document.
The UI for sharing a document (especially with users outside @gmail.com) is too complicated for mere mortals, such as me, to make work. I'm told it can be done, but it's as good as missing.
I can't create throwaway documents because they clutter my personal "list of documents" page that I don't want to maintain.
I have to save explicitly. Except sometimes it saves automatically. Basically I have no idea what it's doing. Etherpad saves every keystroke and has a timeline slider; anybody can understand it.
It encourages "too much" WYSIWYG: like MS Word, it's trying to be a typewriter with paper page layouts and templates and logos and fonts and whatnot, and encourages people to waste their time on formatting. Etherpad has WYSIWYG formatting for bold/italic/etc, but it's lightweight and basic and designed for the screen, not paper, so it's not distracting.

There are probably additional things I would hate about Wave and Docs, but I avoid them both already because of the above reasons, so I don't know what those other reasons are. Conversely, I use etherpad frequently and love it. Try it; I think you will too.

Update 2011/03/13: In case you would like to know the true story instead of my made up one (yeah, right; that would be like reading the book instead of watching the TV movie), you can read a response by one of the etherpad creators. Spoiler: they have, at least physically, emerged from the Google Vortex.

Update 2011/03/13: Someone also linked to PiratePad, which is a modification of etherpad that includes #tags and [[hyperlinks]]. That means they accomplished one of my dreams: making it into a wiki!

Syndicated 2011-03-13 12:05:58 (Updated 2011-03-13 23:05:02) from apenwarr - Business is Programming

28 Feb 2011 (updated 28 Feb 2011 at 09:08 UTC) »

Insufficiently known POSIX shell features

I've seen several articles in the past with titles like "Top 10 things you didn't know about bash programming." These articles are disappointing on two levels: first of all, the tricks are almost always things I already knew. And secondly, if you want to write portable programs, you can't depend on bash features (not every platform has bash!). POSIX-like shells, however, are much more widespread.¹

Since writing redo, I've had a chance to start writing a few more shell scripts that aim for maximum portability, and from there, I've learned some really cool tricks that I haven't seen documented elsewhere. Here are a few.

Update 2011/02/28: Just to emphasize, all the tricks below work in every POSIX shell I know of. None of them are bashisms.

1. Removing prefixes and suffixes

This is a super common requirement. For example, given a *.c filename, you want to turn it into a *.o. This is easy in sh:

	SRC=/path/to/foo.c
	OBJ=${SRC%.c}.o

You might also try OBJ=$(basename $SRC .c).o, but this has an annoying side effect: it *also* removes the /path/to part. Sometimes you want to do that, but sometimes you don't. It's also more typing.

(Update 2011/02/28: Note that the above $() syntax, as an alternative to nesting, is also valid POSIX and works in every halfway modern shell. I use it all the time. Backquotes get really ugly as soon as you need to nest them.)

Speaking of removing those paths, you can use this feature to strip prefixes too:

	SRC=/path/to/foo.c
	BASE=${SRC##*/}
	DIR=${SRC%$BASE}

These are cheap (ie. non-forking!) alternatives to the basename and dirname commands. The nice thing about not forking is they run much faster, especially on Windows where fork/exec is ridiculously expensive and should be avoided at all costs.

(Note that these are not quite the same as dirname and basename. For example, "dirname foo" will return ".", but the above would set DIR to the empty string instead of ".". You might want to write a dirname function that's a little more careful.)

Some notes about the #/% syntax:

The thing you're stripping is a shell glob, not a regex. So "*", not ".*"
bash has a handy regex version of this, but we're not talking about bashisms here :)
The part you want to remove can include shell variables (using $).
Unfortunately the part you're removing *from* has to be just a variable name, so you might have to strip things in a few steps. In particular, removing prefixes *and* suffixes from one string is a two step process.
##/%% mean "the longest matching prefix/suffix" and #/% mean "the shortest matching prefix/suffix." So to remove the *first* directory only, you could use SUB=${SRC#*/}.

2. Default values for variables

There are several different substitution modes for variables that don't contain values. They come in two flavours: assignment and substitution, as well as two rules: empty string vs. unassigned variable. It's easiest to show with an example:

	unset a b c d
	e= f= g= h=
	
	# prints 1 2 3 4 6 8
	echo ${a-1} ${b:-2} ${c=3} ${d:=4} ${e-5} ${f:-6} ${g=7} ${h:=8}
	
	# prints 3 4 8
	echo $a $b $c $d $e $f $g $h

The "-" flavours are a one-shot substitution; they don't change the variable itself. The "=" flavours reassign the variable if the substitution takes effect. (You can see the difference by what shows in the second echo statement compared to the first.)

The ":" rules affect both unassigned ("null") variables and empty ("") variables; the non-":" rules affect only unassigned variables, but not empty ones. As far as I can tell, this is virtually the only time the shell cares about the difference between the two.

Personally, I think it's *almost* always wrong to treat empty strings differently from unset ones, so I recommend using the ":" rules almost all the time.

I also think it makes sense to express your defaults once at the top instead of every single time - since in the latter case if you change your default, you'll have to change your code in 25 places - so I recommend using := instead of :- almost all the time.

If you're going to do that, I also recommend this little syntax trick for assigning your defaults exactly once at the top:

	: ${CC:=gcc} ${CXX:=g++}
	: ${CFLAGS:=-O -Wall -g}
	: ${FILES:=
		f1
		f2
		f3
	}

The trick here is the ":" command, a shell builtin that never does anything and throws away all its parameters. I find this trick to be a little more readable and certainly less repetitive than:

	[ -z "$CC" ] || CC=gcc
	[ -z "$CXX" ] || CXX=g++
	[ -z "$CFLAGS" ] || CFLAGS="-O -Wall -g"
	[ -z "$FILES" ] || FILES="
		f1
		f2
		f3
	"

3. You can assign one variable to another without quoting

It turns out that these two statements are identical:

	a=$b
	a="$b"

...even if $b contains characters like spaces, wildcards, or quotes. For whatever reason, the substitutions in a variable assignment aren't subject to further expansion, which turns out to be exactly what you want. If $b was "chicken ls" you wouldn't really want the meaning of "a=$b" to be "a=chicken; ls". So luckily, it isn't.

If you've been quoting all your variable-to-variable assignments, you can take out the quotes now. By the way, more complex assignments like "a=$b$c" are also safe.

4. Local vs. global variables

In early sh, all variables were global. That is, if you set a variable inside a shell function, it would be visible inside the calling function. For backward compatibility, this behaviour persists today. And from what I've heard, POSIX actually doesn't specify any other behaviour.

However, every single POSIX-compliant shell I've tested implements the 'local' keyword, which lets you declare variables that won't be returned from the current function. So nowadays you can safely count on it working. Here's an example of the standard variable scoping:

	func()
	{
		X=5
		local Y=6
	}
	X=1
	Y=2
	(func)
	echo $X $Y  # returns 1 2; parens throw away changes
	func
	echo $X $Y  # returns 5 2; X was assigned globally

Don't be afraid of the 'local' keyword. Pre-POSIX shells might not have had it, but every modern shell now does.

(Note: stock ksh93 doesn't seem to have the 'local' keyword, at least on MacOS 10.6. But ksh differs from POSIX in lots of ways, and nobody can agree on even what "ksh" means. Avoid it.)

5. Multi-valued and temporary exports, locals, assignments

For historical reasons, some people are afraid of mixing "export" with assignment, or putting multiple exports on one line. I've tested a lot of shells, and I can safely tell you that if your shell is basically POSIX compliant, then it supports syntax like these:

	export PATH=$PATH:/home/bob/bin CHICKEN=5
	local A=5 B=6 C=$PATH
	A=1 B=2
	
	# sets GIT_DIR only while 'git log' runs
	GIT_DIR=$PWD/.githome git log

6. Multi-valued function returns

You might think it's crazy that variable assignments by default leak out of the function where you assigned them. But it can be useful too. Normally, shell functions can only return one string: their stdout, which you capture like this:

	X=$(func)

But sometimes you really want to get *two* values out. Don't be afraid to use globals to accomplish this:

	getXY()
	{
		X=$1
		Y=$2
	}
	
	test()
	{
		local X Y
		getXY 7 8
		echo $X-$Y
	}
	
	X=1 Y=2
	test        # prints 7-8
	echo $X $Y  # prints 1-2

Did you catch that? If you run 'local X Y' in a calling function, then when a subfunction assigns them "globally", it still only affects your local ones, not the global ones.

7. Avoiding 'set -e'

The set -e command tells your shell to die if a function returns nonzero in certain contexts. Unfortunately, set -e *does* seem to be implemented slightly differently between different POSIX-compliant shells. The variations are usually only in weird edge cases, but it's sometimes not what you want. Moreover, "silently abort when something goes wrong" isn't always the goal. Here's a trick I learned from studying the git source code:

	cd foo &&
	make &&
	cat chicken >file &&
	[ -s file ] ||
	die "resulting file should have nonzero length"

(Of course you'll have to define the "die" function to do what you want, but that's easy.)

This is treating the "&&" and "||" (and even "|" if you want) like different kinds of statement terminators instead of statement separators. So you don't indent lines after the first one any further, because they're not really related to the first line; the && terminator is a statement flow control, not a way to extend the statement. It's like terminating a statement with a ; or & - each type of terminator has a different effect on program flow. See what I mean?

It takes a little getting used to, but once you start writing like this, your shell code starts getting a lot more readable. Before seeing this style, I would tend to over-indent my code, which actually made it worse instead of better.

By the way, take special note of the way we used the higher precedence of && vs. || here. All the && statements clump together, so that if *any* of them fail, we fall back to the other side of the || and die.

Oh, as an added bonus, you can use this technique even if set -e is in effect: capturing the return value using && or || causes set -e to *not* abort. So this works:

	set -e
	mv file1 file2 || true
	echo "we always run this line"

Even if the 'mv' command fails, the program doesn't abort. (Because this technique is available, redo always runs all its scripts with set -e active so it can be more like make. If you don't like it, you can simply catch any "expected errors" as above.)

8. printf as an alternative to echo

The "echo" command is chronically underspecified by POSIX. It's okay for simple stuff, but you never know if it'll interpret a word starting with dash (like -n or -c) as an option or just print it out. And ash/dash/busybox, for example, have a weird "feature" where echo interprets "echo \n" as a command to print a newline. Which is fun, except no other shell does that. The others all just print backslash followed by n.

There's good news, though! It turns out the "printf" command is available everywhere nowadays, and its semantics are much more predictable. Of course, you shouldn't write this:

	# DANGER!  See below!
	printf "path to foo: $PATH_TO_FOO\n"

Because $PATH_TO_FOO might contain variables like %s, which would confuse printf. But you *can* write your own version of echo that works just how you like!

	echo()
	{
		# remove this line if you don't want to support "-n"
		[ "$1" = -n ] && { shift; FMT="%s"; } || FMT="%s\n"
		printf "$FMT" "$*"
	}

9. The "read" command is crazier than you think

This is both good news and bad news. The "read" command actually mangles its input pretty severely. It seems the "-r" option (which turns off the mangling) is supported on all the shells that I've tried, but I haven't been able to find a straight answer on this one; I don't think -r is POSIX. But if everyone supports it, maybe it doesn't matter. (Update 2011/02/28: yes, it's POSIX. Thanks to Alex Bradbury for the link.)

The good news is that the mangling behaviour gives you a lot of power, as long as you actually understand it. For example, given this input file, testy.d (produced by gcc -MD -c testy.c):

	testy.o: testy.c /usr/include/stdio.h /usr/include/features.h \
	  /usr/include/sys/cdefs.h /usr/include/bits/wordsize.h \
	  /usr/include/gnu/stubs.h /usr/include/gnu/stubs-32.h \
	  /usr/lib/gcc/i486-linux-gnu/4.3.2/include/stddef.h \
	  /usr/include/bits/types.h /usr/include/bits/typesizes.h \
	  /usr/include/libio.h /usr/include/_G_config.h
	  /usr/include/wchar.h \
	  /usr/lib/gcc/i486-linux-gnu/4.3.2/include/stdarg.h \
	  /usr/include/bits/stdio_lim.h \
	  /usr/include/bits/sys_errlist.h

You can actually read all that content like this:

	read CONTENT <testy.d

...because the 'read' command understands backslash escapes! It removes the backslashes and joins all the lines into a single line, just like the file intended.

And then you can get a raw list of the dependencies by removing the target filename from the start:

	DEPS=${CONTENT#*:}

Until I discovered this feature, I thought you had to run the file through sed to get rid of all the extra junk - and that's one or more extra fork/execs for every single run of gcc. With this method, there's no fork/exec necessary at all, so your autodependency mechanism doesn't have to slow things down.

10. Reading/assigning a variable named by another variable

Say you have a variable $1 that contains the name of another variable, say BOO, and you want to read the variable pointed to by $1, then do a calculation, then write back to it. The simplest form of this is an append operation. You *can't* just do this:

	# Doesn't work!
	$V="$$V appended stuff"

...because "$$V" is actually "$$" (the current process id) followed by "V". Also, even this doesn't work:

	# Also doesn't work!
	$V=50

...because the shell assumes that after substitution, the result is a command name, not an assignment, so it tries to run a program called "BOO=50".

The secret is the magical 'eval' command, which has a few gotchas, but if you know how to use it exactly right, then it's perfect.

	append()
	{
		eval local tmp=\$$1
		tmp="$tmp $2"
		eval $1=\$tmp
	}
	
	BOO="first bit"
	append BOO "second bit"
	echo "$BOO"

The magic is all about where you put the backslashes. You need to do some of the $ substitutions - like replacing "$1" with "BOO" - before calling eval on the literal '$BOO'. In the second eval, we want $1 to be replaced with "BOO" before running eval, but '$tmp' is a literal string parsed by the eval, so that we don't have to worry about shell quoting rules.

In short, if you're sending an arbitrary string into an eval, do it by setting a variable and then using \$varname, rather than by expanding that variable outside the eval. The only exception is for tricks like assigning to dynamic variables - but then the variable name should be controlled by the calling function, who is presumably not trying to screw you with quoting rules.

11. "read" multiple times from a single input file

This problem is one of the great annoyances of shell programming. You might be tempted to try this:

	(read x; read y) <myfile

But it doesn't work; the subshell eats the variable definitions. The following does work, however, because {} blocks aren't subshells, they're just blocks:

	{ read x; read y; } <myfile

Unfortunately, the trick doesn't work with pipelines:

	ls | { read x; read y; }

Because every sub-part of a pipeline is implicitly a subshell whether it's inside () or not, so variable assignments get lost.

A temp file is always an option:

	ls >tmpfile
	{ read x; read y; } <tmpfile
	rm -f tmpfile

But temp files seem rather inelegant, especially since there's no standard way to make well-named temp files in sh. (The mktemp command is getting popular and even appears in busybox nowadays, but it's not everywhere yet.)

Alternatively you can capture the entire output to a variable:

	tmp=$(ls)

But then you have to break it into lines the hard way (using the eval trick from above):

	nextline()
	{
		local INV=$1 OUTV=$2
		eval local IN=\$$INV
		
		local IFS=""
		local newline=$(printf "\nX") 
		newline=${newline%X}
		
		[ -z "$IN" ] && return 1
		local rest=${IN#*$newline}
		if [ "$rest" = "$IN" ]; then
			# no more newlines; return remainder
			eval $INV= $OUTV=\$rest
		else
			local OUT=${IN%$rest}
			OUT=${OUT%$newline}
			eval $INV=\$rest $OUTV=\$OUT
		fi
	}
	
	tmp=$(echo "hello 1"; echo "hello 2")
	nextline tmp x
	nextline tmp y
	echo "$x-$y"  # prints "hello 1-hello 2"

Okay, that's a little ugly. But it works, and you can steal the nextline function and never have to look at it again :) You could also generalize it into a "split" function that can split on any arbitrary separator string. Or maybe someone has a cleaner suggestion?

Parting Comments

I just want to say that sh is a real programming language. When you're writing shell scripts, try to think of them as programs. That means don't use insane indentation; write functions instead of spaghetti; spend some extra time learning the features of your language. The more you know, the better your scripts will be.

When early versions of git were released, they were mostly shell scripts. Large parts of git (like 'git rebase') still are. You can write serious code in shell, as long as you treat it like real programming.

autoconf scripts are some of the most hideous shell code imaginable, and I'm a bit afraid that a lot of people nowadays use them to learn how to program in sh. Don't use autoconf as an example of good sh programming! autoconf has two huge things working against it:

It was designed about 20 years ago, *long* before POSIX was commonly available, so they avoid using really critical stuff like functions. Imagine trying to write a readable program without ever breaking it into functions!
Because of that, their scripts are generated by macro expansion (a poor man's functions), so http://apenwarr.ca/log/configure is more like compiler output than something any real programmer would write.

autoconf solves a lot of problems that have not yet been solved any other way, but it comes with a lot of historical baggage and it leaves a bit of a broken window effect. Please try to hold your shell code to a higher standard, for the good of all of us. Thanks.

Footnote

¹ Of course, finding a shell with POSIX compliance is rather nebulous. The reason autoconf 'configure' scripts are so nasty, for example, is that they didn't want to depend on the existence of a POSIX-compliant shell back in 1992. On many platforms, /bin/sh is anything but POSIX compliant; you have to pick some other shell. But how? It's a tough problem. redo tests your locally-installed shells and picks one that's a good match, then runs it in "sh mode" for maximum compatibility. It's very refreshing to just be allowed to use all the POSIX sh features without worrying. By the way, if you want a portable trying-to-be-POSIX shell, try dash, or busybox, which includes a variant of dash. On *really* ancient Unixes without a POSIX shell, it makes much more sense to just install bash or dash than to forever write all your scripts to assume it doesn't exist.

"But what about Windows?" I can hear you asking. Well, of course you already know about Cygwin and MSys, which both have free ports of bash to Windows. But if you know about them, you probably also know that they're gross: huge painful installation processes, messing with your $PATH, incompatible with each other, etc. My preference is the busybox-w32 (busybox-win32) project, which is a single 500k .exe file with an ash/dash-derived POSIX-like shell and a bunch of standard Unix utilities, all built in. It still has a few bugs, but if we all help out a little, it could be a great answer for shell scripting on Windows machines.

Syndicated 2011-02-28 04:21:46 (Updated 2011-02-28 09:08:39) from apenwarr - Business is Programming

22 Feb 2011 »

Chip-and-pin is *not* broken

I've seen this article about the supposed security holes with chip-and-pin credit cards making the rounds lately. As with my previous article on smartcard PINs, I have just enough knowledge to be dangerous. Which in this case means just enough knowledge to tell you why this latest attack on chip cards is not a very exciting one.

In short, the attackers in the above article have revealed a simple way for anyone to use a stolen chip credit card, without knowing the PIN, regardless of whether or not the credit card reader is online at the time. This security hole is real. I'm not disputing that.

However, it's also not as bad as people are making it sound. Most importantly, chip cards are not even *close* to as insecure as the old ("magstripe") cards they were designed to replace. And while those old cards had annoyingly high levels of fraud, you as a consumer really didn't need to care, because big faceless megacorporations paid for it. In this case, it's still more secure than before, so you should care even less than before.

Here are three reasons why the current security hole is not very exciting:

1. You have to *physically steal the card*.

Most fraud on magstripe cards was from *copying*: anyone with a bit of technical skill can trivially copy a magstripe card just by buying a magstripe writer device for less than $200. Since magstripes are used all over for much more than just financial stuff, there's no regulation about who can buy such devices.

So a common form of attack on magstripes is social engineering. It works like this: go into a store where they swipe your card on a machine. The machine then says "card read error" or "denied" or whatever, and you give up and use a different card or cash. Or maybe they try again on a different terminal, and it mysteriously succeeds this time. But in the meantime, the first terminal has recorded the data on your magstripe - probably less than 1k of data per card. The criminal can pick up the data later, and use it to clone any card that has ever been read by the illegal reader.

Ah, but what about your signature on the back of the card? And how about the hologram of your face that's on some cards? How do they copy *that*, huh? Easy, they don't: they just rewrite the magstripe on a card with *their* picture and signature on it. When they take it to a store, the card is physically theirs, but the account number is yours, so they're spending your money, not theirs. Ouch.

Better still, they can simply re-rewrite the magstripe on their card back to their original identity later, leaving little evidence.

So anyway, be suspicious of any vendor who tries to scan your card but then it "fails," especially if they look like they were expecting it. If their reader was *that* unreliable, they would have stopped using it by now. Report them to your bank. It might help them find some fraudsters.

But anyway, that form of fraud is soon to disappear: your credit card still has a magstripe, but it's only for backwards compatibility with old magstripe-only readers. Chip cards are completely immune to this sort of attack, because the chip interface doesn't tell you the card number: it only gives the reader a one-time transaction authorization code, which you can't use to construct a new card.

Of course, someone who clones your card's magstripe can still copy and use it in a magstripe reader, so you're still susceptible to fraud. But it gets less valuable every day, and vendors who sell expensive stuff - the ideal fraud targets - are the first to upgrade their readers to support chip cards. Someday the magstripe "feature" will be safely turned off entirely.

(It *is* still possible to read the real account number and secret key information from a chip card and therefore clone it. But it requires extremely messy and unreliable equipment. As far as I know, there are no simple, reliable machines that can do this in a store setting without arousing suspicion. Moreover, any such machine would obviously have only-nefarious purposes, so you'll never be able to buy one for $200 like you can with general purpose magstripe devices.)

By the way, the extra "security code" on the back of your card is a partial protection against the card-copying attack: the security code isn't on the magstripe. So someone who copies your card can't use it for web purchases that need the security code, unless they manually write down the security code while stealing your card, and that would be too obvious. (Unless you're at a restaurant, and the server takes your card into a back room for "processing." But you wouldn't ever let them do that, right?)

2. You have to hack the *vendor's* card reading terminal.

Another handy thing about the old magstripe card attacks is they work at any store, on any credit card terminal. (Nowadays they only work at terminals that *don't* support chip cards, because the bank refuses magstripe transactions from chip-capable terminals on chip-capable cards. Getting better!)

That means the standard form of attack would be to steal your card number at a shady corner store or restaurant, then delay for an arbitrary amount of time - you don't want it to be too obvious which shady store stole the number, since they probably stole a bunch all at the same time. Then, take it to an expensive store like Future Shop and buy stuff... untraceably. It has your name on it, not the fraudster, and the vendor is Future Shop, not the fraudster. The perfect crime.

With the chip card hack we're discussing, your crime isn't so perfect anymore. Above, I said that you need to *steal* the card instead of copying it. That's hard enough to do, but it's doubly hard for another reason: the owner will probably notice and immediately call their bank to cancel it. That means you can't wait a few weeks or months before using the stolen card - you have to do it *fast*, before the owner notices. So even if you come up with a clever way of stealing cards in large quantities, it'll be a lot easier for the police to track down the theft just by using statistics.

But even worse, to execute the chip-card-without-PIN hack, you have to break the card *reader* device and make it lie to both the card and the bank. That's not so hard to do, if you're an excellent programmer. Much harder than just copying a magstripe, which any halfway competent techie could do with off-the-shelf equipment. But yes, as with any DRM, it's crackable, so somebody can do it. Based on the "Chip and PIN is Broken" paper, someone already has.

What's *much* harder, though, is getting a store to *use* your hacked card reader on their merchant account. You can't just walk into Future Shop and hand them a card and a special card reader and say, "Yeah, charge it to this." No, you'll have to *infiltrate* a Future Shop and get them to use your special card reader. Not only is that much more traceable - suddenly there are people at Future Shop who *know* who the criminal is - but it's also not much easier than just stealing the stupid TV in the first place. I mean, if you can sneak into Future Shop, then you can sneak out of Future Shop, right?

Or you could get your own merchant account, I guess, and charge the money through that. But no... not likely. What could be more traceable than opening a merchant account at a bank and then running fraudulent transactions through it? I'm sure there's some way to use a fake identity or something, but again, that's *way* harder than walking into a chain store, buying something, and walking out again.

3. The only reason it even works is for backwards compatibility.

The third reason this attack is rather boring is that it exploits an *optional* feature of the EMV specification. Essentially, they independently convince the card, and the bank, that they really don't need a PIN today. In a simpler world, that would be impossible; the card would demand a PIN, and the bank would demand a PIN-authenticated transaction. But because of backwards compatibility and maybe some too-flexible specifications, it's possible to convince the card that the bank has authenticated the PIN, *and* convince the bank that the card has authenticated the PIN, all at the same time.

Yes, it's a real security hole caused by specification bugs; that attack simply shouldn't be possible to do. And yes, because of that attack, a physically stolen card can be used on a hacked card reader without knowing the PIN, so banks should worry about fraud.

But from my reading of the EMV specification (the spec is available for free online, by the way; google it), the particular modes that make this attack possible are optional. You can just turn them off. Banks just don't want to, because in some situations, those modes let you do a transaction (ie. spend money, thus earning the bank money) where you otherwise couldn't.

If I understand correctly - and maybe I don't, as I didn't read the exact form of the attack too carefully - the main point of confusion is that PINs can be verified either online (by the bank) or offline (by the card), and which one we use is determined by a negotiation between the card, the reader, and the bank. (Interestingly, exactly this was the focus of my previous article on chip cards.) If the reader lies to the card and says the bank isn't available right now (offline mode), and lies to the bank and says the card has requested offline PIN verification (maybe the card is set to *require* offline verification; that's one of the options in the spec), then the transaction can go through.

Moreover, the bank will store a "PIN Verified" flag that supposedly means the transaction is known to be much safer than an unverified (eg. magstripe) transaction. That might tell a bank's auto-fraud-detection algorithms to relax more than they should, which is probably the *real* story here. (If you're a bank, you should care about this security hole. If you're a normal person, probably not.)

Bonus trivia: incentives

By the way, here's a bit of information I ran into that I found interesting:

In the magstripe days, most fraud was by default considered the responsibility of the *vendor*. That is, if a store accepted a fraudulent card and you later reversed the transaction, it was the store who lost the money, not the bank. This seems really cruel to store owners, but the idea was that if you don't give them an incentive to reject faked cards, then they won't be careful. For example, a store has no reason to check your signature or your photo id if it's not them who'll lose in case of fraud. In fact, they have an incentive to *not* check those things, since checking them slows them down, costs money, and discourages you from shopping there. (Plus, a store can make their own decisions about how careful they want to be. Do we want a fast checkout line in exchange for higher fraud, or a slow checkout line with less fraud? What saves us the most money in the long run?)

Remember, even without professional magstripe fraud, there was still the even more trivial low-tech kind: a teenager steals a credit card from their parents' wallet, walks into a store, and buys something. Signature/photo checks are really hopelessly weak, but at least they reduce *that*.

Now, putting all the blame on the vendor was supposedly the default, but I suspect it wasn't what actually happened. I'm guessing that, officially or unofficially, if it was proven that a real forged card was used - as opposed to the vendor just not checking the signature carefully enough - that the bank would take on some of the loss. Maybe half, maybe all of it, who knows.

As magstripe fraud got more and more out of control, the industry started switching to chip cards. The problem is, you can't switch the entire world from magstripe to chip all in one day. It's a chicken-and-egg problem. So how do you make it worthwhile for banks to start issuing chip cards - since magstripe fraud isn't their problem - *and* for stores to upgrade to chip readers?

Someone somewhere (in government, perhaps?) came up with a neat solution: we'll change the liability rules. Nowadays it works like this: if a vendor has a chip-capable reader, card fraud is the bank's fault, so they pay for it. If a vendor only has an old magstripe reader, the vendor pays for it. And this is true whether or not the particular credit card is chip capable, because if it's not, then it's the bank's fault for not issuing a newer card.

I find this to be an extremely clever social hack. It aligns everybody's best interests toward reducing fraud, but requires no fines, laws, agreed-upon industry-wide schedules, deprecation periods, or enforcement.

Summary

The "Chip and PIN is Broken" attack:

only works if your card is physically stolen;
only works if the criminal uses a traceable vendor with a modified terminal;
can be disabled someday by turning off some optional protocol features;
is a liability issue for banks and/or vendors, but not consumers like you.

My advice:

If your card is stolen, report it immediately.
If a vendor has trouble reading your card and doesn't look surprised, consider reporting it to your bank.
Yes, you still need to monitor your bank statement for incorrect transactions (not just for fraud; incompetence remains rampant too).
If anyone asks you whether chip cards are more secure than the alternatives, say YES YES OH GOD YES, PLEASE PLEASE LET THE OLD SYSTEM DIE NOW.

I hope that clears things up.

Syndicated 2011-02-22 12:56:37 from apenwarr - Business is Programming

10 Feb 2011 (updated 10 Feb 2011 at 05:04 UTC) »

Daring Fireball linked to me

Oh wow, not only am I internet famous, but now Daring Fireball used my uninformed speculation to inform their uninformed speculation! This, after my original uninformed speculation was inspired by theirs!

I am actually part of an internet circle jerk. Wow. This is too awesome. I love you guys. <snif>

Update 2011/02/09: And The Guardian linked to me too, apparently.

Syndicated 2011-02-10 03:12:33 (Updated 2011-02-10 05:04:03) from apenwarr - Business is Programming

593 older entries...