Older blog entries for apenwarr (starting at number 626)

13 Jan 2013 (updated 13 Jan 2013 at 14:02 UTC) »

In which the White House outshines Canadian Politics

People who are aware of my political view template know that I try to follow a simple process, which is to try to reject low-quality arguments that resort to rhetoric and personal attacks. The result is I sometimes sound like I'm in favour of some policy or motion that I actually disagree with (or vice versa) because I tend to end up arguing about the presentation, and noting the complexity of the problem, rather than just choosing a side and joining the fray. Since I complain that you're being stupid, you assume that I think the opposite point of view is less stupid, but that's missing the point.

In short, I want to see politicians (and politically interested citizens) raising the level of discourse. Having written off American politics long ago, I'm still disappointed when Canadians result to meaningless sludge instead of stopping to understand what's going on.

So imagine my surprise when I discovered an actual U.S. political web site with actual facts and opinions and policy statements from an actual political party, responding to questions from actual citizens in the hope of raising the level of discourse.

The web site I'm referring to is the whitehouse.gov online petition system. In short, they promise to have some senior policymaker respond to your petition, no matter how stupid, if you can get at least 25,000 people to online-sign it. (25,000 is roughly 0.008% of the population of the United States, so that seems reasonable to me to get the attention of a high-level executive.)

Note what they promise: not that they'll change anything, or that the president itself will read your message, or that the response will be useful. Just that they'll respond, and the response will come from some actual person that matters. The content of the response, well, you'll have to judge that for yourself.

(This reminds me of the rules for petitioning the Government of Canada, except doing that only needs 25 signatures instead of 25,000. On the other hand, you're only guaranteed your petition will be read in parliament, and you probably won't get any response at all, other than the hope they might be thinking about it.)

So, how does it turn out? Well, I read through a few of the responses. Apparently there are 96 existing responses, which seems like a good number to me: it means the filter is blocking out the idiotic petitions (and oh boy, idiotic ones exist) but not just silencing everybody (the total number of responses is bigger than I want to read). Moreover, they sometimes combine multiple related petitions into one response (even if each one has less than 25,000 votes) and sometimes respond to petitions with less than 25,000 even though they didn't promise to do so. That tells me real people are actually reading all the petitions and looking for input, even though they don't have to. Moreover, there are less than 40 petitions open right now with more than 25,000 votes and no responses. Since that's less than half the total responses, that suggests to me that there's simply a time delay to answer them (which I'd expect), not that they don't take it seriously. And I doubt they're just deleting petitions they don't like, since anything that managed to get 25,000 signatures would obviously generate a major internet fuss if the signees found it missing.

So yes, the 25,000 signature threshold works, the accountability works, the promises are being kept, and there are actual answers up there.

Are the answers partisan? Of course, they're written by a political party. Are they all satisfying? No, sometimes they just avoid the question and don't bother to back up their claims, like the Transportation Security Administration one. (On the other hand, the petition itself wasn't so hot either.)

But what I do see is a real effort to respond in a way that really represents what the administration believes. You might not like the TSA response, but after reading it, you know exactly what their policy is about it. There are also things like the several immigration reform responses that are ultra-clear about the policy and beliefs - while admitting that, well, you kinda came to the wrong place, because the President isn't the one who sets the immigration policy.

Even the ones with a "blame the republicans" section, like the NASA funding response, do it pretty respectfully. They say "unfortunately, not everyone is supportive" and explain some problems of the alternative policy, but they do it with a tone that it encourages you to think about, and maybe talk to, your representatives to see if you can change their minds. They don't start from the assumption that the alternative viewpoint is idiotic and the only solution is the vote them the hell out. I can respect that.

Canada should have this. The U.S. House and Senate should have this, or at least the Democrats and the Republicans. You know what would be cool? If every party, not just the one in power, submitted a response to every petition that got 25,000 votes, to make their position clear, and we could read them side by side and decide what we believe. And if they could refrain from personal attacks and stick to the issues, like the current site does, and campaigns and TV debates generally don't.

That would be progress.

Syndicated 2013-01-13 13:08:50 (Updated 2013-01-13 14:02:10) from apenwarr - Business is Programming

3D Printing

My first 3d-printed creation (and my 3d model that I printed it from). The photo below is three printings of the same design at different sizes:

The entire car prints, bottom to top, as a single run, and yes, those wheels actually turn. Each two wheel + axle combination is a single solid object, and the frame between the two actually has closed loops around the axles. So we have the magical-seeming trick of passing the solid axles through the loops - without needing any welding or gluing after the fact. Print it, unstick it, and roll it off the platform.

Playing with this has really connected a few physical-world concepts that didn't click for me before. For example, measurement tolerances are absolutes, not percentages. The printer I used is accurate to about 0.1mm. Previously, I had never cared about any distances less than a millimeter ("If it ain't on my ruler, it don't exist") but at this scale (the smallest car is only 1cm tall), it matters. To make the loops that wrap around the axle so they don't blur into the axle itself, I had to leave quite a bit of extra space. This produced a tight fit for the smallest car, but when we naively scale up the design linearly (the big one is about 3cm tall) that excess space leaves the axles pretty floppy. (Or we can call it 4-wheel steering, and then it's a feature.)

The other scaling-related lesson comes back to that old interview question: if somehow you were shrunk down to the size of a coin and put inside a blender, how would you escape?

The answer is that you'd jump out. Why? Because your body's mass scales down with n^3, but the strength in your muscles scales down much more slowly - let's say n^2. Relative to your size, you'd have super strength. That's why grasshoppers can easily leap to 100x their height, but nothing the size of a human can do so.

For the same reason, when you drop my big car on the floor, a wheel tends to break off. When you drop the little one on the floor, it stays intact. Why? Because the mass (and thus the force with which it hits the floor) has scaled up by 3^3 = 27, but the tensile strength is only about 3x higher. That's enough to break the rather weak connection between axle and wheel. That could presumably be fixed by better engineering, but I'm not really That Kind of Engineer.

In short, 3D printing is fun. But really it just makes me that much more impatient for nanobots. Ooh, micron-scale accuracy and toy cars made of diamonds!

Syndicated 2012-12-29 11:29:01 from apenwarr - Business is Programming

Programming inside the URL string

Afterquery is hard to explain to people, possibly because it actually combines several pretty unusual concepts. A single unusual concept is bad enough, but several at once is likely to just leave you scratching your head. With that in mind, here's just one unusual concept: a programming language designed for URLs. Not a language for manipulating URLs; the language *is* the URL.

In afterquery, if I write this:

  http://afterquery.appspot.com?url=example.json&group=a,b,c&group=a,b

Then (assuming a..d are strings and e is numeric) it's about equivalent to this SQL:

  select a, b, count(c) as c, sum(d) as d, sum(e) as e
  from (
    select a, b, c, count(d) as d, sum(e) as e
    from example
    group by a, b, c
  )
  group by a, b

This gives, for each combination of a and b, the number of distinct values of c, the number of distinct combinations of (c,d), and the sum of e - each a slightly different useful aggregate.

Some people have used SQL for so long now that they don't remember anymore exactly how redundant the language is. The above SQL mentions 'e' 4 times! The afterquery code doesn't mention it at all; if you say you want to group by a, b, and c, then the assumption is that e (one of the columns in the initial dataset) is an aggregate value field and doesn't need to be mentioned unless you want an unusual aggregate. (The default aggregate is count() for non-numeric fields, and sum() for numeric fields.)

The sequence of clauses in SQL is also problematic because it's both arbitrary and restrictive. It doesn't reflect the order of operations; in reality, among other things, "from" obviously comes before "select." (Incidentally, LINQ in C# puts "from" first, for that reason.) Worse, "group by" is highly related to the select operation - whether or not something must be aggregated depends on whether it's in the "group by" clause or not, and yet "group by" is way down in the query while "select" is at the top. And worst of all, the entire "group by" clause is actually redundant: you could calculate the "group by" clause entirely by looking at which fields in the select contain aggregation functions and which don't. You know this is true, because if your select clause doesn't use aggregation functions for exactly the right fields (no more, and no less), then your query will abort with an error.

There's a lot of danger in trying to make a programming language that reads too much like English. Maybe it can be done, but you have to be tasteful about it. SQL is not tasteful; it's designed with the same mindset that produced COBOL. (The joke is that an object oriented COBOL would be called "ADD 1 TO COBOL GIVING COBOL", which is the COBOL equivalent of C++ or [incr tcl]. That line is the actual code for incrementing a number in COBOL. Compare with "SELECT sql+1 FROM sql WHERE sql=0 GROUP BY sql".)

Still though, despite SQL's insulting repetitiveness, it has one even more depressing advantage: it's still the most concise way to express a database query of moderate complexity. C# LINQ is the closest thing we have to a competitor - it was specifically designed to try to replace SQL because coding database queries in an imperative language is too messy - and our above nested grouping looks something like this (based on their sample code - I haven't used LINQ in a while and haven't tested it):

  var tmp = 
    from x in example
    group x by x.a, x.b, x.c into g
    select new {
      a = g.Key[0], b = g.Key[1], c = g.Key[2],
      d = g.Count(p => p.d),
      e = g.Sum(p => p.e)
    };
  var result =
    from r in tmp
    group r by r.a, r.b into g
    select new {
      a = g.Key[0], b = g.Key[1],
      c = g.Count(p => p.c);
      d = g.Sum(p => p.d);
      e = g.Sum(p => p.e);
    };

That mentions 'e' 4 times, like the SQL does, but also introduces new temporary variables x, g, r, and p. g is itself a magical complex object that includes a "Key" member we have to use, among other things. Now, LINQ is also much more powerful than SQL (you can use it for things other than databases) and in many cases it can translate itself into SQL (so it can query databases efficiently even though you wrote it imperatively), so it has redeeming features. But it definitely hasn't shortened our basic SQL query.

There are also ORMs (Object-Relational Mappers) out there, like ActiveRecord for example, which can be more concise than plain SQL. But they can't represent complicated concepts all the way through to the database. Generally you end up downloading either all the data and filtering it client-side, or one record at a time, leading to high latency, or splitting into two operations, one to get a list of keys, and another to fetch a bunch of keys in parallel. A proper "query language" like SQL doesn't require that kind of hand optimizing.

Somehow SQL has held on, since 1974, as still the best way to do that particular thing it does. You've got to give them some credit for that.

Imperative or not?

SQL is a programming language, although a restricted one. In my mind, I like to designate programming languages as one of three types: imperative, functional, or declarative. If you read the official definitions, functional languages are technically a subset of declarative ones, but I usually find those definitions to be more misleading than useful. HTML is a declarative language; LISP is a functional language. They're different, even if they do share underlying mathematical concepts.

Other declarative non-functional languages include CSS, XML, XQuery, JSON, regular expressions, Prolog, Excel formulas, and SQL. We can observe that declarative languages are a pretty weird bunch. They tend to share a few attributes: that it's hard to predict what the CPU will actually do (so performance depends on external knowledge of how a particular interpreter works), that expressing data works well and expressing commands works badly (I'm talking to you, ANT and MSBuild), and that explicit conditionals always look funny if they're supported at all.

These problems are very clear in the case of SQL, after using it for only a little while. The hardest part of SQL to understand is its interaction with the so-called "query optimizer" which decides what database indexes to use, and most importantly, when to use an index and when to use a full table scan. The person writing a SQL query will generally have a really clear idea when a table scan is appropriate (that is, usually for small amounts of data) and when it isn't, but there's no way in SQL to express that; you just have to trust the optimizer. It'll generally work, but sometimes it'll go crazy, and you'll have no idea what just happened to make your query go 100x slower. SQL suffers from a definition of "correctness" that doesn't include performance.

Declarative (and functional) languages also share a major advantage over imperative languages, which is that it's easy to manipulate and rearrange the program without losing its meaning, exactly because the implementation is left unspecified. For example, you should be able to convert an arbitrary SQL query to a map/reduce operation. Declarative and functional languages make things like parallelism easier to implement and reason about. (The "map" and "reduce" operations in map/reduce come from functional programming, of course.)

Let's look at afterquery again. The above afterquery, which matches the functionality of the above SQL query, can be broken down like this:

  url=example.json
  group=a,b,c
  group=a,b

Is it imperative, declarative, or functional? I've been thinking about it for a couple of weeks, and I don't really know. The fact that it can be mapped directly to a (declarative) SQL query suggests its declarative nature. But having written the implementation, I also know that how it works is very clearly imperative. It ends up translating the query to almost exactly this in javascript:

  grid = fetchurl('example.json')
  grid = groupby(grid, ['a', 'b', 'c']);
  grid = groupby(grid, ['a', 'b']);

That is, it's applying a series of changes to a single data grid (global shared state - the only state) in memory. You might notice that all the above imperative commands, though, have the same structure, so you could write the same thing functionally:

  groupby(
    groupby(
      fetchurl('example.json'),
      ['a', 'b', 'c']),
    ['a', 'b'])

Most imperative programs cannot be so easily mapped directly onto pure functional notation. So that leads me to my current theory, which is that afterquery really is an imperative language, but it happens to be one so weak and helpless that it can't express complex concepts (like loops) that would make it incompatible with functional/declarative interpretation. It's not turing-complete.

Nevertheless, the imperative-looking representation makes it easier to write and debug queries, and to estimate their performance, than declarative-looking SQL. In theory, a sequence of groups and pivots could be rearranged by a query optimizer to run faster or in parallel, but in practice, afterquery's goal of working on small amounts of data (unlike SQL, which is intended to run on large amounts) makes an optimizer pretty unnecessary, so we can just execute the program as a series of transformations in the order they're given.

An imperative language in the URL string

Traditionally, URLs are about as declarative as things come. At one level, they are just opaque string parameters to one of a very few functions (GET, POST, PUT, etc), some of which take an even bigger opaque string (the POST data) as a second parameter.

One level deeper, we know that URL strings contain certain well-known traditional components: protocol, hostname, path, query string (?whatever), anchor string (#whatever). Inside the query string (and sometimes the anchor string), we have key=value pairs separated by &, with special characters in the values traditionally encoded in a particular way (%-encoding). HTTP specifies the components, but it doesn't have to say anything about the structure of the query string, its key=value pairs, the & signs, or its %-encoding.

Afterquery uses the same key=value pairs as any query string, but while most apps treat them as a declarative dictionary of essentially unordered key=value pairs - with the only ordering being multiple instances of the same key - afterquery also depends on the ordering of keys. &group=a,b,c&pivot=a,b;c is a totally different command from the other way around.

Another huge constraint on a URL is its length: there is no predefined maximum length, but many browsers limit it, maybe to 1024 characters or so. Thus, if you want to keep the program stateless (no state stored between executions, and no programs stored on the server), it's important to keep things concise, so you can say what you need to say inside a single URL.

Luckily, sometimes the best art comes from constraints! Where perl-style punctuation-happy languages with lots of implicit arguments are nowadays unfashionable, we have no choice but to adopt them anyway if we want things to fit inside a URL. Our example afterquery is actually equivalent to:

  url=http://afterquery.appspot.com/example.json
  group=a,b,c;count(d),sum(e)
  group=a,b;count(c),sum(d),sum(e)

The URL has a default path based on the script location (as URLs always do), and the semicolon in group= statements separates the keys from the values. That leads to the very confusing at first, but very convenient thereafter, distinction between

  group=a,b,c

and

  group=a,b,c;

Which mean very different things: the first means "keep all the other columns, and guess how to aggregate their values" while the second means "throw away all the other columns."

Like a regex, it's totally unfriendly to an initial observer, where an SQL statement (or COBOL program) might make a beginner feel comfortable that they can read what's going on. But my theory is: once you've got the idea, SQL is just tedious, but afterquery is more fun.

And unlike SQL, a nontrivial program will fit in an URL string.

Syndicated 2012-12-16 04:46:12 from apenwarr - Business is Programming

11 Dec 2012 (updated 11 Dec 2012 at 02:02 UTC) »

Let's think about *small* data for a change

There's a lot of buzz lately about "big data" - huge, Internet scale databases that take a long time to query, but are awesome, I guess, because of how much testosterone you need to have in order to possess one.

Let's leave aside the question of exactly how big big data needs to be. I've heard of people talking about databases as small as a gigabyte as being "big," I guess because downloading it would take a few minutes, and maybe 'grep' isn't the best way to query it. Other people would say a terabyte is big, or a petabyte.

I don't really care. All I know is, beyond a certain threshold that depends on the current state of computer technology, as soon as your data is "big," queries are slow. Maybe a few seconds, which isn't so bad, but maybe a few *minutes*. And when things get slow, you start having to mess with having separate "data warehouse" servers so that doing your big analytical queries don't bring down your whole database. And managing it all becomes a full time job for someone or many people or a whole company.

I happen to work for an employer that does that sort of thing a lot. And to be honest, I find it pretty boring. Perhaps I have a testosterone deficiency. It's not so much the bigness that bothers me: it's the waiting. I like my compile-test-debug cycle to be on the order of two seconds, but when SQL or mapreduce gets involved, it's more like two minutes.1

I know, cry for me, right? Two minutes of my life, all gone. But seriously, when you're trying to find trends and aren't quite sure what you're looking for and it takes a dozen tries, those two minutes can add up rapidly, especially when added to the 10 minutes of web browsing or email that inevitably ensues once I get bored waiting for the two minutes.

After worrying about this problem for a long time (years now, I guess), I think I've come up with a decent workaround. The trick is to divide your queries into multiple stages. At each stage, we reduce the total amount of data by a few orders of magnitude, and thus greatly decrease the cost of debugging a complex query.

Originally, I might have tried to make a single SQL query that goes from, say, a terabyte of data, down to 10 rows, and made a bar chart. Which 10 rows? Well, it takes a few tries to figure that out, or maybe a few hundred tries. But it's much faster if we set the initial goal instead to, say, only pull out the gigabyte I need from that terabyte. And then I can query a gigabyte instead. From there, I can reduce it to a megabyte, perhaps, which is easy enough to process in RAM without any kind of index or complexity or optimization.

That last part is what I want to talk to you about, because I've been working on a tool to do it. I call it afterquery, tagline: "The real fun starts after the Serious Analysts have gone home." Serious people write mapreduces and SQL queries. Serious people hire statisticians. Serious people have so much data that asking questions about the data requires a coffee break, but they get paid so much they don't have to care.

Afterquery is the opposite of Serious. It downloads the whole dataset into RAM on your web browser and processes it in javascript.

But what it lacks in seriousness, it makes up for in quick turnaround. Here's what it produces from 1582 rows of data I got from Statistics Canada:2