Programming inside the URL string
Afterquery is hard to
explain to people, possibly because it actually combines several pretty
unusual concepts. A single unusual concept is bad enough, but several
at once is likely to just leave you scratching your head. With that in
mind, here's just one unusual concept: a programming language designed
for URLs. Not a language for manipulating URLs; the language *is* the
URL.
In afterquery, if I write this:
http://afterquery.appspot.com?url=example.json&group=a,b,c&group=a,b
Then (assuming a..d are strings and e is numeric) it's about
equivalent to this SQL:
select a, b, count(c) as c, sum(d) as d, sum(e) as e
from (
select a, b, c, count(d) as d, sum(e) as e
from example
group by a, b, c
)
group by a, b
This gives, for each combination of a and b, the number of distinct
values of c, the number of distinct combinations of (c,d), and the sum
of e - each a slightly different useful aggregate.
Some people have used SQL for so long now that they don't remember
anymore exactly how redundant the language is. The above SQL mentions
'e' 4 times! The afterquery code doesn't mention it at all; if you say
you want to group by a, b, and c, then the assumption is that e (one of
the columns in the initial dataset) is an aggregate value field and
doesn't need to be mentioned unless you want an unusual aggregate.
(The default aggregate is count() for non-numeric fields, and sum() for
numeric fields.)
The sequence of clauses in SQL is also problematic because it's both
arbitrary and restrictive. It doesn't reflect the order of operations;
in reality, among other things, "from" obviously comes before "select."
(Incidentally, LINQ in C# puts "from" first, for that reason.) Worse,
"group by" is highly related to the select operation - whether or not
something must be aggregated depends on whether it's in the "group by"
clause or not, and yet "group by" is way down in the query while
"select" is at the top. And worst of all, the entire "group by" clause
is actually redundant: you could calculate the "group by"
clause entirely by looking at which fields in the select contain
aggregation functions and which don't. You know this is true, because
if your select clause doesn't use aggregation functions for exactly the
right fields (no more, and no less), then your query will abort with an
error.
There's a lot of danger in trying to make a programming language that reads
too much like English. Maybe it can be done, but you have to be tasteful
about it. SQL is not tasteful; it's designed with the same mindset that
produced COBOL. (The joke is that an object oriented COBOL would be called
"ADD 1 TO COBOL GIVING COBOL", which is the COBOL equivalent of C++ or [incr
tcl]. That line is the actual code for incrementing a number in COBOL.
Compare with "SELECT sql+1 FROM sql WHERE sql=0 GROUP BY sql".)
Still though, despite SQL's insulting repetitiveness, it has one even
more depressing advantage: it's still the most concise way to
express a database query of moderate complexity. C# LINQ is the
closest thing we have to a competitor - it was specifically designed to
try to replace SQL because coding database queries in an imperative
language is too messy - and our above nested grouping looks something
like this (based on their sample
code - I haven't used LINQ in a while and haven't tested it):
var tmp =
from x in example
group x by x.a, x.b, x.c into g
select new {
a = g.Key[0], b = g.Key[1], c = g.Key[2],
d = g.Count(p => p.d),
e = g.Sum(p => p.e)
};
var result =
from r in tmp
group r by r.a, r.b into g
select new {
a = g.Key[0], b = g.Key[1],
c = g.Count(p => p.c);
d = g.Sum(p => p.d);
e = g.Sum(p => p.e);
};
That mentions 'e' 4 times, like the SQL does, but also introduces new
temporary variables x, g, r, and p. g is itself a magical complex
object that includes a "Key" member we have to use, among other things.
Now, LINQ is also much more powerful than SQL (you can use it for
things other than databases) and in many cases it can translate itself
into SQL (so it can query databases efficiently even though you wrote
it imperatively), so it has redeeming features. But it definitely
hasn't shortened our basic SQL query.
There are also ORMs (Object-Relational Mappers) out there, like
ActiveRecord for example, which can be more concise than plain SQL.
But they can't represent complicated concepts all the way through to
the database. Generally you end up downloading either all the data and
filtering it client-side, or one record at a time, leading to high
latency, or splitting into two operations, one to get a list of keys,
and another to fetch a bunch of keys in parallel. A proper "query
language" like SQL doesn't require that kind of hand optimizing.
Somehow SQL has held on, since 1974, as still the best way to do that
particular thing it does. You've got to give them some credit for that.
Imperative or not?
SQL is a programming language, although a restricted one. In my mind,
I like to designate programming languages as one of three types:
imperative, functional, or declarative. If you read the official
definitions, functional languages are technically a subset of
declarative ones, but I usually find those definitions to be more
misleading than useful. HTML is a declarative language; LISP is a
functional language. They're different, even if they do share
underlying mathematical concepts.
Other declarative non-functional languages include CSS, XML, XQuery,
JSON, regular expressions, Prolog, Excel formulas, and SQL. We can
observe that declarative languages are a pretty weird bunch. They tend
to share a few attributes: that it's hard to predict what the CPU will
actually do (so performance depends on external knowledge of how a
particular interpreter works), that expressing data works well and
expressing commands works badly (I'm talking to you, ANT and MSBuild),
and that explicit conditionals always look funny if they're supported
at all.
These problems are very clear in the case of SQL, after using it for
only a little while. The hardest part of SQL to understand is its
interaction with the so-called "query optimizer" which decides what
database indexes to use, and most importantly, when to use an index and
when to use a full table scan. The person writing a SQL query will
generally have a really clear idea when a table scan is appropriate
(that is, usually for small amounts of data) and when it isn't, but
there's no way in SQL to express that; you just have to trust the
optimizer. It'll generally work, but sometimes it'll go crazy, and
you'll have no idea what just happened to make your query go 100x
slower. SQL suffers from a definition of "correctness" that doesn't
include performance.
Declarative (and functional) languages also share a major advantage
over imperative languages, which is that it's easy to manipulate and
rearrange the program without losing its meaning, exactly because the
implementation is left unspecified. For example, you should be able to
convert an arbitrary SQL query to a map/reduce operation. Declarative
and functional languages make things like parallelism easier to
implement and reason about. (The "map" and "reduce" operations in
map/reduce come from functional programming, of course.)
Let's look at afterquery again. The above afterquery, which matches
the functionality of the above SQL query, can be broken down like
this:
url=example.json
group=a,b,c
group=a,b
Is it imperative, declarative, or functional? I've been thinking about
it for a couple of weeks, and I don't really know. The fact that it
can be mapped directly to a (declarative) SQL query suggests its
declarative nature. But having written the implementation, I also know
that how it works is very clearly imperative. It ends up translating
the query to almost exactly this in javascript:
grid = fetchurl('example.json')
grid = groupby(grid, ['a', 'b', 'c']);
grid = groupby(grid, ['a', 'b']);
That is, it's applying a series of changes to a single data grid
(global shared state - the only state) in memory. You might notice
that all the above imperative commands, though, have the same
structure, so you could write the same thing functionally:
groupby(
groupby(
fetchurl('example.json'),
['a', 'b', 'c']),
['a', 'b'])
Most imperative programs cannot be so easily mapped directly onto pure
functional notation. So that leads me to my current theory, which is
that afterquery really is an imperative language, but it happens to be
one so weak and helpless that it can't express complex concepts (like
loops) that would make it incompatible with functional/declarative
interpretation. It's not turing-complete.
Nevertheless, the imperative-looking representation makes it easier
to write and debug queries, and to estimate their performance, than
declarative-looking SQL. In theory, a sequence of groups and pivots
could be rearranged by a query optimizer to run faster or in parallel,
but in practice, afterquery's goal of working on small amounts of data
(unlike SQL, which is intended to run on large amounts) makes an
optimizer pretty unnecessary, so we can just execute the program as a
series of transformations in the order they're given.
An imperative language in the URL string
Traditionally, URLs are about as declarative as things come. At one
level, they are just opaque string parameters to one of a very few
functions (GET, POST, PUT, etc), some of which take an even bigger
opaque string (the POST data) as a second parameter.
One level deeper, we know that URL strings contain certain well-known
traditional components: protocol, hostname, path, query string
(?whatever), anchor string (#whatever). Inside the query string (and
sometimes the anchor string), we have key=value pairs separated by
&, with special characters in the values traditionally encoded in a
particular way (%-encoding). HTTP specifies the components, but it
doesn't have to say anything about the structure of the query string,
its key=value pairs, the & signs, or its %-encoding.
Afterquery uses the same key=value pairs as any query string, but while most
apps treat them as a declarative dictionary of essentially unordered
key=value pairs - with the only ordering being multiple instances of the
same key - afterquery also depends on the ordering of keys.
&group=a,b,c&pivot=a,b;c is a totally different command
from the other way around.
Another huge constraint on a URL is its length: there is no predefined
maximum length, but many browsers limit it, maybe to 1024 characters or so.
Thus, if you want to keep the program stateless (no state stored between
executions, and no programs stored on the server), it's important to keep
things concise, so you can say what you need to say inside a single URL.
Luckily, sometimes the best art comes from constraints! Where perl-style
punctuation-happy languages with lots of implicit arguments are nowadays
unfashionable, we have no choice but to adopt them anyway if we want things
to fit inside a URL. Our example afterquery is actually equivalent to:
url=http://afterquery.appspot.com/example.json
group=a,b,c;count(d),sum(e)
group=a,b;count(c),sum(d),sum(e)
The URL has a default path based on the script location (as URLs always
do), and the semicolon in group= statements separates the keys from the
values. That leads to the very confusing at first, but very convenient
thereafter, distinction between
group=a,b,c
and
group=a,b,c;
Which mean very different things: the first means "keep all the other
columns, and guess how to aggregate their values" while the second
means "throw away all the other columns."
Like a regex, it's totally unfriendly to an initial observer, where an
SQL statement (or COBOL program) might make a beginner feel comfortable
that they can read what's going on. But my theory is: once you've got
the idea, SQL is just tedious, but afterquery is more fun.
And unlike SQL, a nontrivial program will fit in an URL string.
Syndicated 2012-12-16 04:46:12 from apenwarr - Business is Programming