22 Sep 2005 cinamod   » (Master)

This whole Google print lawsuit bungle has got me thinking.

I'm confused by those who think that Google isn't unambiguously in the clear here. Not because they're doing it for a scholarly purpose. Or because they're only reproducing a terse portion of the work. Or not even because what Google is doing can't possibly affect these author's past decisions to create their works, and thus retroactively disincentivize their respective work's creation (ahem, Sonny Bono CTEA, I'm looking at you here...).

I think that if you're arguing those points, perhaps you're looking at this from an overly-narrow perspective. I invite you to look outside the box. Sure, Google might (or might not) win on the above points alone. But I don't think that Google is copying expressive works. Google is copying databases of words.

In the case of Rural v. Feist, the Court ruled that databases (in that case, telephone directories) were not entitled to copyright protection, as they contained little (if any) expressive content. Copyright protects expression fixed in a tangible media, and even then only within certain limitations.

Here, I believe that Google is treating otherwise expressive, copyrighted texts as databases, thus stripping them of their expressivity in the context of the texts' uses. I think that the use of a derivitive work matters a great deal in determining that work's expressivity before the Court. That the use of a work has a transformative effect on the expressivity of that work, possibly even voiding that work's expressiveness in a given context. In Google's case, the works are copied - perhaps verbatim - but their expressiveness is lost in the process. Granted, this may seem non-obvious.

The search results page rendered by Google most likely have some expressiveness, and would be copyrightable. The texts that Google OCR'd are expressive and copyrightable. But Google's treatment of these texts as search indexes - reverse text lookup databases - is in itself not expressive. They're just unexpressive token sequences, capable of being searched. It is in the translation from meaningful, expressive words into an ordered sequence of cold, machine-searchable tokens that the work loses its expressivity. Note that this distinction would still attach copyright protection to things like eBooks, as the purpose of eBooks is to convey expressivity to a human reader via an electronic medium. The purpose of the tokens is to convey an ordered sequence of words to a machine algorithm incapable of appreciating the work's expressivity or content in any way that we'd call "meaningful".

From that, we're left to conclude that the tokens "John Galt" appearing on page 1 of Ayn Rand's "Atlas Shrugged" next to the tokens "Who is" is merely a fact, absent any inherent meaningful expressivity. And absent this expressivity, copyright doesn't attach to this sentence (which, fwiw, is probably considerably too short for copyright to attach to, anyway). Facts - even collections of facts - simply aren't protected under copyright law.

In the end, Google's Print project is just a fact retrieval system - in essence, no different from the index in the back of the book that they're OCR'ing. Copyright law needn't get involved, because at no point does it affix to what Google is doing. Or so I hope that the Courts decide.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!