16 Apr 2011 mhausenblas   » (Journeyer)

From CSV data on the Web to CSV data in the Web

In our daily work with Government data such as statistics, geographical data, etc. we often deal with Comma-Separated Values (CSV) files. Now, they are really handy as they are easy to produce and to consume: almost any language and platform I came across so far has some support for parsing CSV files and I can virtually export CSV files from any sort of (serious) application.

There is even a – probably not widely known – standard for CSV files (RFC 4180) that specifies the grammar and registers the normative MIME media type text/csv for CSV files.

So far so well.

From a Web perspective, CSV files really are data objects, which however are rather coarse-granular. If I want to use a CSV file, I always have to use the entire file. There is no agreed-upon concept that allows me to refer to a certain cell, row or column. This was my main motivation to start working on what I called Addrable (from Addressable Table) earlier this year. I essentially hacked together a rather simple implementation of Addrables in JavaScript that understands URI fragment identifiers such as:

  • #col:temperature
  • #row:10
  • #where:city=Galway,reporter=Richard

Let’s have a closer look at what the result of the processing of such a fragment identifier against an example CSV file could be. I’m going to use the last one in the list above, that is, addressing a slice where the city column has the value ‘Galway’ and for the reporter column we ask it to be ‘Richard’.

The client-side implementation in jQuery provides a visual rendering of the selected part, see below a screen-shot (if you want to toy around with it, either clone or download it and open it locally in your browser):

There is also a server-side implementation using node.js available (deployed at addrable.no.de), outputting JSON:

{
  "header":
    ["date","temperature"],
  "rows":
    [
      ["2011-03-01", "2011-03-02", "2011-03-03"],
      ["4","10","5"]
    ]
}

Note: the processing of the fragment identifier is meant to be performed by the User Agent after the retrieval action has been completed. However, the server-side implementation demonstrates a workaround for the fact that the fragment identifier is not sent to the Server (see also the related W3C document on Repurposing the Hash Sign for the New Web).

Fast forwarding a couple of weeks.

Now, having an implementation is fine, but why not pushing the envelope and taking it a step further, in order to help making the Web a better place?

Enter Erik Wilde, who did ‘URI Fragment Identifiers for the text/plain Media Type’ aka RFC 5147 some three years ago; and yes, I admit I was a bit biased already through my previous contributions to the Media Fragments work. We decided to join forces to work on ‘text/csv Fragment Identifiers’, based on the Addrable idea.

As a first step (well beside the actual writing of the Internet-Draft to be submitted to IETF) I had a quick look at what we can expect in terms of deployment. That is, a rather quick and naive survey based on some 60 CSV files manually harvested from the Web. The following figure gives you a rough idea what is going on:

To sum up the preliminary findings: almost half of the CSV files are (wrongly) served with text/plain (followed by some other non-conforming and partially exotic Media Types such as text/x-comma-separated-values. The bottom-line is: only 10% of the CSV files are served correctly with text/csv. Why do we care, you ask? Well, for example, because the spec says that the header row is optional, but the presence can be flagged by an optional HTTP Header parameter. Just wondering what the chances are ;)

Now, I admit that my sample here is rather small, but I think the distribution will roughly stay the same. By the way, anyone aware of a good way to find CSV files, besides filetype:csv in Google or contains:csv in Bing, as I did it?

We’d be glad to hear from you – do you think this is useful for your application? If yes, why? How would you use it? Or, maybe you want to do a proper CSV crawl to help us with the analysis?


Filed under: Announcement, FYI, Idea, IETF

Syndicated 2011-04-16 12:43:35 from Web of Data

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!