13 Dec 2010 mhausenblas   » (Journeyer)

Processing the LOD cloud with BigQuery

Google’s BigQuery is a large-scale, interactive query environment that can handle billions of records in seconds. Now, wouldn’t it be cool to process the 26+ billion triples from the LOD cloud with BigQuery?

I guess so ;)

So, I did a first step into this direction by setting up the BigQuery for Linked Data project containing:

  • A Python script called nt2csv.py that converts RDF/NTriples into BigQuery-compliant CSV;
  • BigQuery schemes that can be used together with the CSV data from above;
  • Step-by-step instructions how to use nt2csv.py along with Google’s gsutil and bq command line tools to import the above data into Google Storage and issue a query against the uploaded data in BigQuery.

Essentially, one can – given an account for Google Storage as well as an account for BigQuery – do the following:

bq query
"SELECT object FROM [mybucket/tables/rdf/tblNames]
WHERE predicate = 'http://xmlns.com/foaf/0.1/knows'
LIMIT 10"

… which roughly translates into the following SPARQL query:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?o
WHERE {
?s foaf:knows ?o .
}
LIMIT 10

Currently, I do possess a Google Storage account, but unfortunately not a BigQuery account (yeah, I’ve signed up but still in the queue). So, I can’t really test this stuff – any takers?


Filed under: Experiment, Linked Data

Syndicated 2010-12-13 08:54:05 from Web of Data

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!