Processing the LOD cloud with BigQuery
Google’s BigQuery is a large-scale, interactive query environment that can handle billions of records in seconds. Now, wouldn’t it be cool to process the 26+ billion triples from the LOD cloud with BigQuery?
I guess so
So, I did a first step into this direction by setting up the BigQuery for Linked Data project containing:
- A Python script called nt2csv.py that converts RDF/NTriples into BigQuery-compliant CSV;
- BigQuery schemes that can be used together with the CSV data from above;
- Step-by-step instructions how to use
nt2csv.pyalong with Google’s gsutil and bq command line tools to import the above data into Google Storage and issue a query against the uploaded data in BigQuery.
Essentially, one can – given an account for Google Storage as well as an account for BigQuery – do the following:
"SELECT object FROM [mybucket/tables/rdf/tblNames]
WHERE predicate = 'http://xmlns.com/foaf/0.1/knows'
… which roughly translates into the following SPARQL query:
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
?s foaf:knows ?o .
Currently, I do possess a Google Storage account, but unfortunately not a BigQuery account (yeah, I’ve signed up but still in the queue). So, I can’t really test this stuff – any takers?
Filed under: Experiment, Linked Data