Older blog entries for mhausenblas (starting at number 40)

Processing the LOD cloud with BigQuery

Google’s BigQuery is a large-scale, interactive query environment that can handle billions of records in seconds. Now, wouldn’t it be cool to process the 26+ billion triples from the LOD cloud with BigQuery?

I guess so ;)

So, I did a first step into this direction by setting up the BigQuery for Linked Data project containing:

  • A Python script called nt2csv.py that converts RDF/NTriples into BigQuery-compliant CSV;
  • BigQuery schemes that can be used together with the CSV data from above;
  • Step-by-step instructions how to use nt2csv.py along with Google’s gsutil and bq command line tools to import the above data into Google Storage and issue a query against the uploaded data in BigQuery.

Essentially, one can – given an account for Google Storage as well as an account for BigQuery – do the following:

bq query
"SELECT object FROM [mybucket/tables/rdf/tblNames]
WHERE predicate = 'http://xmlns.com/foaf/0.1/knows'

… which roughly translates into the following SPARQL query:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
?s foaf:knows ?o .

Currently, I do possess a Google Storage account, but unfortunately not a BigQuery account (yeah, I’ve signed up but still in the queue). So, I can’t really test this stuff – any takers?

Filed under: Experiment, Linked Data

Syndicated 2010-12-13 08:54:05 from Web of Data

Open data is the electricity of the 21st century

As I said earlier, today:

Open data is the electricity of the 21st century.

How do I come to this conclusion, you ask?

Well, imagine for a moment all the electricity on earth would be switched off; as an aside: this is unfortunately not an entirely theoretical thing (cf. EMP). What would happen? No, I’m not talking about that you’ll likely miss your favourite TV show. There are serious consequences to be expected, such as people suffering in hospitals, planes crashing, essentially causing our civilisation grinding to a halt. Now, I hope you can acknowledge the pervasiveness of electricity and our dependency thereof.

But how does this relate to open data?

Both electricity and open data share a couple of features:

  • You need an infrastructure (generation, distribution, etc.) to be able to benefit from it.
  • On its own it’s pretty useless. You need ‘applications’ to exploit it.
  • You notice it only as soon as it is not available (anymore).

Concerning electricity, a lot of people had numerous ideas how to utilise it (long before it reached wide adoption) and had to overcome serious obstacles and there were existential fights about deployment and which technologies to use (read more in The Story of Electricity by John Munro).

Now, just like electricity, open data is about to become ubiquitous these days. Be it governments or private entities that, for example, seek to optimise their Web presence. And there are and will be discussions about how to best expose the data (on the Web).

Note that I’m not trying to advocate that all data on earth should be open to everyone. This is maybe the biggest difference to electricity. There are cases (and let’s be honest, quite a few) where the privacy, concerning a person or an organisation, must take precedence over the ‘openness’ of the data. But let this be no excuse to not publish your data on the Web, if there are no privacy concerns.

All that is missing now are applications, on a large scale, that utilise the open data. Think about it, the next time you switch on your TV or use your oven to prepare a meal for the ones you love ;)

Filed under: Linked Data

Syndicated 2010-11-20 11:40:35 from Web of Data

Linked Open Data star scheme by example

I like TimBL’s 5-star deployment scheme for Linked Open Data. However, every time I use it to explain the migration path from ‘no-data-on-the-Web’ to the ‘Full Monty’, no matter if to students, in training sessions or to industry partners, there comes a point where it would be very handy to refer to a concrete example that demonstrates the entire scheme.

Well, there we go. At


you can find the examples for the 5-star scheme, ranging from a PDF to full-blown Linked Data (in RDFa).

Now, just for fun – what will the minimal temperature tomorrow be in Galway? See the power of Linked Open Data in action

… and in case you wanna play around with the data yourself, here is the SPARQL query for the previous answer:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX meteo: <http://purl.org/ns/meteo#>
PREFIX : <http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/gtd-5.html#>
SELECT ?tempC from <http://any23.org/rdfxml/http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/gtd-5.html>
:Galway meteo:forecast ?forecast .
?forecast meteo:predicted "2010-11-13T00:00:00Z"^^xsd:dateTime ;
meteo:temperature ?temp .
?temp meteo:celsius ?tempC .

Note: to execute the SPARQL query, paste it for example into http://sparql.org/sparql.html and toy around with the patterns.

Filed under: demo, Linked Data

Syndicated 2010-11-12 09:22:10 from Web of Data

Quick RDFa profiling report

The other day at semantic-web@w3.org, William Waites asked how to link to an RDF serialisation from an HTML document (if I understood him correctly) as he seems not too much into RDFa. While the answer to this one is IMHO rather straight-forward (use <link href="yadayada.rdf" rel="alternate" type="application/rdf+xml" /> the follow-up discussion reminded me on the section in our Linked Data with RDFa tutorial on Usability Issues (note that this document will soon be moved to another location):

One practical issue you may want to check against sometimes occurs with fine-grained, high-volume datasets. Imagine a detailed description of audio-visual content (say, a one hour video incl. audio track) in RDF, or, equally, a detailed RDF representation of a multi-dimensional table of statistics, with dozens of columns and potentially thousands of rows. In both cases, one ends up with potentially many triples, which might mean some 100k triple or more. As both humans and machines are expected to consume the RDFa document, one certainly has to find a trade-off between using RDFa for the entire description (meaning to embed all triples in the HTML document) and an entirely externalised solution, for example using RDF/XML:

We also give a rough guideline how to decide how much is too much:

… having the entire RDF graph embedded certainly is desirable, however, one has to check the usability of the site. Usability expert Jakob Nielsen advocates a size limit for Web pages yielding an approximate 10 sec response time limit. Based on this we propose to perform a simple sort of response time testing, once with the plain HTML page and once with the embedded RDF graph. In case of a significant difference, one should contemplate if the all-in-RDFa approach is appropriate for the use case at hand.

Now, I wanted to get some real figures regarding how the number of triples embedded with RDFa impacts the loading time of an HTML page in a browser and did the following: I loaded some 17 cities from DBpedia (such as Amsterdam) into an RDF store and created a number of generic RDFa+HTML documents essentially with:


… where $SIZE would range from 10 to 20,000 – each triple looks essentially like:

<div about='http://dbpedia.org/resource/William_Howitt'>
<a rel='dbp:deathPlace'



Then I used Firebug with the NetExport extension and a shell script to gather the load time. The (raw) results are available online as well as two figures that give a rough idea of what is happening:

Note the following regarding the test setup: I did a local test (no network dependencies) with all caches turned off; the tests were performed with Firefox 3.6 on MacOS 10.5.8 (2.53 GHz Intel Core 2 Duo with 4GB/1067 MhZ DDR3 RAM on board). Each document had five runs, the numbers above show the averages over the runs.

Filed under: Experiment, Linked Data

Syndicated 2010-10-26 14:27:03 from Web of Data

Toying around with Riak for Linked Data

So I stumbled upon Rob Vesse’s tweet the other day, where he said he was about to use MongoDB for storing RDF. A week earlier I watched a nice video about links and link walking in Riak, “a Dynamo-inspired key/value store that scales predictably and easily” (see also the Wiki doc).

Now, I was wondering what it takes to store an RDF graph in Riak using Link headers. Let me say that it was very easy to install Riak and to get started with the HTTP interface.

The main issue then was how to map the RDF graph into Riak buckets, objects and keys. Here is what I came up so far – I use a RDF resource-level approach with a special object key that I called:id, which is the RDF resource URI or the bNode. Further, in order to maintain the graph provenance, I store the original RDF document URI in the metadata of the Riak bucket. Each RDF resource is mapped into a Riak object; for each literal RDF object value the literal value is stored directly via an Riak object-key, for each resource object (URI ref or bNode), I use a Link header.

Enough words. Action.

Take the following RDF graph (in Turtle):

@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix : <http://sw-app.org/mic.xhtml#>.

:i foaf:name "Michael Hausenblas" ;
foaf:knows <http://richard.cyganiak.de/foaf.rdf#cygri> .

To store the above RDF graph in Riak I would then using the following curl commands:

curl -X PUT -d 'Michael Hausenblas'

curl -X PUT -d 'http://sw-app.org/mic.xhtml#i'

curl -X PUT -d 'http://richard.cyganiak.de/foaf.rdf#cygri'

curl -X PUT -d 'http://sw-app.org/mic.xhtml#i' -H "Link: </riak/res1/:id>; riaktag=\"foaf:knows\""

Then, querying the store is straight-forward like this (here: list all people I know)


Yes, I know, the prefixes like foaf: etc. need to be taken care of (but that’s rather easy, can be put in the bucket’s metadata as well, along with the prefix.cc service. Further, the bNodes might cause troubles. And there is no smushing via owl:sameAs or IFPs (yet). But the most challenging area is maybe how to map a SPARQL query onto Riak’s link walking syntax.

Thoughts, anyone?

Filed under: Experiment, Linked Data

Syndicated 2010-10-14 15:18:03 from Web of Data

Linked Enterprise Data in a nutshell

If you haven’t been living under a rock for the last weeks you might have noticed a new release of the LOD cloud diagram with some 200 datasets and some 25 billion triples. Very impressive, one may think, but let’s not forget that publishing Linked Data is not an end in itself.

So, I thought, how can I do something useful with the data and I ended up with a demo app that utilizes LOD data in an enterprise setup: the DERI guide. Essentially, what it does is telling you where in the DERI building you find an expert for a certain topic. So, if you just have some 5min, have a look at the screen-cast:

Behind the curtain

Now, let’s take a deeper look how the app works. So, the objective was clear: create a Linked Data app using LOD data with a bunch of shell scripts. And here is what the DERI guide conceptually looks like:

I’m using three datasets in this demo:

All that is needed, then, is an RDF store (I chose 4Store; easy to set up and use, at least on MacOS) to manage the data locally and a bunch of shell scripts to query the data and format the result. The data in the local RDF store (after loading it from the datasets) typically looks like this:

The main script (dg-find.sh) takes a term (such as “Linked Data”) as an input, queries the store for units that are tagged with the topic (http://dbpedia.org/resource/Linked_Data), then pulls in information from the FOAF profiles of the matching members and eventually runs it through an XSLT to produce a HTML page that opens in the default browser:

echo "=== DERI guide v0.1"
echo "Trying to find people for topic: "$1

topicURI=$( echo "http://dbpedia.org/resource/"$1 | sed 's/ /_/')

curl -s --data-urlencode query="SELECT DISTINCT
 ?person WHERE { ?idperson <http://www.w3.org/2002/07/owl#sameAs> ?person ;
<http://www.w3.org/ns/org#hasMembership> ?membership .
?membership <http://www.w3.org/ns/org#organization> ?org .
?org <http://www.w3.org/ns/org#purpose> <$topicURI> . }"
http://localhost:8021/sparql/ > tmp/found-people.xml
webids=$( xsltproc get-person-webid.xsl tmp/found-people.xml )

echo "<h2>Results for: $1</h2>" >> result.html
echo "<div style='padding: 20px; width: 500px'>" >> result.html
for webid in $webids
foaffile=$( util/getfoaflink.sh $webid )
echo "Checking <"$foaffile"> and found WebID <"$webid">"
./dg-initdata-person.sh $foaffile $webid
./dg-render-person.sh $webid $topicURI
echo "</div><div style='border-top: 1px solid #3e3e3e;
 padding: 5px'>Linked Data Research Centre, (c) 2010</div>" >> result.html

rm tmp/found-people.xml
util/room2roomsec.sh result.html result-final.html
rm result.html
open result-final.html

The result for the example query ./dg-find.sh "Linked Data" yields a HTML page such as this:

Lessons learned

I was amazed by the fact how easy and quick it was to use the data from different sources to build a shell-based app. Most of the time I spent writing the scripts (hey, I’m not a shell guru and reading the sed manual is not exactly fun) and tuning the XSLT to output some nice HTML. The actual data integration part, that is, loading the data it into the store and querying it, was straight-forward (beside overcoming some inconsistencies in the data).

From the approximately eight hours I worked on the demo, some 70% went into the former (shell scripts and XSLT), some 20% into the latter (4store handling via curl and creating the SPARQL queries) and the remaining 10% were needed to create the shiny figures and the screen-cast, above. To conclude: the only thing you really need to create a useful LOD app is a good idea which sources to use, the rest is pretty straight-forward and, in fact, fun ;)

Filed under: Experiment, Linked Data

Syndicated 2010-09-27 10:32:51 from Web of Data

Linked Data Consumption – where are we?

These are exciting times, isn’t it? Every day new activities around Linked Data are reported.

All this happens at a rate, which can be overwhelming. Hence I think one should from time to time step back and have a chilled look at where we are concerning consumption of Linked Data. In the following I try to sum up a (rather high-level) view on the current state of the art and highlight ongoing challenges:

Task Technology Examples
Discovery Follow-Your-Nose on RDF-triple-level, Sitemaps, voiD
Access OpenLink’s ODE, SQUIN, any23
Consolidation sameas.org, Sig.ma
Nurture uberblic

As you can see, the more we get away from the data (discovery, access) and move into the direction of information, the fewer available solutions are there. From an application perspective aiming at exploiting Linked Data, the integrated, cleaned, information is of value, not the raw, distributed and dirty (interlinked) RDF pieces out there. In my experience, most of the consolidation and nurturing is still done on the application-level in an ad-hoc manner. There is plenty of room for frameworks and infrastructure to supply this functionality.

No matter if you’re a start-up, a first-year student or a CIO in an established company – have a look at the challenges and remember: now is the right time to come up with solutions. You and the Web will benefit from it.

Filed under: FYI, Linked Data

Syndicated 2010-06-04 08:43:56 from Web of Data

Linked Data for Dummies

Every now and then I ask myself: how would you explain the Linked Data stuff I’m doing to our children or to my parents, FWIW. So, here is an attempt to explain the Linked Data Web, and I promise that I wont use any lingo in the following:

Imagine you’re in a huge building with several storeys, each with an incredible large amount of rooms. Each room has tons of things in it. It’s utterly dark in that building, all you can do is walk down a hallway till you bang into a door or a wall. All the rooms in the buildings are somehow connected but you don’t know how. Now, I tell you that in some rooms there is a treasure hidden and you’ve got one hour to find it.

Here comes the good news: you’re not left to your own resources. You have a jinn, let’s call him Goog, who will help you. Goog is able to take instantaneously you to any room once you tell him a magic word. Let’s imagine the treasure you’re after is a chocolate bar, and you tell Goog: “I want Twox”. Goog tells you now that there are 3579 rooms where there is something with “Twox” in there. So you start with the first room Goog suggests to you, and as a good jinn he of course takes you there immediately; you don’t need to walk there. Now you’re in the room you put everything you can grab into your rucksack and get back outside (remember, you can’t see anything, in there). Once you are outside the building again and can finally see what you’ve gathered you find out that what is in your rucksack is not really what you wanted. So, you have to get back into the building again and try the second room. Again, and again till you eventually find the Twox you want (and you are really hungry now, right?).

Now, imagine the same building but all the rooms and stairs are marked with fluorescent stripes in different colours, for example a hallway that leads you to some food is marked with a green stripe. Furthermore, the things in the rooms have also fluorescent markers in different shapes. For example, Twox chocolate bars are marked with green circles. And there is another jinn now as well- say hello to LinD. You ask LinD the same thing as Goog before: “I want Twox” and LinD asks you: do you mean Twox the chocolate bar or Twox the car? Well, the chocolate bar of course, you say and LinD tells you: I know about 23 rooms that contain Twox chocolate bars, I will get one for you in a moment.

How can LinD do this? Is LinD so much more clever than Goog?

Well, not really. LinD does not understand what a chocolate bar is, pretty much the same as Goog does not know. However, LinD knows how to use the fluorescent stripes and markers in the building, and can thus get you directly what you want.

You see. It’s the same building and the same things in there, but with a bit of a help in forms of markers we can find and gather things much quicker and with less disappointments involved.

In the Linked Data Web we mark the things and hallways in the building, enabling jinns such as LinD to help you to find and use your treasures. As quick and comfortable as possible and no matter where they are.

Filed under: FYI, Linked Data

Syndicated 2010-05-20 08:28:41 from Web of Data

On the usage of Linksets

Daniel Koller asked on Twitter an interesting question:

… are linksets today evaluated in an automated way?or does it depend on a person to interpret it?

Trying to answer this question here, but let’s step a bit: back in 2008, when I started to dive into ‘LOD metadata’ one of my main use cases was indeed how to automate the handling of LOD datasets. I wanted to have a formal description of a dataset’s characteristics in order to write a sort of middle ware (there it is again, this bad word) that could use the dataset metadata and take the burden away from a human to sift through the ‘natural language’ descriptions found in the Wiki pages, such as the Dataset page.

Where are we today?

Looking at the deployment of voiD, I guess we can say that there is a certain uptake; several publisher and systems support voiD and there are dedicated voiD stores available out there, such as the Talis voiD store and the RKB voiD store.

In our LDOW2009 paper Describing Linked Datasets we outlined a couple of potential use cases for voiD and gave some examples of actual usage already. Most notably, Linksets are used for ranking of datasets (see the DING! paper) and distributed query processing.

However, to date I’m not aware of any implementation of my above outlined idea of a middle ware that exploit Linksets. So, I guess one answer to Daniel’s question is: at the moment, mainly humans look at it and use it.

What can be done?

The key to voiD really is its abstraction level. We describe entire Datasets and their characteristics, not single resources such as a certain place, a book or a gene. Understanding that the links are the essence in a truly global-distributed information space, one can see that the Linksets are the key to automatically process the LOD datasets, as they bear the high-level metadata about the interlinking.

When you write an application today that consumes data from the LOD cloud, you need to manually code which datasets you are going to use. Now, imagine a piece of software that really operates on Linksets: suddenly, it would be possible to specify certain requirements and capabilities (such as: ‘needs to be linked with some geo data and with statistical data’) and dynamically plug-in matching dataset. Of course, towards realising this vision, there are other problems to overcome (for example concerning the supported vocabularies vs. SPARQL queries used in the application), however, at least to me, this is a very appealing area, worth investing more resources.

I hope this answers your question, Daniel, and I’m happy to keep you posted concerning the progress in this area.

Filed under: Linked Data, voiD

Syndicated 2010-05-19 08:20:09 from Web of Data

Oh – it is data on the Web

A little story about OData and Linked Data …

Others already gave some high-level overview about OData and Linked Data, but I was interested in two concrete questions: how to utilise OData in the Linked Data Web and how to turn Linked Data into OData.

As already mentioned, I consider Atom, which forms one core bit of OData, as hyperdata allowing to publish data in the Web, not only on the Web. And indeed, the first OData example I examined (http://odata.netflix.com/Catalog/People) looked quite promising:

<title type="text">George Abbott</title>
<name />
<link rel="edit" title="Person" href="People(196)" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/Awards" type="application/atom+xml;type=feed" title="Awards" href="People(196)/Awards" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/TitlesActedIn" type="application/atom+xml;type=feed" title="TitlesActedIn" href="People(196)/TitlesActedIn" />
<link rel="http://schemas.microsoft.com/ado/2007/08/dataservices/related/TitlesDirected" type="application/atom+xml;type=feed" title="TitlesDirected" href="People(196)/TitlesDirected" />
<category term="NetflixModel.Person" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<d:Id m:type="Edm.Int32">196</d:Id>
<d:Name>George Abbott</d:Name>

Note, that there is a URI in the id element that can be used as entity URI and also link/@rel values that can be exploited as typed links. I ran it through OpenLink’s URI Burner (result) and hacked a little XSLT that picks out the relevant bits, just to see how an RDF version might look like. Though the @rel values do not dereference (try it out yourself: http://schemas.microsoft.com/ado/2007/08/dataservices/related/Awards) I thought, well, we can still handle it somehow as Linked Data.

Then, I looked at some more OData examples, just to find out that almost all of the other examples from the OData sources more or less look like the following (from http://datafeed.edmonton.ca/v1/coe/BusStops):

<entry m:etag="W/&quot;datetime'2010-01-14T22%3A43%3A35.7527659Z'&quot;">
<title type="text"></title>
<name />
<link rel="edit" title="BusStops" href="BusStops(PartitionKey='1000',RowKey='3b57b81c-8a36-4eb7-ac7f-31163abf1737')" />
<category term="OGDI.coe.BusStopsItem" scheme="http://schemas.microsoft.com/ado/2007/08/dataservices/scheme" />
<content type="application/xml">
<d:Timestamp m:type="Edm.DateTime">2010-01-14T22:43:35.7527659Z</d:Timestamp>
<d:entityid m:type="Edm.Guid">b0d9924a-8875-42c4-9b1c-246e9f5c8e49</d:entityid>
<d:avenue>Transit Centre</d:avenue>
<d:latitude m:type="Edm.Double">53.57196999</d:latitude>
<d:longitude m:type="Edm.Double">-113.3901687</d:longitude>
<d:elevation m:type="Edm.Double">0</d:elevation>

What you immediately see is the XML payload in the content element, making heavy use of two elements in the d: and m: namespace, two URIs that 404 and hence do not allow me to learn more about the schema (beside the fact that they are centrally maintained by Microsoft).

So, what does this all mean?

Imagine a Web (a Web of Documents, if you wish), which is not based on HTML and hyperlinks, but on MS Word documents. The documents are all available on the Internet, so you can download them and consume the content. But after you’re done with a certain document that talks about a book, how do you learn more about it? For example, reviews about the book or where you can purchase it? Maybe the original document mentions that there is some more related information on another server. So you’d need to go there and look for the related bit of information yourself. You see? That’s what the Web is great at – you just click on a hyperlink and it takes you to the document (or section) you’re interested in. All the legwork is taken care of for you through HTML, URIs and HTTP.

Hm, right, but how is this related to OData?

Well, OData feels a bit like the above mentioned scenario, just concerning data. Of course you – well actually rather a software program I guess – can consume it (a single source), but that’s it. To sum up my impression so far:

  • OData enables to publish structured data on the Web and theoretically in the Web (what’s the difference?)
  • OData uses Atom (and APP) as a framework with the actual data as (proprietary) XML payload;
  • OData typically creates data silos; discovering data beyond a single source is, nicely put, not easy;
  • Creating Linked Data from OData seems not a promising route;
  • Creating OData from Linked Data seems feasible and is desirable, in order to leverage tools such as Pivot.

Regarding the last bullet point, the ‘how to turn Linked Data into OData’, I will do some further research and keep you posted, here.

Filed under: FYI, Linked Data

Syndicated 2010-04-14 08:48:50 from Web of Data

31 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!