Older blog entries for mhausenblas (starting at number 44)

Can NoSQL help us in processing Linked Data?

This is an announcement and call for feedback. Over the past couple of days I’ve compiled a short review article where I look into NoSQL solutions and to what extent they can be used to process Linked Data.

I’d like to extend and refine this article, but this only works if you share your experiences and let me know what I’m missing out and where I’m maybe totally wrong?

If youjust want to read it, use the following link: NoSQL solutions for Linked Data processing (read-only Web page).

If you want to provide feedback or rectify stuff I wrote, use: NoSQL solutions for Linked Data processing (Google Docs with discussion enabled).

Thanks, and enjoy reading as well as commenting on the article!


Filed under: Announcement, Linked Data

Syndicated 2011-05-02 20:30:55 from Web of Data

From CSV data on the Web to CSV data in the Web

In our daily work with Government data such as statistics, geographical data, etc. we often deal with Comma-Separated Values (CSV) files. Now, they are really handy as they are easy to produce and to consume: almost any language and platform I came across so far has some support for parsing CSV files and I can virtually export CSV files from any sort of (serious) application.

There is even a – probably not widely known – standard for CSV files (RFC 4180) that specifies the grammar and registers the normative MIME media type text/csv for CSV files.

So far so well.

From a Web perspective, CSV files really are data objects, which however are rather coarse-granular. If I want to use a CSV file, I always have to use the entire file. There is no agreed-upon concept that allows me to refer to a certain cell, row or column. This was my main motivation to start working on what I called Addrable (from Addressable Table) earlier this year. I essentially hacked together a rather simple implementation of Addrables in JavaScript that understands URI fragment identifiers such as:

  • #col:temperature
  • #row:10
  • #where:city=Galway,reporter=Richard

Let’s have a closer look at what the result of the processing of such a fragment identifier against an example CSV file could be. I’m going to use the last one in the list above, that is, addressing a slice where the city column has the value ‘Galway’ and for the reporter column we ask it to be ‘Richard’.

The client-side implementation in jQuery provides a visual rendering of the selected part, see below a screen-shot (if you want to toy around with it, either clone or download it and open it locally in your browser):

There is also a server-side implementation using node.js available (deployed at addrable.no.de), outputting JSON:

{
  "header":
    ["date","temperature"],
  "rows":
    [
      ["2011-03-01", "2011-03-02", "2011-03-03"],
      ["4","10","5"]
    ]
}

Note: the processing of the fragment identifier is meant to be performed by the User Agent after the retrieval action has been completed. However, the server-side implementation demonstrates a workaround for the fact that the fragment identifier is not sent to the Server (see also the related W3C document on Repurposing the Hash Sign for the New Web).

Fast forwarding a couple of weeks.

Now, having an implementation is fine, but why not pushing the envelope and taking it a step further, in order to help making the Web a better place?

Enter Erik Wilde, who did ‘URI Fragment Identifiers for the text/plain Media Type’ aka RFC 5147 some three years ago; and yes, I admit I was a bit biased already through my previous contributions to the Media Fragments work. We decided to join forces to work on ‘text/csv Fragment Identifiers’, based on the Addrable idea.

As a first step (well beside the actual writing of the Internet-Draft to be submitted to IETF) I had a quick look at what we can expect in terms of deployment. That is, a rather quick and naive survey based on some 60 CSV files manually harvested from the Web. The following figure gives you a rough idea what is going on:

To sum up the preliminary findings: almost half of the CSV files are (wrongly) served with text/plain (followed by some other non-conforming and partially exotic Media Types such as text/x-comma-separated-values. The bottom-line is: only 10% of the CSV files are served correctly with text/csv. Why do we care, you ask? Well, for example, because the spec says that the header row is optional, but the presence can be flagged by an optional HTTP Header parameter. Just wondering what the chances are ;)

Now, I admit that my sample here is rather small, but I think the distribution will roughly stay the same. By the way, anyone aware of a good way to find CSV files, besides filetype:csv in Google or contains:csv in Bing, as I did it?

We’d be glad to hear from you – do you think this is useful for your application? If yes, why? How would you use it? Or, maybe you want to do a proper CSV crawl to help us with the analysis?


Filed under: Announcement, FYI, Idea, IETF

Syndicated 2011-04-16 12:43:35 from Web of Data

CfP: 2nd International Workshop on RESTful Design, Hyderabad, India

If you’re into RESTful stuff, no matter if you’re a researcher or practitioner, consider submitting a paper to our WWW2011 Workshop on RESTful Design (see the Call for Papers for more details on how to participate).

I’m very happy to see the workshop taking place again this year, after the huge success we had last year and I’m honored to serve on the Program Committee together with people like Jan Algermissen, Mike Amudsen, Joe Gregorio, Stefan Tilkov or Yves Lafon, just to name a few ;)

Hope to see you in India!


Filed under: Announcement

Syndicated 2011-01-06 12:03:13 from Web of Data

2010 in review

The stats helper monkeys at WordPress.com mulled over how this blog did in 2010, and here’s a high level summary of its overall blog health:

Healthy blog!

The Blog-Health-o-Meter™ reads Wow.

Crunchy numbers

Featured image

The average container ship can carry about 4,500 containers. This blog was viewed about 18,000 times in 2010. If each view were a shipping container, your blog would have filled about 4 fully loaded ships.

In 2010, there were 21 new posts, growing the total archive of this blog to 59 posts. There were 6 pictures uploaded, taking up a total of 2mb.

The busiest day of the year was February 12th with 449 views. The most popular post that day was Is Google a large-scale contributor to the LOD cloud?.

Where did they come from?

The top referring sites in 2010 were Google Reader, twitter.com, planetrdf.com, linkeddata.org, and sqlblog.com.

Some visitors came searching, mostly for data life cycle, web of data, sparql, hateos, and morphological analysis.

Attractions in 2010

These are the posts and pages that got the most views in 2010.

1

Is Google a large-scale contributor to the LOD cloud? February 2010
7 comments

2

Oh – it is data on the Web April 2010
26 comments

3

Towards Web-based SPARQL query management and execution April 2010
10 comments

4

Linked Data for Dummies May 2010
6 comments

5

Linked Enterprise Data in a nutshell September 2010
4 comments


Filed under: FYI

Syndicated 2011-01-02 07:44:25 from Web of Data

Processing the LOD cloud with BigQuery

Google’s BigQuery is a large-scale, interactive query environment that can handle billions of records in seconds. Now, wouldn’t it be cool to process the 26+ billion triples from the LOD cloud with BigQuery?

I guess so ;)

So, I did a first step into this direction by setting up the BigQuery for Linked Data project containing:

  • A Python script called nt2csv.py that converts RDF/NTriples into BigQuery-compliant CSV;
  • BigQuery schemes that can be used together with the CSV data from above;
  • Step-by-step instructions how to use nt2csv.py along with Google’s gsutil and bq command line tools to import the above data into Google Storage and issue a query against the uploaded data in BigQuery.

Essentially, one can – given an account for Google Storage as well as an account for BigQuery – do the following:

bq query
"SELECT object FROM [mybucket/tables/rdf/tblNames]
WHERE predicate = 'http://xmlns.com/foaf/0.1/knows'
LIMIT 10"

… which roughly translates into the following SPARQL query:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?o
WHERE {
?s foaf:knows ?o .
}
LIMIT 10

Currently, I do possess a Google Storage account, but unfortunately not a BigQuery account (yeah, I’ve signed up but still in the queue). So, I can’t really test this stuff – any takers?


Filed under: Experiment, Linked Data

Syndicated 2010-12-13 08:54:05 from Web of Data

Open data is the electricity of the 21st century

As I said earlier, today:

Open data is the electricity of the 21st century.

How do I come to this conclusion, you ask?

Well, imagine for a moment all the electricity on earth would be switched off; as an aside: this is unfortunately not an entirely theoretical thing (cf. EMP). What would happen? No, I’m not talking about that you’ll likely miss your favourite TV show. There are serious consequences to be expected, such as people suffering in hospitals, planes crashing, essentially causing our civilisation grinding to a halt. Now, I hope you can acknowledge the pervasiveness of electricity and our dependency thereof.

But how does this relate to open data?

Both electricity and open data share a couple of features:

  • You need an infrastructure (generation, distribution, etc.) to be able to benefit from it.
  • On its own it’s pretty useless. You need ‘applications’ to exploit it.
  • You notice it only as soon as it is not available (anymore).

Concerning electricity, a lot of people had numerous ideas how to utilise it (long before it reached wide adoption) and had to overcome serious obstacles and there were existential fights about deployment and which technologies to use (read more in The Story of Electricity by John Munro).

Now, just like electricity, open data is about to become ubiquitous these days. Be it governments or private entities that, for example, seek to optimise their Web presence. And there are and will be discussions about how to best expose the data (on the Web).

Note that I’m not trying to advocate that all data on earth should be open to everyone. This is maybe the biggest difference to electricity. There are cases (and let’s be honest, quite a few) where the privacy, concerning a person or an organisation, must take precedence over the ‘openness’ of the data. But let this be no excuse to not publish your data on the Web, if there are no privacy concerns.

All that is missing now are applications, on a large scale, that utilise the open data. Think about it, the next time you switch on your TV or use your oven to prepare a meal for the ones you love ;)


Filed under: Linked Data

Syndicated 2010-11-20 11:40:35 from Web of Data

Linked Open Data star scheme by example

I like TimBL’s 5-star deployment scheme for Linked Open Data. However, every time I use it to explain the migration path from ‘no-data-on-the-Web’ to the ‘Full Monty’, no matter if to students, in training sessions or to industry partners, there comes a point where it would be very handy to refer to a concrete example that demonstrates the entire scheme.

Well, there we go. At

http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/

you can find the examples for the 5-star scheme, ranging from a PDF to full-blown Linked Data (in RDFa).

Now, just for fun – what will the minimal temperature tomorrow be in Galway? See the power of Linked Open Data in action

… and in case you wanna play around with the data yourself, here is the SPARQL query for the previous answer:


PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX meteo: <http://purl.org/ns/meteo#>
PREFIX : <http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/gtd-5.html#>
SELECT ?tempC from <http://any23.org/rdfxml/http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/gtd-5.html>
WHERE {
:Galway meteo:forecast ?forecast .
?forecast meteo:predicted "2010-11-13T00:00:00Z"^^xsd:dateTime ;
meteo:temperature ?temp .
?temp meteo:celsius ?tempC .
}

Note: to execute the SPARQL query, paste it for example into http://sparql.org/sparql.html and toy around with the patterns.


Filed under: demo, Linked Data

Syndicated 2010-11-12 09:22:10 from Web of Data

Quick RDFa profiling report

The other day at semantic-web@w3.org, William Waites asked how to link to an RDF serialisation from an HTML document (if I understood him correctly) as he seems not too much into RDFa. While the answer to this one is IMHO rather straight-forward (use <link href="yadayada.rdf" rel="alternate" type="application/rdf+xml" /> the follow-up discussion reminded me on the section in our Linked Data with RDFa tutorial on Usability Issues (note that this document will soon be moved to another location):

One practical issue you may want to check against sometimes occurs with fine-grained, high-volume datasets. Imagine a detailed description of audio-visual content (say, a one hour video incl. audio track) in RDF, or, equally, a detailed RDF representation of a multi-dimensional table of statistics, with dozens of columns and potentially thousands of rows. In both cases, one ends up with potentially many triples, which might mean some 100k triple or more. As both humans and machines are expected to consume the RDFa document, one certainly has to find a trade-off between using RDFa for the entire description (meaning to embed all triples in the HTML document) and an entirely externalised solution, for example using RDF/XML:

We also give a rough guideline how to decide how much is too much:

… having the entire RDF graph embedded certainly is desirable, however, one has to check the usability of the site. Usability expert Jakob Nielsen advocates a size limit for Web pages yielding an approximate 10 sec response time limit. Based on this we propose to perform a simple sort of response time testing, once with the plain HTML page and once with the embedded RDF graph. In case of a significant difference, one should contemplate if the all-in-RDFa approach is appropriate for the use case at hand.

Now, I wanted to get some real figures regarding how the number of triples embedded with RDFa impacts the loading time of an HTML page in a browser and did the following: I loaded some 17 cities from DBpedia (such as Amsterdam) into an RDF store and created a number of generic RDFa+HTML documents essentially with:

SELECT * WHERE { ?s ?p ?o } LIMIT $SIZE

… where $SIZE would range from 10 to 20,000 – each triple looks essentially like:

<div about='http://dbpedia.org/resource/William_Howitt'>
<a rel='dbp:deathPlace'
  href='http://dbpedia.org/resource/Rome'>

http://dbpedia.org/resource/Rome

</a>
</div>

Then I used Firebug with the NetExport extension and a shell script to gather the load time. The (raw) results are available online as well as two figures that give a rough idea of what is happening:

Note the following regarding the test setup: I did a local test (no network dependencies) with all caches turned off; the tests were performed with Firefox 3.6 on MacOS 10.5.8 (2.53 GHz Intel Core 2 Duo with 4GB/1067 MhZ DDR3 RAM on board). Each document had five runs, the numbers above show the averages over the runs.


Filed under: Experiment, Linked Data

Syndicated 2010-10-26 14:27:03 from Web of Data

Toying around with Riak for Linked Data

So I stumbled upon Rob Vesse’s tweet the other day, where he said he was about to use MongoDB for storing RDF. A week earlier I watched a nice video about links and link walking in Riak, “a Dynamo-inspired key/value store that scales predictably and easily” (see also the Wiki doc).

Now, I was wondering what it takes to store an RDF graph in Riak using Link headers. Let me say that it was very easy to install Riak and to get started with the HTTP interface.

The main issue then was how to map the RDF graph into Riak buckets, objects and keys. Here is what I came up so far – I use a RDF resource-level approach with a special object key that I called:id, which is the RDF resource URI or the bNode. Further, in order to maintain the graph provenance, I store the original RDF document URI in the metadata of the Riak bucket. Each RDF resource is mapped into a Riak object; for each literal RDF object value the literal value is stored directly via an Riak object-key, for each resource object (URI ref or bNode), I use a Link header.

Enough words. Action.

Take the following RDF graph (in Turtle):


@prefix foaf: <http://xmlns.com/foaf/0.1/>.
@prefix : <http://sw-app.org/mic.xhtml#>.

:i foaf:name "Michael Hausenblas" ;
foaf:knows <http://richard.cyganiak.de/foaf.rdf#cygri> .

To store the above RDF graph in Riak I would then using the following curl commands:

curl -X PUT -d 'Michael Hausenblas' http://127.0.0.1:8098/riak/res0/foaf:name


curl -X PUT -d 'http://sw-app.org/mic.xhtml#i' http://127.0.0.1:8098/riak/res0/:id


curl -X PUT -d 'http://richard.cyganiak.de/foaf.rdf#cygri' http://127.0.0.1:8098/riak/res1/:id


curl -X PUT -d 'http://sw-app.org/mic.xhtml#i' -H "Link: </riak/res1/:id>; riaktag=\"foaf:knows\"" http://127.0.0.1:8098/riak/res0/:id

Then, querying the store is straight-forward like this (here: list all people I know)

curl http://127.0.0.1:8098/riak/res0/:id/_,foaf:knows,_

Yes, I know, the prefixes like foaf: etc. need to be taken care of (but that’s rather easy, can be put in the bucket’s metadata as well, along with the prefix.cc service. Further, the bNodes might cause troubles. And there is no smushing via owl:sameAs or IFPs (yet). But the most challenging area is maybe how to map a SPARQL query onto Riak’s link walking syntax.

Thoughts, anyone?


Filed under: Experiment, Linked Data

Syndicated 2010-10-14 15:18:03 from Web of Data

Linked Enterprise Data in a nutshell

If you haven’t been living under a rock for the last weeks you might have noticed a new release of the LOD cloud diagram with some 200 datasets and some 25 billion triples. Very impressive, one may think, but let’s not forget that publishing Linked Data is not an end in itself.

So, I thought, how can I do something useful with the data and I ended up with a demo app that utilizes LOD data in an enterprise setup: the DERI guide. Essentially, what it does is telling you where in the DERI building you find an expert for a certain topic. So, if you just have some 5min, have a look at the screen-cast:

Behind the curtain

Now, let’s take a deeper look how the app works. So, the objective was clear: create a Linked Data app using LOD data with a bunch of shell scripts. And here is what the DERI guide conceptually looks like:

I’m using three datasets in this demo:

All that is needed, then, is an RDF store (I chose 4Store; easy to set up and use, at least on MacOS) to manage the data locally and a bunch of shell scripts to query the data and format the result. The data in the local RDF store (after loading it from the datasets) typically looks like this:

The main script (dg-find.sh) takes a term (such as “Linked Data”) as an input, queries the store for units that are tagged with the topic (http://dbpedia.org/resource/Linked_Data), then pulls in information from the FOAF profiles of the matching members and eventually runs it through an XSLT to produce a HTML page that opens in the default browser:

clear
echo "=== DERI guide v0.1"
echo "Trying to find people for topic: "$1

topicURI=$( echo "http://dbpedia.org/resource/"$1 | sed 's/ /_/')

curl -s --data-urlencode query="SELECT DISTINCT
 ?person WHERE { ?idperson <http://www.w3.org/2002/07/owl#sameAs> ?person ;
<http://www.w3.org/ns/org#hasMembership> ?membership .
?membership <http://www.w3.org/ns/org#organization> ?org .
?org <http://www.w3.org/ns/org#purpose> <$topicURI> . }"
http://localhost:8021/sparql/ > tmp/found-people.xml
webids=$( xsltproc get-person-webid.xsl tmp/found-people.xml )

echo "<h2>Results for: $1</h2>" >> result.html
echo "<div style='padding: 20px; width: 500px'>" >> result.html
for webid in $webids
do
foaffile=$( util/getfoaflink.sh $webid )
echo "Checking <"$foaffile"> and found WebID <"$webid">"
./dg-initdata-person.sh $foaffile $webid
./dg-render-person.sh $webid $topicURI
done
echo "</div><div style='border-top: 1px solid #3e3e3e;
 padding: 5px'>Linked Data Research Centre, (c) 2010</div>" >> result.html

rm tmp/found-people.xml
util/room2roomsec.sh result.html result-final.html
rm result.html
open result-final.html

The result for the example query ./dg-find.sh "Linked Data" yields a HTML page such as this:

Lessons learned

I was amazed by the fact how easy and quick it was to use the data from different sources to build a shell-based app. Most of the time I spent writing the scripts (hey, I’m not a shell guru and reading the sed manual is not exactly fun) and tuning the XSLT to output some nice HTML. The actual data integration part, that is, loading the data it into the store and querying it, was straight-forward (beside overcoming some inconsistencies in the data).

From the approximately eight hours I worked on the demo, some 70% went into the former (shell scripts and XSLT), some 20% into the latter (4store handling via curl and creating the SPARQL queries) and the remaining 10% were needed to create the shiny figures and the screen-cast, above. To conclude: the only thing you really need to create a useful LOD app is a good idea which sources to use, the rest is pretty straight-forward and, in fact, fun ;)


Filed under: Experiment, Linked Data

Syndicated 2010-09-27 10:32:51 from Web of Data

35 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!