10 May 2011 danbri   » (Journeyer)

Exploring Linked Data with Gremlin

Gremlin is an opensource Java/Groovy system for traversing graphs, including but not limited to RDF graphs. This post is just a log of running some examples from @twarko and the Gremlin wiki and mailing list. The test run below goes pretty slowly, since it uses the Web as its database, via entry-by-entry fetches. In this case it’s fetching from DBpedia, but I’ve ran it with Freebase happily too. The on-demand RDF is handled by the Linked Data Sail; the same thing would work directly against a graph database.

Why is this interesting? Let me see if I can spell out what it’s doing. I’ll edit this post if I screw up …

Ok so the basic thing is that we start exploring the graph from one vertice, ‘v’, representing Stephen fry’s dbpedia entry.

From here, everything else is in one line, the core of which is:

v.inE(‘dbpedia-owl:starring’).outV.outE(‘dbpedia-owl:starring’).inV.groupCount(m).loop(5){it.loops < 3}

This is a series of steps (which map to TinkerPop / Pipes API calls behind the scenes).

  • inE ‘starring’: from v, a vertice, we step onto edges that come in to ‘v’ if they are labelled ‘dbpedia-owl:starring’
  • from those edges, we step to the vertices they come out (‘outV‘) from (these are films etc that Stephen Fry stars in)
  • from those, we step out (‘outE‘) to edges; outgoing edges with that same ‘starring’ label (we don’t try filtering out Stephen here, but we could)
  • from these edges, we step to the vertices that the ‘starring’ edges enter (‘inV‘) (vertices representing films and tv shows)
  • we then call groupCount and pass it our bookkeeping hashtable, m. I believe it increments a counter based on ID of current vertice or edge. As we revisit the same vertice later, the total counter for that entity goes up.
  • from this point, we then go back 5 steps, and recurse 3 times. “{ it.loops < 3 }” (this last is a closure; we can drop any code in here…)

I’m not sure this rushed explanation is 100% right, but maybe gives some flavour. See the Gremlin Wiki for the real goods.

From an application and data perspective, this system is interesting as it allows quantitatively minded graph explorations to be used alongside classically factual SPARQL. The results below show that it can dig out an actor’s co-stars (and then take account of their co-stars, and so on). This sort of neighbourhood exploration helps balance out the messyness of much Linked Data; rather than relying on explicitly asserted facts from the dataset, we can also add in derived data that comes from counting things expressed in dozens or hundreds of pages.

gremlin danbri$ sh gremlin.sh
\,,,/
(o o)
-----oOOo-(_)-oOOo-----

gremlin> g = new LinkedDataSailGraph(new MemoryStoreSailGraph())
==>sailgraph[linkeddatasail]
gremlin> v = g.v(‘http://dbpedia.org/resource/Stephen_Fry‘)
==>v[http://dbpedia.org/resource/Stephen_Fry]
gremlin> g.addNamespace(‘dbpedia-owl’, ‘http://dbpedia.org/ontology/’)
==>null
gremlin> rand = new Random()
==>java.util.Random@594560cf
gremlin> m = [:]
gremlin>
v.inE(‘dbpedia-owl:starring’).outV.outE(‘dbpedia-owl:starring’).inV.groupCount(m).loop(5){ it.loops < 3 }
In the background we can see the various dbpedia links being fetched (try ‘tail -f ripple.log’).
gremlin> m2 = m.sort{ a,b -> b.value <=> a.value }
[...]
gremlin> m2.subMap((m2.keySet() as List)[0..15])

==>v[http://dbpedia.org/resource/Stephen_Fry]=8160
==>v[http://dbpedia.org/resource/Hugh_Laurie]=3641
==>v[http://dbpedia.org/resource/Rowan_Atkinson]=2481
==>v[http://dbpedia.org/resource/Tony_Robinson]=2168
==>v[http://dbpedia.org/resource/Miranda_Richardson]=1791
==>v[http://dbpedia.org/resource/Tim_McInnerny]=1398
==>v[http://dbpedia.org/resource/Emma_Thompson]=1307
==>v[http://dbpedia.org/resource/Robbie_Coltrane]=1303
==>v[http://dbpedia.org/resource/Tony_Slattery]=911
==>v[http://dbpedia.org/resource/Colin_Firth]=854
==>v[http://dbpedia.org/resource/John_Lithgow]=732
==>v[http://dbpedia.org/resource/Emily_Watson]=673
==>v[http://dbpedia.org/resource/John_Hurt]=516
==>v[http://dbpedia.org/resource/John_Cleese]=495
==>v[http://dbpedia.org/resource/Michael_Gambon]=477
==>v[http://dbpedia.org/resource/Helen_Mirren]=472

Syndicated 2011-05-10 19:16:17 from danbri's foaf stories

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!