Older blog entries for danbri (starting at number 194)

K-means test in Octave

Matlab comes with K-means clustering ‘out of the box’. The GNU Octave work-a-like system doesn’t, and there seem to be quite a few implementations floating around. I picked the first from Google, pretty carelessly, saving as myKmeans.m. These are notes from trying to reproduce this Matlab demo with Octave. Not rocket science but worth writing down so I can find it again.

M=4
W=2
H=4
S=500
a = M * [randn(S,1)+W, randn(S,1)+H];
b = M * [randn(S,1)+W, randn(S,1)-H];
c = M * [randn(S,1)-W, randn(S,1)+H];
d = M * [randn(S,1)-W, randn(S,1)-H];
e = M * [randn(S,1), randn(S,1)];
all_data = [a;b;c;d;e];
plot(a(:,1), a(:,2),'.');
hold on;
plot(b(:,1), b(:,2),'r.');
plot(c(:,1), c(:,2),'g.');
plot(d(:,1), d(:,2),'k.');
plot(e(:,1), e(:,2),'c.');
% using http://www.christianherta.de/kmeans.html as myKmeans.m
[centroid,pointsInCluster,assignment] = myKmeans(all_data,5)
scatter(centroid(:,1),centroid(:,2),'x');

Syndicated 2011-06-19 14:14:04 from danbri's foaf stories

Querying Linked GeoData with R SPARQL client

Assuming you already have the R statistics toolkit installed, this should be easy.
Install Willem van Hage‘s R SPARQL client. I followed the instructions and it worked, although I had to also install the XML library, which was compiled and installed when I typed install.packages(“XML“, repos = “http://www.omegahat.org/R“) ‘ within the R interpreter.
Yesterday I set up a simple SPARQL endpoint using Benjamin Nowack’s ARC2 and RDF data from the Ravensburg dataset. The data includes category information about many points of interest in a German town. We can type the following 5 lines into R and show R consuming SPARQL results from the Web:
  • library(SPARQL)
  • endpoint = “http://foaf.tv/hypoid/sparql.php
  • q = “PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>\nPREFIX foaf:\n<http://xmlns.com/foaf/0.1/>\nPREFIX rv:\n<http://www.wifo-ravensburg.de/rdf/semanticweb.rdf#>\nPREFIX gr:\n<http://purl.org/goodrelations/v1#>\n \nSELECT ?poi ?l ?lon ?lat ?k\nWHERE {\nGRAPH <http://www.heppresearch.com/dev/dump.rdf> {\n?poi\nvcard:geo ?l .\n  ?l vcard:longitude ?lon .\n  ?l vcard:latitude ?lat\n.\n ?poi foaf:homepage ?hp .\n?poi rv:kategorie ?k .\n\n}\n}\n”
  • res<-SPARQL(endpoint,q)
  • pie(table(res$k))

This is the simplest thing that works to show the data flow. When combined with richer server-side support (eg. OWL tools, or spatial reasoning) and the capabilities of R plus its other extensions, there is a lot of potential here. A pie chart doesn’t capture all that, but it does show how to get started…

Note also that you can send any SPARQL query you like, so long as the server understands it and responds using W3C’s standard XML response. The R library doesn’t try to interpret the query, so you’re free to make use of any special features or experimental extensions understood by the server.

Exploring Linked Data with Gremlin

Gremlin is an opensource Java/Groovy system for traversing graphs, including but not limited to RDF graphs. This post is just a log of running some examples from @twarko and the Gremlin wiki and mailing list. The test run below goes pretty slowly, since it uses the Web as its database, via entry-by-entry fetches. In this case it’s fetching from DBpedia, but I’ve ran it with Freebase happily too. The on-demand RDF is handled by the Linked Data Sail; the same thing would work directly against a graph database.

Why is this interesting? Let me see if I can spell out what it’s doing. I’ll edit this post if I screw up …

Ok so the basic thing is that we start exploring the graph from one vertice, ‘v’, representing Stephen fry’s dbpedia entry.

From here, everything else is in one line, the core of which is:

v.inE(‘dbpedia-owl:starring’).outV.outE(‘dbpedia-owl:starring’).inV.groupCount(m).loop(5){it.loops < 3}

This is a series of steps (which map to TinkerPop / Pipes API calls behind the scenes).

  • inE ‘starring’: from v, a vertice, we step onto edges that come in to ‘v’ if they are labelled ‘dbpedia-owl:starring’
  • from those edges, we step to the vertices they come out (‘outV‘) from (these are films etc that Stephen Fry stars in)
  • from those, we step out (‘outE‘) to edges; outgoing edges with that same ‘starring’ label (we don’t try filtering out Stephen here, but we could)
  • from these edges, we step to the vertices that the ‘starring’ edges enter (‘inV‘) (vertices representing films and tv shows)
  • we then call groupCount and pass it our bookkeeping hashtable, m. I believe it increments a counter based on ID of current vertice or edge. As we revisit the same vertice later, the total counter for that entity goes up.
  • from this point, we then go back 5 steps, and recurse 3 times. “{ it.loops < 3 }” (this last is a closure; we can drop any code in here…)

I’m not sure this rushed explanation is 100% right, but maybe gives some flavour. See the Gremlin Wiki for the real goods.

From an application and data perspective, this system is interesting as it allows quantitatively minded graph explorations to be used alongside classically factual SPARQL. The results below show that it can dig out an actor’s co-stars (and then take account of their co-stars, and so on). This sort of neighbourhood exploration helps balance out the messyness of much Linked Data; rather than relying on explicitly asserted facts from the dataset, we can also add in derived data that comes from counting things expressed in dozens or hundreds of pages.

gremlin danbri$ sh gremlin.sh
\,,,/
(o o)
-----oOOo-(_)-oOOo-----

gremlin> g = new LinkedDataSailGraph(new MemoryStoreSailGraph())
==>sailgraph[linkeddatasail]
gremlin> v = g.v(‘http://dbpedia.org/resource/Stephen_Fry‘)
==>v[http://dbpedia.org/resource/Stephen_Fry]
gremlin> g.addNamespace(‘dbpedia-owl’, ‘http://dbpedia.org/ontology/’)
==>null
gremlin> rand = new Random()
==>java.util.Random@594560cf
gremlin> m = [:]
gremlin>
v.inE(‘dbpedia-owl:starring’).outV.outE(‘dbpedia-owl:starring’).inV.groupCount(m).loop(5){ it.loops < 3 }
In the background we can see the various dbpedia links being fetched (try ‘tail -f ripple.log’).
gremlin> m2 = m.sort{ a,b -> b.value <=> a.value }
[...]
gremlin> m2.subMap((m2.keySet() as List)[0..15])

==>v[http://dbpedia.org/resource/Stephen_Fry]=8160
==>v[http://dbpedia.org/resource/Hugh_Laurie]=3641
==>v[http://dbpedia.org/resource/Rowan_Atkinson]=2481
==>v[http://dbpedia.org/resource/Tony_Robinson]=2168
==>v[http://dbpedia.org/resource/Miranda_Richardson]=1791
==>v[http://dbpedia.org/resource/Tim_McInnerny]=1398
==>v[http://dbpedia.org/resource/Emma_Thompson]=1307
==>v[http://dbpedia.org/resource/Robbie_Coltrane]=1303
==>v[http://dbpedia.org/resource/Tony_Slattery]=911
==>v[http://dbpedia.org/resource/Colin_Firth]=854
==>v[http://dbpedia.org/resource/John_Lithgow]=732
==>v[http://dbpedia.org/resource/Emily_Watson]=673
==>v[http://dbpedia.org/resource/John_Hurt]=516
==>v[http://dbpedia.org/resource/John_Cleese]=495
==>v[http://dbpedia.org/resource/Michael_Gambon]=477
==>v[http://dbpedia.org/resource/Helen_Mirren]=472

Syndicated 2011-05-10 19:16:17 from danbri's foaf stories

Video Linking: Archives and Encyclopedias

This is a quick visual teaser for some archive.org-related work I’m doing with NoTube colleagues, and a collaboration with Kingsley Idehen on navigating it.

In NoTube we are trying to match people and TV content by using rich linked data representations of both. I love Archive.org and with their help have crawled an experimental subset of the video-related metadata for the Archive. I’ve also used a couple of other sources; Sean P. Aune’s list of 40 great movies, and the Wikipedia page listing US public domain films. I fixed, merged and scraped until I had a reasonable sample dataset for testing. I wanted to test the Microsoft Pivot Viewer (a Silverlight control), and since OpenLink’s Virtuoso package now has built-in support, I got talking with Kingsley and we ended up with the following demo. Since not everyone has Silverlight, and this is just a rough prototype that may be offline, I’ve made a few screenshots. The real thing is very visual, with animated zooms and transitions, but screenshots give the basic idea.

Notes: the core dataset for now is just links between archive.org entries and Wikipedia/dbpedia pages. In NoTube we’ll also try Lupedia, Zemanta, Reuter’s OpenCalais services on the Archive.org descriptions to see if they suggest other useful links and categories, as well as any other enrichment sources (delicious tags, machine learning) we can find. There is also more metadata from the Archive that we should also be using.

This simple preview simply shows how one extra fact per Archived item creates new opportunities for navigation, discovery and understanding. Note that the UI is in no way tuned to be TV, video or archive specific; rather it just lets you explore a group of items by their ‘facets’ or common properties. It also reveals that wiki data is rather chaotic, however some fields (release date, runtime, director, star etc.) are reliably present. And of course, since the data is from Wikipedia, users can always fix the data.

You often hear Linked Data enthusiasts talk about data “silos”, and the need to interconnect them. All that means here, is that when collections are linked, then improvements to information on one side of the link bring improvements automatically to the other. When a Wikipedia page about a director, actor or movie is improved, it now also improves our means of navigating Archive.org’s wonderful collection. And when someone contributes new video or new HTML5-powered players to the Archive, they’re also enriching the Encyclopedia too.

Archive.org films on a timeline by release date according to Wikipedia.

One thing to mention is that everything here comes from the Wikipedia data that is automatically extracted from by DBpedia, and that currently the extractors are not working perfectly on all films. So it should get better in the future. I also added a lot of the image links myself, semi-automatically. For now, this navigation is much more factually-based than topic; however we do have Wikipedia categories for each film, director, studio etc., and these have been mapped to other category systems (formal and informal), so there’s a lot of other directions to explore.

What else can we do? How about flip the tiled barchart to organize by the film’s distributor, and constrain the ‘release date‘ facet to the 1940s:

That’s nice. But remember that with Linked Data, you’re always dealing with a subset of data. It’s hard to know (and it’s hard for the interface designers to show us) when you have all the relevant data in hand. In this case, we can see what this is telling us about the videos currently available within the demo. But does it tell us anything interesting about all the films in the Archive? All the films in the world? Maybe a little, but interpretation is difficult.

Next: zoom in to a specific item. The legendary Plan 9 from Outer Space (wikipedia / dbpedia).

Note the HTML-based info panel on the right hand side. In this case it’s automatically generated by Virtuoso from properties of the item. A TV-oriented version would be less generic.

Finally, we can explore the collection by constraining the timeline to show us items organized according to release date, for some facet. Here we show it picking out the career of one Edward J. Kay, at least as far as he shows up as composer of items in this collection:

Now turning back to Wikipedia to learn about ‘Edward J. Kay’, I find he has no entry (beyond these passing mentions of his name) in the English Wikipedia, despite his work on The Ape Man, The Fatal Hour, and other films.  While the German Wikipedia does honour him with an entry, I wonder whether this kind of Linked Data navigation will change the dynamics of the ‘deletionism‘ debates at Wikipedia.  Firstly by showing that structured data managed elsewhere can enrich the Wikipedia (and vice-versa), removing some pressure for a single Wiki to cover everything. Secondly it provides a tool to stand further back from the data and view things in a larger context; a context where for example Edward J. Kay’s achievements become clearer. Much like Freebase Parallax, the Pivot viewer hints at a future in which we explore data by navigating from sets of things to other sets of things.  Pivot doesn’t yet over this, but it does very vividly present the potential for this kind of navigation, showing that navigation of films, TV shows and actors may be richer when it embraces more general mechanisms.

Syndicated 2011-02-01 19:43:50 from danbri's foaf stories

A Penny for your thoughts: New Year wishes from mechanical turkers

I wanted to learn more about Amazon’s Mechanical Turk service (wikipedia), and perhaps also figure out how I feel about it.

Named after a historical faked chess-playing machine, it uses the Web to allow people around the world to work on short low-pay ‘micro-tasks’. It’s a disturbing capitalist fantasy come true, echoing Frederick Taylor’s ‘Scientific Management‘ of the 1880s. Workers can be assigned tasks at the touch of the button (or through software automation); and rewarded or punished at the touch of other buttons.

Mechanical Turk has become popular for outsourcing large scale data cleanup tasks, image annotation, and other topics where human judgement outperforms brainless software. It’s also popular with spammers. For more background see ‘try a week as a turker‘ or this Salon article from 2006. Turk is not alone, other sites either build on it, or offer similar facilities. See for example crowdflower, txteagle, or Panos Ipeirotis’ list of micro-crowdsourcing services.

Crowdflower describe themselves as offering “multiple labor channels…  [using] crowdsourcing to harness a round-the-clock workforce that spans more than 70 countries, multiple languages, and can access up to half-a-million workers to dispatch diverse tasks and provide near-real time answers.”

Txteagle focuses on the explosion of mobile access in the developing world, claiming that “txteagle’s GroundSwell mobile engagement platform provides clients with the ability to communicate and incentivize over 2.1 billion people“.

Something is clearly happening here. As someone who works with data and the Web, it’s hard to ignore the potential. As someone who doesn’t like treating others as interchangeable, replaceable and disposable software components, it’s hard to feel very comfortable. Classic liberal guilt territory. So I made an account, both as a worker and as a ‘requester’ (an awkward term, but it’s clear why ‘employer’ is not being used).

I tried a few tasks. I wrote 25-30 words for a blog on some medieval prophecies. I wrote 50 words as fast as I could on “things I would change in my apartment”. I tagged some images with keywords. I failed to pass a ‘qualification’ test sorting scanned photos into scratched, blurred and OK. I ‘like’d some hopeless Web site on Facebook for 2 cents. In all I made 18 US cents. As a way of passing the time, I can see the appeal. This can compete with daytime TV or Farmville or playing Solitaire or Sudoko. I quite enjoyed the mini creative-writing tasks. As a source of income, it’s quite another story, and the awful word ‘incentivize‘ doesn’t do justice to the human reality.

Then I tried the other role: requester. After a little more liberal-guilt navelgazing (“would it be inappropriate to offer to buy people’s immortal souls? etc.”), I decided to offer a penny (well, 2 cents) for up to 100 people’s new year wish thoughts, or whatever of those they felt like sharing for the price.

I copy the results below, stripped of what little detail (eg. time in seconds taken) each result came with. I don’t present this as any deep insight or sociological analysis or arty meditation. It’s just what 100 people somewhere else in the Web responded with, when asked what they wish for 2011. If you want arty, check out the sheep market. If you want more from ‘turkers’ in their own voice, do visit the ‘Turker Nation’ forum. Also Turkopticon is essential reading,  ”watching out for the crowd in crowdsourcing because nobody else seems to be.”

The exact text used was “Make a wish for 2011. Anything you like, just describe it briefly. Answers will be made public.”, and the question was asked with a simple Web form, “Make a wish for 2011, … any thought you care to share”.


Here’s what they said:

When you’re lonely, I wish you Love! When you’re down, I wish you Joy! When you’re troubled, I wish you Peace! When things seem empty, I wish you Hope! Have a Happy New Year!

wish u a happy new year…………

happy new year 2011. may this year bring joy and peace in your life

My wish for 2011 is i want to mary my Girlfriend this year.

I wish I will get pregnant in 2011!

i wish juhi becomes close to me

wish you a wonderful happy new year

wish you happy new year

for new year 2011 I wish Love of God must fill each human heart
Food inflation must be wiped off quickly
corruption must be rooted out smartly
Terrorism must be curtailed quickly
All People must get love, care, clothes, shelter & food
Love of God must fill each human heart…

Happy life.All desires to be fulfilled.

wish to be best entrepreneur of the year 2011

dont work hard if it is possible to do the same smarter way..
Be happy!

New year is the time to unfold new horizons,realise new dreams,rejoice in simple pleasures and gear up for new challenges.wishing a fulfilling 2011.

Remember that the best relationship is one where your love for each other is greater than your need for each other. Happy New Year

To get a newer car, and have less car problems. and have more income

I wish that my son’s health problems will be answered

Be it Success & Prosperity, Be it Fun and Frolic…

A new year is waiting for you. Go and enjoy the New Year on New Thought,”Rebirth of My Life”.

Let us wish for a world as one family, then we can overcome all the problems man made and otherwise.

My wish is to gain/learn more knowledge than in 2010

My new years wish for 2011 is to be happier and healthier.

I wish that I would be cured of heartache.

I am really very happy to wish you all very happy new year…..I wish you all the things to be success in your life and career…….. Just try to quit any bad habit within you. Just forgot all the bad incidents happen within your friends and try to enjoy this new year with pleasant……

Wish you a happy and prosperous new year.

I wish for a job.

I would hope that people will end the wars in the world.

Discontinue smoking and restrict intake of alcohol

I wish that my retail store would get a bigger client base so I can expand.

I Wish a wish for You Dear.Sending you Big bunch of Wishes from the Heart close to where.Wish you a Very Very Happy New Year

I wish for 2011 to be filled with more love and happiness than 2010.

Everything has the solution Even IMPOSSIBLE Makes I aM POSSIBLE. Happy Journey for New Year.

May each day of the coming year be vibrant and new bringing along many reasons for celebrations & rejoices. Happy New year

I have just moved and want to make some great new friends! Would love to meet a special senior (man!!) to share some wonderful times with!!!

My wish is that i wanna to live with my “Pretty girl” forever and also wanna to meet her as well,please god please, finish my this wish, no more aspire from me only once.

that people treat each other more nicely and with greater civility, in both their private and public lives.

that we would get our financial house in order

Year’s end is neither an end nor a beginning but a going on, with all the wisdom that experience can instill in us. Wish u very happy new year and take care

Wish you a very happy And prosperous new year 2011

Tom Cruise
Angelina Jolie
Aishwarya Rai
Arnold
Jennifer Lopez
Amitabh Bachhan
& me..
All the Stars wish u a Very Happy New Year.

Oh my Dear, Forget ur Fear,
Let all ur Dreams be Clear,
Never put Tear, Please Hear,
I want to tell one thing in ur Ear
Wishing u a very Happy “NEW YEAR”!

May The Year 2011 Bring for You…. Happiness,Success and filled with Peace,Hope n Togetherness of your Family n Friends….

i want to be happy

Good health for my family and friends

I wish my husband’s children would stop being so mean and violent and act like normal children. I want to love my husband just as much as before we got full custody.

to get wonderful loving girl for me.. :))

Keep some good try. Wish u happy new year

happy new year to all

My wish is to find a good job.

i wish i get a big outsourcing contract this year that i can re-set up my business and get back on track.

I wish that I be firm in whatever I do. That I can do justice to all my endeavors. That I give my 100%, my wholehearted efforts to each and every minutest work I do.

My wish for 2011, is a little patience and understanding for everyone, empathy always helps.

To be able to afford a new house

“NEW YEAR 2011″
+NEW AIM + NEW ACHIEVEMENT + NEW DREAM +NEW IDEA + NEW THINKING +NEW AMBITION =NEW LIFE+SUCCESS HAPPY NEW YEAR!

let this year be terrorist free world

Wish the world walk forward in time with all its innocence and beauty where prevails only love, and hatred no longer found in the dictionary.

no

Wish u a very happy New Year Friends and make this year as a pleasant days…

I wish the economy would get better, so people can afford to pay their bills and live more comfortably again.

i wish, god makes life beautiful and very simple to all of us. and happy new year to world.

Be always at war with your vices, at peace with your neighbors, and let each new year find you a better man and I wish a very very prosperous new year.

i wish i would buy a house and car for my mom

I wish to have a new car.
This new year will be full of expectation in the field of investment.We concerned about US dollar. Hope this year will be a good for US dollar.

this year is very enjoyment life

Cheers to a New Year and another chance for us to get it right

to get married

Wishing all a meaningful,purposeful,healthier and prosperous New Year 2011.

WISH YOU A HAPPY NEW YEAR 2011 MAY BRING ALL HAPPINESS TO YOU

RAKKIMUTHU

In 2011 I wish for my family to get in a better spot financially and world peace.

Wish that economic conditions improve to the extent that the whole spectrum of society can benefit and improve themselves.

I want my divorce to be final and for my children to be happy.

This 2011 year is very good year for All with Health & Wealth.

I wish that things for my family would get better. We have had a terrible year and I am wishing that we can look forward to a better and brighter 2011.

This year bring peace and prosperity to all. Everyone attain the greatest goal of life. May god gives us meaning of life to all.

This new year will bring happy in everyone’s life and peace among countries.

I hope for bipartisanship and for people to realize blowing up other people isn’t the best way to get their point across. It just makes everyone else angry.

A better economy would be nice too

I wish that in 2011 the government will work together as a TEAM for the betterment of all. Peace in the world.

i wish you all happy new year. may god bless all……

no i wish for you

I wish that my family will move into our own house and we can be successful in getting good jobs for our future.

I wish my girl comes back to me

Wish You Happy New Year for All, especially to the workers and requester’s of Mturk.

Greetings!!!

Wishing you and your family a very happy and prosperous NEW YEAR – 2011

May this New Year bring many opportunities your way, to explore every joy of life and may your resolutions for the days ahead stay firm, turning all your dreams into reality and all your efforts into great achievements.

Wish u a Happy and Prosperous New Year 2011….

Wishing u lots of happiness..Success..and Love

and Good Health…….

Wish you a very very happy new year

WISHING YOU ALL A VERY HAPPY & PROSPEROUS NEW YEAR…….

I wish in this 2011 is to be happy,have a good health and also my family.

I pray that the coming year should bring peace, happiness and good health.

I wish for my family to continue to be healthy, for my cars to continue running, and for no 10th Anniversary attacks this upcoming September.

be a good and help full for my family .

Happy and Prosperous New Year

New day new morning new hope new efforts new success and new feeling,a new year a new begening, but old friends are never forgotten, i think all who touched my life and made life meaningful with their support, i pray god to give u a verry “HAPPY AND SUCCESSFUL NEW YEAR”.

Be a good person,as good as no one

wish this new year brings cheers and happiness to one and all.

For the year 2011 I simply wish for the ability to support my family properly and have a healthier year.

I wish I have luck with getting a better job.

Greater awareness of climate change, and a recovering US economy.

this new year 2011 brings you all prosperous and happiness in your life…….

happy newyear wishes to all the beautiful hearts in the world in the world.god bless you all.

wishing every happy new year to all my pals and relatives and to all my lovely countrymen

Syndicated 2011-01-01 11:41:08 from danbri's foaf stories

XMPP untethered – serverless messaging in the core?

In the XMPP session at last february’s FOSDEM I gave a brief demo of some NoTube work on how TV-style remote controls might look with XMPP providing their communication link. For the TV part, I showed Boxee, with a tiny Python script exposing some of its localhost HTTP API to the wider network via XMPP. For the client, I have a ‘my first iphone app‘ approximation of a remote control that speaks a vapourware XMPP remote control protocol, “Buttons”.

The point of all this is about breaking open the Web-TV environment, so that different people and groups get to innovate without having to be colleagues or close-nit business partners. Control your Apple TV with your Google Android phone; or your Google TV with your Apple iPad, or your Boxee box with either. Write smart linking and bookmarking and annotation apps that improve TV for all viewers, rather than only those who’ve bought from the same company as you. I guess I managed to communicate something of this because people clapped generously when my iphone app managed to pause Boxee. This post is about how we might get from evocative but toy demos to a useful and usable protocol, and about one of our largest obstacles: XMPP’s focus on server-mediated communications.

So what happened when I hit the ‘pause’ button on the iphone remote app? Well, the app was already connected to the XMPP network, e.g. signed in as bob.notube@gmail.com via Google Talk’s servers. And so an XMPP stanza flowed out from the room we were in, across to Google somewhere, and then via XMPP server-to-server protocol over to my self-run XMPP server (an ejabberd hosted on Amazon EC2′s east USA zone somewhere). And from there, the message returned finally to Brussels, flowing through whichever Python library I was using to Boxee (signed in as buttons@foaf.tv), causing the video to pause. This happened quite quickly, and generally very quickly; but sometimes it can take more than a second. This can be very frustrating, and while there are workaround (keep-alive messages, smart code that ignores sequences of buffered ‘Pause!’ messages, apps that download metadata and bring more UI to the second screen, …), the problem has a simple cause: it just doesn’t make sense for a ‘pause’ message to cross the atlantic twice, and pass through two XMPP servers, on its the way across the living room from remote control to TV.

But first – why are we even using XMPP at all, rather than say HTTP? Partly because XMPP lets us easily address devices on home networks, that aren’t publically exposed as running a Web server. Partly for the symmetry of the protocol, since ipads, touch tables, smart phones, TVs and media centres all can host and play media items on their own displays, and we may have several such devices in a home setting that need to be in touch with one another. There’s also a certain lazyness; XMPP already defines lots of useful pieces, like buddylist rosters, pubsub notifications, group chats; it has an active and friendly community, and it comes with a healthy collection of tools and libraries. My own interests are around exploring and collectively annotating the huge archives of content that are slowly coming online, and an expectation that this could be a more shared experience, so I’m following an intuition that XMPP provides more useful ‘raw materials’ for social content exploration than raw HTTP. That said, many elements of remote control can be defined and implemented in either environment. But for today, I’m concentrating on the XMPP side.

So back at FOSDEM I raised a couple of concerns, as a long-term XMPP well-wisher but non-insider.

The first was that the technology presents itself as a daunting collection of extensions, each of which might or might not be supported in some toolkit. To this someone (likely Dave Cridland) responded with the reassuring observation that most of these could be implemented by 3rd party app developer simply reading/writing XMPP stanzas. And that in fact pretty much the only ‘core’ piece of XMPP that wasn’t treated as core in most toolkits was the serverless, point-to-point XEP-0174 ‘serverless messaging‘ mode. Everything else, the rest of us mortals could hack in application code. For serverless messaging we are left waiting and hoping for the toolkit maintainers to wire things in, as it generally requires fairly intimate knowledge of the relevant XMPP library.

My second point was in fact related: that if XMPP tools offered better support for serverless operation, then it would open up lots of interesting application options. That we certainly need it for the TV remotes use case to be a credible use of XMPP. Beyond TV remotes, there are obvious applications in the area of open, decentralised social networking. The recent buzz around things like StatusNet, GNU Social, Diaspora*, WebID, OneSocialWeb, alongside the old stuff like FOAF, shows serious interest in letting users take more decentralised control of their online social behaviour. Whether the two parties are in the same room on the same LAN, or halfway around the world from each other, XMPP and its huge collection of field-tested, code-supported extensions is relevant, even when those parties prefer to communicate directly rather than via servers.

With XMPP, app party developers have a well-defined framework into which they can drop ad-hoc stanzas of information; whether it’s a vCard or details of recently played music. This seems too useful a system to reserve solely for communications that are mediated by a server. And indeed, XMPP in theory is not tied to servers; the XEP-0174 spec tells us both how to do local-network bonjour-style discovery, and how to layer XMPP on top of any communication channel that allows XML stanzas to flow back and forth.

From the abstract,

This specification defines how to communicate over local or wide-area networks using the principles of zero-configuration networking for endpoint discovery and the syntax of XML streams and XMPP messaging for real-time communication. This method uses DNS-based Service Discovery and Multicast DNS to discover entities that support the protocol, including their IP addresses and preferred ports. Any two entities can then negotiate a serverless connection using XML streams in order to exchange XMPP message and IQ stanzas.

But somehow this remains a niche use of XMPP. Many of the toolkits have some support for it, perhaps as work-in-progress or a patch, but it remains somewhat ‘out there’ rather than core to the XMPP approach. I’d love to see this change in 2011. The 0174 spec combines a few themes; it talks a lot about discovery, motivated in part by trade-fair and conference type scenarios. When your Apple laptop finds people locally on some network to chat with by “Bonjour”, it’s doing more or less XEP-0174. For the TV remote scenario, I’m interested in having nodes from a normal XMPP network drop down and “re-discover” themselves in a hopefully-lower-latency point to point mode (within some LAN or across the Internet, or between NAT-protected home LANs). There are lots of scenarios when having a server in the loop isn’t needed, or adds cost and risk (latency, single point of failure, privacy concerns).

XEP-0174 continues,

6. Initiating an XML Stream
In order to exchange serverless messages, the initiator and
recipient MUST first establish XML streams between themselves,
as is familiar from RFC 3920.
First, the initiator opens a TCP connection at the IP address
and port discovered via the DNS lookup for an entity and opens
an XML stream to the recipient, which SHOULD include 'to' and
'from' address. [...]

This sounds pretty precise; point-to-point communication is over TCP.  The Security Considerations section discussed some of the different constraints for XMPP in serverless mode, and states that …

To secure communications between serverless entities, it is RECOMMENDED to negotiate the use of TLS and SASL for the XML stream as described in RFC 3920

Having stumbled across Datagram TLS (wikipedia, design writeup), I wonder whether that might also be an option for the layer providing the XML stream between entities.  For example, the chownat tool shows a UDP-based trick for establishing bidirectional communication between entities, even when they’re both behind NAT. I can’t help but wonder whether XMPP could be layered somehow on top of that (OpenSSL libraries have Datagram TLS support already, apparently). There are also other mechanisms I’ve been discussing with Mo McRoberts and Libby Miller lately, e.g. Mo’s dynamic dns / pubkeys idea, or his trick of running an XMPP server in the home, and opening it up via UPnP. But that’s for another time.

So back on my main theme: XMPP is holding itself back by always emphasising the server-mediated role. XEP-0174 has the feel of an afterthought rather than a core part of what the XMPP community offers to the wider technology scene, and the support for it in toolkits lags similarly. I’d love to hear from ‘live and breath XMPP’ folk what exactly they think is needed before it can become a more central part of the XMPP world.

From the TV remotes use case we have a few constraints, such as the need to associate identities established in different environments (eg. via public key). If xmpp:danbri-ipad@danbri.org is already on the server-based XMPP roster of xmpp:nevali-tv@nevali.net, can pubkey info in their XMPP vCards be used to help re-establish trusted communications when the devices find themselves connected in the same LAN? It seems just plain nuts to have a remote control communicate with another box in the same room via transatlantic links through Google Talk and Amazon EC2, and yet that’s the general pattern of normal XMPP communications. What would it take to have more out-of-the-box support for XEP-0174 from the XMPP toolkits? Some combination of beer, money, or a shared sense that this is worth doing and that XMPP has huge potential beyond the server-based communications model it grew from?

Syndicated 2010-12-28 16:49:15 from danbri's foaf stories

How to tell you’re living in the future: bacterial computers, HTML and RDF

Clue no.1. Papers like “Solving a Hamiltonian Path Problem with a bacterial computer” barely raise an eyebrow.

Clue no.2. Undergraduates did most of the work.

And the clincher, …

Clue no.3. The paper is shared nicely in the Web, using HTML, Creative Commons document license, and useful RDF can be found nearby.

From those-crazy-eggheads dept, … bacterial computers solving graph data problems. Can’t wait for the javascript API. Except the thing of interest here isn’t so much the mad science but what they say about how they did. But the paper is pretty fun stuff too.

The successful design and construction of a system that enables bacterial computing also validates the experimental approach inherent in synthetic biology. We used new and existing modular parts from the Registry of Standard Biological Parts [17] and connected them using a standard assembly method [18]. We used the principle of abstraction to manage the complexity of our designs and to simplify our thinking about the parts, devices, and systems of our project. The HPP bacterial computer builds upon our previous work and upon the work of others in synthetic biology [19-21]. Perhaps the most impressive aspect of this work was that undergraduates conducted every aspect of the design, modeling, construction, testing, and data analysis.

undergraduates! Meanwhile, over on partsregistry.org you can read more about the bits and pieces they squished together. It’s like a biological CPAN. And in fact the anology is being actively pursued: see openwetware.org’s work on an RDF description of the catalogue.

I grabbed an RDF file from that site and confirm that simple queries like

select * from <SemanticSBOLv0.13_BioBrick_Data_v0.13.rdf>  where {<http://sbol.bhi.washington.edu/rdf/sbol.owl#BBa_I715022> ?p ?v }

and

select * from <SemanticSBOLv0.13_BioBrick_Data_v0.13.rdf>  where {?x ?p <http://sbol.bhi.washington.edu/rdf/sbol.owl#BBa_I715022>  }

… do navigate me around the graph that describes the pieces described in their paper.

Here’s what the HTML paper says right now,

We designed and built all the basic parts used in our experiments as BioBrick compatible parts and submitted them to the Registry of Standard Biological Parts [17]. Key basic parts and their Registry numbers are: 5′ RFP (BBa_I715022), 3′ RFP (BBa_ I715023), 5′ GFP (BBa_I715019), and 3′ GFP (BBa_I715020). All basic parts were DNA sequence verified. The basic parts hixC(BBa_J44000), Hin LVA (BBa_J31001) were used from our previous experiments [8]. The parts were assembled by the BioBrick standard assembly method [18] yielding intermediates and devices that were also submitted to the Registry. Important intermediate and devices constructed are: Edge A (BBa_S03755), Edge B (BBa_S03783), Edge C (BBa_S03784), ABC HPP construct (BBa_I715042), ACB HPP construct (BBa_I715043), and BAC HPP construct (BBa_I715044). We previously built the Hin-LVA expression cassette (BBa_S03536) [8].

How nice to have a scholarly publication in HTML format, open-access published under creative commons license, and backed by machine-processable RDF data. Never mind undergrads getting bacteria to solve NP-hard graph problems, it’s the modern publishing and collaboration machinery described here that makes me feel I’m living in the future…

(World Wide Web – Let’s Share What We Know…)

ps. thanks to Dan Connolly for nudging me to get this shared with the planetrdf.com-reading community. Maybe it’ll nudge Kendall into posting something too.

Syndicated 2010-11-30 16:25:09 from danbri's foaf stories

‘Republic of Letters’ in R / Custom Widgets for Second Screen TV navigation trails

As ever, I write one post that perhaps should’ve been two. This is about the use and linking of datasets that aid ’second screen’ (smartphone, tablet) TV remotes, and it takes as a quick example a navigation widget and underlying dataset that show us how we might expect to navigate TV archives, in some future age when TV lives more fully in the World Wide Web. I’ll argue that access to the ‘raw data‘ and frameworks for embedding visualisation apps are of equal importance when thinking about innovative ways of exploring the ever-growing archives. All of this comes from many discussions with my NoTube colleagues and other collaborators; rambling scribblyness is all my own.

Ben Hammersley points us at a lovely Flash visualization of correspondence patterns, “Mapping the Republic of Letters“.

Mapping the Republic of Letters has at its center a multidimensional data set which spans 300 years and nearly 100,000 letters. We use computing tools that help us to measure and analyze data quantitatively, though that will not take us to our goal. While we use software and computing techniques that were designed for scientific and statistical methods, we are seeking to develop computing tools to enhance humanistic methods, to help us to explore qualitative aspects of the Republic of Letters. The subject of our study and the nature of the material require it. The collections of correspondence and records of travel from this period are incomplete. Of that incomplete material only a fraction has been digitized and is available to us. Making connections and resolving ambiguities in the data is something that can only be done with the help of computing, but cannot be done by computing alone. (from ‘methods and philosophy‘)


screenshot of Republic of Letters app, showing social network links superimposed on map of historical western Europe


See their detailed writeup for more on this fascinating and quite beautiful work. As I’m working lately on linking TV content more deeply into the Web, and on ’second screen’ navigation, this struck me as just the kind of interface which it ought to be possible to re-use on a tablet PC to explore TV archives. Forgetting for the moment difficulties with Flash on iPads and so on, the idea roughly is that it would be great to embed such a visualization within a TV watching environment, such that when the ‘republic of letters’ widget is focussed on some person, place, or topic, we should have the opportunity to scan the available TV archives for related materials to show.

So a glance at Chrome’s ‘developer tools’ panel gave me a link to the underlying data used by the visualisation. I don’t know exactly whose it is, nor how they want it used, so please treat it with respect. Still, there it is, sat in the Web, in tab-separated format, begging to be used. There’s a lot you can do with the Flash application that I’ve barely touched, but I’m intrigued by the underlying dataset. In particular, where they have the string “Tonson, Jacob”, the data linker in me wants to see a Wikipedia or DBpedia link, since they provide explanation, context, related people, places and themes; all precious assets when trying to scrape together related TV materials to inform, educate or entertain someone with. From a few test searches, it turns out that (many? most?) the correspondents are quite easily matched to Wikipedia: William Congreve, Montagu, 1st earl of Halifax, CharlesHough, bishop of Worcester, John; Stanyan, Abraham;  … Voltaire and others. But what about the data?

Lately I’ve been learning just a little about R, a language used mainly for statistics and related analysis. Here’s what it’ll do ‘out of the box’, in untrained hands:

letters<-read.csv('data.txt',sep='\t', header=TRUE)
v_author = letters$Author=="Voltaire"
v_letters = letters[v_author, ]
Where were Voltaire’s letters sent?
> cbind(summary(v_letters$dest_country))
[,1]
Austria            2
Belgium            6
Canada             0
Denmark            0
England           26
France          1312
Germany           97
India              0
Ireland            0
Italy             68
Netherlands       22
Portugal           0
Russia             5
Scotland           0
Spain              1
Sweden             0
Switzerland      342
The Netherlands    1
Turkey             0
United States      0
Wales              0
As the overview and video in the ‘Republic of Letters‘ site points out (“Tracking 18th-century “social network” through letters”), the patterns of correspondence eg. between Voltaire and e.g. England, Scotland and Ireland jumps out of the data (and more so its visualisation). There are countless ways this information could be explored, presented, sliced-and-diced. Only a custom app can really make the most of it, and the Republic of Letters work goes a long way in that direction. They also note that
The requirements of our project are very much in sync with current work being done in the linked-data/ semantic web community and in the data visualization community, which is why collaboration with computer science has been critical to our project from the start.
So the raw data in the Web here is a simple table; while we could spend time arguing about whether it would better be expressed in JSON, XML or an RDF notation, I’d rather see some discussion around what we can do with this information. In particular, I’m intrigued by the possibilities of R alongside the data-linking habits that come with RDF. If anyone manages to tease anything interesting from this dataset, perhaps mixed in with DBpedia, do post your results.
And of course there are always other datasets to examine; for example see the Darwin correspondence archives, or the Open Knowledge Foundation’s Open Correspondence project which has a Dickens-based pilot. While it is wonderful having UI that is tuned to the particulars of some dataset, it is also great when we can re-use UI code to explore similarly structured data from elsewhere. On both the data side and the UI side, this is expensive, tough work to do well. My current concern is to maximise re-use of both UI and data for the particular circumstances of second-screen TV navigation, a scenario rarely a first priority for anyone!
My hope is that custom navigation widgets for this sort of data will be natural components of next-generation TV remote controls, and that TV archives (and other collections) will open up enough of their metadata to draw in (possibly paying) viewers. To achieve this, we need the raw data on both sides to be as connectable as possible, so that application authors can spend their time thinking about what their users really need and can use, rather than on whether they’ve got the ‘right’ Henry Newton.
If we get it right, there’s a central role for librarianship and archivists in curating the public, linked datasets that tell us about the people, places and topics that will allow us to make new navigation trails through Web-connected television, literature and encyclopedia content. And we’ll also see new roles for custom visualizations, once we figure out an embedding framework for TV widgets that lets them communicate with a display system, with other users in the same room or community, and that is designed for cross-referencing datasets that talk about the same entities, topics, places etc.
As I mentioned regarding Lonclass and UDC, collaboration around open shared data often takes place in a furtive atmosphere of guilt and uncertainty. Is it OK to point to the underlying data behind a fantastic visualisation? How can we make sure the hard work that goes into that data curation is acknowledged and rewarded, even while its results flow more freely around the Web, and end up in places (your TV remote!) that may never have been anticipated?

Lonclass and RDF

Lonclass is one of the BBC’s in-house classification systems – the “London classification”. I’ve had the privilege of investigating lonclass within the NoTube project. It’s not currently public, but much of what I say here is also applicable to the UDC classification system upon which it was based. UDC is also not fully public yet; I’ve made a case elsewhere that it should be, and I hope we’ll see that within my lifetime. UDC and Lonclass have a fascinating history and are rich cultural heritage artifacts in their own right, but I’m concerned here only with their role as the keys to many of our digital and real-world archives.
Why would we want to map Lonclass or UDC subject classification codes into RDF?
Lonclass codes can be thought of as compact but potentially complex sentences, built from the thousands of base ‘words’ in the Lonclass dictionary. By mapping the basic pieces, the words, to other data sources, we also enrich the compound sentences. We can’t map the sentences are there can be infinitely many of them, it would be an expensive and never-ending task.
For example, we might have a lonclass code for “Report on the environmental impact of the decline of tin mining in sweden in the 20th century“. This would be an jumble of numbers and punctuation which I won’t trouble you with here. But if we parsed out that structure we can see the complex code as built from primitives such as ‘tin mining’ (itself e.g. ‘Tin’ and ‘Mining’), ‘Sweden’, etc. By linking those identifiable parts to shared Web data, we also learn more about the complex composite codes that use them. Wikipedia’s Sweden entry tells us in English, “Sweden has land borders with Norway to the west and Finland to the northeast, and water borders with Denmark, Germany, and Poland to the south, and Estonia, Latvia, Lithuania, and Russia to the east.”. Increasingly this additional information is available in machine-friendly form. Although right now we can’t learn about Sweden’s borders from the bits of Wikipedia reflected into DBpedia’s Sweden entry, but UN FAO’s geopolitical ontology does have this information and much more in RDF form.

There is more, much more, to know about Sweden than can possibly be represented directly within Lonclass or UDC. Yet those facts may also be very useful for the retrieval of information tagged with Sweden-related Lonclass codes. If we map the Lonclass notion of ‘Sweden’ to identified concepts described elsewhere, then whenever we learn more about the latter, we also learn more about the former, and indirectly, about anything tagged with complex lonclass codes using that concept. Suddenly an archived TV documentary tagged as covering a ‘report on the environmental impact of the decline of tin mining in sweden’ is accessible also to people or machines looking under Scandinavia + metal mining.

Lonclass and UDC codes have a rich hidden structure that is rarely exploited with modern tools. Lonclass by virtue of its UDC heritage, does a lot of work itself towards representing rich conceptual inter-relationships. It embodies a conceptual map of our world, with mysterious codes (well known in the library world) for topics such as ‘622 – mining’, but also specifics e.g. ‘622.3 Mining of specific minerals, ores, rocks’, and combinations (‘622.3:553.9 Extraction of carbonaceous minerals, hydrocarbons’). By joining a code for ‘mining a specific mineral…’ to a code for ‘553.9 Deposits of carbonaceous rocks. Hydrocarbon deposits’ we get a compound term. So Lonclass/UDC “knows” about the relationship between “Tin Mining” and “Mining”, “metals” etc., and quite likely between “Sweden” and “Scandinavia”. But it can’t know everything! Sooner or later, we have to say, “Sorry, it’s not reasonable to expect the classification system to model the entire world; that’s a bigger problem”.

Even within the closed, self-supporting universe of UDC/Lonclass, this compositional semantics system is a very powerful tool for describing obscure topics in terms of well known simpler concepts. But it’s too much for any single organisation (whether the BBC, the UDC Consortium, or anyone) to maintain and extend such a system to cover all of modern life; from social, legal and business developments to new scientific innovations. The work needs to be shared, and RDF is currently our best bet on how to created such work sharing, meaning sharing, information-linking systems in the Web. The hierarchies in UDC and Lonclass don’t attempt to represent all of objective reality; they instead show paths through information.
If the metaphor of a ‘conceptual map’ holds up, then it’s clear that at some point it’s useful to have our maps made by different parties. The Web now contains a smaller but growing Web of machine readable descriptions. Over at MusicBrainz is a community who take care of describing the entities and relationships that cover much of music, or at least popular music. Others describe countries, species, genetics, languages, historical events, economics, and countless other topics. The data is sometimes messy or an imperfect fit for some task-in-hand, but it is actively growing, curated and connected.
I’m not arguing that Lonclass or UDC should be thrown out and replaced by some vague ‘linked cloud’. Rather, that there are some simple steps that can be taken towards making sure each of these linked datasets contribute to modernising our paths into the archives. We need to document and share opensource tools for an agreed data model for the arcane numeric codes of UDC and Lonclass. We need at least the raw pieces, the simplest codes, to be described for humans and machines in public, stable Web pages, and for their re-use, mapping, data mining and re-combination to be actively encouraged and celebrated. Currently, it is possible to get your hands on this data if you sign NDAs (Lonclass), pay fees (UDC) or exchange USB sticks with the right party in some shady backstreet. Whether the metaphor of choice is ‘key to the archives’ or ‘conceptual map, this is a deeply unfortunate situation. There’s a wealth of meaning hidden inside Lonclass and UDC and the collections they index, a lot that can be added by linking it to other RDF datasets, but more importantly there are huge communities out there who’ll do the work when the data is finally opened up…


Syndicated 2010-11-18 10:02:49 from danbri's foaf stories

Disambiguating with DBpedia

Sketchy notes. Say you’re looking for an identifier for something, and you know it’s a company/organization, and you have a label “Woolworths”.

What can be done to choose amongst the results we find in DBpedia for this crude query?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?x where {
?x a <http://dbpedia.org/ontology/Organisation>;  rdfs:label ?l .
FILTER(REGEX(?l, “Woolworths*”)).
}

More generally, are the tweaks and tricks needed to optimise this sort of disambiguation going to be cross-domain, or do we have to hand-craft them, case by case?

Syndicated 2010-11-16 16:42:33 from danbri's foaf stories

185 older entries...

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!