Advogato: Blog for danbri

Lonclass and RDF

Lonclass is one of the BBC’s in-house classification systems – the “London classification”. I’ve had the privilege of investigating lonclass within the NoTube project. It’s not currently public, but much of what I say here is also applicable to the UDC classification system upon which it was based. UDC is also not fully public yet; I’ve made a case elsewhere that it should be, and I hope we’ll see that within my lifetime. UDC and Lonclass have a fascinating history and are rich cultural heritage artifacts in their own right, but I’m concerned here only with their role as the keys to many of our digital and real-world archives.

Why would we want to map Lonclass or UDC subject classification codes into RDF?

Lonclass codes can be thought of as compact but potentially complex sentences, built from the thousands of base ‘words’ in the Lonclass dictionary. By mapping the basic pieces, the words, to other data sources, we also enrich the compound sentences. We can’t map the sentences are there can be infinitely many of them, it would be an expensive and never-ending task.

For example, we might have a lonclass code for “Report on the environmental impact of the decline of tin mining in sweden in the 20th century“. This would be an jumble of numbers and punctuation which I won’t trouble you with here. But if we parsed out that structure we can see the complex code as built from primitives such as ‘tin mining’ (itself e.g. ‘Tin’ and ‘Mining’), ‘Sweden’, etc. By linking those identifiable parts to shared Web data, we also learn more about the complex composite codes that use them. Wikipedia’s Sweden entry tells us in English, “Sweden has land borders with Norway to the west and Finland to the northeast, and water borders with Denmark, Germany, and Poland to the south, and Estonia, Latvia, Lithuania, and Russia to the east.”. Increasingly this additional information is available in machine-friendly form. Although right now we can’t learn about Sweden’s borders from the bits of Wikipedia reflected into DBpedia’s Sweden entry, but UN FAO’s geopolitical ontology does have this information and much more in RDF form.

There is more, much more, to know about Sweden than can possibly be represented directly within Lonclass or UDC. Yet those facts may also be very useful for the retrieval of information tagged with Sweden-related Lonclass codes. If we map the Lonclass notion of ‘Sweden’ to identified concepts described elsewhere, then whenever we learn more about the latter, we also learn more about the former, and indirectly, about anything tagged with complex lonclass codes using that concept. Suddenly an archived TV documentary tagged as covering a ‘report on the environmental impact of the decline of tin mining in sweden’ is accessible also to people or machines looking under Scandinavia + metal mining.

Lonclass and UDC codes have a rich hidden structure that is rarely exploited with modern tools. Lonclass by virtue of its UDC heritage, does a lot of work itself towards representing rich conceptual inter-relationships. It embodies a conceptual map of our world, with mysterious codes (well known in the library world) for topics such as ‘622 – mining’, but also specifics e.g. ‘622.3 Mining of specific minerals, ores, rocks’, and combinations (‘622.3:553.9 Extraction of carbonaceous minerals, hydrocarbons’). By joining a code for ‘mining a specific mineral…’ to a code for ‘553.9 Deposits of carbonaceous rocks. Hydrocarbon deposits’ we get a compound term. So Lonclass/UDC “knows” about the relationship between “Tin Mining” and “Mining”, “metals” etc., and quite likely between “Sweden” and “Scandinavia”. But it can’t know everything! Sooner or later, we have to say, “Sorry, it’s not reasonable to expect the classification system to model the entire world; that’s a bigger problem”.

Even within the closed, self-supporting universe of UDC/Lonclass, this compositional semantics system is a very powerful tool for describing obscure topics in terms of well known simpler concepts. But it’s too much for any single organisation (whether the BBC, the UDC Consortium, or anyone) to maintain and extend such a system to cover all of modern life; from social, legal and business developments to new scientific innovations. The work needs to be shared, and RDF is currently our best bet on how to created such work sharing, meaning sharing, information-linking systems in the Web. The hierarchies in UDC and Lonclass don’t attempt to represent all of objective reality; they instead show paths through information.

If the metaphor of a ‘conceptual map’ holds up, then it’s clear that at some point it’s useful to have our maps made by different parties. The Web now contains a smaller but growing Web of machine readable descriptions. Over at MusicBrainz is a community who take care of describing the entities and relationships that cover much of music, or at least popular music. Others describe countries, species, genetics, languages, historical events, economics, and countless other topics. The data is sometimes messy or an imperfect fit for some task-in-hand, but it is actively growing, curated and connected.

I’m not arguing that Lonclass or UDC should be thrown out and replaced by some vague ‘linked cloud’. Rather, that there are some simple steps that can be taken towards making sure each of these linked datasets contribute to modernising our paths into the archives. We need to document and share opensource tools for an agreed data model for the arcane numeric codes of UDC and Lonclass. We need at least the raw pieces, the simplest codes, to be described for humans and machines in public, stable Web pages, and for their re-use, mapping, data mining and re-combination to be actively encouraged and celebrated. Currently, it is possible to get your hands on this data if you sign NDAs (Lonclass), pay fees (UDC) or exchange USB sticks with the right party in some shady backstreet. Whether the metaphor of choice is ‘key to the archives’ or ‘conceptual map, this is a deeply unfortunate situation. There’s a wealth of meaning hidden inside Lonclass and UDC and the collections they index, a lot that can be added by linking it to other RDF datasets, but more importantly there are huge communities out there who’ll do the work when the data is finally opened up…

Syndicated 2010-11-18 10:02:49 from danbri's foaf stories

16 Nov 2010 »

Disambiguating with DBpedia

Sketchy notes. Say you’re looking for an identifier for something, and you know it’s a company/organization, and you have a label “Woolworths”.

What can be done to choose amongst the results we find in DBpedia for this crude query?

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

select distinct ?x where {

?x a <http://dbpedia.org/ontology/Organisation>; rdfs:label ?l .

FILTER(REGEX(?l, “Woolworths*”)).

}

More generally, are the tweaks and tricks needed to optimise this sort of disambiguation going to be cross-domain, or do we have to hand-craft them, case by case?

Syndicated 2010-11-16 16:42:33 from danbri's foaf stories

9 Nov 2010 »

Sagan on libraries

“Books permit us to voyage through time, to tap the wisdom of our ancestors. The library connects us with the insights and knowledge, painfully extracted from Nature, of the greatest minds that ever were, with the best teachers, drawn from the entire planet and from all of our history, to instruct us without tiring, and to inspire us to make our own contribution to the collective knowledge of the human species. Public libraries depend on voluntary contributions. I think the health of our civilization, the depth of our awareness about the underpinnings of our culture and our concern for the future can all be tested by how well we support our libraries.” –Carl Sagan, http://en.wikiquote.org/wiki/Carl_Sagan

Syndicated 2010-11-09 10:06:27 from danbri's foaf stories

2 Nov 2010 »

Easier in RDFa: multiple types and the influence of syntax on semantics

RDF is defined as an abstract data model, plus a collection of practical notations for exchanging RDF descriptions (eg. RDF/XML, RDFa, Turtle/N3). In theory, your data modelling activities are conducted in splendid isolation from the sleazy details of each syntax. RDF vocabularies define classes of thing, and various types of property/relationship that link those things. And then instance data uses arbitrary combinations of those vocabularies to make claims about stuff. Nothing in your vocabulary design says anything about XML or text formats or HTML or other syntactic details.

All that said, syntactic considerations can mess with your modelling. I’ve just written this up for the Linked Library Data group, but since the point isn’t often made, I thought I’d do so here too.

RDF instance data, ie. descriptions of stuff, is peculiar in that it lets you use multiple independent schemas at the same time. So I might use SKOS, FOAF, Bio, Dublin Core and DOAP all jumbled up together in one document. But there are some considerations when you want to mention that something is in multiple classes. While you can do this in any RDF notation, it is rather ugly in RDF/XML, historically RDF’s most official, standard notation. Furthermore, if you want to mention that two things are related by two or more specified properties, this can be super ugly in RDF/XML. Or at least rather verbose. These practical facts have tended to guide the descriptive idioms used in real world RDF data. RDFa changes the landscape significantly, so let me give some examples.

Backstory – decentralised extensibility

RDF classes from one vocabulary can be linked to more general or specific classes in another; we use rdfs:subClassOf for this. Similarly, RDF properties can be linked with rdfs:subPropertyOf claims. So for example in FOAF we might define a class foaf:Organization, and leave it at that. Meanwhile over in the Org vocabulary, they care enough to distinguish a subclass, org:FormalOrganization. This is great! Incremental, decentralised extensibility. Similarly, FOAF has foaf:knows as a basic link between people who know each other, but over in the relationship vocabulary, that has been specialized, and we see relationships like ‘livesWith‘, ‘collaboratesWith‘. These carry more specific meaning, but they also imply a foaf:knows link too.

This kind of machine-readable (RDFS/OWL) documentation of the patterns of meaning amongst properties (and classes) has many uses. It could be used to infer missing information: if Ian writes RDF saying “Alice collaboratesWith Bob” but doesn’t explicitly say that Alice also knows Bob, a schema-aware processor can add this in. Or it can be used at query time, if someone asks “who does Alice know?”. But using this information is not mandatory, and this creates a problem for publishers. Should they publish redundant information to make it easier for simple data consumers to understand the data without knowing about the more detailed (and often more recent) vocabulary used?

Historically, adding redundant triples to capture the more general claims has been rather expensive – both in terms of markup beauty, and also file size. RDFa changes this.

Here’s a simple RDF/XML description of something.

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:foaf="http://xmlns.com/foaf/0.1/">
<foaf:Person rdf:about="#fred">
 <foaf:name>Fred Flintstone</foaf:name>
</foaf:Person>
</rdf:RDF>

…and here is how it would have to look if we wanted to add a 2nd type:

<foaf:Person rdf:about="#fred"
 rdf:type="http://example.com/vocab2#BiblioPerson">
  <foaf:name>Fred Flintstone</foaf:name>
</foaf:Person>
</rdf:RDF>

To add a 3rd or 4th type, we’d need to add in extra subelements eg.

<rdf:type rdf:resource="http://example.com/vocab2#BiblioPerson"/>

Note that the full URI for the vocabulary needs to be used at every occurence of the type. Here’s the same thing, with multiple types, in RDFa.

<html>
<head><title>a page about Fred</title></head>
<body>
<div xmlns:foaf="http://xmlns.com/foaf/0.1/"
xmlns:vocab2="http://example.com/vocab2#"
 about="#fred" typeof="foaf:Person vocab2:BiblioPerson" >
<span property="foaf:name">Fred Flintstone</span>
</div>
</body>
</html>

RDFa 1.0 requires the second vocabulary’s namespace to be declared, but after that it is pretty concise if you want to throw in a 2nd or a 3rd type, for whatever you’re
describing. If you’re talking about a relationship between people, instead of ” rel=’foaf:knows’ ” you could put “rel=’foaf:knows rel:livesWith’ “; if you wanted to mention that something was in the class not just of organizations, but formal organizations, you could write “typeof=’foaf:Organization org:FormalOrganization’”.

Properties and classes serve quite different social roles in RDF. The classes tend towards being dull, boring, because they are the point of connection between different datasets and applications. The detail, personality and real information content in RDF lives in the properties. But both classes and properties fall into specialisation hierarchies that cross independent vocabularies. It is quite a common experience to feel stuck, not sure whether to use a widely known but vague term, or a more precise but ‘niche’, new or specialised vocabulary. As RDF syntaxes improve, this tension can melt away somewhat. In RDFa it is significantly easier to simply publish both, allowing smart clients to understand your full detail, and simple clients to find the patterns they expect without having to do schema-based processing.

Syndicated 2010-11-02 11:57:08 from danbri's foaf stories

27 Oct 2010 »

Archive.org TV metadata howto

The following is composed from answers kindly supplied by Hank Bromley, Karen Coyle, George Oates, and Alexis Rossi from the archive.org team. I have mixed together various helpful replies and retro-fitted them to a howto/faq style summary.

I asked about APIs and data access for descriptions of the many and varied videos in Archive.org. This guide should help you get started with building things that use archive.org videos. Since the content up there is pretty much unencumbered, it is perfect for researchers looking for content to use in demos. Or something to watch in the evening.

To paraphrase their answer, it was roughly along these lines:

you can do automated lookups of the search engine using a simple HTTP/JSON API
downloading a lot or everything is ok if you need or prefer to work locally, but please write careful scripts
hopefully the search interface is useful and can avoid you needing to do this

Short API overview: each archive entry that is a movie, video or tv file should have a type ‘movie’. Everything in the archive has a short textual ID, and an XML description at a predictable URL. You can find those by using the JSON flavour of the archive’s search engine, then download the XML (and content itself) at your leisure. Please cache where possible!

I was also pointed to http://deweymusic.org/ which is an example of a site that provides a new front-end for archive.org audio content – their live music collection. My hope in posting these notes here is to help people working on new interfaces to Web-connected TV explore archive.org materials in their work.

JSON API to archive.org services

See online documentation for JSON interface; if you’re happy working with the remote search engine and are building a Javascript-based app, this is perfect.

We have been moving the majority of our services from formats like XML, OAI and other to the more modern JSON format and method of client/server interaction.

How to … play well with others

As we do not have unlimited resources behind our services, we request that users try to cache results where they can for the more high traffic and popular installations/uses. 8-)

TV content in the archive

The archive contains a lot of video files; old movies, educational clips, all sorts of fun stuff. There is also some work on reflecting broadcast TV into the system:

First off, we do have some television content available on the site right now:
http://www.archive.org/details/tvarchive - It’s just a couple of SF gov channels, so the content itself is not terribly exciting. But what IS cool is that this being recorded directly off air and then thrown into publicly available items on archive.org automatically. We’re recording other channels as well, but we currently aren’t sure what we can make public and how.

How to… get all metadata

If you really would rather download all the metadata and put it in their own search engine or database, it’s simple to do: get a list of the identifiers of all video items from the search engine (mediatype:movies), and for each one, fetch this file:

http://www.archive.org/download/{itemID}/{itemID}_meta.xml

So it’s a bit of work since you have to retrieve each metadata record separately, but perhaps it is easily programmable.

However, once you have the identifier for an item, you can automatically find the meta.xml for it (or the files.xml if that’s what you want). So if the item is at:
http://www.archive.org/details/Sita_Sings_the_Blues
the meta.xml is at
http://www.archive.org/download/Sita_Sings_the_Blues/Sita_Sings_the_Blues_meta.xml
and the files.xml is at
http://www.archive.org/download/Sita_Sings_the_Blues/Sita_Sings_the_Blues_files.xml

This is true for every single item in the archive.

How to… get a list of all IDs

Use http://www.archive.org/advancedsearch.php

Basically, you put in a query, choose the metadata you want returned, then choose the format you’d like it delivered in (rss, csv, json, etc.).

Downsides to this method – you can only get about 10,000 items at once (you might be able to push it to 20,000) before it crashes on you, and you can only get the metadata fields listed.

How to… monitor updates with RSS?

Once you have a full dump, you can monitor incoming items via the RSS feed on this page:

http://www.archive.org/details/movies

Subtitles / closed captions

For the live TV collection, there should be extracted subtitles. Maybe I just found bad examples. (e.g

http://www.archive.org/details/SFGTV2_20100909_003000).

Todo: more info here!

What does the Archive search engine index?

In general *everything* in the meta.xml files is indexed in the IA search engine, and accessible for scripted queries at http://www.archive.org/advancedsearch.php.

But it may be that the search engine will support whatever queries you want to make, without your having to copy all the metadata to your own site.

How many “movies” are in the database?

Currently 314,624 “movies” items in the search engine. All tv and video items are supposed to be have “movies” for their mediatype, although there has been some leakage now and then.

Should I expect a valid XML file for each id?

eg. ”identifier”:”mosaic20031001″ seemed problematic.
There are definitely items on the archive that have extremely minimally filled outmeta.xml files.

Response from a trouble report:

“I looked at a couple of your examples, i.e. http://www.archive.org/details/HomeElec, and they do have a meta.xml file in our system… but it ONLY contains a mediatype (movies) and identifier and nothing else. That seems to be making our site freak out. There are at least 800 items in movies that do not have a title. There might be other minimal metadata that is required for us to think it’s a real item, but my guess is that if you did a search like this one you’d see fewer of those errors:
http://www.archive.org/search.php?query=mediatype%3Amovies%20AND%20title%3A[*%20TO%20*] ”

The other error you might see is “The item is not available due to issues with the item’s content.” This is an item that has been taken down but for some reason it did not get taken out of the SE – it’s not super common, but it does happen.
I don’t think we’ve done anything with autocomplete on the Archive search engine, although one can use wildcards to find all possible completions by doing a query. For example, the query:

http://www.archive.org/advancedsearch.php?q=mediatype%3Avideo+AND+title%3Aopen*&fl[]=identifier&fl[]=title&rows=10&page=1&output=json&save=yes

will match all items whose titles contain any words that start with “open” – that sample result of ten items shows titles containing “open,” “opening,” and “opener.”

How can I autocomplete against archive.org metadata?

Not at the moment.

“I believe autocomplete *has* been explored with the search engine on our “Open Library” sister site, openlibrary.org.”

How can I find interesting and well organized areas of the video archive?

I assume you’re looking for collections with pretty regular metadata to work on? These collections tend to be fairly filled out:
http://www.archive.org/details/prelinger
http://www.archive.org/details/academic_films
http://www.archive.org/details/computerchronicles

Syndicated 2010-10-27 07:30:58 from danbri's foaf stories

9 Jul 2010 »

Subject classification and Statistics

Subject classification and statistics share some common problems. This post takes a small example discussed at this week’s ODaF event on “Semantic Statistics” in Tilberg, and explores its expression coded in the Universal Decimal Classification (UDC). UDC supports faceted description, providing an abstract grammar allowing sentence-like subject descriptions to be composed from the “raw materials” defined in its master reference file. This makes the mapping of UDC (and to some extent also Dewey classifications) into W3C’s SKOS somewhat lossy, since patterns and conventions for documenting these complex, composed structures are not yet well established. In the NoTube project we are looking into this in a TV context, in large part because the BBC archives make extensive use of UDC via their Lonclass scheme; see my ‘investigating Lonclass‘ and UDC seminar talk for more on those scenarios. Until this week I hadn’t thought enough about the potential for using this to link deep into statistical datasets.

One of the examples discussed on tuesday was as follows (via Richard Cyganiak):

“There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″

There was much interesting discussion about the proper scope and role of Linked Data techniques for sharing this kind of statistical data. Do we use RDF essentially as metadata, to find ‘black boxes’ full of stats, or do we use RDF to try to capture something of what the statistics are telling us about the world? When do we use RDF as simple factual data directly about the world (eg. school X has N pupils [currently; or at time t]), and when does it become a carrier for raw numeric data whose meaning is not so directly expressed at the factual level?

The state of the art in applying RDF here seems to be SDMX-RDF, see Richard’s slides. The SDMX-RDF work uses SKOS to capture code lists, to describe cross-domain concepts and to indicate subject matter.

Given all this, I thought it would be worth taking this tiny example and looking at how it might look in UDC, both as an example of the ‘compositional semantics’ some of us hope to capture in extended SKOS descriptions, but also to explore scenarios that cross-link numeric data with the bibliographic materials that can be found via library classification techniques such as UDC. So I asked the ever-helpful Aida Slavic (editor in chief of the UDC), who talked me through how this example data item looks from a UDC perspective.

I asked,

So I’ve just got home from a meeting on semweb/stats. These folk encode data values with stuff like “There were 66 fatal occupational injuries in the Washington, DC metropolitan area in 2008″. How much of that could have a UDC coding? I guess I should ask, how would subject index a book whose main topic was “occupational injuries in the Washington DC metro area in 2008″?

Aida’s reply (posted with permission):

You can present all of it & much more using UDC. When you encode a subject like this in UDC you store much more information than your proposed sentence actually contains. So my decision of how to ‘translate this into udc’ would depend on learning more about the actual text and the context of the message it conveys, implied audience/purpose, the field of expertise for which the information in the document may be relevant etc. I would probably wonder whether this is a research report, study, news article, textbook, radio broadcast?

Not knowing more then you said I can play with the following: 331.46(735.215.2/.4)”2008”

Accidents at work — Washington metropolitan area — year 2008
or a bit more detailed: 331.46-053.18(735.215.2/.4)”2008”
Accidents at work — dead persons – Washington metropolitan area — year 2008
[you can say the number of dead persons but this is not pertinent from point of view of indexing and retrieval]

…or maybe (depending what is in the content and what is the main message of the text) and because you used the expression ‘fatal injuries’ this may imply that this is more health and safety/ prevention area in health hygiene which is in medicine.

The UDC structures composed here are:

TIME “2008″

PLACE (735.215.2/.4) Counties in the Washington metropolitan area

TOPIC 1
331 Labour. Employment. Work. Labour economics. Organization of labour
331.4 Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work
331.46 Accidents at work ==> 614.8

TOPIC 2
614 Prophylaxis. Public health measures. Preventive treatment
614.8 Accidents. Risks. Hazards. Accident prevention. Persona protection. Safety
614.8.069 Fatal accidents

NB – classification provides a bit more context and is more precise than words when it comes to presenting content i.e. if the content is focused on health and safety regulation and occupation health then the choice of numbers and their order would be different e.g. 614.8.069:331.46-053.18 [relationship between] health & safety policies in prevention of fatal injuries and accidents at work.

So when you read UDC number 331.46 you do not see only e.g. ‘accidents at work’ but ==> ’accidents at work < occupational health/safety < labour economics, labour organization < economy
and when you see UDC number 614.8 it is not only fatal accidents but rather ==> ‘fatal accidents < accident prevention, safety, hazards < Public health and hygiene. Accident prevention

When you see (735.2….) you do not only see Washington but also United States, North America

So why is this interesting? A couple of reasons…

1. Each of these complex codes combines several different hierarchically organized components; just as they can be used to explore bibliographic materials, similar approaches might be of value for navigating the growing collections of public statistical data. If SKOS is to be extended / improved to better support subject classification structures, we should take care also to consider use cases from the world of statistics and numeric data sharing.

2. Multilingual aspects. There are plans to expose SKOS data for the upper levels of UDC. An HTML interface to this “UDC summary” is already available online, and includes collected translations of textual labels in many languages (see progress report) . For example, we can look up 331.4 and find (in hierarchical context) definitions in English (“Working environment. Workplace design. Occupational safety. Hygiene at work. Accidents at work”), alongside e.g. Spanish (“Entorno del trabajo. Diseño del lugar de trabajo. Seguridad laboral. Higiene laboral. Accidentes de trabajo”), Croatian, Armenian, …

Linked Data is about sharing work; if someone else has gone to the trouble of making such translations, it is probably worth exploring ways of re-using them. Numeric data is (in theory) linguistically neutral; this should make linking to translations particularly attractive. Much of the work around RDF and stats is about providing sufficient context to the raw values to help us understand what is really meant by “66″ in some particular dataset. By exploiting SDMX-RDF’s use of SKOS, it should be possible to go further and to link out to the wider literature on workplace fatalities. This kind of topical linking should work in both directions: exploring out from numeric data to related research, debate and findings, but also coming in and finding relevant datasets that are cross-referenced from books, articles and working papers. W3C recently launched a Library Linked Data group, I look forward to learning more about how libraries are thinking about connecting numeric and non-numeric information.

Syndicated 2010-07-09 09:38:31 from danbri's foaf stories

14 Jan 2010 »

RDFa in Drupal 7: last call for feedback before alpha release

Stephane has just posted a call for feedback on the Drupal 7 RDFa design, before the first official alpha release.

First reaction above all, is that this is great news! Very happy to see this work maturing.

I’ve tried to quickly suggest some tweaks to the vocab, by hacking his diagram in photoshop. All it really shows is that I’ve forgotten how to use photoshop, but I’ll upload it here anyway.

So if you click through to the full image, you can see my rough edits.

I’d suggest:

Use (dcterms) dc:subject as the way of pointing from a document to it’s SKOS subject.
Use (dcterms) dc:creator as the relationship between a document and the person that created it (note that in FOAF, we now declare foaf:maker to map as an equivalentProperty to (dcterms)dc:creator).
Distinguish between the description of the person versus their account in the Drupal system; I would use foaf:Person for the human, and sioc:User (a kind of foaf:OnlineAccount) as the drupal account. The foaf property to link from the former to the latter is foaf:account (new name for foaf:holdsAccount).
Focus on SIOC where it is most at-home: in modelling the structure of the discussion; threading, comments and dialog.
Provide a generated URI for the person. I don’t 100% understand Stephane’s comment, “Hash URIs for identifying things different from the page describing them can be implemented quite easily but this case hasn’t emerged in core” but perhaps this will be difficult? I’d suggest using URIs ending “userpage#!person” so the fragment IDs can’t clash with HTML usage.

If the core release can provide this basic structure, including a hook for describing the human person rather than the site-specific account (ie. sioc:User) then extensions should be able to add their own richness. The current markup doesn’t quite work for that end, as the human user is only described indirectly (unless I understand current reading of sioc:User).

Anyway, I’m nitpicking! This is really great, and a nice and well-deserved boost for the RDFa community.

Syndicated 2010-01-14 13:07:04 from danbri's foaf stories

5 Jan 2010 »

WOT in RDFa?

(This post is written in RDFa…)

To the best of my knowledge, Ludovic Hirlimann’s PGP fingerprint is 6EFBD26FC7A212B2E093 B9E868F358F6C139647C. You might also be interested in his photos on flickr, or his workplace, Mozilla Messaging. The GPG key details were checked over a Skype video call with me, Ludo and Kaare A. Larsen.

This blog post isn’t signed, the URIs it referenced don’t use SSL, and the image could be switched by evildoers at any time! But the question’s worth asking: is this kind of scruffy key info useful, if there’s enough of it?

Syndicated 2010-01-05 21:59:19 from danbri's foaf stories

3 Jan 2010 »

My ’70s Schoolin’ (in RDFa)

I went to Hamsey Green school in the 1970s.

Looking in the UK Govt datasets, I see it is listed there with a homepage of ‘http://www.hamsey-green-infant.surrey.sch.uk’ (which doesn’t seem to work).

Some queries I’m trying via the SPARQL dataset (I’ll update this post if I make them work…)

First a general query, from which I found the URL manually, …

select distinct ?x ?y where { ?x <http ://education.data.gov.uk/def/school/websiteAddress> ?y . }

Then I can go back into the data, and find other properties of the school:

PREFIX sch-ont:  <http://education.data.gov.uk/def/school/>
select DISTINCT ?x ?p ?z WHERE
{
?x sch-ont:websiteAddress "http://www.hamsey-green-infant.surrey.sch.uk" .
?x ?p ?z .
}

Syndicated 2010-01-03 10:48:35 from danbri's foaf stories

29 Dec 2009 »

WordPress trust syndication revisited: F2F plugin

This is a followup to my Syndicating trust? Mediawiki, Wordpress and OpenID post. I now have a simple implementation that exports data from WordPress: the F2F plugin. Also some experiments with consuming aggregates of this information from multiple sources.

FOAF has always had a bias towards describing social things that are shown rather than merely stated; this is particularly so in matters of trust. One way of showing basic confidence in others, is by accepting their comments on your blog or Web site. F2F is an experiment in syndicating information about these kinds of everyday public events. With F2F, others can share and re-use this sort of information too; or deal with it in aggregate to spread the risk and bring more evidence into their trust-related decisions. Or they might just use it to find interesting people’s blogs.

OpenID is a technology that lets people authenticate by showing they control some URL. WordPress blogs that use the OpenID plugin slowly accumulate a catalogue of URLs when people leave comments that are approved or rejected. In my previous post I showed how I was using the list of approved OpenIDs from my blog to help configure the administrative groups on the FOAF wiki.

This may all raise more questions than it answers. What level of detail is appropriate? are numbers useful, or just lists? in what circumstances is it sensible or risky to merge such data? is there a reasonable use for both ‘accept’ lists and ‘unaccept’ lists? What can we do with a list of OpenID URLs once we’ve got it? How do we know when two bits of trust ‘evidence’ actually share a common source? How do we find this information from the homepage of a blog?

If you install the F2F plugin (and have been using the OpenID plugin long enough to have accumulated a database table of OpenIDs associated with submitted comments), you can experiment with this. Basically it will generate HTML in RDFa format describing a list of people . See the F2F Wiki page for details and examples.

The script is pretty raw, but today it all improved a fair bit with help from Ed Summers, Daniel Krech and Morten Frederiksen. Ed and Daniel helped me get started with consuming this RDFa and SPARQL in the latest version of the rdflib Python library. Morten rewrote my initial nasty hack, so that it used Wordpress Shortcodes instead of hardcoding a URL path. This means that any page containing a certain string – f2f in chunky brackets – will get the OpenID list added to it. I’ll try that now, right here in this post. If it works, you’ll get a list of URLs below. Also thanks to Gerald Oskoboiny for discussions on this and reputation-related aggregation ideas; see his page on reputation and trust for lost more related ideas and sites. See also Peter Williams’ feedback on the foaf-dev list.

Next steps? I’d be happy to have a few more installations of this, to get some testbed data. Ideally from an overlapping community so the datasets are linked, though that’s not essential. Ed has a copy installed currently too. I’ll also update the scripts I use to manage the FOAF MediaWiki admin groups, to load data from RDFa blogs; mine and others if people volunteer relevant data. It would be great to have exports from other software too, eg. Drupal or MediaWiki.

Comment accept list for http://danbri.org/words

http://aberingi.pip.verisignlabs.com/

http://acs.myopenid.com/

http://alexandre.alapetite.net/

http://apassant.net/

http://areggiori.livejournal.com/

http://bencompanjen.hyves.nl/#OWI1MzRmYj

http://bendiken.livejournal.com/

http://bentoth.pip.verisignlabs.com/

http://bmuller.myopenid.com/

http://brondsema.net/

http://bruce.darcus.name/