Synchronising dataspaces at scale
So, I have a question for you – how would you approach the following (engineering) problem? Imagine you have two dataspaces, a source dataspace, such as Eurostat with some 5000+ datasets that can take up to several GB in the worst case, and a target dataspace (for example, something like what we’re deploying in the LATC, currently). You want to ensure that the data in the target dataspace is as fresh as possible, that is, providing a minimal temporal delay between the contents of source and target dataspaces.
Don’t get me wrong, this has exactly nothing to do with Linked Data, RDF or the like. This is simply the question of how often one should ‘sample’ the source in order to make sure that the target is ‘always’ up-to-date.
Now, would you say that Shannon’s theorem is of any help? Or, you look at the given source update frequency and decide based on this how often you hammer the server?
It turns out that one should also take into account what happens in the target dataspace. In our case this is mainly the conversion of the XML or TSV into some RDF serialisation. This is, in cases where the source dataset has, say, some 11GB, a non-trivial issue to address. In addition, we see some ~1000 datasets changing in a couple of days time. Which would leave us, in the worst case, with a situation where we would still be in the conversion process of parts of the dataspace while already updated versions of the datasets would be pending.
On the other hand we know, based on our experience with the Eurostat data, that we can rebuild the entire dataspace – that is, downloading all 5000+ files incl. metadata, converting it to RDF and loading the metadata into the SPARQL endpoint – in some 11+ days. Wouldn’t it make sense to simply only look at the update every 10-or-so days?
We discussed this today and settled for a weekly (weekend) update policy. Let’s see where this takes us and I promise that I keep you posted …
Filed under: FYI, Linked Data