Saturday, July 25, 2009

Linked Data and what it really takes to create it

Recently there has been a number of popular discussions (some heated) on whether RDF is necessary for the Linked Data. For example:

http://cloudofdata.com/2009/07/does-linked-data-need-rdf/

http://www.semanticsincorporated.com/2009/07/if-linked-data-is-a-brand-it-has-big-problems-to-address.html

I believe that for the Linked Data to happen, we need a standard for representing semantic information. URIs are just an addressing scheme and do not carry any semantics.

Once one moves beyond the high level marketing statements and starts to consider specific technical details of how the data linking could work, it quickly becomes apparent that, at minimum, RDF (or something very much like RDF) is required. This is often lost on bloggers without a sufficiently deep grasp of the underlying technologies and results in somewhat unfocused and meandering discussions. Questions on how exactly URIs alone can bring the data together are kept unanswered and glossed over in a rhetoric of high level statements.

Unfortunately, this makes it hard to conduct substantive discussions. Perhaps, this is why I found the recent post on Data Reconciliation Strategies and Their Impact on the Web of Data more insightful and useful than discussions above. It makes a number of valid points and raises important questions. Having said this, I disagree with author's conclusion:

"I cannot help but think that any effort tasked to promote and increase such density will look and feel just like Freebase: carefully selecting datasets and painfully trying to reconcile as much data as possible right when it enters the system and entice a community of volunteers to maintain it, curate it and clean up eventual mistakes."

There are the following "connectivity" problems in the query proposed by the author - “the height of all towers located in Paris”:

1. Identity of a tower

2. Identity of predicate height

3. Identity of Paris

I believe it is important to encourage people to provide mappings from their vocabulary/schema to some established vocabularies. These will (and are) appear over time. Foaf:Person is a good example. Now, you do not have to and may not be able to use it directly in your data, but as you expose your data to the Linked Data Cloud (LOD) it would be a good practice to provide such mappings. Otherwise, someone else will have to do it.

This is where dbpedia becomes very useful. It has a URI for height. NASA/TopQuadrant units ontology (to be shortly released here http://www.oegov.org/) has that as well. We linked units of measure, quantities and dimensions to dbpedia. If a broader community finds this work useful, it may be that over time this ontology becomes the de-facto standard and people will link to it. But even today there is a possibility to connect between it and other relevant ontologies/datasets - through dbpedia.

The same premise applies to Paris. There are some authoritative sources that have emerged - the geonames and, again, dbpedia. If they both cross reference each other, then linking to either one should do the trick.

Tower is a trickier problem because it is a type and can be identified in a number of ways - may be you have a table (class) Tower. May be you have a column (predicate) indicating the type. May be you have a table of controlled values for building types and there is a link to the appropriate resource (this will be the case, if something is more than one type - let's say both, a tower and a lighthouse). A similar issue may apply to the height if you reify it to provide a unit of measure, for example, or a measurement date.

Some mechanisms are needed to describe less simplistic mappings. In our work SPARQL-based SPIN rules have proven to be an effective standard-based solution for more complex mappings. Overall, I believe there are only a handful of patterns that will constitute 80% of the cases. A good number of these are described above.

The key benefit of using such an approach is that people can start with their own vocabularies and then, at some later stage, add the links to dbpedia. Or they don't, and someone else does it. This freedom is lost if a system (such as Freebase) forces users to do the mapping up front. With the units ontology it was quite easy to add the mappings, and likewise it will be for most other existing data models.

0 comments:

This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.