Saturday, July 25, 2009

Linked Data and what it really takes to create it

Recently there has been a number of popular discussions (some heated) on whether RDF is necessary for the Linked Data. For example:

I believe that for the Linked Data to happen, we need a standard for representing semantic information. URIs are just an addressing scheme and do not carry any semantics.

Once one moves beyond the high level marketing statements and starts to consider specific technical details of how the data linking could work, it quickly becomes apparent that, at minimum, RDF (or something very much like RDF) is required. This is often lost on bloggers without a sufficiently deep grasp of the underlying technologies and results in somewhat unfocused and meandering discussions. Questions on how exactly URIs alone can bring the data together are kept unanswered and glossed over in a rhetoric of high level statements.

Unfortunately, this makes it hard to conduct substantive discussions. Perhaps, this is why I found the recent post on Data Reconciliation Strategies and Their Impact on the Web of Data more insightful and useful than discussions above. It makes a number of valid points and raises important questions. Having said this, I disagree with author's conclusion:

"I cannot help but think that any effort tasked to promote and increase such density will look and feel just like Freebase: carefully selecting datasets and painfully trying to reconcile as much data as possible right when it enters the system and entice a community of volunteers to maintain it, curate it and clean up eventual mistakes."

There are the following "connectivity" problems in the query proposed by the author - “the height of all towers located in Paris”:

1. Identity of a tower

2. Identity of predicate height

3. Identity of Paris

I believe it is important to encourage people to provide mappings from their vocabulary/schema to some established vocabularies. These will (and are) appear over time. Foaf:Person is a good example. Now, you do not have to and may not be able to use it directly in your data, but as you expose your data to the Linked Data Cloud (LOD) it would be a good practice to provide such mappings. Otherwise, someone else will have to do it.

This is where dbpedia becomes very useful. It has a URI for height. NASA/TopQuadrant units ontology (to be shortly released here has that as well. We linked units of measure, quantities and dimensions to dbpedia. If a broader community finds this work useful, it may be that over time this ontology becomes the de-facto standard and people will link to it. But even today there is a possibility to connect between it and other relevant ontologies/datasets - through dbpedia.

The same premise applies to Paris. There are some authoritative sources that have emerged - the geonames and, again, dbpedia. If they both cross reference each other, then linking to either one should do the trick.

Tower is a trickier problem because it is a type and can be identified in a number of ways - may be you have a table (class) Tower. May be you have a column (predicate) indicating the type. May be you have a table of controlled values for building types and there is a link to the appropriate resource (this will be the case, if something is more than one type - let's say both, a tower and a lighthouse). A similar issue may apply to the height if you reify it to provide a unit of measure, for example, or a measurement date.

Some mechanisms are needed to describe less simplistic mappings. In our work SPARQL-based SPIN rules have proven to be an effective standard-based solution for more complex mappings. Overall, I believe there are only a handful of patterns that will constitute 80% of the cases. A good number of these are described above.

The key benefit of using such an approach is that people can start with their own vocabularies and then, at some later stage, add the links to dbpedia. Or they don't, and someone else does it. This freedom is lost if a system (such as Freebase) forces users to do the mapping up front. With the units ontology it was quite easy to add the mappings, and likewise it will be for most other existing data models.

Thursday, July 9, 2009

Data Transformation using Semantic Web Standards

I have created the presentation below in response to a recent discussion about converting XML to RDF.

A person I was talking to assumed that there was a mapping process one needed to go through before a translation of XML (or relational databases, spreadsheets, etc.) into RDF could take place.

Indeed mapping often happens, but it happens after translation. First the non RDF information is represented in RDF. Any mappings that are created are also captured in RDF/OWL - either by using constructs such as rdfs:subClassOf and owl:sameAs or, for more complex mappings, by using SPIN (SPARQL rules).

I am always surprised how often people find this approach novel and need time to understand what is going on. I guess this is because RDF is so flexible - it is quite easy to represent any data structures in RDF. And because both, data and models are represented in RDF, once imported, structural transformations are very straightforward. RDF is built for change. Other data models do not have this advantage. Hence, they require mappings before importing external data.

The presentation below explains in detail how RDF import and transformations are done including a step by step example. The benefits of the approach are also discussed.

Monday, July 6, 2009

Presentations from the Second TopBraid User Group Meeting - SemTech2009

The Second TopBraid Suite Open User Group Meeting was conducted at the Semantic Technology Conference 2009 in San Jose, CA, Thursday, June 18, 2009, 09:45 AM - 12:45 PM. The Agenda for the meeting and links to the presentations used by the speakers is given below.

User Group Meeting Agenda and Presentations
•9:45am - 9:55am
Welcome (Robert Coyne, TopQuadrant)

•9:55am - 10:30am
Keynote User Talk 1:
"Using SPARQLMotion to Execute Task Networks among Distributed Cyber Physical Systems",
Cyber-physical systems define networks of interactive sensors and actuators, grounded in the physical world. Such systems require a high degree of interoperation to achieve the system's objective. SPARQLMotion, a model-driven scripting language, has been used to achieve that level of interoperation. In addition, this approach results in greater operational redundancy among networks through distributed control. Three recent extensions to SPARQLMotion will be shared along with a motivating example for their use.
John T. Carson, Software Engineer, Lockheed Martin Aeronautics

•10:30am - 11:00am
TopBraid - New Capabilities - Sampler 1
Presentation/demo of key aspects of TopQuadrant's Enterprise Vocabulary Management (EVM) Solution Package (just announced. Customers across a range of industries are building EVM solutions on top of the TopBraid Suite platform. In response to customer requirements, TopQuadrant is offering an EVM Solution Package of commonly needed, high-value components including models, scripts (e.g. for workflow management with approvals), and application configuration templates.
Irene Polikoff, CEO and Co-founder, TopQuadrant, Inc.

•11:00am - 11:35am
Keynote User Talk 2:
"Managing Your Online Social Graph with TopBraid Composer"
In this short user experience session Marco will show you how he makes use of TopBraid Composer to keep track of the Semantic Web Meetup community ( and how he manages events, rsvps and security. In addition we will take a look at some of the editing and reporting features readily available and build into TopBraid Composer to visualize community data for evaluation and the identification of trends.
Marco Neumann, Information Scientist and CEO & Founder, KONA

•11:35am - 12:05pm
TopBraid - New Capabilities - Sampler 2
For QA of ontologies, many users have interest in knowing key statistics or metrics regarding their models, such as the number of properties that reference each class. A simple, flexible, convenient to use solution will be demonstrated within TopBraid. By importing a special ontology, and running provided scripts, desired metrics fields get populated, and a reporting script is used to pretty-print the results to html.
Ralph Hodgson, CTO and Co-founder, TopQuadrant, Inc.

•12:05pm - 12:40pm
User Feedback Session (conducted through lunch),
moderated by Tom Fitzgerald, Director of Sales, TopQuadrant
Notes from the session:
1. Education –tremendous demand for online training materials. Customers recommend we provide more training material (tutorials, videos, examples). Training is a key to success for many TopBraid customers.
2. Help Menu – context search and more examples –need a “Getting Started” tutorial. The tutorial would include how to use the Help Menu effectively.
3. User’s Forum – “great asset” – would like to see it expanded to include wiki format and possibly all of our products. Discsussion of how the forum can be expanded and provide more extensive support, e.g., examples of customer applications.
4. Site Spin – Tim Smith – new capability
5. Mind Mapping integration – Tim Smith talked about the value of providing a easy tool for business users to map out model requirements. The mapping could then be integrated into TBC.
6. Customer Use Cases – request to expand website to include tab for Use Cases and customer scroll

TopQuadrant appreciated and valued this input and is responding quickly, in particular by making the extensive Help provided in TopBraid Composer more accessible, and in providing more support assets through our web site. See links on the TopBraid Composer page, Support and other product pages, and this recent post. for further details and links. Additionally, two new slide sets are being provided for download the give an extensive tour of TopBraid Composer capabilities, and in depth details on importing data into RDF and transforming data with TopBraid utilities and power tools.

•12:40pm - 12:45pm
Announcements and Closing Remarks
Robert Coyne, TopQuadrant

(See also Reflections from the First TopBraid Suite User Group Meeting, May 21, 2008,San Jose, CA)

Saturday, July 4, 2009

The Meaning of "semantic" - post II

After writing this earlier post, I've decided to expand on my thoughts about the intersection between the Semantic Web and text mining, natural language processing, etc.

The write up was ready just in time for TopQuadrant's submission to the monthly Semantic Universe column, so I've ended up publishing it there, but here is a link

RDFa - a good way to provide access to your data?

I have been thinking about RDFa recently. With the announcement from Google and continued support from Yahoo/Search Monkey there is an increased buzz around RDFa. So, why RDFa and what is it good for?

TopBraid have had support for RDFa as long as I can remember – at least two years now. A user can point to a page with RDFa markups and TopBraid will import them. I remember getting existed about this and wanting to mark up all our web pages with RDF. This did not happen. At least partially because RDFa’s interaction with HTML formatting tags is pretty funky – the pages become harder to maintain. Then, there was also a persistent question on why do it at all. If one wants to provide data in RDF, why not do exactly that?

Each web page on a site, could have a corresponding N3 page. There is a standard tag in HTML that can be used to refer to related information. It can be used to point to the N3 page and/or the naming convention could be the same as for the given HTML page, but with the N3 extension. In TopQuadran’t case this would be an only alternative solution since the information on our web site is not in a database (at least not yet, this is changing). If it was in a database, then a way to go would be to provide a SPARQL endpoint.
I looked at the RDFa presentation by Mark Birbeck at the Semantic Technologies conference. I did not get a chance to attend – 7:30 AM is way too early for me , but I browsed through slides. Here is an example of RDFa markup (from the presentation):

This says that there is a dc:creator relationship between the header “RDFa: Now everyone can have an API” and a string “Mark Birbeck”.

Good, but we have not given a URI to the thing we are talking about – a presentation entitled “RDFa: Now everyone can have an API”.

Absence of the URI makes it somewhat hard to talk about the presentation. Any RDFa crawler/importer would have to generate some kind of URI for it. If we used the URI to begin with, we could have simply put the triple {:RDFa_presentation dc:creator “Mark Birbeck” } into an RDF file.

One issue may be the maintenance – having 2 files to maintain. But, embedding RDFa into HTML arguably creates even worse maintenance problems. And, if RDFa markup was automatically generated (most serious publishing happens by generation, not hand crafting), then the maintenance issue is not there – it is easier to generate RDF file in addition to HTML file that it is to generate and insert markups. Not to mention that automatic generation means there is a database that could be exposed through SPARQL.

There must be something I am missing here. While I could not attend Mark Birbeck’s presentation, I just discovered he is giving a webinar on July 12th: . I think I will sign up and see if some of my questions get answered.

I’ll report what I learn here, so stay tuned.

This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.