Monday, August 16, 2010

How to: Find SKOS constraint violations in AGROVOC with SPARQL Rules

The Simple Knowledge Organization System (SKOS) vocabulary management specification is gaining popularity because, as a standard, it makes it easier to share taxonomies and thesaurii between different systems. It also guards investments in vocabulary development against the potential problems of dependence on a proprietary vendor format.

The W3C makes an OWL ontology for SKOS available, which makes it easier to ensure that your vocabulary conforms to the standard. As a comment near the beginning of it tells us, though,

A number of semantic conditions are *not* expressed formally in this schema. These are:

S12
S13
S14
S27
S36
S46

For the conditions listed above, rdfs:comments are used to indicate the conditions.

The comment for S13 says that "skos:prefLabel, skos:altLabel and skos:hiddenLabel are pairwise disjoint properties". In plain English, this means that for a given concept, you can't use the same term for any two of these properties. For example, you shouldn't say that "dog" is both the preferred label and the alternate label for a given concept—it should be one or the other.

If these constraints are in the ontology as comments and not as something that can be implemented by executable code, how do you find violations of these constraints? The simplest way I've found is to use SPARQL Constraints (with SPIN). We've built on Paul Hermans' work to implement these constraints in the ontology at http://topbraid.org/spin/skosspin. It imports the W3C SKOS ontology and adds one rule to the skos:OrderedCollection class for constraint S36 and rules for the other five constraints to the skos:Concept class. For example, it adds the following three rules for constraint S13:

# Constraint S13: skos:prefLabel, skos:altLabel and skos:hiddenLabel 
# are pairwise disjoint properties.
ASK WHERE {
?this skos:prefLabel ?label .
?this skos:altLabel ?label .
}

ASK WHERE {
?this skos:prefLabel ?label .
?this skos:hiddenLabel ?label .
}

ASK WHERE {
?this skos:hiddenLabel ?label .
?this skos:altLabel ?label .
}

If any of these returns a boolean true, then we know that constraint S13 has been violated. (If you look at the skosspin ontology, you'll see these queries represented as triples, which is more difficult to read but easier to implement than rules expressed as SPARQL queries as shown above. TopBraid Composer can convert between the two formats, and so can a SPARQL Text to SPIN RDF Syntax Converter that Holger Knublauch has made available on the web.)

I tested this with the Food and Agriculture Organization of the United Nation's popular AGROVOC thesaurus, a vocabulary "designed to cover the terminology of all subject fields in agriculture, forestry, fisheries, food and related domains", and found over 1600 violations of constraint S13. Because this thesaurus has almost 29,000 concepts and preferred and alternate labels in multiple languages for most concepts, it's easier to violate these constraints than you might think, and I never would have found them without the ability to automate this search. For example, concept http://www.fao.org/aos/agrovoc#c_1135 has an English preferred label of "Buds", 14 preferred labels for other languages such as Farsi and Thai, and 24 alternate labels. Among these, the Slovak skos:prefLabel value and the Sloval skos:altLabel value are both "púèiky", so this concept violates constraint S13.

How do we find the violations? I tried it with the free edition of TopBraid Composer, because it has everything you need to define and use SPARQL Rules. (TopBraid Composer Maestro Edition's ability to use these rules from within applications has made it possible for me to add several nice features to applications for some of our clients.)

The screenshot below of TopBraid Composer's Navigator and Imports views shows that I created a checkAGROVOC project and added a checkAGROVOC.n3 ontology file to it. This ontology only does two things: it imports the ag_skos_080422.rdf file that I downloaded from the fao.org web site and it imports the skosspin ontology described above for its SPARQL Rules. (It imports a web version of the skosspin ontology and, because of its 62 meg size, the local copy of the actual AGROVOC thesaurus.)

With these two files imported, I opened TopBraid Composer's Problems view and clicked that view's "Refresh all problems of current TopBraid file" icon . A "Progress Information" message box told me that TopBraid Composer was "Checking SPIN constraints on skos:Concept", which took a few minutes because there were plenty to check.

After it finished checking, the Problems view said that there were Warnings and had a plus sign that I expanded to see the first few constraint violations:

(You may want to play with the column widths a bit, because the Location column is the one you really want to see.) Double-clicking anywhere on a specific warning line shows the details about that concept on the Resource Form, like this:

If you don't see the little yellow warning symbols on the Resource Form that show where the problems are, click the little "Display constraint violation warnings" icon at the top of TopBraid Composer . In this case, a bit of scrolling down when viewing concept c_1002 shows that the same term appears as both Hungarian skos:altLabel and skos:prefLabel property values for this concept.

The Problems view above only shows the first 100 of the 2,972 warnings. Clicking the context menu white triangle in the view's upper-right lets you configure Preferences for the view so that you can reset the number of items to be displayed; I had no problem with a figure of 3,000.

Now that we know what constraint has been violated, there are other ways to list the concepts that need to be corrected. For example, the following query in TopBraid Composer's SPARQL view lists the identifiers that have the same label for these two properties:

SELECT ?s ?label
WHERE {
?s skos:prefLabel ?label .
?s skos:altLabel ?label .
}

Once you execute this query you can export its results to a file and then use that as a reference point to address the issues in the vocabulary.

We could have started off by executing this query on the AGROVOC SKOS file, but remember, at the time we didn't know which constraints had been violated. Using SPARQL Rules as extra metadata for class definitions helps to automate the identification of quality issues with the data, letting us use other techniques to focus on the specific problems and how to fix them.

Are your SKOS vocabularies violating any of the six extra constraints described in the SKOS specification? As I mentioned, this all works with the free version of TopBraid Composer, which is available for Windows, Mac, and Linux, so you can try it yourself to find out. With the TopBraid Composer Maestro edition, you can build applications for end users who can then maintain these vocabularies with a web-based interface instead of using TopBraid Composer. The user interface for notification of constraint violations then becomes one of the many things you can customize to the needs of your end users. You can also define new constraints around your own shop's business rules—for example, to require that all labels begin with an upper-case letter—and you can set TopBraid Composer or your application to highlight these violations as soon as they occur, instead of checking in batch mode like I did above.

To summarize, what we've seen here is really just a starting point, and there are all kinds of places where you can take it to improve the consistency and value of your vocabularies.

1 comments:

freelance writing said...

great site really informative i learned a lot great stuff keep it up

This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.