Thursday, July 26, 2012

Who needs SKOS-XL? Maybe no one

The SKOS-XL extension to the W3C’s SKOS standard for vocabulary management adds flexibility in how you track concept names, but it adds complexity and potential confusion that are rarely, if ever, worth it. 

What is the appeal of SKOS-XL? Information modelers wanting to separate concepts (so-called conceptual ideas) from terms (the names people use for concepts) often base their thinking on the model of Semiotic Triangle http://en.wikipedia.org/wiki/Triangle_of_reference or Peirce’s Triangle. Sometimes also called a triangle of meaning, this philosophy distinguishes a concept that exists in a human mind—a thought—from how it is referred to and from a symbol that evokes it. 



A referent is understood as a word. A symbol is typically explained as a pictorial depiction. A key aspect of this theory is its focus on human cognition. It postulates that there can be no name or identity intrinsic to a concept as it only exists as a thought in a human mind.

The challenge of applying this thinking to information modeling is that, ultimately, in information modeling we must commit everything to paper, electronic or otherwise. Thus, every concept must have an identity and a name. As a result, a separate model for concepts and terms where terms themselves have identity, names, relationships and are tracked separately from concepts is typically an over-complication that does not deliver practical value. For one thing, even explaining to a business audience a difference between a concept and a term is not simple. Colloquially, these words are often used interchangeably. Once explained, distinguishing and keeping track of these on an ongoing basis, when both “concepts” and “terms” are more often than not named using the same words, can be mind boggling. 

SKOS takes a simpler and what we believe to be a more practical approach to information modeling. It provides a way to describe concepts by giving each one:
  •  A globally unique identity
  • A preferred label that is unique for a human language (such as English or German) within a scope of a particular “concept scheme”. It is called skos:prefLabel.
  • Any number of alternative labels called skos:altLabel. Concept's alternative labels in a given language should not be the same as its preferred label in this language.
  • Whatever other properties (attributes and relationships) are deemed necessary:
  • SKOS supplies some standard relationships such as skos:broader, skos:related and skos:exactMatch and a number of annotations that are thought to be universally useful such as skos:definition and skos:editorialNote.
  • Users of SKOS are free to add properties specific to their domain. For example, when using SKOS to describe different companies, a user may want to add a stock ticker field.

If needed, metadata about labels can be captured without giving them identity of their own. TopBraid EVN is a good example of a tool that offers this capability. Besides the language part, such metadata is typically not just about the label itself, but about its relationship to the concept—for example, who said that this is a preferred label for this concept and when. All the relationships are between concepts, not between the labels.

The W3C has published an optional extension to SKOS called SKOS-XL (SKOS eXtension for Labels) that accommodates those who want to give separate identity to concepts and terms. It does not use the word “term”—presumably, because informally terms are often understood as concepts and vice versa. Instead it introduces a class Label explained as a “lexical entity”. While the extension is small with only one new class and five new properties, its implications are far-reaching. As a result, providing tool support for SKOS-XL is considerably more complex than for SKOS proper. 

Following the SKOS-XL model, labels are not strings as in SKOS proper, but RDF resources with their own identity. Each label can have only one literal form; this is where the actual text string (the name) goes. The literal form is not one per Label per language as with SKOS’s constraint for assigning preferred labels, but one per Label. So, to accommodate different languages, different label resources must be created. At the same time, there can be multiple Label resources with the same literal form (for example, two different Label resources with the literal form “Mouse”). Even a simple SKOS-XL vocabulary is considerably bulkier than its SKOS alternative. Since SKOS-XL format takes far more more space, storage, import/export and performance of search and query can become an issue for larger vocabularies.

Concepts are connected to Labels by relationships that indicate preferred (skosxl:prefLabel) and alternative Labels (skosxl:altLabel) for a Concept. There is no cardinality restrictions on these relationships–that is, a Concept can be linked to multiple Labels using skosxl:prefLabel link. Labels can be linked to each other using skosxl:labelRelation relationship. These links are separate from the relationships between Concepts.

Direct use of SKOS properties that associate label strings with Concepts can be tricky when using SKOS-XL. According to SKOS-XL label strings for Concepts are derived using rules such as:

The property chain (skosxl:prefLabel, skosxl:literalForm) is a sub-property of skos:prefLabel

This means that if there is a Label ex:Label1 with literal form “love” and a Concept ex:Concept1 where ex:Concept1 connects to ex:Label1 using a skosxl:prefLabel relationship, we can conclude that ex:Concept1 has a skos:prefLabel  value of “love”.  Since simultaneously keeping the integrity of directly entered and inferred values is problematic, any tool supporting SKOS-XL must protect the user from directly entering label strings for Concepts. This makes it difficult to use the same tool to edit for SKOS and SKOS-XL vocabularies, especially if users want to intermix different vocabulary formats. 

Furthermore, a user will see the same text label for different entities. This will be not only because different Labels can have the same literal forms, but also because the Concept resources “inherit” string labels from the associated Label resources. This can easily lead to confusing results.

There are also various integrity clashes between SKOS and SKOS-XL. For example:

1.       Two different preferred labels in the same language

ex:Concept1skosxl:prefLabelex:Label1; skosxl:prefLabelex:Label2.
ex:Label1 skosxl:literalForm "love"@en .
ex:Label2 skosxl:literalForm "adoration"@en .

This is not “wrong” according to SKOS-XL because a Concept can be connected to multiple Labels using the skosxl:prefLabel relationship. But, it means that ex:Concept1has skos:prefLabel values of both "love"@en and "adoration"@en. This is a violation of SKOS constraint S14, which prohibits a concept from having more than one preferred label string in a given language.
2.        
      Clash between preferred and alternative labels

ex:Concept1skosxl:prefLabel ex:Label1; skosxl:altLabel ex:Label2
ex:Label1 skosxl:literalForm "love"@en .
ex:Label2 skosxl:literalForm "love"@en .

Again, this is not “wrong” according to SKOS-XL because different Labels can have the same literal form, but it’s a problem for SKOS because it implies identical English language preferred and alternative label strings for ex:Concept1.

Without a doubt, these issues play a role in the fact that while the use and tool support for SKOS is growing, there are few if any tools for SKOS-XL or published SKOS-XL vocabularies. An even more important factor is the lack of compelling business value that would justify SKOS-XL complexity. Having talked to a wide range of users working on business vocabularies, we have yet to hear a use case that cannot be supported by SKOS alone.

Does SKOS-XL look like the only viable approach to your vocabulary management needs? Let’s discuss it—maybe we can help you find a simpler solution.

This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.