Thursday, July 26, 2012

Who needs SKOS-XL? Maybe no one

The SKOS-XL extension to the W3C’s SKOS standard for vocabulary management adds flexibility in how you track concept names, but it adds complexity and potential confusion that are rarely, if ever, worth it. 

What is the appeal of SKOS-XL? Information modelers wanting to separate concepts (so-called conceptual ideas) from terms (the names people use for concepts) often base their thinking on the model of Semiotic Triangle or Peirce’s Triangle. Sometimes also called a triangle of meaning, this philosophy distinguishes a concept that exists in a human mind—a thought—from how it is referred to and from a symbol that evokes it. 

A referent is understood as a word. A symbol is typically explained as a pictorial depiction. A key aspect of this theory is its focus on human cognition. It postulates that there can be no name or identity intrinsic to a concept as it only exists as a thought in a human mind.

The challenge of applying this thinking to information modeling is that, ultimately, in information modeling we must commit everything to paper, electronic or otherwise. Thus, every concept must have an identity and a name. As a result, a separate model for concepts and terms where terms themselves have identity, names, relationships and are tracked separately from concepts is typically an over-complication that does not deliver practical value. For one thing, even explaining to a business audience a difference between a concept and a term is not simple. Colloquially, these words are often used interchangeably. Once explained, distinguishing and keeping track of these on an ongoing basis, when both “concepts” and “terms” are more often than not named using the same words, can be mind boggling. 

SKOS takes a simpler and what we believe to be a more practical approach to information modeling. It provides a way to describe concepts by giving each one:
  •  A globally unique identity
  • A preferred label that is unique for a human language (such as English or German) within a scope of a particular “concept scheme”. It is called skos:prefLabel.
  • Any number of alternative labels called skos:altLabel. Concept's alternative labels in a given language should not be the same as its preferred label in this language.
  • Whatever other properties (attributes and relationships) are deemed necessary:
  • SKOS supplies some standard relationships such as skos:broader, skos:related and skos:exactMatch and a number of annotations that are thought to be universally useful such as skos:definition and skos:editorialNote.
  • Users of SKOS are free to add properties specific to their domain. For example, when using SKOS to describe different companies, a user may want to add a stock ticker field.

If needed, metadata about labels can be captured without giving them identity of their own. TopBraid EVN is a good example of a tool that offers this capability. Besides the language part, such metadata is typically not just about the label itself, but about its relationship to the concept—for example, who said that this is a preferred label for this concept and when. All the relationships are between concepts, not between the labels.

The W3C has published an optional extension to SKOS called SKOS-XL (SKOS eXtension for Labels) that accommodates those who want to give separate identity to concepts and terms. It does not use the word “term”—presumably, because informally terms are often understood as concepts and vice versa. Instead it introduces a class Label explained as a “lexical entity”. While the extension is small with only one new class and five new properties, its implications are far-reaching. As a result, providing tool support for SKOS-XL is considerably more complex than for SKOS proper. 

Following the SKOS-XL model, labels are not strings as in SKOS proper, but RDF resources with their own identity. Each label can have only one literal form; this is where the actual text string (the name) goes. The literal form is not one per Label per language as with SKOS’s constraint for assigning preferred labels, but one per Label. So, to accommodate different languages, different label resources must be created. At the same time, there can be multiple Label resources with the same literal form (for example, two different Label resources with the literal form “Mouse”). Even a simple SKOS-XL vocabulary is considerably bulkier than its SKOS alternative. Since SKOS-XL format takes far more more space, storage, import/export and performance of search and query can become an issue for larger vocabularies.

Concepts are connected to Labels by relationships that indicate preferred (skosxl:prefLabel) and alternative Labels (skosxl:altLabel) for a Concept. There is no cardinality restrictions on these relationships–that is, a Concept can be linked to multiple Labels using skosxl:prefLabel link. Labels can be linked to each other using skosxl:labelRelation relationship. These links are separate from the relationships between Concepts.

Direct use of SKOS properties that associate label strings with Concepts can be tricky when using SKOS-XL. According to SKOS-XL label strings for Concepts are derived using rules such as:

The property chain (skosxl:prefLabel, skosxl:literalForm) is a sub-property of skos:prefLabel

This means that if there is a Label ex:Label1 with literal form “love” and a Concept ex:Concept1 where ex:Concept1 connects to ex:Label1 using a skosxl:prefLabel relationship, we can conclude that ex:Concept1 has a skos:prefLabel  value of “love”.  Since simultaneously keeping the integrity of directly entered and inferred values is problematic, any tool supporting SKOS-XL must protect the user from directly entering label strings for Concepts. This makes it difficult to use the same tool to edit for SKOS and SKOS-XL vocabularies, especially if users want to intermix different vocabulary formats. 

Furthermore, a user will see the same text label for different entities. This will be not only because different Labels can have the same literal forms, but also because the Concept resources “inherit” string labels from the associated Label resources. This can easily lead to confusing results.

There are also various integrity clashes between SKOS and SKOS-XL. For example:

1.       Two different preferred labels in the same language

ex:Concept1skosxl:prefLabelex:Label1; skosxl:prefLabelex:Label2.
ex:Label1 skosxl:literalForm "love"@en .
ex:Label2 skosxl:literalForm "adoration"@en .

This is not “wrong” according to SKOS-XL because a Concept can be connected to multiple Labels using the skosxl:prefLabel relationship. But, it means that ex:Concept1has skos:prefLabel values of both "love"@en and "adoration"@en. This is a violation of SKOS constraint S14, which prohibits a concept from having more than one preferred label string in a given language.
      Clash between preferred and alternative labels

ex:Concept1skosxl:prefLabel ex:Label1; skosxl:altLabel ex:Label2
ex:Label1 skosxl:literalForm "love"@en .
ex:Label2 skosxl:literalForm "love"@en .

Again, this is not “wrong” according to SKOS-XL because different Labels can have the same literal form, but it’s a problem for SKOS because it implies identical English language preferred and alternative label strings for ex:Concept1.

Without a doubt, these issues play a role in the fact that while the use and tool support for SKOS is growing, there are few if any tools for SKOS-XL or published SKOS-XL vocabularies. An even more important factor is the lack of compelling business value that would justify SKOS-XL complexity. Having talked to a wide range of users working on business vocabularies, we have yet to hear a use case that cannot be supported by SKOS alone.

Does SKOS-XL look like the only viable approach to your vocabulary management needs? Let’s discuss it—maybe we can help you find a simpler solution.

Tuesday, May 29, 2012

New white paper: Controlled vocabularies, taxonomies, and thesauruses (and ontologies)

What's the difference between a controlled vocabulary, a taxonomy, a thesaurus, and an ontology? We've found people using some of these terms interchangeably, and it's difficult to find reliable official definitions for each.
We've come up with some unofficial working definitions that have served us well when discussing different kinds of controlled vocabularies and their associated metadata. Our new white paper Controlled vocabularies, taxonomies, and thesauruses (and ontologies) describes each of these and which TopQuadrant products give customers the control they need over the kinds of vocabularies and metadata that they're working with.
We hope that this short white paper gives you a better idea of why people build, use, and store controlled vocabularies and the advantages that standards bring to this work.

Roget's Thesaurus

Sunday, May 13, 2012

Data cathedrals versus information bazaars?

Enterprises create data cathedrals with an enforced dogma to control data purity, causing much information to be outside its walls where informal information bazaars thrive. These information bazaars have suspect quality, uncertain provenance, yet are responsive to users’ needs. Metcalf's law suggests that the benefit gained from integrated information grows geometrically1 with the number of data communities that are integrated. How can we balance the dogma of the data cathedrals and the spontaneity of the information bazaar?

Enterprise's database cathedrals reflect corporate dogma. Nothing gets changed without approval from high. Change is very slow. New databases orders get integrated only after a considerably long time assuming that the new data is 100% squeaky clean. So there are a lot of databases that are entirely outside the database cathedrals' walls. Badly behaved sources of data might even be excommunicated.
Where does the other data go? It is not as though this other data does not exist, although many would like to pretend it to be so. Instead they are all in the information bazaar. Anyone with any information can set up their own information stall, and store their own data in Excel, Access, anywhere they want. They only specialize in their own data for their own use. This data is pretty good because that is all they need for their business. They share well with others but on a barter basis. In fact the information bazaar is chaotic, but lively, always changing to users’ demands, and a fun place to be. 

Why do we have the conflict between the database cathedral and the information bazaars?

The data cathedral offers security, quality, and good provenance. It provides the system of record for users who then should have complete confidence in their decision making. It does this using accurate relational models capturing enterprise information. But a relational model is designed by the cathedral hierarchy based on the closed model: only pure data can be entered into the database; impure data can lead to excommunication. 
The information bazaar has few rules of entry. As demonstrated by the web, it allows anyone to say anything about anything (AAA). Even with this deficiency we will regularly search the web to help us with our decision making, not exploring sources that are suspect, and filtering information that we feel lacks accuracy until we end up with information to support our decision.

Can we resolve these conflicting objectives?

Can we expect the cathedral hierarchy to relax its admittance criteria to let in as much of the information bazaar as possible? Somewhat, but we cannot expect miracles.
Can we expect the information bazaar to become more sober and responsible so that it can securely provide information with guaranteed quality and provenance? Somewhat, but we cannot expect an evangelical conversion?
Really this is not optimal, because the benefit of having data integrated grows geometrically with the number of interconnected sources, yet the database cathedral cannot grow because the information bazaar does not meet their purity dogma.

So how can these conflicting objectives be redeemed?

One path to redemption is to unite the information bazaar through a common semantic model. This allows all information to be available within a universal graph (model). Of course some riff-raff will get in, but again that is an advantage for the semantic model as you can also declare rules that will verify the accuracy of the data even though it is already stored. 
At the same time the data cathedral can continue to expand, hopefully at faster pace, by integrating those graphs that meet their criteria. 
However we allow users to access both the data cathedral, from where they can obtain the system of record, and information bazaar. We could even report results federating form the two data-sources annotating that information from the information bazaar with its provenance and hence less certain data quality. Doing this in a standards compliant way turns existing enterprise information resources into connectable, responsive and interoperable semantic assets.


Using this approach we don’t need to force the data cathedral to relax its dogma, nor do we ask the information bazaar to shut down. Yet we can offer users access to 99% of the enterprise information providing users the 'Metcalf'1 benefits of full integration. As semantic assets grow and connect, they enable a resilient semantic ecosystem of meaningful interactions between people, applications and data irrespective of the differences in structures, data schemas, governance and technologies. The dividing boundaries between the cathedral and the bazaar no longer need to be obstacles to information users. Semantic ecosystem seamlessly embraces and provides integrated access to data cathedrals and information bazaars alike.

1 If I have 10 database systems running my business that are entirely disconnected, then the benefits are 10 * K, some constant. If I integrate these databases in pairs (operations + accounting, accounting + payroll, etc), then the benefits increase to 10 * K * 2. If I integrate in threes, (operations + accounting + maintenance, accounting + payroll + receiving, etc), then the benefits increase four-fold (a corollary of Metcalf's law) to 10 * K * 4. For quad-wise integration my benefits would be 10 * K * 8 and so on. Now it might not be 8 fold but the point is there is a geometric, not linear, growth in benefits as I integrate all of my information across my organization.

Monday, April 30, 2012

Can semantic technology melt process industry’s icebergs of information?

Icebergs of information loiter throughout process manufacturing IT waiting to sink any information integration project. The impact of semantic technologies is being felt in medicine, life sciences, intelligence, and elsewhere but can it solve this problem in process manufacturing? The ability to federate information from multiple data-sources into a schema-less structure, and then deliver that federated information in any format and in accordance with any standard schema uniquely positions semantic technology. Is this a sweet spot for semantic technologies?

Process Manufacturing Application Focus over the Years

Over the years we have been solving problems within process manufacturing IT only to uncover more problems. Once the problem was that of measurement data in silos which was solved by the introduction of real-time data historians. However that created the problems of data visibility, solved by the introduction of graphical user interfaces. This introduced data overload which was partially solved by the introduction of analytical tools to digest the information and produce diagnostics. Unfortunately these tools were difficult to deploy across all assets within an organization, so we have been trying to solve that problem with information models. The current problem is how to convert the diagnostics into actionable knowledge with the use of work-flow engines and ensuring the sustainability of applications as solutions increases in complexity.
Process Manufacturing Application Problems and Solutions over the Years
Measurement data in silos
Data access and visualization
Analysis and business intelligence
Contextualized information
Consistent actioning
Real-time databases collecting measurements
Graphical user interfaces, trending and reporting tools
Analytical tools to digest data into information and diagnostics
Plant data models (ProdML, ISA-95, ISO15926, IEC 61970/61968, Proprietary)
Data but no user access
Data overload
Deployability of analysis to all assets
Interpretation limited to experts
Complexity, much more than RTDB, limiting sustainability
Improved ongoing application benefits

However it is not only the increased technological complexity that is causing problems. Business decisions now cross many more business boundaries. When measurement data was trapped in silos we were content with unit-wide or plant-wide data historians. Now a well performance problem might involve a maintenance engineer located in Houston accessing a Mimosa[1]-based maintenance management system, an operations engineer located in Aberdeen accessing an OPC-UA[2]-based data historian, a production engineer located in London accessing a custom system driven by WITSML[3]-based feeds, and a facilities engineer using an ISO-15926[4] facilities management model. Not only are the participants in different locations and business units, but they also rely on different systems using different models to support their decision making. However they all should be talking about the same well, measured by the same instruments, producing the same flows, and processed by the same equipment.
The problem is that these operational support systems are not simply data silos whose homogeneous data we need to merge into one to answer our questions. In fact these operational support systems are icebergs of information. Above the surface they publish a public perspective focused on the core operational function of the application. However this data needs context, so below the surface is much of the same information that is contained in other systems. This information provides the context to the operational data so that the operational system can perform its required functions. For example the historian needs to know something about the instruments that are the source of its measurements; maintenance management systems need to know not only about the equipment to be maintained but the location of that equipment, physically and organizationally.

Figure 1: Icebergs of Information
Icebergs of information are not limited to the operational data stores deployed in organizations. An essential practice in these days of interoperability requirements is the adoption of model standards. However even these exhibit the same problems as shown by the diagram below. This diagram maps the available standards to its focus within the hydrocarbon supply chain.

Figure 2: Multiple Overlapping Model Standards
Increasing regulatory and competitive demands on the business are forcing decision making to be more timely, and to be more integrated across the traditional business boundaries. However these icebergs are getting in the way of effective decision making.
One way to make any or all of this information available to consumers is to create the bigger iceberg. ‘Simply’ create the relational database schema that covers every past, current, and future business need, and build adapters to populate this database from the operational data stores. Unfortunately this mega-store can only get more complex as it has to keep up with an expanding scope of information required to support the decision making processes.

Figure 3: Integration using the Bigger Iceberg
Alternatively we can keep building data-marts every time someone has a different business query.  However these do not provide the timeliness required to support operational decision making.

The Need for a Babel-Fish

We cannot meet the needs of the business, and solve their decision making needs by having one mega-store because it will never keep up with the changing business requirements. Instead we need a babel-fish (with thanks to the Hitchhikers Guide to the Galaxy).
This babel-fish can consume all of the different operational data in different standards, and translate them into any standard that the end-consumer wants. Thus the babel-fish will need to know that OPC UA's concept 'hasInstrument' has the same meaning as Mimosa's concept of 'Instrumented'. Similarly 10FIC107 from an OPCUA provider is the same as 10-FIC-1-7 from Mimosa.
1.       Information providers (operational data stores) within the business will want to provide information according to their capabilities, but preferably using the standards appropriate for their application. For example measurements should be OPC UA, maintenance should use Mimosa
2.       Information consumers will want to consume information in the form of one or more standards appropriate for their application.

Figure 4: Integration babel-fish

The Semantic/RDF model comes to the rescue

First of all a definition: a semantic model means organizing all data and knowledge as RDF triples {subject, property, object}. Thus {:Peter, :hasAge, 21^^:years}, and {:Pump101, :manufacturedBy, :Rotek} are examples of RDF triples. RDF triples can be persisted in a variety of ways: SQL table, custom organizations, NoSQL, XML files and many more. If we were designing relational database to hold these RDF triples we would only have one ‘table’ so it may appear that we have no schema, in the relational database design-sense when we have key relationships to enforce integrity, and unique indices to enforce uniqueness. However we can add other statements about the data such as {:Pump101, :type, :ReciprocatingPump} and {:ReciprocatingPump, :subClassOf, :Pump}[5]. Used in combination with a reasoner we can infer consequences from these asserted facts, such as :Pump101 is a type of :Pump, and Peter is not a :Pump, despite rumors to the contrary.   These triples can be visualized as the links in a graph with the subject and object being the nodes of the graph, and the property the name of the edge linking these nodes:

Figure 5: RDF Triples as a graph
Over the years, new modeling metaphors have been introduced to solve perceived or actual problems with their predecessors. For example the Relational Model had perceived difficulties associated with reporting, model complexity, flexibility, and data distribution. A semantic model helps solve these problems.

Figure 6: Evolution of Model Metaphors

·         In response to the perceived reporting issues, OLAP techniques were introduced along with the data warehouse. This greatly eased the problem of user-reporting, and data mining. However it did introduce the problem of data duplication.
o   A semantic model can query against a federated model in which information is distributed throughout the original data sources.
·         In response to the perceived complexity issues, various forms of object-orientated modeling were introduced. There is no doubt that it is easier to think of one’s problem in terms of an object model rather than a complex relational or ER model, especially when there are a large number of entities and relations.
o   The semantic model is built around the very simple concept of statements of facts such as {:Peter, :hasAge, 21^^:years}, and {:Pump101, :manufacturedBy, :Rotek} combined with statements that describe the model such as {:Pump101, :type, :ReciprocatingPump} and {:ReciprocatingPump, :subClassOf, :Pump}.
·         The model flexibility problem occurs when, after the model has been designed, the business needs the model to change. In response to this flexibility issue, the choice is to make the original model anticipate all potential uses but then risk complexity, or use an object-relational approach in which it is possible to add new attributes without changing the underlying storage schema.
o   In semantic models these relationships are expressed in triples, using RDFS, SKOS, OWL, etc. Thus RDF is also used as the physical model (in RDF stores, at least).
·         There have been various responses to data distribution.
o   In the relational world there is not much choice other than to replicate the data from heterogeneous data stores using Extract-Transform-Load (ETL) techniques. In the case of homogenous but distributed databases distributed queries are possible, although it does require intimate knowledge of all the schemas in all of the distributed databases.
o   In the object-orientated world we are in a worse situation: it is very difficult to manage a distributed object in which different objects are distributed or attributes are distributed.
The good news is that a semantic approach is the ideal (or even the only) approach that can solve the information integration problem as follows:
1.       Convert to RDF normal form: Convert all source data into RDF. The data can be left at source and fetched on demand (federated) or moved into temporary RDF storage
    • There are already standard ways of doing this for any spreadsheet, relational database, XML schema, and more. For example, TopBraid Suite ( provides converters and adaptors for all common data sources. It is relatively easy to create more mappings such as OPCUA. The dynamic adapters act as SPARQLEndpoints[6].
2.       Federated data model: Create 'rules' that map one vocabulary to another.
    • The language of these rules would be RDFS, SKOS and OWL. For example you can declare {OPCUA:hasInstrument, owl:sameAs, Mimosa:Instrumented}. Note that these are simply additional statements expressed in RDF which are then used by a reasoner to infer the consequences such as :FI101 is actually the same as :10FIC101.
    • More sophisticated rules can also be created using directly RDF and SPARQL. For some examples, see SPIN or SPARQL Rules at and
3.       Chameleon data services: Create consumer queries that extract the information from the combined model into the standard required using SPARQL queries.
    • For example even though all instrument data is in OPCUA, a consumer could use a Mimosa interface to fetch this data. The results can then be published as web-services for consumption by external applications using SPARQLMotion (

Figure 7: Federation End-to-End

Let’s look into these steps in detail:

Convert to RDF normal form

Despite the fact that data will be stored in different formats (relational, XML, object, Excel, etc) according to different schemas they can always be converted into RDF triples. Always is a strong word, but it really does work. There are already ways of doing this for any spreadsheet, relational database, XML schema, and more and it is relatively easy to create more mappings such as OPC-UA. The data can be left at source and fetched on demand (federated) or moved into temporary RDF storage. For example, TopBraid Suite ( provides converters and adaptors for all common data sources.

Figure 8: Conversion to RDF Normal Form

Federated Data Model

A federated data model allows different graphs (aka databases) to be aggregated by linking the shared objects. This applies to real-time measurements (OPC-UA), maintenance (MIMOSA), production data (ProdML), or any external database. We can visualize this as combining the graphs of the individual operational data stores into a single graph.
Of course there will be vocabulary differences between the different data-sources. For example, in the OPC-UA data-source you might have a property OPCUA:hasInstrument, and in a MIMOSA data-source the equivalent is called Mimosa:Instrumented. So the federated data model incorporates 'rules' that map one vocabulary to another. The language of these rules would be RDFS, SKOS, and OWL. For example, in OWL, you can declare {OPCUA:hasInstrument owl:sameAs Mimosa:Instrumented}. Note that these are simply additional statements expressed as RDF triples which are then used by a reasoner to infer consequences such as :FI101 is actually the same as :10FIC101.
There will also be identity differences between the different data-sources. These can also be handled by additional statements, such as {:TANK#102, owl:sameAs, :TK102 }. This allows a reasoner to infer that the statement {:TK102, :has_price, 83^^:$} also applies to :TANK#102, implying {:TANK#102, :has_price, 83^^:$}.

Figure 9: Information Federated from MulTiple Datasources

Chameleon Data Services

To extract information from the federated information, the best choice is SPARQL, the semantic equivalent of SQL only simpler. Whilst SQL allows one to query the contents of multiple tables within a database, SPARQL matches patterns within the graph. With SQL we need to know in which table each field belongs. With SPARQL we define the graph pattern that we want to match, and the query engine will search throughout the federated graphs to find the matches. In the example illustrated below we do not need to know that the price attribute comes from one data source, whilst the volume comes from another. In fact SPARQL allows even further flexibility. The price attribute for Tank#101 could come from a different data source than the price attribute for Tank#102. This is part of the magic of the semantic technology.
  Figure 10: Graph Pattern matching with SPARQL
SPARQL can be used to directly query the federated graph for reporting purposes, however most consumers of the information will expect to interface to a web-service, with SOAP or REST being the most popular. These services do not have to be programmed. Instead they can be declared using SPARQLMotion ( to produce easily consumed and adaptable web-services. The designer for SPARQLMotion is shown below:

Figure 11: Example SPARQLMotion

Semantic/RDF advantages for the Process Manufacturing

Despite solving a complex data integration problem, Semantic/RDF is inherently simpler. Can there be anything simpler than storing all knowledge as RDF triples?  Despite this simplicity, we do not lose any expressivity.
There is no predefined schema to limit flexibility. However the schema rules, encoded as tables and keys in the relational model, can still be expressed using RDFS, OWL, and SKOS statements.
Deconstructing all information into statements (triples) allows data from distributed sources to be easily merged into a single graph.
Any information model can be reconstructed from the merged graph using SPARQL and presented as web-services (SOAP or REST).

[1] MIMOSA is a not-for-profit trade association dedicated to developing and encouraging the adoption of open information standards for Operations and Maintenance in manufacturing, fleet, and facility environments. MIMOSA's open standards enable collaborative asset lifecycle management in both commercial and military applications.
[2] The Unified Architecture (UA) is THE next generation OPC standard that provides a cohesive, secure and reliable cross platform framework for access to real time and historical data and events. 
[3] WITSML™ (Wellsite Information Transfer Standard Markup Language) is an industry initiative to provide open, non-proprietary, standard interfaces for technology and software that monitor and manage wells, completions and workovers.
[4] ISO 15926 provides integration of life-cycle data for process plants including oil and gas production facilities

[5] I should really be using URIs instead of text labels for subject, property, and objects, but the intent of the semantic model is conveyed more simply if we avoid identifiers like ‘’ and use :Pump#101
[6] SPARQL is a query language for RDF. A SPARQL endpoint is a protocol service that makes it possible to query a data source using SPARQL. The source itself does not need to be in RDF. It can, for example, be a traditional relational database. Later in this article we will describe SPARQL in more detail and show some query examples.

This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.