Friday, September 30, 2011

Ontologies and Data Models – are they the same?

Yesterday a question about how ontologies may be different from logical data models was asked by a newcomer on TopBraid Users Forum. As to be expected on the TopBraid Forum, by ontologies he meant specifically ontology models expressed in RDFS/OWL. Because we frequently hear this or similar questions in our trainings, workshops and in conversations with customers, I decided to respond in a blog post instead of writing an e-mail.

Data modeling was invented more than thirty years ago to help with the design of databases, specifically, relational databases. As quoted below, ANSI definition from 1975 differentiated between three data models – conceptual, logical and physical. Data modeling quickly became recognized as a tool for analyzing the semantics of an organization with the respect to the structure and flow of the information used in carrying out organization’s activities. Wikipedia offers the following definition of Data Modeling:

Data modeling is a method used to define and analyze data requirements needed to support the business processes of an organization. The data requirements are recorded as a conceptual data model with associated data definitions. Actual implementation of the conceptual model is called a logical data model.
<…>
In 1975 ANSI described three kinds of data-model instance:
  • Conceptual schema: describes the semantics of a domain (the scope of the model). For example, it may be a model of the interest area of an organization or of an industry. This consists of entity classes, representing kinds of things of significance in the domain, and relationships assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model. In that sense, it defines the allowed expressions in an artificial "language" with a scope that is limited by the scope of the model.
  • Logical schema: describes the structure of some domain of information. This consists of descriptions of (for example) tables, columns, object-oriented classes, and XML tags.
  • Physical schema: describes the physical means used to store data. This is concerned with partitions, CPUs, tablespaces, and the like.
According to ANSI, this approach allows the three perspectives to be relatively independent of each other. Storage technology can change without affecting either the logical or the conceptual model. The table/column structure can change without (necessarily) affecting the conceptual model.
These definitions describe a clear progression from conceptual to logical to physical data models. SInce their origin is in the 70s, they reflect certain technology assumptions than no longer hold true.

When information modeling is done to create a relational database, conceptual model must be different from a logical model because there is no place in a relational database structure to capture, for example, business rules, create subsumtion relationships and describe other key aspects of a conceptual model. This semantic information collected and documented as part of the initial modeling is left behind when modelers and designers move on to define a logical data model. The "left behind" parts are used by software developers as they encode business semantics directly into custom programs.

Logical data model is a subset of a conceptual model that can be expressed using a particular technology. However, there are always some performance considerations that require additional changes to the logical data model before it can be implemented in a relational database. Hence, some of the aspects of a logical model are left behind as it gets translated into a physical data model.

Since an ontology is a model of a domain describing objects that inhabit it, all three types of data models can be thought of as ontologies. They range from the most expressive one that describes business concepts and processes (the conceptual model) to less expressive and progressively moving from describing business semantics to describing physical structures of the data as it is stored in the databases (the logical and physical data model). Physical model can be thought of as an ontology of a particular database. Wikipedia goes on to note
Early phases of many software-development projects emphasize the design of a conceptual data model. Such a design can be detailed into a logical data model. In later stages, this model may be translated into physical data model. However, it is also possible to implement a conceptual model directly.
Semantic Web standards (governed by the W3C, the World Wide Web Consortium) make it possible to implement conceptual models directly. This is possible due to the layered architecture of the Semantic Web technology stack consisting of:
  • RDF – a canonical data model that is like relational data model in its ability to connect related objects and unlike relational data model in that the data objects (or resources in RDF-speak) are highly granular.
The smallest unit of information in RDF is not a table or a row in a table, but individual statements – a single fact about a resource. 

These statements are called RDF triples. For example, “Atlantis decommission-date July, 2011” is a triple where Atlantis is a subject of a triple, decommission date is a predicate of a triple and July, 2011 is an object of a triple. Atlantis and decommission date are RDF resources and July, 2011 is XML literal. Subjects and predicates of a triple are always RDF resources. An object can be either a resource or a literal value. Predicates that connect two resources are relationships or associations in the data modeling speak. Predicates connecting a resource to a literal value are attributes. In RDF they are called respectively object and data properties.

Because RDF model is highly canonical, RDF data is schema-less. There are no constraints that require it to fit into tables or hierarchies. RDF data is simply a network of connected triples. As such, it can be used to represent, if needed, both - table structures and hierarchies. Standard mappings have been defined from relational tables and XML hierarchies into RDF.

Another key differentiating factor of RDF is that it was “born on the web”. Each RDF resource has a globally unique identity, a URI (uniform resource identifier). For example, the URI for Atlantis may be http://www.nasa.gov/shuttle/Atlantis and the URI for a decommission date may be http://www.nasa.gov/lifecycle#decommissionDate . As a result, it is possible to link RDF data over web in a way similar to how documents can be hyperlinked over the web. By web we mean all HTTP based networks including intranets and extranets.

RDF databases store and provide query access to RDF data. Just like there are standard languages for query of relational and XML data, there is a standard for querying RDF. It is called SPARQL. True to the web-native nature of RDF, SPARQL is not only a query language, but also a protocol that makes it possible to access RDF data over HTTP.
  • RDFS (RDF Schema) and OWL (Web Ontology Language) – RDF-based languages for expressing business semantics.
Jointly RDFS and OWL offer ability to define classes or groups of resources that share common characteristics such as Vehicles and Space Shuttles. The richness of RDFS/OWL makes it possible to fully express the meaning of the business concepts. Data models in RDFS/OWL are stored in the same way as the data, in RDF triples. For example, we can have triples stating that Space Shuttle is a Class and it is a sub class of a Vehicle class and that a vehicle can have only one decommission date (cardinality = 1) and its value must be xsd:date. And you can go beyond cardinality and use the Semantic Web standards to represent a variety of business rules.

Since the data and the schema are stored in the same way, it is possible to query schemas the same way data is queried and to combine search criteria about schemas with the search criteria about data. For example, we can create SPARQL queries to ask for all vehicles that have been decommissioned, all subclasses of a vehicle class, all relationships and attributes a vehicle should have and, when returning decommissioned vehicles, to provide only data values for the fields that have cardinality = 1.

The use of RDF means that the modeling constructs and definitions can be linked and connected. Organizations can refer to each other’s business definitions. Models can be modularized and re-used where appropriate. Differences between related, but not identical concepts can be described. All of this can now be done in a standard compliant and interoperable way.

A growing number of standards bodies and communities of interest are publishing RDF/OWL data models for their particular domains. For example:
  • SKOS – provides a way to represent taxonomies and thesauri
  • ISO 15926 – offers a data model for sharing life-cycle data for process plants including oil and gas production facilities
  • Ontology for Media Resources - defines a core set of metadata properties for multimedia resources
  • SIOC - defines information about online communities
  • QUDT - provides models describing measurable quantities, units for measuring different kinds of quantities and the data types used to store and manipulate these objects in software
  • Provenance Vocabulary - defines provenance-related metadata
There is much more that can be added to this post including a discussion on the best practices for ontology modeling, ontology architecture, approaches for connecting and mapping models, using rules and constraints, publishing, versioning and governing models. Each of these topics, however, deserves an exploration in its own right.

I will end by pointing to a few relevant related blogs and web pages we have published before:

Wednesday, September 28, 2011

Living in the XML and OWL World - Comprehensive Transformations of XML Schemas and XML data to RDF/OWL

Many enterprise information models are expressed using XML Schemas. Data between applications is commonly exchanged in XML, compliant with those schemas. Connecting XML data from different systems in a coherent aggregated way is a challenge that confronts many organizations. Capabilities of RDF/OWL to describe semantics of different data models and aggregate disparate data are a natural fit for addressing these challenges.

For a number of years now, TopBraid Composer included the ability to convert XSDs and associated XML files to RDF/OWL. However, for some XML Schemas our converter did not work as well as customers needed. For the upcoming TopBraid Composer 3.6.0 release, it was significantly improved to have a more comprehensive coverage of XSD constructs and more meaningful conversion to OWL. In parallel, we improved our XML data conversion to RDF so that transformations automatically happen based on the generated OWL models. And we have improved performance of the transformations.

An overview of the approach is illustrated in the following figure:

Approach

Since, the conversion occurs automatically, users do not have to worry about writing any rules for commonly needed mappings. However, those users that need to make further transformations can use SPARQL Rules and SPARQLMotion to customize their generated OWL ontology or further transform RDF triples representing the XML data.

The content of this blog is organized as follows:

XML Schemas converted as part of our tests

We tested the importer on a broad range of complicated and large-scale industry standard XSD files, and converted many XML instances with impressive results. The XSDs we have tested with the new importer include:


  1. Banking
    • FpML, the Financial products Markup Language
    • ISO 20022, a standard for Universal financial industry message scheme

  2. Energy and Utilities
    • MultiSpeak, de-facto standard for defining data needed to be exchanged between software applications in order to support the business processes commonly applied at utilities

  3. Government
    • DoDAF, the Department of Defense Architecture Framework
    • NIEM, the U.S. National Information Exchange Model

  4. Oil and Gas
    • ISO 15926, a standard for integration of life-cycle data for process plants including oil and gas production facilities
    • WITSML, Wellsite Information Transfer Standard Markup Language

  5. Healthcare

  6. Electronics
    • IP-XACT, the XML Schema for meta-data documenting Intellectual Property (IP) used in the development, implementation and verification of electronic systems

  7. Other
    • ATML, the Auto-Test Markup Language

Some of the converted schemas will be published at LinkedModels.org. To get an early access to converted models or for any other questions, contact us at TopQuadrant.

The examples we use in this blog are mainly from the Financial products Markup Language (FpML). All FpML 5.2 XSD and XML instance files were tested. An example is transparency standard under FpML, for which the transparency.ttl ontology was generated from 23 XSD files.


Some challenges in converting XML Schemas to OWL


Some of the challenges in converting XSD to OWL that were addressed are:

  1. Transforming of anonymous types
  2. Converting complex types with simple contents

  3. Resolving conflicting nested element and attribute names during OWL property generation

  4. When and how to distinquish global elements from complex types with similar names during OWL class generation

  5. Generating enumerations

  6. Handling substitution groups both at the XSD and XML levels

  7. Handling the overriding of an XSD type with xsi:type in XML

The example that follows shows the approaches that we have used for the transformation.

Illustrative example of transformation rules


The basic transform for a Complex Type in XSD follows these rules:


  1. An OWL class is generated for a complex type.

  2. The URI of the class is generated in three different ways. If the complex type is global and named, then the name attribute is used. If the complex type is local and named, then the name attribute of the owner element is used. If the complex type is anonymous, then the names of its owner element and its parent element are used.

  3. The xsd:annotation and attribute annotations describing the complex type get generated as dc:description, rdfs:comment and/or skos:definition OWL annotations.

  4. Nested or reference children elements of the complex type become OWL allValuesFrom restrictions on the class. If the element has a simple type, then a restriction with an OWL datatype property and an XSD datatype range or a user-defined RDFS datatype range is generated. If the element has a complex type, then a restriction with an OWL object property and an OWL class range is generated. The minOccurs and maxOccurs values become OWL cardinality restrictions.

  5. Element group and attribute group references are generated as super classes.

  6. Attributes become restrictions with an OWL datatype property and an XSD datatype range or a user-defined RDFS datatype range.

An example of a Complex Type, Trade, in fpml-doc-5-2.xsd of transparency standard is displayed below:


<xsd:complexType name="Trade">
<xsd:annotation>
<xsd:documentation xml:lang="en">
A type defining an FpML trade.</xsd:documentation>
</xsd:annotation>
<xsd:sequence>
<xsd:element name="tradeHeader" type="TradeHeader">
<xsd:annotation>
<xsd:documentation xml:lang="en">
The information on the trade which is not
product specific, e.g. trade date.
</xsd:documentation>
</xsd:annotation>
</xsd:element>
<xsd:group ref="TradeEconomics.model">
<xsd:annotation>
<xsd:documentation xml:lang="en">
The economics of the trade. In the case of an
OTC trade, this is the OTC derivative product.
In the case of a trade of a security,
it is the instrument trade economoics.
</xsd:documentation>
</xsd:annotation>
</xsd:group>
</xsd:sequence>
<xsd:attribute name="id" type="xsd:ID" />
</xsd:complexType>

The following is the graph of the OWL class generated for Trade complex type, which shows the OWL class, restrictions, annotations and superclass.


Trade Graph

The following class diagram shows a more sophisticated view of Trade and its related classes downstream in the generated ontology (click on the diagram to open a window with a bigger image).


Trade Class Diagram

The diagram highlights these advanced features in generation:



  1. A superclass relation exists between Trade, generated from an XSD complex type and TradeEconomics.model, generated from an XSD element group.

  2. In the XSD, Swap element has the substitutionGroup Product element. Thus, A_Global-Swap becomes a subclass of A_Global-Product. A_Global- prefix is used to distinguish the element-derived classes from similarly named complex-type-derived classes.

  3. dtype:value restrictions are generated to hold the simple contents occuring in complex types. The complex content part of the type become other restrictions.

  4. The generated object properties have a Ref suffix to distinguish them from datatype properties with same names. Both types of properties can be used in restrictions on different classes as they may be generated from nested or reference children elements under different complex types.

The instance file, "msg_ex001_new_trade.xml" was imported into the transparency ontology. Here is a peek into that XML file:


...
<trade>
<tradeHeader>
<partyTradeIdentifier>
<tradeId tradeIdScheme=
"http://fpml.org/universal_swap_id">123</tradeId>
<tradeId tradeIdScheme=]
"http://fpml.org/submitter_trade_id">456</tradeId>
</partyTradeIdentifier>
<tradeInformation>
...
<cleared>true</cleared>
<nonStandardTerms>false</nonStandardTerms>
<offMarketPrice>false</offMarketPrice>
<largeSizeTrade>false</largeSizeTrade>
...
</tradeInformation>
<tradeDate>2011-02-04</tradeDate>
</tradeHeader>
<swap>
<productType>InterestRateSwap</productType>
<assetClass>InterestRates</assetClass>
<swapStream>
...
</swapStream>
<swapStream>
...
</swapStream>
</swap>
</trade>
...

The above XML constructs were mapped into the following RDF graph, where you can see how the instances, their relationships and their types are generated with respect to the Trade class diagram (click on the graph to open up a window for a more detailed view).


Trade Instance Example

Another example of transformation rules

The basic transform for an Enumeration in XSD follows these rules:

  1. An OWL class is generated from an XSD simple type having XSD enumeration facets. The localname of the class has Enum suffix to distinguish it from classes generated with similar names.

  2. This class becomes a subclass of EnumeratedValue in the same namespace as the OWL class, which itself becomes a subclass of dtype:EnumeratedValue.

  3. Each XSD enumeration facet becomes an instance of the generated class. dtype:value holds the enumeration value. dtype:order is the order in which the enumeration facet occurs.

  4. An Enumeration class in the same namespace as the OWL class is also generated. This class becomes subclass of dtype:Enumeration. An instance of this class is generated as a container to refer to all the instances generated from the current simple type.

Enumerated value instance URIs are generated using a concatenation of the abbreviation of the class localname's upper case letters and the dtype:value literal.

The following figure shows a graph for PremiumQuoteBasisEnum class and the OWL constructs generated from the related XSD enumeration facets (click on the diagram to open a window with a bigger image):

Trade Instance Example

Complete table of supported transformations


For the reader interested in more details a full overview of the mapping transformations is given in the following table:


Table: Conversion from XSD Constructs to OWL Constructs
#XSD/XML ConstructsOWL Constructs
1xsd:simpleTypeowl:Datatype
2xsd:simpleType with xsd:enumerationBecomes an owl:Class as a subclass of EnumeratedValue. Instances are created for every enumerated value. An instance of Enumeration, referring to all the instances, is created as well as the owl:oneOf union over the instances.
3xsd:complexType over xsd:complexContentowl:Class
4xsd:complexType over xsd:simpleContentowl:Class
5xsd:element (global) with complex typeowl:Class and subclass of the class generated from the referenced complex type
6xsd:element (global) with simple typeowl:Datatype
7xsd:element (local to a type)owl:DatatypeProperty or owl:ObjectProperty depending on the element type. OWL Restrictions are built for the occurrence.
8xsd:groupowl:Class and subclass of A_AbstractElementGroup
9xsd:attributeGroupowl:Class and subclass of A_AbstractAttributeGroup
10xsd:minOccurs and xsd:maxOccursCardinality specified in minimum cardinality, maximum cardinality and universal (allValuesFrom) OWL restrictions.
11Anonymous Complex TypeAs for Complex Type except a URI is constructed from the parent element and the nested element reference. Also, the class is defined as a subclass of A_Anon.
12Anonymous Simple TypeAs for Simple Type except a URI is constructed from the parent element and the nested element reference.
13xsd:default on an attributeUses dtype:defaultValue to attach a value to the OWL restriction representing the associated property.
14Substitution GroupsSubclass statements are generated for the members. Instance files resolve their types by consulting the OWL model at import-time.
15Annotation attributes on elementsOWL Annotation properties are created and placed directly on the relevant class.
16Annotations using xsd:annotationBecome, based on user selection, dc:description, rdfs:comment and/or skos:definition OWL annotations.
17xsi:type on an XML elementOverrides the schema type with the specified type.

A SPARQL Metric Query

As a quick check on the generated OWL models, the following is a useful SPARQL query that counts the number of properties on each OWL class.




SELECT ?class (COUNT(DISTINCT ?p) AS ?properties)
WHERE {
?class a owl:Class .
FILTER( afn:namespace( ?class ) =
"http://www.fpml.org/FpML-5/transparency#") .
OPTIONAL {
?class rdfs:subClassOf ?r .
?r a owl:Restriction .
?r owl:onProperty ?p .
}
}
GROUP BY ?class
ORDER BY DESC( ?properties )

The classes for the transparency ontology have the distribution of properties shown in the following figure (click on the diagram to open a window with a bigger image). For example, TradeInformation has 12 properties:


FpML Example 2

Concluding remarks

The new capability is easy to use. As before, a convenient import wizard will guide the user. The dialog has a number of new options. XML conversion will happen automatically when users open XML files in TBC or use XML import modules in SPARQLMotion. As long as an XML file is valid against an XSD that it is based on, the XML will be transformed in accordance to the schema. Parts of the XML files that do not validate against a schema will continue to be converted using the default Semantic XML structure. There is also a new option to specify which OWL file to use as a schema when mapping a specific XML file to triples. This feature is also available for spreadsheets and will be covered in a separate blog.

We believe that the importance of this work is not only in its value to harvest XML Schemas. Ability to use the automatic creation of triples from XML instance files directly in applications is proving to be key to a number of customers. For example, TopQuadrant is currently using this approach in a project for the North Sea Oil and Gas industry.

The functionality we have described will be released in TopBraid Composer 3.6.0. This release entered internal beta this week. Please contact us if you want a try these capabilities before general availability, which is currently planned for November.


This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.