Friday, September 30, 2011

Ontologies and Data Models – are they the same?

Yesterday a question about how ontologies may be different from logical data models was asked by a newcomer on TopBraid Users Forum. As to be expected on the TopBraid Forum, by ontologies he meant specifically ontology models expressed in RDFS/OWL. Because we frequently hear this or similar questions in our trainings, workshops and in conversations with customers, I decided to respond in a blog post instead of writing an e-mail.

Data modeling was invented more than thirty years ago to help with the design of databases, specifically, relational databases. As quoted below, ANSI definition from 1975 differentiated between three data models – conceptual, logical and physical. Data modeling quickly became recognized as a tool for analyzing the semantics of an organization with the respect to the structure and flow of the information used in carrying out organization’s activities. Wikipedia offers the following definition of Data Modeling:

Data modeling is a method used to define and analyze data requirements needed to support the business processes of an organization. The data requirements are recorded as a conceptual data model with associated data definitions. Actual implementation of the conceptual model is called a logical data model.
<…>
In 1975 ANSI described three kinds of data-model instance:
  • Conceptual schema: describes the semantics of a domain (the scope of the model). For example, it may be a model of the interest area of an organization or of an industry. This consists of entity classes, representing kinds of things of significance in the domain, and relationships assertions about associations between pairs of entity classes. A conceptual schema specifies the kinds of facts or propositions that can be expressed using the model. In that sense, it defines the allowed expressions in an artificial "language" with a scope that is limited by the scope of the model.
  • Logical schema: describes the structure of some domain of information. This consists of descriptions of (for example) tables, columns, object-oriented classes, and XML tags.
  • Physical schema: describes the physical means used to store data. This is concerned with partitions, CPUs, tablespaces, and the like.
According to ANSI, this approach allows the three perspectives to be relatively independent of each other. Storage technology can change without affecting either the logical or the conceptual model. The table/column structure can change without (necessarily) affecting the conceptual model.
These definitions describe a clear progression from conceptual to logical to physical data models. SInce their origin is in the 70s, they reflect certain technology assumptions than no longer hold true.

When information modeling is done to create a relational database, conceptual model must be different from a logical model because there is no place in a relational database structure to capture, for example, business rules, create subsumtion relationships and describe other key aspects of a conceptual model. This semantic information collected and documented as part of the initial modeling is left behind when modelers and designers move on to define a logical data model. The "left behind" parts are used by software developers as they encode business semantics directly into custom programs.

Logical data model is a subset of a conceptual model that can be expressed using a particular technology. However, there are always some performance considerations that require additional changes to the logical data model before it can be implemented in a relational database. Hence, some of the aspects of a logical model are left behind as it gets translated into a physical data model.

Since an ontology is a model of a domain describing objects that inhabit it, all three types of data models can be thought of as ontologies. They range from the most expressive one that describes business concepts and processes (the conceptual model) to less expressive and progressively moving from describing business semantics to describing physical structures of the data as it is stored in the databases (the logical and physical data model). Physical model can be thought of as an ontology of a particular database. Wikipedia goes on to note
Early phases of many software-development projects emphasize the design of a conceptual data model. Such a design can be detailed into a logical data model. In later stages, this model may be translated into physical data model. However, it is also possible to implement a conceptual model directly.
Semantic Web standards (governed by the W3C, the World Wide Web Consortium) make it possible to implement conceptual models directly. This is possible due to the layered architecture of the Semantic Web technology stack consisting of:
  • RDF – a canonical data model that is like relational data model in its ability to connect related objects and unlike relational data model in that the data objects (or resources in RDF-speak) are highly granular.
The smallest unit of information in RDF is not a table or a row in a table, but individual statements – a single fact about a resource. 

These statements are called RDF triples. For example, “Atlantis decommission-date July, 2011” is a triple where Atlantis is a subject of a triple, decommission date is a predicate of a triple and July, 2011 is an object of a triple. Atlantis and decommission date are RDF resources and July, 2011 is XML literal. Subjects and predicates of a triple are always RDF resources. An object can be either a resource or a literal value. Predicates that connect two resources are relationships or associations in the data modeling speak. Predicates connecting a resource to a literal value are attributes. In RDF they are called respectively object and data properties.

Because RDF model is highly canonical, RDF data is schema-less. There are no constraints that require it to fit into tables or hierarchies. RDF data is simply a network of connected triples. As such, it can be used to represent, if needed, both - table structures and hierarchies. Standard mappings have been defined from relational tables and XML hierarchies into RDF.

Another key differentiating factor of RDF is that it was “born on the web”. Each RDF resource has a globally unique identity, a URI (uniform resource identifier). For example, the URI for Atlantis may be http://www.nasa.gov/shuttle/Atlantis and the URI for a decommission date may be http://www.nasa.gov/lifecycle#decommissionDate . As a result, it is possible to link RDF data over web in a way similar to how documents can be hyperlinked over the web. By web we mean all HTTP based networks including intranets and extranets.

RDF databases store and provide query access to RDF data. Just like there are standard languages for query of relational and XML data, there is a standard for querying RDF. It is called SPARQL. True to the web-native nature of RDF, SPARQL is not only a query language, but also a protocol that makes it possible to access RDF data over HTTP.
  • RDFS (RDF Schema) and OWL (Web Ontology Language) – RDF-based languages for expressing business semantics.
Jointly RDFS and OWL offer ability to define classes or groups of resources that share common characteristics such as Vehicles and Space Shuttles. The richness of RDFS/OWL makes it possible to fully express the meaning of the business concepts. Data models in RDFS/OWL are stored in the same way as the data, in RDF triples. For example, we can have triples stating that Space Shuttle is a Class and it is a sub class of a Vehicle class and that a vehicle can have only one decommission date (cardinality = 1) and its value must be xsd:date. And you can go beyond cardinality and use the Semantic Web standards to represent a variety of business rules.

Since the data and the schema are stored in the same way, it is possible to query schemas the same way data is queried and to combine search criteria about schemas with the search criteria about data. For example, we can create SPARQL queries to ask for all vehicles that have been decommissioned, all subclasses of a vehicle class, all relationships and attributes a vehicle should have and, when returning decommissioned vehicles, to provide only data values for the fields that have cardinality = 1.

The use of RDF means that the modeling constructs and definitions can be linked and connected. Organizations can refer to each other’s business definitions. Models can be modularized and re-used where appropriate. Differences between related, but not identical concepts can be described. All of this can now be done in a standard compliant and interoperable way.

A growing number of standards bodies and communities of interest are publishing RDF/OWL data models for their particular domains. For example:
  • SKOS – provides a way to represent taxonomies and thesauri
  • ISO 15926 – offers a data model for sharing life-cycle data for process plants including oil and gas production facilities
  • Ontology for Media Resources - defines a core set of metadata properties for multimedia resources
  • SIOC - defines information about online communities
  • QUDT - provides models describing measurable quantities, units for measuring different kinds of quantities and the data types used to store and manipulate these objects in software
  • Provenance Vocabulary - defines provenance-related metadata
There is much more that can be added to this post including a discussion on the best practices for ontology modeling, ontology architecture, approaches for connecting and mapping models, using rules and constraints, publishing, versioning and governing models. Each of these topics, however, deserves an exploration in its own right.

I will end by pointing to a few relevant related blogs and web pages we have published before:

4 comments:

QK said...

Thanks!!!

The information in this blog is extremely useful. One thing, the links for the related blogs errors out. It would be great if these links are updated

QK said...

Thanks!!!

The information in this blog is extremely useful. One thing, the links for related blog at the end do not work. would appreciate if these are updated

Michel Bohms said...

Clearly 'modelling power' and 'direct implementation' are two key factors but I miss the OWA/CWA discussion here which might be even more key? (false versus unknown).

Furthermore these two first factors are not 'exact'. With good old EXPRESS (STEP technology) we could also model specialisation/taxonomies and define complex here called where-rules. And as long as you have a clean/complete one-to-one mapping to the implementing system also EXPRESS could abstract from the underlying implementation mechanism (just like you can have a triple store or RDBMS backend for RDF/OWL). All in all my first real factor for differentiating between ontologies and data models would we OWA/CWA....my two cents, Michel Böhms

Irene Polikoff said...

Open world assumption adds flexibility and enables knowledge discover/classification. However, for many practical enterprise applications, it is important to be able to close the world. SPIN (SPARQL Rules) makes this possible. So, do some other rules-based approaches. Typical of RDF, this can be done with a lot of flexibility - closing the world for some operations and leaving it open for others.

Keep in mind that this blog entry was not about ontologies in general, but specifically about RDF/Linked Data/Semantic Web ontologies. The second sentence tries to make this point very specific by saying that "As to be expected on the TopBraid Forum, by ontologies he meant specifically ontology models expressed in RDFS/OWL." Thus, for the blog authors the most fundamental differentiation is in RDF itself - its support for globally unique identifiers, data and schema distribution and merging, etc.

This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.