Monday, April 30, 2012

Can semantic technology melt process industry’s icebergs of information?


Icebergs of information loiter throughout process manufacturing IT waiting to sink any information integration project. The impact of semantic technologies is being felt in medicine, life sciences, intelligence, and elsewhere but can it solve this problem in process manufacturing? The ability to federate information from multiple data-sources into a schema-less structure, and then deliver that federated information in any format and in accordance with any standard schema uniquely positions semantic technology. Is this a sweet spot for semantic technologies?

Process Manufacturing Application Focus over the Years

Over the years we have been solving problems within process manufacturing IT only to uncover more problems. Once the problem was that of measurement data in silos which was solved by the introduction of real-time data historians. However that created the problems of data visibility, solved by the introduction of graphical user interfaces. This introduced data overload which was partially solved by the introduction of analytical tools to digest the information and produce diagnostics. Unfortunately these tools were difficult to deploy across all assets within an organization, so we have been trying to solve that problem with information models. The current problem is how to convert the diagnostics into actionable knowledge with the use of work-flow engines and ensuring the sustainability of applications as solutions increases in complexity.
Process Manufacturing Application Problems and Solutions over the Years
1985-
1995
1990-
2000
1995-
2005
2000-
2010
2005-
2015
2010-
Problem
Measurement data in silos
Data access and visualization
Analysis and business intelligence
Contextualized information
Consistent actioning
Sustainability
Industry
Response
Real-time databases collecting measurements
(proprietary)
Graphical user interfaces, trending and reporting tools
(proprietary)
Analytical tools to digest data into information and diagnostics
Plant data models (ProdML, ISA-95, ISO15926, IEC 61970/61968, Proprietary)
ISO-9001
Outsourcing
Standards
Consequence
Data but no user access
Data overload
Deployability of analysis to all assets
Interpretation limited to experts
Complexity, much more than RTDB, limiting sustainability
Improved ongoing application benefits

However it is not only the increased technological complexity that is causing problems. Business decisions now cross many more business boundaries. When measurement data was trapped in silos we were content with unit-wide or plant-wide data historians. Now a well performance problem might involve a maintenance engineer located in Houston accessing a Mimosa[1]-based maintenance management system, an operations engineer located in Aberdeen accessing an OPC-UA[2]-based data historian, a production engineer located in London accessing a custom system driven by WITSML[3]-based feeds, and a facilities engineer using an ISO-15926[4] facilities management model. Not only are the participants in different locations and business units, but they also rely on different systems using different models to support their decision making. However they all should be talking about the same well, measured by the same instruments, producing the same flows, and processed by the same equipment.
The problem is that these operational support systems are not simply data silos whose homogeneous data we need to merge into one to answer our questions. In fact these operational support systems are icebergs of information. Above the surface they publish a public perspective focused on the core operational function of the application. However this data needs context, so below the surface is much of the same information that is contained in other systems. This information provides the context to the operational data so that the operational system can perform its required functions. For example the historian needs to know something about the instruments that are the source of its measurements; maintenance management systems need to know not only about the equipment to be maintained but the location of that equipment, physically and organizationally.

Figure 1: Icebergs of Information
Icebergs of information are not limited to the operational data stores deployed in organizations. An essential practice in these days of interoperability requirements is the adoption of model standards. However even these exhibit the same problems as shown by the diagram below. This diagram maps the available standards to its focus within the hydrocarbon supply chain.

Figure 2: Multiple Overlapping Model Standards
Increasing regulatory and competitive demands on the business are forcing decision making to be more timely, and to be more integrated across the traditional business boundaries. However these icebergs are getting in the way of effective decision making.
One way to make any or all of this information available to consumers is to create the bigger iceberg. ‘Simply’ create the relational database schema that covers every past, current, and future business need, and build adapters to populate this database from the operational data stores. Unfortunately this mega-store can only get more complex as it has to keep up with an expanding scope of information required to support the decision making processes.

Figure 3: Integration using the Bigger Iceberg
Alternatively we can keep building data-marts every time someone has a different business query.  However these do not provide the timeliness required to support operational decision making.

The Need for a Babel-Fish

We cannot meet the needs of the business, and solve their decision making needs by having one mega-store because it will never keep up with the changing business requirements. Instead we need a babel-fish (with thanks to the Hitchhikers Guide to the Galaxy).
This babel-fish can consume all of the different operational data in different standards, and translate them into any standard that the end-consumer wants. Thus the babel-fish will need to know that OPC UA's concept 'hasInstrument' has the same meaning as Mimosa's concept of 'Instrumented'. Similarly 10FIC107 from an OPCUA provider is the same as 10-FIC-1-7 from Mimosa.
1.       Information providers (operational data stores) within the business will want to provide information according to their capabilities, but preferably using the standards appropriate for their application. For example measurements should be OPC UA, maintenance should use Mimosa
2.       Information consumers will want to consume information in the form of one or more standards appropriate for their application.

Figure 4: Integration babel-fish

The Semantic/RDF model comes to the rescue

First of all a definition: a semantic model means organizing all data and knowledge as RDF triples {subject, property, object}. Thus {:Peter, :hasAge, 21^^:years}, and {:Pump101, :manufacturedBy, :Rotek} are examples of RDF triples. RDF triples can be persisted in a variety of ways: SQL table, custom organizations, NoSQL, XML files and many more. If we were designing relational database to hold these RDF triples we would only have one ‘table’ so it may appear that we have no schema, in the relational database design-sense when we have key relationships to enforce integrity, and unique indices to enforce uniqueness. However we can add other statements about the data such as {:Pump101, :type, :ReciprocatingPump} and {:ReciprocatingPump, :subClassOf, :Pump}[5]. Used in combination with a reasoner we can infer consequences from these asserted facts, such as :Pump101 is a type of :Pump, and Peter is not a :Pump, despite rumors to the contrary.   These triples can be visualized as the links in a graph with the subject and object being the nodes of the graph, and the property the name of the edge linking these nodes:

Figure 5: RDF Triples as a graph
Over the years, new modeling metaphors have been introduced to solve perceived or actual problems with their predecessors. For example the Relational Model had perceived difficulties associated with reporting, model complexity, flexibility, and data distribution. A semantic model helps solve these problems.

Figure 6: Evolution of Model Metaphors

·         In response to the perceived reporting issues, OLAP techniques were introduced along with the data warehouse. This greatly eased the problem of user-reporting, and data mining. However it did introduce the problem of data duplication.
o   A semantic model can query against a federated model in which information is distributed throughout the original data sources.
·         In response to the perceived complexity issues, various forms of object-orientated modeling were introduced. There is no doubt that it is easier to think of one’s problem in terms of an object model rather than a complex relational or ER model, especially when there are a large number of entities and relations.
o   The semantic model is built around the very simple concept of statements of facts such as {:Peter, :hasAge, 21^^:years}, and {:Pump101, :manufacturedBy, :Rotek} combined with statements that describe the model such as {:Pump101, :type, :ReciprocatingPump} and {:ReciprocatingPump, :subClassOf, :Pump}.
·         The model flexibility problem occurs when, after the model has been designed, the business needs the model to change. In response to this flexibility issue, the choice is to make the original model anticipate all potential uses but then risk complexity, or use an object-relational approach in which it is possible to add new attributes without changing the underlying storage schema.
o   In semantic models these relationships are expressed in triples, using RDFS, SKOS, OWL, etc. Thus RDF is also used as the physical model (in RDF stores, at least).
·         There have been various responses to data distribution.
o   In the relational world there is not much choice other than to replicate the data from heterogeneous data stores using Extract-Transform-Load (ETL) techniques. In the case of homogenous but distributed databases distributed queries are possible, although it does require intimate knowledge of all the schemas in all of the distributed databases.
o   In the object-orientated world we are in a worse situation: it is very difficult to manage a distributed object in which different objects are distributed or attributes are distributed.
The good news is that a semantic approach is the ideal (or even the only) approach that can solve the information integration problem as follows:
1.       Convert to RDF normal form: Convert all source data into RDF. The data can be left at source and fetched on demand (federated) or moved into temporary RDF storage
    • There are already standard ways of doing this for any spreadsheet, relational database, XML schema, and more. For example, TopBraid Suite (http://www.topquadrant.com/products/TB_Suite.html) provides converters and adaptors for all common data sources. It is relatively easy to create more mappings such as OPCUA. The dynamic adapters act as SPARQLEndpoints[6].
2.       Federated data model: Create 'rules' that map one vocabulary to another.
    • The language of these rules would be RDFS, SKOS and OWL. For example you can declare {OPCUA:hasInstrument, owl:sameAs, Mimosa:Instrumented}. Note that these are simply additional statements expressed in RDF which are then used by a reasoner to infer the consequences such as :FI101 is actually the same as :10FIC101.
    • More sophisticated rules can also be created using directly RDF and SPARQL. For some examples, see SPIN or SPARQL Rules at http://spinrdf.org/ and http://www.w3.org/Submission/2011/SUBM-spin-overview-20110222/
3.       Chameleon data services: Create consumer queries that extract the information from the combined model into the standard required using SPARQL queries.
    • For example even though all instrument data is in OPCUA, a consumer could use a Mimosa interface to fetch this data. The results can then be published as web-services for consumption by external applications using SPARQLMotion (http://www.topquadrant.com/products/SPARQLMotion.html)

Figure 7: Federation End-to-End

Let’s look into these steps in detail:

Convert to RDF normal form

Despite the fact that data will be stored in different formats (relational, XML, object, Excel, etc) according to different schemas they can always be converted into RDF triples. Always is a strong word, but it really does work. There are already ways of doing this for any spreadsheet, relational database, XML schema, and more and it is relatively easy to create more mappings such as OPC-UA. The data can be left at source and fetched on demand (federated) or moved into temporary RDF storage. For example, TopBraid Suite (http://www.topquadrant.com/products/TB_Suite.html) provides converters and adaptors for all common data sources.

Figure 8: Conversion to RDF Normal Form

Federated Data Model

A federated data model allows different graphs (aka databases) to be aggregated by linking the shared objects. This applies to real-time measurements (OPC-UA), maintenance (MIMOSA), production data (ProdML), or any external database. We can visualize this as combining the graphs of the individual operational data stores into a single graph.
Of course there will be vocabulary differences between the different data-sources. For example, in the OPC-UA data-source you might have a property OPCUA:hasInstrument, and in a MIMOSA data-source the equivalent is called Mimosa:Instrumented. So the federated data model incorporates 'rules' that map one vocabulary to another. The language of these rules would be RDFS, SKOS, and OWL. For example, in OWL, you can declare {OPCUA:hasInstrument owl:sameAs Mimosa:Instrumented}. Note that these are simply additional statements expressed as RDF triples which are then used by a reasoner to infer consequences such as :FI101 is actually the same as :10FIC101.
There will also be identity differences between the different data-sources. These can also be handled by additional statements, such as {:TANK#102, owl:sameAs, :TK102 }. This allows a reasoner to infer that the statement {:TK102, :has_price, 83^^:$} also applies to :TANK#102, implying {:TANK#102, :has_price, 83^^:$}.

Figure 9: Information Federated from MulTiple Datasources

Chameleon Data Services

To extract information from the federated information, the best choice is SPARQL, the semantic equivalent of SQL only simpler. Whilst SQL allows one to query the contents of multiple tables within a database, SPARQL matches patterns within the graph. With SQL we need to know in which table each field belongs. With SPARQL we define the graph pattern that we want to match, and the query engine will search throughout the federated graphs to find the matches. In the example illustrated below we do not need to know that the price attribute comes from one data source, whilst the volume comes from another. In fact SPARQL allows even further flexibility. The price attribute for Tank#101 could come from a different data source than the price attribute for Tank#102. This is part of the magic of the semantic technology.
  Figure 10: Graph Pattern matching with SPARQL
SPARQL can be used to directly query the federated graph for reporting purposes, however most consumers of the information will expect to interface to a web-service, with SOAP or REST being the most popular. These services do not have to be programmed. Instead they can be declared using SPARQLMotion (http:www.sparqlmotion.org) to produce easily consumed and adaptable web-services. The designer for SPARQLMotion is shown below:

Figure 11: Example SPARQLMotion

Semantic/RDF advantages for the Process Manufacturing

Despite solving a complex data integration problem, Semantic/RDF is inherently simpler. Can there be anything simpler than storing all knowledge as RDF triples?  Despite this simplicity, we do not lose any expressivity.
There is no predefined schema to limit flexibility. However the schema rules, encoded as tables and keys in the relational model, can still be expressed using RDFS, OWL, and SKOS statements.
Deconstructing all information into statements (triples) allows data from distributed sources to be easily merged into a single graph.
Any information model can be reconstructed from the merged graph using SPARQL and presented as web-services (SOAP or REST).


[1] MIMOSA is a not-for-profit trade association dedicated to developing and encouraging the adoption of open information standards for Operations and Maintenance in manufacturing, fleet, and facility environments. MIMOSA's open standards enable collaborative asset lifecycle management in both commercial and military applications.
[2] The Unified Architecture (UA) is THE next generation OPC standard that provides a cohesive, secure and reliable cross platform framework for access to real time and historical data and events. 
[3] WITSML™ (Wellsite Information Transfer Standard Markup Language) is an industry initiative to provide open, non-proprietary, standard interfaces for technology and software that monitor and manage wells, completions and workovers.
[4] ISO 15926 provides integration of life-cycle data for process plants including oil and gas production facilities

[5] I should really be using URIs instead of text labels for subject, property, and objects, but the intent of the semantic model is conveyed more simply if we avoid identifiers like ‘http://www.example.org/equipment#Pump101’ and use :Pump#101
[6] SPARQL is a query language for RDF. A SPARQL endpoint is a protocol service that makes it possible to query a data source using SPARQL. The source itself does not need to be in RDF. It can, for example, be a traditional relational database. Later in this article we will describe SPARQL in more detail and show some query examples.

This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.