Friday, February 19, 2010

How to: use XSLT in a semantic web application

An updated version of this blog post is now available as a chapter in the TopBraid Application Development Quickstart Guide (pdf).

XML-based tools form the plumbing of modern electronic publishing systems, so when your semantic web application can create XML, it can easily feed content into one of these systems. The W3C standard XSLT is the most popular tool for converting one kind of XML into another, so in this posting we'll see how SPARQLMotion scripts can use XSLT to create customized XML from your triples. (TopBraid Suite is also very good at reading XML; see the TopBraid Composer help panel "Creating, Importing, Querying, Saving XML documents with Semantic XML" for more on this.) Our example script will save its output as an XHTML file, but you can create XML conforming to any DTD or schema you like and save it as a file or return it via a web service.

Creating XML

TopBraid offers several ways to create XML. The Semantic XML feature lets you read arbitrary XML and gives you fine-grained control over XML that you create, but for a quick conversion of data to XML, there are simpler options. The W3C's RDF/XML serialization standard is the most well-known (if not the most popular) format for representing triples in XML, but its flexibility in the representation of triples can add complexity to the XSLT stylesheet, which must account for several possible locations of a given triple's subject, predicate, and object in various combinations of XML elements and attributes.

The simplest, most straightforward XML format that a semantic web application can create is the W3C standard SPARQL Query Results XML Format. For example, with the following query against the kennedys.owl file included with TopBraid Composer,

PREFIX k: <http://topbraid.org/examples/kennedys#>
SELECT ?last ?first ?birth
WHERE {
?s k:lastName ?last .
?s k:firstName ?first .
?s k:birthYear ?birth .
}
ORDER BY (?birth)

the XML representation of the result will have the following structure, with a head element listing the variables followed by a results element that contains a result child for each row of results returned by the SPARQL engine:


<sparql xmlns="http://www.w3.org/2005/sparql-results#">
<head>
<variable name="last"/>
<variable name="first"/>
<variable name="birth"/>
</head>
<results>
<result>
<binding name="last">
<literal>Kennedy</literal>
</binding>
<binding name="first">
<literal>Joseph</literal>
</binding>
<binding name="birth">
<literal datatype="http://www.w3.org/2001/XMLSchema#integer">1888</literal>
</binding>
</result>
<result>

<binding name="last">
<literal>Fitzgerald</literal>
</binding>
<binding name="first">
<literal>Rose</literal>
</binding>
<binding name="birth">
<literal datatype="http://www.w3.org/2001/XMLSchema#integer">1890</literal>
</binding>
</result>
<!-- more result elements -->
</results>
</sparql>
As we'll see, a very brief XSLT stylesheet can handle this.

Creating an XML version of SPARQL query results in a SPARQLMotion script

In our sample application, the XSLT stylesheet will create an XHTML version of the data returned by the query above.

Start by creating a SPARQLMotion File in TopBraid Composer file called xsltdemo. Create a SPARQLMotion script in this file with an ImportRDFFromWorkspace initial module named GetKennedyData, and set its sml:sourceFilePath property to /TopBraid/Examples/kennedys.owl. (For more details on following these steps, see How to: create and run a SPARQLMotion script.)

For your script's second module, drag a SerializeSPARQLResults module from the SPARQLMotion palette's Text Processing section onto the workspace and name it SelectRptData. Set the following three properties to configure it:

  • Set its sml:selectQuery property to the SPARQL query shown above.

  • Set sm:outputVariable, a property that names the variable that will hold the results of the query, to queryResults.

  • For the sml:serialization property, click the white triangle to display the context menu and select Add Existing, because you want to pick from the predefined list of legal values. On the Add existing dialog box, select sm:XML on the right. (If you click on sml:RDFSerialization on the left of this dialog box, you'll see that this module can also output N3, NTriples, RDF/XML, and Turtle RDF.)

Close the Edit SelectRptData dialog box and connect the GetKennedyData module's icon to the SelectRptData one.

When you develop an XSLT stylesheet, you want some sample input XML data handy to test it, so let's have this script create a file for this. Add an Export to XML File icon from the Export to Local section of the palette and call it SaveTestXML. Set its sml:xml property to {?queryResults} to get the data that your SelectRptData module stored in this variable, and set sml:targetFilePath to testdata.xml. That's all you need to set, so connect your SelectRptData icon to it as shown here:

Select that third icon and click the debug icon at the top of the SPARQLMotion workspace to run it, and you should see a testdata.xml file appear in the same directory as your script. This file will have XML that follows the structure of the example shown above.

Applying an XSLT stylesheet in a SPARQLMotion script

First, you need to create the XSLT stylesheet file. If you're using an Eclipse-based XML editor such as oXygen or XML Spy, go ahead and use one of those editors. Otherwise, after selecting the folder in the Navigator view where you want to store the file, pick New from the File menu. Because you're not creating one of the specialized files that TopBraid Composer typically deals with, pick Other from the cascade menu. Pick XML from the XML section of the New dialog box and click next. Call the file createKennedyHTMLRpt.xsl, and click Finish on the New XML File dialog box. (Clicking Next on that dialog would lead to wizards that make the process more complicated than this example requires.) Paste the following stylesheet, which I wrote to process the testdata.xml file created above, onto the editor and save the stylesheet:


<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:sp="http://www.w3.org/2005/sparql-results#"
xmlns="http://www.w3.org/1999/xhtml">

<xsl:template match="sp:sparql">
<html>
<head><title>XSLT demo</title></head>
<body>
<table border="1">
<tr>
<th>Last Name</th>
<th>First Name</th>
<th>Birth</th>
</tr>
<xsl:apply-templates/>
</table>
</body>
</html>
</xsl:template>

<xsl:template match="sp:result">
<tr>
<td><xsl:value-of select="sp:binding[@name='last']"/></td>
<td><xsl:value-of select="sp:binding[@name='first']"/></td>
<td><xsl:value-of select="sp:binding[@name='birth']"/></td>
</tr>
</xsl:template>

</xsl:stylesheet>

(If you create the stylesheet outside of TopBraid Composer, make sure to put it in the same folder as your script, then right-click the folder's icon in the Navigator view and select Refresh so that Eclipse knows that the new file is there.) The stylesheet converts the XML into an XHTML file with a table that has a row for each row of the result set.

The next module to add to the SPARQLMotions script is the one that applies the stylesheet to the XML. Drag a Convert XML by XSLT module from the XML Processing section of the palette, name it CreateHTML, and configure it by setting the following three properties:

  • Set sm:outputVariable to HTMLResult. This variable will store the result of the XSLT transformation.

  • Set sml:template to {?stylesheet}. We haven't added the script module that sets this variable yet, but will soon.

  • Set xml:xml to {?queryResults}. As we saw above, the SelectRptData module stores the XML of the query results in this variable; the SaveTestXML module also used the contents of this variable.

Connect the SelectRptData module to your new CreateHTMLRpt module. SelectRptData is already connected to SaveTestXML, but sending a module's output to multiple modules is a common technique in SPARQLMotion scripts.

To read the createKennedyHTMLRpt.xsl stylesheet disk file and send its contents to the CreateHTMLRpt module for use in the XSLT transformation, drag an ImportTextFile module from the Import from Local section of the palette and name it ReadXSLTStylesheet. Tell it where to find the stylesheet file by setting its sml:sourceFilePath property to createKennedyHTMLRpt.xsl, and then set its sm:outputVariable property to stylesheet—the variable that the CreateHTMLRpt module will expect to find the text of the stylesheet that it applies to the output of the SelectRptData module. Connect the ReadXSLTStylesheet module to the CreateHTMLRpt module.

The last module will save the results of the XSLT transformation in a disk file. Drag another ExportToXMLFile module from the Export to Local section of the palette and name it SaveHTMLFile. Set its sml:xml property to {?HTMLResult} so that it reads the variable set by the CreateHTML module, and then indicate where to save this variable's value by setting the module's sml:targetFilePath property to KennedyRpt.html. Connect the CreateHTML module to the new SaveHTMLFile module. Your completed script should look something like this:

To test it, select the CreateHTML icon and click the debug icon. You should see a KennedyRpt.html file appear in the same directory as your xsltdemo script; in a browser, this HTML file will look like this:

Now that the script is applying the XSLT stylesheet to the XML and creating the HTML, you can right-click your SaveTestXML script module and select Delete to remove it from your script if you want to, but maybe you don't want to—if you ever need to refine the SELECT query in the SelectRptData module or the XSLT stylesheet that the ReadXSLTStylesheet module feeds to the CreateHTML module, the ability to create another disk file of test data may prove handy in the future.

This also demonstrates another technique of sophisticated SPARQLMotion scripts: the ability to share common processing modules for different output modules. You essentially have two SPARQLMotion scripts here—SaveTestXML and SaveHTMLFile—which share several modules. That's why, when you pick Edit SPARQLMotion Script or Execute SPARQLMotion Modules from the Scripts menu, you'll see these two "scripts" listed as choices, even though they're graphically represented as a single flow chart with a choice of end points.

Taking it further

If you had created this script as a web service, then instead of ending with an Export to XML File module from the Export to Local section of the palette, you could route the XML from the Convert XML by XSLT module to a Return XML module from the Export to Remote section of the palette. (Or, that could be a third endpoint added to the two discussed above!) If you set the Return XML module's sml:mimetype property to text/html and had it return XHTML like the xsltdemo script above does, you'd be setting up the dynamic creation of a web page, so that users browsing to the URL that invokes the service would think that they were just linking to a web page.

Of course, it doesn't have to return HTML. It can return anything that an XSLT script can create, and you can set sml:mimetype to any valid HTTP MIME type, and any application that can do an HTTP GET can request delivery of this data, which opens up a wide range of possibilities for the contributions that this application can make to a larger system.

Friday, February 12, 2010

How to: read triples from a relational database

An updated version of this blog post is now available as a chapter in the TopBraid Application Development Quickstart Guide (pdf).

TopBraid includes the D2RQ interface to relational database management systems and automates most of the configuration for you, giving your TopBraid applications easy access to relational data. In this entry, we'll see how easily a SPARQLMotion script can use data stored in a MySQL database. You would use a similar process to get at data stored using Oracle, Sybase, SQL Server, and PostgreSQL. Our sample uses MySQL's sample "world" database, a collection of data about countries around the world. See the Other MySQL Documentation page to download the database and for information on installing it.

Configuring TopBraid to read a MySQL database

Once you have a running copy of MySQL with the world database installed, the next step is to tell TopBraid Composer to create the configuration files that let you treat this relational database as a virtual triplestore. Start by selecting the project or folder in the Navigator window where you want TBC to put these files, and then select Import from the File menu and pick "RDF/OWL View on Relational Database via D2RQ" from the list of formats that can be imported. Click the Next button, and you'll see the Specify database connection dialog box.

For the database connection file name, enter worlddb, and then change the Base URI of each of the configuration files to something that reflects this filename, as shown in the screen shot below.

The last four fields are where you specify the key information for giving TopBraid access to the database. The drop-down list for the Database URL field offers a choice of templates for different RDBMS systems; for our MySQL database, select jdbc:mysql://<server>/<database> and change it to jdbc:mysql://localhost/world. (This assumes that MySQL is running on the same machine as your copy of TBC. If not, substitute the appropriate name for localhost. The part after the last slash is the database that you want to read from the MySQL server on that system.) In the next two fields, enter a user name and password for an account that has access to the worlddb database when using MySQL tools, and then in the Driver Class field select com.mysql.jdbc.Driver.

Click the Finish button, and TBC will create the configuration files and display a message box about it:

The "Importing Relational Databases with D2RQ" online help panel has good background on this whole process, especially on the role that each of these configuration files play. Among other things, it points out that the property names assigned to the columns of the relational database tables are just a starting point that can be reconfigured by editing these files. For example, the tableName_columnName format of the property names that you use to access the data is only a default and can be changed.

Reading the data from a SPARQLMotion script

To see how SPARQLMotion can use this data, we're going to create a simple script that gets the database's value for the president of the United States. After running it, we'll update the data using a MySQL tool and then run the SPARQLMotion script again to see the update reflected in the data that TopBraid is using.

Start by creating a new SPARQLMotion file as described at How to: create and run a SPARQLMotion script and call it mysqltest. Create a new SPARQLMotion script for this file, and for its initial module pick sml:ImportRDFFromWorkspace from the sml:ImportFromLocalModules section of the sml:ImportModules category. Call it GetWorldData, and when its icon appears, set this module's sml:sourceFilePath property to worlddb-Test.owl. This was one of the files created earlier by the configuration process, and it imports two others: the worlddb-Schema.owl schema file and the worlddb.d2rq file that serves as a placeholder for the actual data in the MySQL world database. This is the only property that you need to set for this module.

Next, we'll add a module that queries the database for the name of the president of the United States. Add a Bind by select module from the Control Flow section of the SPARQLMotion palette and call it PresidentQuery. Set its sml:selectQuery property to the following query (note how the namespace for the property names is the "Base URI of generated Schema file" value you assigned when configuring how TopBraid would import the MySQL data):

PREFIX w: <http://example.org/worlddb/myschema.owl#>
SELECT ?headOfState
WHERE {
?s w:country_Name "United States" .
?s w:country_HeadOfState ?headOfState .
}

That's the only property to set for this module, so close the Edit PresidentQuery dialog box, click Add connect on the SPARQLMotion palette, and connect the GetWorldData module icon to the new icon.

When you work with a database like this, TopBraid keeps cached copies of the data to speed up later queries on the same data. This has obvious advantages, but it can be a disadvantage if you're more interested in data currency than in quick retrieval times. Our database is small, and we want to see changes reflected as soon as possible, so we're going to add a module that clears out this cache each time the script is run.

Drag a Collect garbage module from the Control Flow section of the palette to your workspace and name it EmptyCache. Set its sml:baseURI property to the URI for the database you want to clear from the cache: http://example.org/worlddb/myDatabase.owl, which is the base URI for the worlddb.d2rq placeholder file. (This is the URI you entered in the "Base URI of Instances (Database)" field when you set up the initial import and configuration.) After configuring this one property, connect the President query module icon to the Empty cache icon.

Select the Empty cache icon and click the debug icon to run this short script. You'll see in the Console view (a handy view when developing and debugging scripts) that the headOfState property has the value "George W. Bush", so the MySQL sample database is a bit out of date:

Let's fix this. Update the data using your favorite MySQL tool. For example, from MySQL's command line interface, you could enter this:

mysql> USE world
Database changed
mysql> UPDATE country SET HeadOfState="Barack Obama"
-> WHERE HeadOfState="George W. Bush";

Run the SPARQLMotion script again, and you should see the update reflected in the Console view. If it wasn't for the script's Empty cache icon, your SPARQL query might have used a cached copy of the data that did not include this updated value.

As the online help tells us, it's a good idea to be careful with the sml:CollectGarbage module, especially if you're developing a multi-user application where data is frequently updated. If quick access to updated data is not a priority in your application, omitting this module will mean faster response times for your users. It's up to the needs of your application, and it's good to know that you can have it either way.

Instead of opening the worlddb-Test.owl file with your application, you're more likely to open a file that you created that imports one or more of the configuration files generated above (again, read the "Importing Relational Databases with D2RQ" help panel to learn more about what each file can contribute to your application) and perhaps some other files as well. You could add additional data files and RDFS or OWL files describing relationships between the various imported files so that you can do interesting things with the combinations. For example, you could declare the properties that reference MySQL database columns to be subproperties of others declared for your application or declared in a standard vocabulary such as Dublin Core.

Something else that can make your script development easier is importing the generated schema into your script—in this case, dragging the worlddb-Schema.owl file onto the Import view of the mysqltest file. This way, the namespace prefixes will already be defined for you and you can use autocompletion when you enter class and property names in your SPARQL queries.

The sample application here didn't do much on its own, but when you combine the imported data with other data and metadata and take advantage of the other techniques described in this How-to series, the ability to incorporate live relational data in your semantic web applications adds some great possibilities to what your applications can do.

Saturday, February 6, 2010

How to: get data from SPARQL endpoints and save data in a spreadsheet

An updated version of this blog post is now available as a chapter in the TopBraid Application Development Quickstart Guide (pdf).

This entry covers two useful SPARQLMotion techniques: how to retrieve and use data from SPARQL endpoints, and how to save data so that you can open it with a spreadsheet program such as Excel or OpenOffice Calc. While this may look like a lot of material to cover, both techniques are so simple that this is the shortest How to blog entry yet.

Retrieving data from a SPARQL endpoint

A SPARQL endpoint is a server that accepts SPARQL queries and returns the result using the SPARQL Protocol for RDF, a W3C standard XML-based format that is much simpler than RDF/XML. This endpoint can be part of your intranet, providing access for others there to data behind your firewall, but more and more SPARQL endpoints are available on the public internet to provide you with data to use in your applications, and excitement about the new possibilities provided by this data is a big driver of the Linked Data movement. (In a future entry, we'll see how you can create your own SPARQL endpoint with TopBraid Live.)

The ability to retrieve SPARQL endpoint data from within a SPARQLMotion script is not a feature specific to SPARQLMotion, but actually an extension to the Jena framework that will hopefully become part of SPARQL 1.1: the SERVICE keyword. This lets you name a data source right in your query, and the source can be a remote SPARQL endpoint.

To try it out, enter the following query into TopBraid Composer's SPARQL view and execute it. Instead of pulling data from the currently open data graph, it gets it from the SPARQL endpoint specified by the URI http://dbpedia.org/sparql, which is the endpoint for the huge DBpedia collection of triples harvested from the structured infobox data in Wikipedia pages:


SELECT ?subsidiaryName ?description
WHERE {
{
SERVICE <http://dbpedia.org/sparql> {
<http://dbpedia.org/resource/IBM> <http://dbpedia.org/ontology/Company/subsidiary> ?ibmSub .
?ibmSub <http://dbpedia.org/property/abstract> ?description .
?ibmSub rdfs:label ?subsidiaryName .
FILTER (lang(?description) = "en")
FILTER (lang(?subsidiaryName) = "en")
}
}
}

This query retrieves names and descriptions of IBM subsidiaries from DBpedia and only passes along the English language ("en") ones.

You can use this technique anywhere that you can use a SPARQL SELECT query in your TopBraid development—including in the sml:selectQuery property of a SPARQLMotion Create Spreadsheet module, as we'll see.

Saving data in a spreadsheet

If you search the TopBraid Composer online help for the word "spreadsheet," you'll see that TopBraid offers a variety of ways to read and write spreadsheet data. The simplest way to create a spreadsheet from a SPARQLMotion script is the Create Spreadsheet module. After you enter a SELECT query in its sml:selectQuery property, it sends a tab-delimited spreadsheet to the next SPARQLMotion module with a column for each variable in the select query. (Any spreadsheet program can read a tab-delimited file. OpenOffice Calc doesn't offer "tab-delimited file" as a choice of file format to open, but if you tell it to open a "Text CSV" file and point it at a tab-delimited file, it will find that tab characters are the delimiters and import it properly.) For example, with the SELECT query shown above, this module will create a two-column spreadsheet with subsidiaryName values in the first column and description values in the second.

Let's create a SPARQLMotion script that does this. Create a new SPARQLMotion file as described at How to: create and run a SPARQLMotion script, calling the file endpointtsv. Next, select Create SPARQLMotion script from the Scripts menu, and when it asks you to select an initial module type, select sml:CreateSpreadsheet from the sml:TextProcessing modules choices under sml:ProcessingModules. Name the module GetIBMSubs and click OK.

When the icon for your new module appears on the SPARQLMotion workspace, double-click it to configure it and set sml:selectQuery to the query shown above. The sml:CreateSpreadsheet module creates a spreadsheet in memory, not on disk, to provide you with the flexibility of doing other things with its output before you send it to a specific output destination such as a disk file, so set the sm:outputVariable property to the name spreadsheetData so that subsequent modules can reference it. You're finished configuring this module, so click the Close button.

Our SPARQLMotion script will only have one subsequent module: the one that saves the spreadsheet to a disk file. Drag an Export to text file module from the Export to local section of the palette onto the workspace and name it SaveSpreadsheet. Double-click its icon to configure it and set the following properties:

  • Set sml:targetFilePath to ibmsubs.txt.

  • The sml:text property is where you assign the text to store in the file being created; set it to {?spreadsheetData} so that it uses the value of the variable being set in the GetIBMSubs module.

  • Set sml:replace to true so that running the script more than once replaces the existing disk file instead of appending the data on as new lines. (Of course, setting sml:replace to false can be very handy in other scripts that you write.)

Click the Close button, and then click on Add connection on the palette. Connect your two icons, and you're finished with your simple little script:

Save it, select the SaveSpreadsheet icon, and click the debug icon to run your script. You should see an ibmsubs.txt file appear in the same directory as the script itself, because you didn't specify a path name when you configured the sml:targetFilePath property. (You're welcome to include a pathname with the file in that property.)

Open up this new file in a spreadsheet program, and then you can use any of the program's formatting features that you like before saving the spreadsheet using the program's native format. Here are the first few rows of the spreadsheet in Microsoft Excel:

If you combine these two techniques with others covered in this How-to series, you could have all of this triggered by a web service call, or the script could combine the data retrieved from a SPARQL endpoint with other data from other sources before saving the spreadsheet, or it might output the data in another of the destinations available in the "Export to" sections of the SPARQLMotion workspace palette. Take another look at those palette choices, or better yet, review the help page for them; I'm sure you'll get some great ideas.

This is a blog by TopQuadrant, developers of the TopBraid Suite, created to support the pursuit of our ongoing mission - to explode strange semantic myths, to seek out new models that support a new generation of dynamic business applications, to boldly integrate data that no one has integrated before.