The semantic web is here now!

Last month, Oil IT Journal launched its RSS-enabled news feed. Without realizing it, we found that, like many other publishers of news sites and blogs, we were joining the periphery of a new improved ‘semantic web’. Editor Neil McNaughton has been boning up on other semweb developments to find tons of interesting stuff. If standard nomenclature, catalogues and metadata are your bag the semweb has a lot to offer—once you have deciphered the W3C’s terminology. .

When you search Google, clever, but very simple technology returns what is usually an exceedingly large amount of information. A search for ‘POSC’ will return a boatload of information from the Center for Poultry Science, the Department of Polymer Science and Structural Chemistry of the Free University of Brussels, information about ‘pop-on screw covers’ (hey—I need some of those for the bathroom!), various political science faculties and a few references to the Petrotechnical Open Standards Consortium.

Semantic Web

The problem is the power of full text search à la Google—which effortlessly seeks every occurrence of a word in some 3.3 billion web pages in under a second—an embarrassment of richesse if there ever was one. A while back, Tim Berners-Lee (Sir Tim) decreed that all this needed fixing and called for the creation of the Semantic Web—where somehow or other, documents would identify themselves—by exposing information about who wrote them, when, what was inside—and whether ‘POSC’ meant poultry, politics or petrol.

Ontology?

This sparked off a plethora of World Wide Web Consortia (W3C) bent on realizing the Semantic Web. How’s it going? If you visit the semweb home page on w3c.org you’ll see a couple of definitions and an impressive list of workgroups that are developing ontologies, vocabularies, ways of describing resources and data. In fact, just as we in the upstream consider a knowledge-information-data (KID) continuum, Berners-Lee takes a very broad church approach to the semweb—defining it as ‘the representation of data on the web’. Berners-Lee’s target for the semantic web is not so much the document, but the relational database.

Why not XML?

This begs the question – if the semweb is about data, like data in databases, then what is wrong with using XML? Berners-Lee has an answer to that: XML files from two different applications can’t understand each other’s data and will never interoperate. WITSML is a parallel universe to, for instance, WellLogML. Wouldn’t it be nice if there was a way of writing XML to embed at least some data that was universally sharable and understandable outside of a ‘closed’ XML environment.

RDF

There is, thanks to the semantic web. The W3C’s Resource Description Format is just that, a relatively straightforward way of representing a data element that lets folks who don’t want to have to read and understand (to parse in the jargon) the whole XML file and extract meaningful information. RDF statements are simple, coming in three parts—a resource, a property, and a value. An rdf statement could be ‘example.org/index.html has a creator whose value is John Smith*.

Triple

This RDF triple is made up of a resource—the example.org document, which has a property ‘creator’ whose value is ‘John Smith’. So simple it is almost dumb. But the neat bit is the way an RDF statement points up and out of an XML document to a reference source—or ‘namespace’—which defines terms and acts as a union between different XML documents. It is not too hard to see how a few RDF statements added to WITSML and WellLogML will both point to the same namespace and fix once and for all the horrible problems of agreeing on a name for a well. A DTI RDF namespace for the North Sea? Remember, you read it here first!

Metadata

While BL wants to use RDF to expose all data—including relational databases, the quick wins for RDF are in sharing metadata. The more ‘top level’ your data is, the easier it is to reach agreement on what to share and how. Which leads me on to the real reason I started out on this topic—RSS which, is probably the first prime-time application of the semweb anywhere.

Content management

As you know, Oil IT Journal is available in print, online and in the form of a monthly headlines distribution. When we are through writing the paper version (in Microsoft Publisher) a sophisticated cut and paste exercise ensues, involving Dreamweaver, FrontPage and a good deal of cussing—to produce a standardized html version. This is then processed by a load of Visual Basic apps to update the website, print out the headlines edition and generate the Corporate online editions.

RSS

Little did I realize that the format for headlines distribution is topologically identical to a Blog (a web log) and that there was a whole universe of news feeds out there written in the same style—but using what is variously described as ‘really simple syndication’ or ‘RDF Site summary’. A couple of extra lines of VB code and we had our own newsfeed up and running. A metadata tag in the oilIT.com home page tells robots where to look for the rss feed and whaddya know—we had 2,000 hits to that little chunk of xml this month! Is that semantic or what!

* Example from the W3C’s excellent RDF Primer.

Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.