Book review - Linked Data: storing, querying, reasoning

A new book describes the evolution of the web into a web of ‘linked documents’ and the potential that semantic technologies have for making sense of it all. Plethoric, rather impenetrable, research avenues are described, as semantics meets-up with the big data movement. Making sense of ‘heterogeneous data from different sources’ appears to remain an elusive goal.

The semantic web, Tim Berners-Lee’s notion to ‘bring structure to the meaningful content of the web’ and allow ‘software agents to roam from page to page and carry out sophisticated tasks for users’ was first mooted around 2000. In the early days, a multitude of books were published elaborating on the merits and magic that was to flow from RDF, the seemingly straightforward resource description format. For our sins, we were on the semantic bandwagon from the get-go (almost) and have over the years traced the evolution of the movement in its slide down the hype curve and into its current resting place in academia. One thing that characterizes these early publications (and indeed much of today’s writing on technology) is a focus on its perceived or anticipated benefits. When we spotted the new publication on Linked Data*, we wondered whether this was going to be more of the same or might it provide enlightenment as to the direction the semantic web has taken since it was reborn as ‘linked data’. In what follows we capitalize ‘Linked Data’ when referring to the book. A non-capitalized ‘linked data’ refers to the subject of … linked data.

The 236-page publication begins with an explanation of how the world wide web has ‘has evolved from a web of linked documents to a web including linked data’. ‘The adoption of linked data technologies has shifted the web from a space of connecting documents to a global space where pieces of data from different domains are semantically linked and integrated to create a global web of data. Linked data enables operations to deliver integrated results as new data is added to the global space. This opens new opportunities for applications such as search engines, data browsers, and various domain-specific applications. … The web of linked data contains heterogeneous data coming from multiple sources and various contributors, produced using different methods and degrees of authoritativeness, and gathered automatically from independent and potentially unknown sources.’ This clearly makes ‘linking’ heterogeneous data sets tricky. Indeed …

Such data size and heterogeneity bring new challenges for linked data management systems. While small amounts of linked data can be handled in-memory or by standard relational database systems, big linked data graphs, which we nowadays have to deal with, are very hard to manage.

The authors seem to conflate heterogeneity and size and from this point on, Linked Data becomes a discussion of ‘modern’ database technology, big data and graph technology. The critical issue of heterogeneity, and the near impossibility of ‘reasoning’ across inconsistent data sets seems to take a back seat. In fact, the examples given perpetuate the awkward manner in which RDF captures essential metadata such as units of measure. For instance …

Let us revisit two types of time labels for representing the time information of RDF stream elements. An interval-based label is a pair of timestamps, commonly natural numbers representing logical time. A pair of timestamps, [start, end], is used to specify the interval in which the RDF triple is valid. For instance, :John :at :livingroom, [7, 9] means that John was at the “living room” from 7 to 9.

7 to 9 what one asks… seconds, minutes, days? This may seem trivial, but the treatment given to data in the RDF world is almost always idiosyncratic and error prone, not at all suitable for an engineering usage. There is no discussion of how engineering units are managed in a consistent manner in Linked Data. This is unfortunate as one the purported extensions of linked data is in the field of streaming sensor data. It appears that ‘current linked data query processing engines are not suitable for handling RDF stream data and … the most popular data model used for stream data is the relational model’. Nonetheless ...

The research trend for RDF stream data processing has been established as the main track in the Semantic Web community. With sensors getting deployed everywhere, the availability of streaming data is increasing. Handling such time-dependent data presents some unique challenges. Along with RDF triples, sometimes provenance information such as source of the data, date of creation, or last modification is also captured. In large RDF graphs, adding provenance information would only make the graph larger. Suitable mechanisms are needed to handle and manage provenance information.

Well, provenance is an interesting field. Linked Data has almost 300 references and a whole Chapter devoted to the subject.

… provenance has been of concern within the linked data community where a major use case is the integration of datasets published by multiple different actors… The unconstrained publication, use, and interlinking of datasets that is encouraged by the linked open data initiative is both a blessing and a curse.

Linked Data covers various provenance models from Dublin Core, the W3 and others. The section is stuffed full of references and a pointer to the W3’s publication on the subject. Describing provenance introduces new words ‘the monus operator’, new symbols and structures ‘the m-semiring’, the ‘seba-structure’. The section on RDF provenance dives in with an explanation of the multiple gotchas of embedding provenance in RDF and querying data sets with Sparql. But all in a rather impenetrable tone that may be understandable to the specialist but not by us!

Other chapters cover ‘storing’, ‘querying’ and ‘reasoning’ linked data at a similar academic level. The conflation with the ‘big data’ movement permeates the book with plethoric references and discussion on topics revolving around Hadoop, Spark etc. There are over 400 references cited and IT-related issues of partitioning, in (or not) memory processing and caching…

Our conclusion is that Linked Data is a large collection of research avenues and references in the field that are rather impenetrable for the outsider. It is different from those early books on the semantic web in that it does not proselytize (too much), but rather enumerates so many different research avenues that it is hard to see the wood for the trees. Moreover, if the semantic web has not taken off, it is at least partly because there is no straightforward, widely accepted way to unequivocally embed real data in a web page. Until that happens, the academics will go on researching techniques that, when they are confronted with the inconsistent data that makes up the web, are doomed to fail.

* Linked Data: storing, querying, reasoning by Sherif Sakr (King Saud bin Abdulaziz University), Marcin Wylot (TU Berlin), Raghava Mutharaju (GE Global Research), Danh Le Phuoc (TU Berlin), Irini Fundulaki (ICS Greece). Springer 2018 ISBN 978-3-319-73514-6 ISBN 978-3-319-73515-3 (eBook).

This article originally appeared in Oil IT Journal 2019 Issue # 4.

For more information or to comment on this topic email here.