The UK chapter of the international society for knowledge organization organized a Great Debate last month on the role of the ‘traditional thesaurus’ in modern information retrieval. The essential question for knowledge organizers is, do we need to index and classify documents or just leave retrieval up to search.
Arguing for search, Judi Vernau (Metataxis) observed that use of thesaurus standards such as ISO 25964 in document indexing and retrieval has declined although it remains ‘a small part of the story.’ There is less emphasis on a narrow, constrained thesaurus and a search for a broader set of relationships. Document classification used to be done by professionals but the world wide web has changed everything. Librarians trained in the art of information retrieval have been replaced by end users who just want to find something!
While the classical thesaurus may be overkill, tagging a corporate document archive will help retrieve stuff in context. The question is how much effort to put into tagging and how to do it. What is key is to make it easy, with a useful semantic structure that is easy to navigate. This is not the case for a traditional thesaurus which contains too many terms and constraints. While there are good reasons to conform to a standard, there are good reasons to not do so, to slacken off some constraints.
Automated entity extraction may or may not work but might be an improvement over the status quo. Dictionaries and whitelists can help – so long as your search is tuned to take advantage of them. Unfortunately, some commercial content management systems can’t even handle a three level deep hierarchy, let alone more complex relationships. There will always be a tension between the desire for smarts and the need to dumb down. Currently the thesaurus is a middle way that does not do the job. You need to be more flexible and more sophisticated.
Vanda Broughton (University College London) defended the whole concept of the thesaurus which is ‘alive and well and fighting back,’ especially in its loser, less constrained manifestation. The process of building a thesaurus teaches us about the domain, its concepts and relationships and can help build a first pass formal model of the domain. The information modeling that underlies a thesaurus should not be lightly abandoned. It is key to semantic web applications, machine reasoning and fuzzy logic. The controlled vocabulary, search tools, domain models and other terms ‘all point to the thesaurus.’
Consultant Helen Lippell, arguing for search, observed that it is hard today to make a business case for a thesaurus and that projects tend to mushroom out of control. Modern content management systems provide inbuilt tools such as automated suggestion lists and constraints, reducing the amount of expert’s time required compared to a full blown thesaurus. What’s needed is a balance between such simple tools for tagging and annotating against ontologies and billion triple stores and semantic web technology ‘that is really hard to implement.’
Leonard Will (Willpower) is a thesaurus believer, mentioning in particular the semantic web’s Skos tagging scheme. Wikipedia has a thesaurus as do hardnosed organizations like eBay and Amazon. Combining the thesaurus with geographic information systems brings even more benefits. The pro search, anti-thesaurus motion was defeated resoundingly.
References and extra reading – Everything is miscellaneous, Hasset, UK data archive, Information, a very short introduction. Visit ISKO-UK’s new website.
© Oil IT Journal - all rights reserved.