Google-like search engines can be used to find masses of information in unstructured documents. But ‘masses’ doesn’t always equate to ‘useful’. How do you ensure that a ‘well’ is the same thing across all the stuff you are searching. Or as the web’s architect-in-chief Tim Berners-Lee would put it, how do you build the next-generation ‘semantic web’. The answer, according to a new report from Ark Group lies in the methodical implementation of taxonomy.
Taxonomy is about building lists of key words, and using these to organize and search documents. The grand-daddy of taxonomies is the Dewey Decimal classification, used by librarians to determine where to put books on shelves. Modern taxonomies can be more complex – with hierarchies and linked lists – although there is a lot to be said for keeping your taxonomies simple. The science of taxonomy naturally enough uses some pretty fancy words itself. A thesaurus is just a list of words used in the taxonomy—but ontologies? These define relationships between the members of a taxonomy—belongs to, is a parent of etc. Knowledge, or topic maps can be used to give visual representations of taxonomies.
But definitions are not easy as ‘Taxonomies’ reveals, “The problem is that the definitions of taxonomies, ontologies and knowledge maps tend to overlap. The irony is that one of the most important attributes of taxonomy is the exact and unambiguous application of terminology.” Taxonomies can be used to drive document creation—by making sure that consistent terminology is used and the documents are classified up-front in a meaningful way. Failing this, when confronted with a morass of legacy or public domain documentation, software can be used to ‘seek meaning’ by applying taxonomies to intelligent searching.
Theory apart, ‘Taxonomies’ starts to get seriously useful (as opposed to stimulatingly interesting) around page 50 (a little late in the day for an 80 page book!) with a short review of commercial taxonomy software. Software-based classification can be manual (Multites) with a visual tree structure of terminology available to the manual indexer. Alternatively, standard templates can be customized for use in specific industries (Sun One). Finally, automated classification (Autonomy etc.) can extract meaning from free text. The section concludes with a discussion of various automated taxonomy engines from Autonomy, Verity, GammaWare ( a table compares functionality of products from 20 or so vendors) along with pointers to research on vector analysis and ‘rough sets’. This review of off-the-shelf software comes with a health warning as one quoted ‘expert’ says “The solution is not to buy software that will do a poor job of classification. The answer is to start creating less content of a higher quality – and to integrate classification into the content production process. But can your people write well to begin with? Do they have the skills to write clear headings and organize their documents into a logical and readable manner?”
The answer to the last question, so far as ‘Taxonomies’ is concerned, is unfortunately no. ‘Taxonomies’ is jam-packed full of enlightening quotes, but these are all slotted under the same header ‘what the experts say’. This wouldn’t be so bad if the overall structure of the book was better organized. The one exception to this is the executive summary which for those of you in need of some rapid boning-up on this trendy topic is probably worth the purchase. In summary if you are looking for a kick-start cookbook to using XML to build the semantic web, this is not the book for you. ‘Taxonomies’ offers a philosophical ‘divertissement’ with a multitude of inspirational quotes ‘from the experts’ and a very top level overview of taxonomy-oriented software. But the review is more of a ‘bluffers guide’ than a textbook.
Taxonomies: Frameworks for corporate knowledge. Jan Wylie, Ark Group. ISBN 0-9543897-1-9. email@example.com.
This article originally appeared in Oil IT Journal 2003 Issue # 4.
For more information or to comment on this topic email here.