Taxonomy Boot Camp, London. NISO, understanding metadata

Taxonomies for publishing, project management, NLP. Primer provides state-of-the art snapshot.

Speaking at the recent Taxonomy Boot Camp in London, Michael Upshall (Unisilo) showed how two project management taxonomies were mashed-up to underpin Gower Publishing’s internal project management community. The community offers enhanced term-based search across the Gower PM body of knowledge, comprising over 100 published books on the topic.

In what Upshall claims as a world-first, the two industry-leading PM taxonomies, the UK’s PRINCE2 and the US PMBOK have been combined into a common terminology of some 250 terms. Tools of the trade include Drupal (website construction) and Apache Solr (automated tagging and search). The solution supports natural language query such as ‘What is the role of the project sponsor?’ and search by author, subject and across related content generated via the taxonomy.

Charlie Hull described how Flax built a free, open source taxonomy classifier in 10 days, leveraging a ‘brace’ of open source software including Apache Lucene/Solr, Python and JQuery. (Stanford’s natural language processing (NLP) was tried but found wanting). Flax leverages search-based classification, seeking key terms in documents and using Solr’s ‘MoreLikeThis’ feature to extend hits. The end result was liked by the client, but not by the IT ‘powers that be’ which were opposed to using open source software.

Brendan Clarke went one step further, applying machine learning to document tagging. Clarke is a Microsoft content management guru and co-founder of TermSet, a SharePoint metadata and taxonomy add-on. The cloud-based system uses NLP to create taxonomies from information inside documents. Human tagging is expensive. For Clarke, NLP is the future of text analytics.

Sukaina Bharwani showed how PoolParty’s Semantic Suite has been used to create the Climate Tagger, a multi-language thesaurus of concepts and terms relating to energy efficiency, renewables and climate change. Climate Tagger data is available from a portal or as machine-readable linked open data. The PoolParty API exposes NLP functionality, push requests and statistics on activity, trending concepts and recommendations. Under the hood is Drupal, WordPress and the Ckan open source data portal.


A recent ‘Primer Publication’ from the US National information standards organization, ‘Understanding Metadata’ (UM) by Jenn Riley describes how metadata has become a household word following the NSA leak. Google’s knowledge graph contains 3.5 billion metadata elements on 500 million people, places and things. UM presents the intricacies of the semantic web/linked data/RDF approach with reference to machine-to-machine content negotiation. The various options for deployment,, OWL, SKOS, Dublin Core and FoaF, are treated with a succinct historical background and useful code examples. UM concludes with an interesting editorial on the future of metadata which Riley sees as a continuing shift towards the graph-based, sparse data collections as opposed to the rigid structure of the database. The 50 page primer is a free download from NISO.

Click here to comment on this article

Click here to view this article in context on a desktop

© Oil IT Journal - all rights reserved.