Tuesday, May 25, 2010

Beyond Keyword Searching

Sometimes we put documents into store for safe-keeping. We want them to be available if we should ever need them but we are not expecting to review them on a regular basis. Tax filings, expired contracts and wills would fall into this category. In a business environment though, there are many documents we need to look at on a regular basis or be able to retrieve quickly. There is nothing more frustrating than spending several hours hunting for a document you know is out there somewhere but can’t remember where it was filed and countless studies have revealed we all spend significant amounts of our working lives looking for information.

When SharePoint (and similar document management software) was first introduced, it seemed to offer a solution: behind the scenes text indexing (so users didn’t have to do anything other than upload their documents) and a really fast search engine that allowed users to retrieve documents based on the words in the text and a few key metadata fields such as title, author, folder name. However while keyword searching is very effective in extremely large, highly heterogeneous information environments like the internet as a whole (Google being a case in point – and even they modify this approach for other services such as Shopping) – it has significant limitations when looking for information in more focused environments – such as business operation – where one of the primary needs is to group together like documents and separate them from unlike documents.

Without some form of tagging, it is not straightforward to carry out even quite simple looking searches because the underlying language used to describe business concepts is not standardized. For example, the HR Department might be referred to as: HR, Human Resources and Personnel. A Project might be referred to by a project number, the client name, the project name, some abbreviation of the project name and so on. It is for this reason that most blogging software (such as this one) enables postings to be tagged/coded.
And beyond variation in terminology is the problem that no where in office documents is the purpose of the document automatically recorded. For example, there is no automatic way to distinguish a Word document that is a contract document from one that is a proposal, or an internal PowerPoint presentation from an external one produced for a client meeting. To categorize documents in this way requires human intervention and a document classification system that is agreed across the business entity.

SharePoint 2007 began to address some of the limitations of keyword searching by enabling documents to be tagged (or coded) on upload. Appropriate values for the tags/codes could be set up in lists (or for the more sophisticated, as BDC’s to a database) that would appear to users as drop down menus, or if few enough – checkboxes or radio buttons. And user compliance could be enforced by making tagging mandatory so that documents couldn’t be uploaded unless appropriate values had been selected. However, the management of this tagging could only be done at the site level, which made the enforcement of standard values and classification systems across a business entity with many site collections, let alone sites, too labor intensive.

SharePoint 2010 has extended its coding/tagging functionality in a variety of ways. It has introduced centralized coding management (aka Managed Metadata) that can be applied across an entire site collection. The Taxonomy Term Store (accessible to users with site administrator permissions) enables lists of terms to be created or imported (see figures 1 and 2 below) which can then be applied across all sites in a collection. Examples of the types of taxonomies that can be usefully managed in this way would be departments, geographic regions, project names, product names, sizes/units. Once a term list has been made available across the site collection, it can be included as a properties column in any document libraries across the entire site collection (see figure 3 below) and made available as a metadata filter for searching.

In SharePoint 2010, content administrators can also define hierarchies of Content Types that are meaningful to their business operation (e.g. Project Contracts, Financial Reports, Job Offer Letters) that can be deployed across entire Site Collections. Each Content Type can have assigned its own workflows, permissions and retention policies which are inheritable from general (e.g. Contract) to more specific types (e.g. Legal Contracts, Engineering Contracts).

The ability to centrally define and manage taxonomies and term/coding lists in SharePoint 2010 will make it much easier to manage effectively the large multi-site, multi-library document collections that now exist in many business organizations and are likely to grow further.

No comments:

Post a Comment