Thursday, January 6, 2011

The Art of Searching in an Expanding Information Universe

As the pedabytes of data on the internet grow ever larger, it has become harder and harder to find what you are looking for even when you are sure the information must be out there somewhere. Google is a wonderful thing but there are inherent problems in basic keyword searching that are becoming more apparent as the volume of data grows and, inevitably, along with it the volume of junk.

One problem with online searching is that most search engines require you to describe in some way – typically through the use of keywords – the information you want to retrieve. Which is fine if you know enough about what you are looking for to describe it but not at all if you don’t.

For example, at the beginning of a legal case, today’s legal teams may be presented with terabytes of emails and documents collected from individuals of interest (aka custodians) but may have little or no idea about what’s in those emails and documents or how to identify items of interest (aka responsive documents). This is such an issue that whole suites of software have been developed to assist with what is known as Early Case Assessment (e.g. Clearwell), attempting to solve the problem by analyzing the document set by topic, key phrases or terms so that the legal team can begin to develop a search strategy.

A more common situation would be one we have all experienced when trying to solve a technical issue in an area with which we are not familiar. “Pop-up thingy” may be how you’d describe the dialog window that keeps popping up but how is it ‘officially’ named in the software you are using? Without knowing that, finding assistance is difficult. You may have to trawl through a few dozen only marginally relevant items to finally track down the keywords you need to do a proper search.

Researching technical issues is also made difficult by the fact that you may not know which element of a systems environment is the one causing the problem and therefore where to focus the search. For example, if a user of hosted SharePoint 2010 on Win 7 32bit laptop using IE8 has issues downloading documents after an upgrade to Office 2010, is the primary problem with SharePoint, SharePoint 2010, Win7 UAC, IE 8, 32bit or MS Office 2010? Entering a search that includes all the software components and their versions is likely to be far too narrow and to remove potentially helpful documents (for example, the problem might not be Win7 related and there may be helpful information refering to a similar situation on desktops running Vista). Not scoping it at all is likely to result in hundreds of irrelevant documents dealing with obscure issues with, say, SharePoint 2003 and XP SP1. Once you have some clue as to what might be the cause of the problem – or even best hypothesis – you can scope down to the versions of the software environment that are relevant and, hopefully, find articles and postings relating to similar situation. But you need that initial clue/hypothesis i.e. you need to understand something about the answer before you can pose the question that's going to bring up potentially relevant solutions.

The difficulty with using keyword searching to find information, as the above examples illustrate, is that you have to know how the information you are searching for is expressed in words. The second is that the same word can have two different meanings or be used in two different contexts and it is not always easy to frame a search to exclude all meanings but the one you want, without losing potentially relevant articles.

The meanings do not need to be as diverse as say, the word “spring”. Take the example of “FedEx”. If you run a search for the keyword ‘FedEx’ on either Google or Bing, you will find that it brings up not only information published by FedEx on its own web site, but business articles about FedEx, articles mentioning FedEx Field (the sports venue), FedEx Air & Ground (NFL) Players of the Week, the FedEx PGA Cup and blogs/forum postings about a delivery or mentions of FedEx’s delivery service in articles which are actually about something else e.g. see highlighted ‘page1’ results from a Google search for Fedex below.

Partial screenshot of the results of a Google search for 'Fedex'


Google does have a News category filter but since the NFL is also news, the results include business news, company news and sports news.

Bing also category filters. These appear to filter based on source type rather than the content (indeed the API refers to them as sourceType). Below the top level ‘News’ source type is a subcategory called ‘business’ which presumable scopes the results to business news sources. When we tried it, it did seem to remove many of the top ranked listings relating to NFL issues but there remained in the top 10 postings (sorted by most recent), one result for the FedEx PGA Cup which we presume survived because the article was published in TradingMarkets.com which is deemed a business news source.

Results of searching Fedex in the News source types on Bing. Note ability to filter by Business, Sports or Political source types listed in the left hand menu.


In practice, we’ve tended to find Yahoo! Finance to be the easiest and quickest way to find recent business oriented articles about a company sorted by date, but obviously this only works for companies that are public or large enough to be tracked by Yahoo! Finance and even then, some of the articles seem only loosely related to the company in question.

The difficulties we have been experiencing trying to find information through the “usual channels” – primarily Google, Bing – had us reading with interest a recent posting on TechCrunch: “Why We Desperately Need a New and Better Google” (https://techcrunch.com/2011/01/01/why-we-desperately-need-a-new-and-better-google-2/)

It was a posting that resonated deeply having experienced many of the same issues –wading through the junk “compilation” sites that are nothing more than automatically gathered links to links and add zero value; increasing difficulty searching specifically for people; problems with trying to find only recently written (as opposed to recently indexed) articles.

Inspired by the posting, we decided to check out Blekko, a search engine the author of the article and his team of students at the School of Information at UC-Berkeleyhad used with some success, to see whether the functionality on offer would assist us with some of our search problems.  

Blekko was founded in mid-2007 by a group who had previously worked at Topix and Netscape’s Open Directory.Blekko’s primary differentiator is the use of ‘slashtags’ to filter (or sort) search results. For example, using /people will filter search results that are specifically about a person; /date sorts results by published (not indexed) date; Topic slash tags e.g. /health or /recipes will filter the search to a curated subset of web sites dealing with these categories (thereby avoiding the spammers, the listers and other junk sites as well as minimizing the problem of multiple meanings/contexts for terms). Blekko developed some initial topic slashtags but users are free to create their own and use for their own purposes or share with others.

We searched Blekko for recent news stories about FedEx ( Fedex /news /date). We would have liked to have scoped by business but unfortunately there is currently no ‘business’ slashtag. While the initial results were all company and recent news related (good!), the NFL had crept in by result 11.

Screenshot of results using Blekko and the Search: Fedex /news /date


We also noticed that there didn’t appear to be any results about share prices (compared with, for example, search results for Fedex filtered by NEWS and BUSINESS on Bing), and so we tried slashtag Finance as an alternative. This brought up a very mixed bag of results, a consequence of filtering by web site rather than topic. There were many mentions of the PGA Cup because golf it seems that a well reported topic on Financial web sites! Obviously, if we were doing this frequently, it would be worthwhile to creating our own slashtag to scope the results to those business information sources we found most useful for this topic.

The results of a search for recent information about technology at Fedex (Fedex /tech /date) show some of the difficulties of achieving precision with keyword searching – even when scoped by source. Only the third article down is relevant.

Screenshot of Blekko search results for Fedex /tech


Without going into the realm of true semantic analysis and the semantic web, one mechanism that would help improve the relevancy of search results in cases where a topic can have multiple foci within the same information source context (e.g. FedEx as a company vs other companies incidental use of Fedex ) would be to make more use of facets in the manner of many Solr implementations or indeed SharePoint 2010 Fast but that in turn would require the use of taxonomies and indexing of content which in an world-wide-web scenario would need to be automated rather than carried out by human content providers as happens in SharePoint environments.

Snapshot of the results of a SharePoint 2010 Fast Search showing 'Refine by' options 


Overall, we do think the ability to filter search results by a curated set of web sites has potential and we loved the ability to combine topic slashtags with the /date and /people tags to further refine and sort the results. We also liked the ability to declare a site as “spam” and have it forever banned from our search results. (Which we would have loved to have known about when trying to do a search on a Drupal related technical issue a few months ago). Another thing we did appreciate about Blekko is its transparency. For instance, it is very easy to find which web sources are included in a slashtag’s scope. Simple go to: find the slashtag and drill down on the link. In contrast, we were unable to find which news sources were included in Bing’s news sourcetype or which business news sources in the news >business category.

On a very minor note:(1) It would be helpful to new users of Blekko to put a link to the list of slashtags on the home page (2) When we searched for iChromatiq (we couldn’t resist!), our home page listed 18th after a series of postings for “aChromatic”. We can see why our web site ranked lower than the dictionary entry for ‘achromatic’ on dictionary.com – Blekko does make reasons for page rankings explicit – but it is because the Blekko engine treats ichromatiq and achromatic as the same term and since the ichromatiq web site has fewer inbound/outbound links than, the dictionary.com entry for ‘achromatic’, it is ranked far lower. We would have no argument with this ranking if we had searched for ‘achromatic’ or if our web site was achromatic.com. But logically, shouldn’t a search for a specific term rank pages containing that specific term above pages containing terms which may be similar but are not identical? They are, after all, the best fit. Or, at least – like Google or Bing – ask the user if they meant achromatic rather than ichromatiq and based on the response, search accordingly. Just a thought!