Monday, February 14, 2011

Clustering Connections with LinkedIn InMaps

Last month, LinkedIn announced a new application called InMaps which can be used to visualize a LinkedIn Network. LinkedIn’s aim is to enable its users to see what their network looks like and so better leverage their network, including identifying areas where it could be strengthened and extended.

As readers of this blog will know, data visualization is something in which we are keenly interested and so we went to try it out. Curiously, LinkedIn does not promote its labs area – or at least not that we could tell – even though there are some very interesting experimental applications in it (e.g. try out INFINITY ).

For our evaluation, we chose a relatively small network to evaluate because we were interested in exploring the representation in some depth. (Note: we have read comments from others that the software may be challenged dealing with very large networks in the 30,000+ region. D.J. Patil, Chief Scientist of LinkedIn notes the same in his comments on a posting on the FlowData blog: http://flowingdata.com/2011/01/24/explore-your-linkedin-network-visually-with-inmaps/#comment-63891 ).

It is recommended that InMaps is used with Firefox or Chrome rather than IE. Once you have reached the Labs page and selected the InMaps option, all you need to do is to permit the InMaps application to access your LinkedIn Connections. The application then processes LinkedIn’s connection-network representation and produces a diagram which is not dissimilar in style to Gephi (see previous blog posting: http://ichromatiq.blogspot.com/search/label/Gephi ) and indeed LinkedIn Maps is listed on Gephi’s own web site as a user of the Gephi toolkit (see: http://gephi.org/2011/happy-new-year/ )

Example of a LinkedIn InMap

Highly connected individuals within your network are represented with larger nodes and fonts. It is important to bear in mind, however, that the map is only representing the connectedness between the individuals to which you are connected. It is not showing the connectedness of those individuals within LinkedIn. So, for example, if you have a connection to individual A who happens to have a very large LinkedIn network but, for some reason, no one else in your network is connected to them, they will appear as a small node with a single link to you. If, on the other hand, you are connected to individual B who is connected to all the same people with whom you are connected, that individual is going to be represented as a large node.

We particularly liked the fact that the map highly interactive. Not only can you pan, zoom and mouse-over a node to get tool-tip information, but clicking on the node brings up their LinkedIn profile in the right hand sidebar. Very useful!

Most intriguing however is the clustering, represented by different colors. InMaps allows you to choose your own label for each cluster/color but gives little information as to how the clusters are derived except to say that they represent different affiliations such as previous employers, educational institutions or industries. Looking at the inMap shown here, it was clear that the dominating factor in the clustering was employment attribute and specifically company name.

Close-Up of "Misc" Cluster
The small red cluster on the immediate left of center is essentially a “misc” group. Looking at this in more detail, we noticed that connections based on professional organizations did not seem to be picked up – but that may have been either because the number of such connections was below the clustering threshold and/or the individuals concerned had not recorded the organization in their profile. We also noticed that one particular employer affiliation had not been clustered. In this case, the reason we believe is that this particular enterprise was so large that people often reference the operating division in which they work rather than the whole. Further the name of the enterprise has changed over the years. Since it would be an enormous task to keep track of all the changes – name and organizational structure – many Fortune 5000 companies go through, it might be useful to allow users to overlay the initial map with affiliations they know exist i.e. adding additional attributes.

We would have liked to have compared the representation produced by inMaps with those produced by other visualization tools: in particular NodeXL because that would have allowed us to add/modify attributes easily. Unfortunately while it is possible to export out your LinkedIn connections, you cannot access the connections between individual s in your network.

Overall, this is a very useful visualization tool, providing valuable insight into one’s professional network. It would be very interesting to overlay this with other perspectives including email traffic flow or twitter activity to give an extended picture of how one communicates and connects within the business and professional environment. More please!

Thursday, January 6, 2011

The Art of Searching in an Expanding Information Universe

As the pedabytes of data on the internet grow ever larger, it has become harder and harder to find what you are looking for even when you are sure the information must be out there somewhere. Google is a wonderful thing but there are inherent problems in basic keyword searching that are becoming more apparent as the volume of data grows and, inevitably, along with it the volume of junk.

One problem with online searching is that most search engines require you to describe in some way – typically through the use of keywords – the information you want to retrieve. Which is fine if you know enough about what you are looking for to describe it but not at all if you don’t.

For example, at the beginning of a legal case, today’s legal teams may be presented with terabytes of emails and documents collected from individuals of interest (aka custodians) but may have little or no idea about what’s in those emails and documents or how to identify items of interest (aka responsive documents). This is such an issue that whole suites of software have been developed to assist with what is known as Early Case Assessment (e.g. Clearwell), attempting to solve the problem by analyzing the document set by topic, key phrases or terms so that the legal team can begin to develop a search strategy.

A more common situation would be one we have all experienced when trying to solve a technical issue in an area with which we are not familiar. “Pop-up thingy” may be how you’d describe the dialog window that keeps popping up but how is it ‘officially’ named in the software you are using? Without knowing that, finding assistance is difficult. You may have to trawl through a few dozen only marginally relevant items to finally track down the keywords you need to do a proper search.

Researching technical issues is also made difficult by the fact that you may not know which element of a systems environment is the one causing the problem and therefore where to focus the search. For example, if a user of hosted SharePoint 2010 on Win 7 32bit laptop using IE8 has issues downloading documents after an upgrade to Office 2010, is the primary problem with SharePoint, SharePoint 2010, Win7 UAC, IE 8, 32bit or MS Office 2010? Entering a search that includes all the software components and their versions is likely to be far too narrow and to remove potentially helpful documents (for example, the problem might not be Win7 related and there may be helpful information refering to a similar situation on desktops running Vista). Not scoping it at all is likely to result in hundreds of irrelevant documents dealing with obscure issues with, say, SharePoint 2003 and XP SP1. Once you have some clue as to what might be the cause of the problem – or even best hypothesis – you can scope down to the versions of the software environment that are relevant and, hopefully, find articles and postings relating to similar situation. But you need that initial clue/hypothesis i.e. you need to understand something about the answer before you can pose the question that's going to bring up potentially relevant solutions.

The difficulty with using keyword searching to find information, as the above examples illustrate, is that you have to know how the information you are searching for is expressed in words. The second is that the same word can have two different meanings or be used in two different contexts and it is not always easy to frame a search to exclude all meanings but the one you want, without losing potentially relevant articles.

The meanings do not need to be as diverse as say, the word “spring”. Take the example of “FedEx”. If you run a search for the keyword ‘FedEx’ on either Google or Bing, you will find that it brings up not only information published by FedEx on its own web site, but business articles about FedEx, articles mentioning FedEx Field (the sports venue), FedEx Air & Ground (NFL) Players of the Week, the FedEx PGA Cup and blogs/forum postings about a delivery or mentions of FedEx’s delivery service in articles which are actually about something else e.g. see highlighted ‘page1’ results from a Google search for Fedex below.

Partial screenshot of the results of a Google search for 'Fedex'


Google does have a News category filter but since the NFL is also news, the results include business news, company news and sports news.

Bing also category filters. These appear to filter based on source type rather than the content (indeed the API refers to them as sourceType). Below the top level ‘News’ source type is a subcategory called ‘business’ which presumable scopes the results to business news sources. When we tried it, it did seem to remove many of the top ranked listings relating to NFL issues but there remained in the top 10 postings (sorted by most recent), one result for the FedEx PGA Cup which we presume survived because the article was published in TradingMarkets.com which is deemed a business news source.

Results of searching Fedex in the News source types on Bing. Note ability to filter by Business, Sports or Political source types listed in the left hand menu.


In practice, we’ve tended to find Yahoo! Finance to be the easiest and quickest way to find recent business oriented articles about a company sorted by date, but obviously this only works for companies that are public or large enough to be tracked by Yahoo! Finance and even then, some of the articles seem only loosely related to the company in question.

The difficulties we have been experiencing trying to find information through the “usual channels” – primarily Google, Bing – had us reading with interest a recent posting on TechCrunch: “Why We Desperately Need a New and Better Google” (https://techcrunch.com/2011/01/01/why-we-desperately-need-a-new-and-better-google-2/)

It was a posting that resonated deeply having experienced many of the same issues –wading through the junk “compilation” sites that are nothing more than automatically gathered links to links and add zero value; increasing difficulty searching specifically for people; problems with trying to find only recently written (as opposed to recently indexed) articles.

Inspired by the posting, we decided to check out Blekko, a search engine the author of the article and his team of students at the School of Information at UC-Berkeleyhad used with some success, to see whether the functionality on offer would assist us with some of our search problems.  

Blekko was founded in mid-2007 by a group who had previously worked at Topix and Netscape’s Open Directory.Blekko’s primary differentiator is the use of ‘slashtags’ to filter (or sort) search results. For example, using /people will filter search results that are specifically about a person; /date sorts results by published (not indexed) date; Topic slash tags e.g. /health or /recipes will filter the search to a curated subset of web sites dealing with these categories (thereby avoiding the spammers, the listers and other junk sites as well as minimizing the problem of multiple meanings/contexts for terms). Blekko developed some initial topic slashtags but users are free to create their own and use for their own purposes or share with others.

We searched Blekko for recent news stories about FedEx ( Fedex /news /date). We would have liked to have scoped by business but unfortunately there is currently no ‘business’ slashtag. While the initial results were all company and recent news related (good!), the NFL had crept in by result 11.

Screenshot of results using Blekko and the Search: Fedex /news /date


We also noticed that there didn’t appear to be any results about share prices (compared with, for example, search results for Fedex filtered by NEWS and BUSINESS on Bing), and so we tried slashtag Finance as an alternative. This brought up a very mixed bag of results, a consequence of filtering by web site rather than topic. There were many mentions of the PGA Cup because golf it seems that a well reported topic on Financial web sites! Obviously, if we were doing this frequently, it would be worthwhile to creating our own slashtag to scope the results to those business information sources we found most useful for this topic.

The results of a search for recent information about technology at Fedex (Fedex /tech /date) show some of the difficulties of achieving precision with keyword searching – even when scoped by source. Only the third article down is relevant.

Screenshot of Blekko search results for Fedex /tech


Without going into the realm of true semantic analysis and the semantic web, one mechanism that would help improve the relevancy of search results in cases where a topic can have multiple foci within the same information source context (e.g. FedEx as a company vs other companies incidental use of Fedex ) would be to make more use of facets in the manner of many Solr implementations or indeed SharePoint 2010 Fast but that in turn would require the use of taxonomies and indexing of content which in an world-wide-web scenario would need to be automated rather than carried out by human content providers as happens in SharePoint environments.

Snapshot of the results of a SharePoint 2010 Fast Search showing 'Refine by' options 


Overall, we do think the ability to filter search results by a curated set of web sites has potential and we loved the ability to combine topic slashtags with the /date and /people tags to further refine and sort the results. We also liked the ability to declare a site as “spam” and have it forever banned from our search results. (Which we would have loved to have known about when trying to do a search on a Drupal related technical issue a few months ago). Another thing we did appreciate about Blekko is its transparency. For instance, it is very easy to find which web sources are included in a slashtag’s scope. Simple go to: find the slashtag and drill down on the link. In contrast, we were unable to find which news sources were included in Bing’s news sourcetype or which business news sources in the news >business category.

On a very minor note:(1) It would be helpful to new users of Blekko to put a link to the list of slashtags on the home page (2) When we searched for iChromatiq (we couldn’t resist!), our home page listed 18th after a series of postings for “aChromatic”. We can see why our web site ranked lower than the dictionary entry for ‘achromatic’ on dictionary.com – Blekko does make reasons for page rankings explicit – but it is because the Blekko engine treats ichromatiq and achromatic as the same term and since the ichromatiq web site has fewer inbound/outbound links than, the dictionary.com entry for ‘achromatic’, it is ranked far lower. We would have no argument with this ranking if we had searched for ‘achromatic’ or if our web site was achromatic.com. But logically, shouldn’t a search for a specific term rank pages containing that specific term above pages containing terms which may be similar but are not identical? They are, after all, the best fit. Or, at least – like Google or Bing – ask the user if they meant achromatic rather than ichromatiq and based on the response, search accordingly. Just a thought!