ChromaScope

Monday, February 14, 2011

Clustering Connections with LinkedIn InMaps

Last month, LinkedIn announced a new application called InMaps which can be used to visualize a LinkedIn Network. LinkedIn’s aim is to enable its users to see what their network looks like and so better leverage their network, including identifying areas where it could be strengthened and extended.

As readers of this blog will know, data visualization is something in which we are keenly interested and so we went to try it out. Curiously, LinkedIn does not promote its labs area – or at least not that we could tell – even though there are some very interesting experimental applications in it (e.g. try out INFINITY ).

For our evaluation, we chose a relatively small network to evaluate because we were interested in exploring the representation in some depth. (Note: we have read comments from others that the software may be challenged dealing with very large networks in the 30,000+ region. D.J. Patil, Chief Scientist of LinkedIn notes the same in his comments on a posting on the FlowData blog: http://flowingdata.com/2011/01/24/explore-your-linkedin-network-visually-with-inmaps/#comment-63891 ).

It is recommended that InMaps is used with Firefox or Chrome rather than IE. Once you have reached the Labs page and selected the InMaps option, all you need to do is to permit the InMaps application to access your LinkedIn Connections. The application then processes LinkedIn’s connection-network representation and produces a diagram which is not dissimilar in style to Gephi (see previous blog posting: http://ichromatiq.blogspot.com/search/label/Gephi ) and indeed LinkedIn Maps is listed on Gephi’s own web site as a user of the Gephi toolkit (see: http://gephi.org/2011/happy-new-year/ )

Example of a LinkedIn InMap

Highly connected individuals within your network are represented with larger nodes and fonts. It is important to bear in mind, however, that the map is only representing the connectedness between the individuals to which you are connected. It is not showing the connectedness of those individuals within LinkedIn. So, for example, if you have a connection to individual A who happens to have a very large LinkedIn network but, for some reason, no one else in your network is connected to them, they will appear as a small node with a single link to you. If, on the other hand, you are connected to individual B who is connected to all the same people with whom you are connected, that individual is going to be represented as a large node.

We particularly liked the fact that the map highly interactive. Not only can you pan, zoom and mouse-over a node to get tool-tip information, but clicking on the node brings up their LinkedIn profile in the right hand sidebar. Very useful!

Most intriguing however is the clustering, represented by different colors. InMaps allows you to choose your own label for each cluster/color but gives little information as to how the clusters are derived except to say that they represent different affiliations such as previous employers, educational institutions or industries. Looking at the inMap shown here, it was clear that the dominating factor in the clustering was employment attribute and specifically company name.

Close-Up of "Misc" Cluster

The small red cluster on the immediate left of center is essentially a “misc” group. Looking at this in more detail, we noticed that connections based on professional organizations did not seem to be picked up – but that may have been either because the number of such connections was below the clustering threshold and/or the individuals concerned had not recorded the organization in their profile. We also noticed that one particular employer affiliation had not been clustered. In this case, the reason we believe is that this particular enterprise was so large that people often reference the operating division in which they work rather than the whole. Further the name of the enterprise has changed over the years. Since it would be an enormous task to keep track of all the changes – name and organizational structure – many Fortune 5000 companies go through, it might be useful to allow users to overlay the initial map with affiliations they know exist i.e. adding additional attributes.

We would have liked to have compared the representation produced by inMaps with those produced by other visualization tools: in particular NodeXL because that would have allowed us to add/modify attributes easily. Unfortunately while it is possible to export out your LinkedIn connections, you cannot access the connections between individual s in your network.

Overall, this is a very useful visualization tool, providing valuable insight into one’s professional network. It would be very interesting to overlay this with other perspectives including email traffic flow or twitter activity to give an extended picture of how one communicates and connects within the business and professional environment. More please!

Thursday, January 6, 2011

The Art of Searching in an Expanding Information Universe

As the pedabytes of data on the internet grow ever larger, it has become harder and harder to find what you are looking for even when you are sure the information must be out there somewhere. Google is a wonderful thing but there are inherent problems in basic keyword searching that are becoming more apparent as the volume of data grows and, inevitably, along with it the volume of junk.

One problem with online searching is that most search engines require you to describe in some way – typically through the use of keywords – the information you want to retrieve. Which is fine if you know enough about what you are looking for to describe it but not at all if you don’t.

For example, at the beginning of a legal case, today’s legal teams may be presented with terabytes of emails and documents collected from individuals of interest (aka custodians) but may have little or no idea about what’s in those emails and documents or how to identify items of interest (aka responsive documents). This is such an issue that whole suites of software have been developed to assist with what is known as Early Case Assessment (e.g. Clearwell), attempting to solve the problem by analyzing the document set by topic, key phrases or terms so that the legal team can begin to develop a search strategy.

A more common situation would be one we have all experienced when trying to solve a technical issue in an area with which we are not familiar. “Pop-up thingy” may be how you’d describe the dialog window that keeps popping up but how is it ‘officially’ named in the software you are using? Without knowing that, finding assistance is difficult. You may have to trawl through a few dozen only marginally relevant items to finally track down the keywords you need to do a proper search.

Researching technical issues is also made difficult by the fact that you may not know which element of a systems environment is the one causing the problem and therefore where to focus the search. For example, if a user of hosted SharePoint 2010 on Win 7 32bit laptop using IE8 has issues downloading documents after an upgrade to Office 2010, is the primary problem with SharePoint, SharePoint 2010, Win7 UAC, IE 8, 32bit or MS Office 2010? Entering a search that includes all the software components and their versions is likely to be far too narrow and to remove potentially helpful documents (for example, the problem might not be Win7 related and there may be helpful information refering to a similar situation on desktops running Vista). Not scoping it at all is likely to result in hundreds of irrelevant documents dealing with obscure issues with, say, SharePoint 2003 and XP SP1. Once you have some clue as to what might be the cause of the problem – or even best hypothesis – you can scope down to the versions of the software environment that are relevant and, hopefully, find articles and postings relating to similar situation. But you need that initial clue/hypothesis i.e. you need to understand something about the answer before you can pose the question that's going to bring up potentially relevant solutions.

The difficulty with using keyword searching to find information, as the above examples illustrate, is that you have to know how the information you are searching for is expressed in words. The second is that the same word can have two different meanings or be used in two different contexts and it is not always easy to frame a search to exclude all meanings but the one you want, without losing potentially relevant articles.

The meanings do not need to be as diverse as say, the word “spring”. Take the example of “FedEx”. If you run a search for the keyword ‘FedEx’ on either Google or Bing, you will find that it brings up not only information published by FedEx on its own web site, but business articles about FedEx, articles mentioning FedEx Field (the sports venue), FedEx Air & Ground (NFL) Players of the Week, the FedEx PGA Cup and blogs/forum postings about a delivery or mentions of FedEx’s delivery service in articles which are actually about something else e.g. see highlighted ‘page1’ results from a Google search for Fedex below.

Partial screenshot of the results of a Google search for 'Fedex'

Google does have a News category filter but since the NFL is also news, the results include business news, company news and sports news.

Bing also category filters. These appear to filter based on source type rather than the content (indeed the API refers to them as sourceType). Below the top level ‘News’ source type is a subcategory called ‘business’ which presumable scopes the results to business news sources. When we tried it, it did seem to remove many of the top ranked listings relating to NFL issues but there remained in the top 10 postings (sorted by most recent), one result for the FedEx PGA Cup which we presume survived because the article was published in TradingMarkets.com which is deemed a business news source.

Results of searching Fedex in the News source types on Bing. Note ability to filter by Business, Sports or Political source types listed in the left hand menu.

In practice, we’ve tended to find Yahoo! Finance to be the easiest and quickest way to find recent business oriented articles about a company sorted by date, but obviously this only works for companies that are public or large enough to be tracked by Yahoo! Finance and even then, some of the articles seem only loosely related to the company in question.

The difficulties we have been experiencing trying to find information through the “usual channels” – primarily Google, Bing – had us reading with interest a recent posting on TechCrunch: “Why We Desperately Need a New and Better Google” (https://techcrunch.com/2011/01/01/why-we-desperately-need-a-new-and-better-google-2/)

It was a posting that resonated deeply having experienced many of the same issues –wading through the junk “compilation” sites that are nothing more than automatically gathered links to links and add zero value; increasing difficulty searching specifically for people; problems with trying to find only recently written (as opposed to recently indexed) articles.

Inspired by the posting, we decided to check out Blekko, a search engine the author of the article and his team of students at the School of Information at UC-Berkeleyhad used with some success, to see whether the functionality on offer would assist us with some of our search problems.

Blekko was founded in mid-2007 by a group who had previously worked at Topix and Netscape’s Open Directory.Blekko’s primary differentiator is the use of ‘slashtags’ to filter (or sort) search results. For example, using /people will filter search results that are specifically about a person; /date sorts results by published (not indexed) date; Topic slash tags e.g. /health or /recipes will filter the search to a curated subset of web sites dealing with these categories (thereby avoiding the spammers, the listers and other junk sites as well as minimizing the problem of multiple meanings/contexts for terms). Blekko developed some initial topic slashtags but users are free to create their own and use for their own purposes or share with others.

We searched Blekko for recent news stories about FedEx ( Fedex /news /date). We would have liked to have scoped by business but unfortunately there is currently no ‘business’ slashtag. While the initial results were all company and recent news related (good!), the NFL had crept in by result 11.

Screenshot of results using Blekko and the Search: Fedex /news /date

We also noticed that there didn’t appear to be any results about share prices (compared with, for example, search results for Fedex filtered by NEWS and BUSINESS on Bing), and so we tried slashtag Finance as an alternative. This brought up a very mixed bag of results, a consequence of filtering by web site rather than topic. There were many mentions of the PGA Cup because golf it seems that a well reported topic on Financial web sites! Obviously, if we were doing this frequently, it would be worthwhile to creating our own slashtag to scope the results to those business information sources we found most useful for this topic.

The results of a search for recent information about technology at Fedex (Fedex /tech /date) show some of the difficulties of achieving precision with keyword searching – even when scoped by source. Only the third article down is relevant.

Screenshot of Blekko search results for Fedex /tech

Without going into the realm of true semantic analysis and the semantic web, one mechanism that would help improve the relevancy of search results in cases where a topic can have multiple foci within the same information source context (e.g. FedEx as a company vs other companies incidental use of Fedex ) would be to make more use of facets in the manner of many Solr implementations or indeed SharePoint 2010 Fast but that in turn would require the use of taxonomies and indexing of content which in an world-wide-web scenario would need to be automated rather than carried out by human content providers as happens in SharePoint environments.

Snapshot of the results of a SharePoint 2010 Fast Search showing 'Refine by' options

Overall, we do think the ability to filter search results by a curated set of web sites has potential and we loved the ability to combine topic slashtags with the /date and /people tags to further refine and sort the results. We also liked the ability to declare a site as “spam” and have it forever banned from our search results. (Which we would have loved to have known about when trying to do a search on a Drupal related technical issue a few months ago). Another thing we did appreciate about Blekko is its transparency. For instance, it is very easy to find which web sources are included in a slashtag’s scope. Simple go to: find the slashtag and drill down on the link. In contrast, we were unable to find which news sources were included in Bing’s news sourcetype or which business news sources in the news >business category.

On a very minor note:(1) It would be helpful to new users of Blekko to put a link to the list of slashtags on the home page (2) When we searched for iChromatiq (we couldn’t resist!), our home page listed 18th after a series of postings for “aChromatic”. We can see why our web site ranked lower than the dictionary entry for ‘achromatic’ on dictionary.com – Blekko does make reasons for page rankings explicit – but it is because the Blekko engine treats ichromatiq and achromatic as the same term and since the ichromatiq web site has fewer inbound/outbound links than, the dictionary.com entry for ‘achromatic’, it is ranked far lower. We would have no argument with this ranking if we had searched for ‘achromatic’ or if our web site was achromatic.com. But logically, shouldn’t a search for a specific term rank pages containing that specific term above pages containing terms which may be similar but are not identical? They are, after all, the best fit. Or, at least – like Google or Bing – ask the user if they meant achromatic rather than ichromatiq and based on the response, search accordingly. Just a thought!

Friday, December 10, 2010

Commetrix CMX Analyzer: Dynamic Social Network Visualization

Commetrix CMX Analyzer is a social network analysis platform from a German company Trilexis (www.trilexis.com) which originated in a research group at the Technical University of Berlin. (Note: the website, user interface and documentation are all in English.) What is interesting about this particular tool is its emphasis on the dynamics of social interactions over time. It achieves this through a data format that captures information about each individual link event including not only originator, destination and time but also user specified attributes which could include communication mode (email, IM, twitter), type of exchange (social, work, ecommerce), topic (e.g. keywords extracted from the subject).

Commetrix CMX Analyzer User Interface

A small subset of the Enron Email dataset –from the size and the individuals referenced we are guessing a single custodian - is provided for demonstration purposes. Part of our interest in this particular software is that we are familiar with the Enron dataset and had researched it using the social network analysis functionality of an eDiscovery system called MetaLINCS. We were curious to see what additional insights CMX Analyzer might provide.

CMX Analyzer is a desktop tool built in Java incorporating the 3D graphical capabilities of Java 3D and the Java Media Framework. Once we had obtained the license key, the application was straightforward to install and comes with a user guide. To date we have only been able to try it out on the sample data set provided as the process of creating new data sets requires end-user coding (of link attributes) followed by a data transformation process that requires as separate tool (Commetrix Producer) or the data being sent to Trilexis for processing by their systems.

Commetrix Data Preparation Process:

Commetrix is not as functionally or visually rich as some of the other tools we have investigated and reported on in previous blogs (e.g. Gephi, nodeXL). However, where it comes into its own is in the dynamic visualization of email communications over time. The MetaLINCs software we had used in the past had provided a “time-slider” but was essentially a “snapshot” approach. Commetrix has time-sliders too but also animates the traffic creating a unique perspective on what is, after all, a time-based series of events. (We should also warn readers that the resulting animations make for highly addictive viewing. We were totally captivated!) The start-end of the time period can be set, as can the intervals and speed of animation. It is also possible to run the time line backwards as well as forwards. This makes it possible to identify “hot spots” of communication activity between group subsets at particular points in time. In other types of communication e.g. twitter or facebook – we can see how this would provide valuable insight into the evolution of a topic of discussion or a social group.

Snapshot of Communications: Jan 2000

Snapshot of Communications: Dec 2000

Visually, Commetrix is more limited than some of the other packages we have used e.g. it is not possible to pan or zoom. There are options to change node size and color to represent parameters such as communications sent, communications received, number of direct contacts. Color schemes cannot be chosen directly but can be set to show selected attributes e.g. the following screenshot shows nodes color coded by the ‘function’ attribute where dark blue represents employees, pale blue represents directors, green represents traders, wholly purple circles represent managers and purple circles with yellow centers represent in-house lawyers. (Note: we found the use of full and semi colored circles to be somewhat confusing).

Colorcoding by Function

Included in Commetrix is an “egoview” option which allows you to select a particular node and investigate communication to and from that individual node. Links can be filtered to include only direct communications (a 1-step link) or communications involving two or more steps. The image below, for example, shows communications to and from Sara Shackleton. While this capability is helpful focusing down on traffic to and from a node, in the case of email communications if the data set is from only one custodian, the egoview has limited value when used outside that custodian as it will show only those communications that happen to have been referenced in emails sent to and from the primary custodian i.e. it is an imperfect sample.

Screenshot Showing Ego View - Tana Jones

Commetrix also comes with a Keyword filter. The intent is to allow the user to focus on interactions “about” the selected keywords. The interface is less obvious than some of the other areas and we confess to wondering if there was a bug until – rereading the manual – we realized that “In” didn’t mean “inbound” but include and “Out” meant exclude. Selecting the terms was also rather tedious as it meant scrolling through a long list of options. To validate the filtering, we took ‘california’ related terms and looked to see if Jeff Dasovitch was included, which he is – see screenshot below. It would be interesting to see this concept better developed with better keyword lists, more complex keyword filtering options and possibly the employment of automated topic determination techniques such as keyword clustering.

Screenshot Showing Use of Keyword Filter

Although the enron data set was provided only for demo purposes – having worked with this data, we were curious about two things: firstly how were the keywords derived (we guessed email subject but some of the keywords were email domains – indicating other metadata might have been used as well – and some phrases had been concatenated (e.g. ‘californiaattached’) or include a leading article (e.g.thenumber), or word fragments (e.g.’t’, ‘e’). Secondly, and more importantly, how were the “identities” of the individuals represented by the nodes resolved? This is always a major issue in email communications if the only information about senders and recipients is an email address. Most individuals have multiple email addresses – even within companies – and the names on email addresses may be difficult to resolve to a single individual. We raise this question because MetaLINCS included functionality that attempted to link individuals with their email accounts based not only on email address but also on communications patterns. Even then, many individuals/email accounts that a human would identify as probably being connected, could not be automatically linked. We are guessing that the identity of individuals was manually coded since the node table has a clean one-to-one mapping between individuals and a single email address.

In summary, while we think some of the other software we have used and researched offer better social network visualization options, we really liked the time-line animation Commetrix provides and believe it could be very helpful when studying the evolution of a network or communication patterns over time. While the keyword filtering option was disappointing in both the implementation and the demo dataset provided, we think it has obvious potential – particularly when analyzing large data sets of email, IM and twitter – in enabling users to focus in on only those communications “about” a particular topic. Of course, with that come all the provisos of using keywords as a substitute for “aboutness” but if it was combined with stemming, a better stop word list, and some form of thesaurus (to apply synonyms automatically) it would be very powerful.

Sunday, November 21, 2010

Anatomy of a Professional Community Portal

Our mission on this project was to create a portal for a professional community. The Portal aimed to support the usual range of community functionality such as blogging, forums and aggregating news feeds, host curated, searchable libraries of documents – from Standard Operating Procedures to Equipment Manuals and browsable and searchable directories of key information such as suppliers of equipment or professional programs. In addition, it needed to provide custom, secure workspaces where groups of users could collaborate on activities such as ISO accreditation and quality assurance. These workspaces needed to assist the group manage the process, monitor events and store relevant document and submissions in one easy-to-find place. The Portal also had to support an ecommerce area where merchants could sell equipment, training courses and quality assurance/proficiency testing programs.

Community Portal Functionality

Since there was a very limited amount of time in which to produce a demonstrable system and an even more limited budget for developing it, we opted to use Drupal as the underlying platform. Some of the pros and cons of this decision, and a comparison with SharePoint, were discussed in the previous blog. The combination of Drupal core, Views and Ubercart enabled us to roll out the ecommerce area and basic community features such as blogs, forums, job center, news aggregation rapidly and with relatively little direct coding.

An Example of a Browsable, Searchable Directory

Drupal’s taxonomy infrastructure, together with Views and Drupal’s core content management and search capabilities, made it very straightforward to roll out a number of different libraries and directories that were both easy to set up and easy for content managers to add to and edit. For situations in which content can be contributed by more than a small group of content managers/writers, Drupal supports workflow management although – like many Drupal functions – it does require a little more work to set up than SharePoint 2010’s more plug-and-play approach.

For the workspaces we created permission controlled secure areas that featured a mix of calendars and events, lists e.g. task lists, subscription lists, member lists – and content libraries e.g. standard guides, test submissions and test results. Each workspace was set up to support multiple projects within the overall activity type.

An Example of a Secure, Custom Workspace

The area we found least satisfactory were Drupal’s out-of-the-box submission forms for anything other than standard content such as documents and blogs. It did not provide a satisfactory interface for more complex data submission and we are currently testing various form modules and functional extensions to rectify this.

All in all, we found Drupal a very powerful and effective platform for building a professional community portal. As in any IT project, planning and design is an essential ingredient in long term growth and maintainability. In particular, we would recommend careful consideration of the information architecture in advance of any development. Drupal is underpinned by a relational database and the same considerations of redundancy, normalization and entity-relationships that hold in conventional system design, hold for Drupal development and design too. Consideration needs to be given to the relationships between the objects that Drupal nodes represent and data dictionaries set up to define each field. Doing this, you can leverage the power of Views to create a functionally rich, maintainable portal.

Monday, October 11, 2010

SharePoint vs Drupal: A “hands-on” comparison

Recently, we have found ourselves in the unusual position of building two content management oriented sites at the same time: one in SharePoint2010 Foundation and one in Drupal. While there are various blogs and commentary out there on the web about the pros and cons of the two, they are mostly written from the point-of-view of either a system administrator or a developer. In these projects, we are using third-party hosting (so no systems administration) and trying not to code but to use the out-of-the-box functionality, so we hope this blog will provide a different and practical perspective to anyone considering these platforms as options.

In our situation, the choice of platform was dictated by client needs: low-cost with ecommerce on the one hand and a company internal, office team environment on the other. Both systems are being hosted by third parties so we did not have to worry about systems administration. We did install Drupal for our development environment and note that, as everyone has commented, it is very straightforward to set up whereas our previous experience of on-premise SharePoint required significant input and ongoing maintenance from systems engineering. For SharePoint, we are working with SharePoint 2010 Foundation – which has some significant functional limitations over the “Standard” version. For Drupal we are working with version 6.15 and using Panels, Views, PathAuto, ImageCache and Ubercart as our base platform.

In both cases, the intention was to see how far we could go using the system “out-of-the-box” and without coding – which, by-and-large, we have been able to do. (Although we have to confess to a quick code-tweak in Drupal to change the name on a search button from APPLY to SEARCH). With both systems, we found ourselves frustrated initially by the fact we had less control over the individual look-and-feel of the page than we were used to in a conventional build-it-yourself, non-templated environment. However, once we adapted, we love the fact that we can focus on content and functionality and know that the look-and-feel is going to be consistently applied, and that we don’t have to design every style and control ourselves.

The steepest learning curve by far was with Drupal – which is to be expected since it is very much intended to be a lego-like platform with a wealth of options. The quantity and range of available Drupal contributed modules is its great strength and a significant advantage over the more monolithic SharePoint. On the other hand, many times we found ourselves spending hours “shopping” for new modules. While not an unhappy experience (we like shopping!) we had to be quite strict with ourselves to avoid becoming module experts who hadn’t actually built anything!

Another advantage of Drupal is the availability of sophisticated and varied themes. There are 759 freely available on the Drupal site plus many more that can be purchased for less than $100. This is a huge plus, making it easy to get a reasonable looking site up and running without spending significant effort designing and coding stylesheets. And then if you want to make minor changes to your theme - which you inevitably will - you can make local modifications to the theme stylesheet and/or use a module like CSSInjector to set up rule-based overrides. With SharePoint the out-of-the-box choice is mostly limited to the color palette – which is OK for company internal sites but anyone developing for external use is going to need more and having a broader library of available templates would be useful. Yes, you can use SharePoint Designer but it is much more effort than css-tweaking in Drupal.

Drupal's Theme Index

SharePoint’s strengths are undoubtedly its tight integration with Office and the ease of use of its out-of-the-box content management functionality. Once you have mastered the concepts of libraries and lists, you can very quickly create a functional CMS with most effort going – as it should – into organizing the content. The Office Ribbon look-and-feel and the more consistent user interface in SharePoint2010, as compared with earlier versions, mean that complete beginners can become effective users in a very short space of time. The multi-file upload feature is a joy: it’s fast, it’s easy to use and it makes large scale document upload a pain-free operation. The search site gives you effort free total site collection search capability and indeed, even at the Foundation level, we have found SharePoint’s searching to be fast, efficient (maybe even over-efficient as we are not sure of the usefulness of indexing every Excel cell) and users love what they describe as the “google-style” result displays. Users also like being able to synch their contacts and calendar with Outlook.

The SharePoint 2010 Ribbon

For internal content management systems, SharePoint2010 is a no-brainer and a hosted option removes the pain of system setup and administration. However, it could have been, should have been, so much better. It is the small things that don’t quite work that bring SharePoint down. Like the missing spellchecker on the editor (see our previous blog), or the fact that you can’t automatically set the calendar display to show multiple user events. The “wiki” style content creation feature isn’t quite there yet either. In an office/work environment, you often need to create “ordered” content with some kind of an index page: “How To” documents for example. SharePoint wiki pages, while searchable and link-able cannot be explicitly ordered and Foundation doesn’t even have tagging options. After using Drupal Views, we also found the limitations on SharePoint list settings frustrating and unnecessary. If Views allows you to set multiple filters and sort levels, why can’t SharePoint since the underlying architecture – SQL – is fundamentally the same? However we do note that the UI on SharePoint’s list set up is far more intuitive and can be readily used by non-programmers whereas Views took some getting used to and is definitely not intuitive.

The downside of Drupal is the learning curve and the fact that you do have to set-up and configure much of the functionality you want. While the extensive range of available modules means that most of this can be done without coding, it still takes some time to research and install these. And although there are many helpful blogs and commentaries on various aspects of Drupal (for which we are profoundly grateful – what did we do before Google?), interfaces for the more complex modules are often not at all intuitive and documentation can be sparse, or written from a developer perspective that assumes you are going to want to code. Panels is an example of a module where more extensive documentation and some cook-book examples would have been very helpful.

In summary, there is a place for both Drupal and SharePoint. Each has their strengths and weaknesses and neither is perfect. Both are impressive in how much functionality is available and configurable without coding. For company-internal, content management, SharePoint would be our first choice and a hosted version makes it easy to get up and running in a matter of days if not hours (as well as being cost-effective compared with purchasing an on-premises license). For external sites needing a broad range of functionality such as ecommerce, Drupal is a great option. It’s hard to beat free and the extensive eco-system of freely available modules and themes makes it easy to put together a site that has a stylish look-and-feel and rich functionality while never (or almost never) having to cut a line of code.

Saturday, September 18, 2010

Analyzing Email Communications: An Ego-Centric Approach

As a quick scan through prior blogs will show, throughout this year we have been exploring the application of social network visualization software to email communications. Our interest has been two-fold: finding tools to support those working in legal and regulatory environments who need to examine large numbers of emails for answers to “Who, What, Where, When and Who Knew” kinds of questions and secondly, to see if this approach might provide behavioral psychologists with tools to identify and/or objectively measure, communications issues in workplace teams. In many workplace situations, email has become the primary communication mechanism whether through cultural factors (as with many IT teams) or because of distance (with geographically dispersed teams). At the same time communication issues are cited as one of the primary reasons why projects fail. It seemed to us that tools for analyzing the flow of email communications in a team might help identify team members who are outside the group, or who have significantly fewer interactions with key individuals in the team, thereby enabling remedial action to be taken.

Software we have looked at so far includes: Gephi – useful for large data sets – and NodeXL – useful for analyzing smaller groups of individuals with great options for customizing the appearance of the graphs e.g. color coding particular attributes or clusters and easy to use. Data feeds into both are organized basically as edge lists and node lists with Gephi requiring XML formatting and NodeXL spreadsheet or csv lists. (Note: in an email environment, a node is an individual – represented by either an email address or a name and an edge is the communication between two individuals with the volume of communications represented by a weight measure). The visualizations produced look at communication and clustering from a birds-eye view across the entire data set.

UCINET takes a somewhat different approach. UCINET is a social network analysis program developed by university researchers at the University of Kentucky and distributed by Analytic Technologies (see www.analytictech.com/ucinet/). There is a free trial version and relatively low cost options for students, researchers and single users.

Unlike NodeXL or Gephi, UCINET is not a complete visualization package but only the analytic engine. It is, however, integrated with a freeware program called NETDRAW. Since both are included in the download package, installation is straightforward. We did find in practice though that the package behaves like a set of separate tools operating on a common data set compared with the more integrated environments of NodeXL or Gephi. Another difference is that UCINET works on matrices not edge/node lists. Fortunately, it has an import function which accepts a standard edge list (e.g. person1, person2, weight) in excel format. The import function then converts this into a matrix for analysis and visualization.

Our test data set is the same as before: an anonymized set of email communications. For this investigation we started with a small subset of 368 nodes and 1223 edges.

NETDRAW visualization of entire email network

While NETDRAW is by no means as sophisticated as the graphical packages in Gephi or even NodeXL, where the UCINET/NETDRAW package came into its own is in its ability to hone in easily on a selected set of individuals. A checklist menu of nodes appears on the right hand side of the graph and altering the selections immediately redraws the graph showing only those individuals and their connections. We think this is very helpful when drilling down to investigate the interactions between a particular group of people.

Another great feature of UCINET/NETDRAW is its ability to visualize interactions from an “ego” perspective. By selecting an initial “ego”, the software identifies all the individuals in communication with the selected individual and produces a subgraph of communications between them. For example, simply selecting “Carmela Soprano” produced the following subgraph.

"Carmelo Soprano" Ego Network Graph

NETDRAW can be configured to represent the volume of communications as the size of the link:

Network Graph with Link Width Representing Communication Volume

Or with the volume shown in a link label:

Network Graph with Link Label Showing Communication Volume

UCINET offers a range of node centrality measures including Closeness, Betweenness, Degee and Eigenvector. (For information about what these measures represent, see previous blogs or go to: http://en.wikipedia.org/wiki/Betweenness_centrality#Eigenvector_centrality). Once the measures are calculated, nodes can be colorized to represent one of the selected measures. For example the nodes on the sub-graph below have been colorized to represent the value of the Indegree attribute.
It is also possible to filter based on a particular measure. The graph below shows the entire set filtered to show only nodes with high Eignvector counts (a measure of the importance of the individual in the network).

Network filtered by Eigenvector Measure (to show 'Important' individuals only)

UCINET/NETDRAW also has a number of algorithms for analyzing subgroups. For example, in the subgraph below (an “ego” network for Tom Hagen), it has identified 3 factions – represented by the three different colors: red, blue, black.

Graph identifying Factions within a Subgroup

An analysis of cliques in the entire set identified 60 separate groups shown in the graph below.

Graph showing the 60 cliques identified in the data set

What we liked about UCINET/NETDRAW is the ease with which we could explore the involvement of particular individuals in the network using the ego feature combined with the filtering and attribute based node coloring. We also liked the wide range of analysis options which included not only the standard centrality measures but also various clustering algorithms and analyses of cliques and subgroups. While more extensive documentation would have been helpful, (although we do appreciate that this was initially developed as a research tool), we did appreciate that whatever we did to it, it never crashed and managed to catch any errors gracefully.

Saturday, September 11, 2010

The Case of the Missing Spell Checker

A recent project involved creating a proof-of-concept SharePoint 2010 Foundation site(s) for a client. The aim was to demonstrate some of SharePoint’s collaboration features and show how the platform could support various teams within the client’s organization. In setting up the demonstration, we decided to create a small Knowledge Base using the built in content creation tools.

The new page editing tools are certainly easier to use than in previous versions of SharePoint and adding in pictures is a cinch. The range of styles and fonts is also much improved. We did think the mechanism for linking pages – while very wiki-like – could have been made easier for less tech-savvy users. More importantly, since Foundation users do not get the content management and tagging features of the Standard and Enterprise versions, better tools for organizing the pages – other than simple links – would have been helpful. For example, it would have been nice to have been able to designate one of the pages as the “Home Page” of the Knowledge Base. Another great feature would have been to have an “Index Page” with an automatically created index of pages in the wiki.

SharePoint 2010 Foundation Content Editor: Insert Options

SharePoint 2010 Foundation : Text Editing Options

It wasn’t until someone pointed out a glaring spelling error in the copy we’d been writing for the Knowledge Base that we realized that, most strangely, there wasn’t any form of spell checker in the content editor. At first we thought we’d simply mislaid it somewhere in the ribbon but after looking high and low for it and checking several blogs, we realized that it in fact doesn’t exist in Foundation. Microsoft skirt round the issue by declaring that spell checking exists in Standard and Enterprise, thereby carefully not saying that it doesn’t exist in Foundation.

This seems to us very strange and a significant drawback to Foundation (which is almost certain to be the de facto hosted version). After all, blog platforms and software like Blogger - on which ChromaScope is hosted - have incorporated spell checkers for some time now.

Blogger's Editing Options (Spell Check is the last icon on the right)

Intrigued, we decided to do a quick comparison of functionality between the HTML editors in Blogger and SharePoint 2010.

Feature	Blogger	SharePoint 2010 Foundation
Cut/Copy/Paste	Yes	Yes
Font Styles	Yes (7 available)	Yes (13 available)
Font Color	Yes (limited range)	Yes (extensive range)
Strike-through/SuperScript/Subscript	Strike-through only	Yes
Highlight Text	Yes	Yes
Paragraph Formatting (e.g. justification)	Yes	Yes
Style Gallery (e.g. Byline)	Quote only	Yes (7 available)
MarkUp Style Gallery (e.g. Heading1)	Title and Body only (from blog content editor).	Yes (14 available)
Text Layout (e.g. columns)	Yes but through Page Design rather than content editor.	Yes
Insert Picture/Image	Yes	Yes
Insert Video	Yes	Yes (but not as obvious how to do this)
Insert Link	Yes	Yes
Insert Jump Break	Yes	No
Insert Table	No	Yes
Select Elements based on HTML tag	No	Yes
CheckIn/CheckOut	No (but the publish function enables users to decide when pages become publically available.)	Yes
Tagging	Yes	No
Edit HTML Source	Yes	Yes
Page Templating	Yes	Yes but by using SharePoint Designer
Language Support	Yes including non-latin	Extensive including non-latin
Spell Checking	Yes	No

While overall, SharePoint 2010 Foundation has a very rich content editor, some of the features and the rather technical HTML element orientation may make it difficult for the general user or, more likely, simply languish unused. Blogger, on the other hand, with the exception of the option of easily adding a table, has all the features the general user/content creator would need to compose content AND a spell checker! Hopefully Microsoft take note of the feedback that we, and we are sure everyone else, will give them and make the text editor in SharePoint 2010 Foundation more like an easy-to-use content editor and less like an HTML editor for web designers.