Saturday, July 10, 2010

Analyzing Email Communications: Processed vs Unprocessed Data

In the previous post, we looked at using NodeXL to visualize communication patterns on emails that had bee preprocessed. In other words, we had run the original email file through a software tool that extracted metadata such as To, From, CC, Subject, Date Sent and stored it in a SQL database. The software we were using also extracted a Person’s Name from the email address.

For import into NodeXL, we created an edge list with the fields: PERSON_NAME1, PERSON_NAME2, CONNECTION COUNT (i.e. the number of communications between the people concerned) by simply querying the database and exporting into Excel. Later on, when we wanted to develop the visualization and look at clustering, we were able to use the database to generate a list of node attributes (e.g. Family Membership) and import that into NodeXL. For Gephi, we followed a similar process except that we output into the required XML format.

The benefits of this approach were brought home to us when we tried the Email Import feature in NodeXL. This function allows you to import network information from your personal email file into NodeXL and to configure the resulting network display. Unfortunately it is limited at present to import of personal email only – which limits its applicability. It would have been nice to have had the option to point it at some sample PSTs e.g. from the Enron data set. (And yes, we know there are workarounds to this and had the result of our test been exceptional, we might have spent time setting it up).


The import process is very simple – a click of the button if you want everything, slightly longer if you want to filter by recipient or time – and pretty quick. The resulting network retains the directionality of the email communications – which we had stripped out of our sample data. (Note: that was by choice, we could have retained it in the sample since the database captured the metadata field from which the name had been extracted).

However, we found the results of this approach not as clean or as insightful as when processed email data was used and it made us appreciate the value of preprocessing first:
(1) People’s names are almost always shorter than their Email address which makes the resulting node labels easier to work with and display.
(2) Using processed data, it is often possible to resolve multiple email addresses into the same identity. This is not a perfect science but a little text manipulation and some judicious review and editing can get you a long way. Some processing software will even support this process. With so many people holding multiple emails accounts: work and personal – this is not an insignificant issue.
(3) Processed data – because it gives you access to all the metadata – enables the network to be enriched with additional information about each node e.g. Organization, Domain. These attributes can then be used, for example, to cluster groups of nodes and provide additional insight (e.g. perhaps Operations isn’t communicating with Sales and vice versa). Attributes such as Month/Year Sent could be added to Edges.
(4) And if the metadata isn’t enough, and there is other information available that can be mapped to the individuals identified in the communications, (role maybe or demographics such as age and gender), with some minimal database work, the email network can be enriched with this information too.
(5) If the data is being imported from a database of processed email, the number of edge-pairs and nodes is known. If NodeXL is applied directly to an email file it isn’t and that means that you could very easily outstrip the capabilities of NodeXL which is designed to handle networks of a few thousand rather than tens of thousands of nodes.

Example of a Network of Email Addresses Showing Directionality with Nodes Sized and Colored by Eigenvector Centrality (i.e. Level of Importance) laid out using the Harel-Koren Method.

Sunday, July 4, 2010

Visualizing Email Communications using NodeXL

Email has become an integral part of communication in both the business and personal spheres. Given its centrality, it is surprising how few tools are generally available for analyzing it outside specialist areas such as Early Case Assessment tools within the litigation area: Xobni being a notable exception at the individual level. However, the rise of social network analysis, and the tools that support it, may change that. Graph theory is remarkably neutral as to whether it is applied to Facebook Friend networks or email communications within a Sales and Marketing division.

In a previous post, we reported on using Gephi – an open source tool for graphing social networks – to visualize email communications. In this post, we look at using NodeXL for the same purpose. We used the same email data set before – the ‘Godfather Sample’ – in which an original email data set was processed to extract the metadata (e.g. sender, recipient, date sent, subject) and subsequently anonymized using fictional names.

NodeXL is a free and open source template for Microsoft Excel 2007 that provides a range of basic network analysis and visualization features intended for use on modest-sized networks of several thousand nodes/vertices. It is targeted at non-programmers and builds upon the familiar concepts and features within Excel. Information about the network, e.g. node data and edge lists, is all contained within worksheets.


Data can be simply loaded by cutting and pasting an edge list from another Excel worksheet but there are also a wide range of other options including the ability to import network data from Twitter (Search and User networks), YouTube and Flickr and from files in GraphML, Pajek and UCINET Full Matrix DL file formats. There is also an option to import directly from an Email PST file which we will discuss a following post. In addition to the basics of an edge list, attribute information can be associated with each edge and node. In our “Godfather” email sample, we added a weighting for communication strength (i.e. the number of emails between the two individuals) to each edge and the affiliation with the Corleone family to each node.

Once an edge list has been added, the vertices/node list is automatically created and a variety of graphical representations can be produced depending on the layout option selected, (Fruchterman Riengold is the default but Harel-Koren Fast Multiscale as well as Grid, Polar, Sugiyama and Sine Wave options are also available), and by mapping data attributes to the visual properties of nodes and vertices. For example, in the graph shown below, nodes were color coded and sized with respect to the individual’s connections with the Corleone family: blue for Corleone family members, green for Corleone allies, orange for Corleone enemies and Pink for individuals with no known associations with the family.



The width of the edges/links was then set to vary in relation to the degree of communication between the two nodes i.e. the number of emails sent between the two individuals concerned.


Labels can be added to both nodes and links showing either information about the node/link or its attributes, as required.






Different graph layout options are available which may be used to generate alternative perspectives and/or easier to view graphs.

Harel-Koren Layout


Circle Layout


Because even a small network can generate a complex, dense graph, NodeXL has a wide range of options for filtering and hiding parts of the graph, the better to elucidate others. The visibility of an edge/vertex for example, can be linked to a particular attribute e.g. degree of closeness. We found the dynamic filters particularly useful for rapidly focusing on areas of interest without altering the properties of the graph themselves. For example, in the following screenshot we are showing only those links where the number of emails between the parties is greater than 40. This allows us to focus on individuals who have been emailing each other more frequently than the average.


In addition to graphical display, NodeXL can be used to calculate key network metrics including: Degree (the number of links on a node and a reflection of the number of relationships an individual has with other members of the network) with In-Degree and Out-Degree options for directed graphs, Betweenness Centrality (the extent to which a node lies between other nodes in the network and a reflection of the number of people an individual is connecting to indirectly), Closeness Centrality (a measure of the degree to which a node is near all other nodes in a network and reflects the ability of an individual to access information through the "grapevine" of network members) and Eigenvector Centrality (a measure of the importance of an individual in the network). In an analysis of email communications, these can be used to identify the degree of connectedness between individuals and their relative importance in the communication flow.

For example, in our Godfather sample, we have sized the nodes in the graph below by their Degree Centrality. While Vito Corleone is, as expected, shown to be highly connected, Ritchie Martin – an individual not thought to have business associations with the Corleone family, is shown to be more connected than supposed.

Node Sized by Degree Centrality


When we look at the same data from the perspective of betweenness, we see that Vito, Connie and Ritchie all have a high degree of indirect connections.

Nodes Sized by Betweenness Centrality


And the Eignevector Centrality measure confirms Vito Corleone's signficance in the network as well as Connie's, two "allies" - Hyman Roth and Salvatore Tessio and two individuals  Ritchie Martin.

Nodes Sized by Eigenvector Centrality


Last but not least, it is also possible to use NodeXL to visualize clusters of nodes to show or identify subgroups within a network. Clusters can be added manually or generated automatically. Manually creating clusters requires first assigning nodes to an attribute or group membership and then determining the color and shape of the nodes for each subgroup/cluster. In our GodFather example, we used “Family” affiliation to create clusters within the network but equally one could use organization/company, country, language, date etc.
"Family Affiliation" Clusters Coded by Node Color

Selected Cluster (Corleone Affiliates)

NodeXL will also generate clusters automatically using a clustering algorithm developed specifically for large scale social network analysis which works by aggregating closely interconnected groups of nodes. The results for the Godfather sample are shown below. We did not find the automated clustering helpful but this is probably a reflection of the relatively small size of the sample.

In the next post, we will look at importing email data directly into NodeXL and compare approaches based on analyzing processed vs unprocessed email data.

Larger Email Network Visualization

To download NodeXL, go to http://nodexl.codeplex.com//. We would also recommend working though the NodeXL tutorial which can be downloaded from: http://casci.umd.edu/images/4/46/NodeXL_tutorial_draft.pdf


A top level overview of social network analysis and the basic concepts behind graph metrics can be found on Wikipedia e.g. http://en.wikipedia.org/wiki/Social_network and http://en.wikipedia.org/wiki/Betweenness_centrality#Eigenvector_centrality