Saturday, July 10, 2010

Analyzing Email Communications: Processed vs Unprocessed Data

In the previous post, we looked at using NodeXL to visualize communication patterns on emails that had bee preprocessed. In other words, we had run the original email file through a software tool that extracted metadata such as To, From, CC, Subject, Date Sent and stored it in a SQL database. The software we were using also extracted a Person’s Name from the email address.

For import into NodeXL, we created an edge list with the fields: PERSON_NAME1, PERSON_NAME2, CONNECTION COUNT (i.e. the number of communications between the people concerned) by simply querying the database and exporting into Excel. Later on, when we wanted to develop the visualization and look at clustering, we were able to use the database to generate a list of node attributes (e.g. Family Membership) and import that into NodeXL. For Gephi, we followed a similar process except that we output into the required XML format.

The benefits of this approach were brought home to us when we tried the Email Import feature in NodeXL. This function allows you to import network information from your personal email file into NodeXL and to configure the resulting network display. Unfortunately it is limited at present to import of personal email only – which limits its applicability. It would have been nice to have had the option to point it at some sample PSTs e.g. from the Enron data set. (And yes, we know there are workarounds to this and had the result of our test been exceptional, we might have spent time setting it up).


The import process is very simple – a click of the button if you want everything, slightly longer if you want to filter by recipient or time – and pretty quick. The resulting network retains the directionality of the email communications – which we had stripped out of our sample data. (Note: that was by choice, we could have retained it in the sample since the database captured the metadata field from which the name had been extracted).

However, we found the results of this approach not as clean or as insightful as when processed email data was used and it made us appreciate the value of preprocessing first:
(1) People’s names are almost always shorter than their Email address which makes the resulting node labels easier to work with and display.
(2) Using processed data, it is often possible to resolve multiple email addresses into the same identity. This is not a perfect science but a little text manipulation and some judicious review and editing can get you a long way. Some processing software will even support this process. With so many people holding multiple emails accounts: work and personal – this is not an insignificant issue.
(3) Processed data – because it gives you access to all the metadata – enables the network to be enriched with additional information about each node e.g. Organization, Domain. These attributes can then be used, for example, to cluster groups of nodes and provide additional insight (e.g. perhaps Operations isn’t communicating with Sales and vice versa). Attributes such as Month/Year Sent could be added to Edges.
(4) And if the metadata isn’t enough, and there is other information available that can be mapped to the individuals identified in the communications, (role maybe or demographics such as age and gender), with some minimal database work, the email network can be enriched with this information too.
(5) If the data is being imported from a database of processed email, the number of edge-pairs and nodes is known. If NodeXL is applied directly to an email file it isn’t and that means that you could very easily outstrip the capabilities of NodeXL which is designed to handle networks of a few thousand rather than tens of thousands of nodes.

Example of a Network of Email Addresses Showing Directionality with Nodes Sized and Colored by Eigenvector Centrality (i.e. Level of Importance) laid out using the Harel-Koren Method.

No comments:

Post a Comment