ChromaScope

Saturday, September 11, 2010

The Case of the Missing Spell Checker

A recent project involved creating a proof-of-concept SharePoint 2010 Foundation site(s) for a client. The aim was to demonstrate some of SharePoint’s collaboration features and show how the platform could support various teams within the client’s organization. In setting up the demonstration, we decided to create a small Knowledge Base using the built in content creation tools.

The new page editing tools are certainly easier to use than in previous versions of SharePoint and adding in pictures is a cinch. The range of styles and fonts is also much improved. We did think the mechanism for linking pages – while very wiki-like – could have been made easier for less tech-savvy users. More importantly, since Foundation users do not get the content management and tagging features of the Standard and Enterprise versions, better tools for organizing the pages – other than simple links – would have been helpful. For example, it would have been nice to have been able to designate one of the pages as the “Home Page” of the Knowledge Base. Another great feature would have been to have an “Index Page” with an automatically created index of pages in the wiki.

SharePoint 2010 Foundation Content Editor: Insert Options

SharePoint 2010 Foundation : Text Editing Options

It wasn’t until someone pointed out a glaring spelling error in the copy we’d been writing for the Knowledge Base that we realized that, most strangely, there wasn’t any form of spell checker in the content editor. At first we thought we’d simply mislaid it somewhere in the ribbon but after looking high and low for it and checking several blogs, we realized that it in fact doesn’t exist in Foundation. Microsoft skirt round the issue by declaring that spell checking exists in Standard and Enterprise, thereby carefully not saying that it doesn’t exist in Foundation.

This seems to us very strange and a significant drawback to Foundation (which is almost certain to be the de facto hosted version). After all, blog platforms and software like Blogger - on which ChromaScope is hosted - have incorporated spell checkers for some time now.

Blogger's Editing Options (Spell Check is the last icon on the right)

Intrigued, we decided to do a quick comparison of functionality between the HTML editors in Blogger and SharePoint 2010.

Feature	Blogger	SharePoint 2010 Foundation
Cut/Copy/Paste	Yes	Yes
Font Styles	Yes (7 available)	Yes (13 available)
Font Color	Yes (limited range)	Yes (extensive range)
Strike-through/SuperScript/Subscript	Strike-through only	Yes
Highlight Text	Yes	Yes
Paragraph Formatting (e.g. justification)	Yes	Yes
Style Gallery (e.g. Byline)	Quote only	Yes (7 available)
MarkUp Style Gallery (e.g. Heading1)	Title and Body only (from blog content editor).	Yes (14 available)
Text Layout (e.g. columns)	Yes but through Page Design rather than content editor.	Yes
Insert Picture/Image	Yes	Yes
Insert Video	Yes	Yes (but not as obvious how to do this)
Insert Link	Yes	Yes
Insert Jump Break	Yes	No
Insert Table	No	Yes
Select Elements based on HTML tag	No	Yes
CheckIn/CheckOut	No (but the publish function enables users to decide when pages become publically available.)	Yes
Tagging	Yes	No
Edit HTML Source	Yes	Yes
Page Templating	Yes	Yes but by using SharePoint Designer
Language Support	Yes including non-latin	Extensive including non-latin
Spell Checking	Yes	No

While overall, SharePoint 2010 Foundation has a very rich content editor, some of the features and the rather technical HTML element orientation may make it difficult for the general user or, more likely, simply languish unused. Blogger, on the other hand, with the exception of the option of easily adding a table, has all the features the general user/content creator would need to compose content AND a spell checker! Hopefully Microsoft take note of the feedback that we, and we are sure everyone else, will give them and make the text editor in SharePoint 2010 Foundation more like an easy-to-use content editor and less like an HTML editor for web designers.

Monday, August 16, 2010

PivotViewer: More Than Just Images

PivotViewer (aka simply as Pivot) is a framework that comes out of Microsoft Live Labs and is intended to support analysis of large datasets where the individual data entities have an image associated with them. We say this carefully because at first glance it looks like “yet-another-image-gallery-application” but it really is not (although we’d agree that you could use it for that purpose if you wanted, just as you can use a chisel to pull up carpet tacks and a 6-burner commercial class stove to cook a packet of soup).

Screenshot of AGM Movie Demo in PivotViewer showing the tiled view

The Silverlight enabled viewer works in a not too dissimilar way from an Excel Pivot table. Data can be filtered by any of the facets/categories available, supplemented (if required) by keyword searching. Images can be shown in tiled view or organized in bar chart view by chosen facet. Drilling down to item detail is as simple as zooming into an image. The corresponding data is displayed in a list to the side and adjacent items can be quickly stepped through using forward/backward buttons.

Screenshot of AGM Movie Demo in PivotViewer showing chart view

Underpinning the framework is the concept of a “collection”. A collection comprises a set of images and an XML file describing the images. The CollectionXML schema is a set of property-values that specify the collection as a whole, the facet categories into which the collection is organized and the individual items. The images in the collection are stored in Deep Zoom format and rendered using Seadragon technology.

CollectionXML Schema Overivew

Creating a Pivot collection is not as intimidating or difficult as it might sound, however, because fortunately LiveLabs provide several tools to facilitate the process including one that is based on Excel.

Screenshot of Pivot Collection Tool for Excel

This summer, LiveLabs also released a Silverlight 4 control which can be embedded in web sites (including SharePoint) and used to view, manipulate and analyze collections. The tools are available (for free) from the Pivot site. The Silverlight PivotViewer control can be downloaded from: www.silverlight.net/learn/pivotviewer/

Our initial interest in PivotViewer was its visualization capability and its potential for presenting complex data in ways that make it easier for users to understand and analyze. To this end we decided to try it out for ourselves and build a mini application using the Silverlight control as the viewer and the Pivot Collection Tool for Microsoft Excel to create the underlying collection. We had available a small collection of data and images relating to laboratory equipment and thought this would provide an interesting proof-of-concept.

Unlike many “interesting concept” toolsets we have attempted to deploy in the past, this one turned out to be very straightforward to use – despite a paucity of documentation. While the Excel Collection tool is “plug-and-play”, some knowledge of .NET development and Silverlight is obviously necessary to deploy the PivotViewer control. Thanks to Tim Heurer’s very helpful blog on how to deploy PivotViewer, we were able to get a basic Lab Equipment PivotViewer up and running very quickly.

Screenshot of Laboratory Equipment application - tile view

Screenshot of Laboratory Items by material type (chart view)

Although we knew going in that the small number of data items we were using was less than ideal, (more is definitely better here), we thought that the set of uniform images we had available (complete with color coding) and the supporting data about the equipment (size, material type, category, descriptions etc). would make up for it. We were wrong! We had focused on the images and these, while necessary, are not sufficient. What is absolutely essential to really make the most of this application is rich data. We had only two main facets and a small number of parameters for each significantly limiting what we could do.

Contrast this with the AMG Movie demo provided as a sample with the control where each movie is accompanied by a wealth of information including a description as well as faceted data such as date of release, director, actors, genre, box office takings, countries, runtime time and it is this information that fuels the application.

Close-up of Movie demo item and accompanying data

When thinking about how Pivot could be used, our first thoughts had been the obvious “image gallery” type applications: a web enabled version of an art gallery or museum for example. The “out-of-the-box” ability to support filtering and search by multiple facets – supplemented by keyword searching – would be ideal. Users could look, for example, for all Impressionist paintings depicting lakes painted in France between this date and that. Similarly, it could be used to develop a very useful, useable interface to any large catalog of items: from clothing (women’s jeans boot-cut dark-wash) to hardware (small plate door knocker solid brass satin nickel finish).

However, it was when playing with the movie application that we realized that thinking of it as simply a front end to a catalog was to underplay its potential. We had started to look at the box office takings facet and it was then that the penny-dropped. We found ourselves looking for patterns. What correlations were there between directors, actors and takings? It was very easy to ask these questions and then focus in on the results, arranging the items as tiles or as bar graphs. We could see the visual potential of PivotViewer really coming into play when looking at, for example, trends in sales on clothing or even real estate – anything where visual appearance (from color to style) is a factor in sales, cost of manufacture, page views or some other key metric.

Screenshot from AGM Movie demo showing Movies by Box Office Gross

In the movie demo, the images are a nice-to-have as a visualization but are not an essential part of the analysis per se. In other cases, we could envisage the images themselves being an essential part of the analysis. For example, retailers often study the selling power of pages in their printed catalogs or web sites, to determine which layouts are the most effective. PivotViewer would make this a very easy analysis to conduct. Similarly a greetings card manufacturer could look for patterns and trends in consumer choice of design.

In summary, we believe this technology has great potential deployed in environments that are data rich and where either visual appearance is correlated with one or more key metrics, or can facilitate visualization of complex data simply by making the individual items (or groups of items) more recognizable.Underpinning the framework is the concept of a “collection”. A collection comprises a set of images and an XML file describing the images. The CollectionXML schema is a set of property-values that specify the collection as a whole, the facet categories into which the collection is organized and the individual items. The images in the collection are stored in Deep Zoom format and rendered using Seadragon technology.

Saturday, July 10, 2010

Analyzing Email Communications: Processed vs Unprocessed Data

In the previous post, we looked at using NodeXL to visualize communication patterns on emails that had bee preprocessed. In other words, we had run the original email file through a software tool that extracted metadata such as To, From, CC, Subject, Date Sent and stored it in a SQL database. The software we were using also extracted a Person’s Name from the email address.

For import into NodeXL, we created an edge list with the fields: PERSON_NAME1, PERSON_NAME2, CONNECTION COUNT (i.e. the number of communications between the people concerned) by simply querying the database and exporting into Excel. Later on, when we wanted to develop the visualization and look at clustering, we were able to use the database to generate a list of node attributes (e.g. Family Membership) and import that into NodeXL. For Gephi, we followed a similar process except that we output into the required XML format.

The benefits of this approach were brought home to us when we tried the Email Import feature in NodeXL. This function allows you to import network information from your personal email file into NodeXL and to configure the resulting network display. Unfortunately it is limited at present to import of personal email only – which limits its applicability. It would have been nice to have had the option to point it at some sample PSTs e.g. from the Enron data set. (And yes, we know there are workarounds to this and had the result of our test been exceptional, we might have spent time setting it up).

The import process is very simple – a click of the button if you want everything, slightly longer if you want to filter by recipient or time – and pretty quick. The resulting network retains the directionality of the email communications – which we had stripped out of our sample data. (Note: that was by choice, we could have retained it in the sample since the database captured the metadata field from which the name had been extracted).

However, we found the results of this approach not as clean or as insightful as when processed email data was used and it made us appreciate the value of preprocessing first:
(1) People’s names are almost always shorter than their Email address which makes the resulting node labels easier to work with and display.
(2) Using processed data, it is often possible to resolve multiple email addresses into the same identity. This is not a perfect science but a little text manipulation and some judicious review and editing can get you a long way. Some processing software will even support this process. With so many people holding multiple emails accounts: work and personal – this is not an insignificant issue.
(3) Processed data – because it gives you access to all the metadata – enables the network to be enriched with additional information about each node e.g. Organization, Domain. These attributes can then be used, for example, to cluster groups of nodes and provide additional insight (e.g. perhaps Operations isn’t communicating with Sales and vice versa). Attributes such as Month/Year Sent could be added to Edges.
(4) And if the metadata isn’t enough, and there is other information available that can be mapped to the individuals identified in the communications, (role maybe or demographics such as age and gender), with some minimal database work, the email network can be enriched with this information too.
(5) If the data is being imported from a database of processed email, the number of edge-pairs and nodes is known. If NodeXL is applied directly to an email file it isn’t and that means that you could very easily outstrip the capabilities of NodeXL which is designed to handle networks of a few thousand rather than tens of thousands of nodes.

Example of a Network of Email Addresses Showing Directionality with Nodes Sized and Colored by Eigenvector Centrality (i.e. Level of Importance) laid out using the Harel-Koren Method.

Sunday, July 4, 2010

Visualizing Email Communications using NodeXL

Email has become an integral part of communication in both the business and personal spheres. Given its centrality, it is surprising how few tools are generally available for analyzing it outside specialist areas such as Early Case Assessment tools within the litigation area: Xobni being a notable exception at the individual level. However, the rise of social network analysis, and the tools that support it, may change that. Graph theory is remarkably neutral as to whether it is applied to Facebook Friend networks or email communications within a Sales and Marketing division.

In a previous post, we reported on using Gephi – an open source tool for graphing social networks – to visualize email communications. In this post, we look at using NodeXL for the same purpose. We used the same email data set before – the ‘Godfather Sample’ – in which an original email data set was processed to extract the metadata (e.g. sender, recipient, date sent, subject) and subsequently anonymized using fictional names.

NodeXL is a free and open source template for Microsoft Excel 2007 that provides a range of basic network analysis and visualization features intended for use on modest-sized networks of several thousand nodes/vertices. It is targeted at non-programmers and builds upon the familiar concepts and features within Excel. Information about the network, e.g. node data and edge lists, is all contained within worksheets.

Data can be simply loaded by cutting and pasting an edge list from another Excel worksheet but there are also a wide range of other options including the ability to import network data from Twitter (Search and User networks), YouTube and Flickr and from files in GraphML, Pajek and UCINET Full Matrix DL file formats. There is also an option to import directly from an Email PST file which we will discuss a following post. In addition to the basics of an edge list, attribute information can be associated with each edge and node. In our “Godfather” email sample, we added a weighting for communication strength (i.e. the number of emails between the two individuals) to each edge and the affiliation with the Corleone family to each node.

Once an edge list has been added, the vertices/node list is automatically created and a variety of graphical representations can be produced depending on the layout option selected, (Fruchterman Riengold is the default but Harel-Koren Fast Multiscale as well as Grid, Polar, Sugiyama and Sine Wave options are also available), and by mapping data attributes to the visual properties of nodes and vertices. For example, in the graph shown below, nodes were color coded and sized with respect to the individual’s connections with the Corleone family: blue for Corleone family members, green for Corleone allies, orange for Corleone enemies and Pink for individuals with no known associations with the family.

The width of the edges/links was then set to vary in relation to the degree of communication between the two nodes i.e. the number of emails sent between the two individuals concerned.

Labels can be added to both nodes and links showing either information about the node/link or its attributes, as required.

Different graph layout options are available which may be used to generate alternative perspectives and/or easier to view graphs.

Harel-Koren Layout

Circle Layout

Because even a small network can generate a complex, dense graph, NodeXL has a wide range of options for filtering and hiding parts of the graph, the better to elucidate others. The visibility of an edge/vertex for example, can be linked to a particular attribute e.g. degree of closeness. We found the dynamic filters particularly useful for rapidly focusing on areas of interest without altering the properties of the graph themselves. For example, in the following screenshot we are showing only those links where the number of emails between the parties is greater than 40. This allows us to focus on individuals who have been emailing each other more frequently than the average.

In addition to graphical display, NodeXL can be used to calculate key network metrics including: Degree (the number of links on a node and a reflection of the number of relationships an individual has with other members of the network) with In-Degree and Out-Degree options for directed graphs, Betweenness Centrality (the extent to which a node lies between other nodes in the network and a reflection of the number of people an individual is connecting to indirectly), Closeness Centrality (a measure of the degree to which a node is near all other nodes in a network and reflects the ability of an individual to access information through the "grapevine" of network members) and Eigenvector Centrality (a measure of the importance of an individual in the network). In an analysis of email communications, these can be used to identify the degree of connectedness between individuals and their relative importance in the communication flow.

For example, in our Godfather sample, we have sized the nodes in the graph below by their Degree Centrality. While Vito Corleone is, as expected, shown to be highly connected, Ritchie Martin – an individual not thought to have business associations with the Corleone family, is shown to be more connected than supposed.

Node Sized by Degree Centrality

When we look at the same data from the perspective of betweenness, we see that Vito, Connie and Ritchie all have a high degree of indirect connections.

Nodes Sized by Betweenness Centrality

And the Eignevector Centrality measure confirms Vito Corleone's signficance in the network as well as Connie's, two "allies" - Hyman Roth and Salvatore Tessio and two individuals Ritchie Martin.

Nodes Sized by Eigenvector Centrality

Last but not least, it is also possible to use NodeXL to visualize clusters of nodes to show or identify subgroups within a network. Clusters can be added manually or generated automatically. Manually creating clusters requires first assigning nodes to an attribute or group membership and then determining the color and shape of the nodes for each subgroup/cluster. In our GodFather example, we used “Family” affiliation to create clusters within the network but equally one could use organization/company, country, language, date etc.

"Family Affiliation" Clusters Coded by Node Color

Selected Cluster (Corleone Affiliates)

NodeXL will also generate clusters automatically using a clustering algorithm developed specifically for large scale social network analysis which works by aggregating closely interconnected groups of nodes. The results for the Godfather sample are shown below. We did not find the automated clustering helpful but this is probably a reflection of the relatively small size of the sample.

In the next post, we will look at importing email data directly into NodeXL and compare approaches based on analyzing processed vs unprocessed email data.

Larger Email Network Visualization

To download NodeXL, go to http://nodexl.codeplex.com//. We would also recommend working though the NodeXL tutorial which can be downloaded from: http://casci.umd.edu/images/4/46/NodeXL_tutorial_draft.pdf

A top level overview of social network analysis and the basic concepts behind graph metrics can be found on Wikipedia e.g. http://en.wikipedia.org/wiki/Social_network and http://en.wikipedia.org/wiki/Betweenness_centrality#Eigenvector_centrality

Friday, June 25, 2010

Serious Data Analytics with the Palantir Platform

Every now and then we feel like children outside a candy store, faces pressed to the window, eying the good things within. Today was one of those moments when we came across a reference to Palantir Technologies’ data analytics platform on on TechCrunch and went to investigate further.

Palantir is a data analysis platform which enables the integration of structured and unstructured data from a variety of sources – documents, databases, email communications – and provides the sophisticated tools required to search and analyze it. The company – Palatir Technologies (http://www.palantir.com/) - focuses on two verticals: Finance and Government with the latter accounting for 70% of their business and divided into Intelligence and Defense, Financial Regulation (Palantir is currently being used to monitor ARRA stimulus funding fraud and alert the various Inspector General’s to suspicious activity), Cybersecurity and Healthcare (e.g. tracing the origin of food poison outbreaks, correlating hospital quality indicators with medicare cost reports). Palantir has also teamed up with Thomson Reuters to develop a next generation financial analysis platform.

In order to deliver its functionality, the Palantir platform incorporates a number of different technologies. Its text search engine is based on Lucene – a java based text retrieval engine that has been around for a long time. Lucene, like most text retrieval software, operates on an inverted index i.e. it creates a list of key words (ignoring any stop words – generally words in a language that are not meaningful or, because they are so common, useful in a search – like ‘the’ or ‘a’ in English) and indexes against each term, the entire set of documents (and positions within the document) where the term occurs. One of Palantir’s customizations adjusts the retrieved results so that users can only see information they are cleared to view (a necessary requirement for some of Palantir’s national security customers). If a user doesn’t have access to a piece of information, its existence is totally suppressed and it will never appear even in a keyword count.

To test drive Palantir - go to : https://www.analyzethe.us/ and use their 'Analyze the US' application to explore public domain information about the US. The interface is easy to use, once you have adjusted to the UI metaphor, and most functions can be achieved by drag-and-drop. A set of test data is provided e.g. mortality statistics for various US hospitals. As with all data analysis systems, the challenge is knowing what questions to ask, within the context of the available data.

Palantir has one of the most easy to use geospatial analysis interfaces we’ve seen. Any group of geocodeable entities can be seen in map view by simply dragging and dropping the selection onto the Map icon. Geospatial related searches can be carried out over an area defined by radius, polygon or route. In addition, HeatMap and TreeMap geovisualizations are also supported. We did try importing some geocoded distribution data to see if we could produce a HeatMap of delivery density and were able to do so quickly and with minimum effort (see below based on Richmond VA).

Palantir would seem to be an ideal tool for use in forensic accounting and fraud investigations where there are a large number of interconnected persons of interest and organizational entities. Similarly, its ability to integrate structured data and documents might also be helpful in complex finance, fraud and IP related litigations where the legal team needs a way of analyzing and understanding a large set of both data and documents. Recent sub-prime related litigations come to mind as do complex Mergers and Acquisitions.

Wednesday, June 16, 2010

Using MapPoint 2010 for Route Analysis

As anyone who has been involved in routing of any scale knows, there are few software tools available and most of those come with a hefty price tag. If you want to create a single route, there are a range of options at various price points available. If you want to create multiple routes from a starting point of a set of addresses, or analyze a large number of routes simultaneously, options are limited and the applications available tend to have been developed for quite specific requirements which may or may not match those of the task in hand.

On a recent project, the requirement was to take a large number of predefined routes and calculate travel time and distance for each routes. Each predefined route had to be maintained as such (i.e. stops could not be transferred between routes) but to obtain more accurate times and distances we did decide to optimize the sequencing of stops within routes. This set of requirements is not what most routing applications are designed to do! Higher-end application such as ESRI’s ArcLogistics will allow you to create optimized routes from a set of delivery addresses they are not designed to support analysis of an existing set of routes.

For this assignment, the tool available was MapPoint2010. While this latest version of MapPoint has been enhanced to meet the needs of business users wanting to carry out various forms of geospatial analysis (e.g. revenue by sales territory, customer location), routing (outside of some minor upgrades such as enabling route information to be sent to GPS devices) has obviously not been a priority. MapPoint does come with an API so it is possible to engineer a bespoke application in support of a particular need but deadline constraints meant that we did not have time to pursue this approach.

Importing data into MapPoint2010 is straightforward (although it would be helpful if the data importer recognized a broader range of data types e.g. time) and it was possible to load in the data so that Route and Stop number information was preserved. However, once imported, it was not possible to use the routeID to manipulate the data. To do what we needed to do, we would have had to have imported each route individually to create separate datasets. (Note: the ability to transfer pushpins between datasets or to merge datasets does not seem to work as advertised).

Reporting and/or export to Excel of route information in MapPoint 2010 is also limited. The product seems mainly geared up to produce turn-by-turn directions which we did not need for this project. The built-in export-to-excel function allows you to export a dataset (which would have been viable if each route had been imported as a separate dataset) but there is no means to customize the export and strangely, vital route information such as distance and travel time is not included in the export – making it useless for any form of route analysis.

The solution turned out to be a third-party add on (RouteReader/RouteWriter) from Mapping Tools (www.mappingtools.com) which allowed us to select individual routes, optimize them and then output the results – including drive time and distance – to Excel. There were occasional odd results with the Route Writer arising from a particular stop being present on two different routes (the application was obviously using location information rather than routeID when outputting to Excel) but other than that the application worked well. The big “however”, however, was that each route still had to be analyzed individually. Since there were 120 routes, this took a significant amount of time. Our ideal application would have allowed us to set up batch route creation (by routeID, sequenced by either stopID or optimized) and the ability to batch output the results to Excel.

Unfortunately even RouteReader/RouteWriter could not overcome a fundamental problem with MapPoint2010 – a strange inability to geocode addresses along interstate or state highways. At first, we thought it was a naming issue: many highways have multiple “names” depending on the segment. Possibly we simply didn’t have the preferred street name for the segment of highway in question. However this was not the case. Street number level geocoding does not seem available for many highway segments in the area we were investigating (Southeast US), even though these are not new developments.

To workaround this, we had to laboriously confirm each non-identified address using Bing Maps (which is a great tool because it returns the “official” version of an address, together with the zip+4), and then force the stop back into MapPoint at the correct location using the Lat/Lon obtained from Google. And then since we could not get MapPoint to transfer pushpins between datasets, we had to manually add these “invalid” stops into their intended route before optimizing the route and reporting out to Excel. This added a considerable amount of time and effort to what was already a slow process. If only we could have routed on Bing Maps! Last but not least, if an address is incorrect (and we had several) it would be very helpful to have the opportunity in MapPoint to correct it and re-match it on the spot.

Sunday, June 6, 2010

Searching SharePoint 2010 with FAST

FAST is a high-end search engine that is being provided by Microsoft (at additional cost) as an enterprise level alternative to SharePoint’s built-in search engine. Whereas standard SharePoint 2010 can handle millions of documents, the FAST search engine can index and search over a hundred million i.e. it can scale to handle not only document management for an entire organization but more specialist requirements such as regulatory compliance and litigation document review. It also has extensive support for languages other than English including Chinese, Japanese and Korean.

As well as being an enterprise level search engine, FAST incorporates a number of features designed to make it easier for end users to find things. For example, many users remember documents by their visual appearance. FAST supports visual recognition by displaying a small thumbnail next to the summary of the document so users looking for a specific document can rapidly identify it. In addition FAST also includes graphical previewers for PowerPoint documents which can be used, for example, to find that one particular slide in a presentation without having to open the whole file and go through it slide by slide. Results also include links to ‘Similar Results’ and to ‘Duplicates’.

Example of a FAST Results Display

To support its search capabilities, FAST includes extremely powerful content processing based on linguistics and text analysis. Examples of linguistic processing in the item and query processing include character normalization, normalization of stemming variations and suggested spelling corrections. FAST automatically extracts document metadata such as author, date last modified, and makes them available for fielded searching, faceted search refinement and relevancy tuning. In addition to document metadata, it is also possible to define what Microsoft refer to as “managed properties”. These are categories such as organization names, place names and dates that may exist in the content of the document and can help develop or refine a search. Defining a custom extractor will enable such properties to be identified and indexed. (Note: this is a similar capability to that offered by several ‘Early Case Assessment’ tools in the litigation space).

Example of FAST Refinement Category List for a Results Set

Sharepoint 2010 Standard provides the ability to refine search results based on key metadata/properties such as document type, author, date created. These refinement metadata values are based by default on the first 50 results returned. With FAST, refinement moves to a whole other level, so-called ‘Deep’ refinement, where the refinement categories are based on managed properties within the entire result set. Users are presented with a list of refinement categories together with the counts within each category. (Note: this functionality is similar to the refinement capability that many major eCommerce sites provide e.g. NewEgg.com, BestBuy etc).

SharePoint 2010 with FAST : Architectural Overview

A detailed feature comparison between SharePoint2010 Standard Search and FAST is and further information about FAST is provided in Microsoft’s document “FAST Search Server 2010 for SharePoint Evaluation Guide” downloadable from http://www.microsoft.com/downloads/details.aspx?FamilyID=f1e3fb39-6959-4185-8b28-5315300b6e6b&displaylang=en