Friday, June 25, 2010

Serious Data Analytics with the Palantir Platform

Every now and then we feel like children outside a candy store, faces pressed to the window, eying the good things within. Today was one of those moments when we came across a reference to Palantir Technologies’ data analytics platform on on TechCrunch and went to investigate further.

Palantir is a data analysis platform which enables the integration of structured and unstructured data from a variety of sources – documents, databases, email communications – and provides the sophisticated tools required to search and analyze it. The company – Palatir Technologies (http://www.palantir.com/) - focuses on two verticals: Finance and Government with the latter accounting for 70% of their business and divided into Intelligence and Defense, Financial Regulation (Palantir is currently being used to monitor ARRA stimulus funding fraud and alert the various Inspector General’s to suspicious activity), Cybersecurity and Healthcare (e.g. tracing the origin of food poison outbreaks, correlating hospital quality indicators with medicare cost reports). Palantir has also teamed up with Thomson Reuters to develop a next generation financial analysis platform.

In order to deliver its functionality, the Palantir platform incorporates a number of different technologies. Its text search engine is based on Lucene – a java based text retrieval engine that has been around for a long time. Lucene, like most text retrieval software, operates on an inverted index i.e. it creates a list of key words (ignoring any stop words – generally words in a language that are not meaningful or, because they are so common, useful in a search – like ‘the’ or ‘a’ in English) and indexes against each term, the entire set of documents (and positions within the document) where the term occurs. One of Palantir’s customizations adjusts the retrieved results so that users can only see information they are cleared to view (a necessary requirement for some of Palantir’s national security customers). If a user doesn’t have access to a piece of information, its existence is totally suppressed and it will never appear even in a keyword count.

To test drive Palantir - go to : https://www.analyzethe.us/ and use their 'Analyze the US' application to explore public domain information about the US. The interface is easy to use, once you have adjusted to the UI metaphor, and most functions can be achieved by drag-and-drop. A set of test data is provided e.g. mortality statistics for various US hospitals. As with all data analysis systems, the challenge is knowing what questions to ask, within the context of the available data.


Palantir has one of the most easy to use geospatial analysis interfaces we’ve seen. Any group of geocodeable entities can be seen in map view by simply dragging and dropping the selection onto the Map icon. Geospatial related searches can be carried out over an area defined by radius, polygon or route. In addition, HeatMap and TreeMap geovisualizations are also supported. We did try importing some geocoded distribution data to see if we could produce a HeatMap of delivery density and were able to do so quickly and with minimum effort (see below based on Richmond VA).


Palantir would seem to be an ideal tool for use in forensic accounting and fraud investigations where there are a large number of interconnected persons of interest and organizational entities. Similarly, its ability to integrate structured data and documents might also be helpful in complex finance, fraud and IP related litigations where the legal team needs a way of analyzing and understanding a large set of both data and documents. Recent sub-prime related litigations come to mind as do complex Mergers and Acquisitions.

Wednesday, June 16, 2010

Using MapPoint 2010 for Route Analysis

As anyone who has been involved in routing of any scale knows, there are few software tools available and most of those come with a hefty price tag. If you want to create a single route, there are a range of options at various price points available. If you want to create multiple routes from a starting point of a set of addresses, or analyze a large number of routes simultaneously, options are limited and the applications available tend to have been developed for quite specific requirements which may or may not match those of the task in hand.

On a recent project, the requirement was to take a large number of predefined routes and calculate travel time and distance for each routes. Each predefined route had to be maintained as such (i.e. stops could not be transferred between routes) but to obtain more accurate times and distances we did decide to optimize the sequencing of stops within routes. This set of requirements is not what most routing applications are designed to do! Higher-end application such as ESRI’s ArcLogistics will allow you to create optimized routes from a set of delivery addresses they are not designed to support analysis of an existing set of routes.

For this assignment, the tool available was MapPoint2010. While this latest version of MapPoint has been enhanced to meet the needs of business users wanting to carry out various forms of geospatial analysis (e.g. revenue by sales territory, customer location), routing (outside of some minor upgrades such as enabling route information to be sent to GPS devices) has obviously not been a priority. MapPoint does come with an API so it is possible to engineer a bespoke application in support of a particular need but deadline constraints meant that we did not have time to pursue this approach.

Importing data into MapPoint2010 is straightforward (although it would be helpful if the data importer recognized a broader range of data types e.g. time) and it was possible to load in the data so that Route and Stop number information was preserved. However, once imported, it was not possible to use the routeID to manipulate the data. To do what we needed to do, we would have had to have imported each route individually to create separate datasets. (Note: the ability to transfer pushpins between datasets or to merge datasets does not seem to work as advertised).

Reporting and/or export to Excel of route information in MapPoint 2010 is also limited. The product seems mainly geared up to produce turn-by-turn directions which we did not need for this project. The built-in export-to-excel function allows you to export a dataset (which would have been viable if each route had been imported as a separate dataset) but there is no means to customize the export and strangely, vital route information such as distance and travel time is not included in the export – making it useless for any form of route analysis.

The solution turned out to be a third-party add on (RouteReader/RouteWriter) from Mapping Tools (www.mappingtools.com) which allowed us to select individual routes, optimize them and then output the results – including drive time and distance – to Excel. There were occasional odd results with the Route Writer arising from a particular stop being present on two different routes (the application was obviously using location information rather than routeID when outputting to Excel) but other than that the application worked well. The big “however”, however, was that each route still had to be analyzed individually. Since there were 120 routes, this took a significant amount of time. Our ideal application would have allowed us to set up batch route creation (by routeID, sequenced by either stopID or optimized) and the ability to batch output the results to Excel.

Unfortunately even RouteReader/RouteWriter could not overcome a fundamental problem with MapPoint2010 – a strange inability to geocode addresses along interstate or state highways. At first, we thought it was a naming issue: many highways have multiple “names” depending on the segment. Possibly we simply didn’t have the preferred street name for the segment of highway in question. However this was not the case. Street number level geocoding does not seem available for many highway segments in the area we were investigating (Southeast US), even though these are not new developments.

To workaround this, we had to laboriously confirm each non-identified address using Bing Maps (which is a great tool because it returns the “official” version of an address, together with the zip+4), and then force the stop back into MapPoint at the correct location using the Lat/Lon obtained from Google. And then since we could not get MapPoint to transfer pushpins between datasets, we had to manually add these “invalid” stops into their intended route before optimizing the route and reporting out to Excel. This added a considerable amount of time and effort to what was already a slow process. If only we could have routed on Bing Maps! Last but not least, if an address is incorrect (and we had several) it would be very helpful to have the opportunity in MapPoint to correct it and re-match it on the spot.

Sunday, June 6, 2010

Searching SharePoint 2010 with FAST

FAST is a high-end search engine that is being provided by Microsoft (at additional cost) as an enterprise level alternative to SharePoint’s built-in search engine. Whereas standard SharePoint 2010 can handle millions of documents, the FAST search engine can index and search over a hundred million i.e. it can scale to handle not only document management for an entire organization but more specialist requirements such as regulatory compliance and litigation document review. It also has extensive support for languages other than English including Chinese, Japanese and Korean.

As well as being an enterprise level search engine, FAST incorporates a number of features designed to make it easier for end users to find things. For example, many users remember documents by their visual appearance. FAST supports visual recognition by displaying a small thumbnail next to the summary of the document so users looking for a specific document can rapidly identify it. In addition FAST also includes graphical previewers for PowerPoint documents which can be used, for example, to find that one particular slide in a presentation without having to open the whole file and go through it slide by slide. Results also include links to ‘Similar Results’ and to ‘Duplicates’.

Example of a FAST Results Display


To support its search capabilities, FAST includes extremely powerful content processing based on linguistics and text analysis. Examples of linguistic processing in the item and query processing include character normalization, normalization of stemming variations and suggested spelling corrections. FAST automatically extracts document metadata such as author, date last modified, and makes them available for fielded searching, faceted search refinement and relevancy tuning. In addition to document metadata, it is also possible to define what Microsoft refer to as “managed properties”. These are categories such as organization names, place names and dates that may exist in the content of the document and can help develop or refine a search. Defining a custom extractor will enable such properties to be identified and indexed. (Note: this is a similar capability to that offered by several ‘Early Case Assessment’ tools in the litigation space).

Example of FAST Refinement Category List for a Results Set


Sharepoint 2010 Standard provides the ability to refine search results based on key metadata/properties such as document type, author, date created. These refinement metadata values are based by default on the first 50 results returned. With FAST, refinement moves to a whole other level, so-called ‘Deep’ refinement, where the refinement categories are based on managed properties within the entire result set. Users are presented with a list of refinement categories together with the counts within each category. (Note: this functionality is similar to the refinement capability that many major eCommerce sites provide e.g. NewEgg.com, BestBuy etc).

SharePoint 2010 with FAST : Architectural Overview


A detailed feature comparison between SharePoint2010 Standard Search and FAST is and further information about FAST is provided in Microsoft’s document “FAST Search Server 2010 for SharePoint Evaluation Guide” downloadable from http://www.microsoft.com/downloads/details.aspx?FamilyID=f1e3fb39-6959-4185-8b28-5315300b6e6b&displaylang=en