Wednesday, July 30, 2014

Will "data lakes" cure companies info management problems?

Anyone working in the analytics space will know that one of the primary obstacles to performing analytics is the state of the underlying data. One may, naively, expect that organizations would see systems as a strategic asset that requires top attention. The reality is that the state of systems is much to be desired. This phenomenon was actually highlighted in Michael Lewis's latest book: Flash Boys. He notes in this interview, that his investigation into High Frequency Trading (HFT) began with the arrest of Sergey Aleynikov:

Aleynikov was arrested for stealing code from Goldman Sachs related to its HFT platform and it was assumed that he would use this code at his new employers trading department. However, as Lewis explains, the new firm used a totally different code base and the information he took was the equivalent of the notes one keeps in their notebook (Lewis explains that this was the judgement of Aleynikov's programming peers).

So if this is relatively useless information, why was Aleynikov kept in jail for a year?

The theory is that Goldman was afraid that the inadequacy of the HFT platform would be revealed to the world. Yes. Goldman. A bank deemed "too big too fail" - thereby having access to unlimited government funding and bailouts - can't spare the capital to invest in a state of the art trading platform. This was one of the reasons that seems to explain why Goldman ultimately backed the IEX platform that Brad Katsyuma set up to essentially eliminate the advantage that HFT firms have over everyone else.

Give that this is the reality of the technology, it should be no surprise that the promise of data analytics often gets killed due to the inability to get to the data. And this is where data lakes come into the picture.

As noted in this Forbes article, "data lakes" differ from "data warehouses" in that there is no upfront cleansing, sorting and categorization of the data into a specific structure. Instead, the data is stored in "in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers."

Will this then solve the world's data access and cleansing problems? Gartner does not think so.

In an analyst report released last week, Gartner effectively slammed this concept pretty hard. According to Gartner, they noted the following issues with the concept of data lakes:

  • No consensus around  what "data lakes" means. As with any new techno-buzzword, each vendor will define data lakes to mean something different. With cloud computing, NIST quickly came on to the scene to define the concept and most commentators at least referred to this definition as a starting point before proposing their own. 
  • Skills gap exist preventing the common user from leveraging "data lakes": According to Gartner, data lakes "assumes that all are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata". In other words, companies who save time/effort on the upfront preprocessing of the data, need to equip their users with the necessary skills to leverage the technology. 
  • Data lakes remove context, security and other meta data. Without context, users likely cannot use the data carefully. Imagine the difficulty in assessing the dimensions of inventory when some of the subsidiaries use the metric system, where others do not. 
It appears that data lakes in their current iteration are a work in progress when it comes to solving the data accessibility and quality issues that analytics expert face. However, it appears with more robust tools on the front-end that would address the issues, can make data lakes a potential tool to access the organization's disparate data. 

No comments: