Wednesday, July 30, 2014

Will "data lakes" cure companies info management problems?

Anyone working in the analytics space will know that one of the primary obstacles to performing analytics is the state of the underlying data. One may, naively, expect that organizations would see systems as a strategic asset that requires top attention. The reality is that the state of systems is much to be desired. This phenomenon was actually highlighted in Michael Lewis's latest book: Flash Boys. He notes in this interview, that his investigation into High Frequency Trading (HFT) began with the arrest of Sergey Aleynikov:


Aleynikov was arrested for stealing code from Goldman Sachs related to its HFT platform and it was assumed that he would use this code at his new employers trading department. However, as Lewis explains, the new firm used a totally different code base and the information he took was the equivalent of the notes one keeps in their notebook (Lewis explains that this was the judgement of Aleynikov's programming peers).

So if this is relatively useless information, why was Aleynikov kept in jail for a year?

The theory is that Goldman was afraid that the inadequacy of the HFT platform would be revealed to the world. Yes. Goldman. A bank deemed "too big too fail" - thereby having access to unlimited government funding and bailouts - can't spare the capital to invest in a state of the art trading platform. This was one of the reasons that seems to explain why Goldman ultimately backed the IEX platform that Brad Katsyuma set up to essentially eliminate the advantage that HFT firms have over everyone else.

Give that this is the reality of the technology, it should be no surprise that the promise of data analytics often gets killed due to the inability to get to the data. And this is where data lakes come into the picture.

As noted in this Forbes article, "data lakes" differ from "data warehouses" in that there is no upfront cleansing, sorting and categorization of the data into a specific structure. Instead, the data is stored in "in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers."

Will this then solve the world's data access and cleansing problems? Gartner does not think so.

In an analyst report released last week, Gartner effectively slammed this concept pretty hard. According to Gartner, they noted the following issues with the concept of data lakes:

  • No consensus around  what "data lakes" means. As with any new techno-buzzword, each vendor will define data lakes to mean something different. With cloud computing, NIST quickly came on to the scene to define the concept and most commentators at least referred to this definition as a starting point before proposing their own. 
  • Skills gap exist preventing the common user from leveraging "data lakes": According to Gartner, data lakes "assumes that all are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata". In other words, companies who save time/effort on the upfront preprocessing of the data, need to equip their users with the necessary skills to leverage the technology. 
  • Data lakes remove context, security and other meta data. Without context, users likely cannot use the data carefully. Imagine the difficulty in assessing the dimensions of inventory when some of the subsidiaries use the metric system, where others do not. 
It appears that data lakes in their current iteration are a work in progress when it comes to solving the data accessibility and quality issues that analytics expert face. However, it appears with more robust tools on the front-end that would address the issues, can make data lakes a potential tool to access the organization's disparate data. 



Wednesday, July 16, 2014

Privacy to be cast aside to make Big Data a reality?

This is the fourth and final instalment of a multi-part exploration of the audit, assurance, compliance and related concepts brought up in the book,  Big Data: A Revolution That Will Transform How We Live, Work, and Think (the book is also available as an audiobook and hey while I am at it, here's the link to the e-book ).  In the last two posts we explored the more tactical examples of how big data can assist auditors in executing audits resulting in a more efficient and effective audit. The book also examines the societal implications of big data. In this instalment, we look explore the privacy implications of big data.

What's are the privacy implications of Big Data?
In the past 3 instalments, we've explored the opportunities that big data affords to audit profession and society at large. In this article we look at the privacy implications raised by the book.

When we think of a totalitarian state we flash back to the regimes of world war II or the Soviet era. The book talks about how the East German Communist State invested vast amounts of resources on gathering data from its citizens in order to see who conformed with the state's ideology and who didn't. The book notes that East German secret police (the Ministerium für Staatssicherheit or "stasi") accumulated (amongst other things) 70 miles of documents. However, now big data analytics essentially enables corporations and governments to mine the digital exhaust people leave through social media, using their cell phones or logging into their email accounts and essentially eliminate the privacy people have.

Some may point to anonymization as a potential solution to the problem. However, the authors highlight how New York Times reporters were able to comb through anonymized data published by AOL to positively establish the identity of the users. This highlights that the powerful tools that have emerged from big data alter the privacy landscape. Consequently, privacy controls need to be rethought from this perspective.

The authors, however, raise a much more interesting point when discussing privacy in the era of big data. They highlight the conflict between privacy and profiting from big data. They note how the value of big data emerges from the secondary uses of big data. However, privacy policies require the user to consent to a specific uses of data at the time they sign up for the service. This means future big data analytics are essentially limited by what uses the user agreed upon sign-up. However, corporations in their drive to maximize profits will ultimately make privacy policies so loose (i.e. to cover secondary uses) that the user essentially has to give up all their privacy in order to use the service. What the authors propose is an accountability framework. Similar to how stock issuing companies are accountable to the security regulators, the idea is that organizations would be accountable to a privacy body of sorts that reviews the use of the big data and ensures that companies are accountable for the negative consequences of the data.

For those of use that have been involved in privacy compliance, such an approach would make it real for companies to deal with the privacy issues in proactive manner. We saw how companies attitudes towards controls over financial reporting shifted from mild interest (or indifference) to active concern with the passage of Sarbanes-Oxley. In contrast, no similar fervour could be found the business landscape when addressing privacy issues. Although the solution is not obvious, the reality is that companies will make their privacy notices meaningless in order to reap the ROI from investments made in big data.