Most people are familiar with the 4 Vs definition of Big Data: Volume, Variety, Velocity and Veracity. (And if you are not here is an infographic courtesy of IBM:)
I have written about the Big Data in the past, specifically, on its implication on financial audits (here, here, and here) as well as privacy. However, I was meeting with people recently and were discussing big data and I found that business professional understood what it was divorced from it operational implications. This is problematic as the potential for big data is lost if we don't understand how big data has changed the underlying analytical technique.
But first we must look at the value perspective: how is big data different from the business intelligence techniques that business have used for decades?
From a value perspective, big data analytics and business intelligence (BI) ultimately have the same value proposition: mining the data to find trends, correlations and other patterns to identify new products and services or improve existing offerings and services.
However, what Big Data really is about is that previous analytical technique that was limited due to technological constraints no longer exists. What I am saying is that big data is more about how we can do analysis differently instead of the actual data itself. To me big data is a trend in analytical technique where the volume, variety, or velocity is no longer an issue in performing analysis. In other words - to flip the official definition into an operational statement - the size, shape (e.g. unstructured or structured), speed - is no longer an impediment to your analytical technique of choice.
And this is where you, as a TechBiz Pro, need to weigh the merits of walking them through the technological advances in the NoSQL realm. That is, how did we go from the rows & columns world of BI to the open world of Big Data? Google is a good place to start. It is pretty good illustration of big data techniques in action: using Google we get extract information from the giant mass of data we know as the Internet (volume), within seconds (velocity) and regardless if it's video, image or text (variety). However, Internet companies found that the existing SQL technologies inadequate for the task and so they went into the world of NoSQL technologies such as Hadoop (Yahoo), Cassandra (Facebook), and Google's BigTable/MapReduce. The details aren't really important but the importance lies in the fact that these companies had to invent tools to deal with the world of big data.
And this leads to how it is has disrupted the conventional BI thinking when it comes to analysis.
From a statistical perspective, you no longer have to sample the data and extrapolate to the larger population. You can just load up the entire populations, apply your statistical modeling imagination to it and identify the correlations that are there. Chris Anderson, of Wired, noted that this is a seismic change in nothing less than the scientific method itself. In a way what he is saying is that now that you can put your arms around all the data you no longer really need a model. He did get a lot of heat for saying this, but he penned the following to explain his point:
"The big target here isn't advertising, though. It's science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.
But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on."
Science aside the observation that Chris Anderson makes has big implications for business decision making. Advances in big data technologies can enable the deployment of statistical techniques that were previously not feasible and can yield insights without having to bother with model development. Statisticians and data scientists can play with the data and find something that works through trial and error. From financial audit perspective, this has tremendous implications - once we figure out the data extraction challenge. And that's where veracity comes in, which is the topic of a future blogpost.
But to close on a more practical level, companies such as Tesco are leveraging big data analytics to improve their bottom. An example, courtesy of Paul Miller from the Cloud of Data blog/podcast site, is how Tesco extracted the following insight: “[a] 16 degree
sunny Saturday in late April will cause a spike. Exactly the same figures a
couple of weeks later will not, as people have had their first BBQ of the
season”. In terms of overall benefits to the company, he notes “Big Data
projects deliver huge returns at Tesco; improving promotions to ensure 30% fewer
gaps on shelves, predicting the weather and behaviour to deliver £6million less
food wastage in the summer, £50million less stock in warehouses, optimising
store operations to give £30million less wastage.”