Why read this book?
This book is written by journalist, Kenneth Cukier, (who claims in this video to have to used the term "big data" before it was commonly used) and Viktor Mayer-Schönberger (an Internet and Governance professor at University of Oxford).
Given the background of the authors, it is an easy to digest book that gives the reader a good understanding of how access to large volumes of data and the use of correlations will change the way business is done and how society has a whole functions - without going into the technical detail of how big data is "crunched" at the back end. The authors also discuss the following:
- Why more is better: Algorithms improve by being exposed to more data - regardless of how messy it is. On the topic of size, it also comments how statistical sampling is a feature of an era when organizations could not wrap their arms around the data.
- Consumer and business implications: The book is filled with examples that anyone can relate to, such as predicting whether the price of an airplane ticket will go up or down, as well as how Google uses search queries to predict flu outbreaks.
- Enter "Datafication": It also distinguishes "datafication" versus "digitization", where the latter is making something into bits and bytes, whereas the former is something that can be analyzed by some sort of analytic engine.
- Potentially challenges and negative consequences of big data driven decisions: One of the challenges cited by the author is the "black box" nature of algorithms: how does a common person challenge the an algorithm, when it takes a rocket scientist to understand the algorithm itself? The authors also take the risk of explaining the danger of subordinating human decision making to algorithms. For example, they note it would be problematic for governments to round up and quarantine people just because they looked up terms related to the flu.
The case for data driven audits
The book is filled with examples that illustrate the power of big data and how they impact business and society. However, there are a couple of examples that illustrate how financial audits can benefit from such techniques, given the way non-financial "audits" are using big data techniques to audit and assess information.
Case 1: New York City and Auditing Illegal Conversions: As discussed in this excerpt of the book, Mike Flowers applied big data techniques to the problems of "illegal conversion" in New York city. As noted in the article, illegal conversions is the "the practice of cutting up a dwelling into many smaller units so that it can house as many as 10 times the number of people it was designed for. They are major fire hazards, as well as cauldrons of crime, drugs, disease, and pest infestation. A tangle of extension cords may snake across the walls; hot plates sit perilously on top of bedspreads. People packed this tightly regularly die in blazes". The data scientists working for Flowers, took the 900,000 property lots in the city and correlated "five years of fire data ranked by severity" against the following pieces of data:
- Delinquency in paying property taxes,
- Foreclosure proceedings,
- Odd patterns in their usage of utilities,
- Non-payment of utilities,
- Type of building,
- Date building was built,
- Ambulance visits,
- Rodent complaints,
- External brickwork.
This is a pretty straightforward evidence for "data driven audits": financial auditors can identify correlations between financial data and non-financial data to determine which financial transactions need more scrutiny than others.
Not convinced?
Well, investors are already doing this. The book gives the example of how an investment firm is using traffic analysis, from Inrix, to determine the sales that a retailer will make and then buy or sell the stock of the retailer on that information. In a senses, the investment is using this as a proxy for sales. In an audit context, auditors can study the vehicular traffic around stores against the sales recorded against such stores and determine if there are issues worth investigating.
Well, investors are already doing this. The book gives the example of how an investment firm is using traffic analysis, from Inrix, to determine the sales that a retailer will make and then buy or sell the stock of the retailer on that information. In a senses, the investment is using this as a proxy for sales. In an audit context, auditors can study the vehicular traffic around stores against the sales recorded against such stores and determine if there are issues worth investigating.
Of course, this endeavor is not merely a matter of copying & pasting data from StatsCan and cobbling up a spreadsheet or two. It is a lot of hard work. However, this is not surprising to anyone who has been performing computer assisted audit techniques for the last decade or so. The challenge has always been in cleaning up the data and making it usable. Some of the challenges that the New York team of statisticians had, include:
- Inconsistent data formats: The team had to bring together data sets from 19 different agencies. Each agency had a different way of describing location. Consequently, this has to be standardized so that each of the 19 data sets can be correlated to the same property.
- Datafying expert intuition: The article describes how brickwork got added as an element to the correlation model. The data scientists on the team observed how the fire inspector could look at a building and know whether it was okay or not.
- Understanding significance of each variable: Each variable must be assessed in its own right to avoid the problem of generalization. For example, rodent infestation is not uniform in its significance across New York city. As noted in the article, " A rat spotted on the posh Upper East Side might generate 30 calls within an hour, but it might take a battalion of rodents before residents in the Bronx felt moved to dial 311".
No comments:
Post a Comment