Monday, December 29, 2014

Low Decision Agility: BigData's Insurmountable Challenge?

Working in the field of data analytics for over decade there is one recurring theme that never seems to go away: the overall struggle organizations have with getting their data in order.

Courtesy of this link
Although this is normally framed in terms of data quality and data management, it's important to link this back to the ultimate raison d'etre for data and information: organizational decision making. Ultimately, an organization has significant data and information management challenges it culminates into a lack of "decision agility" for executive or operational management. I define decision agility as follows:

"Decision agility is the ability of an entity to provide relevant information 
to a decision maker in a timely manner." 

Prior to getting into the field, you would think that with all the hype of the Information Age it would be easy as pressing a button for a company to get you the data that you need to perform the analysis you need to do. However, after getting into the field, you soon realize how wrong this thinking: most organizations have low-decision agility.

I would think it is fair to say that this problem hits those involved in external (financial) audits the hardest. As we have tight budgets, low-decision agility at the clients we audit makes it cost-prohibitive to perform what is now known as audit analytics (previously known as CAATs). Our work is often reigned in by the (non-IT) auditors running the audit engagement because it is "cheaper" do the same test manually rather than parse our way through the client's data challenges

So what does this have to do with Big Data Analytics?

As I noted in my last post, there is the issue of veracity - the final V in the 4 Vs definition of Big Data. However, veracity is part of the larger problem of low decision agility that you can find at organizations. Low-decision agility emerges from a number of factors and can have implications on a big data analytics initiative at an organization. These factors and implications include:

  • Wrong data:  Fortune, in this article, notes there is the obvious issue of "obsolete, inaccurate, and missing information" data records itself. Consequently, the big data analytics initiative needs to assess the veracity of the underlying data to understand how much work needs to be done to clean up the data before meaningful insights can be drawn from the data. 
  • Disconnect between business and IT: The business has one view of the data and the IT folks see the data in a different way. So when you try to run a "simple" test it takes a significant amount of time to reconcile business's view of the data model to IT's view of the data model. To account for this problem there needs to be some effort in determining how to sync the user's view of the data and IT's view of the data on an ongoing basis to enable the big data analytic to rely on the data that sync's up with the ultimate decision maker's view of the world.  
  • Spreadsheet mania: Let's fact it: organizations treat IT as an expense not as an investment. Consequently, organizations will rely on spreadsheets to do some of the heavy lifting for the information processing because it is the path of least resistance. The overuse of spreadsheets can be a sign of an IT system that fails to meets the needs of the users. However, regardless of why they are used, the underlying problem is dealing with these vast array of business-managed applications that are often fraught with errors and outside the controls of production system. The control and related data issues become obvious during compliance efforts, such as SOX 404 or major transitions to new financial/data standards, such as the move to IFRS. When developing big data analytics, how do you account for the information trapped in these myriad little apps outside of IT's purview? 
  • Silo thinking: I remember the frustration of dealing with companies that lacked a centralized function that had a holistic view of the data. Each department would know it's portion of the processing rules, etc. but would have no idea of what happened upstream or downstream. Consequently, an organization needs to create a data governance structure that understands the big picture and can identify and address the potential gaps in the data set before it is fed into the Hadoop cluster.  
  • Heterogenous systems: Organizations with a patch-work of systems require extra effort from getting the data formatted and synchronized. InfoSec specialists deal with this issue of normalization when it come to security log analysis: the security logs that are extracted from different systems need to have the event IDs, codes, etc. "translated" into a common language to enable a proper analysis of events that are occurring across the enterprise. The point is that big data analytics must also perform a similar "translation" to enable analysis of data pulled from different systems. Josh Sullivan of Booz Allen states: " your models can take weeks and weeks" to recognize what content fed into the system are actually the same value. For example, it will take a while for the system to learn that female and woman are the same thing when looking at gender data. 
  • Legacy systems:  Organizations may have legacy systems which do not retain data, are hard to extract from and difficult to import into other tools. Consequently, this can cost time and money to get the data into a usable format that will also need to be factored into the big data analytics initiative.
  • Business rules and semantics: Beyond the heterogenity differences between systems there can also be a challenge in how something is commonly defined. A simple example is currency: an ERP that expand multiple countries the amount reported may be in the local currency or the dollar, but requires the metadata to give that meaning. Another issue can be that different user group define something different. For example, for a sale for the sales/marketing folks may not mean the same thing as a sale for the finance/accounting group (e.g. the sales & marketing people may not account for doubtful accounts or incentives that need to be factored in for accounting purposes). 
Of course these are not an exhaustive list of issues, but it gives an idea of how the reality of analytics is obscured the tough reality of state of data.  

In terms of the current state of data quality, a recent blog post by Michele Goetz of Forrester noted that 70% of the executive level business professionals they interviewed spent more than 40% of their time vetting and validating data. (Forrester notes the following caveat about the data: "The number is too low to be quantitative, but it does give directional insight.")

Until organizations get to a state of high decision agility - where business users spend virtually no time vetting/validating the data - organizations may not be able to reap the full benefits of a big data analytics initiative. 

Tuesday, December 23, 2014

How would you explain BigData to a business professional? (Updated)

Most people are familiar with the 4 Vs definition of Big Data: Volume, Variety, Velocity and Veracity. (And if you are not here is an infographic courtesy of IBM:)

I have written about the Big Data in the past, specifically, on its implication on financial audits (here, here, and here) as well as privacy. However, I was meeting with people recently and were discussing big data and I found that business professional understood what it was divorced from it operational implications. This is problematic as the potential for big data is lost if we don't understand how big data has changed the underlying analytical technique.

But first we must look at the value perspective: how is big data different from the business intelligence techniques that business have used for decades?

From a value perspective, big data analytics and business intelligence (BI) ultimately have the same value proposition: mining the data to find trends, correlations and other patterns to identify new products and services or improve existing offerings and services.

However, what Big Data really is about is that previous analytical technique that was limited due to technological constraints no longer exists. What I am saying is that big data is more about how we can do analysis differently instead of the actual data itself. To me big data is a trend in analytical technique where the volume, variety, or velocity is no longer an issue in performing analysis. In other words - to flip the official definition into an operational statement - the size, shape (e.g. unstructured or structured), speed - is no longer an impediment to your analytical technique of choice.

And this is where you, as a TechBiz Pro, need to weigh the merits of walking them through the technological advances in the NoSQL realm. That is, how did we go from the rows & columns world of BI to the open world of Big Data?  Google is a good place to start. It is pretty good illustration of big data techniques in action: using Google we get extract information from the giant mass of data we know as the Internet (volume), within seconds (velocity) and regardless if it's video, image or text (variety). However, Internet companies found that the existing SQL technologies inadequate for the task and so they went into the world of NoSQL technologies such as Hadoop (Yahoo), Cassandra (Facebook), and Google's BigTable/MapReduce. The details aren't really important but the importance lies in the fact that these companies had to invent tools to deal with the world of big data.

And this leads to how it is has disrupted the conventional BI thinking when it comes to analysis.

From a statistical perspective, you no longer have to sample the data and extrapolate to the larger population. You can just load up the entire populations, apply your statistical modeling imagination to it and identify the correlations that are there.  Chris Anderson, of Wired, noted that this is a seismic change in nothing less than the scientific method itself. In a way what he is saying is that now that you can put your arms around all the data you no longer really need a model. He did get a lot of heat for saying this, but he penned the following to explain his point:

"The big target here isn't advertising, though. It's science. The scientific method is built around testable hypotheses. These models, for the most part, are systems visualized in the minds of scientists. The models are then tested, and experiments confirm or falsify theoretical models of how the world works. This is the way science has worked for hundreds of years.

But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete. Consider physics: Newtonian models were crude approximations of the truth (wrong at the atomic level, but still useful). A hundred years ago, statistically based quantum mechanics offered a better picture — but quantum mechanics is yet another model, and as such it, too, is flawed, no doubt a caricature of a more complex underlying reality. The reason physics has drifted into theoretical speculation about n-dimensional grand unified models over the past few decades (the "beautiful story" phase of a discipline starved of data) is that we don't know how to run the experiments that would falsify the hypotheses — the energies are too high, the accelerators too expensive, and so on."

Science aside the observation that Chris Anderson makes has big implications for business decision making. Advances in big data technologies can enable the deployment of statistical techniques that were previously not feasible and can yield insights without having to bother with model development. Statisticians and data scientists can play with the data and find something that works through trial and error. From financial audit perspective, this has tremendous implications - once we figure out the data extraction challenge. And that's where veracity comes in, which is the topic of a future blogpost.

But to close on a more practical level, companies such as Tesco are leveraging big data analytics to improve their bottom. An example, courtesy of Paul Miller from the Cloud of Data blog/podcast site, is how Tesco extracted the following insight: “[a] 16 degree sunny Saturday in late April will cause a spike. Exactly the same figures a couple of weeks later will not, as people have had their first BBQ of the season”. In terms of overall benefits to the company, he notes “Big Data projects deliver huge returns at Tesco; improving promotions to ensure 30% fewer gaps on shelves, predicting the weather and behaviour to deliver £6million less food wastage in the summer, £50million less stock in warehouses, optimising store operations to give £30million less wastage.”

Wednesday, December 17, 2014

SEC and the Quants: Will RoboCop get a BigData overhaul?

As reported in this Forbes article in 2013, the SEC began to use so-called RoboCop to assist with their regulatory duties.

Who is RoboCop?

No, it's not that infamous crime-fighting cyborg from the late-80s (coincidentally remade in 2014). It is actually the Accounting Quality Model (AQM) - not quite as exciting I know. According to Forbes:

"AQM is an analytical tool which trawls corporate filings to flag high-risk activity for closer inspection by SEC enforcement teams. Use of the AQM, in conjunction with statements by recently-confirmed SEC Chairman Mary Jo White and the introduction of new initiatives announced July 2, 2013, indicates a renewed commitment by the SEC to seek out violations of financial reporting regulations. This pledge of substantial resources means it is more important than ever for corporate filers to understand SEC enforcement strategies, especially the AQM, in order to decrease the likelihood that their firm will be the subject of an expensive SEC audit."

Another interesting point raised by the Forbes article is the use of XBRL in this accounting model: "AQM relies on the newly-mandated XBRL data which is prone to mistakes by the inexperienced. Sloppy entries could land your company’s filing at the top of the list for close examination."

(On a side note: AICPA has published this study to assist XBRL filers ensure that they are preparing quality statements, given that there are many possible errors; as noted in this study).

Within this context, we should take note of how the SEC is hiring "quantitative analysts" (or "quants" for short). As noted in this WSJ article:

"And Wall Street firms, for their part, are able to offer quantitative analysts—or “quants”—far higher pay packages than the regulator. The SEC’s access to market data also remains limited. In 2012, it approved a massive new computer system to track markets, known as the Consolidated Audit Trail, but the system isn’t likely to come online for several years, experts say."

Could the SEC pull a fast one and become the source of innovation? Although the WSJ article seems to downplay the possibility that the SEC can outpace the firms, it is not something that the audit industry can ignore.

As noted in a previous post on Big Data, it was just this type of mindset that Mike Flowers of New York City looked to revolutionize how the NYC leveraged big data to improve its "audit" of illegal conversions. Perhaps the SEC may follow in his stead.

Thursday, December 11, 2014

Time for Windows 10? I can't wait!

I have been overly optimistic about Windows in the past, but here me out!

Boy Genius published a post on the future Windows 10 that they are releasing next year. (Note Microsoft decided to skip Windows 9 altogether).

And does it looks good. As can be seen in this video, it will feature Cortana who is Microsoft's personal digital assistant that incorporates voice search, voice commands (i.e. you can get Cortana to set-up an appointment with you) and machine learning (i.e. it learns from your interactions.

It speaks to the overall move towards using natural language processing (NLP) and elements of Artificial Intelligence. Apple was arguably first to the scene with its Siri application. However, IBM's Watson is also a clearer example of where this technology is heading. Gartner refers to these types of technologies as "smart machines", which they claim has the following implications:

"Most business and thought leaders underestimate the potential of smart machines to take over millions of middle-class jobs in the coming decades," said Kenneth Brant, research director at Gartner. "Job destruction will happen at a faster pace, with machine-driven job elimination overwhelming the market's ability to create valuable new ones."

Will Cortana take your job? Well, let's just enjoy the possibility that Microsoft may build in some real cool NLP technology into your every-day computer and worry about that one a future post!

Tuesday, December 9, 2014

Europe vs Google et al: Long term ramifications of the Snowden Revelations?

Wall Street Journal had an interesting piece today where they discuss how the "clash that pits [European] governments against the new tech titans, established industries against upstart challengers, and freewheeling American business culture against a more regulated European framework". For example, "[t]he European Parliament in late October called on Internet companies operating in the region to “unbundle” its search engines from its other commercial properties". The obvious company that would be impacted by this is Google (and the WSJ article notes that Microsoft is aiding and abetting such calls to help boost its own profile).

However, the WSJ article notes: "And perhaps most fundamentally, it is about control of the Internet, the world’s common connection and crucial economic engine that is viewed as being under the sway of the U.S. This exploded following the revelations by Edward Snowden of widespread U.S. government surveillance of Americans and Europeans—sometimes via U.S. company data and telecommunications networks."

This would not be the first article to note that the Snowden revelations have put a chill on the move to the (US) cloud. However, it does highlight how far the revelations have gone to force the hand of European regulators to at least act in public like they are trying to do something to protect the data of their companies.

What the article did not into much detail is the likely reason that the Europeans are concerned. Although it may presented to be an issue of privacy or anti-surveillance, the likely real reason is industrial espionage.  As per the Snowden revelations, governmental spy agencies are not just interested in obtaining information on matters relating to national security, but are also interested in obtaining data related to international trade or other business dealings. As noted by the CBC, “NSA does not limit itsespionage to issues of national security and he cited German engineering firm,Siemens as one target”. It is unfair just to single out the US for such actions, as other governments do it as well. For example, Canada’s CSEC is also alleged to be involved in similar activity. The Globe & Mail reporting that “Communications SecurityEstablishment Canada (CSEC) has spied on computers and smartphones affiliatedwith Brazil’s mining and energy ministry in a bid to gain economic intelligence.” Former Carleton University Professor Martin Rudner explains (in the same G&M article) that the objective of such surveillance is to give Canadian government a leg up during negotiations, such as NAFTA. 

Although most have forgotten the commercial rivalries (see quote from then US president Woodrow Wilson about the roots of international conflict) that exist between the G8 Nations, it is important to understand the implications that this has for data security on the cloud. Anything that is sensitive and is relevant to business dealings should never be put on the cloud. Of course it is a matter of judgment of what constitutes "sensitive", but the criteria can effectively "reverse engineered" based on what was revealed.

Friday, December 5, 2014

Remembering those Blackberry days

The Globe and Mail reported on BlackBerry's latest approach in terms of rebuilding its mobile user base. The company is offering $400 trade + a $150 gift card for anyone who trades in their iPhone for the rather odd square shaped Passport. Here is the review from the Verge regarding the latest:

Coincidentally, I came across an BlackBerry of mine: the Torch. I remembered thinking that after using the device how it was the perfect compromise between the touch screen and the classic keyboard. However, that feeling faded quite quickly: the device was so under-powered compared to the competition and of course it lacked the apps that you could find in the Apple AppStore. But at the time I could never imagine giving up the physical QWERTY keyboard.

Since then I have moved onto Android and more specifically to the SwiftKey keyboard - to the point I can't go back to a physical keyboard!

How did BlackBerry fail to keep up with the times?

As noted in this article, Mike Lazaridis the founder of the CEO, was inspired to develop the BlackBerry when he recalled his teacher's advice while watching a presentation in 1987 - almost a decade before the Internet - on how Coke used wireless technology to manage the inventory at the vending machines. What was his teacher's advice? His teacher advised him not to get swept in the computer craze as the real boon lay in integrating wireless technology with computers.

BlackBerry caused a storm in the corporate introducing it's smartphones in 1998. It went on to dominate the corporate smartphone market as the gold standard in mobile communications. The following graphic from Bloomberg really captures the subsequent rise and fall quite well:

What happened how did the iPhone, unveiled in 2007, and the Android Operating System outflank the Blackberry? This article in the New Yorker larger blames BlackBerry's inability to understand the trend of "consumerization of IT": users wanted to use their latest iPhone or Android device instead of the BlackBerry in the corporate environment - and was it just a matter of technology to make this happen.

Although luminaries, such as Clay Christensen, have written extensively on the challenge of innovation. And there's always the problem of hindsight bias. However, is the problem more basic? When we look at the financial crisis, some people like to blame poor modeling. But I think that is more convenient than accepting the reality that people got swept up in the wave.

Isn’t it fair to say that people knew that house of cards was going to come down (and some of the investment banks were even betting on it falling apart), but were overly optimistic that they would get out before everyone else does?

But that’s the point.

When we are in a situation where we are surrounded by people who confirm our understanding of the world – we may believe them instead of trying to see if our understanding of the situation is correct. With the housing bubble, the key players wanted to believe that those models were correct – even though models have failed the infamous Long Term Capital Management. With BlackBerry what was it? Did they think their hold over the corporate IT? What I wonder is did they not even try to see within their families and those around them who were using the iPhone or Android devices? Weren’t they curious what “all the fuss was about”?

Although this is problem with many of us who want to believe that the present situation is going to continue indefinitely (especially when things are going our way), there are others who do stay on top of things. Most notably is the Encyclopedia Britannica that actually stopped issuing physical encyclopedias and moved to the digital channel instead.

Change is a challenge, but the key is to be prepared to admit that the current way of doing things can be done better, faster and in radically different way.

Tuesday, October 28, 2014

Financial Crisis: Why didn't they use analytics?

For the past while, I have been reviewing the aftermath of the 2007-2008 Financial Crisis. I came across an interesting piece that highlights the importance of using analytics and "dashboarding" to monitor risk within a company. To be specific, I came across this when going through Nomi Prins's book, It Takes a Pillage: An Epic Tale of Power, Deceit, and Untold Trillions. Nomi Prins was in charge of analytics at Goldman Sachs and other banks. The embedded video gives more information about her and the book she wrote:

While listening to her book, I came across a transcript from the hearings in the aftermath of the crisis. As can be seen in the following video, Representative Paul E. Kanjorski is questioning the now-former CEO of Country Wide financial, Angelo Mozilo about the sub-prime crisis.

The part to focus on is where he grills the CEO about why they didn't aggregate statistics to monitor the mounting losses from the sub-prime loans (click here for where the transcript was extracted from. Please note the italics and bold is mine):
"Mr. Kanjorski: How long did it take you to come up with the understanding that there was this type of an 18 percent failure rate before you sent the word down the line, "Check all of these loans or future loans for these characteristics so we don't have this horrendous failure?"
Mr. Mozilo. Yes, immediately--within the first--if we don't get payment the first month, we're contacting the borrower. And
that's part of what we do. And we are adjusting our----
Mr. Kanjorski. I understand you do to the mortgage holder. But don't you put all those together in statistics and say, "These packages we are selling now are failing at such a horrific rate that they'll never last and there will be total decimation of our business and of these mortgages?" "

In other word, the Congressman is wondering how the CEO could not know that his business was failings because it is only common sense to monitor the key metrics that measure the key risk indicators (KRIs) associated with his principal business activities.

I would be the first to argue that there was much bigger issues with the financial crisis, such as the 16 trillion dollar-bank-bailout, the failure to properly rate the bonds backed by the sub-prime mortgages, quantitative easing, and so on.  That being said, organizations and companies need to be aware of the importance of measuring the KRIs associated with their business. Regulators, and others charged with oversight, will eventually question the insufficiency of such monitoring controls. Furthermore, as these regulators are more tech savvy - such as the judge in the Oracle vs Google trial - the more sophisticated dashboards they will expect.

Wednesday, August 6, 2014

Worth mentioning: KPMG's take on the state of tech in the audit profession

In a recent post (as in just this week) on Forbes, KPMG's  James P. Liddy who is the Vice Chair, Audit and Regional Head of Audit, Americas put out a great post that summarizes the current state of analytics in financial audits.

He diplomatically summarizes the current state of the financial audit as "unchanged for more than 80 years since the advent of the classic audit" while stating "[a]dvances in technology and the massive proliferation of available information have created a new landscape for financial reporting. With investors now having access to a seemingly unlimited breadth and depth of information, the need has never been greater for the audit process to evolve by providing deeper and more relevant insights about an organization’s financial condition and performance –while maintaining and continually improving audit quality." [Emphasis added]

For those that have started off our careers in the world of financial audit as professional accountants and then moved to the world of audit analytics or IT risk management, we have always felt that technology could help us to get audits done more efficiently and effectively.

I was actually surprised that he stated that auditors "perform procedures over a relatively small sample of transactions – as few as 30 or 40 – and extrapolate conclusions across a much broader set of data". We usually don't see this kind of openness when it comes to discussing the inner-workings of the profession. However, I think that discussing such fundamentals is inevitable given those outside the profession are embracing big data analytics in "non-financial audits". For example, see this post where I discuss the New York City fire department's use of big data analytics to identify a better audit population when it comes to identifying illegal conversions that are a high risk and need to be evacuated.

For those that take comfort in the regulated nature of the profession as protection of disruption, we should take note of how the regulators are embracing big data analytics. Firstly, the SEC is using RoboCop to better target financial irregularities. Secondly, according to the Wall Street Journal, FINRA is eyeing an automated audit approach to monitoring to risk. The program is known as "Comprehensive Automated Risk Data System" (CARDS). As per FINRA:

"CARDS program will increase FINRA's ability to protect the investing public by utilizing automated analytics on brokerage data to identify problematic sales practice activity. FINRA plans to analyze CARDS data before examining firms on site, thereby identifying risks earlier and shifting work away from the on-site exam process". In the same post, Susan Axelrod, FINRA's Executive Vice President of Regulatory Operations, is quoted as saying "The information collected through CARDS will allow FINRA to run analytics that identify potential "red flags" of sales practice misconduct and help us identify potential business conduct problems with firms, branches and registered representatives".

As a result, I agree with Mr. Libby: sticking to the status quo is no longer a viable strategy for the profession.

Tuesday, August 5, 2014

Had the Red Coats monitored Paul Revere's Facebook, would America be independent today?

The Globe and Mail reported  that Canadian intelligence captures private data without a warrant in its fight against Chinese hackers. As one would expect, the article discusses how there is calculation performed to determine whether the harm of invading privacy of Canadians is outweighed by preserving national security.

The privacy debate ranges between two camps. One camp, such as EPIC, work to shed light on how organizations and governments encroach on individual privacy and see encroachment as a threat to the individual's ability to express ideas and the like. The other camp is the likes of Jeff Jarvis, a professor at CUNY and self-admitted-Google-fanboy-extraordinaire, who often defends Google's encroachment on the lives of people by slamming people's fear of Google by forcing his opponents to quantify "what's the harm". He especially takes issue with the emotional response of how of people feel that Google's knowledge of them is "creepy".

In a sense, I understand where Professor Jarvis is coming from: consumers want more customized services and they don't want to pay cash for them, so companies have to resort to advertising revenues to be paid. Google, Facebook, et al, are profit making companies and they want to be paid.

To me this is not the real cost in terms of privacy.

The real cost is how the government uses that data it gathers directly, or indirectly via Facebook (according to RT the mood study FB was performing was part of a gov't contract to deal with "civil unrest") , Google, et al,  to interact with the politically objectionable.

One way to look at the cost is being spied upon, deemed a threat to national security and then sent somewhere to be tortured. This is what happened to Maher Arar. He was allegedly fingered by 15-year old Omar Khadr to be a terrorist. Based on this information, the US sent him to Syria to be tortured. According to the Garvie Report, the RCMP gave sensitive information about Arar to the US government. Ultimately, Arar was exonerated and all charges were cleared. The Canadian government paid him 10.5 million + legal fees and apologized to him. But how do you put a price on torturing an innocent man?

And to be sure democratic government do actively monitor the political active within the countries. For example, this article in the New York Times goes to describe in great detail how the government captured this information. Ultimately, Occupy was defeated through by police actions resulting in 8,000 arrests as well as other means. If it hadn't, how would the government have used this information to interact with the protesters on a go-forward basis?

From another perspective, the harm is also political engagement. Although the Maher Arar case shows that the government can mishandle the data it gathers about people and put them in harms way, this happens to a few people (e.g. Ahmad El Maati, Muayyed Nureddin and Abdullah Almalki) and is not a commonly used approach with dealing with protesters. For example, it's not like the Occupy protesters were rounded in the 1,000s and sent to Syria.

But there is another cost. Such surveillance and the potential for being harmed, puts a chilling effect for those that want to speak out against the way things are running. Why protest when you will lose your job and can't pay the bills?  Think about the American War of Independence. If the British were able to spy on the "facebook" pages, email accounts and cell phones of  Sam Adams, Paul Revere and pro-separatist sympathizers in the colonial militias - would the British had been able to arrest these separatists in a timely manner? Or would have pre-colonial surveillance society taught the Founding Fathers to self-censor and tow the pro-British line?  It is pure speculation, but I think the Union Jack would still be flying in the land we now call America.

Wednesday, July 30, 2014

Will "data lakes" cure companies info management problems?

Anyone working in the analytics space will know that one of the primary obstacles to performing analytics is the state of the underlying data. One may, naively, expect that organizations would see systems as a strategic asset that requires top attention. The reality is that the state of systems is much to be desired. This phenomenon was actually highlighted in Michael Lewis's latest book: Flash Boys. He notes in this interview, that his investigation into High Frequency Trading (HFT) began with the arrest of Sergey Aleynikov:

Aleynikov was arrested for stealing code from Goldman Sachs related to its HFT platform and it was assumed that he would use this code at his new employers trading department. However, as Lewis explains, the new firm used a totally different code base and the information he took was the equivalent of the notes one keeps in their notebook (Lewis explains that this was the judgement of Aleynikov's programming peers).

So if this is relatively useless information, why was Aleynikov kept in jail for a year?

The theory is that Goldman was afraid that the inadequacy of the HFT platform would be revealed to the world. Yes. Goldman. A bank deemed "too big too fail" - thereby having access to unlimited government funding and bailouts - can't spare the capital to invest in a state of the art trading platform. This was one of the reasons that seems to explain why Goldman ultimately backed the IEX platform that Brad Katsyuma set up to essentially eliminate the advantage that HFT firms have over everyone else.

Give that this is the reality of the technology, it should be no surprise that the promise of data analytics often gets killed due to the inability to get to the data. And this is where data lakes come into the picture.

As noted in this Forbes article, "data lakes" differ from "data warehouses" in that there is no upfront cleansing, sorting and categorization of the data into a specific structure. Instead, the data is stored in "in a massive, easily accessible repository based on the cheap storage that’s available today. Then, when there are questions that need answers, that is the time to organize and sift through the chunks of data that will provide those answers."

Will this then solve the world's data access and cleansing problems? Gartner does not think so.

In an analyst report released last week, Gartner effectively slammed this concept pretty hard. According to Gartner, they noted the following issues with the concept of data lakes:

  • No consensus around  what "data lakes" means. As with any new techno-buzzword, each vendor will define data lakes to mean something different. With cloud computing, NIST quickly came on to the scene to define the concept and most commentators at least referred to this definition as a starting point before proposing their own. 
  • Skills gap exist preventing the common user from leveraging "data lakes": According to Gartner, data lakes "assumes that all are highly skilled at data manipulation and analysis, as data lakes lack semantic consistency and governed metadata". In other words, companies who save time/effort on the upfront preprocessing of the data, need to equip their users with the necessary skills to leverage the technology. 
  • Data lakes remove context, security and other meta data. Without context, users likely cannot use the data carefully. Imagine the difficulty in assessing the dimensions of inventory when some of the subsidiaries use the metric system, where others do not. 
It appears that data lakes in their current iteration are a work in progress when it comes to solving the data accessibility and quality issues that analytics expert face. However, it appears with more robust tools on the front-end that would address the issues, can make data lakes a potential tool to access the organization's disparate data. 

Wednesday, July 16, 2014

Privacy to be cast aside to make Big Data a reality?

This is the fourth and final instalment of a multi-part exploration of the audit, assurance, compliance and related concepts brought up in the book,  Big Data: A Revolution That Will Transform How We Live, Work, and Think (the book is also available as an audiobook and hey while I am at it, here's the link to the e-book ).  In the last two posts we explored the more tactical examples of how big data can assist auditors in executing audits resulting in a more efficient and effective audit. The book also examines the societal implications of big data. In this instalment, we look explore the privacy implications of big data.

What's are the privacy implications of Big Data?
In the past 3 instalments, we've explored the opportunities that big data affords to audit profession and society at large. In this article we look at the privacy implications raised by the book.

When we think of a totalitarian state we flash back to the regimes of world war II or the Soviet era. The book talks about how the East German Communist State invested vast amounts of resources on gathering data from its citizens in order to see who conformed with the state's ideology and who didn't. The book notes that East German secret police (the Ministerium für Staatssicherheit or "stasi") accumulated (amongst other things) 70 miles of documents. However, now big data analytics essentially enables corporations and governments to mine the digital exhaust people leave through social media, using their cell phones or logging into their email accounts and essentially eliminate the privacy people have.

Some may point to anonymization as a potential solution to the problem. However, the authors highlight how New York Times reporters were able to comb through anonymized data published by AOL to positively establish the identity of the users. This highlights that the powerful tools that have emerged from big data alter the privacy landscape. Consequently, privacy controls need to be rethought from this perspective.

The authors, however, raise a much more interesting point when discussing privacy in the era of big data. They highlight the conflict between privacy and profiting from big data. They note how the value of big data emerges from the secondary uses of big data. However, privacy policies require the user to consent to a specific uses of data at the time they sign up for the service. This means future big data analytics are essentially limited by what uses the user agreed upon sign-up. However, corporations in their drive to maximize profits will ultimately make privacy policies so loose (i.e. to cover secondary uses) that the user essentially has to give up all their privacy in order to use the service. What the authors propose is an accountability framework. Similar to how stock issuing companies are accountable to the security regulators, the idea is that organizations would be accountable to a privacy body of sorts that reviews the use of the big data and ensures that companies are accountable for the negative consequences of the data.

For those of use that have been involved in privacy compliance, such an approach would make it real for companies to deal with the privacy issues in proactive manner. We saw how companies attitudes towards controls over financial reporting shifted from mild interest (or indifference) to active concern with the passage of Sarbanes-Oxley. In contrast, no similar fervour could be found the business landscape when addressing privacy issues. Although the solution is not obvious, the reality is that companies will make their privacy notices meaningless in order to reap the ROI from investments made in big data.

Monday, June 16, 2014

Auditing the Algorithm: Is it time for AlgoTrust?

This is the third instalment of a multi-part exploration of the audit, assurance, compliance and related concepts brought up in the book,  Big Data: A Revolution That Will Transform How We Live, Work, and Think (the book is also available as an audiobook and hey while I am at it, here's the link to the e-book ).  In the last two posts we explored the more tactical examples of how big data can assist auditors in executing audits resulting in a more efficient and effective audit. The book, however, also examines the societal implications of big data. In this instalment, we look explore the role of the algorithmist.

Why do we need to audit the "secret sauce"?
When it comes to big data analytics, the decisions and conclusions the analyst will make hinges greatly on the underlying actual algorithm.  Consequently, as big data analytics become more and more part of the drivers of actions in companies and societal institutions (e.g. schools, government, non-profit organizations, etc.), the more dependent society becomes on the "secret sauce" that powers these analytics. The term "secret sauce" is quite apt because it highlights the underlying technical opaqueness that is commonplace with such things: the common person likely will not be able to understand how the big data analytic arrived at a specific conclusion. We discussed this in our previous post as the challenge of explainability, but the nuance here is that is how do you explain algorithms to external parties, such as customers, suppliers, and others.

To be sure this is not the only book  that points to the importance of the role of algorithms in society. Another example is "Automate This: How Algorithms Came to Rule Our World" by Chris Steiner, which (as you can see by the title) explains how algorithms are currently dominating our society. The book bring ups common examples the "flash crash" and the role that "algos" are playing on Wall Street in the banking sector as well as how NASA used these alogrithms to assess personality types for its flight missions. It also goes into the arts. For example, it discusses how there's an algorithm that can predict the next hit song and hit screenplay as well as how algorithms can generate classical music that impresses aficionados - until they find out it is an algorithm that generated it! The author, Chris Steiner, discusses this trend in the follow TedX talk:

So what Mayer-Schönberger and Cukier suggest is the need for a new profession which they term as "algorithmists". According to them:

"These new professionals would be experts in the areas of computer science, mathematics, and statistics; they would act as reviewers of big-data analyses and predictions. Algorithmists would take a vow of impartiality and confidentiality, much as accountants and certain other professionals do now. They would evaluate the selection of data sources, the choice of analytical and predictive tools, including algorithms and models, and the interpretation of results. In the event of a dispute, they would have access to the algorithms, statistical approaches, and datasets that produced a given decision."

The also extrapolate this thinking to an "external algorithmist": who would "act as impartial auditors to review the accuracy or validity of big-data predictions whenever the government required it, such as under court order or regulation. They also can take on big-data companies as clients, performing audits for firms that wanted expert support. And they may certify the soundness of big-data applications like anti-fraud techniques or stock-trading systems. Finally, external algorithmists are prepared to consult with government agencies on how best to use big data in the public sector.

As in medicine, law, and other occupations, we envision that this new profession regulates itself with a code of conduct. The algorithmists’ impartiality, confidentiality, competence, and professionalism is enforced by tough liability rules; if they failed to adhere to these standards, they’d be open to lawsuits. They can also be called on to serve as expert witnesses in trials, or to act as “court masters”, which are experts appointed by judges to assist them in technical matters on particularly complex cases.

Moreover, people who believe they’ve been harmed by big-data predictions—a patient rejected for surgery, an inmate denied parole, a loan applicant denied a mortgage—can look to algorithmists much as they already look to lawyers for help in understanding and appealing those decisions."

They also envision such professionals would work also work internally within companies, much the way internal auditors do today.

WebTrust for Certification Authorities: A model for AlgoTrust?
The authors bring up a good point: how would you go about auditing an algo? Although auditors lack the technical skills of algoritmists, it doesn't prevent them from auditing algorithms. The WebTrust for Certification Authorities (WebTrust for CAs) could be a model where assurance practitioners develop a standard in conjunction with algorithmists and enable audits to be performed against the standard. Why is WebTrust for CAs a model? WebTrust for CAs is a technical standard where an audit firm would "assess the adequacy and effectiveness of the controls employed by Certification Authorities (CAs)". That is, although the cryptographic key generation process is something that goes beyond the technical discipline of a regular CPA, it did not prevent the assurance firms from issuing an opinion.

So is it time for CPA Canada and the AICPA to put together a draft of "AlgoTrust"?


Although the commercial viability for such a service would be hard to predict, it would help at least start the discussion around of how society can achieve the outcomes Mayer-Schönberger and Cukier describe above. Furthermore, some of the ground work for such a service is already established. Fundamentally, an algorithm takes data inputs, processes it and then delivers a certain output or decision. Therefore, one aspect of such a service is to understand whether the algo has "processing integrity" (i.e. as the authors put it, to attest to the "accuracy or validity of big-data predictions"), which is something the profession established a while back through its SysTrust offering. To be sure this framework would have to be adapted. For example, algos are used to make decisions so there needs to be some thinking around how we would identify materiality in terms of  total number of "wrong" decisions as well as defining "wrong" in an objective and is auditable manner.

AlgoTrust, as a concept, illustrates not only a new area where auditors can move its assurance skill set into an emerging area but also how the profession can add thought leadership around the issue of dealing with opaqueness of algorithms - just as it did with financial statements nearly a century ago.

Thursday, June 5, 2014

Big Data Audit Analytics: Dirty data, explainability and data driven decision making

This is the second instalment of a multi-part exploration of the audit, assurance, compliance and related concepts brought up in the book,  Big Data: A Revolution That Will Transform How We Live, Work, and Think (the book is also available as an audiobook and hey while I am at it, here's the link to the e-book ). In this instalment, I explore another example of Big Data Audit analytics noted in the book and highlight the lessons learned from it. 

Con Edison and Exploding Manhole Covers
The book discussed the case of Con Edison (the public utility that provides electricity to New York City)  and its efforts to better predict, which of their manhole covers will experience "technical difficulties" from the relatively benign (e.g. smoking, heating up, etc) to the potentially deadly (where a 300 pound manhole can explode into the air and potentially harm someone). Given the potentially implications on life and limb, Con Edison needed a better audit approach, if you will, then random guessing as to which manhole cover would need maintenance to prevent such problems from occurring.

And this is where Cynthia Rudin, currently an associate professor of statistics at MIT, comes into the picture. She and her team of statisticians at Columbia University worked with Con Edison to devise a model that would predict, where the maintenance dollars should be focused.

The team developed a model with 106 (with the biggest factors being age of the manhole covers and if there were previous incidents) data predictors that ranked manhole covers in terms of which ones were most likely to have issues to those least likely. How accurate was it?  As noted in the book, the top 10% of those ranked most likely to have incidents ended up accounting for 44% of the manhole covers with potentially deadly incidents. In other words, Con Edison through big data analytics was able to better "audit" the population of manhole covers for potential safety issues.  The following video goes into some detail on what the team did:

What lessons can be drawn from this use of Big Data Analytics?
Firstly, algorithms can overcome dirty data. When Professor Rudin was putting together the data to analyse, it included data from the early days of Con Edison, i.e. as in 1880s when Thomas Edison was alive! To illustrate the book notes how there 38 different ways to enter the word "service box" into service records. This is on top of the fact that some of these records were hand written and were documented by people who didn't have a concept of a computer let alone big data analytics.

Second, although the biggest factors seem obvious in hindsight, we should be aware of such conclusions. The point is that data driven decision making is more defensible than a "gut feel", which speaks directly to the professional judgement versus statistical approach of performing audit procedures. The authors further point out that there at least 104 other variables that were contenders and their relative importance cannot be known without preforming such a rigorous analysis.  The point here is that for organizations to succeed and take analytics to the next level need to embrace culturally the concept that, where feasible, organizations should invest in the necessary leg work to obtain conclusions based on solid analysis.

Third, the authors highlight the importance of "explainability". They attribute to the world of artificial intelligence, which refers to the ability of the human user to drill deeper into the analysis generated by the model and explain to operational and executive management why a specific manhole needs to be investigated. In contrast, the authors point out that models that are complex due to the inclusion of numerous variables are difficult to explain. This is a critical point for auditors. As the auditors must be able defend why a particular transaction was chosen over another for audit, big data audit analytics needs to incorporate this concept of explainability.

Finally, it is but another example of how financial audits can benefit from such techniques, given the way non-financial "audits" are using big data techniques to audit and assess information. So internal and external auditors can highlight this (along with two examples identified in the previous post) as part of their big data audit analytics business case.