UWCISA: Official Blog: dirty data

Showing posts with label dirty data. Show all posts

Thursday, September 10, 2015

BNY Mellon Software Glitch: Cost of IT Control Failure

In the previous post on the BNY Mellon's technology woes, we explored what the company did right as well as the overall need for independent evaluation of the technology that runs the Information Age. In this post, we explore the costs and consequences of the breach.

One of the challenges for putting in controls around information integrity is that it is a hard sell: what's really the value of accurate information? This is in contrast to something like information security where it is also hard sell, but much easier. The reason? When an information security breach occurs, it is largely to access something of value that can be monetized. The Poneman Institute puts this cost at approximately $174 per record.

Consequently, it is easier for someone to go to the CEO/CFO and explain how tightening controls around information security will protect the company's bottom line. Furthermore, information security breaches are something that has entered the mass consciousness within the business community: SunGard was quick to reassure everyone that the issue affecting BNY Mellon's accounting software was NOT attributable to "any external or unauthorised systems access".

When making the business case for controls over information, it can be challenging to show how the control will lead to savings in terms of "decision failure", i.e. the cost of making the wrong decision due to unreliable information. Let's face it: most companies are willing take big risks on their information by continuing to rely on spreadsheets that have an error rate of 88%. Furthermore, as highlighted by this Protiviti study, internal auditors understand the information integrity challenges but are not getting the funding to tackle them.

So the incident at BNY Mellon is rare occurrence where something that is mis-priced can actually lead to costs. As noted in the Wall Street Journal:

"A software glitch this week at fund administrator Bank of New York Mellon Corp. caused difficulties in pricing many mutual funds and exchange-traded funds, prompting some fund sponsors to publish lists of funds whose stated asset values were erroneous.

What can you do if one of your funds is on the list, meaning you may have overpaid for shares?

Reach out to your fund company and ask for a refund. They don’t have to give you one but firms may do so because of their often long-term relationships—ones they want to keep—with investors, analysts said."

The other costs include:

Direct costs: As noted in the WSJ, "BNY Mellon was working on a separate contingency plan, mobilizing more than 100 accountants to manually calculate the values of thousands of fund securities". The company also required SunGard to come in and work at their premises to fix the issue. If the company had to pay extra for this work to be completed, it would add up quickly.
Loss of business due to damaged reputation: Experiencing these types of issues exposes the company to the loss of business due to their reputation being damaged.
Regulatory scrutiny: Yes, BNY Mellon is under regulatory review. As noted in the Boston Globe and Yahoo Finance, BNY Mellon's technical problems have attracted regulatory scrutiny of Massachusetts's Secretary of State William F. Galvin who has sent letters of inquiry to the company to investigate what happened. Regarding the incident he said: "In the warp-speed of trading these days computer problems can happen...But the fallout that seems only to affect large financial institutions can hit the average investor looking at his and her retirement money".

Of course we won't know the full cost until, the regulatory probe finishes and the publish their findings or the cost was material and this shows up in the financial statements. Regardless, organizations should be proactive in ensuring that sufficient technology controls are in place and that these types of risk are controlled.

Monday, December 29, 2014

Low Decision Agility: BigData's Insurmountable Challenge?

Working in the field of data analytics for over decade there is one recurring theme that never seems to go away: the overall struggle organizations have with getting their data in order.

Courtesy of this link.

Although this is normally framed in terms of data quality and data management, it's important to link this back to the ultimate raison d'etre for data and information: organizational decision making. Ultimately, an organization has significant data and information management challenges it culminates into a lack of "decision agility" for executive or operational management. I define decision agility as follows:

"Decision agility is the ability of an entity to provide relevant information

to a decision maker in a timely manner."

Prior to getting into the field, you would think that with all the hype of the Information Age it would be easy as pressing a button for a company to get you the data that you need to perform the analysis you need to do. However, after getting into the field, you soon realize how wrong this thinking: most organizations have low-decision agility.

I would think it is fair to say that this problem hits those involved in external (financial) audits the hardest. As we have tight budgets, low-decision agility at the clients we audit makes it cost-prohibitive to perform what is now known as audit analytics (previously known as CAATs). Our work is often reigned in by the (non-IT) auditors running the audit engagement because it is "cheaper" do the same test manually rather than parse our way through the client's data challenges

So what does this have to do with Big Data Analytics?

As I noted in my last post, there is the issue of veracity - the final V in the 4 Vs definition of Big Data. However, veracity is part of the larger problem of low decision agility that you can find at organizations. Low-decision agility emerges from a number of factors and can have implications on a big data analytics initiative at an organization. These factors and implications include:

Wrong data: Fortune, in this article, notes there is the obvious issue of "obsolete, inaccurate, and missing information" data records itself. Consequently, the big data analytics initiative needs to assess the veracity of the underlying data to understand how much work needs to be done to clean up the data before meaningful insights can be drawn from the data.
Disconnect between business and IT: The business has one view of the data and the IT folks see the data in a different way. So when you try to run a "simple" test it takes a significant amount of time to reconcile business's view of the data model to IT's view of the data model. To account for this problem there needs to be some effort in determining how to sync the user's view of the data and IT's view of the data on an ongoing basis to enable the big data analytic to rely on the data that sync's up with the ultimate decision maker's view of the world.
Spreadsheet mania: Let's fact it: organizations treat IT as an expense not as an investment. Consequently, organizations will rely on spreadsheets to do some of the heavy lifting for the information processing because it is the path of least resistance. The overuse of spreadsheets can be a sign of an IT system that fails to meets the needs of the users. However, regardless of why they are used, the underlying problem is dealing with these vast array of business-managed applications that are often fraught with errors and outside the controls of production system. The control and related data issues become obvious during compliance efforts, such as SOX 404 or major transitions to new financial/data standards, such as the move to IFRS. When developing big data analytics, how do you account for the information trapped in these myriad little apps outside of IT's purview?
Silo thinking: I remember the frustration of dealing with companies that lacked a centralized function that had a holistic view of the data. Each department would know it's portion of the processing rules, etc. but would have no idea of what happened upstream or downstream. Consequently, an organization needs to create a data governance structure that understands the big picture and can identify and address the potential gaps in the data set before it is fed into the Hadoop cluster.
Heterogenous systems: Organizations with a patch-work of systems require extra effort from getting the data formatted and synchronized. InfoSec specialists deal with this issue of normalization when it come to security log analysis: the security logs that are extracted from different systems need to have the event IDs, codes, etc. "translated" into a common language to enable a proper analysis of events that are occurring across the enterprise. The point is that big data analytics must also perform a similar "translation" to enable analysis of data pulled from different systems. Josh Sullivan of Booz Allen states: "...training your models can take weeks and weeks" to recognize what content fed into the system are actually the same value. For example, it will take a while for the system to learn that female and woman are the same thing when looking at gender data.
Legacy systems: Organizations may have legacy systems which do not retain data, are hard to extract from and difficult to import into other tools. Consequently, this can cost time and money to get the data into a usable format that will also need to be factored into the big data analytics initiative.
Business rules and semantics: Beyond the heterogenity differences between systems there can also be a challenge in how something is commonly defined. A simple example is currency: an ERP that expand multiple countries the amount reported may be in the local currency or the dollar, but requires the metadata to give that meaning. Another issue can be that different user group define something different. For example, for a sale for the sales/marketing folks may not mean the same thing as a sale for the finance/accounting group (e.g. the sales & marketing people may not account for doubtful accounts or incentives that need to be factored in for accounting purposes).

Of course these are not an exhaustive list of issues, but it gives an idea of how the reality of analytics is obscured the tough reality of state of data.

In terms of the current state of data quality, a recent blog post by Michele Goetz of Forrester noted that 70% of the executive level business professionals they interviewed spent more than 40% of their time vetting and validating data. (Forrester notes the following caveat about the data: "The number is too low to be quantitative, but it does give directional insight.")

Until organizations get to a state of high decision agility - where business users spend virtually no time vetting/validating the data - organizations may not be able to reap the full benefits of a big data analytics initiative.

Thursday, June 5, 2014

Big Data Audit Analytics: Dirty data, explainability and data driven decision making

This is the second instalment of a multi-part exploration of the audit, assurance, compliance and related concepts brought up in the book, Big Data: A Revolution That Will Transform How We Live, Work, and Think (the book is also available as an audiobook and hey while I am at it, here's the link to the e-book ). In this instalment, I explore another example of Big Data Audit analytics noted in the book and highlight the lessons learned from it.

Con Edison and Exploding Manhole Covers

The book discussed the case of Con Edison (the public utility that provides electricity to New York City) and its efforts to better predict, which of their manhole covers will experience "technical difficulties" from the relatively benign (e.g. smoking, heating up, etc) to the potentially deadly (where a 300 pound manhole can explode into the air and potentially harm someone). Given the potentially implications on life and limb, Con Edison needed a better audit approach, if you will, then random guessing as to which manhole cover would need maintenance to prevent such problems from occurring.

And this is where Cynthia Rudin, currently an associate professor of statistics at MIT, comes into the picture. She and her team of statisticians at Columbia University worked with Con Edison to devise a model that would predict, where the maintenance dollars should be focused.

The team developed a model with 106 (with the biggest factors being age of the manhole covers and if there were previous incidents) data predictors that ranked manhole covers in terms of which ones were most likely to have issues to those least likely. How accurate was it? As noted in the book, the top 10% of those ranked most likely to have incidents ended up accounting for 44% of the manhole covers with potentially deadly incidents. In other words, Con Edison through big data analytics was able to better "audit" the population of manhole covers for potential safety issues. The following video goes into some detail on what the team did:

What lessons can be drawn from this use of Big Data Analytics?
Firstly, algorithms can overcome dirty data. When Professor Rudin was putting together the data to analyse, it included data from the early days of Con Edison, i.e. as in 1880s when Thomas Edison was alive! To illustrate the book notes how there 38 different ways to enter the word "service box" into service records. This is on top of the fact that some of these records were hand written and were documented by people who didn't have a concept of a computer let alone big data analytics.

Second, although the biggest factors seem obvious in hindsight, we should be aware of such conclusions. The point is that data driven decision making is more defensible than a "gut feel", which speaks directly to the professional judgement versus statistical approach of performing audit procedures. The authors further point out that there at least 104 other variables that were contenders and their relative importance cannot be known without preforming such a rigorous analysis. The point here is that for organizations to succeed and take analytics to the next level need to embrace culturally the concept that, where feasible, organizations should invest in the necessary leg work to obtain conclusions based on solid analysis.

Third, the authors highlight the importance of "explainability". They attribute to the world of artificial intelligence, which refers to the ability of the human user to drill deeper into the analysis generated by the model and explain to operational and executive management why a specific manhole needs to be investigated. In contrast, the authors point out that models that are complex due to the inclusion of numerous variables are difficult to explain. This is a critical point for auditors. As the auditors must be able defend why a particular transaction was chosen over another for audit, big data audit analytics needs to incorporate this concept of explainability.

Finally, it is but another example of how financial audits can benefit from such techniques, given the way non-financial "audits" are using big data techniques to audit and assess information. So internal and external auditors can highlight this (along with two examples identified in the previous post) as part of their big data audit analytics business case.