UWCISA: Official Blog: disaster recovery

Showing posts with label disaster recovery. Show all posts

Thursday, August 4, 2016

FlightDelays & Contingency Planning in Real Life

Photo Credit: Trey Ratcliff

Last week, I had headed to a day long conference on Thursday in New York and was expecting to return home on the 7:20 flight back to Toronto.

However, things didn't goes as planned: La Guardia (LGA) had cancelled a number of flights due to weather delays.

I decided to haul it back renting a car through a one-way rental.

Originally, went to Hertz but they refused to rent to me because I was heading back to Toronto, Canada. If you can believe it, they advised me to rent to Buffalo and then take the bus back to Toronto. Yeah right!!!

Thank God, Avis did not give me such a ridiculous advice and instead gave me the car to make my way back home. Ended up leaving Avis around 8:45 and made it home around 4:30 AM. One big advantage of travelling at that time of the night is that there is no traffic :)

Thinking about this after the fact, I realized it was a good lesson in "real life" contingency planning, so here's what I think I did right, could have done better and otherwise.

What I did right:

Call the travel agent instead of waiting in line to talk to the airline: I had already cleared customs and was lining up at the Air Canada desk inside LGA realizing that my flight was cancelled. However, I decided to call the travel agent (while in line) to see what the situation was at other airports (JFK, Newark) and to see what my options were. That's where I learned that I would be flying out at 11:30 am on Friday (i.e. the next day).
Avoided flying out on Friday: Didn't realize this at the time, but my chiropractor told me that after a major flight cancellation the airport is dealing with at least twice the volume the next day - especially since it was Friday and everyone would want to get home for the weekend. Consequently, how much rest would I get if I had to be back 3 to 4 hours earlier the next day to make sure I got on the plane? My fear at that time is that either the weather delays would continue or something else would force me on a later flight.
Would any hotels be available? Given that many people had their flight cancelled, the hotels would likely be booked. Also, if I had to book outside the airport then I would have to battle morning traffic on the way back in. So it didn't seem like an appealing option.
What's crazy to most, may be open to you: The 8 hour drive back did seem daunting. However, most wouldn't do such a crazy thing thereby making it a viable option - since everyone else would be trying to get on a plane there would be plenty of supply for me in terms of getting the rental. Or at least that's what I expected and it turned out to be right. Also, when I spoke to the travel agent she told me that someone else from Deloitte was looking to carpool back to Toronto. Unfortunately, I just missed him. However, realizing someone else is doing made it seem less crazy. And truth be told those cars were getting booked fast when I got to the car rental companies - many people were driving to Boston, Pittsburgh, etc.

What I could have done better:

Monitoring for weather: When my flight got delayed on the way in on Wednesday that should have been a clue that there could be problems the next day. In the future, I should keep track of weather conditions and been mindful.
Monitoring for cancellations: Although I had checked in via my mobile app, I had been using a low power mode for the iPhone. This prevent me from being alerted right away. The reason I was on lower power mode is that the conference organizers didn't have outlets at the table and so I wanted to make sure I had battery power to call/email/etc. at the airport. Next time, I should sit near an outlet or have portable power source to make sure that I can charge my phone at the airport or on the plane.
Book a car sooner: If I had learned about the cancellation sooner, I could have made alternative arrangements sooner. At least I could have booked the car and procured it closer to where I was at, instead of wasting that time driving into the airport.
Noticed airport irregularities: There were more people queuing up at the Air Canada counter outside the security area. However, I just dismissed this as volume. However, the lower volume in the security area should have been my second clue that something was awry.
Check the rental for damage: I was so focused on getting on my way, I didn't check. As it turns out, the car was damaged massively on the front. Fortunately, the guy letting me out noticed that and wrote it on the form. It's hard, but in an emergency situation it is important to make sure to keep a cool head and not make such errors.

Otherwise: One thing that stuck in my mind is missing the fellow Deloitte colleague on the way back to Toronto. Was there a better of organizing ourselves so if something like this were to happen again, we could car pool? How can we trust each other if we don't work at the same company? I think that setting up an app and getting subscribers to sign-up ahead of time wouldn't be feasible because most people don't think about getting stranded at the airport - let alone finding a way to trust each other using user reviews.

Contingency plans: test, test, test.

My biggest takeaway from this experience is that you can't know how good a contingency plan is until you actual do a real live test.

And unfortunately most companies overall don't test their plans.

As noted in this Business Continuity survey, Deloitte categorized managers as "aware" (i.e. those who know there's a problem) and "committed" (i.e. those that are willing to take action to resolve it). Out of the Committed group basically only 50% had tested their plans, while the aware group only 17% had tested their plans.

With real estate it's location, location, location, but with business continuity plans it's test, test, test. As noted above, I realized a number of gaps in my contingency plan that I never would have known until I experienced this real-life emergency.

Author: Malik Datardina, CPA, CA, CISA. Malik works at Auvenir as a GRC Strategist that is working to transform the engagement experience for accounting firms and their clients. The opinions expressed here do not necessarily represent UWCISA, UW, Auvenir (or its affiliates), CPA Canada or anyone else.

Wednesday, June 17, 2015

Can Inadequate Disaster Recovery Planning be worse than locusts?

Why are US farmers facing a disaster?

Is it due to locusts? No.

It's due to inadequate IT disaster recovery planning.

As reported in the Wall Street Journal, the US Immigration Department is unable to issue visas to temporary workers due to a system failure. Specifically:

"“The system that helps perform necessary security checks has suffered hardware failure,” said Niles Cole, a State Department spokesman. “Until it is repaired, no visas can be issued.” He said technicians are working around the clock to resolve the issue but couldn’t offer a timeline for when the system would be back in action.

Specifically, a central database isn't receiving biometric information from U.S. consulates world-wide, he said. Biometric data, including fingerprints, are used for security screening of applicants."

And the losses are mounting daily. Over 200 workers are sitting at the Mexican-US border waiting to be processed by system so they can get into the US and help harvest the crops. The article reported that farmers are losing between $500,000 to $1,000,000 per day because the fruits are spoiling.

Reading this article I had the following questions

Why isn't there a hot site?
Given the importance of the technology, why don't they have the ability to swap to a new piece of hardware instantaneously?

Was the security information backed up and why is there no manual work around?
If it's digital information, why isn't there a manual work around to transmit the information and circumvent the faulty hardware? The data could be manually uploaded to the central database.

Was a proper risk assessment done? When a disaster recovery plan (DRP) is created for a system, the organization must determine the Recovery Time Objective (RTO) that determines how quickly a system will be stored after failure. Google, for example, has an RTO of zero. To determine what the RTO is there needs to be an assessment of the impact of such a failure. In this case when setting the RTO did the risk management professional include the fact that this system was critical in supporting the visa program H2-A for temporary farm workers? It should be noted that the US farmers association had paid into this program and now they are suffering losses of over $500,000. This will also reduce the amount of tourist visas issued potentially resulting in lost tourist dollars to the US.

The lesson we can learn from this is to ensure that we understand what business processes a system supports and understand the impact to those business processes should the system go down.

Monday, December 3, 2012

The other DDoS: Denial of Service by DMCA

In information security, the common definition of DDoS is Distributed Denial of Service attack. However, there is a legally sanctioned form of DDoS: DMCA Denial of Service, where a user acting in good faith is 'denied service' because of an alleged infringement of the DMCA. The DMCA (i.e. the Digital Millennium Copyright Act) provides a means to enforce of copyright protections online and was ultimately responsible for killing Napster (who enabled peer-to-peer sharing of music and other files). Although the Napster case was cut & dry to some (like the Recording Industry), there are some where users are actually acting in good faith, but are taken down through enforcement of such an Act.

The case that illustrates this issue is the take down of 1.45 million education blogs in October. James Framer, CEO of EduBlogs, noted that "ServerBeach, to whom we pay $6,954.37 every month to host Edublogs, turned off our webservers, without notice, less than 12 hours after issuing us with a DMCA email." He went on to explain what the actual infringement was: "one of our teachers, in 2007, had shared a copy of Beck’s Hopelessness Scale with his class, a 20 question list, totalling some 279 words, published in 1974, that Pearson would like you to pay $120 for." Reading the blog further it turns out that EduBlogs did actually comply with the DMCA request that they received. However, the issue that Pearson had was (a) it was accessible via Google's cache and (b) it was accessible by its Varnish cache. In other words, James Farmer got legally DDoSed: 1.45 million blogs were made unavailable due to ServerBeach rush to comply with the DMCA instead of "calling any of the 3 numbers for us [ServerBeach] have on file".

Edublogs, however, is not the only company to be DDoSed in this manner. Small companies that publish news reports on YouTube or other content sharing sites also face this danger. Take for example Leo Laporte's This Week in Tech (TWIT) new media network, which publishes tech related podcasts and videocasts. The business model of this network resides on him being able to make the video available soon after its airing. Failure to do so will result in the company losing out on ad revenue because the "eyeballs never made it" to the particular show. Consequently, when one of their episodes gets pulled down by Google's robots, or due to request of the copyright holder (as noted here), it jeopardizes the TWIT business model making him another DDoS victim.

From a risk perspective, the risk of such event should be evaluated, especially for businesses that rely on revenues via the distribution of online content. Specifically, the agreement with the third parties that host their content should include provisions that enable them to at least demonstrate compliance prior to be taken down. However, both James Farmer and Leo Laporte have attempted to work with their respective providers to prevent this type of risk. Farmer complied with the request, while Laporte has attempted to contact Google and explain that he is news organization. So this is easier said then done. Laporte hosts the videos on his own servers, however the popularity of YouTube limits the effectiveness of this "backup strategy" (i.e. users won't go to the site to watch the video instead of YouTube). In the end, it may just be an unavoidable cost of relying on such providers.

From a longer-term perspective, it illustrates clash of legacy laws and the capability of the Internet to "network knowledge". This the concept is taken from David Weinbergers's "Too big To Know", who identified how the ability to share, link and debate information on the Internet transforms knowledge into a more fluid state in contrast to the static nature of books. He explains this concept in the following video:

James Farmer implicitly argued this point in his rant against Pearson when he said: "Here’s another idea Pearson, maybe one that you could take from Edublogs, howabout you let this tiny useful list be freely available, and then you sell your study materials / textbooks and other material around that… maybe use Creative Commons Non Commercial Attribution license or similar to make sure you get some links and business." In other words, Pearson has failed to understand this new world of networked knowledge, where a link to the "offending" list would link to other resources that has Pearson has - enriching both Pearson and those using its publications.

Monday, November 19, 2012

Hurricane Sandy and Disaster Recovery: Cloud to the rescue?

When looking at the aftermath of hurricane Sandy, the most important aspect of the event is the toll it has had on the people. The Atlantic puts the total impact in terms of dollars at $60 billion, with death toll at 123 people. However, those that survived face the challenges brought about by the flooding and living without power for weeks. For example, 4 million remained without power for extended period of time. This of course challenged individuals to keep their frozen food cold and live without technology for that period of time. As for companies, their disaster recovery plans were put to the test. Perhaps the most poignant example was the New York University Langone Medical Center who had to evacuate patients because their backup generators because they were located in the basements, which got flooded. Hospital officials defended their preparedness but critics pointed out that the backup power generators "are not state-of-the-art".

Samara Lynn of PC Magazine published an article on how Sandy taught organizations valuable lessons from a Disaster Recovery (DR) perspective (she previously painstakingly put together a 4 part series for small and medium sized businesses on DR planning; see here, here, here, and here). Before I read the article, I was expecting a bulleted list of dos and don'ts when it comes DR planning. But what I was surprised to find is that companies are relying on cloud computing service providers to make up for the unavailability of local processing. Examples include:

A New York Architectural firm Diller Scofidio + Renfro used Amazon Web Services (AWS) to relocate the company's core applications, enabling users with the proper license configuration to access these applications right from their laptops. Also, the IT Manager, Chris Donnell, used AWS as a remote desktop during the disaster. (I encourage you to read the whole article as it details how Chris was in the middle of an email migration from Outlook to Gmail when Sandy hit; poor guy!). The company also used Panzura to store the data temporarily in the cloud.
Ring Central, a cloud-based pbx hosting service, (they sponsor TWIET and other podcasts on the TWIT network) was able to relocate their operations away from the storm. More importantly, they offer near instant recovery of phone support by plugging in a piece of hardware they can "bring in a live extension under 10 minutes". Naturally, there is an increased interest in Ring Central by those that were satisfied with the lengthy recovery times of their providers.

The article also discusses how a service provider made DR as part of IT outsourcing service and how the key to DR is backup power.

Although not related directly to cloud, one of the most amazing story that I've heard is how SquareSpace (SQS) kept it's platform up and running. Like the hospital, SQS had its back up generator in the basement and that got flooded. It published this blog post to inform customers of what was happening. However, the real interesting story is the lengths that team went to ensure the site stayed up and running. The team physically took fuel from the basement to the generator of the roof going up 17 flights of stair.

Even more amazing was that the founder and CEO, Anthony Casalena, personally helped in this effort. Talk about Tone at the Top!