UWCISA: Official Blog: DRP

Showing posts with label DRP. Show all posts

Tuesday, October 4, 2022

Fiona’s Fury: Flashback to Summer’s Great Rogers Outage (Part 1)

Canadians continue to pick up the pieces after tropical storm Fiona battered the maritime provinces. Although estimates of the damage are yet to be calculated, the “Nova Scotia Premier Tim Houston announced over C$40 million ($29.1 million) in support to help those who were impacted by Fiona” (link). In terms of cellphone outages, CBC reported that “there are still areas of the province without cellphone service Monday although companies declined to say exactly how many customers have been affected.”

The Canadian Radio-television and Telecommunications Commission (CRTC) has asked for estimates on how many people were affected by the outage, but the telecom companies are reticent to share this information. As CBC reported: “Bell and Telus asked for some of their submissions to be redacted, while Eastlink and Rogers demanded their entire reports be kept confidential.”

Photo by Pixabay: link

Rogers Outage in Review: What happened last summer?

When looking at the outage that hit the Maritimes, it reminds us of the situation that unfolded over the summer. In July 2022, the Rogers outage was not limited to the East Coast. Instead, it affected the entire country. When Rogers was requested to explain what happened, it appears that they had a more conciliatory tone:

“Rogers Communications Canada Inc. (“Rogers”) is in receipt of a letter containing Requests for Information (“RFIs”) from the Canadian Radio-television and Telecommunications Commission (“CRTC” or the “Commission”), dated July 12, 2022, concerning the above-mentioned subject. Attached, please find our Response to that letter… At the outset, Rogers appreciates the opportunity to explain to the Commission, the Government of Canada and all Canadians what transpired on July 8^th, 2022. The network outage experienced by Rogers was simply not acceptable. We failed in our commitment to be Canada’s most reliable network. We know how much our customers rely on our networks and we sincerely apologize.” [Emphasis added]

Though the documented was redacted, it did provide some background as to what happened. For this post, we will take a look at the outage itself. For the next post, we will look at the lessons learned.

Cause of the outage

Rogers explained the cause of the outage as follows:

“Given the magnitude of the outage, it appears that Rogers had to be more forthcoming with what happened and were “Maintenance and update windows always take place in the very early morning hours when network traffic is at its quietest. At 4:43AM EDT, a specific coding was introduced in our Distribution Routers which triggered the failure of the Rogers IP core network starting at 4:45AM… The configuration change deleted a routing filter and allowed for all possible routes to the Internet to pass through the routers. As a result, the routers immediately began propagating abnormally high volumes of routes throughout the core network. Certain network routing equipment became flooded, exceeded their capacity levels and were then unable to route traffic, causing the common core network to stop processing traffic. As a result, the Rogers network lost connectivity to the Internet for all incoming and outgoing traffic for both the wireless and wireline networks for our consumer and business customers.” [Emphasis added]

In other words, the change inadvertently resulted in an attack pattern similar to a denial-of-service attack – where the network shutdown because it became overwhelmed with traffic.

They also go on to explain that the company “uses a common core network, essentially one IP network infrastructure, that supports all wireless, wireline and enterprise services. The common core is the brain of the network that receives, processes, transmits and connects all Internet, voice, data and TV traffic for our customers… Certain network routing equipment became flooded, exceeded their memory and processing capacity and were then unable to route and process traffic, causing the common core network to shut down.” The implication being that the common core network became a single point of failure.

What was and was not impacted

With respect to Rogers Bank (yes, Rogers operates a bank):

“The impact to the Bank’s customers was minimal as the Bank services were available and the Bank’s customers were able to transact on their Rogers Bank credit cards. There was no interruption in the Bank’s core systems (credit card processing, Interactive Voice Response (“IVR”), Call Centre and customer self-serve mobile application) and these core systems remained available to the Bank’s customers. No critical Bank systems were impacted, and all daily processing was completed as required, including by the Bank’s statement printing vendor and its card personalization bureau which received their daily files and were processing them per standard service level agreements and procedures.”

This was a different story for those that relied on Rogers phone lines to process payments at their businesses with Interac tweeting:

“There is a nationwide Rogers outage that encompasses all their business and consumer network services. This is impacting INTERAC Debit and INTERAC eTransfer. INTERAC Debit is currently unavailable online and at checkout..”

Beyond the millions who had no service, emergency communications were also impacted:

“Unfortunately, the outage of July 8^th did impact 9-1-1 service across Rogers’ service area, to both wireline and wireless services.
Wireline impact: There were approximately [REDACTED] 9-1-1 calls placed successfully across Rogers’ network on July 8th. The typical daily average of total wireline 9-1-1 calls is [REDACTED] per day. Data is unavailable for unsuccessful wireline 9-1-1 calls. On July 9^th, there were approximately [REDACTED] 9-1-1 calls placed successfully across Rogers’ network.
Wireless impact: As can be seen in table below, the outage similarly affected wireless 9-1-1. Total successful calls were [REDACTED] the average daily amount of about [REDACTED] 9-1-1 calls made from Rogers wireless devices.”

Rogers offered service outage credits

The key remedy offered was service credits, but this was not due to breaches in service agreements:

“There was no breach of our service agreements with our retail customers. However, in order to address our customers’ disappointment with the outage, Rogers has already announced it will be crediting 5 days of service fees to its customers. This will be applied automatically to their next invoice.”

Cooperation with Bell and Telus

Regardless of the highly-competitive nature of the business, it does appear the Rogers, Bell and Telus were coordinating with each other:

“On July 17th, 2015, the Canadian Telecom Resiliency Working Group (“CTRWG”), formerly called Canadian Telecom Emergency Preparedness Association, established reciprocal agreements between Rogers and Bell, and between Rogers and TELUS, to exchange alternate carrier SIM cards in support of Business Continuity.”
“As we stated in Rogers(CRTC)11July2022-1.xviii above, our Chief Technology and Information Officer reached out to his counterparts at Bell and TELUS early on July 8th. Assistance was offered by both Bell and TELUS. However, given the nature of the issue, Rogers rapidly assessed and concluded that it was not possible to make the necessary network changes to enable our wireless customers to move to their wireless networks.”
“Rogers, Bell and TELUS are presently assessing potential options and will report further findings and potential solutions per the creation of the Memorandum of Understanding that will be delivered in September 2022 to the Minister of ISED by CSTAC.”

In closing, the outage comes down to change management. The error was exacerbated by the industry-standard approach to using a single platform to provide the various telecommunication services. Rogers did offer service credits, but were careful to note that this was not due to breach of agreements. Finally, the industry does come together during crisis situation, putting their competitive differences aside.

In our next post, we’ll take a look at the lessons learned from this outage. Stay tuned!

Author: Malik Datardina, CPA, CA, CISA. Malik works at Auvenir as a GRC Strategist that is working to transform the engagement experience for accounting firms and their clients. The opinions expressed here do not necessarily represent UWCISA, UW, Auvenir (or its affiliates), CPA Canada or anyone else.

Thursday, August 4, 2016

FlightDelays & Contingency Planning in Real Life

Photo Credit: Trey Ratcliff

Last week, I had headed to a day long conference on Thursday in New York and was expecting to return home on the 7:20 flight back to Toronto.

However, things didn't goes as planned: La Guardia (LGA) had cancelled a number of flights due to weather delays.

I decided to haul it back renting a car through a one-way rental.

Originally, went to Hertz but they refused to rent to me because I was heading back to Toronto, Canada. If you can believe it, they advised me to rent to Buffalo and then take the bus back to Toronto. Yeah right!!!

Thank God, Avis did not give me such a ridiculous advice and instead gave me the car to make my way back home. Ended up leaving Avis around 8:45 and made it home around 4:30 AM. One big advantage of travelling at that time of the night is that there is no traffic :)

Thinking about this after the fact, I realized it was a good lesson in "real life" contingency planning, so here's what I think I did right, could have done better and otherwise.

What I did right:

Call the travel agent instead of waiting in line to talk to the airline: I had already cleared customs and was lining up at the Air Canada desk inside LGA realizing that my flight was cancelled. However, I decided to call the travel agent (while in line) to see what the situation was at other airports (JFK, Newark) and to see what my options were. That's where I learned that I would be flying out at 11:30 am on Friday (i.e. the next day).
Avoided flying out on Friday: Didn't realize this at the time, but my chiropractor told me that after a major flight cancellation the airport is dealing with at least twice the volume the next day - especially since it was Friday and everyone would want to get home for the weekend. Consequently, how much rest would I get if I had to be back 3 to 4 hours earlier the next day to make sure I got on the plane? My fear at that time is that either the weather delays would continue or something else would force me on a later flight.
Would any hotels be available? Given that many people had their flight cancelled, the hotels would likely be booked. Also, if I had to book outside the airport then I would have to battle morning traffic on the way back in. So it didn't seem like an appealing option.
What's crazy to most, may be open to you: The 8 hour drive back did seem daunting. However, most wouldn't do such a crazy thing thereby making it a viable option - since everyone else would be trying to get on a plane there would be plenty of supply for me in terms of getting the rental. Or at least that's what I expected and it turned out to be right. Also, when I spoke to the travel agent she told me that someone else from Deloitte was looking to carpool back to Toronto. Unfortunately, I just missed him. However, realizing someone else is doing made it seem less crazy. And truth be told those cars were getting booked fast when I got to the car rental companies - many people were driving to Boston, Pittsburgh, etc.

What I could have done better:

Monitoring for weather: When my flight got delayed on the way in on Wednesday that should have been a clue that there could be problems the next day. In the future, I should keep track of weather conditions and been mindful.
Monitoring for cancellations: Although I had checked in via my mobile app, I had been using a low power mode for the iPhone. This prevent me from being alerted right away. The reason I was on lower power mode is that the conference organizers didn't have outlets at the table and so I wanted to make sure I had battery power to call/email/etc. at the airport. Next time, I should sit near an outlet or have portable power source to make sure that I can charge my phone at the airport or on the plane.
Book a car sooner: If I had learned about the cancellation sooner, I could have made alternative arrangements sooner. At least I could have booked the car and procured it closer to where I was at, instead of wasting that time driving into the airport.
Noticed airport irregularities: There were more people queuing up at the Air Canada counter outside the security area. However, I just dismissed this as volume. However, the lower volume in the security area should have been my second clue that something was awry.
Check the rental for damage: I was so focused on getting on my way, I didn't check. As it turns out, the car was damaged massively on the front. Fortunately, the guy letting me out noticed that and wrote it on the form. It's hard, but in an emergency situation it is important to make sure to keep a cool head and not make such errors.

Otherwise: One thing that stuck in my mind is missing the fellow Deloitte colleague on the way back to Toronto. Was there a better of organizing ourselves so if something like this were to happen again, we could car pool? How can we trust each other if we don't work at the same company? I think that setting up an app and getting subscribers to sign-up ahead of time wouldn't be feasible because most people don't think about getting stranded at the airport - let alone finding a way to trust each other using user reviews.

Contingency plans: test, test, test.

My biggest takeaway from this experience is that you can't know how good a contingency plan is until you actual do a real live test.

And unfortunately most companies overall don't test their plans.

As noted in this Business Continuity survey, Deloitte categorized managers as "aware" (i.e. those who know there's a problem) and "committed" (i.e. those that are willing to take action to resolve it). Out of the Committed group basically only 50% had tested their plans, while the aware group only 17% had tested their plans.

With real estate it's location, location, location, but with business continuity plans it's test, test, test. As noted above, I realized a number of gaps in my contingency plan that I never would have known until I experienced this real-life emergency.

Monday, December 3, 2012

The other DDoS: Denial of Service by DMCA

In information security, the common definition of DDoS is Distributed Denial of Service attack. However, there is a legally sanctioned form of DDoS: DMCA Denial of Service, where a user acting in good faith is 'denied service' because of an alleged infringement of the DMCA. The DMCA (i.e. the Digital Millennium Copyright Act) provides a means to enforce of copyright protections online and was ultimately responsible for killing Napster (who enabled peer-to-peer sharing of music and other files). Although the Napster case was cut & dry to some (like the Recording Industry), there are some where users are actually acting in good faith, but are taken down through enforcement of such an Act.

The case that illustrates this issue is the take down of 1.45 million education blogs in October. James Framer, CEO of EduBlogs, noted that "ServerBeach, to whom we pay $6,954.37 every month to host Edublogs, turned off our webservers, without notice, less than 12 hours after issuing us with a DMCA email." He went on to explain what the actual infringement was: "one of our teachers, in 2007, had shared a copy of Beck’s Hopelessness Scale with his class, a 20 question list, totalling some 279 words, published in 1974, that Pearson would like you to pay $120 for." Reading the blog further it turns out that EduBlogs did actually comply with the DMCA request that they received. However, the issue that Pearson had was (a) it was accessible via Google's cache and (b) it was accessible by its Varnish cache. In other words, James Farmer got legally DDoSed: 1.45 million blogs were made unavailable due to ServerBeach rush to comply with the DMCA instead of "calling any of the 3 numbers for us [ServerBeach] have on file".

Edublogs, however, is not the only company to be DDoSed in this manner. Small companies that publish news reports on YouTube or other content sharing sites also face this danger. Take for example Leo Laporte's This Week in Tech (TWIT) new media network, which publishes tech related podcasts and videocasts. The business model of this network resides on him being able to make the video available soon after its airing. Failure to do so will result in the company losing out on ad revenue because the "eyeballs never made it" to the particular show. Consequently, when one of their episodes gets pulled down by Google's robots, or due to request of the copyright holder (as noted here), it jeopardizes the TWIT business model making him another DDoS victim.

From a risk perspective, the risk of such event should be evaluated, especially for businesses that rely on revenues via the distribution of online content. Specifically, the agreement with the third parties that host their content should include provisions that enable them to at least demonstrate compliance prior to be taken down. However, both James Farmer and Leo Laporte have attempted to work with their respective providers to prevent this type of risk. Farmer complied with the request, while Laporte has attempted to contact Google and explain that he is news organization. So this is easier said then done. Laporte hosts the videos on his own servers, however the popularity of YouTube limits the effectiveness of this "backup strategy" (i.e. users won't go to the site to watch the video instead of YouTube). In the end, it may just be an unavoidable cost of relying on such providers.

From a longer-term perspective, it illustrates clash of legacy laws and the capability of the Internet to "network knowledge". This the concept is taken from David Weinbergers's "Too big To Know", who identified how the ability to share, link and debate information on the Internet transforms knowledge into a more fluid state in contrast to the static nature of books. He explains this concept in the following video:

James Farmer implicitly argued this point in his rant against Pearson when he said: "Here’s another idea Pearson, maybe one that you could take from Edublogs, howabout you let this tiny useful list be freely available, and then you sell your study materials / textbooks and other material around that… maybe use Creative Commons Non Commercial Attribution license or similar to make sure you get some links and business." In other words, Pearson has failed to understand this new world of networked knowledge, where a link to the "offending" list would link to other resources that has Pearson has - enriching both Pearson and those using its publications.

Monday, November 19, 2012

Hurricane Sandy and Disaster Recovery: Cloud to the rescue?

When looking at the aftermath of hurricane Sandy, the most important aspect of the event is the toll it has had on the people. The Atlantic puts the total impact in terms of dollars at $60 billion, with death toll at 123 people. However, those that survived face the challenges brought about by the flooding and living without power for weeks. For example, 4 million remained without power for extended period of time. This of course challenged individuals to keep their frozen food cold and live without technology for that period of time. As for companies, their disaster recovery plans were put to the test. Perhaps the most poignant example was the New York University Langone Medical Center who had to evacuate patients because their backup generators because they were located in the basements, which got flooded. Hospital officials defended their preparedness but critics pointed out that the backup power generators "are not state-of-the-art".

Samara Lynn of PC Magazine published an article on how Sandy taught organizations valuable lessons from a Disaster Recovery (DR) perspective (she previously painstakingly put together a 4 part series for small and medium sized businesses on DR planning; see here, here, here, and here). Before I read the article, I was expecting a bulleted list of dos and don'ts when it comes DR planning. But what I was surprised to find is that companies are relying on cloud computing service providers to make up for the unavailability of local processing. Examples include:

A New York Architectural firm Diller Scofidio + Renfro used Amazon Web Services (AWS) to relocate the company's core applications, enabling users with the proper license configuration to access these applications right from their laptops. Also, the IT Manager, Chris Donnell, used AWS as a remote desktop during the disaster. (I encourage you to read the whole article as it details how Chris was in the middle of an email migration from Outlook to Gmail when Sandy hit; poor guy!). The company also used Panzura to store the data temporarily in the cloud.
Ring Central, a cloud-based pbx hosting service, (they sponsor TWIET and other podcasts on the TWIT network) was able to relocate their operations away from the storm. More importantly, they offer near instant recovery of phone support by plugging in a piece of hardware they can "bring in a live extension under 10 minutes". Naturally, there is an increased interest in Ring Central by those that were satisfied with the lengthy recovery times of their providers.

The article also discusses how a service provider made DR as part of IT outsourcing service and how the key to DR is backup power.

Although not related directly to cloud, one of the most amazing story that I've heard is how SquareSpace (SQS) kept it's platform up and running. Like the hospital, SQS had its back up generator in the basement and that got flooded. It published this blog post to inform customers of what was happening. However, the real interesting story is the lengths that team went to ensure the site stayed up and running. The team physically took fuel from the basement to the generator of the roof going up 17 flights of stair.

Even more amazing was that the founder and CEO, Anthony Casalena, personally helped in this effort. Talk about Tone at the Top!