The unfortunate truth is this: systems fail. Sadly, it’s nigh-on impossible to build a network that has no vulnerabilities, and there’s always going to be something that could and will go wrong; be it a cyber-attack, power failure, device failure or plain old human error.
So, what’s essential, aside from mitigating the risk, is ensuring you’re fully prepared for not if, but when that something goes wrong. Without defined roles and tasks during a disaster, your team could be running amok, stressing out and stepping on each other’s toes. Naturally, this leads to tensions rising, poor internal and external communication and at the end of the day a team that isn’t effectively doing the job they set out to achieve.
We place a great deal of importance on major incidents- and we’ve worked hard on building reliable processes and procedures. So, when a disaster does strike, the systems team are poised to respond in the most efficient way possible, in line with the remarkably calm and reliable manner we pride ourselves on maintaining.
Test, test, and test again. Major incident plans that have never been tested are no plans at all. I’m pretty sure there is a good number of companies out there who have disaster plans written and ready to go that have never been tested, and nobody knows if they’re any good or not, which renders them pointless. There are hiccups you simply will never think of without testing. For example, in our early testing, it was clear the instructions for sending an SMS through our system were not clear enough and required further explanation from the Service Desk, which of course distracted them from the task at hand.
This was the first time we’ve had a “Major Incident” when I’ve not been in the office for direct feedback from our Service Desk team, and I was absolutely fine with it. Procedures, like systems, will never be perfect, but the iterative process of testing our procedures means we’re all confident when they occur, even when we’re out of the office. Every Major Incident at Exmos is followed by an MI review meeting where all involved parties discuss what went well, what went not so well, and what we can improve on for the future. Getting to the stage where everyone is completely comfortable with the process means we’re clearer in our communication and everyone understands their role during an incident. This comes across when we’re talking to clients, and helps us maintain a calm and reliable operation.
It was only following the event that I recalled the Service Desk was testing its major incident procedure, so it may as well have been the real thing in my eyes. Our service engineers did a top job, which helped everyone involved stay reassured.
At 8:20 am I received the first message via SMS;
“Major incident: DLCA Remote Access server failure.”
This was followed by a secure link to a private webpage which featured further information about the incident, and the plan of action to resolve the issue.
During a Major Incident with one of our clients, every stakeholder, both at Exmos and on the client side, is always kept in the loop. We provide frequent updates via SMS, with a link to a web page with further information on the case, including estimated fix times. Knowledge is definitely the enemy of panic in situations like major incidents, and by providing this we minimise anxious calls to the Service Desk which takes our engineers away from the real task at hand.
We’d like to offer our gratitude to David Wheeler, Ross Nicol and the team at Drummond Laurie for allowing us to cross-test our major incident procedures alongside theirs. Just as only the management team at Exmos was aware this was a test, none of the staff at Drummond Laurie was in on the secret, so this was a true test of their procedures also. It shows a forward thinking and digitally minded company who doesn’t ignore potential threats and is willing to undertake an exercise which ultimately increases their resilience.
I’m delighted with our MI procedures, but we’re not resting on our laurels. Our Major Incident procedures will be continuously tested and improved upon, so major incidents don’t seem so major, and they can be tackled just like any other case.
Posted by Gordon Coulter on Monday, August 14, 2017
Back in April Gordon flew off to Silicon Valley to engage with some of the top tech companies in the world today. Here's the story.
EIE is Scotland´s premier technology investor showcase. The annual EIE conference features Scotland´s most promising high growth companies from the life sciences, ICT and energy sectors.
Exmos was lucky enough to be shortlisted for an award at the Digital Technology awards - here's what happened...