There’s nothing worse than having to declare a self-inflicted disaster, when you just executed a perfect save…
You’re a hero. Your company had an IT disaster, everyone saw your team throw on the red capes and swing into action. You brought the failed server back into operation from your backup system on some other hardware, or as a VM on another host. Oh, and you met your RTO/RPO predictions, so acceptable downtime and acceptable data loss.
What’s even more spectacular is your company runs a factory production process, so any downtime is a financial killer.
But you’re hiding a few secrets.
Firstly, nobody in your team threw on a red superhero cape. In fact, you barely raised an eyebrow because you were executing a well-rehearsed Disaster Recovery plan based on a documented process. You’ve done so many times it’s become boring. That’s a good secret and one you can, quite rightly, be quietly smug about.
The other secrets are a bit of a killer though.
Now you’re running on the actual backup copy, there’s no backup being taken. You need to quickly get back onto new hardware so you can reinstate the backup cycle again. If there’s another disaster while you’re already in disaster mode, well you’re screwed. Nobody outside your team knows this.
You’re running regular image backups of your critical servers to your backup system and that’s what allowed that rapid recovery. Although each incremental backup is relatively small and happens very quickly, the first one where you seeded the entire data on the server took almost an entire day.
To get that to fail back onto a new server, you’re going to have to shut everything down and copy back the backup. Or, in other words, you’re going to have to self-inflict downtime to get back to normal running. The only slight benefit is you get to chose when. Nobody is going to be happy about this and it’ll tarnish your original stellar recovery.
Unfortunately, this is so often not discussed during the sales process.
Exmos’ Observations
We’ve been caught with this in the past, including having to run on a backup device for over a month before we could arrange downtime to fail-back.
A few years ago, we changed our backup system to something more focussed on Business Continuity and we could at least continue to take backups while running on the backup copy.
Last year, we got a new feature and we can now stream those backups to new equipment, while we continue to run on the failed-over hardware. The actual fail-back process involves only a few minutes of downtime while we take one final, tiny, backup, shutdown the backup copy and boot the new live server.
We’ve been involved in some gnarly disasters over the years and are happy to relate our experiences, but never the customer, over a coffee. Why not give us a shout and drop in for a chat?