When we speak to customers about their SQL Servers, the words “reliable” and “stable” often form part of their vocabulary, on an interchangeable basis.
But are they really the same thing?
For me, they’re quite different. In this day-and-age, achieving stability should be impossible, but reliability is paramount. When we find stability there’s usually problems too.
It’s likely you’re now thinking to yourself “what a weird thing to say”.
Reliability
When something’s “reliable”, we know that doing the same task repeatedly will give us the correct results.
In the world of SQL Server, we might be asking for a list of customers beginning with “A” - we get the correct list and the time taken is consistent.
If we only got half the list, we’d declare it unreliable. If sometimes the result is instant and other times it takes five minutes, unreliable would again figure in our vocabulary (probably along with some other words for emphasis).
If we’re using an application and the SQL Server has crashed, again it’s unreliable.
What if we’re doing queries but there’s a backup running so they time out? Unreliable.
Stability
When something’s “stable” it means it’s not changing.
In a healthcare scenario, we talk about someone’s condition being “stable”. That doesn’t necessarily mean they’re well and all their critical organs are operating reliability, it just means they’re neither better nor worse - just the same. If you’re critically ill, that might be a good thing.
When I started in IT (some 30+ years ago), we craved stability. Once you got an operating system and its applications stable, you left them alone. That usually resulted in reliability for the person using them. It wasn’t uncommon to sit down at a system that was 10+ years old, to be told “it’s really reliable”.
During a recent SQL Server audit, we uncovered some very old installations - circa SQL Server 2005 (so getting close to 20 years old). These installations were also running on old operating systems and several had never been patched. Some of them had never been backed up (one for another post). We also had a couple of instances of a SQL Server version that Microsoft won’t support due to the potential for data corruption. Who even knew this was a thing?
Ask anyone who was using these stable systems about upgrades and they’d say “never given us a problem - please don’t mess with it”.
Aim for Reliability, Ditch Stability
We used to achieve reliability by getting things stable and then leaving them alone. The thinking was “If it ain’t broke, don’t fix it”. Then patching arrived…
In the early days of patching, we got bug fixes and new features. The bug fixes were designed to improve reliability and the new features sometimes did the opposite. The mentality of the time was “patch when necessary, otherwise leave things alone - especially if it’s reliable and nobody wants the new features”.
This is why we often find old versions of SQL Server that have never been patched. As an engineer, you don’t want yelled at for patching the SQL Server, breaking it and breaking the business-critical systems that use it. SQL Servers are often looked after by Windows Engineers, who are comfortable at recovering from Windows Server issues, Exchange Server issues etc, but SQL Server is a black box where nightmares live. Best left alone, especially if it’s working OK.
Along Came Cybersecurity
Then the world changed. The Internet arrived and hacking/malware growth surged to giddy heights. We were at war. The Slammer Worm (2003) infected over 75,000 SQL Servers all over the world in a matter of minutes, despite Microsoft having released a patch for this almost a year before it arrived. Hardly anyone had installed the patch. This is still happening 20 years later with malware such as “Maggie” (Hundreds of Microsoft SQL servers backdoored with new malware (bleepingcomputer.com)) and Trigona (Microsoft SQL servers hacked to deploy Trigona ransomware (bleepingcomputer.com)).
Slowly, patches started being applied more regularly, but still sometimes weeks or months behind their release as engineers waited to see if there were any problems. There’s comfort in letting others jump first.
Crashing a server because of a bad patch has always been a problem. But we changed from “patch if you must” to “you must patch” - and mostly, we do. Some highly critical servers might still be on a slightly delayed schedule, but most people want their patches installed quickly…
Unless it’s SQL Server - the black box where nightmares live. It’s still better to avoid that if possible and so that’s what happens. Which leaves these servers with exploitable vulnerabilities, or sometimes just exploitable configuration. Those configurations were maybe acceptable practice years ago, but now we know better. But it still takes a brave person to change them as figuring out the possible impact can be complex.
So, we live with it. Diligently patching our networks and infrastructure to hopefully avoid a cybersecurity incident, while our SQL Servers are sitting exposed and vulnerable - either to attack (for data theft), or as a platform to launch an attack on other parts of the network.
SQL Server Needs TLC
The bottom line is SQL Server should never be stable. It should be constantly changing due to being patched for security vulnerabilities and updated to remove old-school thinking which we now understand is no longer fit for purpose.
So, our stance is “Embrace change on your SQL Server”. That doesn’t mean go off patching with wild abandon. These tasks still need to be properly planned and executed with a roll-back option in place. But you can no longer afford to achieve reliability by being stable. That’s a problem just brewing away in your blind spot that’ll eventually surface and give you a massive thwack across the face, probably at the weekend in the middle of the night - you know how these things pick their moment…