Wednesday, March 3, 2010

Why I Hate "Good Enough"



I really truly hate the “Good Enough” mentality that’s become so pervasive in IT these days. It’s not because I think everything should be five-nines – that’s a common misconception of my attitudes and thoughts. Far from it – five-nines is prohibitively expensive and downright absurd for almost everyone. (Which is also why I dislike anything claiming five-nines based on not going down in 12 months. Seriously, that’s not five-nines.) More simply, if “Good Enough” was really the ultimate level in reliability, then why does any business bother with Disaster Recovery?

Here’s how Good Enough is implemented most commonly these days:
  •  The Chances Of This Going Down Are Too Small To Bother With Planning For
  • We Don’t Think This Will Go Down So We Won’t Plan For If It Does
  • We’re PRETTY SURE This Won’t Go Down But We Have Support on Speed Dial
  • We Clustered This So We Totally Know It Won't Go Down
  • If Something Goes Wrong, Call Support And Hope They Know Why

Here’s how Good Enough is implemented by yours truly:
  •          Chances of Failure are Very Very Low, BUT If It Dies, We Do This
  •          We Don’t Believe This Can Fail, BUT If It Does, We Do This
  •          Confidence Is Moderate, BUT We Have A Plan For Failures
  •          It’s Clustered, BUT If The Cluster Has Problems, We Do This
  •          If We Have A Problem, Involve Everyone And Find The Root Cause By Any Means Necessary

Notice the difference? I do something very different – I make the presumption of failure. That doesn’t mean everything’s crap, even though much of it these days is. It presumes that at some point in time, for some reason, failure will occur. I don’t know when, I don’t know why, and I may not even know how. But I intend to and absolutely require that there be a plan in place for dealing with that failure. Things like maintenance are planned, but you would probably be shocked at how many organizations plan their maintenance poorly by my standards. And my standards aren’t that unreasonably high, either.
I require a plan for the maintenance, a plan to back out if things should go wrong, those you’ll find everywhere. But I also require a plan for restoring function if a back out should fail, and a plan for forcing ahead if a back out is impossible. Why do I require these things? Because what happens if the back out fails? I’ve had it happen, and it’s not pretty. And what happens when you can’t back out changes? I’ve seen that plenty of times – most organizations actually take the stance that if it can’t be backed out, don’t bother with a back out plan, just say it has to go forward. Okay, so what happens when the upgrade fails? There’s no plan in place, no way to go back, you’re caught in a lurch.

I’ve been blessed, or cursed depending who you ask, to see many kinds of failures in many situations. Everything from a single byte of corruption resulting in a failed firmware update to yours truly accidentally deleting the wrong multi-terabyte database. (Hey, think of how many coworkers and employees you know that would actually admit to it, as opposed to just restore from backups and pretend it never happened.) As I’ve progressed in my career, I’ve learned a lot about failures, and a lot about how to manage them and mitigate them. Yet somehow this knowledge seems to just be absent or downright missing at a variety of levels.
I wish I had some good answers as to how we can inject this back into the IT operations and business operations processes. Unfortunately, I don’t, other than pointing it out here. Seriously folks, think about this. What’s your procedure when a round of maintenance goes awry? Chances are your first and only answer is “call support.” Calling support is all well and good, and an important step, but it shouldn’t be your only step. It’s also not a step you should be injecting between “perform upgrade” and “maintenance complete.” In other words, your process flow chart shouldn’t be a series of straight lines, and they shouldn’t all be pointing down or right.
Let’s talk example. This is a real situation I’ve been through, with details changed. I’m not going to name names, because absolutely nobody in this situation looks good by any measure.

Maintenance was scheduled on a development system for Friday afternoon. This maintenance was operating system patches and a scheduled reboot as part of the patching process. The process had been done many times before with no problems, so there was no established plan for backing out patches. Install the bundle, reboot, done.
After installing the bundle, the system was rebooted and refused to go to multiuser, complaining of problems with system libraries. Upon examination of the logs, it was decided that it would be too much hassle, and rather than attempt to repair, a quick script would be written to back out the patches. The script failed to back out several individual patches, because they could not be backed out. This was accepted as “just how it is” and the system was rebooted again.
Now the system refused to go past single user, and critical services could not start. Files were determined to be missing, and an attempt was made to install them from the OS media. This failed because there were incompatible patches on the system that could not be backed out. A SEV1 call was placed to the operating system vendor’s support.

Now, let’s start with our first failure – the presumption that just because it worked before, it would work again. Then it’s compounded by not having any real plan – install, reboot, done is not a plan. Further complicating it, a back out attempt was ad-libbed, without understanding that some patches couldn’t be backed out. It only gets worse when this is accepted as “just normal” without any explanation or understanding of why or what. It’s likely at this point, dependent patches were removed because they could be backed out despite the patches that COULDN’T be backed out being dependent on them. This is a fatal presumption of “the vendor would never do something that stupid.” Sorry; every single vendor is that stupid at one point or another, and they make mistakes just like everyone else.
So at this point, the entire process has become ad-libbed. Do we restore from tape? Back out more? Reattempt patching? Who knows! There’s no plan; we’re shooting from the hip. So now we’re on hold for support with a system that’s been down for hours, its 9PM on a Friday, and it has to be back online by 7AM Monday or it’ll throw off a multi-million dollar project. This primarily came about for the worst reason of all; “development” was treated as a sandbox where it was okay to do just about anything, despite it being very actively used for development work.
Ultimately, the vendor’s response made the problem even worse still: “oh, yeah, we know about this problem. You have to restore from tape, and if that doesn’t work, you have to reinstall.” So the system was restored from tape, with limited success. Reinstallation of the system wasn’t an option, because of the way things had been configured and had to be built. But leaving restoring from tape and reinstalling the system as the only repair methods is what the operating system vendor considered to be a Good Enough answer for their Enterprise product.

Ultimately, the system continued to have problems and turned into a very expensive three month project performing a total rebuild of the system and all its environments, because everyone involved from management to the system administrators to the operating system vendor all said “that’s Good Enough.” It cost management a lot of respect, the system administrators a lot of time, the business a great deal of money, and the vendor lost the customer – probably forever.

So the next time somebody tells you something is Good Enough, don’t buy it. A Good Enough plan isn’t – and never will be Good Enough when it’s your business at stake. Good Enough doesn’t mean building the most reliable infrastructure you can then throwing up your hands and saying “that’s as good as we can get, oh well!” It means accepting that things will fail, things can fail, and that nothing will ever be perfect – then taking that knowledge and acceptance to build plans for that.

If you’ve planned for and built for the fact that failures are a when and never an if, and defined a process to work around and repair those failures, then hey, that’s Good Enough for me. 

3 comments:

  1. Been there too, Phil, and it isn't fun. Overall, well put. We must never assume bad things won't happen because they do and I've seen some very unlikely stuff happen!

    Expect the best - plan for the worst.

    ReplyDelete
  2. Great post.

    As they say "Failing to plan is planning to fail!".

    Cheers.

    Chris

    ReplyDelete
  3. When I was doing DR/BC planning at a former employer we came across a startling statistic: 95% of companies that suffer a true disaster never reopen their doors.

    I've always referred to this as the "Tyranny of Enough," and it's used to devastating effect on both sides of the metric. "Not doing enough" and "it's good enough" are both merely attempts at shrugging off the responsibility of being accountable for decisions. Neither stance requires planting your flag on an issue and qualifying (and quantifying!) criteria for success.

    No one wants to be the one responsible for identifying the risk, let alone the *probability* of hitting that maximum loss.

    So, Amen, brutha!

    J

    ReplyDelete