Common Mistakes When Troubleshooting Critical System Issues

bluescreenSometimes, though, things break. Sometimes, these issues are small and sometimes they are the huge problems. When they fall into the “huge problems” category, we have a name for them – you’ve likely heard of them – we call them Critical Situations or “CritSits.” Usually we are talking massive, “business down” outages affecting company-wide or highly critical systems.

Here are common mistakes that take place in these types of situations:

  • Stay Calm
    I know this is easier said than done, especially when your environment is down and management is asking you for a status update every 20-30 seconds but this is not the time to let things fall apart.  Read more
  • Bring all parties to the table
    This is not usually the best time for in-fighting or pointing the index or middle finger. Often, though, this can be when you see it the most. Right now you are in a pickle, and the faster this gets resolved the better it will be for everyone involved, period. Read more
  • Have a schedule rotation
    Many times these issues go long into the night and into the next day. This is not the time to pull a marathon session of 40 hours straight. Work your normal 8, 10 to 12 hours shifts as much as you can. Having a fresh set of eyes and a sharp mind are critical to getting this thing solved. Read more
  • Death by Status Update
    We’ve all been on these calls. We talk about what we just did for 20 minutes and what the results are, we talk about what we think it might be for 20 minutes. Then we talk about what the next steps are for 20 minutes. Then we have to update everyone else on what we are going to do for 20 minutes. Then we do the actual thing we are going to do for 20 minutes and have to stop because we need to start preparing for our next status update. Read more
  • The Art of Evidence Gathering
    One of the most important things to do after determining there’s a problem is being able to define it.  One critical component for being able to troubleshoot an issue, as well as better defining it, is gathering data… Read more
  • Use Solid Troubleshooting Techniques and Start with the Basics
    Don’t go with your gut, go with solid troubleshooting techniques. Making lots of changes at once “to see what happens” is a sure fire way to waste time and probably make it worse. One of the things that is usually overlooked is documenting what you are changing/doing. Read more
  • Backups! Backups! Backups!
    There have been many times over the years where during a critical server outage things could have been back online within minutes (time = $$$) by restoring a recent backup.  In those situations, many times valuable root cause data could have been captured, backup restored, and crisis mitigated – if only they could restore a backup. Read more

systemError

You can read more details here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s