incident management (1).jpg

It's 3 A.M. when you hear the familiar voice beside you.

"Honey, please pick up the phone. Probably another  app or whatever going down."

You fumble for the phone and listen to your Ops Manager describe yet another Critical Incident.

Deja Vu all over again, as Yoga Berri used to say.

More lost sleep; more aggravation; more reporting to the CEO that you still do not have a solution in place to prevent additional customer losses.

You think: How wonderfully boring it would be if you could put a solution in place that would reduce this drama.

But how to go about it?

What's a Major Incident Anyway?

The next day at work, you start at the beginning.

Everyone has their version of what makes an incident. But what’s the real definition of an incident? Not someone with their hair on fire about a false positive, but a real incident.

Well, it’s an unplanned interruption or reduction in quality of an IT service (a Service Interruption).

You read a bit more. About a "Major Incident Management Process" that looks like it would help you get more sleep at night.

You share what you've learned with your Ops and Support staffs and iterate on a:

major-incident.png

 

Major Incident Management Process

You discover that crafting a Major Incident Management Process plan includes four necessary steps, which you send out in an email to Ops and Support staffs.

Teams, this is what we need to immediately:

  • Define what an "SI" (Significant Incident) is
  • Create communication channels (e.g. conference bridge and real-time chat) to notify pre-designated people to work to restore service
  • Create a plan to restore service
  • Document the outage (to help prevent reoccurrence)

How IT Event Management Helps

Armed with this new understanding of how things should happen, you call a meeting with the entire Ops staff. Your Ops team lets you know that the four steps you shared with them (above) can be helped by efficient automation of IT Systems.

"What kind of automation," you ask?

"A system that will help with the following," they say.

One by one, they proceed to add the following to the whiteboard:

Integrations:

Any system or app that generates events is an integration point that needs to be aggregated by an event management solution.

Filtering:

This removes the "noise" or irrelevant events from the event stream so that the Ops team can focus on events that might have significant impacts on the business.

Deduplication:

This eliminates duplicated signals so that identical or similar events can be collapsed into a single event.

automate-event-hierarchies.png

Correlation:

This allows related events to be grouped or linked, providing a cleaner view of events to the Ops team.

Your Ops manager tells you that all of these can reduce unnecessary, wasteful, and mistake-prone human interaction to reduce incident frequency and improve Mean Time to Recovery (MTTR).

"You mean we'll actually get some sleep if we take care of these?" you ask.

“Some,” they reply. “But if you really want to sleep easy, and make our lives better as well, you should help us find a system that would help us prioritize ‘real’ events and incidents, automate root cause analysis, and even predict potential outages before they occur, so we can get in front of issues for a change.”

Incident Management with ServiceNow

The Ops manager nods and mentions leveraging ServiceNow. You don't get it. What does ServiceNow have to do with event and incident management?

"ServiceNow is deeply integrated into our IT Services and Support. And we can leverage ServiceNow to help us get up to speed without introducing some new system that overlaps or tries to compete with ServiceNow. We can implement a ServiceNow Incident Management system that produces ServiceNow Incidents.”

“Ok, anything else?” you ask.

The Ops Manager pushes a brochure across the table. “This company ‘Evanios’ has exactly what you're looking for,” he says. ”With full ServiceNow integration and connections to other ITSM platforms, too. It uses machine learning and configured logic, which means we don’t need to do any custom scripting. And it is built on the ITIL Event Management framework, which promotes best practices.”

ITIL Incident Management with Evanios

ITIL? You took a basic exam on ITIL a few years ago but don't remember every detail. How's ITIL factor into this?

Fortunately, your newest Ops staff member has recently received advanced training. You ask him to tell you about it in "Layman's Terms" (you remember ITIL as theoretical).

He lays an ITIL Incident Management flow chart in front of you and proceeds to track the phases of a successful Incident Management process.

"Where did you get this?" you ask.

His finger falls on that name again. Evanios.

Evanios can present a "single pane of glass" into what's happening to prevent the surprise outages. It will help both Ops and Support communicate together by dealing with a fully integrated event management solution that (for once) automates incident management.

Making Ops Boring

Weeks later, after you've implemented Evanios into your ServiceNow platform, you find you and your teams no longer fretting over distinctions between events and incidents. Evanios helps everyone focus on the management of services in supporting the business.

You notice something else. No more late-night calls. No customer complaints and loss.

In the most pain-free Ops meeting you've ever held, you review what's changed. You now have:

  • a single, unified platform to manage services including event and incident management
  • visibility using a single pane of glass dashboard that provides teams with instant status updates
  • complete integration of all your monitoring tools to prevent leakage or rogue alerts
  • major noise reduction with deduplication, correlation, and event hierarchies
  • automated resolution of low-level events that were previously handled manually
  • event prediction you've never seen before and
  • comprehensive IT Operations Analytics (ITOA) you've never compiled before.

You then share the news with your CIO and explain the impressive ROI you'll realize from this investment. It will reduce the operational budget, and decrease development time and service disruptions to prevent the customer loss that had become almost routine.

The following week you pull all the teams together: Ops, Support, and Development, and review the continued progress. At the conclusion, you tell them: "Let's keep everything this boring, OK? I haven't slept this well in years."