infrastructure monitoring

Definition of Infrastructure Monitoring

Infrastructure Monitoring refers to the systematic collection of data within an infrastructure (your basic computing framework). This collected data is used to provide alerts on unexpected downtime, network intrusion, and resource saturation.

Monitoring helps make operations processes auditable, which is essential for forensic investigations and for getting to the root cause of problems via an RCA (Root Cause Analysis). So, monitoring supports the objective analysis of Operations practices and generally all of IT processes.

Infrastructure monitoring is a critical component of IT infrastructure management and includes network traffic monitors for use in the network performance monitoring of all IT infrastructure components.

Key Concepts & Terms

  • Active Monitoring: Refers to systems that collect data by interacting directly with the systems monitored. Administrators need to weigh the cost of the monitoring vs. the overall value of the test. An example of an active test is an agent that tests response times on production databases
  • Alert: A notification about an event that is "captured" by a monitoring system. It is produced when a data stream surpasses a preset threshold (definition: see below). Alerts are typically configurable. Monitoring systems typically send alerts to varied levels of administrators; various thresholds trigger different types of alerts
  • False Negative: An event that a monitoring system does not detect. False negatives happen with tests that are not sensitive enough to detect possible issues and tests that do not run at proper intervals are responsible for these errors. False negatives are critically important and can seriously impede the usefulness of any monitoring system
  • False Positive:An event that is detected beyond the monitoring threshold but does not show an operational problem. Monitoring systems that are incorrectly configured cause false positives. Not only do false positives prove bothersome, but they increase the likelihood of "Crying Wolf" -- decreasing the effectiveness of similar alerts because operations, support and users are more likely to disregard genuine issues
  • Nagios: (Now Nagios Core) is a broadly popular open source (free) application that monitors systems, networks, and infrastructure
  • Passive Monitoring: Refers to monitoring systems that collect data by reviewing data already generated. This data is collected from logs or ”traps” or from messages relayed by the monitored system to a passive agent. The syslog (see below) is a type of passive monitoring
  • Syslog: A logging format standard started by BSD Unix utilities (i.e. sendmail). Syslog is somewhat poorly used despite its saturation. Many applications use their own logging capabilities
  • Threshold:A preset configuration that indicates a borderline for operation -- outside of which the system is not expected to function. Thresholds are constantly “tuned,” to prevent false positives or false negatives (see above for definitions)

Integrating Infrastructure Monitoring Data into an AIOps Platform

Infrastructure monitoring depends on an effective monitoring platform. Monitoring platforms listen, gather and co-relate events from critical applications and their underlying IT environment. A well-defined platform empowers system admins to dynamically migrate into another technology or architecture that scales on-demand. This monitoring platform also helps with monitoring servers, monitoring systems, and monitoring tools.

The Cure for Infrastructure Monitoring Blues "Reds"

Any Ops team that sees too much red on their infrastructure monitoring display can definitely get the “Reds.” The following can help.

Infrastructure Monitoring is about Managing the Flood

Monitoring systems and other point management tools produce a never-ending stream of events, the clear majority of which are irrelevant. It's like “drinking from a fire hose.” These events have to be separately analyzed and then turned into incidents when there's a real issue. In fact, the volume of events volumes is so high that many of the critical events are lost in the chaos. Often, first-level support teams only find out about service issues when users start to scream.

Evanios eliminates event noise in real-time. It normalizes, filters, deduplicates, correlates data from multiple sources, applying advanced noise reduction methodologies to reduce millions of events to a few hundred relevant, actionable events.

server performance issues

... It's also about Ordering Events

Most often, the IT Ops team doesn't have the business context and other relevant information that they need to prioritize events - or the clear diagnostic information necessary to resolve them quickly. For example, is a server issue affecting the company's e-commerce website, or is it just marginally affecting the compute capacity of a cluster?

The IT Ops team size has been steadily shrinking due to the mandate for them to do more with less. With this reduced level of staffing, it is nearly impossible for IT Ops team to manually sift through all events and prioritize the events that they should be working on. Also, they should be focusing on high value tasks than normal ones.

Evanios scores every event, to help you put first things first. Using tunable machine learning algorithms, configured logic, historical context, CMDB relationships, predictive traits and impacted relationships, Evanios scoring logic tells you exactly where to spend your time.

Effective Root Cause Analysis is Key to Infrastructure Monitoring

Traditional methods of finding root cause no longer work effectively, as they were designed assuming that the IT environment is mostly static and does not change often.

Today's IT environment is dynamic and highly complex. Software Defined Networking and Network Function Virtualization, some of the building blocks of modern IT environment, require a management technology that is highly adaptive. Evanios finds and scores a small set of probable causes based on machine learning, historical probability, CMDB relationships, change management actions, automation actions, and temporal alignment.

Let Evanios support your infrastructure monitoring solution(s). Check out our pre-packaged integrations to tools like Datadog, NagiosOpsview, SolarWinds, Zabbix, and Zenoss. And contact us for a demo if you would like to see more.