Without holistic service modeling, there's no way to correlate all factors. Moreover, all service components may be within accepted tolerances right now, but the IT service may be at risk because performance is degrading and heading toward a fiasco. Service modeling that includes a proactive, predictive view is required.
Proactive performance management means correcting degradations before they noticeably impact user experience or automated business processes. Instrumentation must be in place to monitor infrastructure performance indicators, business transactions and SLA compliance; policies must be in place to trigger early warnings of degradation trends before they slow transactions or result in a service-crippling outage.
Key performance indicators (KPI) for infrastructure -- there are hundreds to choose from -- are measured in terms of time over threshold and deviation from normal.
Time-over-threshold rules should be implemented in a way that warns of negative KPI trends. That means indicating persistent conditions and not every transient spike or drop. So rather than sending an alarm every time a switch port reaches 100% utilization and discards packets, the threshold rules should trigger an alarm only if the port discards packets for a cumulative period of, say, 20 minutes, within an hour-long reporting window.
Deviation from normal rules combined with time-over-threshold rules trigger early warnings when KPIs persistently deviate from the business cycle's historical behavior. Higher than normal traffic on a router interface may indicate a runaway program taxing a remote server. Higher than normal CPU utilization on a server or excessive database check-pointing informs operators of an impending degradation.
In the application environment, depletion of threads and pooled objects, memory leaks, Java Database Connectivity driver-database version mismatch and bad coding can have significant effects on an IT service. To manage this, baselining and heuristic trending must be applied to enable predictive alarming.
By monitoring 100 per cent of the business transactions that traverse a network, IT gets a real understanding of user experience and transaction success (completion). Watching response trends to warn of impending SLA violations helps IT be proactive.
When user response is above its prescribed 400-millisecond SLA, how does IT most efficiently pinpoint the cause? Cross-silo IT service modeling is the best practice. First, it gives IT visibility of the infrastructure components and application elements that comprise the service. Second, it provides predictability by letting IT see the performance trends of the infrastructure and application elements that will affect the overall IT service in the future. Third, it helps IT identify root causes by correlating events that have affected service uptime or performance.
Network operation center personnel are in a good position to take advantage of this holistic view. All business transactions flow back and forth across the network -- connecting users to application servers, servers to back-end systems, databases and Web services. Learning IT service availability management best practices that combine service modeling across all silos with proactive performance monitoring puts them in the cat-bird seat.
Sign up for Computerworld eNewsletters.