Energy Industry
A multinational financial services company was experiencing issues and lengthy repair and recovery times in their most critical lines of business.
Overview
A large energy company had nearly zero visibility into the health of this critical application. Providing customers with multiple ways to view and pay their bills allows for more satisfied customers and more bills being paid on time. However, when these services are unavailable customer aggravation quickly escalates and revenue is delayed. With limited insight into their application, diagnosing and troubleshooting issues was a constant challenge.
There were three key issues:
The Challenge
The existing monitoring consisted of limited transaction level monitoring, database monitoring, network device monitoring and custom built scripts to query certain aspects of the application. The transaction and database monitoring were being performed by separate third party tools. However, consistent base operating system monitoring did not exist and other key aspects of the application were also not being monitored. The existing monitoring tools stored data in disparate locations that were only accessible to individuals with access to the database, file or web page where the data was held.
Alerts from these various tools were not centrally managed and utilized their own processes and procedures for notification. Without a central event management system, it was not possible to correlate the alerts or integrate them with the incident management system.
Due to these shortcomings in the existing environment, the application owners wanted consistent base health monitoring across all servers, additional application level monitors, centralized event management and access to all metrics retrieved by the various tools, new and old.
In addition, the business needed a tool that would display the health and status of the application. These dashboards would enable them to understand how each area of the application was performing, the health of each area and how each area was affected by issues in the environment.
The desire was to enable quicker diagnosis of issues, preferably prior to customer calls, quicker resolution times and easier identification of root cause.
What was one of the first steps to fix these issues?
Solution
One of the first steps to be taken in a project that requires business level dashboards that are fed from existing and new alerts and metrics is to design the dashboards. The design will then dictate the areas requiring monitoring, the required data feeds and the required metrics. It is essential that the design of the dashboards is guided by the implementer and that the overall design is created by the application owner or other dashboard consumer, such as the operations center. This ensures that the implementation meets the needs of the dashboard users. Three dashboards were created for the customer billing application: an executive dashboard and two technical dashboards.
Executive Dashboard
The executive dashboard was designed to present a high level status of three critical areas: transaction times, phone network and the connection between call centers and the data center. Impact was utilized to retrieve the transaction times from the application's database and SmartCloud Application Performance Management (SCAPM) custom agents were created to test the connectivity of the phone network and the call center to data center connections. Utilizing this dashboard, executives could quickly determine whether any of these key systems were adversely impacting their customers.
Technical Dashboards
The first technical dashboard was designed to show the impact that various metrics and alerts had on the customer service application. The dashboard displayed the status of the five key areas of customer service: phones, web, kiosks, call centers and customer offices. To determine the status of these areas, SCAPM operating system, application specific and custom agents were deployed, along with synthetic scripts to simulate end users. SCAPM was used to generate alerts for the various resources and to collect the metrics into the data warehouse for reporting. The existing network and database monitoring tools were integrated by deploying the necessary OMNIbus probes. To ensure the alerts could be correlated to the correct services in TBSM, the alerts were enriched utilizing probe rules and Impact policies.
Indicators for critical pieces of the application that spanned the five key areas were also included directly on the dashboard. These indicators were for items such as databases, connections to external credit card processing companies and phone networks. As before, these were monitored using SCAPM agents.
In addition to alerts, the dashboard also displayed the number of payments made via each of the five areas. While this information was already held within the application's database, it was not readily accessible by the application owners and operations center.
The goal of this project was to integrate the existing data and implement new monitoring to fill the monitoring gaps and to present everything on easy-to-use dashboards that were available to the application owners and operations staff.
Conclusion
A new implementation of monitoring, event management and dashboards is normally performed in one of two ways:
Horizontally
Monitoring and event management are deployed to the enterprise and dashboards are deployed for select application.
Vertically
Monitoring, event management and dashboards are deployed one application at a time. In this project, the components were deployed vertically for the customer service application. The implementation consisted of designing the executive and technical dashboards, implementing new operating system and application specific monitoring, performing event correlation and enrichment and integrating existing monitoring tools into the event management and dashboard solutions.
The customer service application had minimal monitoring in place and no consistent interface for the application owners to visualize the data being monitored and collected.
Utilizing SCAPM, OMNIbus, Impact and TBSM, the goals of the project were met and the application owners and operations staff were able to see the state of their application and had access to all monitored metrics from one interface, in real time.