A multinational financial services company was experiencing issues and lengthy repair and recovery times in their most critical lines of business.
Minor issues potentially impacted a large customer base while severe outages were not only customer impacting but also received substantial negative publicity via web based and conventional media outlets.
The process for resolving issues was lengthy and inefficient. Representatives from all areas of the business were called on to investigate within their domain and report back, a process that could take hours or even days. While issues were eventually resolved, the amount of time spent to simply identify the area causing the problem was extremely expensive to the organization. Once identified, additional time and effort was then required to determine the root cause, remediation procedures and steps to be taken to avoid a similar issue in the future. Unfortunately, due to limitations in their existing tools, identification of the root cause was not always possible.
The initial focus was on three lines of business.
The sheer size of the organization lead to many common challenges throughout the implementation but each line of business also presented its own unique challenges due to its specific requirements.
In order to decrease the amount of time required to diagnose and resolve an issue, each line of business wanted multiple dashboards with various integrations and data feeds, including monitoring metrics. Across the organization, there were over fifty monitoring tools in place, many with different owners. Some of these tools were already integrated into a common event management system but even these were not integrated with dashboards in mind, which meant vital information was missing.
Each line of business had many different groups responsible for the various technology layers that made up their applications. Every group had numerous monitoring tools that required investigation to determine what was being monitored, if any changes were needed and to identify any gaps that needed to be filled. With so many groups and tools in use it was a time consuming process to work with the various owners to perform the investigation, determine what changes were necessary and implement the required changes.
The overall application owners also presented challenges when it came to project requirements. Over the course of the implementation, requirements constantly changed due to new issues identified, changes in personnel, changes in management, etc. As the requirements changed, individual pieces of the project had to be reevaluated and potentially redone.
In such a large organization, the environment both technically, within an application and at the enterprise level, and organizationally can be extremely complex. Normally simple tasks become much more difficult; obtaining architecture diagrams, obtaining credentials, understanding the relationships between technologies, making the correct personnel contacts, etc. all work together to make the deployment more difficult and threaten the success of the project.
Credit Card Challenge
Had experienced transaction slowdowns within their application but had difficulty pinpointing the root cause. There was existing IBM Tivoli Monitoring (ITM) in place for log file and custom script monitoring as well as MQ monitoring.
Was to provide real time Tivoli Business Service Manager (TBSM) dashboards that showed application status as determined by events from both existing and new monitors and presented transaction performance statistics. The existing monitors required enrichment and the new monitoring had to encompass each technology layer within the applications.
Online Banking Challenge
Had experienced severe outages affecting a large portion of their customers. The application covered many areas of technology and had monitoring in place using a variety of tools. However, there was no tool in use that tied all of the data together to present a holistic view of the application to aid in problem identification, repair and resolution
Was to gather the monitoring events, real time metrics and transaction performance data from various sources and use them to affect service status in TBSM dashboards along with presenting the real time metrics. In addition, the dashboards were to be used as a launch point to other applications that would provide additional information, such as Change Management.
Branch Banking Challenge
Had monitoring in place for the branch banks and their assets. However, they did not have the ability to view aggregate information at the regional level, to define business rules that determine the state of a branch based upon various types of events or to view the status of a branch’s assets. In addition, certain banks had ongoing severe issues and the line of business needed a way to view these high profile banks and monitor their state.
Was to provide TBSM dashboards at the regional, branch and asset level to provide insight into the state of each level and to provide dashboards showing the state of the chronic banks.
Each project presented its own unique challenges but some of them were common among all three. Identification of the key personnel was one of the first steps taken. “Project stakeholders/decision makers, dashboard consumers, monitoring tool owners and technology owners all play a critical role and the project’s success and on time delivery hinged on their involvement and understanding of the projects’ objectives.“
As the projects progressed, there was continuous internal promotion of our activities, which led to the involvement of additional people who wanted to provide their input and requirements. As the project stakeholders were intimately involved, each desired change in scope or deliverable was presented along with the corresponding required change in time line. This ensured that any modifications to the deliverables or project plan were acknowledged and approved by the stakeholders.
The next step was designing the required dashboards for each line of business. Meetings were held with the consumers of the dashboards to determine layout, content and functionality. Determining the dashboard requirements is essential to do at the beginning of the project as these lead to the identification of the required monitoring, data sources, enrichment and integrations.
In each project, an assessment was performed to identify and review all existing monitoring and monitoring tools to identify gaps and determine if any changes or additions needed to be made. This also helped identify any enrichment activities that were needed to associate existing and new alerts with the various dashboard indicators.
Dashboards, Monitoring & Agents
The dashboards created for the Credit Card applications represented the detailed application transaction flows. Each transaction diagram required the corresponding events and metrics to be tied in to the corresponding levels within the application. To provide this level of granularity within the dashboards, the existing events required enhanced enrichment, the new monitoring required specific application setup and the real time dashboard metrics required specific information to be correctly correlated.
The existing ITM v5 monitoring was capturing log file alerts as well as application health information obtained via custom scripts. While already in place, it was not designed with dashboards in mind and the alerts therefore did not contain enough information to be properly correlated with the dashboard services. To properly enrich the events, Impact was used to retrieve data from existing and newly created databases and enrich the events.
ITM v6 monitoring was in place for monitoring distributed and mainframe MQ but again required enrichment similar to the ITM v5 monitors. Additional ITM v6 agents were to provide additional application monitoring and fill the monitoring gaps. The additional agents were deployed to monitor web servers, WebSphere servers and log files. Transactions were also monitored via web response time and WebSphere transaction tracking agents, as well as via data from application log files.
Custom ITM 6 agents were created to monitor the log files using the Universal Agent and Agent Builder agents (at the time of the project, the Agent Builder wasn’t capable of performing all of the required calculations so the Universal Agent had to be used in these situations).
The dashboards were built using Tivoli Business Service Manager (TBSM). The application flows were used to build the service model and display the data in a format easily understood by the application teams. The services were affected by the new and existing monitoring and real time statistics were collected via SOAP calls to ITM and presented on the dashboards. The ITM TBSM agent was also used to collect service information that was used in custom built Tivoli Common Reporting (TCR) reports to provide a history of the application’s services within TBSM.
Each line of business had a unique set of requirements for monitoring, event management, presentation and reporting.
The implementation of monitoring, event management and dashboards for an application is a complex process. Performing these tasks in a large organization with many groups involved, many existing tools and an ever-changing landscape of people, processes and technology can prevent the project from ever reaching the end state. The most critical step to ensure the project’s success is identifying the key personnel: stakeholders/decision makers, dashboard owners/consumers and tool owners.
The stakeholders/decision makers will help identify the project’s goals, deliverables and milestones and will make decisions regarding requested changes in scope. The development of the dashboards must be done with the owners and consumers of the dashboards to ensure that the final product meets the needs and requirements of those that will be using them. This may include multiple groups of people, from executives to application owners to operations staff. Integrating all tools into the dashboards and identifying and remediating monitoring gaps requires the coordinated and combined efforts of the implementers and the existing tool owners.
To meet these requirements, existing monitoring was reviewed and monitoring gaps were addressed through new ITM 5 and ITM 6 agents and monitors. Existing and new events were enriched through OMNIbus probe rules and automations and Impact policies. Impact was also utilized to pull in key performance indicators from various tools for presentation.
All of this data was then presented in the dashboards that were designed by the business owners. The dashboards enabled the line of business to view the state of their application and the effect that various occurrences in the environment had on the application, along with key metrics, presented in a single interface. The dashboard also provided integrations with other tools for further diagnostics.
By improving the existing monitoring and adding additional monitoring, enriching events and presenting this data along with key metrics in business focused dashboards, the business/application owners were able to easily see how their applications were performing and quickly pinpoint problem areas. All of this lead to easier identification of problems and their root cause and to quicker resolution times.