A multinational financial services company was experiencing issues and lengthy repair and recovery times in their most critical lines of business.
A multinational financial services company was experiencing issues and lengthy repair and recovery times in their most critical lines of business.
Minor issues potentially impacted a large customer base while severe outages were not only customer impacting but also received substantial negative publicity via web based and conventional media outlets.
The process for resolving issues was lengthy and inefficient. Representatives from all areas of the business were called on to investigate within their domain and report back, a process that could take hours or even days. While issues were eventually resolved, the amount of time spent to simply identify the area causing the problem was extremely expensive to the organization. Once identified, additional time and effort was then required to determine the root cause, remediation procedures and steps to be taken to avoid a similar issue in the future. Unfortunately, due to limitations in their existing tools, identification of the root cause was not always possible.
The sheer size of the organization lead to many common challenges throughout the implementation but each line of business also presented its own unique challenges due to its specific requirements.
In order to decrease the amount of time required to diagnose and resolve an issue, each line of business wanted multiple dashboards with various integrations and data feeds, including monitoring metrics. Across the organization, there were over fifty monitoring tools in place, many with different owners. Some of these tools were already integrated into a common event management system but even these were not integrated with dashboards in mind, which meant vital information was missing.
Each line of business had many different groups responsible for the various technology layers that made up their applications. Every group had numerous monitoring tools that required investigation to determine what was being monitored, if any changes were needed and to identify any gaps that needed to be filled. With so many groups and tools in use it was a time consuming process to work with the various owners to perform the investigation, determine what changes were necessary and implement the required changes.
The overall application owners also presented challenges when it came to project requirements. Over the course of the implementation, requirements constantly changed due to new issues identified, changes in personnel, changes in management, etc. As the requirements changed, individual pieces of the project had to be reevaluated and potentially redone.
In such a large organization, the environment both technically, within an application and at the enterprise level, and organizationally can be extremely complex. Normally simple tasks become much more difficult; obtaining architecture diagrams, obtaining credentials, understanding the relationships between technologies, making the correct personnel contacts, etc. all work together to make the deployment more difficult and threaten the success of the project.
Had experienced transaction slowdowns within their application but had difficulty pinpointing the root cause. There was existing IBM Tivoli Monitoring (ITM) in place for log file and custom script monitoring as well as MQ monitoring.
Was to provide real time Tivoli Business Service Manager (TBSM) dashboards that showed application status as determined by events from both existing and new monitors and presented transaction performance statistics. The existing monitors required enrichment and the new monitoring had to encompass each technology layer within the applications.
Had experienced severe outages affecting a large portion of their customers. The application covered many areas of technology and had monitoring in place using a variety of tools. However, there was no tool in use that tied all of the data together to present a holistic view of the application to aid in problem identification, repair and resolution.
Was to gather the monitoring events, real time metrics and transaction performance data from various sources and use them to affect service status in TBSM dashboards along with presenting the real time metrics. In addition, the dashboards were to be used as a launch point to other applications that would provide additional information, such as Change Management.
Had monitoring in place for the branch banks and their assets. However, they did not have the ability to view aggregate information at the regional level, to define business rules that determine the state of a branch based upon various types of events or to view the status of a branch’s assets. In addition, certain banks had ongoing severe issues and the line of business needed a way to view these high profile banks and monitor their state.
Was to provide TBSM dashboards at the regional, branch and asset level to provide insight into the state of each level and to provide dashboards showing the state of the chronic banks.
Each project presented its own unique challenges but some of them were common among all three. Identification of the key personnel was one of the first steps taken. “Project stakeholders/decision makers, dashboard consumers, monitoring tool owners and technology owners all play a critical role and the project’s success and on time delivery hinged on their involvement and understanding of the projects’ objectives.“
As the projects progressed, there was continuous internal promotion of our activities, which led to the involvement of additional people who wanted to provide their input and requirements. As the project stakeholders were intimately involved, each desired change in scope or deliverable was presented along with the corresponding required change in time line. This ensured that any modifications to the deliverables or project plan were acknowledged and approved by the stakeholders.
The next step was designing the required dashboards for each line of business. Meetings were held with the consumers of the dashboards to determine layout, content and functionality. Determining the dashboard requirements is essential to do at the beginning of the project as these lead to the identification of the required monitoring, data sources, enrichment and integrations.
In each project, an assessment was performed to identify and review all existing monitoring and monitoring tools to identify gaps and determine if any changes or additions needed to be made. This also helped identify any enrichment activities that were needed to associate existing and new alerts with the various dashboard indicators.
The dashboards created for the Credit Card applications represented the detailed application transaction flows. Each transaction diagram required the corresponding events and metrics to be tied in to the corresponding levels within the application. To provide this level of granularity within the dashboards, the existing events required enhanced enrichment, the new monitoring required specific application setup and the real time dashboard metrics required specific information to be correctly correlated.
The existing ITM v5 monitoring was capturing log file alerts as well as application health information obtained via custom scripts. While already in place, it was not designed with dashboards in mind and the alerts therefore did not contain enough information to be properly correlated with the dashboard services. To properly enrich the events, Impact was used to retrieve data from existing and newly created databases and enrich the events.
ITM v6 monitoring was in place for monitoring distributed and mainframe MQ but again required enrichment similar to the ITM v5 monitors. Additional ITM v6 agents were to provide additional application monitoring and fill the monitoring gaps. The additional agents were deployed to monitor web servers, WebSphere servers and log files. Transactions were also monitored via web response time and WebSphere transaction tracking agents, as well as via data from application log files.
Custom ITM 6 agents were created to monitor the log files using the Universal Agent and Agent Builder agents (at the time of the project, the Agent Builder wasn’t capable of performing all of the required calculations so the Universal Agent had to be used in these situations).
The dashboards were built using Tivoli Business Service Manager (TBSM). The application flows were used to build the service model and display the data in a format easily understood by the application teams. The services were affected by the new and existing monitoring and real time statistics were collected via SOAP calls to ITM and presented on the dashboards. The ITM TBSM agent was also used to collect service information that was used in custom built Tivoli Common Reporting (TCR) reports to provide a history of the application’s services within TBSM.
To meet the goals of the project, four main TBSM dashboards were created utilizing custom canvases: senior executive, Online Banking executive, Online Banking data center and Online Banking technical/operational. The technical dashboard also consisted of two additional sub-dashboards to display the technical details for two critical subcomponents of the application.
The senior executive dashboard displayed metrics retrieved from third party tools and had thresholds applied to them for status changes within the dashboard. The data applied to not only Online Banking but other lines of business as well. The executive dashboard displayed similar data but for areas specifically within the Online Banking domain. The data center dashboard displayed the same Online Banking specific metrics but also presented them broken down by data center.
The technical dashboard logically represented the flow of traffic from end user through the system’s technology layers all the way through to the mainframe systems. Metrics retrieved from various sources were displayed overall for Online Banking and at each technical layer, including change, risk, user and event data. Links from these metrics to the underlying tools were also included for context sensitive launch points. The sub-dashboards represented additional data flows within sub-domains of the Online Banking application.
All three of the technical dashboards had indicators that reflected various metrics and events that were received from numerous tools. The events were received through various probes and probe rules were utilized to provide TBSM some of the data needed for event to service mapping. Impact was utilized to perform additional event enrichment in order to complete the association of alerts and services.
Impact was also utilized to retrieve metrics from multiple third party tools and black box devices (such as load balancers), to provide event counts and to interrogate application servers to retrieve critical data to be displayed on the dashboards.
Within TBSM, underlying business rules were created to determine the effect that the metrics and events had on the various components on the dashboards. Multiple integration points from TBSM to other third party tools were also created using right click menus and submenus. These included launching technology domain specific administrative url’s, launching secure shell sessions into administrative servers and context sensitive launches into the change management system.
Custom TCR reports were created to provide historical information about the dashboard indicators. These reports consisted of the following:
The branches’ assets were monitored using a home grown monitoring and inventory management tool. The events from this tool were integrated into OMNIbus using custom probe rules so that they could be utilized to affect the TBSM services. The inventory data from the tool was used to create custom iDML books that were imported into TBSM via the Discovery Library Toolkit to auto generate the branch banking service model, including the region, the bank and all assets in each branch. Custom perl scripts were created to routinely parse the data and generate the refresh iDML books.
To display the required information, two main TBSM custom canvas dashboards were used, along with drill downs to scorecard views. The executive branch banking dashboard showed a high level summary of the number of branches in available, marginal and unavailable states within each region. Each region also showed the number of branches that had experienced recent issues and were marked for closer watch. Drill downs from this dashboard provided a scorecard view into each region, bank and asset, along with all alerts associated with each.
The chronic branch banking dashboard showed the percent and number of each of the six asset types across all chronic branch banks that were in the available, marginal and unavailable states. Drill downs from this dashboard provided a scorecard view into the assets within each bank, with roll-ups to the overall bank and region.
Events from the custom monitoring tool were fed into OMNIbus via a custom log file agent. OMNIbus probe rules were then utilized to parse the information from the alert into the various fields required by the TBSM service model. Underlying business rules were created within TBSM to determine the effect that the events had on the various components on the dashboards.
The implementation of monitoring, event management and dashboards for an application is a complex process. Performing these tasks in a large organization with many groups involved, many existing tools and an ever-changing landscape of people, processes and technology can prevent the project from ever reaching the end state. The most critical step to ensure the project’s success is identifying the key personnel: stakeholders/decision makers, dashboard owners/consumers and tool owners.
The stakeholders/decision makers will help identify the project’s goals, deliverables and milestones and will make decisions regarding requested changes in scope. The development of the dashboards must be done with the owners and consumers of the dashboards to ensure that the final product meets the needs and requirements of those that will be using them. This may include multiple groups of people, from executives to application owners to operations staff. Integrating all tools into the dashboards and identifying and remediating monitoring gaps requires the coordinated and combined efforts of the implementers and the existing tool owners.
To meet these requirements, existing monitoring was reviewed and monitoring gaps were addressed through new ITM 5 and ITM 6 agents and monitors. Existing and new events were enriched through OMNIbus probe rules and automations and Impact policies. Impact was also utilized to pull in key performance indicators from various tools for presentation.
All of this data was then presented in the dashboards that were designed by the business owners. The dashboards enabled the line of business to view the state of their application and the effect that various occurrences in the environment had on the application, along with key metrics, presented in a single interface. The dashboard also provided integrations with other tools for further diagnostics.
By improving the existing monitoring and adding additional monitoring, enriching events and presenting this data along with key metrics in business focused dashboards, the business/application owners were able to easily see how their applications were performing and quickly pinpoint problem areas. All of this lead to easier identification of problems and their root cause and to quicker resolution times.