DevOps appears to be here to stay, so from an Operations perspective, we need to ensure that all of the Development teams are playing together nicely and following some common rules.
I just realized that many Dev teams don’t fully understand the need for Ops when they’re implementing DevOps. Here are the foundational reason, IMO, behind the need for Operations:
In many enterprises, applications never die, and customers continue to need support long after the original application development team has moved on. If applications don’t follow some basic standard practices, they can easily be forgotten by the people who need to support them – Operations. Developers want to move on to the next new thing, which is great for Dev, but horrible for Ops. There are numerous classifications of applications that can’t simply change on a whim due to factors such as regulatory control. Regulations affect a truly stunning number of companies, from utilities to taxis to manufacturing. Unless Dev is going to take responsibility for the support of their application over its entire lifespan (which can be 5 to even 20 years), Operations needs to be involved.
Integration With Other Applications
Applications need to talk to one another at some point. And when those connections fail, all involved application teams usually point fingers at one another. To minimize this finger-pointing, all applications should adhere to some common standards, several categories of which are found below. Even if all Development teams coordinate tightly in your company, there are still MANY external applications being used that need to be supported (e.g. WebSphere, Oracle, etc.). And the management of these applications needs to be coordinated with the in-house applications being built. Operations provides this management and coordination.
Application logging should be somewhat standardized to allow the log data to be collected and parsed for important information. This doesn’t mean they all need to log in exactly the same format, but they should all adhere to some best practices, such as:
Every log entry should have a timestamp and a unique identifier (such as transaction ID)
Logs should be human readable
Identify the source of the message
Avoid multi-line messages if possible
Use name-value pairs (possibly log in JSON format)
Applications NEED to be monitored at very least for performance (response time) and availability (up/down). Ideally you want to have data collectors at each tier of a multi-tiered application to give you transaction topology and detailed monitoring data, but this can come later. At a bare minimum, all applications need to be monitored using some type of synthetic transactions, which run dummy/non-“real” transactions through the system to gather constant performance and availability metrics.
While many applications log information, there are parts of the infrastructure that can only send “events” to some remote destination. The most common types of events are “SNMP traps” (SNMP=Simple Network Management Protocol), which are generated by network equipment such as routers and switches. A cohesive management strategy by operations needs to manage information in log files and events to allow for correlation between and among different systems. For example, a JDBC call from an application may fail, but the application itself doesn’t know if this is a failure of the database itself, the network infrastructure or possibly even DNS misconfiguration. The event management function of the Operations group works on identifying these relationships in order to help perform Root Cause Analysis of incidents. This decreases the amount of resources required to resolve an issue.
Who needs to be notified when “something” goes wrong? Do you want every application team to receive an emergency text in the middle of the night for every problem? Probably not. The Operations team is usually responsible for sending (and, more importantly, suppressing) the appropriate notifications. This is tightly related to Event Management and Root Cause Analysis.
Anyone who is responsible for handling a ticket needs to have some idea of what to do. Runbooks are sequences of steps an operator can run to gather more information and/or resolve an issue. Runbooks need to be maintained to ensure that they’re valid and up-to-date. Application teams often don’t have all of the experience needed to create comprehensive runbooks. They are created over time by the Operations staff, who are constantly handling issues.
In an enterprise, the ideal situation is that each user has ONE userid and password (or certificate, etc.) that they use to authenticate to all applications. This authentication storage mechanism needs to be maintained. This is another function provided by Operations.
DevOps is currently a very popular methodology, and it serves its purpose very well. It allows Development teams to continuously deploy applications to provide better business value. Operations is still required to perform quite a few functions that simply aren’t in the purview of Development.