5 Important DevOps Monitoring Metrics


Many engineering organizations embrace the DevOps philosophy as a strategy for speeding software development and delivery while simultaneously improving the overall quality of the end product. DevOps seeks to do this by breaking down traditional organizational barriers between development and operations teams through what it defines as five key pillars of success:

  1. Reducing organizational silos
  2. Accepting failure as normal
  3. Implementing change gradually
  4. Tooling and automation
  5. Measuring everything

In this article, we're going to focus on pillar number five: measuring everything. We'll discuss five key metrics for DevOps software pipelines and explain how each of them can help measure software delivery performance.

Change Lead Time

Change lead time, sometimes simply called lead time, essentially tracks the time it takes for a project to go from inception to implementation. In agile software development, larger projects will generally be broken up into smaller sets of features or changes, so this metric doesn't necessarily mean you’re tracking the delivery of an entire software product all at once.

As requirements are gathered, the team iterates through the development of smaller sets of features. The time that it takes to start executing on, testing, and delivering these features to production is the change lead time.

Shorter lead times indicate the general health and agility of a team and its ability to quickly deliver new features to customers. It is also indicative of whether a team is able to rapidly adapt to feedback from customers and turnaround new features or improvements. Lead times measured in hours, days, or weeks are better than those measured in months or years.

Along with helping to satisfy the DevOps pillar of measuring everything, this metric also helps quantify the extent to which teams are able to quickly implement smaller, more gradual changes.

Deployment Frequency

Lead time is one aspect of measuring a team's productivity and gauging how rapidly they can adapt to changing requirements. Deployment frequency, which measures how quickly a team can deliver new features to production, is another.

Incomplete features that aren't delivered to production don't provide any value to customers or the organization. There is no benefit in terms of customer satisfaction, feedback for the team, or revenue if the software isn't actually delivered to those who need it.

Smaller, more rapid deployments help keep feature releases small, and they also help increase the likelihood that fewer bugs will creep into production. As with change lead time, this metric also helps satisfy the DevOps pillar of implementing gradual yet rapid change.

A very productive team in an agile organization would ideally be able to deliver several releases per week, sometimes as often as several times per day. This normally indicates that the organization's automation and tooling (such as CI/CD pipelines) are up to the task of rapid software iteration and delivery.

Change Failure Rate

Sometimes abbreviated as CFR, Change Failure Rate is the percentage of all changes that were delivered to production, but either failed or had bugs or defects. Changes with severe defects frequently result in rollbacks, which can hurt other metrics such as uptime and revenue.

In DevOps, the acceptance of failure is considered normal. The successful delivery of rapid, smaller releases is an indication of healthy DevOps processes as well as a healthy DevOps culture.

Frequent failures can be indicative of poor code quality, poor testing, too much pressure on teams to deliver features too quickly, or other problems within the software development process.

In a perfect world, CFR would be zero or very close to it. In reality, this number should probably be in the very low single digits. Double-digit CFR may mean the software is very buggy or the software release cycles are not short and small enough to support a higher rate of success.

Mean Time to Recovery

When failure is considered a fact of life, then contingency plans need to be in place to deal with those failures. This is where Mean Time To Recovery, or MTTR, comes into play.
Some software deployments will fail. Ideally, this won’t happen often, but when the failure is catastrophic, it's important to be able to recover quickly. MTTR is measured from the time it takes to detect a problem – Mean Time to Detection (MTTD) – to the time it reverts back to a baseline state.

Recovery is typically accomplished through rollbacks. Rapid and easy rollbacks result in a lower MTTR – and the lower the MTTR, the better. Ideally, your MTTR will be measured in seconds.

If MTTR takes longer than a few minutes, it usually means the organization might be running afoul of its Service Level Agreements (SLAs) with customers, and it can result in bad will and lost revenue.

Better automation via CI/CD pipelines as well as solid incident response and rollback procedures can help reduce MTTR to acceptable levels.

Mean Time to Detection

If you can't detect a problem, you can't act to fix it. The time it takes for a problem to manifest and be detected by either humans or automation is called Mean Time To Detection, or MTTD.

It is crucial that organizations are able to quickly detect problems or defects in production. This allows humans or automated systems to take action to correct the problem. MTTD is often reduced through the implementation of solid monitoring and observability best practices, tooling, and automation.

The longer it takes to detect a problem, the more likely it is that other changes will be piled on top of it, making it more difficult to sort out which change caused the problem in the first place.

The last thing a business wants to hear from its customers is that they found a problem before the business did. This can lead to embarrassment or, worse yet, to lost customers and reputation.


While there are many metrics organizations can use to measure how effectively they are implementing DevOps, the ones above provide a good starting point for measuring efficacy.
In the end, it’s all about the people, the culture, the technology, and the automation. Through the proper use of these metrics, a healthy and well-managed DevOps culture will help organizations realize faster and higher quality software delivery with less friction and better team morale.

Guest blog courtesy of Sumo Logic. Read more Sumo Logic guest blogs here.