DevOps metrics: what to track, how and why do it.

Every business using DevOps workflows, tools and culture wants to ensure the maximum cost-efficient allocation of its resources. Our article explains what DevOps metrics to track, what tools to use and how to do it.

Vladimir Fedak
10 min readFeb 10, 2020

--

DevOps means different things for different people, as everybody fills this term with meaning based on own experience. For the most part, DevOps is perceived as a methodology for improving the cost-efficiency and predictability of your software development and infrastructure management operations, as well as a culture that fosters communication and collaboration within your teams to endorse innovation and creativity. However, business prefers to know how its money is spent, so monitoring DevOps metrics is essential. Today we explain where to look, what to keep track of (and what to avoid), how to do it and why this is done.

In 2017 Gartner published a report titled “Data-Driven DevOps: Use Metrics to Guide Your Journey”, where the following pyramid of DevOps metrics was presented.

As you can see, DevOps workflows impact the whole range of business objectives, so the corresponding metrics can be gathered from a variety of sources to analyze the overall business efficiency. Some can be obtained using certain tools, some — only through polls and surveys. However, tracking them is essential for keeping the hand on the pulse of your business — but how to do it correctly?

One of the most important prerequisites of long-term successful business performance is the applicability of the assessment results. There is no point in monitoring your business if the highlighted issues don’t transform into improvements. Thus said, DevOps metrics are essential to follow the KPI process and assess the efficiency of changes.

Characteristics of a useful DevOps metric

One of the biggest fallacies of modern consulting is the recommendation to “analyze everything” in hope of finding (or out of fear of losing) some important information that can be crucial for securing a healthy bottom line for your business. This is utterly incorrect, as such an approach demands to dedicate inordinate amounts of resources to keep an eye on various aspects of your operations that have little to no impact on the resulting value. In short — it does not matter in the least if the cable is red or blue, as long as it is whole and passes the signal from the producer to the consumer system without errors.

Thus said, what are the qualities of a DevOps metric you should actually collect?

Here are 5 key DevOps metric properties to look for:

  • Measurable — a metric must be measured. “Very good” is not a measurable value.
  • Relevant — the metric must measure what’s important for business
  • Incorruptible — team members cannot affect the measurement results
  • Actionable — analysis of the metrics over time must provide insights on possible improvements of systems, workflows, policies, etc.
  • Traceable — the metric must directly point to root causes instead of suggesting that something is wrong.

Thus said, it is impossible to track all metrics in one dashboard. You have to split them into sets that are relevant to specific aspects of your business and stages of your DevOps workflow:

For the purpose of operational flexibility and ease of management, these metrics can be grouped into logical categories:

  • Change velocity — change complexity, lead time to change, MeanTimeToResolution/Recovery/Repair (MTTR), frequency of deployment
  • Control of quality — average application error rate, number of support tickets post-release, number of bugs making it to production, automated testing coverage percentage, deployment success rate.
  • Level of performance — scalability and high-availability of systems, resource utilization efficiency, app latency, etc.
  • Customer satisfaction — median application usage and traffic over some period of time, customer satisfaction rates, speed of feature implementation, the business impact of new features (like the increase in subscriptions and renewals), etc.

Naturally, this list is not exhaustive and based on your business niche and company lifecycle stage, it might be prudent to track metrics for software development cycles, application performance in production, infrastructure performance and cost-efficiency, metrics organized by system health and team productivity and any other aspect of the business you might be interested in.

We have briefly mentioned MTTR, one of several often-used acronyms related to DevOps metrics (and software engineering in general), and it is important to understand them in full.

  • MTTF or MeanTimeToFailure — a period of time from product/feature launch to the first failure. Characterized by uninterrupted availability of service and correct system behavior, until a failure of some sort occurs.
  • MTTD or MeanTimeToDetection — a period of time from the incident occurring to your team being informed of it and diagnosing its root cause. This metric showcases the efficiency of your issue tracking and monitoring systems.
  • MTTR or MeanTimeToResolution\Recovery\Repair — a period of time between finding the root cause and correcting the issue. It depends on your code complexity, DevOps workflow maturity, operational flexibility and a variety of other parameters.
  • MTBF or MeanTimeBetweenFailures — the period of time between the next failure of the same type occurs. this metric highlights your system stability and process reliability over time.

Thus said you should form your own pool of DevOps metrics and organize them in as many dashboards as needed to cover all the aspects of your business you wish to monitor automatically.

Bad business metrics

Unfortunately, people might make mistakes, and it specifically relates to managers who need to select what business parameters to monitor and how to evaluate success there. Thus said, some companies instate incorrect DevOps metrics that are either irrelevant or detrimental to the process of delivering value to your end-users.

You must avoid tracking metrics based on:

  • Traditional engineering values — MTBF can be somewhat irrelevant in the DevOps environment, as operational stability is not its paramount goal. For example, virtualized computing resources and managing Infrastructure as Code allows deploying and configuring environments to test some ideas with literally no expenses in terms of time or money. And the code deployed there might fail a lot — but it will help formulate the application concept better and select the most appropriate software architecture (like microservices), thus justifying all the expenses. Therefore, DevOps metrics you monitor must not be based on best practices only — they must be also adapted to the needs of your business.
  • Metrics based on rivalry — when the top performer gets everything and even the second place is an also-ran, it’s hard to expect communication and collaboration within the team. Never build your metrics on competition between team members or teams (like failed builds or numbers of severe bugs found). Finding bugs is important, but evaluating a minor bug as severe to improve the QA statistics and worsen the Dev team statistics is definitely not the way to go.
  • Vanity metrics — your team should work as Avengers, by supporting each other, not as a group of individual superheroes who compete for popularity. The number of lines of code written weekly is irrelevant, as they can all go away during refactoring — but the timeliness and quality of the resulting feature delivery are important. The number of daily deployments is not important unless each deployment adds value to the end-user experience.

Your employees will be tempted to form a pool of metrics they will succeed in achieving. Filter out the ones that deliver no actual value and don’t be discouraged by their complexity. It’s better to have low results on an important metric (and look for ways to improve the situation) than 100% on a useless one.

15 DevOps metrics important for any online business

Almost any business nowadays runs some kind of IT operations, be it delivering own product or service online or interacting with end-users through customer-facing applications, or just running your mission-critical systems in the cloud. Thus said, 15 key DevOps metrics listed below will be useful to ensure your business flexibility and resilience:

  1. Deployment frequency — deploy small batches of code often, instead of releasing large chunks rarely. Small and frequent deployments are easier to test, release and consume by end-users. Track the deployments to testing environments, staging servers and production environment separately to ensure complete visibility over the deployment pipeline.
  2. Deployment volume — keep an eye on the quantity and volume of deployment artifacts that are shipped to the customer with each release. It’s hard to evaluate based on values like bug fixes or functionality delivered, so it might be better to track the number of story points delivered or the worth of development work in days released.
  3. Deployment time — this is one of the most crucial metrics, as long deployment time might be a result of a huge volume of your application — or a sub-par release workflow. If your deployment takes exorbitant amounts of time — this signalizes of the possible room for growth and improvement for your IT operations.
  4. Lead time — the period of time between starting the work on some item and publishing it to production. This is an essential metric for ensuring business continuity and planning, as it enables the PMs and C-suite to plan the operations accordingly.
  5. Support ticket volumes — this is an important metric for several stages of your IT operations. Support tickets showcase possible design flaws, highlight bugs that make it to production or inform you of the customer thoughts on your latest product updates. This is a crucial source of feedback that can be turned into input for new features or system improvements. Minimizing the numbers of tickets is good — but shutting them down cuts the feedback loop between you and your audience.
  6. Percentage of code coverage by automated tests — DevOps fosters automation of code operations, and testing is not an exclusion. It is a good practice to test every new batch of code against the automated unit and integrity tests and track the % of coverage. However, 100% coverage might mean that the code was written to respond to the tests, not to work as intended. There always must be some broken tests.
  7. Flaw escape rate — in the ideal world, all the bugs are found during QA and pre-release testing on the staging server. In real life, some glitches make it to production. DevOps is all about shipping the code quickly, so you must track this metric to minimize the number and severity of defects that get discovered by users.
  8. Uninterrupted app availability — naturally, nobody ever wants the product to be unavailable to users for long. However, certain deployment strategies expect some downtime but keeping this metric minimal is crucial for long-term business success.
  9. Application SLA — even if your company does not have an SLA signed before subscribing for the product, your customers still expect some levels of availability, speed of response to support tickets and delivery of other guaranteed services.
  10. Failed deployments — the whole point of DevOps is ensuring the success of deployments and removing all factors that can affect it. However, even if you have never had failed deployments, you must always understand what can go wrong and how to recover from a major post-release crash. This is the contingency strategy and this metric is directly aligned with MTBF.
  11. Error rates — there always are some exceptions in your normal application performance. Bugs after a new release, database connection issues, query timeouts, other issues — all of these contribute to the uptime and system performance metrics for your IT operations.
  12. App usage and traffic — after a new app version is released, you want to see normal levels of app usage. If there are spikes in traffic or no traffic at all — something is wrong. System uptime and a bunch of other parameters depend on this metric.
  13. App performance — there are normal performance patterns for any app — it uses certain amount of CPU power, RAM, I/O, generates certain numbers of SQL requests, etc. Track them before the release, so you will be able to identify any pattern changes after the release. This is best done using cloud monitoring tools like Prometheus & Grafana, Retrace, Nagios, ELK stack and other apps.
  14. MMTD — as we mentioned above, MTTD or MeanTimeToDetection is the crucial metric that showcases the efficiency of your monitoring tools and smart alerting practices. Having MMTD below 15 minutes is a goal worth investing in.
  15. MMTR — MeanTimeToRecovery is another crucial metric that highlights the efficiency of your workflows, policies and procedures. It is normal to measure MMTR in business hours, and having MMTR close to 8 is quite a great result for a product company, while Managed Services Providers and cloud platforms this metric should be below 4 hours.

There definitely are other, not DevOps-related metrics, like customer churn and technical support time to respond. These are not affected by your DevOps workflows, yet they have a substantial impact on overall end-user experience, which directly impacts long-term business success.

Summary: track your unique DevOps metrics

15 DevOps metrics mentioned above can form a reliable backbone for your daily system monitoring routine but don’t hesitate to adjust the list based on your unique project needs and business requirements. Most importantly, align the gathered metrics with business KPIs, as we showed above — so that the values and graphs produced by your dashboards become actionable insights that result in business decisions.

The toolkit used for gathering DevOps metrics differs based on the rest of your infrastructure components, but it is a general rule of thumb to build cloud monitoring systems using open-source tools with RESTful APIs so that new modules can be integrated quickly and without issues. IT Svit uses Terraform, Kubernetes, ELK Stack, DataDog, FluentD, Prometheus + Grafana, Jenkins, Ansible, AWS CloudWatch Alarms/Metrics/Logs, SmartAlertManager and other tools to track various DevOps metrics for our customers.

Should you have any more questions on DevOps metrics — we would be glad to assist!

--

--