Get in Touch
Close

Your Cloud Story,
Engineered for Success

Contacts

US Office: Obsium, 6200,
Stoneridge Mall Rd, Pleasanton CA 94588 USA

Kochi Office: GB4, Ground Floor, Athulya, Infopark Phase 1, Infopark Campus Kakkanad, Kochi 682042

+91 9895941969

hello@obsium.io

DevOps metrics that actually matter (and how to measure them)

DevOps metrics that actually matter

DevOps metrics that actually matter (and how to measure them)

Most teams have a DevOps monitoring setup that tracks one thing well: deployment frequency. The line goes up, leadership feels good, and nobody asks whether the software is actually getting more reliable.

That is the core problem with DevOps metrics in 2026. Teams measure speed but not health. They track how often they ship, but not what breaks after they ship, how long users wait for responses, how much compute sits idle, or how much engineering time goes to firefighting instead of building.

Now that AI writes 41% of code and deployment frequency is inflated by automation, the gap between “our DevOps metrics look good” and “our software is reliable” is wider than ever (DevOps.com: DORA 2025: Faster, But Are We Any Better?).

This post covers three layers of DevOps metrics that actually matter, the DevOps monitoring tools you need to collect them, and why DevOps observability is the piece most teams are missing.

Three layers of DevOps metrics

Most teams only measure one layer: delivery speed (how fast code gets to production). That is necessary but not sufficient. A complete DevOps monitoring setup covers three layers:

LayerWhat it answersExample metrics
Delivery performanceHow fast and safely are we shipping?Deployment frequency, lead time, change failure rate, recovery time, rework rate
Operational healthIs the system working for users right now?Error rate by service, P99 latency, error budget burn rate, cost per request
Developer experienceIs the team productive and sustainable?Cycle time breakdown, deploy confidence, toil ratio

Delivery performance is where DORA lives. DORA (DevOps Research and Assessment) is Google’s research program that tracks five metrics: deployment frequency, lead time for changes, change failure rate, failed deployment recovery time, and rework rate (added in 2025) (CD Foundation: The DORA 4 Key Metrics Become 5).

These are a reasonable starting point, but they only cover the first layer. Teams with elite DORA scores still have production incidents, high toil, and no visibility into cost or latency. The metrics say “elite.” The engineers say “I am drowning.”

The other two layers are where DevOps monitoring and DevOps observability fill the gap.

DevOps monitoring: operational health metrics

These come from your DevOps observability stack. If you are running Prometheus, you already have most of this data.

Error rate by service. Not just “did the deploy fail” but “is the service returning errors right now, and at what rate.” This is the DevOps monitoring metric that catches slow degradation between deploys.

sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)

P99 latency by service. Average latency hides problems. If your average response time is 200ms but your 99th percentile is 4 seconds, 1% of your users are having a terrible experience. At 10 million requests per day, that is 100,000 slow requests.

histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

Error budget burn rate. This is the SLO-based metric from the Kubernetes alerting post. If your SLO is 99.9% availability, your error budget is 43 minutes per month. How fast are you burning through it?

  • Burning 5% of your monthly budget in an hour = page someone
  • Burning 1% over 6 hours = create a ticket
  • Burning 0.1% over a day = normal, no action

This single metric replaces dozens of threshold-based alerts with one number that maps directly to user impact.

Infrastructure cost per request. Most teams know their total cloud bill. Almost nobody knows the cost per API request or per transaction. This is the metric that connects infrastructure spend to business output.

sum(node_instance_cost_per_hour) / sum(rate(http_requests_total[1h]))

You need a cost exporter like OpenCost to make node_instance_cost_per_hour available. Once you have it, you can track whether your cost efficiency is improving or degrading as traffic scales.

DevOps metrics for developer experience

These are harder to collect automatically but just as important.

Cycle time breakdown. Lead time for changes is the total time from commit to production. But that single number does not tell you where the time is being spent. Break it into stages:

  • Coding time (commit to PR open)
  • Review wait time (PR open to first review)
  • Review time (first review to approval)
  • Merge to deploy time (approval to running in production)

Most teams assume coding is the bottleneck. In practice, review wait time is where most of the lead time hides. Here is what a 2-day lead time often looks like when you break it down:

StageTime
Coding (commit to PR open)4 hours
Review wait (PR open to first review)30 hours
Review (first review to approval)1 hour
Merge to deploy30 minutes
Total lead time~36 hours

A single lead time number shows “1.5 days.” The breakdown shows the real problem: nobody is reviewing PRs.

Deploy confidence score. Ask your team once a month: “On a scale of 1-5, how confident are you that a deploy on Friday afternoon will not cause an incident?” Track the trend. If your delivery metrics say “elite” but deploy confidence is 2 out of 5, something is wrong that the numbers are not capturing.

Toil ratio. What percentage of engineering time goes to unplanned work (incidents, manual operations, firefighting) versus planned work (features, improvements, tech debt)?

The 2026 State of SRE report found toil increased 30% in 2025, the first rise in five years. If your toil ratio is above 30%, your team is spending more time keeping the lights on than building anything new.

DevOps monitoring tools: how to measure all of this

The good news: you do not need a paid platform to collect these DevOps metrics. The open-source stack handles most of it.

The DevOps monitoring tools you need

LayerToolWhat it gives you
Metric collectionPrometheusTime-series data: CPU, memory, request rates, latencies, error rates
Kubernetes metadatakube-state-metricsPod status, deployment counts, resource requests vs. usage
Delivery metricsdora-exporter or Four KeysDeployment frequency, lead time, change failure rate, recovery time
Cost dataOpenCostPer-namespace, per-workload cloud cost attribution
VisualizationGrafanaDashboards, alerts, SLO tracking
LogsLokiDeploy logs, error context, audit trails
TracesTempoRequest-level latency breakdowns across services

DevOps monitoring dashboards worth building

DashboardWhat it showsWho uses it
Service health overviewError rate, P99 latency, request volume, error budget remaining per serviceOn-call engineers (first thing they check during incidents)
Delivery performanceDeployment frequency, lead time, change failure rate, recovery time (weekly, trended monthly)Engineering leadership (quarterly reviews)
Cost efficiency trackerTotal cluster cost, cost per namespace, cost per request, resource utilizationPlatform team, FinOps (the dashboard that pays for itself)
Deploy timelineGrafana annotation overlay: every deploy alongside error rate and latencyAnyone debugging “did this start after a deploy?”

What to alert on

Not all DevOps metrics deserve an alert. Here is what actually warrants a notification:

  • Error budget burning faster than 14x normal rate (page)
  • Error budget burning faster than 3x normal rate for 6+ hours (ticket)
  • P99 latency above SLO threshold for 10+ minutes (page)
  • Cost per request is increasing more than 50% week over week (weekly digest)
  • Resource utilization below 15% for a namespace for 7+ days (weekly digest)

Everything else goes to a dashboard, not a pager.

The DevOps metrics that are noise

Some things that teams measure but probably should not:

  • Lines of code. Measures nothing useful. A 10-line fix that prevents a production outage is worth more than a 5,000-line feature.
  • Number of commits. Encourages small, meaningless commits to inflate the number.
  • Uptime percentage without context. 99.9% uptime is meaningless if you do not know the error budget, the blast radius, or how you measured it.
  • Velocity (story points). Every team that tracks velocity eventually inflates their point estimates. The metric becomes self-referential and stops meaning anything.
  • Number of incidents. Fewer incidents can mean better reliability or worse detection. Without context, this metric tells you nothing.

The DevOps observability gap most teams have

The pattern we see is this: teams have some DevOps monitoring in place, usually deployment frequency and maybe error rates. But they cannot answer basic questions like:

  • What is our cost per API request?
  • Which service has the worst change failure rate?
  • How much time do we spend waiting for reviews versus writing code?
  • Is our reliability getting better or worse month over month?

The gap is not awareness. Most engineering leaders know they should measure these things. The gap is DevOps observability infrastructure.

Without Prometheus scraping your services, Loki collecting your deploy logs, Tempo tracing your requests, and Grafana pulling it all together, you cannot build the dashboards described in this post.

And building and maintaining that stack of DevOps monitoring tools is 10-20 hours a month of work that most teams do not have bandwidth for.

That is the difference between having DevOps metrics and having a measurement system. One is a number someone checks occasionally. The other is the foundation for every engineering decision.

Where Obsium fits

If your team is measuring deploys but not the things that actually matter, book a free 30-minute observability consultation. No sales deck, just an engineer-to-engineer chat about your stack.

Leave a Comment

Your email address will not be published. Required fields are marked *