Kubernetes cost optimization: how to find and fix wasted compute

Cloud infrastructure spending crossed $675 billion in 2025. About 27% of it was wasted. That percentage has held between 27% and 32% every single year since 2019. Nobody is getting better at this (Spendark: State of Cloud Waste 2026).

Kubernetes was supposed to help. Pack more workloads onto fewer machines, scale up when you need it, scale down when you don’t. The theory is sound.

The practice is that most Kubernetes clusters run at 8% CPU utilization and 20% memory utilization, which is worse than the year before (Cast AI: 2026 State of Kubernetes Resource Optimization).

This post walks through where the waste is hiding, how to measure it with tools you probably already have, and what to fix first.

Where the waste actually is

The numbers from Cast AI’s 2026 report, measured across tens of thousands of production clusters on AWS, GCP, and Azure:

Metric	2025	2026	Direction
CPU utilization	10%	8%	Getting worse
Memory utilization	23%	20%	Getting worse
CPU overprovisioning	40%	69%	Getting much worse
Memory overprovisioning	—	79%	—
GPU utilization	—	5%	—

That CPU number is worth sitting with for a moment. For every dollar you spend on CPU in your Kubernetes cluster, 92 cents of compute sits idle. Not reserved for burst capacity. Not allocated for failover. Just unused.

GPU is even worse. Teams running AI and ML workloads on Kubernetes are paying for GPU instances at $2-4 per hour per GPU and using 5% of the capacity. That is not a rounding error. That is burning money.

Why this keeps getting worse

The pattern is the same at almost every company.

A developer sets resource requests and limits at deploy time based on:

A rough guess
Local testing numbers that do not reflect production
Whatever the last team used

The application goes to production, and nobody revisits those numbers.

Then the padding starts:

A pod gets OOM-killed because memory requests were too low
The team doubles the memory request “just to be safe”
The application runs fine at the new setting
Nobody ever checks whether the new setting is 4x more than needed
Multiply this by every pod in the cluster

A January 2026 study of 3,042 production clusters found that 68% of pods request 3-8x more memory than they actually use. 64% of engineering teams admitted to adding “just to be safe” headroom of 2-4x after a single OOM incident (Wozz: Kubernetes Memory Overprovisioning Study 2026).

The cost of that padding is invisible to the team doing the padding. The platform team sees the infrastructure bill. The application team sees “my app runs fine.” Nobody connects the two.

How to measure the waste

You do not need a paid tool to find overprovisioned workloads. If you are running Prometheus and Grafana (which most Kubernetes teams are), you already have the data. You just need to query it.

CPU: requested vs. used

This Prometheus query shows how much CPU each workload is requesting versus how much it actually uses over the past 7 days:

sum(rate(container_cpu_usage_seconds_total{container!=""}[7d])) by (namespace, pod)
/
sum(kube_pod_container_resource_requests{resource="cpu"}) by (namespace, pod)

Any workload consistently below 20% is a candidate for right-sizing. Below 10% is almost certainly overprovisioned.

Memory: requested vs. used

Same idea for memory:

sum(container_memory_working_set_bytes{container!=""}) by (namespace, pod)
/
sum(kube_pod_container_resource_requests{resource="memory"}) by (namespace, pod)

This gives you the ratio of actual memory usage to what you requested. A workload requesting 2 GB and using 400 MB is at 20%, meaning you are paying for 1.6 GB of memory that sits empty.

The top wasters dashboard

Build a Grafana dashboard with these panels:

Top 10 CPU wasters by namespace: Workloads with the biggest gap between requested and used CPU
Top 10 memory wasters by namespace: Same for memory
Node utilization heatmap: Which nodes are underloaded and could be consolidated
Pod count vs. resource utilization: Namespaces with many pods but low utilization per pod

This dashboard alone will tell you where 80% of the waste is. Most teams have never built it because nobody asked the question.

What to fix first

Not everything needs to be fixed at once. Start with the changes that save the most money with the least risk.

1. Right-size the obvious outliers

Sort your workloads by the gap between requested and used resources. The top 10% of wasters usually account for 50-60% of the total waste. These are the pods requesting 4 GB of memory and using 300 MB, or requesting 2 CPU cores and averaging 0.1.

For each one:

Look at 30 days of usage data, not just the average. Check the peaks.
Set requests to the 95th percentile of actual usage plus a 20-30% buffer
Set limits to 2x the request (or remove CPU limits entirely, since CPU throttling often causes more problems than it solves)
Deploy the change to a staging environment first
Roll out to production and monitor for a week

2. Use the Vertical Pod Autoscaler (VPA) for continuous right-sizing

Doing this manually works for the top 10 offenders. For the other 200 workloads in your cluster, you need automation.

The Vertical Pod Autoscaler watches actual resource usage and adjusts requests over time. Run it in recommendation mode first (it suggests changes but does not apply them). Review the recommendations for a few weeks. Then switch to auto mode for workloads you trust.

VPA is not perfect. It restarts pods when it changes resource requests, which can cause brief disruption.

Workload type	VPA approach
Stateless services	Auto mode (safe to restart)
Stateful workloads	Recommendation mode only (review and apply manually)
Databases, message queues	Skip VPA, right-size manually based on 30-day usage data

3. Clean up idle workloads

Every cluster has them:

Dev environments that nobody has touched in months
Preview deployments from PRs that merged weeks ago
Staging namespaces for projects that shipped or got cancelled

A quick way to find them:

sum(rate(container_cpu_usage_seconds_total{container!=""}[7d])) by (namespace) < 0.01

Any namespace averaging near zero CPU over 7 days is worth investigating. Either the workloads are genuinely idle (delete them) or they are running but serving no traffic (find out why).

4. Implement cluster autoscaling properly

Overprovisioning at the pod level wastes resources within nodes. But if your cluster autoscaler is not configured to remove underutilized nodes, you are also paying for entire machines that could be shut down.

Check these settings:

Scale-down threshold: The default is usually 50% utilization. If a node is below this for 10 minutes, the autoscaler removes it. Consider lowering the threshold to 35-40%.
Scale-down delay: The default is often 10 minutes. For non-production clusters, you can drop this to 2-5 minutes.
Node bin-packing: Make sure the scheduler is using the “MostRequestedPriority” strategy, which packs pods onto fewer nodes instead of spreading them evenly.

5. Use spot instances for fault-tolerant workloads

Spot instances cost 60-90% less than on-demand. Good candidates:

Batch jobs
CI/CD runners
Stateless web workers with multiple replicas

The risk is that spot instances can be reclaimed with 2 minutes notice. The mitigation is running enough replicas across multiple availability zones so that losing one instance does not affect availability.

What this looks like in real money

Here is a rough example. You are running 50 nodes on AWS (m5.2xlarge, 8 vCPU, 32 GB memory) at $0.384/hour on-demand.

	Before	After
Nodes	50	30
Monthly compute cost	$13,824	$8,294
Annual compute cost	$165,888	$99,533
Annual savings	—	$66,355

That is from right-sizing pods and letting the cluster autoscaler remove the nodes that become empty. If you add spot instances for eligible workloads, the savings go further.

At larger scale (200+ nodes), teams regularly save $150,000-$300,000 per year just from right-sizing and node consolidation. No tooling changes. No architecture rework. Just measuring what you are actually using and adjusting the numbers.

The monitoring gap

You cannot optimize what you cannot see. If you do not have resource utilization metrics flowing into a system that lets you query, visualize, and alert on them, every optimization described above is a one-time fix that drifts back to waste within 3-6 months.

This is what an observability stack is for. Not just uptime alerts. Not just error logs. Resource utilization over time, tracked per workload, per namespace, per cluster.

The teams that keep costs under control are the ones that:

Have dashboards showing resource utilization by namespace and workload
Run weekly or monthly reviews of the top wasters
Alert when utilization drops below a threshold (meaning something is overprovisioned or idle)
Track cost trends over time so they can spot regressions before the bill arrives

Without this, you are flying blind. You fix the waste once, and 6 months later new deployments with padded resource requests have brought you right back to where you started.

Where Obsium fits

If your cluster costs keep climbing, book a free 30-minute observability consultation. No sales deck, just an engineer-to-engineer chat about your stack.

Kubernetes cost optimization: how to find and fix wasted compute