“Datadog bill shock” has become enough of a running joke in 2026 that it now has its own Reddit threads, its own memes, and its own line item on FinOps team roadmaps. A recent analysis found that initial Datadog cost estimates typically miss by 3-12x once you actually run it in production (OneUptime: The Real Cost of Observability in 2026).
A mid-sized engineering team of 50 people, 200 hosts, a normal microservices setup, is now paying around $220,000 a year. And the Kubernetes numbers are worse: teams regularly report host counts 3-5x higher than what they estimated at contract signing, because every pod and every node adds to the bill, and autoscaling means your invoice autoscales with it.
That is the conversation that pushes every engineering leader into the same question at some point: should we go with Grafana’s open-source stack, Datadog, or New Relic?
The short version
If you just want the quick take before reading the rest:
- Datadog gives you the fastest time-to-value and the least operational work. It also gives you the least predictable bills and the most vendor lock-in.
- New Relic is cheaper per-seat and the free tier is generous (100 GB/month). But costs get slippery once your data volumes grow, and the per-GB model punishes verbose Kubernetes logging.
- Grafana + Prometheus + Loki (self-hosted or via Grafana Cloud) is the most flexible and typically the cheapest at scale. The tradeoff is real operational overhead unless you have SRE capacity or bring in outside help.
Below is the longer version with actual numbers.
The money question
Because honestly, that’s why you’re reading this. Nobody migrates their observability platform for fun. You’re here because someone in finance flagged a bill that tripled overnight.
Datadog pricing
Datadog charges per host. $15-$23/month for infrastructure monitoring depending on your plan. APM adds another $31-$40/host/month. Logs start at $1.70 per million events indexed. Custom metrics cost extra. RUM costs extra. Synthetics cost extra.
The pricing page lists 23+ separate products. The average customer uses about 9 of them.
Here’s where it gets uncomfortable for Kubernetes teams specifically:
A company with “20 services” doesn’t have 20 billable hosts. With 3 replicas across 4 microservices, you’re looking at 12 hosts per service group. One team we know went from 20 expected hosts to 340 billable units.
A Sedai analysis from 2026 estimated that a mid-market company with 50 engineers and ~200 hosts pays roughly $220,000/year for Datadog. And that’s the baseline, not including overages.
One VP of Engineering budgeted $12,000/month. The actual invoice came in at $147,000.
New Relic pricing
New Relic took a different approach: data-based pricing. You pay per GB ingested ($0.30-$0.60/GB depending on your plan) plus per-seat fees for full platform users ($99-$349/month per user).
The free tier is actually useful: 100 GB of data per month, one full platform user, unlimited basic users.
The problem for Kubernetes teams? Kubernetes is chatty. Between pod logs, node metrics, kube-state-metrics, and trace data from a few dozen services, you can burn through 100 GB in days. Once you’re past the free tier, the per-GB model means every deployment, every scaling event, every noisy log line has a dollar sign attached to it.
Teams end up in a weird spot where they’re optimizing log levels and dropping metrics to control costs instead of improving observability. That’s backwards.
Grafana stack pricing
Two options here:
Self-hosted (Grafana + Prometheus + Loki + Tempo): The software is free. You pay for the infrastructure to run it, which typically works out to $5-$15/month per node for small setups. A mid-sized Kubernetes cluster might spend $500-$2,000/month on the underlying compute and storage. No per-host fees, no per-GB ingestion charges, no surprises.
Grafana Cloud: Starts free (10K metrics series, 50 GB logs, 50 GB traces). Pro plan is $19/month base + usage. Significantly cheaper than Datadog for most workloads, and you still get a managed experience.
The catch? Self-hosted means you own the uptime, scaling, and maintenance of the stack itself. That takes real engineering time. Grafana Cloud reduces that burden but adds cost.
Beyond cost: how they actually feel in production
Getting started
Datadog is fast. Install the agent via Helm chart, turn on auto-discovery, dashboards in an hour. If your VP says “I need observability by Friday,” Datadog delivers.
New Relic is similarly quick. Their Kubernetes integration and APM auto-instrumentation are solid. Getting started is fast. Understanding your bill takes longer.
The Grafana stack takes 1-3 weeks for a production-ready self-hosted setup. You’re wiring up Prometheus scrape targets, Loki for logs, Tempo for traces, building dashboards, writing alerting rules. Grafana Cloud compresses that to a few days, but you’re still doing more upfront work.
At 100+ services, the tradeoffs get real
Datadog keeps working smoothly. You add services, they appear in dashboards. Your bill does the same thing, though. At 100+ services with APM, you’re in six-figure territory before anyone on the team realizes it.
New Relic handles scale fine technically, but something shifts in team behavior. Engineers start reducing trace sampling, dropping debug logs, shortening retention. You’re trading observability for cost control at that point. Which defeats the purpose.
The Grafana stack handles scale well if you invest in it. Prometheus with Thanos or Mimir for long-term storage, Loki for logs, Tempo for traces. 77% of organizations now run Prometheus in production, per the 2025 CNCF survey. That adoption isn’t accidental. But self-hosting at this scale needs dedicated SRE attention, or your observability platform becomes its own source of outages.
The lock-in question
People underestimate this when they’re choosing platforms. Then they feel it when they try to leave.
Datadog uses its own query language (DQL), its own agent, its own data format. We’ve helped clients migrate away from Datadog. It takes weeks, not hours. You’re rebuilding dashboards, alerts, SLOs, and runbooks from scratch.
New Relic has gotten better here. They support OTLP (OpenTelemetry Protocol), so migration is more feasible than Datadog. Still not painless.
The Grafana stack runs on open standards: PromQL, LogQL, TraceQL. OpenTelemetry fits natively. If you want to swap a backend component later, your instrumentation stays intact.
Cost comparison: 50-host Kubernetes cluster
Rough annual estimates for a team running 50 hosts with infra monitoring, APM, and log management:
| Datadog | New Relic | Grafana Cloud | Grafana self-hosted | |
| Infrastructure monitoring | $13,800 | Included | ~$3,600 | ~$1,200 (compute) |
| APM / tracing | $24,000 | Included | ~$2,400 | ~$600 (compute) |
| Log management | $12,000+ | $8,000-$15,000 | ~$3,000 | ~$1,000 (storage) |
| Per-user / seat costs | Included | $12,000-$42,000 | ~$1,200 | $0 |
| Estimated annual total | ~$55,000-$70,000 | ~$25,000-$60,000 | ~$10,000-$15,000 | ~$3,000-$8,000 |
Actual costs depend on data volume, retention, custom metrics count, and how many Datadog features you turn on. The ranges are wide because no two setups look the same.
The self-hosted Grafana number looks absurdly low, and it is, if you don’t count engineering time. Add 10-20 hours/month of SRE effort for maintenance, upgrades, and troubleshooting and the true cost goes up. How much depends on what you pay your engineers.
So which one?
Go with Datadog if you’re under 50 hosts, need it working tomorrow, don’t have SRE capacity, and can stomach unpredictable bills. Plenty of teams are happy with Datadog at smaller scale. It’s a good product. The problems start when you grow.
Go with New Relic if you want a commercial platform, your data volumes are moderate, and you like the free tier for early-stage work. Just watch the per-GB costs as your cluster grows.
Go with the Grafana stack if you’re past 50 hosts, you care about cost control or lock-in, or you’re in a regulated environment where data can’t leave your infrastructure. This is where most Kubernetes teams land eventually. The CNCF numbers tell the story: 76% of companies use open-source for observability.
The real question isn’t which tools. It’s who runs them.
That’s where Obsium comes in
We build and operate open-source observability stacks for Kubernetes teams. Grafana, Prometheus, Loki, Tempo, OpenTelemetry. Everything is deployed inside your infrastructure, so your data never leaves your environment.
Think of it as the cost profile of self-hosted open-source, minus the 10-20 hours/month your SREs would spend keeping it running. We handle upgrades, scaling, alert tuning, and dashboard builds. You keep full ownership and zero vendor lock-in. If you ever want to take over operations, everything is yours, and we’ll help with the transition.
We’ve migrated teams off $200K+/year Datadog setups and landed them on a stack that gave better visibility at a fraction of the spend. Not by cutting corners on observability, but by removing the pricing model that punishes Kubernetes environments for scaling.
If your observability bill keeps climbing or you’ve been wanting to move to open-source but don’t have the bandwidth, let’s have a conversation.
Book a free 30-minute observability consultation
We’ll look at your current setup, flag where you’re overspending, and map out what a migration would actually involve. No sales deck, just an engineer-to-engineer chat about your stack.




