A US Banking SaaS provider serving multiple enterprise banking clients needed production-grade observability – metrics, logs, and traces – automatically provisioned for every new customer cluster, with all data flowing entirely within the AWS private network and no cross-tenant data leakage.
Challenge: Observability at Scale, with No Room for Gaps
The client’s SaaS platform provisions a dedicated AWS EKS cluster for each enterprise banking customer. Every new sign-up needed fully operational monitoring, logging, and distributed tracing from the moment it went live – but there was no automated mechanism to make that happen. Engineers were manually configuring observability tools after each onboarding, creating dangerous blind spots and operational drag.
The environment also imposed strict constraints that made this harder than it sounds:
- No data over public internet – Financial services regulatory requirements meant all telemetry, even operational metrics – had to stay within the AWS private network.
- Multi-tenant data isolation – Multiple banking clients sharing a central observability platform required strict per-tenant data segregation, enforced at the storage layer, not just the UI.
- Fragmented tooling – Three separate agents (Prometheus, Promtail, OpenTelemetry Collector) ran in each tenant cluster, each with its own upgrade cycle, config syntax, and failure surface.
- Long-term retention cost – EBS-backed storage for Prometheus and Elasticsearch doesn’t scale economically. A durable, cost-elastic storage tier was needed.
- Tool selection uncertainty – The client needed a structured evaluation of ELK Stack vs Grafana LGTM before committing to a platform.
Action: A Private, Automated, Unified Observability Platform
Obsium designed and built a two-tier observability architecture on AWS, evolving it over two phases to address every constraint.
- Grafana LGTM selected over ELK. A head-to-head evaluation showed the Grafana stack was 5–10x cheaper on storage (label-indexed vs. full-text), natively multi-tenant, built cloud-native for Kubernetes, and fully open-source. ELK was ruled out on cost, licensing, and operational complexity grounds.
- Dedicated Observability Cluster, fully private. Grafana, Mimir (metrics), Loki (logs), and Tempo (traces) were deployed in a dedicated AWS VPC – all in private subnets, all backed by Amazon S3. Internal NLBs and AWS PrivateLink VPC Endpoints gave each tenant cluster a private path to push telemetry, with zero public internet exposure.
- Multi-tenancy enforced at the data layer. Each banking client received a unique Grafana Organisation and X-Scope-OrgID header enforced across Mimir, Loki, and Tempo. A customer can only ever query their own data – by architecture, not by convention.
- Three agents replaced by one. In 2025, Prometheus, Promtail, and the OpenTelemetry Collector were replaced with a single Grafana Alloy deployment per tenant cluster. Alloy handles metrics scraping, log collection, and trace forwarding in one Helm release, one config file, one upgrade process.
- Zero-touch tenant onboarding via Terraform. The entire observability stack – Helm releases, PrivateLink wiring, tenant credentials was codified in Terraform. Every new banking customer now receives a fully configured observability setup at cluster creation time, with no manual steps.
- Log-to-trace correlation. Grafana derived fields were configured to render trace IDs in log lines as clickable links into Tempo enabling engineers to jump from a log error to its full distributed trace in one click.
Results: One Platform. Every Signal. Every Tenant. Automated.
3→1
Agents per cluster
0
Public endpoints
100%
Tenant data isolation
3
Signals unified
S3
Long-term retention
- Three collection agents per cluster reduced to one (Grafana Alloy), halving Helm releases and operational overhead
- Zero telemetry over the public internet – all data flows via AWS PrivateLink, supporting SOC 2 and PCI-DSS network segmentation requirements
- Complete tenant data isolation enforced at the storage layer across metrics, logs, and traces
- New tenant observability provisioned automatically at onboarding – from hours of manual work to zero
- Amazon S3 as the long-term backend for all three signals – storage costs scale with actual volume, not with EBS provisioning
- Single Grafana UI replaces what would have been Kibana + a separate Grafana instance – one interface, one login, full signal coverage
- Platform is forward-compatible – continuous profiling, synthetic monitoring, and eBPF observability can be added with configuration changes alone
What This Makes Possible:
With a fully automated, private, unified observability platform in place, the client’s engineering team can onboard new banking customers without touching a monitoring config. Incidents are diagnosed faster -engineers correlate metrics spikes, log errors, and distributed traces from a single screen. And as the platform grows, adding new tenants adds zero observability overhead.
Grafana Alloy’s component model also means the platform is future-proof: signals like continuous profiling (Pyroscope) or eBPF-based network observability are configuration additions, not new agent deployments.
Want observability that works from day one, for every customer?
Obsium builds production-grade, automated, multi-tenant observability platforms on AWS EKS. Private networks, S3-backed retention, and unified agents, all delivered as infrastructure code.
Get the case study
Enter your details to download the PDF.
