Glossary
Clear, plain-English definitions of the cloud, DevOps, SRE, Kubernetes, observability, and platform engineering terms our team explains to customers every week. Browse the A–Z, or search for a specific term.
A
-
Air Gapped Environment
An air gapped environment operates without network connectivity to public or untrusted systems. The term air gap refers to the physical or logical separation that blocks remote access,…
-
Alertmanager
Alertmanager is the alert handling component of the Prometheus monitoring ecosystem. When Prometheus evaluates alerting rules and detects conditions are met, it sends alerts to Alertmanager.
-
Amazon CloudFront
Amazon CloudFront is AWS's content delivery network (CDN) service that accelerates website and application delivery by caching content at edge locations around the world. When users request your…
-
Amazon EBS
Amazon EBS is a high-performance block storage service designed for use with Amazon EC2 instances and containerized workloads on EKS and ECS. EBS volumes behave like raw, unformatted…
-
Amazon EC2
Amazon EC2 (Elastic Compute Cloud) is AWS's service for renting virtual servers in the cloud, allowing organizations to run applications without owning physical hardware. You pay only for…
-
Amazon EKS
Amazon Elastic Kubernetes Service (EKS) is AWS's managed Kubernetes offering that handles the complex work of running container orchestration infrastructure, freeing teams to focus on building applications rather…
-
Amazon RDS
Amazon RDS is a managed database service that handles the heavy lifting of database administration including provisioning, patching, backup, recovery, and scaling. RDS supports multiple database engines including…
-
Amazon S3
Amazon S3 is an object storage service that offers industry-leading scalability, data availability, security, and performance. S3 stores data as objects within buckets, where each object can range…
-
Ansible
Ansible is an open-source IT automation engine that automates provisioning, configuration management, application deployment, and orchestration across your infrastructure. It uses human-readable YAML scripts called playbooks and connects…
-
Apache HTTP Server
Apache HTTP Server is a free, open-source web server software that accepts HTTP requests from browsers and delivers web pages in response. It's maintained by the Apache Software…
-
APM (Application Performance Monitoring)
APM stands for Application Performance Monitoring. It refers to tools that help you understand how well an application is running by tracking speed, errors, and overall reliability while…
-
AppArmor
AppArmor is a Linux Security Module that confines applications to specific resources — files, network connections, and system capabilities — by enforcing per-program security profiles. Even if an…
-
Argo CD
Argo CD is a Kubernetes native continuous delivery tool that uses GitOps principles to manage and deploy applications. It continuously monitors application definitions stored in Git repositories and…
-
Artifact
An artifact is a file or package produced during the process of building or deploying software. It is the output that gets stored, shared, or delivered so it…
-
Auto Scaling
Auto Scaling is a cloud computing feature that automatically adjusts the number of active compute resources based on current demand. It monitors metrics like CPU use, request count,…
-
Availability Zones
An availability zone, or AZ, is an isolated physical data center within a cloud region. Each zone operates with its own power, networking, and cooling infrastructure, allowing applications…
-
AWS App Mesh
AWS App Mesh is a fully managed service mesh from Amazon Web Services that handles communication between microservices running in your applications. It uses Envoy proxies deployed alongside…
-
AWS Athena
AWS Athena is Amazon's serverless query service that lets you analyze data directly in S3 using standard SQL — no databases to set up, no servers to manage,…
-
AWS CloudFormation
AWS CloudFormation is an AWS service that lets you model and provision AWS infrastructure resources using declarative templates written in JSON or YAML. You describe the resources you…
-
AWS ECS
AWS ECS is Amazon's fully managed container orchestration service that lets you run, stop, and manage Docker containers on a cluster of EC2 instances or serverless Fargate infrastructure.…
-
AWS Fargate
AWS Fargate is a serverless compute engine for containers that works with both Amazon ECS and Amazon EKS. It removes the need to provision, configure, and scale clusters…
-
AWS IAM
AWS IAM is a web service that helps you securely control access to AWS resources. IAM lets you create and manage users, groups, and roles, and define permissions…
-
AWS Lambda
AWS Lambda is a serverless compute service that lets you run code without provisioning or managing servers. You upload your code as a function, define what events should…
-
AWS VPC
AWS VPC is a service that lets you create a logically isolated section of the AWS cloud where you can launch resources in a virtual network that you…
-
Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS) is Microsoft's managed platform for running containerized applications without the operational burden of maintaining Kubernetes infrastructure yourself. It handles the complex work of container…
B
-
Backstage
Backstage is an open-source platform for building developer portals, originally created by Spotify and now a CNCF incubating project. It provides a centralized hub where developers can discover…
-
Bash (Bourne Again Shell)
Bash stands for Bourne Again Shell. It is a command line tool that lets users interact with an operating system by typing commands instead of clicking buttons. Bash…
-
Bastion
A bastion, also known as a bastion host, is a specially secured server that acts as the single, controlled entry point into a private network. Instead of exposing…
-
Bitbucket
Bitbucket is a web based platform used to store, manage, and collaborate on source code using Git. It helps teams work together on software projects by providing tools…
-
Blameless Postmortem
Blameless Postmortem is a structured analysis conducted after an incident that focuses on understanding what happened, why it happened, and how to prevent it from happening again, without…
-
Blue-Green Deployment
Blue-Green Deployment is a release strategy that maintains two identical production environments, called blue and green. At any time, one environment serves live traffic while the other is…
-
Bucket
A bucket acts like a top level folder in the cloud. Inside a bucket, you store objects, which can be files of any type and size. Each object…
C
-
Canary Deployment
Canary Deployment is a release strategy where a new version of an application is deployed to a small subset of the production infrastructure or user base before being…
-
Chaos Engineering
Chaos Engineering is the discipline of experimenting on a distributed system to build confidence in its ability to withstand turbulent conditions in production. It involves intentionally introducing failures…
-
CI/CD Pipeline
CI/CD Pipeline is an automated workflow that takes code changes from a developer's commit through building, testing, and deploying to production. CI (Continuous Integration) automatically builds and tests…
-
CIDR (Classless Inter Domain Routing)
CIDR stands for Classless Inter-Domain Routing. It is a way to define and manage IP address ranges more efficiently in computer networks. CIDR is commonly used in cloud…
-
Cloud Computing
Cloud computing means using the internet to access IT services like storage, servers, databases, software and more. Instead of buying and maintaining your own physical servers and equipment,…
-
Cloud Cost Optimization
Cloud Cost Optimization is the continuous process of reducing cloud infrastructure spending while maintaining the performance, availability, and reliability that the business requires. It encompasses rightsizing resources, eliminating…
-
Cloud Service
Cloud services are tools and resources like software, storage, and computing power that are delivered over the internet. Instead of companies buying and maintaining their own servers and…
-
Cloud-Native
Cloud-Native refers to an approach for building and running applications that fully exploits the advantages of cloud computing. Cloud-native applications are designed as loosely coupled microservices, packaged in…
-
Cloudflare
Cloudflare is a web infrastructure and security company that sits between websites and their visitors, protecting against attacks while accelerating content delivery. The platform powers a significant portion…
-
CloudWatch
CloudWatch is a monitoring and observability service provided by Amazon Web Services. It helps you track the health, performance, and behavior of applications and infrastructure running on AWS…
-
Container Orchestration
Container Orchestration is the automated management of containerized applications across a cluster of machines. It handles scheduling containers onto nodes, scaling them based on demand, managing networking between…
-
Container Registry
Container Registry is a centralized repository where container images are stored, versioned, and distributed. It works similarly to a code repository but for Docker and OCI-compatible images.
-
Containerd
Containerd is an industry-standard container runtime that manages the complete lifecycle of containers on a host system, including image transfer, container execution, storage, and networking. It is the…
-
Cronjob
A cronjob is a time-based scheduler in Unix-like operating systems that automates repetitive tasks by running commands or scripts at specified intervals. It's the backbone of server automation,…
-
Crossplane
Crossplane is an open-source CNCF project that extends Kubernetes to provision and manage cloud infrastructure resources. It lets you define and manage AWS, Azure, and GCP resources like…
-
CVE
CVE, or Common Vulnerabilities and Exposures, is a standardized system for identifying and cataloging publicly known cybersecurity weaknesses. Each vulnerability receives a unique identifier that security professionals worldwide…
D
-
Data Center
A data center is a physical facility that houses the computing infrastructure — servers, storage systems, and networking equipment — that organizations use to store, process, and deliver…
-
Developer Experience
Developer Experience (DevEx) encompasses the overall experience developers have when building, deploying, and maintaining software within an organization. It includes the quality of tools, documentation, APIs, development environments,…
-
DevOps
DevOps is a modern approach to software delivery that brings development and operations teams together. Instead of working separately, both sides collaborate throughout the entire process of building,…
-
Distributed Tracing
Distributed Tracing is an observability technique that tracks individual requests as they propagate through a distributed system, recording timing and contextual data at each service boundary. It provides…
-
DNS
DNS, or the Domain Name System, is the internet's phonebook — it translates human-friendly domain names like google.com into the numerical IP addresses that computers use to find…
-
Docker
Docker is a containerization platform that packages applications together with everything they need to run. This includes code, libraries, dependencies, and configuration. By doing this, Docker ensures applications…
-
Drift Detection
Drift Detection is the process of comparing the desired state of infrastructure, as defined in code or configuration, against the actual state of deployed resources. When differences are…
E
-
EBPF
eBPF is a revolutionary technology built into the Linux kernel that allows custom programs to run in a sandboxed environment within the kernel. These programs can observe and…
-
ELK Stack
ELK Stack is a collection of three open-source tools, Elasticsearch, Logstash, and Kibana, that work together to provide comprehensive log management and analytics. Elasticsearch stores and indexes log…
-
Error Budget
Error Budget is the maximum amount of downtime or errors a service can experience while still meeting its Service Level Objective. It is calculated as 100 percent minus…
-
Etcd
etcd is a strongly consistent, distributed key-value store used as the primary data store for Kubernetes. It stores all cluster state, including configuration data, secrets, service discovery information,…
F
-
Falco
Falco is an open-source cloud-native runtime security project created by Sysdig and now a CNCF graduated project. It monitors system calls from the kernel to detect anomalous activity…
-
Feature Flags
Feature Flags are conditional statements in application code that control which features or behaviors are active for different users, environments, or conditions. They decouple the act of deploying…
-
FinOps
FinOps is a cloud financial management practice that brings together technology, finance, and business teams to collaborate on data-driven spending decisions. The term combines Finance and DevOps, reflecting…
-
Fluentd
Fluentd is an open-source data collector developed under the Cloud Native Computing Foundation that provides a unified logging layer for distributed systems. It collects log data from multiple…
-
Flux CD
Flux CD is an open-source GitOps operator for Kubernetes maintained by the Cloud Native Computing Foundation. It continuously monitors Git repositories containing Kubernetes manifests, Helm charts, or Kustomize…
G
-
Git
Git is a distributed version control system that tracks changes to files and coordinates work among multiple developers. Created by Linus Torvalds in 2005, it has become the…
-
GitHub Actions
GitHub Actions is a continuous integration and continuous delivery platform built directly into GitHub. It allows you to automate workflows triggered by repository events like pushes, pull requests,…
-
GitOps
GitOps is an operational framework that uses Git as the single source of truth for managing infrastructure and applications, applying DevOps practices like version control and automation to…
-
GKE (Google Kubernetes Engine)
Google Kubernetes Engine (GKE) is Google Cloud's fully managed service for deploying, managing, and scaling containerized applications using Kubernetes. It handles the complex infrastructure work — upgrades, patches,…
-
Golden Path
Golden Path is a pre-built, recommended way to accomplish a common development task that incorporates organizational best practices by default. Golden paths are created by platform engineering teams…
-
Golden Signals
Golden Signals are four key metrics defined in Google's Site Reliability Engineering book that every production service should monitor: latency, traffic, errors, and saturation. These signals provide a…
-
Grafana
Grafana is an open-source platform that transforms raw metrics from databases, cloud services, and applications into interactive dashboards and visualizations. It doesn't store data itself — instead, it…
-
Grafana Mimir
Grafana Mimir is an open-source, horizontally scalable, highly available long-term storage system for Prometheus metrics, created by Grafana Labs. It is the successor to Cortex and handles billions…
-
Grafana Tempo
Grafana Tempo is an open-source, easy-to-operate distributed tracing backend built by Grafana Labs. Unlike traditional tracing systems that index trace data for search, Tempo stores traces in object…
H
-
HashiCorp Vault
HashiCorp Vault is an identity-based secrets management system that provides a centralized place to store, access, and distribute secrets such as API keys, passwords, certificates, and encryption keys.…
-
Helm
Helm is a package manager for Kubernetes. It lets you define, install, and upgrade complex Kubernetes applications using reusable templates called charts. Helm simplifies deployment management by bundling…
I
-
Immutable Infrastructure
Immutable Infrastructure is an infrastructure management approach where components are never modified after deployment. Instead of patching or updating running servers, you build a new version from scratch,…
-
Incident Management
Incident Management is a structured process for detecting, responding to, and resolving service disruptions or security events. It defines clear roles, communication channels, and procedures that teams follow…
-
Infrastructure as Code
Infrastructure as Code (IaC) is the practice of defining, provisioning, and managing computing infrastructure through version-controlled, machine-readable files rather than through manual processes. IaC applies software development practices…
-
Internal Developer Platform
Internal Developer Platform (IDP) is a set of tools, services, and workflows built by platform engineering teams that enables application developers to self-serve infrastructure and deployment needs. An…
-
Istio
Istio is an open-source service mesh platform that layers transparently onto Kubernetes clusters. It provides a uniform way to connect, secure, control, and observe microservices. Istio uses Envoy…
J
-
Jaeger
Jaeger is an open-source, end-to-end distributed tracing system originally developed by Uber and now a graduated CNCF project. It tracks requests as they flow through microservices, recording timing…
-
Jenkins
Jenkins is an open-source automation server written in Java that enables teams to automate virtually any aspect of the software delivery process. It supports thousands of plugins that…
K
-
Karpenter
Karpenter is an open-source Kubernetes cluster autoscaler that automatically provisions right-sized compute nodes in response to pending, unschedulable pods. Instead of relying on static node groups, Karpenter launches…
-
KMS (Key Management Service)
Key Management Service (KMS) is a centralized service used to create, manage, rotate, and control cryptographic keys for securing data. It enables organizations to encrypt sensitive information and…
-
Kubectl
Kubectl is the command-line interface used to interact with Kubernetes clusters. It allows users to deploy applications, inspect cluster resources, manage workloads, and troubleshoot issues by communicating directly…
-
Kubelet
Kubelet is a core component of the Kubernetes framework that runs on every worker node and is responsible for managing pods and containers. It ensures that containers defined…
-
Kubernetes
Kubernetes is an open source platform that helps you manage and run containerized applications at scale. It automates tasks such as deploying apps, scaling them up or down…
-
Kubernetes ConfigMap
Kubernetes ConfigMap is a Kubernetes API object used to store non-confidential configuration data in key-value pairs. ConfigMaps decouple configuration from container images, allowing you to change application settings…
-
Kubernetes CRD
Kubernetes CRD is a Custom Resource Definition, a mechanism that lets you extend the Kubernetes API by defining entirely new resource types. Once a CRD is registered, users…
-
Kubernetes DaemonSet
Kubernetes DaemonSet is a controller that ensures a copy of a specified pod runs on every node in the cluster, or on a selected subset of nodes. As…
-
Kubernetes Deployment
Kubernetes Deployment is a resource that provides declarative updates for pods and replica sets. You describe the desired state of your application in a Deployment manifest, and the…
-
Kubernetes HPA
Kubernetes HPA is the Horizontal Pod Autoscaler, a built-in Kubernetes controller that automatically adjusts the number of pod replicas in a Deployment, ReplicaSet, or StatefulSet based on observed…
-
Kubernetes Ingress
Kubernetes Ingress is an API resource that defines rules for routing external HTTP and HTTPS traffic to services within a cluster. Instead of exposing each service individually through…
-
Kubernetes Namespace
Kubernetes Namespace is a mechanism for isolating groups of resources within a single Kubernetes cluster. Namespaces provide a scope for names, meaning two resources can have the same…
-
Kubernetes Network Policy
Kubernetes Network Policy is a Kubernetes resource that controls the flow of network traffic between pods at the IP address and port level. By default, all pods in…
-
Kubernetes Operator
Kubernetes Operator is a method of packaging, deploying, and managing applications using custom controllers and custom resource definitions. Operators encode operational knowledge, such as how to deploy, scale,…
-
Kubernetes Pod
Kubernetes Pod is the smallest and simplest unit in the Kubernetes object model. A pod represents a single instance of a running process in your cluster and can…
-
Kubernetes RBAC
Kubernetes RBAC is the authorization system built into Kubernetes that controls access to the Kubernetes API. It lets administrators define who can perform specific actions on specific resources…
-
Kubernetes Secret
Kubernetes Secret is a Kubernetes API object designed to hold sensitive information such as passwords, OAuth tokens, SSH keys, and TLS certificates. Secrets are similar to ConfigMaps but…
-
Kubernetes StatefulSet
Kubernetes StatefulSet is a workload controller designed for applications that require stable, persistent identities and ordered deployment. Unlike Deployments, which treat pods as interchangeable, StatefulSets assign each pod…
L
-
Linkerd
Linkerd is an ultralight service mesh for Kubernetes, originally created by Buoyant and now a CNCF graduated project. It provides critical features like mutual TLS, observability, and traffic…
-
Load Balancer
Load Balancer is a device or service that distributes incoming network traffic across multiple backend servers or instances. It ensures that no single server handles too much traffic,…
-
Log Aggregation
Log Aggregation is the practice of collecting log data from multiple sources, such as applications, containers, servers, and network devices, and centralizing it in a unified system. This…
-
Loki
Loki is a log aggregation system designed by Grafana Labs to be cost-effective and easy to operate. Unlike traditional log management tools that index the full text of…
M
-
Microservices Architecture
Microservices Architecture is a software design approach where an application is structured as a collection of small, independently deployable services. Each service owns its own data, runs in…
-
MTTD
MTTD (Mean Time to Detect) is Mean Time to Detect, a reliability metric measuring the average elapsed time between the onset of an issue and its detection by…
-
MTTR
MTTR (Mean Time to Recovery) is a key reliability metric that measures the average duration between the start of an incident and full service restoration. It encompasses detection,…
-
Multi-Cloud
Multi-Cloud is a strategy where an organization uses cloud computing services from two or more providers, such as AWS, Azure, and Google Cloud, rather than relying on a…
N
-
NAT Gateway
NAT Gateway is a managed network address translation service that enables instances in private subnets to connect to the internet or other AWS services while preventing the internet…
O
-
Observability Pipeline
Observability Pipeline is an intermediary layer that sits between telemetry data sources and observability backends. It collects metrics, logs, and traces from applications and infrastructure, then transforms, filters,…
-
On-Call Rotation
On-Call Rotation is a scheduled system where team members take turns being the designated primary responder for production incidents, alerts, and escalations. The on-call engineer is responsible for…
-
OPA
OPA (Open Policy Agent) is an open-source, general-purpose policy engine that decouples policy decision-making from policy enforcement. It uses a high-level declarative language called Rego to define policies…
-
OpenTelemetry
OpenTelemetry is a vendor-neutral, open-source observability framework maintained by the Cloud Native Computing Foundation. It provides a unified set of APIs, SDKs, and tools for generating, collecting, and…
P
-
Platform as a Product
Platform as a Product is an approach to platform engineering that applies product management principles to building and evolving internal developer platforms. Instead of mandating platform usage, teams…
-
Platform Engineering
Platform engineering is the practice of building and maintaining internal platforms that help developers work faster, safer and more efficiently. These platforms provide ready made tools, environments and…
-
Pod Disruption Budget
Pod Disruption Budget is a Kubernetes resource that limits the number of pods from a given application that can be voluntarily disrupted at the same time. Voluntary disruptions…
-
Pod Security Standards
Pod Security Standards are a set of security profiles defined by the Kubernetes project establishing three pod security levels: Privileged, which is unrestricted; Baseline, which prevents known privilege…
-
Prometheus
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud and now a graduated project of the Cloud Native Computing Foundation. It collects and stores…
-
Pulumi
Pulumi is an open-source infrastructure as code platform that lets you define, deploy, and manage cloud infrastructure using familiar programming languages like TypeScript, Python, Go, C#, and Java.…
R
-
RED Method
RED Method is a monitoring methodology for request-driven microservices that focuses on three key metrics: Rate (requests per second), Errors (failed requests per second), and Duration (distribution of…
-
Rolling Update
Rolling Update is a deployment strategy that gradually replaces old instances of an application with new ones, one or a few at a time, rather than updating all…
-
Runbook
Runbook is a set of documented procedures that describe how to carry out specific operational tasks or respond to specific types of incidents. Runbooks provide step-by-step instructions that…
S
-
Secrets Management
Secrets Management is the set of practices and tools used to securely handle sensitive data throughout its lifecycle. This includes API keys, database passwords, encryption keys, OAuth tokens,…
-
Serverless Computing
Serverless Computing is a cloud execution model where the cloud provider dynamically manages the allocation and provisioning of servers. Despite the name, servers still exist, but the developer…
-
Service Catalog
Service Catalog is a centralized, searchable inventory of all software services, APIs, libraries, and infrastructure components within an organization. It records metadata for each entry including ownership, documentation…
-
Service Mesh
Service Mesh is a dedicated infrastructure layer that handles service-to-service communication within a microservices architecture. It provides capabilities like load balancing, traffic routing, mutual TLS encryption, retries, and…
-
Sidecar Pattern
Sidecar Pattern is a design pattern in Kubernetes where a secondary container, called the sidecar, is deployed alongside the primary application container within the same pod. The sidecar…
-
Site Reliability Engineering
Site Reliability Engineering, or SRE, is the practice of keeping software systems reliable, fast and available. It blends software engineering skills with operations work to make sure websites,…
-
SLA
SLA (Service Level Agreement) is a Service Level Agreement, a formal, often legally binding contract between a service provider and its customers that specifies the expected level of…
-
SLI
SLI (Service Level Indicator) is a Service Level Indicator, a carefully defined quantitative measure of some aspect of the level of service provided. Common SLIs include request latency,…
-
SLO
SLO (Service Level Objective) is an internal reliability target that specifies the desired level of performance for a service over a given time window. SLOs are typically expressed…
-
SLSA
SLSA (Supply Chain Security) is Supply-chain Levels for Software Artifacts, a security framework originally created by Google providing incrementally adoptable guidelines for securing the software supply chain. SLSA…
-
Spot Instances
Spot Instances are unused compute capacity offered by cloud providers at significantly reduced prices, typically 60 to 90 percent cheaper than on-demand. The tradeoff is that the provider…
T
-
Tekton
Tekton is an open-source, Kubernetes-native framework for building CI/CD pipelines. It defines pipeline components as Kubernetes custom resources, meaning pipelines, tasks, and runs are all native Kubernetes objects.
-
Terraform
Terraform is an open-source infrastructure as code tool created by HashiCorp that enables teams to define and provision infrastructure using declarative configuration files written in HCL. It supports…
-
Thanos
Thanos is an open-source project that extends Prometheus to provide long-term metrics storage, global query capabilities across multiple Prometheus instances, and high availability. It integrates seamlessly with existing…
-
Toil in SRE
Toil is a term used in Site Reliability Engineering to describe operational work that is manual, repetitive, automatable, tactical, and devoid of enduring value. It is work that…
U
-
USE Method
USE Method is a performance analysis methodology created by Brendan Gregg that monitors infrastructure resources by tracking three metrics: use (percentage of time the resource is busy), Saturation…
Z
-
Zero Trust Architecture
Zero Trust Architecture is a security framework that eliminates implicit trust and requires continuous verification of every user, device, and service attempting to access resources. Unlike traditional perimeter-based…
