Brussels / 1 & 2 February 2025

schedule

The Art of Fleet-Wide Kubernetes Observability: 3 Core Strategies


Kubernetes is just magical! The solution and benefits it offers for running large scale applications is hidden from none of us. But, the same strengths in running applications on Kubernetes can be a huge hurdle when monitoring these same deployments.

A single pod can easily expose 25 different metrics, and each container can add another 20 metrics. So, for a minimal single node K8s cluster, even running a minimum of 10 pods can let your metrics volume to 5000 data points. Now imagine, running a multi node cluster in a multi cluster fleet , what a chaos it can be! This brings up a huge challenge of understanding and handling metrics that are actually useful. It doesn’t really end at just understanding your metrics - How do you set up alerting? How do you correlate this enormous data? How do you develop or adopt the right tooling?

This talk delves into the art and science of building a robust observability ecosystem tailored for large-scale environments. We'll focus on the key three fundamentals for managing observability at scale without sacrificing performance - Metrics, Alerts and Correlation.

Drawing from our daily experience managing a multi-cluster, complex fleet as SREs at Red Hat, we'll share our approach to effectively collecting and managing metrics, developing alerting strategies, and extracting valuable insights by properly correlating telemetry. These practices help establish a solid foundation for a strategic and robust cluster monitoring system. Whether you're starting from scratch or optimising your existing setup, these insights will help you build a resilient, scalable observability framework. This session will deepen your understanding of monitoring capabilities and help enhance reliability across your fleet.

Speakers

Photo of Mitali Bhalla Mitali Bhalla
Photo of Pratik Kumar Panda Pratik Kumar Panda