FOSDEM 2024
/
Schedule
/
Events
/
Developer rooms
/
Testing and Continuous delivery
/
Chaos Engineering in Action: Enhancing Resilience in Strimzi

Chaos Engineering in Action: Enhancing Resilience in Strimzi

Track: Testing and Continuous delivery devroom
Room: UD2.208 (Decroly)
Day: Sunday
Start: 12:40
End: 13:10
Video only: ud2208
Chat: Join the conversation!

This session offers an in-depth exploration of chaos engineering within the Strimzi ecosystem, a key Kafka operator for Kubernetes. The focus is on demonstrating practical, hands-on applications of chaos experiments to underline how they can enhance the resilience and reliability of Kafka clusters managed by Strimzi in a Kubernetes environment.

The presentation begins by introducing the fundamental principles of chaos engineering, establishing a foundation for the subsequent demonstrations. The core of the session features a series of detailed demonstrations, each focusing on a specific chaos experiment. These experiments and their effects on the Strimzi-managed Kafka clusters are carefully observed and analyzed, utilizing tools such as Grafana and Prometheus from the CNCF projects. Through these demonstrations, attendees will gain a clear understanding of various tools and techniques used to create and manage failure scenarios. This part of the session is designed to provide a vivid insight into how different types of disruptions can impact system stability and performance, emphasizing the practical aspects of chaos engineering in a real-world environment.

Throughout the session, we will discuss and demonstrate: - Setting Up the Environment: Preparation of a Strimzi-managed Kafka cluster on Kubernetes for chaos experiments. - Designing Chaos Experiments: Crafting realistic and meaningful chaos scenarios tailored to Kafka clusters. - Implementing Experiments: Step-by-step execution of chaos experiments, including network failures, pod deletions, and resource constraints. - Monitoring and Analysis: Utilizing monitoring tools to observe the impact of chaos experiments and analyze system behavior under stress. Using Prometheus and Grafana. - Learning and Adapting: Interpreting results to improve system design and resilience strategies.