Brussels / 2 & 3 February 2019


Challenges in Monitoring Distributed Storage Environment and how Tendrl addresses them

Monitoring involves dynamically extracting, disseminating, interpreting and presenting system information to the user. The advent of distributed storage environments brings in new challenges to the monitoring world. Going through logs of a large volume of data over geographically distributed nodes simultaneously to detect the point of failure is becoming infeasible and time-consuming. There is a dire need to replace the standard monitoring techniques and practices used to monitor centralized storage with a centralized monitoring system, which obtains exact information required to track the health, performance, load and capacity of system objects or software processes in the distributed systems and present it to the users in real time and in useful formats.

In this talk, Rishubh and Gowtham will discuss the challenges faced by sys-admins in monitoring distributed systems and will discuss how tendrl an open source monitoring tool aids them in monitoring a distributed storage system with the help of a scale-out open source software defined storage (SDS) - Gluster. You will get an in-depth explanation on how tendrl monitors each system in the distributed environment and get a glimpse of its modern web interface. They will also show the audience how metric virtualization provided by grafana makes monitoring rudimentary and fast. Their talk will conclude with a demo on how tendrl helps vendors in capacity planning, detecting and alerting the failure conditions, and keeping track of performance and health of the system.

Monitoring involves dynamically extracting, disseminating, interpreting and presenting system information to the user. The advent of distributed storage brought in a lot of new challenges in monitoring the storage systems. It has become impractical for the sys-admins to go through logs of each storage systems distributed across geographically dispersed nodes to detect points of failure. It’s also infeasible to keep a track of health, performance, and capacity of each system by tracking logs obtained via fetching information over terminal using CLI commands.

In order to monitor the environment efficiently, the monitoring system should be able to get information about each component on every node in the distributed storage environment. Gathering so much data can flood it with a lot of information all of which is not required by the user. The monitoring system should be able to filter the information and display only relevant information to the user when it’s required.

Monitoring should:

  1. Help users find out why and when did their system fail?
  2. Help users to predict future problems.
  3. Help users make decisions based on real data and trends instead of hunches.
  4. Avoid outage costs.
  5. Observe and check the status of processes and resources over a period of time and keep them under systematic review.

Tendrl facilitates:

  1. Monitoring distributed storage system with ease.
  2. Operational consistency and efficiency (streamlined provisioning, enhanced discovery, and management via the integrated service dashboard).
  3. Getting important information about storage utilization to help troubleshoot and diagnose issues (noisy neighbor, flaky disks, network bottlenecks)
  4. Proactively monitor and manage health, performance, and capacity utilization and gain operational intelligence at scale. Receive alerts for operational issues requiring intervention.
  5. Alert on operational issues: (OSD state, Cluster state, Failed Drive etc)

For better understanding, in the presentation we also include interactive demos that show the challenges faced by users and how they can overcome these challenges.


Rishubh Jain
Gowtham Shanmugasundaram