Distributed Data


When thinking about a monitoring system, the following items must be considered:

  1. Dynamic time intervals for monitoring
    intervals for collecting data should not be based on fixed intervals. In required cases intervals should increase; and decrease in case additional finer-grained monitoring is not required anymore.
  2. Re-Use modern distributed logging or messaging systems
    systems such as Logstash can collect all kinds of logs and conduct local processing before shipping the data off. Other platforms as LinkedIns Kafka provide a high-performance distributed messaging system, largely for log aggregation and monitoring data collection and is based on an event-oriented architecture to decouple incoming streams and processing.
  3. Watch Big Data analytics
    researches start using advanced machine learning algorithms to deal with noisy, inconsistent and voluminous data. This is an important spot to watch.

Tooling

The market is full of monitoring tools; open source communities and commercial players provide many ways of taking care of your data. I will sum up some typical tools, which can be considered:

  • Nagios
    network tracing tool, which is currently weak on cloud environments where serveres come and go. The communicty provides lots of plugins, which can be used as agents for all kinds of use-cases. Nagios core system is rather small and only provides alerting features.
  • Sensu and Icinga
    is a highly extensible and scalable system that works well in cloud environments. It is a fork of Nagios and is more focused on scalable distributed monitoring architecture and easy extension. Therefore it provides a stronger internal reporting system; however Nagios plugins can be re-used.
  • Ganglia
    is a tooling to collect cluster metrics by replicating to nearby nodes for preventing data loss and over-chattiness to central repositories. Many IaaS providers support Ganglia.
  • Graylog2, Logstash and Splunk
    are distributed log management systems, tailored for processing large amouts of text based logs. Their offer front ends for integrative exploration of logs and powerful search features.
  • CloudWatch and the like
    within public clouds, cloud provider will usually offer monitoring solutions, e.g. Amazon WebServices offers CloudWatch. Those tools collect hundereds of metrics at a fixed interval. Another cloud based solution is papertrail, which is also recommended by Ryan Baxter (IBM)
  • Kafka
    is a systems designed by LinkedIn, which focuses on collecting large amounts of logs and metrics for real-time, multiple uses by other system. Specialized systems were designed for the colection and dissemination part. Kafka is based on a publish-subscripe messaging system and can be used for for more than just monitoring.
  • Stream processing tools (Storm, Flume, S4)
    by collecting large number of logs and metrics continuously, you are effectively creating monitoring data streams. These kind of systems can process monitoring data iin real-time fashion.

Application Performance Index (Apdex) is an open standard developed by an alliance of companies, which defines a method for reporting and comparing the performance of software applications in computing. It can be leveraged to create and evaluate custom monitoring systems and tools.

Example Implementation