Agent-Based Monitoring
Some systems are shipped with internal monitoring facilities, that can be accessed throug defined protocols. Typical examples are:
- Simple Network Management Protocol (SNMP)
e.g. for gathering metrics from servers and network equipment. - Windows Management Instrumentation (WMI)
e.g. for gathering data about Microsoft Windows systems. - Java Management Extension (JMX)
e.g. for Java based middleware applications. - Secure Shell (SSH)
e.g. for accessing resources and gather data from them.
This approach is better, if the collection repository is out of your network, as less ports need to be maintained, e.g. for firewall or server configurations.Further data from various sources can be collected and sent over a single channel, which optimizes network traffic and processing overhead. An industry standard for triggering actions based on transactions is called Application Response Measurement (ARM). Those systems need access to agents, which provide the required data.
Agentless Monitoring
Agentless monitoring is more usefull if agents cannot be installed and for simplifying deployments of monitoring systems. Therefore this approach has its strength in terms of deployment and maintenance efforts. This method can also be used for external health checks, e.g. from users point of view.
Monitoring Operations Activities
For some organizations, operations tools such as Chef.io are used for Infrastructure-as-Code utility, e.g. to monitor environment resources within the deployment pipeline. A short introduction to Chef can be found below:
The Infrastructure-as-Code movement leads to the philosophy, that infrastructure should contribute monitoring information in the same fashio as other applications. This is important, as – for instance – deployments can lead to higher CPU utilization during a roll-out; therefore this information must be considered while monitoring applications.
Data Collection and Storage
For analytical actions, the collection and storage of data is crucial. Therefore it is important to differentiate the kinds of data which can be gathered:
- Time Series Data
e.g. sequences of time-stamped data points, which represent certain aspects of states and state changes. - Time-Stamped Event Notifications
which are typically outpot as logs, statistics or existing data. This data can be conducted direct measurement or turned into metrics, as they indicate time and space.
For all data collected data, it is crucial to consider the collection time, correlating items (based on contect) and the data volume:
- Collating related items by time
which might not correlate on microseconds, due to differences within system times. - Collating related items by context
while running multiple nodes, false-positives might be discovered as multiple nodes report into the same repository. - Volume of monitoring data
based on storage capacities, performance requirements and provided audit timeframe different aspects must be considered.
A popular time series database is the Round-Robin Database (RRD), which is designed for storing and displaying time series data with good retention policy configuration capabilities. For some use cases also a Hadoop Distributed File System might be required; all analytical activities at this space are based on big data concepts.
As any monitoring system will process Time-Stamped Event Notifications (e.g. logs), the following requirements must be met:
- Consistent format
e.g. by leveraging logging facades. - Explanation why the log message was produced
e.g. by leveraging tags such as \”error condition detected\” or \”tracing of code\”. - Context information
e.g. source of the log entry within the code, Process ID, Requested ID, Node ID, etc. Please also see my artical about this topic. - Screening infomration
e.g. severities, alert levels, etc. This is required, as log messages are collected in a repository that is accessed through queries.
Considering the Big-Picture
When monitoring applications, also related workflows should be considered for monitoring, as they are crucial for the efficiency of operations and therefore the realized business satisfaction. For instance, for DevOps Processes Damon Edwards lists five things that are important to monitor:
- A business metric
- Cycle time
- Mean time to detect errors
- Mean time to report errors
- Amout of scrap (rework)
Those measurements also correlated to Availability Management of ITIL 2011.